Data / Interactive / Journalism

Tutorial: How to clean data using Microsoft Excel

Sometimes when you open a spreadsheet, the data is not exactly laid out how you expect it to be. This means that you have to clean the dataset before you can use it. Here’s my tutorial on the five most commons ways to clean a dirty dataset:

1) Removing extra white space from cells

Sometimes cells may include too much white space, like the following: Trim function Excel As you can see, the cells have too much white space surrounding the characters. There’s a really quick fix for this in Excel – the TRIM function: TRIM function Magic! Now just drag the formula down for however many cells as necessary.

2) Numbers stored as text

This is a problem that I encountered recently with a dataset. I was trying to work out the sum of a series of numbers but the formula was simply returning nothing. Why? Because the numbers were stored as text! An easy fix, but it took me an infuriating amount of time to figure it out… Go to format cells and double check that the number is stored as a number, rather than text: Number stored as text

3) Splitting one column into two

A common example of this is when you receive a column of full names that you need to split into forename and surname: Excel Thankfully, Excel has a clever “Convert Text to Columns” wizard that makes this a doddle. In Excel go to the Data tab and select “Text to columns”. Excel The wizard automatically works out that the data in the column I need splitting is delimited. Data that is “delimiter separated” is separated by a character such as comma, for instance, comma-delimited data e.g. “Lynn, Fenner, 45, London”. In this case the data is delimited by a space. Therefore on the next screen, we select “space” as the delimiter: Excel When you’re done, the data has been neatly separated:

Excel4) Merging data into one column

In some cases, you might want to do the exact opposite. What if postcodes are stored in two separate columns, for example? Excel The postcodes have also been entered into the spreadsheet in lowercase. This is easily solved using the formula =UPPER, as shown above. Excel The CONCATENATE function can combine up to 255 text strings into one string. Here I’ve added ” ” into the formula as I wanted to include a space between the two postcodes, as this is the typical format used. You can also add text to the function, which I wanted to do alongside the names used previously in this example: Concatenate The formula, incase not visible, is here:

 =CONCATENATE("The postcode of ",E2," ",F2," is ",D2,"")

 5) Removing odd characters

Screen Shot 2014-04-06 at 12.59.19In this example, instead of a ” “, the data includes a strange character. This can easily be modified using the Find and Replace tool: Excel In the replace with box, I’ve added one space. Select “Replace All” and you can see that all the data that been updated and is much more workable now: Excel Hopefully, these five steps will make it easier for you the next time that you receive the dreaded dirty dataset!

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s