Sometimes when you open a spreadsheet, the data is not exactly laid out how you expect it to be. This means that you have to clean the dataset before you can use it. Here’s my tutorial on the five most commons ways to clean a dirty dataset:
1) Removing extra white space from cells
Sometimes cells may include too much white space, like the following: As you can see, the cells have too much white space surrounding the characters. There’s a really quick fix for this in Excel – the TRIM function: Magic! Now just drag the formula down for however many cells as necessary.
2) Numbers stored as text
This is a problem that I encountered recently with a dataset. I was trying to work out the sum of a series of numbers but the formula was simply returning nothing. Why? Because the numbers were stored as text! An easy fix, but it took me an infuriating amount of time to figure it out… Go to format cells and double check that the number is stored as a number, rather than text:
3) Splitting one column into two
A common example of this is when you receive a column of full names that you need to split into forename and surname: Thankfully, Excel has a clever “Convert Text to Columns” wizard that makes this a doddle. In Excel go to the Data tab and select “Text to columns”. The wizard automatically works out that the data in the column I need splitting is delimited. Data that is “delimiter separated” is separated by a character such as comma, for instance, comma-delimited data e.g. “Lynn, Fenner, 45, London”. In this case the data is delimited by a space. Therefore on the next screen, we select “space” as the delimiter: When you’re done, the data has been neatly separated:
In some cases, you might want to do the exact opposite. What if postcodes are stored in two separate columns, for example? The postcodes have also been entered into the spreadsheet in lowercase. This is easily solved using the formula =UPPER, as shown above. The CONCATENATE function can combine up to 255 text strings into one string. Here I’ve added ” ” into the formula as I wanted to include a space between the two postcodes, as this is the typical format used. You can also add text to the function, which I wanted to do alongside the names used previously in this example: The formula, incase not visible, is here:
=CONCATENATE("The postcode of ",E2," ",F2," is ",D2,"")
5) Removing odd characters
In this example, instead of a ” “, the data includes a strange character. This can easily be modified using the Find and Replace tool: In the replace with box, I’ve added one space. Select “Replace All” and you can see that all the data that been updated and is much more workable now: Hopefully, these five steps will make it easier for you the next time that you receive the dreaded dirty dataset!