Data / Journalism

Data tutorial: Using OutWit Hub and Open Refine to scrape data

This tutorial will show you:

  • The basics of using OutWit Hub to scrape data (I posted a full tutorial on this here)
  • How to get the longitude and latitude from postcodes using a Post Code API and Open Refine (previously known as Google Refine).

For part of my data journalism project, I wanted to scrape The Churches Conservation Trust list of conserved/historic churches in England (I’ll explain why I needed to do this in a later post.)

The Churches Conservation Trust

The Churches Conservation Trust’s website listing the 344 churches that they care for.

As you can see, the churches are not listed in an easily accessible manner. But this was very easy to scrape using OutWit Hub. The expressions that I used are here:

OutWit Hub

The expressions that I used in OutWit Hub to scrape the data that I needed.

The 344 churches were spread across seven different pages, but because of the URL structure this was easy to work out and automate.

The URL of this list is:

"http://www.visitchurches.org.uk/Ourchurches/
Completelistofchurches/Page-1/?pageSize=50"

As I had seven pages to scrape I know that I just have to change “Page-1” to Page-2, Page-3, etc until I’ve got up to Page-7. As shown in my last OutWit Hub tutorial, I then just saved the list of URLs, saved them as a .HTML file, opened the page in OutWit Hub, then fast-scraped the URLs. This gave me a list of all the conserved churches in England, including their post codes and even the description of each church.

However, I wanted to find out the longitude and latitude code in order to map the churches using Google Fusion Tables. Manually searching for the longitude and latitude for 344 postcodes would be an incredibly painful experience.

Finding the

longitudes and latitudes

Thankfully, you can use an API and Open Refine to quickly automate this. The API that I’ll be using, UK Postcodes API, returns data on a postcode that you search for in the URL.

The URL that I’m using to find the data that I require is:

http://uk-postcodes.com/postcode/
[postcode (no space)].['xml', 'csv', 'json'* or 'rdf']

I have chosen to receive the data in JavaScript Object Notation format as this is easier to deal with on this occasion. Therefore if you searched for the first postcode:

http://uk-postcodes.com/postcode/PE19 6UJ.json

It would return the following:

<{"postcode":"PE19 6UJ",
"geo":{"lat":52.193917468796194,
"lng":-0.2054113922963551,
"easting":522761.0,"northing":256593.0,
"geohash":"http://geohash.org/gcrbgmd7te6j"},
"administrative":{"council":{"title":"Huntingdonshire",
"uri":"http://statistics.data.gov.uk/id/statistical-geography/E07000011",
"code":"E07000011"},
"county":{"title":"Cambridgeshire",
"uri":"http://statistics.data.gov.uk/id/statistical-geography/E10000003",
"code":"E10000003"},
"ward":{"title":"Gransden and The Offords",
"uri":"http://statistics.data.gov.uk/id/statistical-geography/E05008528",
"code":"E05008528"},
"constituency":{"title":"Huntingdon",
"uri":"http://statistics.data.gov.uk/id/statistical-geography/E14000757",
"code":"E14000757"}}}

Although this looks extremely confusing, the only information that I want from this is the latitude and longitude – which is in the first sentence (though you could, if you wanted, pull other details such as the ward name and the local council). I’m therefore going to use Open Refine to automatically pull this information for me. Once you know how to do this, it’s very straight-forward to do and replicate in other data work.

Using Open Refine

Firstly, I downloaded Open Refine (download link here). I then saved a list of all the post codes as an .XLS file in Microsoft Excel. Open the program and go to “Create Project” and select your .XLS file with the postcodes in. This will lead you to this screen:

Open Refine

The list of post codes in Open Refine

Click on the post code column and click on “Add column by fetching URLs”:

Open Refine

Add column by fetching URLs

As explained before, in order to get the longitude and latitude we need to visit a URL – hence why we are fetching this data from a URL. But we need to change the URL so it is unique for every postcode. Typing this manually is an unattractive prospective. However, Open Refine allows you do this extremely quickly. The URL we need is in this style:

http://uk-postcodes.com/postcode/PE19 6UJ.json

Therefore, the only element changing is the postcode. This is our “value”. In the next screen that appears we therefore need to paste the following code:

"http://www.uk-postcodes.com/postcode/"+value+".json"
Google Refine

Open Refine easily automatically creates the URLs

As the preview column shows, this changes every URL so that it displays the corresponding postcode. Now Open Refine will fetch the data for every URL, like we saw before. As there are 344 rows of post codes, just returning this information took about 45 minutes:

Google Refine

The data that the Post Code API returned in Open Refine

The next step is click on the column where all your information is and “Add column based on this column”. We are now going to grab the latitude and longitude for every post code.

When the next screen appears, enter the following code (Google Refine Expression Language, if you want to be specific!) into the box:

value.parseJson()['geo']['lat']

You’ll find that this will return the latitude. For the longitude enter the following code:

value.parseJson()['geo']['long']
Google Refine

Finding the longitude using Open Refine

Now I have the longitude and latitude for every postcode, hurrah!

Google Refine

The list of churches in Microsoft Excel

To produce a map showing every conserved church in England, I uploaded the spreadsheet to Google Fusion Tables and this was the finished product:

Conserved churches

My map of all the conserved churches in England

I’ll be using this process a lot in the future, especially for mapping. But there are a wide variety of APIs out there which can help you automatically find data. Open Refine is also an excellent tool to know about in order to clean data – which I shall be writing a tutorial on soon!

Disclaimer: I am much indebted to Paul Bradshaw’s tutorial on “Adding geographical information to a spreadsheet based on postcodes – Google Refine and APIs” and his data journalism intensive class at City held this Wednesday.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s