Data / Journalism

Data journalism tutorial: Using OutWit Hub to scrape data

As part of my Online Journalism module at City University London, run by Paul Bradshaw, I co-manage a literacy blog called Reading for London. I wanted to create a map of every reading group in London as a useful resource for our readers. A website called Reading Groups for Everyone lists every reading group in London, but unfortunately not on one single page. To copy this information for every borough would take a lot of time. So what is the easiest way to do it? Answer: using a piece of software called OutWit Hub. Thankfully, we’ve been learning about this excellent piece of software in our data journalism classes.

Reading groups in Hackney

A list of reading groups in Hackney

As you can see, this website outlines the main details about every reading group in Hackney: the name of the group, the suitable age group, location, and time. I want to get all this key information into a Excel spreadsheet. Copying and pasting would take a long time as there are 17 different reading groups in Hackney so OutWit Hub is an easy solution. You can download OutWit Hub Light here for free. However, this does unfortunately limit you to only a 100 rows of data.

1) Open OutWit and go to File > Open and load the page where you want to scrape the information from.

2) In OutWit Hub you want to tell the software where it can find the relevant information that you are looking for. For the reading group name, for instance, I searched for the name of the first reading group, Clapton Library Reading Group, into the HTML code.

Reading Groups OutWit Hub

Trying to find where the name of the reading group appears in the code

3) In order to get the name of the reading group, you need to find the code that comes before and after the name, and enter this into the “marker before” and “marker after” boxes.

In this case, the name of the reading group comes before a URL:

“<h3><a href=”http://readinggroups.org/groups/location/london

I copied the URL up until London as this URL will be the same for every reading group. As you want to scrape various reading groups, you want the information in the “marker before” and “marker after” boxes to be uniform i.e. not distinct to only one borough.

In this case, the “marker after” is easier to fill as the code after the Reading Group name is just “</a></h3><p>”.

4) Now if you click Save > Execute the scraper will pull all the reading groups names out for you. Easy!

Reading Groups OutWit Hub

OutWit Hub quickly scrapes the names of the reading groups

This is obviously a lot quicker than copying and pasting.

5) Next I want to find out what age group the reading group is most suitable for, its location and the time. I therefore go to Scraper > New and again edit the “marker before” and “marker after” text.

Finding the time is very simple and it always comes after the word “Meet:” so that is the “marker before”. Once you’ve got the hang of what you’re looking for, it is very easy to work out what the markers are. Once you’ve done this for all the different elements, your scraper section should look like this:

Reading Groups OutWit Hub

What your expression markers should look like in OutWit Hub

6) Now if you go to Save > Execute > Export, you can export all the information to XLS format so that you can view it in Excel:

Reading Groups OutWit Hub

What the data looks like once it has been exported from OutWit to Excel

Now the data is a lot easier to work with than it was on the web page. But I want the data on all the reading groups in London! Now this becomes slightly more complicated, but stay with me because this is a very handy tip (With thanks to John Burn-Murdoch)!

To do this, you just need to understand the URL structure of every different London page. These are the URLs for the Islington and Camden page:

http://readinggroups.org/library/?l=London+Borough+of+Islington
http://readinggroups.org/library/?l=London+Borough+of+Camden

As you can see, this URL is very basic with the borough name always listed at the end. This means that it is very easy to generate the URLs of all the London Boroughs relatively quickly using Microsoft Excel.

6) To do this, open Excel and in Column A type the basic part of the URL that never changes. In this example it would be: “http://readinggroups.org/library/?l=London+Borough+of+&#8221;Now click the + sign in the bottom right corner of the cell to duplicate this URL.

7) Next, In Column B add in the borough names e.g. Islington, Camden, and Barnet. I knew that the website had a list of the different areas here so it was easy to copy.

8) Now, in Column C type the formulae =A2&B2 and drag it down so that it is entered in every cell in that column. This will combine the two columns so after adding all the areas, you’ll have this list of URLs:

OutWit Hub

Generating URLs for scraping in Excel is easier than you think

9) Now copy and paste Column C, what I have called “New URLs”, into an entirely new Excel workbook. Make sure to paste the column as values! This removes the formulae, otherwise it will not work. Save the spreadsheet as ReadingGroupURLs.html

10) Move back to OutWit and open your html file as a page then view under Data > Table. Select the URL Column and select Fast Scrape > then select the scraper that you originally created (I called mine Reading Group). Now this will scrape all the information from every page, hurrah!

OutWit Hub

This tells OutWit to “fast scrape” these websites using the earlier created Scraper called “Reading Group”

Once you’ve clicked this, OutWit will then work its magic and produce a list of every reading group in London:

Reading Groups London

A spreadsheet of all the reading groups in London

Data scraping is a lot easier than it seems thanks to OutWit Hub. My next stage with this data piece would be to visualise all the London reading groups onto a map – I’ll be putting that into another tutorial!

What tools do you recommend for data scraping? Let me know in the comments.

Advertisements

One thought on “Data journalism tutorial: Using OutWit Hub to scrape data

  1. Pingback: Data tutorial: Using OutWit Hub and Open Refine to scrape data | Sophie Murray-Morris: Freelance Journalist

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s