As part of my Online Journalism module at City University London, run by Paul Bradshaw, I co-manage a literacy blog called Reading for London. I wanted to create a map of every reading group in London as a useful resource for our readers. A website called Reading Groups for Everyone lists every reading group in London, but unfortunately not on one single page. To copy this information for every borough would take a lot of time. So what is the easiest way to do it? Answer: using a piece of software called OutWit Hub. Thankfully, we’ve been learning about this excellent piece of software in our data journalism classes.
As you can see, this website outlines the main details about every reading group in Hackney: the name of the group, the suitable age group, location, and time. I want to get all this key information into a Excel spreadsheet. Copying and pasting would take a long time as there are 17 different reading groups in Hackney so OutWit Hub is an easy solution. You can download OutWit Hub Light here for free. However, this does unfortunately limit you to only a 100 rows of data.
1) Open OutWit and go to File > Open and load the page where you want to scrape the information from.
2) In OutWit Hub you want to tell the software where it can find the relevant information that you are looking for. For the reading group name, for instance, I searched for the name of the first reading group, Clapton Library Reading Group, into the HTML code.
3) In order to get the name of the reading group, you need to find the code that comes before and after the name, and enter this into the “marker before” and “marker after” boxes.
In this case, the name of the reading group comes before a URL:
I copied the URL up until London as this URL will be the same for every reading group. As you want to scrape various reading groups, you want the information in the “marker before” and “marker after” boxes to be uniform i.e. not distinct to only one borough.
In this case, the “marker after” is easier to fill as the code after the Reading Group name is just “</a></h3><p>”.
4) Now if you click Save > Execute the scraper will pull all the reading groups names out for you. Easy!
This is obviously a lot quicker than copying and pasting.
5) Next I want to find out what age group the reading group is most suitable for, its location and the time. I therefore go to Scraper > New and again edit the “marker before” and “marker after” text.
Finding the time is very simple and it always comes after the word “Meet:” so that is the “marker before”. Once you’ve got the hang of what you’re looking for, it is very easy to work out what the markers are. Once you’ve done this for all the different elements, your scraper section should look like this:
6) Now if you go to Save > Execute > Export, you can export all the information to XLS format so that you can view it in Excel:
Now the data is a lot easier to work with than it was on the web page. But I want the data on all the reading groups in London! Now this becomes slightly more complicated, but stay with me because this is a very handy tip (With thanks to John Burn-Murdoch)!
To do this, you just need to understand the URL structure of every different London page. These are the URLs for the Islington and Camden page:
As you can see, this URL is very basic with the borough name always listed at the end. This means that it is very easy to generate the URLs of all the London Boroughs relatively quickly using Microsoft Excel.
6) To do this, open Excel and in Column A type the basic part of the URL that never changes. In this example it would be: “http://readinggroups.org/library/?l=London+Borough+of+”. Now click the + sign in the bottom right corner of the cell to duplicate this URL.
7) Next, In Column B add in the borough names e.g. Islington, Camden, and Barnet. I knew that the website had a list of the different areas here so it was easy to copy.
8) Now, in Column C type the formulae =A2&B2 and drag it down so that it is entered in every cell in that column. This will combine the two columns so after adding all the areas, you’ll have this list of URLs:
9) Now copy and paste Column C, what I have called “New URLs”, into an entirely new Excel workbook. Make sure to paste the column as values! This removes the formulae, otherwise it will not work. Save the spreadsheet as ReadingGroupURLs.html
10) Move back to OutWit and open your html file as a page then view under Data > Table. Select the URL Column and select Fast Scrape > then select the scraper that you originally created (I called mine Reading Group). Now this will scrape all the information from every page, hurrah!
Once you’ve clicked this, OutWit will then work its magic and produce a list of every reading group in London:
Data scraping is a lot easier than it seems thanks to OutWit Hub. My next stage with this data piece would be to visualise all the London reading groups onto a map – I’ll be putting that into another tutorial!
What tools do you recommend for data scraping? Let me know in the comments.