18 Nov 2012

Google Refine the best way to scrape data from websites that I like.

There are a lot of ways to scrape data out of a website, you could use any of your favourite computer language to GET a webpage and use HTML parser to get the data you need. But Google Refine comes with a lots of techniques to manipulate the data you scrapped and then export in any format you like with little or less coding. If you don’t know what Google Refine is I recommend watching some of the videos on their website here.

I think I cannot demonstrate how to scrape a website from any website, so I’m going to use my site for an example.

  1. I would start with Excel and put the URL I want to scrape in the second row. I could have put in the first row but I would have to change the configuration in Google Refine.

  1. Create a project in Google Refine.

  1. If successful you would see this.

  1. Then we will fetch the url.

  1. You can set the delay and give the column a name then click Ok. You can also use some formular or expression

  1. Then you would see that the content has been fetch and put into the column we named.

  1. Then we will only extract the title of the page. Now this one is tricky because every website has its own HTML and we need to adapt the expression.

For this example I use this expression. So that means get me “#main-content article header h2 a” and “join” with “,”

forEach(value.parseHtml().select("#main-content article header h2 a"), e, e.innerHtml()).join(",")
  1. Then we will change the shape of this data to be each title per row by using “Split multi-values cells”

That’s it. But there are a lot of ways you could use Google Refine to manipulate your data. One thing that you need to be careful if you’re fetching a large data, if Google Refine could not finish the process in one go. You might have to start it over. So, if your connection is not reliable I would suggest using Amazon EC2. You could easily deploy Google Refine on the cloud. Also, do give a lot of memore to start Google Refine, otherwise you might get OutOfMemory in the middle of the process.

One thing that I found out and I haven’t researched it throughly but I think Google Refine lacks is the ability to extract text by regular expression. You could match a text with regular expression but you cannot extract the text you want. But overall, it’s still the best tool for the job.

Til next time,
noppanit at 00:00