04 Apr 2013

How to index all wikipedia (English) data using Elasticsearch?

I have been researching on how to get a context out of a piece of text. There are lots of techniques to do that, Information Retriever, Noun Chunker, Text Classification and etc. One technique that I have been trying to do but it’s quite difficult because of the resources and cannot be done on-the-fly is from wikipedia. Wikipedia is like a central repository where human being can contribute to add more content for other people. So, every article is written by human with some potential keywords in the text. For example, if the article is about dog, it’s likely that in the text contains word “dog” or other synonyms. However, by just indexing all the wikipedia data is not quite enough, because that means you can only search through the content which Google or Bing is doing a better job obviously. What I’m interested in Elasticsearch is Cosine Similarity which roughly speaking is a technique to determine how similar between two vectors. Elasticsearch provide just that which is “More like this” functionality. More like this technique in elasticsearch is a way you can easily measure how similar two pieces of content is. The usage is easy, just index all the text you want it to be searchable and input another piece of text then elasticsearch will give you a score.

Instead of doing everything yourself, Elasticsearch has a plugin called elasticsearch-river-wikipedia which will do everything for you from downloading all the dumped wiki data to index all the data for you to search immediately. However, I found little documentation for this one on how to use this for a Elasticsearch virgin like me.

So, here’s how I do it.

First of course you need to install elasticsearch. If you’re on a Mac, I suggest you to use Homebrew.

Then you need to install the plugin by following these steps. And that should be it. However, I have looked at the plugin sourcecode and it will download the 30GB data and unzip it and index it for you in the background. There’s not logs or any indication. I find it quite hard to see when it’s finished. I actually had to listen to my CPU fan to stop then I realised that it’s done indexing. But here’s what I did to make it slightly faster and more obvious.

The downloading takes a while depends on your connection. So, I suggest you to download the file yourself and then unzip it. Then use this command to create index.

curl -XPUT localhost:9203/_river/wikipedia/_meta -d '
    "type" : "wikipedia",
    "wikipedia" : {
        "url" : "file:///${PATH_TO_YOUR_FOLDER}/enwiki-latest-pages-articles.xml"

This will speed up the process a lot and it will reduce the chance that your JVM will face PERMGen exception because the plugin will try to unzip the 30GB data for you as well. Then just wait for the indexing to finish. You can try to see the status of your node from your browser by typing this http://localhost:9200/_stats in your favourite browser. The size of the finished index is around 40GB.

Til next time,
noppanit at 00:00