30 Dec 2012

First experiment on Naive Bayes with NLTK

I have been experimenting with Natural Language Processing on Text classification for a while now. So, I’m going to write a little journal on my blog. There are lots of academic papers or event commercial API on the Internet for sentiment analysis. But most of them only classify sentiment into negative, positive and neutral. My experiment will be based on Plutchik’s wheel of emotions which will classify a text into one of the eight emotions.

For the purpose to get things done really fast, I use the example from Laurent’s blog. But you can use nltk-trainer to train the classifier without a single line of python code.

Most papers suggest that bag-of-words model is one of the best techniques o classify text. So, I decided to use this method. However, this is about sentiment analysis so I used only Adjectives for feature extraction. The result is unacceptable with only 19.7530864198 %

for word, pos_tag in nltk.pos_tag(words):
   if pos_tag == 'ADJ':

The second attempt I decided to fall back to bag-of-words model, and the result has gone up to 61.316872428 %

filtered_words = [e.lower() for e in words.split() if len(e) >= 3]

So, I moved on and try to clean up the text a bit by cleaning stopwords, stripping ‘RT’ or ‘rt’ for retweet, deleting @peoplename and tokenise word by whitespace. So, “i’m” stays as one word and not [“i”, “‘m”]. The result has gone up to 69.9588477366 %.

# stripping and cleaning.
# this is for stripping out the stopwords by using <a href="http://nltk.googlecode.com/svn/trunk/doc/book/ch02.html" title="NLTK Text Corpora">NLTK Text Corpora</a>
stripped_words = [w for w in tokenized_text if not w in stopwords.words('english')]

I’ll keep experimenting and post some more techniques to see if I could get something out of this.

Til next time,
noppanit at 00:00