Thursday, June 13, 2013

Quick-and-dirty Sentiment Analysis in Ruby + R

Sentiment analysis is a hot topic these days, and it is easy to see why. The idea that one could mine a bunch of Twitter drivel in order to guesstimate the popularity of a topic, company or celebrity must have induced seizures in marketing departments across the globe.

All the more so because, given the right tools, it's not all that hard.

The R Text Mining package (tm) can be used to perform rather painless sentiment analysis on choice topics.

The Web Mining plugin (tm.plugin.webmining) can be used to query a search engine and build a corpus of the documents in the results:

corpus <- WebCorpus(YahooNewsSource('drones'))

The corpus is a standard tm corpus object, meaning it can be passed to other tm plugins without a problem.

One of the more interesting plugins that can be fed a corpus object is the Sentiment Analysis plugin (tm.plugin.sentiment):

corpus <- score(corpus)
sent_scores <- meta(corpus)

The score() method performs sentiment analysis on the corpus, and stores the results in the metadata of the corpus R object. Examining the output of the meta() call will display these scores:

     MetaID     polarity         subjectivity     pos_refs_per_ref  neg_refs_per_ref 
 Min.   :0   Min.   :-0.33333   Min.   :0.02934   Min.   :0.01956   Min.   :0.00978  
 1st Qu.:0   1st Qu.:-0.05263   1st Qu.:0.04889   1st Qu.:0.02667   1st Qu.:0.02266  
 Median :0   Median : 0.06926   Median :0.06767   Median :0.03009   Median :0.02755  
 Mean   :0   Mean   : 0.04789   Mean   :0.06462   Mean   :0.03343   Mean   :0.03118  
 3rd Qu.:0   3rd Qu.: 0.15862   3rd Qu.:0.07579   3rd Qu.:0.03981   3rd Qu.:0.03526  
 Max.   :0   Max.   : 0.37778   Max.   :0.10145   Max.   :0.06280   Max.   :0.05839  
             NA's   : 2.00000   NA's   :2.00000   NA's   :2.00000   NA's   :2.00000  
 Min.   :-0.029197  
 1st Qu.:-0.002451  
 Median : 0.003501  
 Mean   : 0.002248  
 3rd Qu.: 0.009440  
 Max.   : 0.026814  
 NA's   : 2.000000 

These sentiment scores are based on the Lydia/TextMap system, and are explained in the TestMap paper as well as in the tm.plugin.sentiment presentation:

  • polarity (p - n / p + n) : difference of positive and negative sentiment references / total number of sentiment references
  • subjectivity (p + n / N) :  total number of sentiment references / total number of references
  • pos_refs_per_ref (p / N) : total number of positive sentiment references / total number of references
  • neg_refs_per_ref (n / N) : total number of negative sentiment references / total number of references
  • senti_diffs_per_ref  (p - n / N) :  difference of positive and negative sentiment references  / total number of references                                                        
The pos_refs_per_ref and neg_refs_per_ref are the rate at which positive and negative references occur in the corpus, respectively (i.e., "x out of n textual references were positive/negative"). The polarity metric is used to determine the bias (positive or negative) of the text, while the subjectivity metric is used to determine the rate at which biased (i.e. positive or negative) references occur in the text.

The remaining metric, senti_diffs_per_ref, is a combination of polarity and subjectivity: it determines the bias of the text in relation to the size of the text (actually, number of references in the text) as a whole. This is likely to be what most people expect the output of a sentiment analysis to be, but it may be useful to create a ratio of pos_refs_per_ref to neg_refs_per_ref.

Having some R code to perform sentiment analysis is all well and good, but it doesn't make for a decent command-line utility. For that, it is useful to call R from within Ruby. The rsruby gem can be used to do this.

# initialize R
ENV['R_HOME'] ||= '/usr/lib/R'
r = RSRuby.instance

# load TM libraries

# perform search and sentiment analysis
r.eval_R("corpus <- WebCorpus(YahooNewsSource('drones'))")
r.eval_R('corpus <- score(corpus)')

# output results
scores = r.eval_R('meta(corpus)')
puts scores.inspect

The output of the last eval_R command is a Hash corresponding to the sent_scores dataframe in the R code.

Naturally, in order for this to be anything but a throwaway script, there has to be some decent command line parsing, maybe an option to aggregate or summarize the results, and of course some sort of output formatting.

As usual, the source code for such a utility has been uploaded to GitHub:

Usage: sentiment_for_symbol.rb TERM [...]
Perform sentiment analysis on a web query for keyword

Google Engines:
    -b, --google-blog                Include Google Blog search
    -f, --google-finance             Include Google Finance search
    -n, --google-news                Include Google News search
Yahoo Engines:
    -F, --yahoo-finance              Include Yahoo Finance search
    -I, --yahoo-inplay               Include Yahoo InPlay search
    -N, --yahoo-news                 Include Yahoo News search
Summary Options:
    -m, --median                     Calculate median
    -M, --mean                       Calculate mean
Output Options:
    -p, --pipe-delim                 Print pipe-delimited table output
    -r, --raw                        Serialize output as a Hash, not an Array
Misc Options:
    -h, --help                       Show help screen

1 comment:

  1. Thank you for posting the great content ... I was looking for something like this ...
    I found it quiet interesting, hopefully you will keep posting such blogs .... Keep sharing.