Sentiment analysis is a hot topic these days, and it is easy to see why. The idea that one could mine a bunch of Twitter drivel in order to guesstimate the popularity of a topic, company or celebrity must have induced seizures in marketing departments across the globe.
All the more so because, given the right tools, it's not all that hard.
The R Text Mining package (tm) can be used to perform rather painless sentiment analysis on choice topics.
The Web Mining plugin (tm.plugin.webmining) can be used to query a search engine and build a corpus of the documents in the results:
library(tm.plugin.webmining)
corpus <- WebCorpus(YahooNewsSource('drones'))
corpus <- WebCorpus(YahooNewsSource('drones'))
The corpus is a standard tm corpus object, meaning it can be passed to other tm plugins without a problem.
One of the more interesting plugins that can be fed a corpus object is the Sentiment Analysis plugin (tm.plugin.sentiment):
library(tm.plugin.sentiment)
corpus <- score(corpus)
corpus <- score(corpus)
sent_scores <- meta(corpus)
The score() method performs sentiment analysis on the corpus, and stores the results in the metadata of the corpus R object. Examining the output of the meta() call will display these scores:
summary(sent_scores)
MetaID polarity subjectivity pos_refs_per_ref neg_refs_per_ref
Min. :0 Min. :-0.33333 Min. :0.02934 Min. :0.01956 Min. :0.00978
1st Qu.:0 1st Qu.:-0.05263 1st Qu.:0.04889 1st Qu.:0.02667 1st Qu.:0.02266
Median :0 Median : 0.06926 Median :0.06767 Median :0.03009 Median :0.02755
Mean :0 Mean : 0.04789 Mean :0.06462 Mean :0.03343 Mean :0.03118
3rd Qu.:0 3rd Qu.: 0.15862 3rd Qu.:0.07579 3rd Qu.:0.03981 3rd Qu.:0.03526
Max. :0 Max. : 0.37778 Max. :0.10145 Max. :0.06280 Max. :0.05839
NA's : 2.00000 NA's :2.00000 NA's :2.00000 NA's :2.00000
senti_diffs_per_ref
Min. :-0.029197
1st Qu.:-0.002451
Median : 0.003501
Mean : 0.002248
3rd Qu.: 0.009440
Max. : 0.026814
NA's : 2.000000
These sentiment scores are based on the Lydia/TextMap system, and are explained in the TestMap paper as well as in the tm.plugin.sentiment presentation:
- polarity (p - n / p + n) : difference of positive and negative sentiment references / total number of sentiment references
- subjectivity (p + n / N) : total number of sentiment references / total number of references
- pos_refs_per_ref (p / N) : total number of positive sentiment references / total number of references
- neg_refs_per_ref (n / N) : total number of negative sentiment references / total number of references
- senti_diffs_per_ref (p - n / N) : difference of positive and negative sentiment references / total number of references
The pos_refs_per_ref and neg_refs_per_ref are the rate at which positive and negative references occur in the corpus, respectively (i.e., "x out of n textual references were positive/negative"). The polarity metric is used to determine the bias (positive or negative) of the text, while the subjectivity metric is used to determine the rate at which biased (i.e. positive or negative) references occur in the text.
The remaining metric, senti_diffs_per_ref, is a combination of polarity and subjectivity: it determines the bias of the text in relation to the size of the text (actually, number of references in the text) as a whole. This is likely to be what most people expect the output of a sentiment analysis to be, but it may be useful to create a ratio of pos_refs_per_ref to neg_refs_per_ref.
Having some R code to perform sentiment analysis is all well and good, but it doesn't make for a decent command-line utility. For that, it is useful to call R from within Ruby. The rsruby gem can be used to do this.
# initialize R
ENV['R_HOME'] ||= '/usr/lib/R'
r = RSRuby.instance
# load TM libraries
r.eval_R("suppressMessages(library('tm.plugin.webmining'))")
r.eval_R("suppressMessages(library('tm.plugin.sentiment'))")
# perform search and sentiment analysis
r.eval_R("corpus <- WebCorpus(YahooNewsSource('drones'))")
r.eval_R('corpus <- score(corpus)')
# output results
scores = r.eval_R('meta(corpus)')
puts scores.inspect
The output of the last eval_R command is a Hash corresponding to the sent_scores dataframe in the R code.
Naturally, in order for this to be anything but a throwaway script, there has to be some decent command line parsing, maybe an option to aggregate or summarize the results, and of course some sort of output formatting.
As usual, the source code for such a utility has been uploaded to GitHub: https://github.com/mkfs/sentiment-analysis
Usage: sentiment_for_symbol.rb TERM [...]
Perform sentiment analysis on a web query for keyword
Google Engines:
-b, --google-blog Include Google Blog search
-f, --google-finance Include Google Finance search
-n, --google-news Include Google News search
Yahoo Engines:
-F, --yahoo-finance Include Yahoo Finance search
-I, --yahoo-inplay Include Yahoo InPlay search
-N, --yahoo-news Include Yahoo News search
Summary Options:
-m, --median Calculate median
-M, --mean Calculate mean
Output Options:
-p, --pipe-delim Print pipe-delimited table output
-r, --raw Serialize output as a Hash, not an Array
Misc Options:
-h, --help Show help screen
Thank you for posting the great content ... I was looking for something like this ...
ReplyDeleteI found it quiet interesting, hopefully you will keep posting such blogs .... Keep sharing.