The Sorry Scheme of Things Entire: r

Showing posts with label r. Show all posts

Sunday, March 4, 2018

brittleR

Apparently the current crop of R developers was never told the standard policy of never break backwards compatibility on a minor version increment.

Put simply, packages available for 3.3 do not work in 3.4. I'm sure there is some perfectly justifiable reason for making the release 3.4 instead of 4.0, while breaking all existing packages (until manually updated by the maintainers), but guess what: it's bollocks.

Anyways, to forestall a long rant about disregarding long-held policy due to historical ignorance (*cough* Uber), here is how to downgrade 3.4 to 3.3 on Debian:

1. add stretch to /etc/apt/sources.list
deb http://ftp.us.debian.org/debian/ stretch main non-free contrib
deb-src http://ftp.us.debian.org/debian/ stretch main non-free contrib
2. apt-get update
apt-cache showpkg r-base should show version 3.3.3-1
3. apt-get remove r-base r-base-core r-base-html r-base-dev
4. go to https://packages.debian.org/stretch/r-base-html to get version number (3.3.3-1)
also https://packages.debian.org/stretch/r-recommended for version numbers for dependencies
5. now do the big install:
sudo apt-get install r-base=3.3.3-1 r-base-core=3.3.3-1 \
r-recommended=3.3.3-1 r-base-dev=3.3.3-1 r-base-html=3.3.3-1 \
r-cran-boot=1.3-18-2 r-cran-class=7.3-14-1 r-cran-cluster=2.0.5-1 \
r-cran-codetools=0.2-15-1 r-cran-foreign=0.8.67-1 \
r-cran-kernsmooth=2.23-15-2 r-cran-lattice=0.20-34-1 \
r-cran-mass=7.3-45-1 r-cran-matrix=1.2-7.1-1 r-cran-mgcv=1.8-16-1 \
r-cran-nlme=3.1.129-1 r-cran-nnet=7.3-12-1 r-cran-rpart=4.1-10-2 \
r-cran-spatial=7.3-11-1 r-cran-survival=2.40-1-1
6. prevent R from updating again, because its developers obviously cannot be trusted:
sudo sudo apt-mark hold r-base r-base-core
7. comment out the lines in sources.list for step 1

In related news, here is how to downgrade an R package to a specific version number. The example uninstalls the package 'BH' and replaces it with version 1.62.0. As with most package management in R these days, you need the devtools package installed.

remove.packages('BH')
library(devtools)
install_version('BH', '1.62.0')

Wednesday, November 5, 2014

Creating an R list using RsRuby

For the most part rsruby, works as advertised. Where things blow up unexpectedly is when using a Ruby Hash to create an R List object.

irb > require 'rsruby'
irb > ENV['R_HOME'] ||= '/usr/lib/R'
irb > $R = RSRuby.instance

irb > $R.assign('test.x', { :a => 1, :b => "abc", :c => [8,9] } )
RException: Error in (function (x, value, pos = -1, envir = as.environment(pos), inherits = FALSE, :
unused arguments (a = 1, b = "abc", c = 8:9)

This happens because rsruby treats a trailing Hash as a collection of keyword arguments to the R assign() function. All that metaprogramming magic ain't free, y'know?

The solution is to wrap the Hash argument into an actual Hash storing keyword arguments to the R function.

A quick look at the R help file for assign() shows that it has the following signature:

assign(x, value, pos = -1, envir = as.environment(pos),

inherits = FALSE, immediate = TRUE)

This means that the Hash containing the R List data will have to be passed as the value argument to the assign() call.

$R.assign('test.x', { :value => { :a => 1, :b => "abc", :c => [8,9] } } )

ArgumentError: Unsupported object ':a' passed to R.

Of course, R cannot handle Symbols unless they are the names of function keyword arguments. This is easy to fix.

irb >$R.assign('test.x', { :value => { 'a' => 1, 'b' => "abc", 'c' => [8,9] } } )

=> {"a"=>1, "b"=>"abc", "c"=>[8, 9]}

irb > $R.eval_R("print(test.x)")

[1] 1

[1] "abc"

[1] 8 9

=> {"a"=>1, "b"=>"abc", "c"=>[8, 9]}

All's well that ends well!

Thursday, March 20, 2014

Custom tick labels in R perspective plots

In R, the persp() is a built-in function to create surface plots. The basic usage is straightforward: create a matrix of values

# plot a 10x10 matrix of random values in the range -100..100:
persp( matrix(runif(100, min=-100, max=100), nrow=10, ncol=10) )

All well and good, until it's time to prepare the plots for presentation -- and suddenly it becomes apparent that plots created with persp() do not work well with axis(), text(), mtext(), par(), and other standard graphics device functions.

The trans3d documentation refers the reader to the persp documentation for examples; those examples are too convoluted to serve any useful educational purpose. A quick note to documentation writers: always include an example showing the simplest possible use of your function on trivial data sets (usually the array of integers from 1 to 10, or sin(x) if a function is required). Do not use only edge cases and exciting demos as examples.

The discussion that follows will demonstrate how to construct a perspective plot with custom labels using persp() and trans3d(). The data to be plotted is a 10x10 matrix of values in the range -100:100.

The first thing to understand is that persp() does not just draw a plot; it also returns a perspective matrix (or pmat) which can be used to translate 3-dimensional coordinates to the 2-dimensional coordinate system used in the image of the plot.

The function that performs this translation is trans3d(). Its arguments are the x, y, and z coordinates to be translated, followed by the pmat. The return value is a list with two elements: x and y, the two-dimensional coordinates in the image.

If one of the x, y, and z arguments is a vector, then the vector is considered to be a line at the other two coordinates. Thus, a line along the X axis from (0, 10, 10) to (10, 10, 10) would be translated using trans3d(0:10, 10, 10, pmat); a line along the Y axis from (0, 3, 10) to (0, 7, 10) would be translated using trans3d(0, 3:7, 10, pmat).

Enough background; time for an example.

Basic Perspective Plot

First, some definitions of the data ranges to keep things clear:

x.axis <- 1:10
min.x <- 1
max.x <- 10
y.axis <- 1:10
min.y <- 1
max.y <- 10
z.axis <- seq(-100, 100, by=25)
min.z <- -100
max.z <- 100

Pay particular attention to z.axis: in addition to specifying the range of each axis, the *.axis variables also specify the tick marks of each axis.

Next, a draw the initial perspective plot, saving the pmat:

pmat <- persp( x=x.axis, y=y.axis,
matrix(runif(100, min=-100, max=100), nrow=10, ncol=10),
xlab='', ylab='', zlab='',
ticktype='detailed', box=FALSE, axes=FALSE,
mar=c(10, 1, 0, 2), expand=0.25,
col='green', shade=0.25, theta=40, phi=30 )

Note the theta (rotation along the vertical axis) and phi (rotation along the horizontal axis) parameters. It is useful to play with these a bit, as different data sets will require different viewing angles. The r ("eyepoint distance") and d ("perspective strength") parameters provide further control of the view. Note also that box and axes parameters are FALSE: we will be drawing our own axes.

Drawing the Axes

In this plot, the X axis will be drawn at min.y and min.z (left side of Y, bottom of Z), Y at max.x and min.z (right side of X, bottom of Z), and Z at min.x and min.y (left side of X, left side of Y).

These parameters are passed to trans3d() to calculate the coordinates of a line at each axis, as described previously. The translated coordinates can be passed directly to lines().

lines(trans3d(x.axis, min.y, min.z, pmat) , col="black")
lines(trans3d(max.x, y.axis, min.z, pmat) , col="black")
lines(trans3d(min.x, min.y, z.axis, pmat) , col="black")

Drawing Tick Marks

Adding tick marks requires calculating the position of a second line, parallel to the axis, and using segments() to draw ticks that span the distance between the axis and the second line. The basic procedure is as follows:

tick.start <- trans3d(x.axis, min.y, min.z, pmat)
tick.end <- trans3d(x.axis, (min.y - 0.20), min.z, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)

Note the (min.y - 0.20) in the calculation of tick.end. This places the second line, parallel to the X axis, at the position -0.20 on the Y axis (i.e., into negative/unplotted space).

The tick marks on the Y and Z axes can be handled similarly:

tick.start <- trans3d(max.x, y.axis, min.z, pmat)
tick.end <- trans3d(max.x + 0.20, y.axis, min.z, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)

tick.start <- trans3d(min.x, min.y, z.axis, pmat)
tick.end <- trans3d(min.x, (min.y - 0.20), z.axis, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)

Adding Tick Mark Labels

The final step is to label the ticks on each axis. Once again, the procedure is to calculate the position of a line, parallel to the axis, at the position where the labels are to be displayed:

labels <- c('first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth')
label.pos <- trans3d(x.axis, (min.y - 0.25), min.z, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(0, NA), srt=270, cex=0.5)

The adj=c(0, NA) expression is used to left-justify the labels, the srt=270 expression is used to rotate the labels 270°, and the cex=0.5 expression is used to scale the label text to 75% of its original size.

The labels on the Y and Z axes are produced similarly:

labels <- c('alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa')
label.pos <- trans3d((max.x + 0.25), y.axis, min.z, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(0, NA), cex=0.5)

labels <- as.character(z.axis)
label.pos <- trans3d(min.x, (min.y - 0.5), z.axis, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(1, NA), cex=0.5)

Note that the Y and Z axis tick labels do not need to be rotated.

The Final Product

Saturday, February 15, 2014

Sending email from R using Gmail

As mentioned previously, the sendmailR package generally works well for sending an email from within R. When an ISP blocks traffic on port 25, however, sendmailR cannot be used unless a a local mailserver is configured to act as a relay. This means that sendmailR is unable to reliably sending email on arbitrary machines and across arbitrary network connections.

There is gmailR, of course, but it requires rJython, and using Python through R via Java is just too many levels of indirection for something as simple as sending an email.

Instead, it should be possible to use Curl to send an email by directly connecting to Gmail.

The RCurl package should be able to provide this. According to the SSL SMTP example, an email (subject + body) can be uploaded with code (converted from C) such as the following:

library(RCurl)

rmail.rcurl.read.function <- function(x) return(x)

gmail.rcurl.send <- function( username, password, to.email, subject, body ) {
email.data <- paste(
paste('Subject:', subject),
'', body, '', sep="\r\n")

curl <- getCurlHandle()

curlSetOpt( .opts=list(
"useragent"='Mozilla 5.0',
"use.ssl"=3, # AKA CURLUSESSL_ALL
"username"=username,
"password"=password,
"readdata"=email.data,
"mail.from"=username,
"mail.rcpt"=to.email,
"readfunction"=rmail.rcurl.read.function,
"upload"=TRUE
), curl=curl )


getURL('smtp://smtp.gmail.com:587', curl=curl, verbose=TRUE)
}

Unfortunately, this crashes R -- it appears to be a bug in RCurl, possibly due to a lack of SMTP support ( a call to curlVersion() shows that SMTP and SMTPS are supported, but listCurlOptions() does not include mail.from or mail.rcpt).

Instead, Curl must be called directly via system. Sure, this is ugly, and yes, system should never be used, but it was RCurl that drove us to this. Remember to sanitize your inputs (in this case, the email addresses and the password)!

gmail.curl.send <- function( username, password, to.email, subject, body ) {
email.data <- paste(
paste('Subject:', subject),
'', body, '', sep="\r\n")

# IMPORTANT: username, password, and to.email must be cleaned!
cmd <- paste('curl -n --ssl-reqd --mail-from "<',
username,
'>" --mail-rcpt "<',
to.email,
'<" --url ',
'smtp://smtp.gmail.com:587',
' --user "', username, ':', password,
'" -T -', sep='')
system(cmd, input=email.data)
}

This works correctly, as one would expect: it's hard to go wrong with a simple shell call to Curl. Note the use of the input parameter in the system call: this creates a temp file with the email contents, which Curl then uploads to Gmail using the -T flag.

Friday, February 7, 2014

Disasm Wordcloud

In a recent discussion of what the possible applications of R to binary analysis are, the usual visualizations (byte entropy, size of basic blocks, number of times a function is called during a trace, etc) came to mind. Past experiments with tm.plugins.webmining, however, also raised the following question: Why not use the R textmining packages to generate a wordcloud from a disassembled binary?

Why not, indeed.

The objdump disassembler can be used to generate a list of terms from a binary file. The template Ruby code for generating a list of terms is a simple wrapper around objdump:

# generate a space-delimited string of terms occurring in target at 'path'
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
# ...extract terms from line and append to arr...
arr
}.join(" ")

The R code for generating wordclouds has been covered before. The code for disassembly terms can be more simple, as the terms have already been extracted from the raw text (disassembly):

library('tm')
library('wordcloud')

# term occurrences must be in variable "terms"
corpus <- Corpus(VectorSource(terms))
tdm <- TermDocumentMatrix(corpus)
vec <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
df <- data.frame(word=names(vec), freq=vec)

# output file path must be in variable "img_path"
png(file=img_path)
# minimum frequency should be higher than 1 if there are many terms
wordcloud(df$word, df$freq, min.freq=1)
dev.off()

The most interesting terms in a binary are the library functions that are invoked. The following regex will extract the symbol name from call instructions:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
arr << $1 if line =~ /<([_[:alnum:]]+)(@[[:alnum:]]+)?>\s*$/
arr
}

When run on /usr/bin/xterm, this generates the following wordcloud:

The other obvious terms in a binary are the instruction mnemonics. The following regex will extract the instruction mnemonics from an objdump disassembly:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
arr << $1 if line =~ /^\s*[[:xdigit:]]+:[[:xdigit:]\s]+\s+([[:alnum:]]+)\s*/
arr
}

When run on /usr/bin/xterm, this generates the following wordcloud:

Of course, there is always the possibility of generating a wordcloud from the ASCII strings in a binary. The following Ruby code is a crude attempt at creating a terms string from the output of the strings command:

terms = `strings '#{path}'`.gsub(/[[:punct:]]/, '').lines.to_a .join(' ')

When run on /usr/bin/xterm, this generates the following wordcloud:

Not as nice as the others, but some pre-processing of the strings output would clear that up.

There is, of course, a github for the code. Note that the implementation is in Ruby, using the rsruby gem to interface with R.

Wednesday, February 5, 2014

Finding stock symbols by industry in R

The quantmod package is fantastic, but it has one shortcoming: there is no facility for retrieving information about a specific industry (e.g., "is the entire industry on a downward trend, or just this company?").

Yahoo Finance provides this information via its CSV API; this means it should be easy to retrieve from within R. Details about the API have been provided by the C# yahoo-finance-managed project.

The first step is to get a list of all possible sectors. This is a straightforward Curl download and CSV parse:

library(RCurl)
get.sectors <- function() {
url <- 'http://biz.yahoo.com/p/csv/s_conameu.csv'
csv <- rawToChar(getURLContent(url, binary=TRUE))
df <- read.csv(textConnection(csv))
# sector ID is its index in this alphabetical list
df$ID <- 1:nrow(df)
return(df)
}

Note the use of textConnection() to parse an in-memory string instead of an on-disk file. The binary=TRUE flag causes Curl to return a "raw" object which is converted to a character vector by the rawToChar() call; this is necessary because the CSV file ends with a NULL byte.

The next step is to fetch a list of the industries in each sector. At first, this seems to be straightforward:

get.sector.industries <- function( sector ) {
url <- paste('http://biz.yahoo.com/p/csv',
paste(as.integer(sector), 'conameu.csv', sep=''),
sep='/')
csv <- rawToChar(getURLContent(url, binary=TRUE))
df <- read.csv(textConnection(csv))

# fix broken Industry names
df[,'Industry'] <- gsub(' +', ' ', df[,'Industry'])

# default (incorrect) ID column
df$ID <- (sector * 100) + 1:nrow(df)

df$Sector <- sector
return(df)
}

Unfortunately, there is one problem: the industry IDs are not based on the index value. In fact, there does not seem to be a way to obtain the industry IDs using the Yahoo Finance API, which appears to be a pretty egregious oversight.

Yahoo Finance provides an alphabetical list of industries in all sectors; the URL for each industry entry contains its ID. This means that the page can be parsed in order to build a list of industries and their IDs.

The code is a little hairy, involving a couple of XPath queries to extract the URLs and their descriptions:

library(XML)
get.industry.ids <- function() {
html <- htmlParse('http://biz.yahoo.com/ic/ind_index_alpha.htm')

# extract description from A tags
html.names <- as.vector(xpathSApply(html, "//td/a/font", xmlValue))
# extract URL from A tags
html.urls <- as.vector(xpathSApply(html, "//td/a/font/../@href"))

if (length(html.names) != length(html.urls)) {
warning(paste("Got", length(html.names), "names but",
length(html.urls), "URLs"))
}

html.names <- gsub("\n", " ", html.names)
html.urls <- gsub("http://biz.yahoo.com/ic/([0-9]+).html", "\\1", html.urls)

df <- data.frame(Name=character(length(html.urls)),
ID=numeric(length(html.urls)), stringsAsFactors=FALSE)
for (i in 1:length(html.urls)) {
url = html.urls[i]
val = suppressWarnings(as.numeric(url))
if (! is.na(val) ) {
df[i,'Name'] = html.names[i]
df[i,'ID'] = val
}
}
return(df)
}

In this function, htmlParse() was used to download the web page instead of Curl. This is necessary because the webpage contains one or more non-trailing NULL bytes; rawToChar() can only strip trailing NULL bytes. The parser in htmlParse() is able to handle the NULL bytes just fine.

With this function, the IDs of industries can be set as follows:

df <- get.sector.industries( sector.id )
id.df <- get.industry.ids()
for (i in 1:nrow(id.df)) {
name <- id.df[i, 'Name']
if (nrow(df[df$Industry == name,]) > 0) {
df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
}
}

It is now possible to build a dataframe that contains the industries of all of the sectors:

df.sectors <- get.sectors()
id.df <- get.industry.ids()
df.industries <- NULL
for (id in df.sectors) {
  df <- get.sector.industries(id)
name <- id.df[i, 'Name']
if (nrow(df[df$Industry == name,]) > 0) {
df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
}

if (is.null(ind.df)) {

ind.df <- df

} else {

ind.df <- rbind(ind.df, df)

}

}

This list is probably not going to change much, so the dataframe can be stored for reuse in an .RData object.

The final step, getting the stock symbols for a specific industry, is much more straightforward:

get.industry.symbols <- function(id) {
url <- paste('http://biz.yahoo.com/p/csv',
paste(as.integer(id), 'conameu.csv', sep=''),
sep='/')
csv <- rawToChar(getURLContent(url, binary=TRUE))
df <- read.csv(textConnection(csv))
return(df)
}

As usual, there is a github for the code.

One final note: the sector and industry data is also available via the FinViz API. Yahoo Finance was selected for this project in order to be compatible with the quantmod data.

Sunday, February 2, 2014

Daily stock symbol reports with R

This is a simple R script that uses the quantmod package to look up stock symbols on Yahoo Finance, and the sendmailR package to send an email alert if the latest stock price ("Last" in the Yahoo report) is below a "buy" threshold or above a "sell" threshold.

The input file format is tab-delimited with three columns: Symbol, BuyAt, SellAt. There is no need for a header column. For example:
AAPL 300 750
BA 65 90
...

The function that does all the work is symbol.report. This reads the input file containing buy and sell thresholds, performs a Yahoo Finance query on all symbols in the file, and generates a dataframe with the details (BuyAt, SellAt, Open, Close, Last, etc) of every symbol whose latest price is either below the buy threshold, or above the sell threshold.

library(quantmod)
symbol.report <- function(filename, header=FALSE, sep = "\t") {
watch.df <- read.delim(filename, header=header, sep=sep)
colnames(watch.df) <- c('Symbol', 'BuyAt', 'SellAt')

quote.df <- getQuote(paste(watch.df$Symbol, collapse=';'))
quote.df$Symbol <- rownames(quote.df)

df <- merge(watch.df, quote.df)

df[(df$Last <= df$BuyAt) | (df$SellAt > 0 & df$Last >= df$SellAt), ]
}

The symbol.report function is invoked by symbol.alert, which will send an email to the provided address if the dataframe returned by symbol.report is not empty. If an email address is not provided, the dataframe will be printed to STDOUT.

library(sendmailR)
symbol.alert <- function(filename, email=NULL, verbose=FALSE) {
df <- symbol.report(filename)

if (nrow(df) == 0) {
return(df)
}

if ( is.null(email) ) {
print(df)
} else {
sendmail(# from: fake email address
paste('<', "r.script@nospam.org", '>', sep=''),
# to: provided email address
paste('<', email, '>', sep=''),
# subject
"SYMBOL ALERT",
# body
capture.output(print(df, row.names=FALSE)),
# SMTP server (gmail)
control=list(smtpServer='ASPMX.L.GOOGLE.COM'),
verbose=verbose)
}
return(df)
}

A few things to note here:
* a fake email address is used as the From address, allowing easy filtering of these emails
* the SMTP server used is the GMail server, which may not be appropriate for some users

This function can be called from a shell script in a cron job, invoking R with the --vanilla option:

R --vanilla -e "source('/home/me/symbol.alert.R'); ticker.email.alert('/home/me/monitoried_symbols.dat', 'me@gmail.com')"

And again, there is a github for the code.

Thursday, July 25, 2013

Including binary files in an R package

The R package format provides support for data in standard formats (.R, .Rdata, .csv) in the data/ directory. Unfortunately, data in unsupported formats (e.g. audio files, images, SQLite databases) is ignored by the package build command.

The solution, as hinted at in the manual, is to place such data in the inst/extdata/ directory:
"It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files."

Using a SQLite database file as an example, an R package can provide a default database by including the path to the built-in database as a default parameter to functions. Because the path is determined at runtime, the best solution is to include an exported function that provides the path to the built-in database:

pkg.default.database <- font="" function="">
system.file('extdata', 'default_db.sqlite', package='pkg')
}

In this example, the package name is pkg, and the SQLite database file is inst/extdata/default_db.sqlite.

Package functions that take a path to the SQLite database can then invoke this function as a default parameter. For example:

pkg.fetch.rows <- db="pk.default.database()," font="" function="" limit="NULL)">
# Connect to database

conn <- db="" dbconnect="" font="" ite="">
if (! dbExistsTable(conn, 'sensor_data')) {
warning(paste('Table SENSOR_DATA does not exist in', db))
dbDisconnect(conn)
return(NULL)
}

# build query for table SENSOR_DATA
query <- font="" from="" sensor_data="">
if (! is.null(where) ) {
query <- font="" paste="" query="" where="">
}

# send query and retrieve rows as a dataframe
ds <- conn="" dbsendquery="" font="" query="">
df <- ds="" fetch="" n="-1)</font">

# cleanup
dbClearResult(ds)
dbDisconnect(conn)

return(df)
}

Monday, June 17, 2013

(English) word-clouds in R

The R wordcloud package can be used to generate static images similar to tag-clouds. These are a fun way to visualize document contents, as demonstrated on the R Data Mining website and at the One R Tip A Day site.

Running the sample code from these examples on any real English prose results in lists of words that are far from satisfactory, even when using a stemmer. English is a difficult language to parse, especially when the source is nontechnical writing or, worse, a transcript. In this particular case, an entirely accurate parsing of English isn't necessary; the wordcloud generation only has to be intelligent enough to not make the viewer snort in derision.

To begin with, use the R Text Mining package to load a directory of documents to be analyzed:
library(tm)
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'))

This creates a Corpus containing all files in the directory supplied to DirSource. The files are assumed to be in plaintext; for different formats, use the Corpus readerControl argument:
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'), readerControl=readPDF)

If the text is already loaded in R, then a VectorSource can be of course be used:
wc_corpus <- Corpus(VectorSource(data_string))

Next, the text in the Corpus must be normalized. This involves the following steps:

convert all text to lowercase
expand all contractions
remove all punctuation
remove all "noise words"

The last step requires detecting what are known as "stop words" : words in a language which provide no information (articles, prepositions, and extremely common words). Note that in most text processing, a fifth step would be added to stem the words in the Corpus; in generating word clouds, this produces undesirable output, as the stemmed words tend to be roots that are not recognizable as actual English words.

The following code performs these steps:

wc_corpus <- tm_map(wc_corpus, tolower)
# fix_contractions is defined later in the article
wc_corpus <- tm_map(wc_corpus, fix_contractions)
wc_corpus <- tm_map(wc_corpus, removePunctuation)
wc_corpus <- tm_map(wc_corpus, removeWords, stopwords('english'))
# Not executed: stem the words in the corpus
# wc_corpus <- tm_map(wc_corpus, stemDocument)

This code makes use of the tm_map function, which invokes a function for every document in the Corpus.

A support function is required to remove contractions from the Corpus. Note that this step must be performed before punctuation is removed, or it will be more difficult to detect contractions.

The purpose of the fix_contractions function is to expand all contractions to their "formal English" equivalents: don't to do not, we'll to we will, etc. The following function uses gsub to perform this expansion, except in the case of possessives and plurals ('s) which are simply removed.

fix_contractions <- function(doc) {
# "won't" is a special case as it does not expand to "wo not"
doc <- gsub("won't", "will not", doc)
doc <- gsub("n't", " not", doc)
doc <- gsub("'ll", " will", doc)
doc <- gsub("'re", " are", doc)
doc <- gsub("'ve", " have", doc)
doc <- gsub("'m", " am", doc)
# 's could be is or possessive: it has no expansion
doc <- gsub("'s", "", doc)
return(doc)
}

The Corpus has now been normalized, and can be used to generate a list of words along with counts of their occurrence. First, a TermDocument matrix is created; next, a Word-Frequency Vector (a list of the number of occurrences of each word) is generated. Each element in the vector is the number of occurrences for a specific word, and the name of the element is the word itself (use names(v) to verify this).

td_mtx <- TermDocumentMatrix(wc_corpus, control = list(minWordLength = 3))
v <- sort(rowSums(as.matrix(td_mtx)), decreasing=TRUE)

At this point, the vector is a list of all words in the document, along with their frequency counts. This can be cleaned up by removing obvious plurals (dog, dogs; address, addresses; etc), and adding their occurrence count to the singular case.

This doesn't have to be completely accurate (it's only a wordcloud, after all), and it is not necessary to convert plural words to singular if there is no singular form present. The following function will check each word in the Word-Frequency Vector to see if a plural form of that word (specifically, the word followed by s or es) exists in the Vector as well. If so, the frequency count for the plural form is added to the frequency count for the singular form, and the plural form is removed from the Vector.

aggregate.plurals <- function (v) {
aggr_fn <- function(v, singular, plural) {
if (! is.na(v[plural])) {
v[singular] <- v[singular] + v[plural]
v <- v[-which(names(v) == plural)]
}
return(v)
}
for (n in names(v)) {
n_pl <- paste(n, 's', sep='')
v <- aggr_fn(v, n, n_pl)
n_pl <- paste(n, 'es', sep='')
v <- aggr_fn(v, n, n_pl)
}
return(v)
}

The function is applied to the Word-Frequency Vector as follows:
v <- aggregate.plurals(v)

All that remains is to create a dataframe of the word frequencies, and supply that to the wordcloud function in order to generate the wordcloud image:
df <- data.frame(word=names(v), freq=v)
library(wordcloud)
wordcloud(df$word, df$freq, min.freq=3)

It goes without saying that the default R graphics device can be changed to save the file. An example for PNG output:
png(file='wordcloud.png', bg='transparent')
wordcloud(df$word, df$freq, min.freq=3)
dev.off()

The techniques used previously to create a standalone sentiment analysis command-line utility can be used in this case as well.

Thursday, June 13, 2013

Quick-and-dirty Sentiment Analysis in Ruby + R

Sentiment analysis is a hot topic these days, and it is easy to see why. The idea that one could mine a bunch of Twitter drivel in order to guesstimate the popularity of a topic, company or celebrity must have induced seizures in marketing departments across the globe.

All the more so because, given the right tools, it's not all that hard.

The R Text Mining package (tm) can be used to perform rather painless sentiment analysis on choice topics.

The Web Mining plugin (tm.plugin.webmining) can be used to query a search engine and build a corpus of the documents in the results:

library(tm.plugin.webmining)
corpus <- WebCorpus(YahooNewsSource('drones'))

The corpus is a standard tm corpus object, meaning it can be passed to other tm plugins without a problem.

One of the more interesting plugins that can be fed a corpus object is the Sentiment Analysis plugin (tm.plugin.sentiment):

library(tm.plugin.sentiment)
corpus <- score(corpus)

sent_scores <- meta(corpus)

The score() method performs sentiment analysis on the corpus, and stores the results in the metadata of the corpus R object. Examining the output of the meta() call will display these scores:

summary(sent_scores)
     MetaID     polarity         subjectivity     pos_refs_per_ref  neg_refs_per_ref 
 Min.   :0   Min.   :-0.33333   Min.   :0.02934   Min.   :0.01956   Min.   :0.00978  
 1st Qu.:0   1st Qu.:-0.05263   1st Qu.:0.04889   1st Qu.:0.02667   1st Qu.:0.02266  
 Median :0   Median : 0.06926   Median :0.06767   Median :0.03009   Median :0.02755  
 Mean   :0   Mean   : 0.04789   Mean   :0.06462   Mean   :0.03343   Mean   :0.03118  
 3rd Qu.:0   3rd Qu.: 0.15862   3rd Qu.:0.07579   3rd Qu.:0.03981   3rd Qu.:0.03526  
 Max.   :0   Max.   : 0.37778   Max.   :0.10145   Max.   :0.06280   Max.   :0.05839  
             NA's   : 2.00000   NA's   :2.00000   NA's   :2.00000   NA's   :2.00000  
 senti_diffs_per_ref
 Min.   :-0.029197  
 1st Qu.:-0.002451  
 Median : 0.003501  
 Mean   : 0.002248  
 3rd Qu.: 0.009440  
 Max.   : 0.026814  
 NA's   : 2.000000

These sentiment scores are based on the Lydia/TextMap system, and are explained in the TestMap paper as well as in the tm.plugin.sentiment presentation:

polarity (p - n / p + n) : difference of positive and negative sentiment references / total number of sentiment references
subjectivity (p + n / N) : total number of sentiment references / total number of references
pos_refs_per_ref (p / N) : total number of positive sentiment references / total number of references
neg_refs_per_ref (n / N) : total number of negative sentiment references / total number of references
senti_diffs_per_ref (p - n / N) : difference of positive and negative sentiment references / total number of references

The pos_refs_per_ref and neg_refs_per_ref are the rate at which positive and negative references occur in the corpus, respectively (i.e., "x out of n textual references were positive/negative"). The polarity metric is used to determine the bias (positive or negative) of the text, while the subjectivity metric is used to determine the rate at which biased (i.e. positive or negative) references occur in the text.

The remaining metric, senti_diffs_per_ref, is a combination of polarity and subjectivity: it determines the bias of the text in relation to the size of the text (actually, number of references in the text) as a whole. This is likely to be what most people expect the output of a sentiment analysis to be, but it may be useful to create a ratio of pos_refs_per_ref to neg_refs_per_ref.

Having some R code to perform sentiment analysis is all well and good, but it doesn't make for a decent command-line utility. For that, it is useful to call R from within Ruby. The rsruby gem can be used to do this.

# initialize R

ENV['R_HOME'] ||= '/usr/lib/R'

r = RSRuby.instance

# load TM libraries

r.eval_R("suppressMessages(library('tm.plugin.webmining'))")

r.eval_R("suppressMessages(library('tm.plugin.sentiment'))")

# perform search and sentiment analysis

r.eval_R("corpus <- WebCorpus(YahooNewsSource('drones'))")

r.eval_R('corpus <- score(corpus)')

# output results

scores = r.eval_R('meta(corpus)')

puts scores.inspect

The output of the last eval_R command is a Hash corresponding to the sent_scores dataframe in the R code.

Naturally, in order for this to be anything but a throwaway script, there has to be some decent command line parsing, maybe an option to aggregate or summarize the results, and of course some sort of output formatting.

As usual, the source code for such a utility has been uploaded to GitHub: https://github.com/mkfs/sentiment-analysis

Usage: sentiment_for_symbol.rb TERM [...]

Perform sentiment analysis on a web query for keyword

Google Engines:

-b, --google-blog Include Google Blog search

-f, --google-finance Include Google Finance search

-n, --google-news Include Google News search

Yahoo Engines:

-F, --yahoo-finance Include Yahoo Finance search

-I, --yahoo-inplay Include Yahoo InPlay search

-N, --yahoo-news Include Yahoo News search

Summary Options:

-m, --median Calculate median

-M, --mean Calculate mean

Output Options:

-p, --pipe-delim Print pipe-delimited table output

-r, --raw Serialize output as a Hash, not an Array

Misc Options:

-h, --help Show help screen

Tuesday, April 16, 2013

Pre-allocating a DataFrame in R

Anyone who has ever tried to load a few thousand rows of data into an R dataframe of a couple hundred columns will have learned the hard way that the storage space should be allocated in advance.

Normally this is not a problem. The columns are initialized with empty vectors sized to the number of rows expected:

n <- 100
df <- data.frame( x=numeric(n), y=character(n) )
for ( i in 1:n ) {
df[i,] = list(...)
}

R dataframes act a little funny with time series, though. When storing a time series in a dataset, the rows represent the data points in a time series (or attributes), while the column represents the time series itself (or entity). Thus, the two time series
1 3 5 7 9
8 2 5 1 4
should be stored in an R data frame as
1 8
3 2
5 5
7 1
9 4
...i.e. the transpose of how data is normally stored in R dataframes (rows being the entity, columns being the attributes). This is mostly due to an assumption in tools like ggplot: the analysis or visualization is performed on the values of an attribute (column) in a set of entities (rows).

This poses a problem when dynamically allocating a dataframe for time series: the number of columns is not known in advance, while the number of rows often is (e.g. in DSP samples).

The solution is to create a list of columns, then pass the list to the data.frame() constructor:

ts.allocate.dataframe <- function(num_ts, ts_size=) {
# create a list of numeric vectors
cols = lapply(1:num_ts, function(x) numeric(ts_size))
data.frame(index=1:ts_size,
# initialize a column of timestamps to now()
timestamp=as.POSIXct(1:ts_size, origin=Sys.time()),
# add the columns for the time series
as.data.frame(cols))
}

When filling the dataframe, be sure to set the column name when inserting the data:

# ... build lists ts_data and ts_names ...

df.ts <- ts.allocate.dataframe(length(ts_data[[1]]),
length(ts_data) )

for ( i in 1:length(ts_data) ) {
# set column i+2 to ts_data[i] contents
# note that the first two columns in the dataframe
# are 'index and 'timestamp'
df.ts[,i+2] <- ts_data[[i]]
# set column name to ts_names[i]
names(df.ts)[[i+2]] <- ts_names[[i]]
}