Sunday, October 27, 2013

Qt4: Cannot mix incompatible Qt library


This problem occurs every now and then when using closed-source Qt4 binaries or libraries:

bash$ LD_LIBRARY_PATH=./lib bin/TestBench
Cannot mix incompatible Qt library (version 0x40801) with this library (version 0x40803)
Aborted (core dumped)

Setting LD_LIBRARY_PATH in order to override the Qt4 library never works, even though it should.


It turns out that this error has nothing to do with the proprietary software being linked to an incompatible Qt version. Instead, the user's Qt theme (often Oxygen) is incompatible with the Qt libraries shipped with the proprietary software.


This can be solved in two ways. The permanent way is to run qtconfig-qt4 and choose another theme (e.g. Cleanlooks, which always seems to work).

The second is to pass a compatible theme to the proprietary software using the -style command-line argument:


 LD_LIBRARY_PATH=./lib bin/TestBench -style=CleanLooks

This will override the theme only for this invocation of the application.


Thursday, July 25, 2013

Including binary files in an R package

The R package format provides support for data in standard formats (.R, .Rdata, .csv) in the data/ directory. Unfortunately, data in unsupported formats (e.g. audio files, images, SQLite databases) is ignored by the package build command.

The solution, as hinted at in the manual, is to place such data in the inst/extdata/ directory:
"It should not be used for other data files needed by the package, and the convention has grown up to use directory inst/extdata for such files."

Using a SQLite database file as an example, an R package can provide a default database by including the path to the built-in database as a default parameter to functions. Because the path is determined at runtime, the best solution is to include an exported function that provides the path to the built-in database:

pkg.default.database <- font="" function="">
    system.file('extdata', 'default_db.sqlite', package='pkg')
}

In this example, the package name is pkg, and the SQLite database file is inst/extdata/default_db.sqlite.

Package functions that take a path to the SQLite database can then invoke this function as a default parameter. For example:

pkg.fetch.rows <- db="pk.default.database()," font="" function="" limit="NULL)">
        # Connect to database

conn <- db="" dbconnect="" font="" ite="">
if (! dbExistsTable(conn, 'sensor_data')) {
 warning(paste('Table SENSOR_DATA does not exist in', db))
 dbDisconnect(conn)
 return(NULL)
}

        # build query for table SENSOR_DATA
        query <- font="" from="" sensor_data="">
if (! is.null(where) ) {
query <- font="" paste="" query="" where="">
}

        # send query and retrieve rows as a dataframe
ds <- conn="" dbsendquery="" font="" query="">
        df <- ds="" fetch="" n="-1)</font">

        # cleanup
        dbClearResult(ds)
dbDisconnect(conn)

        return(df)
}

Monday, June 17, 2013

(English) word-clouds in R

The R wordcloud package can be used to generate static images similar to tag-clouds. These are a fun way to visualize document contents, as demonstrated on the R Data Mining website and at the One R Tip A Day site.

Running the sample code from these examples on any real English prose results in lists of words that are far from satisfactory, even when using a stemmer. English is a difficult language to parse, especially when the source is nontechnical writing or, worse, a transcript. In this particular case, an entirely accurate parsing of English isn't necessary; the wordcloud generation only has to be intelligent enough to not make the viewer snort in derision.

To begin with, use the R Text Mining package to load a directory of documents to be analyzed:
library(tm)
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'))

This creates a Corpus containing all files in the directory supplied to DirSource. The files are assumed to be in plaintext; for different formats, use the Corpus readerControl argument:
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'), readerControl=readPDF)

If the text is already loaded in R, then a VectorSource can be of course be used:
wc_corpus <- Corpus(VectorSource(data_string))

Next, the text in the Corpus must be normalized. This involves the following steps:
  1. convert all text to lowercase
  2. expand all contractions
  3. remove all punctuation
  4. remove all "noise words"
The last step requires detecting what are known as "stop words" : words in a language which provide no information (articles, prepositions, and extremely common words).  Note that in most text processing, a fifth step would be added to stem the words in the Corpus; in generating word clouds, this produces undesirable output, as the stemmed words tend to be roots that are not recognizable as actual English words.

The following code performs these steps:
wc_corpus <- tm_map(wc_corpus, tolower)
# fix_contractions is defined later in the article
wc_corpus <- tm_map(wc_corpus, fix_contractions)
wc_corpus <- tm_map(wc_corpus, removePunctuation)
wc_corpus <- tm_map(wc_corpus, removeWords, stopwords('english'))
# Not executed: stem the words in the corpus
# wc_corpus <- tm_map(wc_corpus, stemDocument)

This code makes use of the tm_map function, which invokes a function for every document in the Corpus.

A support function is required to remove contractions from the Corpus. Note that this step must be performed before punctuation is removed, or it will be more difficult to detect contractions.

The purpose of the fix_contractions function is to expand all contractions to their "formal English" equivalents: don't to do not, we'll to we will, etc. The following function uses gsub to perform this expansion, except in the case of possessives and plurals ('s) which are simply removed.

fix_contractions <- function(doc) {
   # "won't" is a special case as it does not expand to "wo not"
   doc <- gsub("won't", "will not", doc)
   doc <- gsub("n't", " not", doc)
   doc <- gsub("'ll", " will", doc)
   doc <- gsub("'re", " are", doc)
   doc <- gsub("'ve", " have", doc)
   doc <- gsub("'m", " am", doc)
   # 's could be is or possessive: it has no expansion
   doc <- gsub("'s", "", doc) 
   return(doc)
}

The Corpus has now been normalized, and can be used to generate a list of words along with counts of their occurrence. First, a TermDocument matrix is created; next, a Word-Frequency Vector (a list of the number of occurrences of each word) is generated. Each element in the vector is the number of occurrences for a specific word, and the name of the element is the word itself (use names(v) to verify this).

td_mtx <- TermDocumentMatrix(wc_corpus, control = list(minWordLength = 3))
v <- sort(rowSums(as.matrix(td_mtx)), decreasing=TRUE)

At this point, the vector is a list of all words in the document, along with their frequency counts. This can be cleaned up by removing obvious plurals (dog, dogs; address, addresses; etc), and adding their occurrence count to the singular case.

This doesn't have to be completely accurate (it's only a wordcloud, after all), and it is not necessary to convert plural words to singular if there is no singular form present. The following function will check each word in the Word-Frequency Vector to see if a plural form of that word (specifically, the word followed by s or es) exists in the Vector as well. If so, the frequency count for the plural form is added to the frequency count for the singular form, and the plural form is removed from the Vector.

aggregate.plurals <- function (v) {
    aggr_fn <- function(v, singular, plural) {
       if (! is.na(v[plural])) {
           v[singular] <- v[singular] + v[plural]
           v <- v[-which(names(v) == plural)]
       }
       return(v)
    }
    for (n in names(v)) {
       n_pl <- paste(n, 's', sep='')
       v <- aggr_fn(v, n, n_pl)
       n_pl <- paste(n, 'es', sep='')
       v <- aggr_fn(v, n, n_pl)
     }
     return(v)
 }

The function is applied to the Word-Frequency Vector as follows:
v <- aggregate.plurals(v)

All that remains is to create a dataframe of the word frequencies, and supply that to the wordcloud function in order to generate the wordcloud image:
df <- data.frame(word=names(v), freq=v)
library(wordcloud)
wordcloud(df$word, df$freq, min.freq=3)

It goes without saying that the default R graphics device can be changed to save the file. An example for PNG output:
png(file='wordcloud.png', bg='transparent')
wordcloud(df$word, df$freq, min.freq=3)
dev.off()

The techniques used previously to create a standalone sentiment analysis command-line utility can be used in this case as well.

Friday, June 14, 2013

Git Trick: Preview before pull

A little out of sync with your teammates? Not sure if that next git pull is going to send you into a half-hour of merging?

Use git-fetch and git-diff to see what evils await you:


git fetch
git diff origin/master


As usual, difftool can be used to launch a preferred diff utility (*cough*meld*cough*).


git diff origin/master


To see just what files have changed, use the --stat option:


git diff --stat origin/master

...or --dirstat to see what directories have changed:


git diff --dirstat origin/master


With any luck, everything is more or less in sync and you can proceed with your usual git pull.

For those looking for something to add to their .bashrc:

alias git-dry-run='git fetch && git diff --stat origin/master'                  

Thursday, June 13, 2013

Quick-and-dirty Sentiment Analysis in Ruby + R


Sentiment analysis is a hot topic these days, and it is easy to see why. The idea that one could mine a bunch of Twitter drivel in order to guesstimate the popularity of a topic, company or celebrity must have induced seizures in marketing departments across the globe.

All the more so because, given the right tools, it's not all that hard.


The R Text Mining package (tm) can be used to perform rather painless sentiment analysis on choice topics.

The Web Mining plugin (tm.plugin.webmining) can be used to query a search engine and build a corpus of the documents in the results:

library(tm.plugin.webmining)
corpus <- WebCorpus(YahooNewsSource('drones'))

The corpus is a standard tm corpus object, meaning it can be passed to other tm plugins without a problem.


One of the more interesting plugins that can be fed a corpus object is the Sentiment Analysis plugin (tm.plugin.sentiment):

library(tm.plugin.sentiment)
corpus <- score(corpus)
sent_scores <- meta(corpus)


The score() method performs sentiment analysis on the corpus, and stores the results in the metadata of the corpus R object. Examining the output of the meta() call will display these scores:


summary(sent_scores)
     MetaID     polarity         subjectivity     pos_refs_per_ref  neg_refs_per_ref 
 Min.   :0   Min.   :-0.33333   Min.   :0.02934   Min.   :0.01956   Min.   :0.00978  
 1st Qu.:0   1st Qu.:-0.05263   1st Qu.:0.04889   1st Qu.:0.02667   1st Qu.:0.02266  
 Median :0   Median : 0.06926   Median :0.06767   Median :0.03009   Median :0.02755  
 Mean   :0   Mean   : 0.04789   Mean   :0.06462   Mean   :0.03343   Mean   :0.03118  
 3rd Qu.:0   3rd Qu.: 0.15862   3rd Qu.:0.07579   3rd Qu.:0.03981   3rd Qu.:0.03526  
 Max.   :0   Max.   : 0.37778   Max.   :0.10145   Max.   :0.06280   Max.   :0.05839  
             NA's   : 2.00000   NA's   :2.00000   NA's   :2.00000   NA's   :2.00000  
 senti_diffs_per_ref
 Min.   :-0.029197  
 1st Qu.:-0.002451  
 Median : 0.003501  
 Mean   : 0.002248  
 3rd Qu.: 0.009440  
 Max.   : 0.026814  
 NA's   : 2.000000 


These sentiment scores are based on the Lydia/TextMap system, and are explained in the TestMap paper as well as in the tm.plugin.sentiment presentation:

  • polarity (p - n / p + n) : difference of positive and negative sentiment references / total number of sentiment references
  • subjectivity (p + n / N) :  total number of sentiment references / total number of references
  • pos_refs_per_ref (p / N) : total number of positive sentiment references / total number of references
  • neg_refs_per_ref (n / N) : total number of negative sentiment references / total number of references
  • senti_diffs_per_ref  (p - n / N) :  difference of positive and negative sentiment references  / total number of references                                                        
The pos_refs_per_ref and neg_refs_per_ref are the rate at which positive and negative references occur in the corpus, respectively (i.e., "x out of n textual references were positive/negative"). The polarity metric is used to determine the bias (positive or negative) of the text, while the subjectivity metric is used to determine the rate at which biased (i.e. positive or negative) references occur in the text.

The remaining metric, senti_diffs_per_ref, is a combination of polarity and subjectivity: it determines the bias of the text in relation to the size of the text (actually, number of references in the text) as a whole. This is likely to be what most people expect the output of a sentiment analysis to be, but it may be useful to create a ratio of pos_refs_per_ref to neg_refs_per_ref.


Having some R code to perform sentiment analysis is all well and good, but it doesn't make for a decent command-line utility. For that, it is useful to call R from within Ruby. The rsruby gem can be used to do this.

# initialize R
ENV['R_HOME'] ||= '/usr/lib/R'
r = RSRuby.instance

# load TM libraries
r.eval_R("suppressMessages(library('tm.plugin.webmining'))")
r.eval_R("suppressMessages(library('tm.plugin.sentiment'))")

# perform search and sentiment analysis
r.eval_R("corpus <- WebCorpus(YahooNewsSource('drones'))")
r.eval_R('corpus <- score(corpus)')

# output results
scores = r.eval_R('meta(corpus)')
puts scores.inspect


The output of the last eval_R command is a Hash corresponding to the sent_scores dataframe in the R code.


Naturally, in order for this to be anything but a throwaway script, there has to be some decent command line parsing, maybe an option to aggregate or summarize the results, and of course some sort of output formatting.

As usual, the source code for such a utility has been uploaded to GitHub: https://github.com/mkfs/sentiment-analysis

Usage: sentiment_for_symbol.rb TERM [...]
Perform sentiment analysis on a web query for keyword

Google Engines:
    -b, --google-blog                Include Google Blog search
    -f, --google-finance             Include Google Finance search
    -n, --google-news                Include Google News search
Yahoo Engines:
    -F, --yahoo-finance              Include Yahoo Finance search
    -I, --yahoo-inplay               Include Yahoo InPlay search
    -N, --yahoo-news                 Include Yahoo News search
Summary Options:
    -m, --median                     Calculate median
    -M, --mean                       Calculate mean
Output Options:
    -p, --pipe-delim                 Print pipe-delimited table output
    -r, --raw                        Serialize output as a Hash, not an Array
Misc Options:
    -h, --help                       Show help screen

Wednesday, May 22, 2013

RVM 'rvm_path does not exist' error

This seems to occur when RVM is used in a new user account, with a multi-user RVM installation (i.e. installed to /usr/local/rvm):

bash$ source /usr/local/rvm/scripts/rvm

$rvm_path (/usr/local/rvm/scripts/rvm) does not exist.rvm_is_a_shell_function: command not found
__rvm_teardown: command not found

The solution is to set the rvm_path variable in ~/.rvmrc :

bash$ cat ~/.rvmrc 
export rvm_path="/usr/local/rvm"

This has to be done in order for the additions to .bash_profile or .bashrc to work.

Tuesday, April 16, 2013

Pre-allocating a DataFrame in R

Anyone who has ever tried to load a few thousand rows of data into an R dataframe of a couple hundred columns will have learned the hard way that the storage space should be allocated in advance.

Normally this is not a problem. The columns are initialized with empty vectors sized to the number of rows expected:

n <- 100
df <- data.frame( x=numeric(n), y=character(n) )
for ( i in 1:n ) {
  df[i,] = list(...)
}

R dataframes act a little funny with time series, though. When storing a time series in a dataset, the rows represent the data points in a time series (or attributes), while the column represents the time series itself (or entity). Thus, the two time series
  1 3 5 7 9
  8 2 5 1 4
should be stored in an R data frame as
  1 8
  3 2
  5 5
  7 1
  9 4
...i.e. the transpose of how data is normally stored in R dataframes (rows being the entity, columns being the attributes). This is mostly due to an assumption in tools like ggplot: the analysis or visualization is performed on the values of an attribute (column) in a set of entities (rows).

This poses a problem when dynamically allocating a dataframe for time series: the number of columns is not known in advance, while the number of rows often is (e.g. in DSP samples).

The solution is to create a list of columns, then pass the list to the data.frame() constructor:

ts.allocate.dataframe <- function(num_ts, ts_size=) {
        # create a list of numeric vectors
        cols = lapply(1:num_ts, function(x) numeric(ts_size))
        data.frame(index=1:ts_size, 
                   # initialize a column of timestamps to now()
                   timestamp=as.POSIXct(1:ts_size, origin=Sys.time()),
                   # add the columns for the time series
                   as.data.frame(cols))


When filling the dataframe, be sure to set the column name when inserting the data:

# ... build lists ts_data and ts_names ...

df.ts <- ts.allocate.dataframe(length(ts_data[[1]]), 
                                  length(ts_data) )

for ( i in 1:length(ts_data) ) {
    # set column i+2 to ts_data[i] contents
    # note that the first two columns in the dataframe 
    #      are 'index and 'timestamp'
    df.ts[,i+2] <- ts_data[[i]]
    # set column name to ts_names[i]
    names(df.ts)[[i+2]] <- ts_names[[i]]
}

Friday, February 8, 2013

Ruby Version Check


As good a practice as it is to make Ruby code portable between interpreter versions, sometimes it is just not possible. And other times, a particular feature (such as little- and big-endian flags in Array#pack and String#unpack) makes requiring a minimum version a small burden to bear.

There are a few underfeatured or overengineered solutions to this problem already, but that feels "just right" for a task that should be so simple. What follows are two brief one-liners for enforcing a minimum Ruby version.


Method 1 : A simple to_i check

This method is useful if the required version is in the format of the 1.8.x and 1.9.x releases. That is, the version number must be at least three single-digit decimal numbers (i.e. 0-9) separated by decimal points (or periods, if you prefer). The trick is simply to concatenate the decimal numbers and convert them to an integer:

RUBY_VERSION.split('.')[0,3].join.to_i

This can be used in a straightforward check as follows:

raise "Ruby 1.9.3 required" if RUBY_VERSION.split('.')[0,3].join.to_i < 193


Method 2 : Padded-decimal to_i check

This method is useful if the version components can contain multiple-digit decimal numbers. For simplicity, assume that the largest version component will be 999. The trick then is to format a string in which each version component is zero-padded to three places, then converted to an integer:

("%03d%03d%03d" % RUBY_VERSION.split('.')[0,3]).to_i 

The required version number must also be zero-padded, as the following check demonstrates:

raise "Ruby 1.9.3 required" if ("%03d%03d%03d" % RUBY_VERSION.split('.')[0,3]).to_i < 1009003 


Futureproofing

Of course, to be absolutely safe, one must consider the possibility of versions such as 3.0 or even 3. In these cases, the output of split('.') will cause a failure, as there are not three components available.

The fix is to append enough zeroes to the array after splitting.

A two-component version number requires a single zero to be appended:

irb> ("%03d%03d%03d" % (("1.8").split('.')<<0)).to_i
=> 1008000 

While a one-component version number requires two zeroes to be appended:

irb> ("%03d%03d%03d" % (("1").split('.')<<0<<0)).to_i
=> 1000000


The ultimate, maximally-futureproof version-string-to-integer comparison is therefore:

("%03d%03d%03d" % (RUBY_VERSION.split('.')<<0<<0)).to_i

The accompanying version check for Ruby 1.9.3:

raise "Ruby 1.9.3 required" if ("%03d%03d%03d" % (RUBY_VERSION.split('.')<<0<<0)).to_i < 1009003 

A bit ugly, and a bit more than an 80-character line, but still simple enough to chuck under the shebang.