Monday, June 17, 2013

(English) word-clouds in R

The R wordcloud package can be used to generate static images similar to tag-clouds. These are a fun way to visualize document contents, as demonstrated on the R Data Mining website and at the One R Tip A Day site.

Running the sample code from these examples on any real English prose results in lists of words that are far from satisfactory, even when using a stemmer. English is a difficult language to parse, especially when the source is nontechnical writing or, worse, a transcript. In this particular case, an entirely accurate parsing of English isn't necessary; the wordcloud generation only has to be intelligent enough to not make the viewer snort in derision.

To begin with, use the R Text Mining package to load a directory of documents to be analyzed:
library(tm)
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'))

This creates a Corpus containing all files in the directory supplied to DirSource. The files are assumed to be in plaintext; for different formats, use the Corpus readerControl argument:
wc_corpus <- Corpus(DirSource('/tmp/wc_documents'), readerControl=readPDF)

If the text is already loaded in R, then a VectorSource can be of course be used:
wc_corpus <- Corpus(VectorSource(data_string))

Next, the text in the Corpus must be normalized. This involves the following steps:
  1. convert all text to lowercase
  2. expand all contractions
  3. remove all punctuation
  4. remove all "noise words"
The last step requires detecting what are known as "stop words" : words in a language which provide no information (articles, prepositions, and extremely common words).  Note that in most text processing, a fifth step would be added to stem the words in the Corpus; in generating word clouds, this produces undesirable output, as the stemmed words tend to be roots that are not recognizable as actual English words.

The following code performs these steps:
wc_corpus <- tm_map(wc_corpus, tolower)
# fix_contractions is defined later in the article
wc_corpus <- tm_map(wc_corpus, fix_contractions)
wc_corpus <- tm_map(wc_corpus, removePunctuation)
wc_corpus <- tm_map(wc_corpus, removeWords, stopwords('english'))
# Not executed: stem the words in the corpus
# wc_corpus <- tm_map(wc_corpus, stemDocument)

This code makes use of the tm_map function, which invokes a function for every document in the Corpus.

A support function is required to remove contractions from the Corpus. Note that this step must be performed before punctuation is removed, or it will be more difficult to detect contractions.

The purpose of the fix_contractions function is to expand all contractions to their "formal English" equivalents: don't to do not, we'll to we will, etc. The following function uses gsub to perform this expansion, except in the case of possessives and plurals ('s) which are simply removed.

fix_contractions <- function(doc) {
   # "won't" is a special case as it does not expand to "wo not"
   doc <- gsub("won't", "will not", doc)
   doc <- gsub("n't", " not", doc)
   doc <- gsub("'ll", " will", doc)
   doc <- gsub("'re", " are", doc)
   doc <- gsub("'ve", " have", doc)
   doc <- gsub("'m", " am", doc)
   # 's could be is or possessive: it has no expansion
   doc <- gsub("'s", "", doc) 
   return(doc)
}

The Corpus has now been normalized, and can be used to generate a list of words along with counts of their occurrence. First, a TermDocument matrix is created; next, a Word-Frequency Vector (a list of the number of occurrences of each word) is generated. Each element in the vector is the number of occurrences for a specific word, and the name of the element is the word itself (use names(v) to verify this).

td_mtx <- TermDocumentMatrix(wc_corpus, control = list(minWordLength = 3))
v <- sort(rowSums(as.matrix(td_mtx)), decreasing=TRUE)

At this point, the vector is a list of all words in the document, along with their frequency counts. This can be cleaned up by removing obvious plurals (dog, dogs; address, addresses; etc), and adding their occurrence count to the singular case.

This doesn't have to be completely accurate (it's only a wordcloud, after all), and it is not necessary to convert plural words to singular if there is no singular form present. The following function will check each word in the Word-Frequency Vector to see if a plural form of that word (specifically, the word followed by s or es) exists in the Vector as well. If so, the frequency count for the plural form is added to the frequency count for the singular form, and the plural form is removed from the Vector.

aggregate.plurals <- function (v) {
    aggr_fn <- function(v, singular, plural) {
       if (! is.na(v[plural])) {
           v[singular] <- v[singular] + v[plural]
           v <- v[-which(names(v) == plural)]
       }
       return(v)
    }
    for (n in names(v)) {
       n_pl <- paste(n, 's', sep='')
       v <- aggr_fn(v, n, n_pl)
       n_pl <- paste(n, 'es', sep='')
       v <- aggr_fn(v, n, n_pl)
     }
     return(v)
 }

The function is applied to the Word-Frequency Vector as follows:
v <- aggregate.plurals(v)

All that remains is to create a dataframe of the word frequencies, and supply that to the wordcloud function in order to generate the wordcloud image:
df <- data.frame(word=names(v), freq=v)
library(wordcloud)
wordcloud(df$word, df$freq, min.freq=3)

It goes without saying that the default R graphics device can be changed to save the file. An example for PNG output:
png(file='wordcloud.png', bg='transparent')
wordcloud(df$word, df$freq, min.freq=3)
dev.off()

The techniques used previously to create a standalone sentiment analysis command-line utility can be used in this case as well.

No comments:

Post a Comment