Friday, February 7, 2014

Disasm Wordcloud

In a recent discussion of what the possible applications of R to binary analysis are, the usual visualizations (byte entropy, size of basic blocks, number of times a function is called during a trace, etc) came to mind. Past experiments with tm.plugins.webmining, however, also raised the following question: Why not use the R textmining packages to generate a wordcloud from a disassembled binary?

Why not, indeed.

The objdump disassembler can be used to generate a list of terms from a binary file. The template Ruby code for generating a list of terms is a simple wrapper around objdump:

# generate a space-delimited string of terms occurring in target at 'path'
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  # ...extract terms from line and append to arr...
  arr
}.join(" ")  

The R code for generating wordclouds has been covered before. The code for disassembly terms can be more simple, as the terms have already been extracted from the raw text (disassembly):

library('tm')
library('wordcloud')

# term occurrences must be in variable "terms"
corpus <- Corpus(VectorSource(terms))
tdm <- TermDocumentMatrix(corpus)
vec <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
 df <- data.frame(word=names(vec), freq=vec)

# output file path must be in variable "img_path"
png(file=img_path)
# minimum frequency should be higher than 1 if there are many terms
wordcloud(df$word, df$freq, min.freq=1) 
dev.off()
                             
The most interesting terms in a binary are the library functions that are invoked. The following regex will extract the symbol name from call instructions:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /<([_[:alnum:]]+)(@[[:alnum:]]+)?>\s*$/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:

The other obvious terms in a binary are the instruction mnemonics. The following regex will extract the instruction mnemonics from an objdump disassembly:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /^\s*[[:xdigit:]]+:[[:xdigit:]\s]+\s+([[:alnum:]]+)\s*/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:       

Of course, there is always the possibility of generating a wordcloud from the ASCII strings in a binary. The following Ruby code is a crude attempt at creating a terms string from the output of the strings command:

 terms = `strings '#{path}'`.gsub(/[[:punct:]]/, '').lines.to_a .join(' ')

When run on /usr/bin/xterm, this generates the following wordcloud:     

Not as nice as the others, but some pre-processing of the strings output would clear that up.

There is, of course, a github for the code. Note that the implementation is in Ruby, using the rsruby gem to interface with R.

No comments:

Post a Comment