In a recent discussion of what the possible applications of R to binary analysis are, the usual visualizations (byte entropy, size of basic blocks, number of times a function is called during a trace, etc) came to mind. Past experiments with tm.plugins.webmining, however, also raised the following question: Why not use the R textmining packages to generate a wordcloud from a disassembled binary?
Why not, indeed.
The objdump disassembler can be used to generate a list of terms from a binary file. The template Ruby code for generating a list of terms is a simple wrapper around objdump:
# generate a space-delimited string of terms occurring in target at 'path'
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
# ...extract terms from line and append to arr...
arr
}.join(" ")
The R code for generating wordclouds has been covered before. The code for disassembly terms can be more simple, as the terms have already been extracted from the raw text (disassembly):
library('tm')
library('wordcloud')
# term occurrences must be in variable "terms"
corpus <- Corpus(VectorSource(terms))
tdm <- TermDocumentMatrix(corpus)
vec <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
df <- data.frame(word=names(vec), freq=vec)
# output file path must be in variable "img_path"
png(file=img_path)
# minimum frequency should be higher than 1 if there are many terms
wordcloud(df$word, df$freq, min.freq=1)
dev.off()
The most interesting terms in a binary are the library functions that are invoked. The following regex will extract the symbol name from call instructions:
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
arr << $1 if line =~ /<([_[:alnum:]]+)(@[[:alnum:]]+)?>\s*$/
arr
}
When run on /usr/bin/xterm, this generates the following wordcloud:
The other obvious terms in a binary are the instruction mnemonics. The following regex will extract the instruction mnemonics from an objdump disassembly:
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
arr << $1 if line =~ /^\s*[[:xdigit:]]+:[[:xdigit:]\s]+\s+([[:alnum:]]+)\s*/
arr
}
When run on /usr/bin/xterm, this generates the following wordcloud:
Of course, there is always the possibility of generating a wordcloud from the ASCII strings in a binary. The following Ruby code is a crude attempt at creating a terms string from the output of the strings command:
terms = `strings '#{path}'`.gsub(/[[:punct:]]/, '').lines.to_a .join(' ')
When run on /usr/bin/xterm, this generates the following wordcloud:
Not as nice as the others, but some pre-processing of the strings output would clear that up.
There is, of course, a github for the code. Note that the implementation is in Ruby, using the rsruby gem to interface with R.
Friday, February 7, 2014
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment