Wednesday, November 5, 2014

Creating an R list using RsRuby

For the most part rsruby, works as advertised. Where things blow up unexpectedly is when using a Ruby Hash to create an R List object.

irb > require 'rsruby'
irb > ENV['R_HOME'] ||= '/usr/lib/R' 
irb > $R = RSRuby.instance

irb > $R.assign('test.x', { :a => 1, :b => "abc", :c => [8,9] } )
RException: Error in (function (x, value, pos = -1, envir = as.environment(pos), inherits = FALSE,  : 
  unused arguments (a = 1, b = "abc", c = 8:9)

This happens because rsruby treats a trailing Hash as a collection of keyword arguments to the R assign() function. All that metaprogramming magic ain't free, y'know?


The solution is to wrap the Hash argument into an actual Hash storing keyword arguments to the R function.

A quick look at the R help file for assign() shows that it has the following signature:

assign(x, value, pos = -1, envir = as.environment(pos),
            inherits = FALSE, immediate = TRUE)

This means that the Hash containing the R List data will have to be passed as the value argument to the assign() call.

$R.assign('test.x', { :value => { :a => 1, :b => "abc", :c => [8,9] } } )
ArgumentError: Unsupported object ':a' passed to R.

Of course, R cannot handle Symbols unless they are the names of function keyword arguments. This is easy to fix.

irb >$R.assign('test.x', { :value => { 'a' => 1, 'b' => "abc", 'c' => [8,9] } } )
 => {"a"=>1, "b"=>"abc", "c"=>[8, 9]} 
irb > $R.eval_R("print(test.x)")
$a
[1] 1

$b
[1] "abc"

$c
[1] 8 9

 => {"a"=>1, "b"=>"abc", "c"=>[8, 9]}

All's well that ends well!

Thursday, March 20, 2014

Custom tick labels in R perspective plots

In R, the persp() is a built-in function to create surface plots. The basic usage is straightforward: create a matrix of values

# plot a 10x10 matrix of random values in the range -100..100:
persp( matrix(runif(100, min=-100, max=100), nrow=10, ncol=10) )

All well and good, until it's time to prepare the plots for presentation -- and suddenly it becomes apparent that plots created with persp() do not work well with axis(), text(), mtext(), par(), and other standard graphics device functions.

The trans3d documentation refers the reader to the persp documentation for examples; those examples are too convoluted to serve any useful educational purpose. A quick note to documentation writers: always include an example showing the simplest possible use of your function on trivial data sets (usually the array of integers from 1 to 10, or sin(x) if a function is required). Do not use only edge cases and exciting demos as examples.

The discussion that follows will demonstrate how to construct a perspective plot with custom labels using persp() and trans3d(). The data to be plotted is a 10x10 matrix of values in the range -100:100.


The first thing to understand is that persp() does not just draw a plot; it also returns a perspective matrix (or pmat) which can be used to translate 3-dimensional coordinates to the 2-dimensional coordinate system used in the image of the plot.

The function that performs this translation is trans3d(). Its arguments are the x, y, and z coordinates to be translated, followed by the pmat. The return value is a list with two elements: x and y, the two-dimensional coordinates in the image.

If one of the x, y, and z arguments is a vector, then the vector is considered to be a line at the other two coordinates. Thus, a line along the X axis from (0, 10, 10) to (10, 10, 10) would be translated using trans3d(0:10, 10, 10, pmat); a line along the Y axis from (0, 3, 10) to (0, 7, 10) would be translated using trans3d(0, 3:7, 10, pmat).


Enough background; time for an example.


Basic Perspective Plot

First, some definitions of the data ranges to keep things clear:

x.axis <- 1:10
min.x <- 1
max.x <- 10
y.axis <- 1:10
min.y <- 1
max.y <- 10
z.axis <- seq(-100, 100, by=25)
min.z <- -100
max.z <- 100

Pay particular attention to z.axis: in addition to specifying the range of each axis, the *.axis variables also specify the tick marks of each axis.

Next, a draw the initial perspective plot, saving the pmat:

pmat <- persp( x=x.axis, y=y.axis,
               matrix(runif(100, min=-100, max=100), nrow=10, ncol=10), 
               xlab='', ylab='', zlab='', 
               ticktype='detailed', box=FALSE, axes=FALSE, 
               mar=c(10, 1, 0, 2), expand=0.25,
               col='green', shade=0.25, theta=40, phi=30 )

Note the theta (rotation along the vertical axis) and phi (rotation along the horizontal axis) parameters. It is useful to play with these a bit, as different data sets will require different viewing angles. The r ("eyepoint distance") and d ("perspective strength") parameters provide further control of the view. Note also that box and axes parameters are FALSE: we will be drawing our own axes.


Drawing the Axes

In this plot, the X axis will be drawn at min.y and min.z (left side of Y, bottom of Z), Y at max.x and min.z (right side of X, bottom of Z), and Z at min.x and min.y (left side of X, left side of Y).

These parameters are passed to trans3d() to calculate the coordinates of a line at each axis, as described previously. The translated coordinates can be passed directly to lines().

lines(trans3d(x.axis, min.y, min.z, pmat) , col="black")
lines(trans3d(max.x, y.axis, min.z, pmat) , col="black")
lines(trans3d(min.x, min.y, z.axis, pmat) , col="black")


Drawing Tick Marks

Adding tick marks requires calculating the position of a second line, parallel to the axis, and using segments() to draw ticks that span the distance between the axis and the second line. The basic procedure is as follows:

tick.start <- trans3d(x.axis, min.y, min.z, pmat)
tick.end <- trans3d(x.axis, (min.y - 0.20), min.z, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)

Note the (min.y - 0.20) in the calculation of tick.end. This places the second line, parallel to the X axis, at the position -0.20 on the Y axis (i.e., into negative/unplotted space).

The tick marks on the Y and Z axes can be handled similarly:

tick.start <- trans3d(max.x, y.axis, min.z, pmat)
tick.end <- trans3d(max.x + 0.20, y.axis, min.z, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)

tick.start <- trans3d(min.x, min.y, z.axis, pmat)
tick.end <- trans3d(min.x, (min.y - 0.20), z.axis, pmat)
segments(tick.start$x, tick.start$y, tick.end$x, tick.end$y)


Adding Tick Mark Labels

The final step is to label the ticks on each axis. Once again, the procedure is to calculate the position of a line, parallel to the axis, at the position where the labels are to be displayed:

labels <- c('first', 'second', 'third', 'fourth', 'fifth', 'sixth', 'seventh', 'eighth', 'ninth', 'tenth')
label.pos <- trans3d(x.axis, (min.y - 0.25), min.z, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(0, NA), srt=270, cex=0.5)

The adj=c(0, NA) expression is used to left-justify the labels, the srt=270 expression is used to rotate the labels 270°, and the cex=0.5 expression is used to scale the label text to 75% of its original size.

The labels on the Y and Z axes are produced similarly:

labels <- c('alpha', 'beta', 'gamma', 'delta', 'epsilon', 'zeta', 'eta', 'theta', 'iota', 'kappa')
label.pos <- trans3d((max.x + 0.25), y.axis, min.z, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(0, NA), cex=0.5)

labels <- as.character(z.axis)
label.pos <- trans3d(min.x, (min.y - 0.5), z.axis, pmat)
text(label.pos$x, label.pos$y, labels=labels, adj=c(1, NA), cex=0.5)

Note that the Y and Z axis tick labels do not need to be rotated.


The Final Product



Saturday, February 15, 2014

Sending email from R using Gmail

As mentioned previously, the sendmailR package generally works well for sending an email from within R. When an ISP blocks traffic on port 25, however, sendmailR cannot be used unless a a local mailserver is configured to act as a relay. This means that sendmailR is unable to reliably sending email on arbitrary machines and across arbitrary network connections.

There is gmailR, of course, but it requires rJython, and using Python through R via Java is just too many levels of indirection for something as simple as sending an email.

Instead, it should be possible to use Curl to send an email by directly connecting to Gmail.

The RCurl package should be able to provide this. According to the SSL SMTP example, an email (subject + body) can be uploaded with code (converted from C) such as the following:

library(RCurl)

rmail.rcurl.read.function <- function(x) return(x)

gmail.rcurl.send <- function( username, password, to.email, subject, body ) {
  email.data <- paste(
    paste('Subject:', subject),
    '', body, '', sep="\r\n")
  
  curl <- getCurlHandle()
  
  curlSetOpt( .opts=list(
    "useragent"='Mozilla 5.0',
    "use.ssl"=3, # AKA CURLUSESSL_ALL 
    "username"=username,
    "password"=password,
    "readdata"=email.data,
    "mail.from"=username,
    "mail.rcpt"=to.email,
    "readfunction"=rmail.rcurl.read.function,
    "upload"=TRUE
  ), curl=curl )
  
  
  getURL('smtp://smtp.gmail.com:587', curl=curl, verbose=TRUE)
}

Unfortunately, this crashes R -- it appears to be a bug in RCurl, possibly due to a lack of SMTP support ( a call to curlVersion() shows that SMTP and SMTPS are supported, but listCurlOptions() does not include mail.from or mail.rcpt).

Instead, Curl must be called directly via system. Sure, this is ugly, and yes, system should never be used, but it was RCurl that drove us to this. Remember to sanitize your inputs (in this case, the email addresses and the password)!

gmail.curl.send <- function( username, password, to.email, subject, body ) {
  email.data <- paste(
    paste('Subject:', subject),
    '', body, '', sep="\r\n")
  
  # IMPORTANT: username, password, and to.email must be cleaned!
  cmd <- paste('curl -n --ssl-reqd --mail-from "<',
               username,
               '>" --mail-rcpt "<',
               to.email,
               '<" --url ',
               'smtp://smtp.gmail.com:587',
               ' --user "', username, ':', password, 
               '" -T -', sep='')
  system(cmd, input=email.data)
}

This works correctly, as one would expect: it's hard to go wrong with a simple shell call to Curl. Note the use of the input parameter in the system call: this creates a temp file with the email contents, which Curl then uploads to Gmail using the -T flag.

Friday, February 7, 2014

Disasm Wordcloud

In a recent discussion of what the possible applications of R to binary analysis are, the usual visualizations (byte entropy, size of basic blocks, number of times a function is called during a trace, etc) came to mind. Past experiments with tm.plugins.webmining, however, also raised the following question: Why not use the R textmining packages to generate a wordcloud from a disassembled binary?

Why not, indeed.

The objdump disassembler can be used to generate a list of terms from a binary file. The template Ruby code for generating a list of terms is a simple wrapper around objdump:

# generate a space-delimited string of terms occurring in target at 'path'
terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  # ...extract terms from line and append to arr...
  arr
}.join(" ")  

The R code for generating wordclouds has been covered before. The code for disassembly terms can be more simple, as the terms have already been extracted from the raw text (disassembly):

library('tm')
library('wordcloud')

# term occurrences must be in variable "terms"
corpus <- Corpus(VectorSource(terms))
tdm <- TermDocumentMatrix(corpus)
vec <- sort(rowSums(as.matrix(tdm)), decreasing=TRUE)
 df <- data.frame(word=names(vec), freq=vec)

# output file path must be in variable "img_path"
png(file=img_path)
# minimum frequency should be higher than 1 if there are many terms
wordcloud(df$word, df$freq, min.freq=1) 
dev.off()
                             
The most interesting terms in a binary are the library functions that are invoked. The following regex will extract the symbol name from call instructions:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /<([_[:alnum:]]+)(@[[:alnum:]]+)?>\s*$/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:

The other obvious terms in a binary are the instruction mnemonics. The following regex will extract the instruction mnemonics from an objdump disassembly:

terms = `objdump -DRTgrstx '#{path}'`.lines.inject([]) { |arr, line|
  arr << $1 if line =~ /^\s*[[:xdigit:]]+:[[:xdigit:]\s]+\s+([[:alnum:]]+)\s*/
  arr
}  

When run on /usr/bin/xterm, this generates the following wordcloud:       

Of course, there is always the possibility of generating a wordcloud from the ASCII strings in a binary. The following Ruby code is a crude attempt at creating a terms string from the output of the strings command:

 terms = `strings '#{path}'`.gsub(/[[:punct:]]/, '').lines.to_a .join(' ')

When run on /usr/bin/xterm, this generates the following wordcloud:     

Not as nice as the others, but some pre-processing of the strings output would clear that up.

There is, of course, a github for the code. Note that the implementation is in Ruby, using the rsruby gem to interface with R.

Wednesday, February 5, 2014

Finding stock symbols by industry in R

The quantmod package is fantastic, but it has one shortcoming: there is no facility for retrieving information about a specific industry (e.g., "is the entire industry on a downward trend, or just this company?").

Yahoo Finance provides this information via its CSV API; this means it should be easy to retrieve from within R. Details about the API have been provided by the C# yahoo-finance-managed project.

The first step is to get a list of all possible sectors. This is a straightforward Curl download and CSV parse:

library(RCurl)
get.sectors <- function() {
  url <- 'http://biz.yahoo.com/p/csv/s_conameu.csv'
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  # sector ID is its index in this alphabetical list
  df$ID <- 1:nrow(df)
  return(df)
}

Note the use of textConnection() to parse an in-memory string instead of an on-disk file. The binary=TRUE flag causes Curl to return a "raw" object which is converted to a character vector by the rawToChar() call; this is necessary because the CSV file ends with a NULL byte.

The next step is to fetch a list of the industries in each sector. At first, this seems to be straightforward:

get.sector.industries <- function( sector ) {
  url <- paste('http://biz.yahoo.com/p/csv', 
               paste(as.integer(sector), 'conameu.csv', sep=''), 
               sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  
  # fix broken Industry names
  df[,'Industry'] <- gsub(' +', ' ', df[,'Industry'])
  
  # default (incorrect) ID column
  df$ID <- (sector * 100) + 1:nrow(df)
    
  df$Sector <- sector
  return(df)
}

Unfortunately, there is one problem: the industry IDs are not based on the index value. In fact, there does not seem to be a way to obtain the industry IDs using the Yahoo Finance API, which appears to be a pretty egregious oversight.

Yahoo Finance provides an alphabetical list of industries in all sectors; the URL for each industry entry contains its ID. This means that the page can be parsed in order to build a list of industries and their IDs.

The code is a little hairy, involving a couple of XPath queries to extract the URLs and their descriptions:

library(XML)
get.industry.ids <- function() {
  html <- htmlParse('http://biz.yahoo.com/ic/ind_index_alpha.htm')

  # extract description from A tags
  html.names <- as.vector(xpathSApply(html, "//td/a/font", xmlValue))
  # extract URL from A tags
  html.urls <- as.vector(xpathSApply(html, "//td/a/font/../@href"))
  
  if (length(html.names) != length(html.urls)) {
    warning(paste("Got", length(html.names), "names but", 
                  length(html.urls), "URLs"))
  }

  html.names <- gsub("\n", " ", html.names)
  html.urls <- gsub("http://biz.yahoo.com/ic/([0-9]+).html", "\\1", html.urls)
  
  df <- data.frame(Name=character(length(html.urls)), 
                   ID=numeric(length(html.urls)), stringsAsFactors=FALSE)
  for (i in 1:length(html.urls)) {
    url = html.urls[i]
    val = suppressWarnings(as.numeric(url))
    if (! is.na(val) ) {
      df[i,'Name'] = html.names[i]
      df[i,'ID'] = val
    }
  }
  return(df)
}

In this function, htmlParse() was used to download the web page instead of Curl. This  is necessary because the webpage contains one or more non-trailing NULL bytes; rawToChar() can only strip trailing NULL bytes. The parser in htmlParse() is able to handle the NULL bytes just fine.

With this function, the IDs of industries can be set as follows:

df <- get.sector.industries( sector.id )
id.df <- get.industry.ids()
for (i in 1:nrow(id.df)) {
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
  }

It is now possible to build a dataframe that contains the industries of all of the sectors:

df.sectors <- get.sectors()
id.df <- get.industry.ids()
df.industries <- NULL
for (id in df.sectors) {
  df <- get.sector.industries(id)
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
    if (is.null(ind.df)) {
      ind.df <- df
    } else {
      ind.df <- rbind(ind.df, df)
    }
}

This list is probably not going to change much, so the dataframe can be stored for reuse in an .RData object.

The final step, getting the stock symbols for a specific industry, is much more straightforward:

get.industry.symbols <- function(id) {
  url <- paste('http://biz.yahoo.com/p/csv', 
                 paste(as.integer(id), 'conameu.csv', sep=''), 
                 sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  return(df)
}

As usual, there is a github for the code.

One final note: the sector and industry data is also available via the FinViz API. Yahoo Finance was selected for this project in order to be compatible with the quantmod data.

Sunday, February 2, 2014

Daily stock symbol reports with R

This is a simple R script that uses the quantmod package to look up stock symbols on Yahoo Finance, and the sendmailR package to send an email alert if the latest stock price ("Last" in the Yahoo report) is below a "buy" threshold or above a "sell" threshold.

The input file format is tab-delimited with three columns: Symbol, BuyAt, SellAt. There is no need for a header column. For example:
  AAPL        300     750
  BA           65      90
  ...

The function that does all the work is symbol.report. This reads the input file containing buy and sell thresholds, performs a Yahoo Finance query on all symbols in the file, and generates a dataframe with the details (BuyAt, SellAt, Open, Close, Last, etc) of every symbol whose latest price is either below the buy threshold, or above the sell threshold.

library(quantmod)
symbol.report <- function(filename, header=FALSE, sep = "\t") {
  watch.df <- read.delim(filename, header=header, sep=sep)
  colnames(watch.df) <- c('Symbol', 'BuyAt', 'SellAt')

  quote.df <- getQuote(paste(watch.df$Symbol, collapse=';'))
  quote.df$Symbol <- rownames(quote.df)

  df <- merge(watch.df, quote.df)

  df[(df$Last <= df$BuyAt) | (df$SellAt > 0 & df$Last >= df$SellAt), ]
}  

The symbol.report function is invoked by  symbol.alert, which will send an email to the provided address if the dataframe returned by symbol.report is not empty. If an email address is not provided, the dataframe will be printed to STDOUT.

library(sendmailR)
symbol.alert <- function(filename, email=NULL, verbose=FALSE) {
  df <- symbol.report(filename)

  if (nrow(df) == 0) {
    return(df)
  }

  if ( is.null(email) ) {
    print(df)
  } else {
    sendmail(# from: fake email address
             paste('<', "r.script@nospam.org", '>', sep=''),
             # to: provided email address
             paste('<', email, '>', sep=''),
             # subject
             "SYMBOL ALERT",
             # body
             capture.output(print(df, row.names=FALSE)),
             # SMTP server (gmail)
             control=list(smtpServer='ASPMX.L.GOOGLE.COM'),
             verbose=verbose)
  } 
  return(df)
}

A few things to note here:
  * a fake email address is used as the From address, allowing easy filtering of these emails
  * the SMTP server used is the GMail server, which may not be appropriate for some users

This function can be called from a shell script in a cron job, invoking R with the --vanilla option:

  R --vanilla -e "source('/home/me/symbol.alert.R'); ticker.email.alert('/home/me/monitoried_symbols.dat', 'me@gmail.com')"

And again, there is a github for the code.