Wednesday, February 5, 2014

Finding stock symbols by industry in R

The quantmod package is fantastic, but it has one shortcoming: there is no facility for retrieving information about a specific industry (e.g., "is the entire industry on a downward trend, or just this company?").

Yahoo Finance provides this information via its CSV API; this means it should be easy to retrieve from within R. Details about the API have been provided by the C# yahoo-finance-managed project.

The first step is to get a list of all possible sectors. This is a straightforward Curl download and CSV parse:

library(RCurl)
get.sectors <- function() {
  url <- 'http://biz.yahoo.com/p/csv/s_conameu.csv'
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  # sector ID is its index in this alphabetical list
  df$ID <- 1:nrow(df)
  return(df)
}

Note the use of textConnection() to parse an in-memory string instead of an on-disk file. The binary=TRUE flag causes Curl to return a "raw" object which is converted to a character vector by the rawToChar() call; this is necessary because the CSV file ends with a NULL byte.

The next step is to fetch a list of the industries in each sector. At first, this seems to be straightforward:

get.sector.industries <- function( sector ) {
  url <- paste('http://biz.yahoo.com/p/csv', 
               paste(as.integer(sector), 'conameu.csv', sep=''), 
               sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  
  # fix broken Industry names
  df[,'Industry'] <- gsub(' +', ' ', df[,'Industry'])
  
  # default (incorrect) ID column
  df$ID <- (sector * 100) + 1:nrow(df)
    
  df$Sector <- sector
  return(df)
}

Unfortunately, there is one problem: the industry IDs are not based on the index value. In fact, there does not seem to be a way to obtain the industry IDs using the Yahoo Finance API, which appears to be a pretty egregious oversight.

Yahoo Finance provides an alphabetical list of industries in all sectors; the URL for each industry entry contains its ID. This means that the page can be parsed in order to build a list of industries and their IDs.

The code is a little hairy, involving a couple of XPath queries to extract the URLs and their descriptions:

library(XML)
get.industry.ids <- function() {
  html <- htmlParse('http://biz.yahoo.com/ic/ind_index_alpha.htm')

  # extract description from A tags
  html.names <- as.vector(xpathSApply(html, "//td/a/font", xmlValue))
  # extract URL from A tags
  html.urls <- as.vector(xpathSApply(html, "//td/a/font/../@href"))
  
  if (length(html.names) != length(html.urls)) {
    warning(paste("Got", length(html.names), "names but", 
                  length(html.urls), "URLs"))
  }

  html.names <- gsub("\n", " ", html.names)
  html.urls <- gsub("http://biz.yahoo.com/ic/([0-9]+).html", "\\1", html.urls)
  
  df <- data.frame(Name=character(length(html.urls)), 
                   ID=numeric(length(html.urls)), stringsAsFactors=FALSE)
  for (i in 1:length(html.urls)) {
    url = html.urls[i]
    val = suppressWarnings(as.numeric(url))
    if (! is.na(val) ) {
      df[i,'Name'] = html.names[i]
      df[i,'ID'] = val
    }
  }
  return(df)
}

In this function, htmlParse() was used to download the web page instead of Curl. This  is necessary because the webpage contains one or more non-trailing NULL bytes; rawToChar() can only strip trailing NULL bytes. The parser in htmlParse() is able to handle the NULL bytes just fine.

With this function, the IDs of industries can be set as follows:

df <- get.sector.industries( sector.id )
id.df <- get.industry.ids()
for (i in 1:nrow(id.df)) {
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
  }

It is now possible to build a dataframe that contains the industries of all of the sectors:

df.sectors <- get.sectors()
id.df <- get.industry.ids()
df.industries <- NULL
for (id in df.sectors) {
  df <- get.sector.industries(id)
    name <- id.df[i, 'Name']
    if (nrow(df[df$Industry == name,]) > 0) {
      df[df$Industry == name, 'ID'] <- id.df[i, 'ID']
    }
    if (is.null(ind.df)) {
      ind.df <- df
    } else {
      ind.df <- rbind(ind.df, df)
    }
}

This list is probably not going to change much, so the dataframe can be stored for reuse in an .RData object.

The final step, getting the stock symbols for a specific industry, is much more straightforward:

get.industry.symbols <- function(id) {
  url <- paste('http://biz.yahoo.com/p/csv', 
                 paste(as.integer(id), 'conameu.csv', sep=''), 
                 sep='/')
  csv <- rawToChar(getURLContent(url, binary=TRUE))
  df <- read.csv(textConnection(csv))
  return(df)
}

As usual, there is a github for the code.

One final note: the sector and industry data is also available via the FinViz API. Yahoo Finance was selected for this project in order to be compatible with the quantmod data.

No comments:

Post a Comment