The Sorry Scheme of Things Entire: Pre-allocating a DataFrame in R

Anyone who has ever tried to load a few thousand rows of data into an R dataframe of a couple hundred columns will have learned the hard way that the storage space should be allocated in advance.

Normally this is not a problem. The columns are initialized with empty vectors sized to the number of rows expected:

n <- 100
df <- data.frame( x=numeric(n), y=character(n) )
for ( i in 1:n ) {
df[i,] = list(...)
}

R dataframes act a little funny with time series, though. When storing a time series in a dataset, the rows represent the data points in a time series (or attributes), while the column represents the time series itself (or entity). Thus, the two time series
1 3 5 7 9
8 2 5 1 4
should be stored in an R data frame as
1 8
3 2
5 5
7 1
9 4
...i.e. the transpose of how data is normally stored in R dataframes (rows being the entity, columns being the attributes). This is mostly due to an assumption in tools like ggplot: the analysis or visualization is performed on the values of an attribute (column) in a set of entities (rows).

This poses a problem when dynamically allocating a dataframe for time series: the number of columns is not known in advance, while the number of rows often is (e.g. in DSP samples).

The solution is to create a list of columns, then pass the list to the data.frame() constructor:

ts.allocate.dataframe <- function(num_ts, ts_size=) {
# create a list of numeric vectors
cols = lapply(1:num_ts, function(x) numeric(ts_size))
data.frame(index=1:ts_size,
# initialize a column of timestamps to now()
timestamp=as.POSIXct(1:ts_size, origin=Sys.time()),
# add the columns for the time series
as.data.frame(cols))
}

When filling the dataframe, be sure to set the column name when inserting the data:

# ... build lists ts_data and ts_names ...

df.ts <- ts.allocate.dataframe(length(ts_data[[1]]),
length(ts_data) )

for ( i in 1:length(ts_data) ) {
# set column i+2 to ts_data[i] contents
# note that the first two columns in the dataframe
# are 'index and 'timestamp'
df.ts[,i+2] <- ts_data[[i]]
# set column name to ts_names[i]
names(df.ts)[[i+2]] <- ts_names[[i]]
}

The Sorry Scheme of Things Entire

Tuesday, April 16, 2013

Pre-allocating a DataFrame in R

No comments:

Post a Comment

Labels

Blog Archive