Question

I have a dataset with longitudinal data in a person-oriented format, as such:

pid varA_1 varB_1 varA_2 varB_2 varA_3 varB_3 ...
1   1      1      0      3      2      1
2   0      1      0      2      2      1
...
50k 1      0      1      3      1      0

This results in a large dataframe, with minimum 50k observations and 90 variables measured for up to 29 periods.

I would like to get a more period-oriented format, as such:

pid index start stop varA varB varC ...
1   1     ...
1   2     
...
1   29
2   1

I have tried different approaches for reshaping the dataframe (*apply, plyr, reshape2, loops, appending vs. prefilling all numeric matrices, etc.,), but do not seem to get a decent processing time (+40min for subsets). I have picked up various hints along the way on what to avoid, but I'm still not sure if I miss some bottleneck or possible speedup.

Is there an optimal way to approach this kind of data-processing, so that I can evaluate the best-case processing time I can achieve in pure R-code? There have been similar questions on Stackoverflow, but they did not result in convincing answers...

Was it helpful?

Solution

First, let's build the data example (I am using 5e3 instead of 50e3 to avoid memory problems with my configuration):

nObs <- 5e3
nVar <- 90
nPeriods <- 29

dat <- matrix(rnorm(nObs*nVar*nPeriods), nrow=nObs, ncol=nVar*nPeriods)

df <- data.frame(id=seq_len(nObs), dat)

nmsV <- paste('Var', seq_len(nVar), sep='')
nmsPeriods <- paste('T', seq_len(nPeriods), sep='')

nms <- c(outer(nmsV, nmsPeriods, paste, sep='_'))
names(df)[-1] <- nms

And now with stats::reshape you change the format:

df2 <- reshape(df, dir = "long", varying = 2:((nVar*nPeriods)+1), sep = "_")

I am not sure if this is the fast solution you are looking for.

OTHER TIPS

The well-aged stack() function can be very fast, if things fit into memory.

For large set, using (transparent) sqlite database as an intermediate is best. Try Gabor's package sqldf, there are many examples on googlecode.

http://code.google.com/p/sqldf/

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top