idata.frame: ¿Por qué error “is.data.frame (DF) no es verdad”?

https://stackoverflow.com/questions/3980986

09-10-2019
|

Pregunta

Estoy trabajando con una trama de datos grande llamada exp ( archivo aquí ) en R. en aras de rendimiento, se sugirió que me la salida de la función idata.frame () desde plyr. Pero creo que lo estoy usando mal.

Mi llamada original, lento pero funciona:

df.median<-ddply(exp, 
                 .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), 
                 na.rm=TRUE)

Con idata.frame, Error: is.data.frame(df) is not TRUE

library(plyr)
df.median<-ddply(idata.frame(exp), 
                 .(groupname,starttime,fPhase,fCycle), 
                 numcolwise(median), 
                 na.rm=TRUE)

Así que, pensé, tal vez es mi datos. Así que probé el conjunto de datos baseball. El ejemplo idata.frame funciona bien: dlply(idata.frame(baseball), "id", nrow) Pero si intento algo similar a mi llamada deseada usando baseball, no funciona:

bb.median<-ddply(idata.frame(baseball), 
                 .(id,year,team), 
                 numcolwise(median), 
                 na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE

Tal vez mi error está en la forma en que estoy especificando las agrupaciones? Alguien sabe cómo hacer que mi ejemplo de trabajo?

ETA:

También intentó:

groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')

i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable

que utiliza una forma más rápida de conseguir las medianas, pero da un error diferente. Creo que no entiendo cómo utilizar idata.frame en absoluto.

Solución

Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table.

Specifically the lapply(.SD,FUN) and .SDcols arguments with by

Setup the data.table

library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)

Which columns are numeric

numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))



dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase, 
    fCycle), .SDcols = numeric_columns]

some benchmarking

library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime, 
    fPhase, fCycle), .SDcols = numeric_columns], 
 plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE), 
 idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist), 
        smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur), 
        smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct), 
        entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct), 
        larct = median(x$larct), na.rm = TRUE)), 
 aggregate = aggregate(exp[, numeric_columns],
                       exp[, c("groupname", "starttime", "fPhase", "fCycle")], 
              median), 
 replications = 5)

##         test replications elapsed relative user.self 
## 4  aggregate            5    5.42    1.789      5.30   
## 1 data.table            5    3.03    1.000      3.03    
## 3 idataframe            5   11.81    3.898     11.77       
## 2       plyr            5    9.47    3.125      9.45

Otros consejos

Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().

In any case, this cuts about 20% off the time (on my system):

system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)

Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow