idata.frame: ¿Por qué error “is.data.frame (DF) no es verdad”?
-
09-10-2019 - |
Pregunta
Estoy trabajando con una trama de datos grande llamada exp ( archivo aquí ) en R. en aras de rendimiento, se sugirió que me la salida de la función idata.frame () desde plyr. Pero creo que lo estoy usando mal.
Mi llamada original, lento pero funciona:
df.median<-ddply(exp,
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
Con idata.frame, Error: is.data.frame(df) is not TRUE
library(plyr)
df.median<-ddply(idata.frame(exp),
.(groupname,starttime,fPhase,fCycle),
numcolwise(median),
na.rm=TRUE)
Así que, pensé, tal vez es mi datos. Así que probé el conjunto de datos baseball
. El ejemplo idata.frame
funciona bien: dlply(idata.frame(baseball), "id", nrow)
Pero si intento algo similar a mi llamada deseada usando baseball
, no funciona:
bb.median<-ddply(idata.frame(baseball),
.(id,year,team),
numcolwise(median),
na.rm=TRUE)
>Error: is.data.frame(df) is not TRUE
Tal vez mi error está en la forma en que estoy especificando las agrupaciones? Alguien sabe cómo hacer que mi ejemplo de trabajo?
ETA:
También intentó:
groupVars <- c("groupname","starttime","fPhase","fCycle")
voi<-c('inadist','smldist','lardist')
i<-idata.frame(exp)
ag.median <- aggregate(i[,voi], i[,groupVars], median)
Error in i[, voi] : object of type 'environment' is not subsettable
que utiliza una forma más rápida de conseguir las medianas, pero da un error diferente. Creo que no entiendo cómo utilizar idata.frame en absoluto.
Solución
Given you are working with 'big' data and looking for perfomance, this seems a perfect fit for data.table
.
Specifically the lapply(.SD,FUN)
and .SDcols
arguments with by
Setup the data.table
library(data.table)
DT <- as.data.table(exp)
iexp <- idata.frame(exp)
Which columns are numeric
numeric_columns <- names(which(unlist(lapply(DT, is.numeric))))
dt.median <- DT[, lapply(.SD, median), by = list(groupname, starttime, fPhase,
fCycle), .SDcols = numeric_columns]
some benchmarking
library(rbenchmark)
benchmark(data.table = DT[, lapply(.SD, median), by = list(groupname, starttime,
fPhase, fCycle), .SDcols = numeric_columns],
plyr = ddply(exp, .(groupname, starttime, fPhase, fCycle), numcolwise(median), na.rm = TRUE),
idataframe = ddply(exp, .(groupname, starttime, fPhase, fCycle), function(x) data.frame(inadist = median(x$inadist),
smldist = median(x$smldist), lardist = median(x$lardist), inadur = median(x$inadur),
smldur = median(x$smldur), lardur = median(x$lardur), emptyct = median(x$emptyct),
entct = median(x$entct), inact = median(x$inact), smlct = median(x$smlct),
larct = median(x$larct), na.rm = TRUE)),
aggregate = aggregate(exp[, numeric_columns],
exp[, c("groupname", "starttime", "fPhase", "fCycle")],
median),
replications = 5)
## test replications elapsed relative user.self
## 4 aggregate 5 5.42 1.789 5.30
## 1 data.table 5 3.03 1.000 3.03
## 3 idataframe 5 11.81 3.898 11.77
## 2 plyr 5 9.47 3.125 9.45
Otros consejos
Strange behaviour, but even in the docs it says that idata.frame is experimental. You probably found a bug. Perhaps you could rewrite the check at the top of ddply that tests is.data.frame().
In any case, this cuts about 20% off the time (on my system):
system.time(df.median<-ddply(exp, .(groupname,starttime,fPhase,fCycle), function(x) data.frame(
inadist=median(x$inadist),
smldist=median(x$smldist),
lardist=median(x$lardist),
inadur=median(x$inadur),
smldur=median(x$smldur),
lardur=median(x$lardur),
emptyct=median(x$emptyct),
entct=median(x$entct),
inact=median(x$inact),
smlct=median(x$smlct),
larct=median(x$larct),
na.rm=TRUE))
)
Shane asked you in another post if you could cache the results of your script. I don't really have an idea of your workflow, but it may be best to setup a chron to run this and store the results, daily/hourly whatever.