What is the most efficient way to split a combined factor column into two factor columns in an r data.table?

https://stackoverflow.com/questions/17115571

31-05-2022
|

Вопрос

I have a large data.table (9 M lines) with two columns: fcombined and value fcombined is a factor, but its actually the result of interacting two factors. The question now is what is the most efficient way to split up the one factor column in two again? I have already come up with a solution that works ok, but maybe there is more straight forward way that i have missed. The working example is:

library(stringr)
f1=1:20
f2=1:20
g=expand.grid(f1,f2)
combinedfactor=as.factor(paste(g$Var1,g$Var2,sep="_"))
largedata=1:10^6
DT=data.table(fcombined=combinedfactor,value=largedata)


splitfactorcol=function(res,colname,splitby="_",namesofnewcols){#the nr. of cols retained is length(namesofnewcols)
  helptable=data.table(.factid=seq_along(levels(res[[colname]])) ,str_split_fixed(levels(res[[colname]]),splitby,length(namesofnewcols)))
  setnames(helptable,colnames(helptable),c(".factid",namesofnewcols))
  setkey(helptable,.factid)
  res$.factid=unclass(res[[colname]])
  setkey(res,.factid)
  m=merge(res,helptable)
  m$.factid=NULL
  m
}
splitfactorcol(DT,"fcombined",splitby="_",c("f1","f2"))

Решение

I think this does the trick and is about 5x faster.

setkey(DT, fcombined)
DT[DT[, data.table(fcombined = levels(fcombined),
                   do.call(rbind, strsplit(levels(fcombined), "_")))]]

I split the levels and then simply merged that result back into the original data.table.

Btw, in my tests strsplit was about 2x faster (for this task) than the stringr function.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow