How to divide data programmatically in a data table in R

https://stackoverflow.com/questions/22421312

15-06-2023
|

Domanda

I have a code as given below:

newdata <- ddply(data, .(SIC,FYEAR), function(x){if(nrow(x)>7) x else NULL});

In this code the function is applied on each fragment of data being divided by SIC, and FYEAR. How can I write a generic function which allows for creation of fragments programmatically in data tables? Something like,

newdata <- ddply(data, .(col1,col2,...,coln), function(x){if(nrow(x)>7) x else NULL});

but an equivalent solution in data tables, where both n and the names of columns are given programmatically. It would be more helpful if someone can let me know a solution based on data tables. Thanks.

A reproducible example goes below:

require(data.table)
data <- data.table(structure(list(SIC = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1), FYEAR = c(1999, 1999, 1999, 1999, 1999, 2000, 2000, 
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2001, 2001, 2001, 
2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002, 
2002, 2002, 2002, 2002, 2002, 2002, 2003, 2003, 2003, 2003, 2003, 
2003, 2003, 2003, 2003, 2003, 2004, 2004, 2004, 2004, 2004, 2004, 
2004, 2004, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2006, 2006, 
2006, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 2007, 2008, 
2008, 2008, 2008, 2008, 2009, 2009, 2009, 2009, 2009, 2009, 2010, 
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011, 
2011, 2011, 2011, 2011, 2012), BIG4 = c(0, 0, 1, 1, 1, 0, 0, 
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 
1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 
1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 
0, 0, 1, 1, 1, 1, 1, 1, 0)), .Names = c("SIC", "FYEAR", "BIG4"
), row.names = c(31842L, 48128L, 982L, 2173L, 8655L, 31843L, 
55799L, 62384L, 983L, 2174L, 7034L, 8656L, 36790L, 51631L, 69782L, 
31844L, 55800L, 62385L, 984L, 7035L, 8657L, 18874L, 36791L, 51632L, 
69783L, 985L, 7036L, 8658L, 13375L, 18875L, 31845L, 36792L, 51633L, 
62386L, 69784L, 986L, 2177L, 7037L, 8659L, 18876L, 36793L, 51634L, 
55801L, 62387L, 69785L, 36794L, 987L, 2178L, 7038L, 8660L, 18877L, 
51635L, 62388L, 7039L, 36795L, 988L, 2179L, 8661L, 18878L, 62389L, 
19823L, 36796L, 989L, 2180L, 8662L, 18879L, 62390L, 19824L, 36797L, 
2181L, 8663L, 18880L, 19825L, 36798L, 2182L, 8664L, 69790L, 24268L, 
24325L, 36799L, 2183L, 8665L, 31852L, 24269L, 29392L, 36800L, 
2184L, 8666L, 18883L, 31853L, 69792L, 24270L, 36801L, 2185L, 
8667L, 18884L, 26989L, 31854L, 69793L, 30612L), class = "data.frame"))

Edit:

Let say I want to write a function, and pass cols = c("BIG4") or cols = c("SIC", "FYEAR") etc. to define the columns which is used to fragment the data and then delete those fragments having less than 8 data points and then combine the left fragments and then return the joined dataset.

Soluzione

Here are two approaches.

# using .SD
foo.SD <- function(x, .by,.thresh){
        x[,if(.N>.thresh){.SD},by=.by]
      }
# using .I (should be slightly faster as .SD is not loaded into memory for
# each group
foo.I <- function(x, .by,.thresh){
       x[x[,if(.N>.thresh){.I},by=.by]$V1]
 }

foo.SD(data, c('SIC','FYEAR'), 7)
foo.I(data, c('SIC','FYEAR'), 7)

Altri suggerimenti

You can put the nrow condition in the j argument of [.data.table, the trick is to return an empty version of the same data.table for groups which do not pass the required number of rows, using data[0]:

# Discard chunks of a data.table which have less than a specified number of rows
throwAwaySmall <- function(data, cols, rowSizeThreshold) {
  data[, .SD[.N>rowSizeThreshold], by=cols]
}

The following is now equivalent to your first piece of code:

throwAwaySmall(data, c("SIC", "FYEAR"), 7)

How about this:

DT<-data.table(df,key=c("SIC","FYEAR"))
DT[,list(BIG4,incl=length(BIG4)>7),by=c("SIC","FYEAR")][incl==T]

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow