I have a code as given below:
newdata <- ddply(data, .(SIC,FYEAR), function(x){if(nrow(x)>7) x else NULL});
In this code the function is applied on each fragment of data being divided by SIC
, and FYEAR
. How can I write a generic function which allows for creation of fragments programmatically in data tables? Something like,
newdata <- ddply(data, .(col1,col2,...,coln), function(x){if(nrow(x)>7) x else NULL});
but an equivalent solution in data tables, where both n and the names of columns are given programmatically. It would be more helpful if someone can let me know a solution based on data tables. Thanks.
A reproducible example goes below:
require(data.table)
data <- data.table(structure(list(SIC = c(1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1), FYEAR = c(1999, 1999, 1999, 1999, 1999, 2000, 2000,
2000, 2000, 2000, 2000, 2000, 2000, 2000, 2000, 2001, 2001, 2001,
2001, 2001, 2001, 2001, 2001, 2001, 2001, 2002, 2002, 2002, 2002,
2002, 2002, 2002, 2002, 2002, 2002, 2003, 2003, 2003, 2003, 2003,
2003, 2003, 2003, 2003, 2003, 2004, 2004, 2004, 2004, 2004, 2004,
2004, 2004, 2005, 2005, 2005, 2005, 2005, 2005, 2005, 2006, 2006,
2006, 2006, 2006, 2006, 2006, 2007, 2007, 2007, 2007, 2007, 2008,
2008, 2008, 2008, 2008, 2009, 2009, 2009, 2009, 2009, 2009, 2010,
2010, 2010, 2010, 2010, 2010, 2010, 2010, 2011, 2011, 2011, 2011,
2011, 2011, 2011, 2011, 2012), BIG4 = c(0, 0, 1, 1, 1, 0, 0,
0, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1,
1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,
1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1,
1, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1,
0, 0, 1, 1, 1, 1, 1, 1, 0)), .Names = c("SIC", "FYEAR", "BIG4"
), row.names = c(31842L, 48128L, 982L, 2173L, 8655L, 31843L,
55799L, 62384L, 983L, 2174L, 7034L, 8656L, 36790L, 51631L, 69782L,
31844L, 55800L, 62385L, 984L, 7035L, 8657L, 18874L, 36791L, 51632L,
69783L, 985L, 7036L, 8658L, 13375L, 18875L, 31845L, 36792L, 51633L,
62386L, 69784L, 986L, 2177L, 7037L, 8659L, 18876L, 36793L, 51634L,
55801L, 62387L, 69785L, 36794L, 987L, 2178L, 7038L, 8660L, 18877L,
51635L, 62388L, 7039L, 36795L, 988L, 2179L, 8661L, 18878L, 62389L,
19823L, 36796L, 989L, 2180L, 8662L, 18879L, 62390L, 19824L, 36797L,
2181L, 8663L, 18880L, 19825L, 36798L, 2182L, 8664L, 69790L, 24268L,
24325L, 36799L, 2183L, 8665L, 31852L, 24269L, 29392L, 36800L,
2184L, 8666L, 18883L, 31853L, 69792L, 24270L, 36801L, 2185L,
8667L, 18884L, 26989L, 31854L, 69793L, 30612L), class = "data.frame"))
Edit:
Let say I want to write a function, and pass cols = c("BIG4")
or cols = c("SIC", "FYEAR")
etc. to define the columns which is used to fragment the data and then delete those fragments having less than 8 data points and then combine the left fragments and then return the joined dataset.