Row Differences in Dataframe by Group

https://stackoverflow.com//questions/22029051

21-12-2019
|

Question

My problem has to do with finding row differences in a data frame by group. I've tried to do this a few ways. Here's an example. The real data set is several million rows long.

set.seed(314)
df = data.frame("group_id"=rep(c(1,2,3),3),
            "date"=sample(seq(as.Date("1970-01-01"),Sys.Date(),by=1),9,replace=F),
            "logical_value"=sample(c(T,F),9,replace=T),
            "integer"=sample(1:100,9,replace=T),
            "float"=runif(9))
df = df[order(df$group_id,df$date),]

I ordered it by group_id and date so that the diff function can find the sequential differences, which results in time ordered differences of the logical, integer, and float variables. I could easily do some sort of apply(df,2,diff), but I need it by group_id. Hence, doing apply(df,2,diff) results in extra unneeded results.

df
  group_id       date logical_value integer      float
1        1 1974-05-13         FALSE       4 0.03472876
4        1 1979-12-02          TRUE      45 0.24493995
7        1 1980-08-18          TRUE       2 0.46662253
5        2 1978-12-08          TRUE      56 0.60039164
2        2 1981-12-26          TRUE      34 0.20081799
8        2 1986-05-19         FALSE      60 0.43928929
6        3 1983-05-22         FALSE      25 0.01792820
9        3 1994-04-20         FALSE      34 0.10905326
3        3 2003-11-04          TRUE      63 0.58365922

So I thought I could break up my data frame into chunks by group_id, and pass each chunk into a user defined function:

create_differences = function(data_group){
  apply(data_group, 2, diff)
}

But I get errors using the code:

diff_df = lapply(split(df,df$group_id),create_differences)
 Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator 

by(df,df$group_id,create_differences)
 Error in r[i1] - r[-length(r):-(length(r) - lag + 1L)] : non-numeric argument to binary operator

As a side note, the data is nice, no NAs, nulls, blanks, and every group_id has at least 2 rows associated with it.

Edit 1: User alexis_laz correctly pointed out that my function needs to be sapply(data_group, diff).

Using this edit, I get a list of data frames (one list entry per group).

Edit 2:

The expected output would be a combined data frame of differences. Ideally, I would like to keep the group_id, but if not, it's not a big deal. Here is what the sample output should be like:

diff_df
     group_id date logical_value integer     float
[1,]        1 2029             1      41 0.2102112
[2,]        1  260             0     -43 0.2216826
[1,]        2 1114             0     -22 -0.3995737
[2,]        2 1605            -1      26 0.2384713
[1,]        3 3986             0       9 0.09112507
[2,]        3 3485             1      29 0.47460596

La solution

I think regarding the fact that you have millions of rows you can move to the data.table suitable for by group actions.

library(data.table)
DT <- as.data.table(df)
## this will order per group and per day
setkeyv(DT,c('group_id','date'))
## for all column apply diff
DT[,lapply(.SD,diff),group_id]

# group_id      date logical_value integer       float
# 1:        1 2029 days             1      41  0.21021119
# 2:        1  260 days             0     -43  0.22168257
# 3:        2 1114 days             0     -22 -0.39957366
# 4:        2 1604 days            -1      26  0.23847130
# 5:        3 3987 days             0       9  0.09112507
# 6:        3 3485 days             1      29  0.47460596

Autres conseils

It certainly won't be as quick compared to data.table but below is an only slightly ugly base solution using aggregate:

result <- aggregate(. ~ group_id, data=df, FUN=diff)
result <- cbind(result[1],lapply(result[-1], as.vector))
result[order(result$group_id),]

#  group_id date logical_value integer       float
#1        1 2029             1      41  0.21021119
#4        1  260             0     -43  0.22168257
#2        2 1114             0     -22 -0.39957366
#5        2 1604            -1      26  0.23847130
#3        3 3987             0       9  0.09112507
#6        3 3485             1      29  0.47460596

Licencié sous: CC-BY-SA avec attribution

Non affilié à StackOverflow