Using dlply with unsplit

https://stackoverflow.com/questions/23573739

r
plyr

19-07-2023
|

Question

I want to append a column to a dataframe that has the result of a cumulative function. I can accomplish this with unsplit/split, like this

> set.seed(3)
> d <- data.frame(type=sample(c('a','b'),10,replace=TRUE), val=rnorm(10))
> d
   type         val
1     a  0.03012394
2     b  0.08541773
3     a  1.11661021
4     a -1.21885742
5     b  1.26736872
6     b -0.74478160
7     a -1.13121857
8     a -0.71635849
9     b  0.25265237
10    b  0.15204571

So I use split/lapply/unsplit to get my desired result

> d$sum <- unsplit(lapply(split(d,d$type), function(x) { cumsum(x$val)}), d$type)
> d
   type         val         sum
1     a  0.03012394  0.03012394
2     b  0.08541773  0.08541773
3     a  1.11661021  1.14673416
4     a -1.21885742 -0.07212326
5     b  1.26736872  1.35278645
6     b -0.74478160  0.60800486
7     a -1.13121857 -1.20334183
8     a -0.71635849 -1.91970032
9     b  0.25265237  0.86065723
10    b  0.15204571  1.01270293

And this is the desired result. But I'd really like to use the simplified syntax of plyr in this case. So I tried

> d$sum2 <- unsplit(dlply(d, .(type), summarise, cumsum(val)), d$type)
Error in `row.names<-.data.frame`(`*tmp*`, value = value) : 
  duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': '1', '2', '3', '4', '5'

The output of dlply and the lapply/split are almost the same, except that the dlply has some extra junk that I think unsplit will ignore, and the dlply output has re-indexed the row.names. I think this latter is what the complaint is.

Also to note that I am aware that I can approach this with ddply/transform

> ddply(d, .(type), transform, sum2=cumsum(val))                                
   type         val         sum        sum2
1     a  0.03012394  0.03012394  0.03012394
2     a  1.11661021  1.14673416  1.14673416
3     a -1.21885742 -0.07212326 -0.07212326
4     a -1.13121857 -1.20334183 -1.20334183
5     a -0.71635849 -1.91970032 -1.91970032
6     b  0.08541773  0.08541773  0.08541773
7     b  1.26736872  1.35278645  1.35278645
8     b -0.74478160  0.60800486  0.60800486
9     b  0.25265237  0.86065723  0.86065723
10    b  0.15204571  1.01270293  1.01270293

This won't work in my case, because as you can see, this has the side effect of rearranging the rows to be out of order. If there's some argument to ddply that would not rearrange the rows, then this would be perfect for my purposes.

Solution

Perhaps you could try dplyr instead? In contrast to ddply, it keeps the original order.

library(dplyr)
d %.%
  group_by(type) %.%
  mutate(sum = cumsum(val))
# Source: local data frame [10 x 3]
# Groups: type
# 
#    type         val         sum
# 1     a  0.03012394  0.03012394
# 2     b  0.08541773  0.08541773
# 3     a  1.11661021  1.14673416
# 4     a -1.21885742 -0.07212326
# 5     b  1.26736872  1.35278645
# 6     b -0.74478160  0.60800486
# 7     a -1.13121857 -1.20334183
# 8     a -0.71635849 -1.91970032
# 9     b  0.25265237  0.86065723
# 10    b  0.15204571  1.01270293

OTHER TIPS

Why not use use ave?

d$sum <-   # absolutely terrible name for a variable
  ave( d$val, d$type, FUN=cumsum)

The lapply( split(d, d$type) , func)-approach is overkill for a function that will only operate on one vector at a time.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow