Question

While using R, I am often interested in performing operations on a data.frame in which I summarize a variable by a group, and then want to add those summary values back into the data.frame. This is most easily shown by example:

myDF <- data.frame(A = runif(5), B = c("A", "A", "A", "B", "B"))
myDF$Total <- with(myDF, by(A, B, sum))[myDF$B]
myDF$Proportion <- with(myDF, A / Total)

which produces:

          A B     Total Proportion
1 0.5272734 A 1.7186369  0.3067975
2 0.5105128 A 1.7186369  0.2970452
3 0.6808507 A 1.7186369  0.3961574
4 0.2892025 B 0.6667133  0.4337734
5 0.3775108 B 0.6667133  0.5662266

This trick -- essentially getting a vector of named values, and "spreading" or "stretching" them across the relevant rows by group -- generally works, although class(myDF$Total) is "array" unless I put the by() inside of a c().

I am wondering:

  1. Is there a commonly-used name for this operation?
  2. Is there another, less hacky-feeling, and/or faster way of doing this?
  3. Is there a way to do this with dplyr? Maybe there is a Hadley-approved verb operation (like mutate, arrange, etc.) about which I am unaware. I know that it is easy to summarise(), but I often need to put those summaries back into the data.frame.
Was it helpful?

Solution

Here's a "less hacky" way to do this with base R.

set.seed(1)
myDF <- data.frame(A = runif(5), B = c("A", "A", "A", "B", "B"))

within(myDF, {
  Total <- ave(A, B, FUN = sum)
  Proportion <- A/Total
})

#           A B Proportion    Total
# 1 0.2655087 A  0.2193406 1.210486
# 2 0.3721239 A  0.3074170 1.210486
# 3 0.5728534 A  0.4732425 1.210486
# 4 0.9082078 B  0.8182865 1.109890
# 5 0.2016819 B  0.1817135 1.109890

In "dplyr" language, I guess you're looking for mutate:

myDF %>%
  group_by(B) %>%
  mutate(Total = sum(A), Proportion = A/Total)

# Source: local data frame [5 x 4]
# Groups: B
# 
#           A B    Total Proportion
# 1 0.2655087 A 1.210486  0.2193406
# 2 0.3721239 A 1.210486  0.3074170
# 3 0.5728534 A 1.210486  0.4732425
# 4 0.9082078 B 1.109890  0.8182865
# 5 0.2016819 B 1.109890  0.1817135

From the "Introduction to dplyr" vignette, you would find the following description:

As well as selecting from the set of existing columns, it's often useful to add new columns that are functions of existing columns. This is the job of mutate(). dplyr::mutate() works the same way as plyr::mutate() and similarly to base::transform(). The key difference between mutate() and transform() is that mutate allows you to refer to columns that you just created.


Also, since you've tagged this "data.table", you can "chain" commands together in "data.table" quite easily to do something like:

DT <- data.table(myDF)
DT[, Total := sum(A), by = B][, Proportion := A/Total][]
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top