Calculate unique combinations of values in dataframe, and summary values

https://stackoverflow.com/questions/4697106

11-10-2019
|

Question

I would like to work with unique combinations of var1 and var2 in my dataframe:

foo <- data.frame(var1 = c(1,1,2,2,2,2,3,3,3,3,3,4,4,4,4),
                  var2 = c(1,1,1,1,2,2,1,1,2,2,2,2,2,3,3))

As has been noted, unique(foo) results in this:

      var1  var2
 1    1     1
 2    2     1
 3    2     2
 4    3     1
 5    3     2
 6    4     2
 7    4     3

Based on the unique combinations, how do I get:

n, the number of occurrences of a var1 value and
svar, the sum of each var1 value's var2 values.

The output could look like this:

      var1  n    svar
1     1     1    1
2     2     2    3
3     3     2    3
4     4     2    5

Solution

unique(foo) should give you what you are after here.

UPDATE 2014: use dplyr instead of plyr

I recommend looking into the library plyr for other aggregating type tasks, or the base R equivalents of tapply(), aggregate() et al.

While redundant for this exercise, here's how you would use plyr:

library(plyr)
ddply(foo, .(var1), unique)

Note you can replace unique with any number of functions, such as finding the mean and sd of var2 like so:

ddply(foo, .(var1), summarise, mean = mean(var2), sd = sd(var2))

Response to edit

Now you have a more legitimate use for plyr(). Taking what we learned from above:

x <- unique(foo)

combined with plyr:

ddply(x, .(var1), summarise, n = length(var2), sum = sum(var2))

Should give you what you are after.

OTHER TIPS

I hope I understand your question well, try:

unique(foo)

After question was edited:

Not to write the same as @Chase, a very simple but not too elegant solution could be:

foo$var12 <- paste(foo$var1, foo$var2, sep='|')      # the two variables combined to one
table(foo$var12)                                     # and showing its frequencies

And the output is a table of course:

 1|1 2|1 2|2 3|1 3|2 4|2 4|3 
   2   2   2   2   3   2   2

The answers are different than you state, but I trust my code more than I trust your answer, and I cannot bring myself to commit the sin of naming a variable "sum":

 newfoo <- data.frame(
                 var1=unique(foo$var1),
                 n = with(foo, tapply(var2, var1, length) ),
                 svar = with(foo, tapply(var2, var1, sum) ) )
 newfoo
#  var1 n svar
#1    1 2    2
#2    2 4    6
#3    3 5    8
#4    4 4   10

EDIT: (hadn't at first figured out what Chase did try to tell me.)

newfoo <- data.frame(
                  var1=unique(unique(foo)$var1),
                  n = with(unique(foo), tapply(var2, var1, length) ),
                  svar = with(unique(foo), tapply(var2, var1, sum) ) )

> newfoo
  var1 n svar
1    1 1    1
2    2 2    3
3    3 2    3
4    4 2    5

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow