Question

I have been searching for this for a while, but haven't been able to find a clear answer so far. Probably have been looking for the wrong terms, but maybe somebody here can quickly help me. The question is kind of basic.

Sample data set:

set <- structure(list(VarName = structure(c(1L, 5L, 4L, 2L, 3L),
 .Label = c("Apple/Blue/Nice", 
"Apple/Blue/Ugly", "Apple/Pink/Ugly", "Kiwi/Blue/Ugly", "Pear/Blue/Ugly"
), class = "factor"), Color = structure(c(1L, 1L, 1L, 1L, 2L), .Label = c("Blue", 
"Pink"), class = "factor"), Qty = c(45L, 34L, 46L, 21L, 38L)), .Names = c("VarName", 
"Color", "Qty"), class = "data.frame", row.names = c(NA, -5L))

This gives a data set like:

set


      VarName      Color Qty
1 Apple/Blue/Nice  Blue  45
2  Pear/Blue/Ugly  Blue  34
3  Kiwi/Blue/Ugly  Blue  46
4 Apple/Blue/Ugly  Blue  21
5 Apple/Pink/Ugly  Pink  38

What I would like to do is fairly straight forward. I would like to sum (or averages or stdev) the Qty column. But, also I would like to do the same operation under the following conditions:

  1. VarName includes "Apple"
  2. VarName includes "Ugly"
  3. Color equals "Blue"

Anybody that can give me a quick introduction on how to perform this kind of calculations?

I am aware that some of it can be done by the aggregate() function, e.g.:

aggregate(set[3], FUN=sum, by=set[2])[1,2]

However, I believe that there is a more straight forward way of doing this then this. Are there some filters that can be added to functions like sum()?

Was it helpful?

Solution

Is this what you're looking for?

 # sum for those including 'Apple'
 apple <- set[grep('Apple', set[, 'VarName']), ]
 aggregate(apple[3], FUN=sum, by=apple[2])
  Color Qty
1  Blue  66
2  Pink  38

 # sum for those including 'Ugly'
 ugly <- set[grep('Ugly', set[, 'VarName']), ]
 aggregate(ugly[3], FUN=sum, by=ugly[2])
  Color Qty
1  Blue 101
2  Pink  38

 # sum for Color==Blue
 sum(set[set[, 'Color']=='Blue', 3])
[1] 146

The last sum could be done by using subset

sum(subset(set, Color=='Blue')[,3])

OTHER TIPS

The easiest way to to split up your VarName column, then subsetting becomes very easy. So, lets create an object were varName has been separated:

##There must(?) be a better way than this. Anyone?
new_set =  t(as.data.frame(sapply(as.character(set$VarName), strsplit, "/")))

Brief explanation:

  • We use as.character because set$VarName is a factor
  • sapply takes each value in turn and applies strplit
  • The strsplit function splits up the elements
  • We convert to a data frame
  • Transpose to get the correct rotation

Next,

##Convert to a data frame
new_set = as.data.frame(new_set)
##Make nice rownames - not actually needed
rownames(new_set) = 1:nrow(new_set)
##Add in the Qty column
new_set$Qty = set$Qty

This gives

R> new_set
     V1   V2   V3 Qty
1 Apple Blue Nice  45
2  Pear Blue Ugly  34
3  Kiwi Blue Ugly  46
4 Apple Blue Ugly  21
5 Apple Pink Ugly  38

Now all the operations are as standard. For example,

##Add up all blue Qtys
sum(new_set[new_set$V2 == "Blue",]$Qty)
[1] 146

##Average of Blue and Ugly Qtys
mean(new_set[new_set$V2 == "Blue" & new_set$V3 == "Ugly",]$Qty)
[1] 33.67

Once it's in the correct form, you can use ddply which does every you want (and more)

library(plyr)
##Split the data frame up by V1 and take the mean of Qty
ddply(new_set, .(V1), summarise, m = mean(Qty))

##Split the data frame up by V1 & V2 and take the mean of Qty
ddply(new_set, .(V1, V2), summarise, m = mean(Qty))
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top