The function you will likely want to apply to your dataframe is aggregate()
with either mean
or sd
as the function parameter.
mean and standard deviation by group for multiple variables [duplicate]
Question
I am sure this question has been answered before, but I would like to caclulate mean and sd by treatment for multiple variables (100s) all at once and cannot figure out how to do it aside from using a long winded ddply code.
This is a portion of my dataframe (g):
trt blk til res sand silt clay ibd1_6 ibd9_14 ibd_ave
1 CTK 1 CT K 74 15 11 1.323 1.593 1.458
2 CTK 2 CT K 71 15 14 1.575 1.601 1.588
3 CTK 3 CT K 72 14 14 1.551 1.594 1.573
4 CTR 1 CT R 72 15 13 1.560 1.647 1.604
5 CTR 2 CT R 73 14 13 1.612 1.580 1.596
6 CTR 3 CT R 73 13 14 1.709 1.577 1.643
7 ZTK 1 ZT K 72 16 12 1.526 1.546 1.536
8 ZTK 2 ZT K 71 16 13 1.292 1.626 1.459
9 ZTK 3 ZT K 71 17 12 1.623 1.607 1.615
10 ZTR 1 ZT R 66 16 18 1.719 1.709 1.714
11 ZTR 2 ZT R 67 17 16 1.529 1.708 1.618
12 ZTR 3 ZT R 66 17 17 1.663 1.655 1.659
I would like to have a function that does what ddply does, i.e ddply(g, trt, meanSand=mean(sand), sdSand=sd(sand), meanSilt=mean(silt). . . .) without having to write it all out. Any ideas? Thank you for your patience!
La solution
Autres conseils
assuming myDF
is your original dataset:
library(data.table)
myDT <- data.table(myDF)
# Which variables to calculate All columns but the first five? :
variables <- tail( names(myDT), -5)
myDT[, lapply(.SD, function(x) list(mean(x), sd(x))), .SDcols=variables, by=list(trt, til)]
## OR Separately, if you prefer shorter `lapply` statements
myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
--
> myDT[, lapply(.SD, mean), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 14.66667 13.00000 1.483000 1.596000 1.539667
# 2: CTR CT 14.00000 13.33333 1.627000 1.601333 1.614333
# 3: ZTK ZT 16.33333 12.33333 1.480333 1.593000 1.536667
# 4: ZTR ZT 16.66667 17.00000 1.637000 1.690667 1.663667
> myDT[, lapply(.SD, sd), .SDcols=variables, by=list(trt, til)]
# trt til silt clay ibd1_6 ibd9_14 ibd_ave
# 1: CTK CT 0.5773503 1.7320508 0.13908271 0.004358899 0.07112196
# 2: CTR CT 1.0000000 0.5773503 0.07562407 0.039576929 0.02514624
# 3: ZTK ZT 0.5773503 0.5773503 0.17015973 0.041797129 0.07800214
# 4: ZTR ZT 0.5773503 1.0000000 0.09763196 0.030892286 0.04816984
aggregate(g[, c("sand", "silt", "clay")], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )
Using an anonymous function with aggregate.data.frame
allows one to get both values with one call. You only want to pass in the columns to be aggregated.If you had a long list of columns and only wanted to exclude let's say the first 4 from calculations, it could be written as:
aggregate(g[, names(g)[-(1:4)], g$trt, function(x) c(mean=mean(x), sd=sd(x) ) )