existing function to combine standard deviations in R?
-
28-04-2021 - |
Question
I have 4 populations with known means and standard deviations. I would like to know the grand mean and grand sd. The grand mean is obviously simple to calculate, but R has a handy utility function, weighted.mean(). Does a similar function exist for combining standard deviations?
The calculation is not complicated, but an existing function would make my code cleaner and easier to understand.
Bonus question, what tools do you use to search for functions like this? I know it must be out there, but I've done a lot of searching and can't find it. Thanks!
Solution
Are the populations non overlapping?
library(fishmethods)
combinevar
For instance the example in wikipedia would work like this:
xbar <- c(70,65)
s<-c(3,2)
n <- c(1,1)
combinevar(xbar,s,n)
and standard deviation would be sqrt(combinevar(xbar,s,n)[2])
if you don't want to download the library the function goes like this:
combinevar <-
function (xbar = NULL, s_squared = NULL, n = NULL)
{
if (length(xbar) != length(s_squared) | length(xbar) != length(n) |
length(s_squared) != length(n))
stop("Vector lengths are different.")
sum_of_squares <- sum((n - 1) * s_squared + n * xbar^2)
grand_mean <- sum(n * xbar)/sum(n)
combined_var <- (sum_of_squares - sum(n) * grand_mean^2)/(sum(n) -
1)
return(c(grand_mean, combined_var))
}
OTHER TIPS
I don't know of a specific package or function name but it seems easy to roll your own function from Wikipedia's page. Assuming no overlap in the populations:
## N: vector of sizes
## M: vector of means
## S: vector of standard deviations
grand.mean <- function(M, N) {weighted.mean(M, N)}
grand.sd <- function(S, M, N) {sqrt(weighted.mean(S^2 + M^2, N) -
weighted.mean(M, N)^2)}
Use the sample.decomp
function in the utilities
package
Statistical problems of this kind have now been automated in the sample.decomp
function in the utilities
package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.
How to use the function: Here we give an example where we use the function to compute the sample moments of a pooled sample composed of four subgroups. To do this, we first generate a mock dataset DATA
containing four subgroups with unequal sizes, and we pool these as the single dataset POOL
. The moments of the subgroups and the pooled sample can be obtained using the moments
function in the same package.
#Create some subgroups of mock data and a pooled dataset
set.seed(1)
N <- c(28, 44, 51, 102)
SUB1 <- rnorm(N[1])
SUB2 <- rnorm(N[2])
SUB3 <- rnorm(N[3])
SUB4 <- rnorm(N[4])
DATA <- list(SUB1 = SUB1, SUB2 = SUB2, SUB3 = SUB3, SUB4 = SUB4)
POOL <- c(SUB1, SUB2, SUB3, SUB4)
#Show sample statistics for the subgroups
library(utilities)
moments(DATA)
n sample.mean sample.var sample.skew sample.kurt NAs
SUB1 28 0.09049834 0.9013829 -0.7648008 3.174128 0
SUB2 44 0.18637936 0.8246700 0.3653918 3.112901 0
SUB3 51 0.05986594 0.6856030 0.3076281 2.306243 0
SUB4 102 -0.05135660 1.0526184 0.3348429 2.741974 0
#Show sample statistics for the pooled sample
moments(POOL)
n sample.mean sample.var sample.skew sample.kurt NAs
POOL 225 0.03799749 0.9030244 0.1705622 2.828833 0
Now that we have set of moments for subgroups, we can use the sample.decomp
function to obtain the pooled sample moments from the subgroup sample moments. As an input to this function you can either use the moments
output for the subgroups or you can input the sample sizes and sample moments separately as vectors (here we will do the latter). As you can see, this gives the same sample moments for the pooled sample as direct computation from the underlying data.
#Compute sample statistics for subgroups
library(utilities)
MEAN <- c(mean(SUB1), mean(SUB2), mean(SUB3), mean(SUB4))
VAR <- c( var(SUB1), var(SUB2), var(SUB3), var(SUB4))
#Compute sample decomposition
sample.decomp(n = N, sample.mean = MEAN, sample.var = VAR, names = names(DATA))
n sample.mean sample.var
SUB1 28 0.09049834 0.9013829
SUB2 44 0.18637936 0.8246700
SUB3 51 0.05986594 0.6856030
SUB4 102 -0.05135660 1.0526184
--pooled-- 225 0.03799749 0.9030244
As you can see, the sample.decomp
function allows computation of the pooled sample variance. You can read about this function in the package documentation.