Question

I have 4 populations with known means and standard deviations. I would like to know the grand mean and grand sd. The grand mean is obviously simple to calculate, but R has a handy utility function, weighted.mean(). Does a similar function exist for combining standard deviations?

The calculation is not complicated, but an existing function would make my code cleaner and easier to understand.

Bonus question, what tools do you use to search for functions like this? I know it must be out there, but I've done a lot of searching and can't find it. Thanks!

Was it helpful?

Solution

Are the populations non overlapping?

library(fishmethods)
combinevar

For instance the example in wikipedia would work like this:

xbar <- c(70,65)
s<-c(3,2)
n <- c(1,1)
combinevar(xbar,s,n)

and standard deviation would be sqrt(combinevar(xbar,s,n)[2])

if you don't want to download the library the function goes like this:

combinevar <- 
function (xbar = NULL, s_squared = NULL, n = NULL) 
{
    if (length(xbar) != length(s_squared) | length(xbar) != length(n) | 
        length(s_squared) != length(n)) 
        stop("Vector lengths are different.")
    sum_of_squares <- sum((n - 1) * s_squared + n * xbar^2)
    grand_mean <- sum(n * xbar)/sum(n)
    combined_var <- (sum_of_squares - sum(n) * grand_mean^2)/(sum(n) - 
        1)
    return(c(grand_mean, combined_var))
}

OTHER TIPS

I don't know of a specific package or function name but it seems easy to roll your own function from Wikipedia's page. Assuming no overlap in the populations:

## N: vector of sizes
## M: vector of means
## S: vector of standard deviations

grand.mean <- function(M, N) {weighted.mean(M, N)}
grand.sd   <- function(S, M, N) {sqrt(weighted.mean(S^2 + M^2, N) -
                                      weighted.mean(M, N)^2)}

Use the sample.decomp function in the utilities package

Statistical problems of this kind have now been automated in the sample.decomp function in the utilities package. This function can compute pooled sample moments from subgroup moments, or compute missing subgroup moments from the other subgroup moments and pooled moments. It works for decompositions up to fourth order ---i.e., decompositions of sample size, sample mean, sample variance/standard deviation, sample skewness, and sample kurtosis.


How to use the function: Here we give an example where we use the function to compute the sample moments of a pooled sample composed of four subgroups. To do this, we first generate a mock dataset DATA containing four subgroups with unequal sizes, and we pool these as the single dataset POOL. The moments of the subgroups and the pooled sample can be obtained using the moments function in the same package.

#Create some subgroups of mock data and a pooled dataset
set.seed(1)
N    <- c(28, 44, 51, 102)
SUB1 <- rnorm(N[1])
SUB2 <- rnorm(N[2])
SUB3 <- rnorm(N[3])
SUB4 <- rnorm(N[4])
DATA <- list(SUB1 = SUB1, SUB2 = SUB2, SUB3 = SUB3, SUB4 = SUB4)
POOL <- c(SUB1, SUB2, SUB3, SUB4)

#Show sample statistics for the subgroups
library(utilities)
moments(DATA)

       n sample.mean sample.var sample.skew sample.kurt NAs
SUB1  28  0.09049834  0.9013829  -0.7648008    3.174128   0
SUB2  44  0.18637936  0.8246700   0.3653918    3.112901   0
SUB3  51  0.05986594  0.6856030   0.3076281    2.306243   0
SUB4 102 -0.05135660  1.0526184   0.3348429    2.741974   0

#Show sample statistics for the pooled sample
moments(POOL)

       n sample.mean sample.var sample.skew sample.kurt NAs
POOL 225  0.03799749  0.9030244   0.1705622    2.828833   0

Now that we have set of moments for subgroups, we can use the sample.decomp function to obtain the pooled sample moments from the subgroup sample moments. As an input to this function you can either use the moments output for the subgroups or you can input the sample sizes and sample moments separately as vectors (here we will do the latter). As you can see, this gives the same sample moments for the pooled sample as direct computation from the underlying data.

#Compute sample statistics for subgroups
library(utilities)
MEAN   <- c(mean(SUB1), mean(SUB2), mean(SUB3), mean(SUB4))
VAR    <- c( var(SUB1),  var(SUB2),  var(SUB3),  var(SUB4))

#Compute sample decomposition
sample.decomp(n = N, sample.mean = MEAN, sample.var  = VAR, names = names(DATA))

             n sample.mean sample.var
SUB1        28  0.09049834  0.9013829
SUB2        44  0.18637936  0.8246700
SUB3        51  0.05986594  0.6856030
SUB4       102 -0.05135660  1.0526184
--pooled-- 225  0.03799749  0.9030244

As you can see, the sample.decomp function allows computation of the pooled sample variance. You can read about this function in the package documentation.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top