Question

I am trying to create two data sets, one which summarizes data by 2 groups which I have done using the following code:

x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)

aggregate(x, list(g1, g2), mean)

The second needs to summarize the data by the first group and NOT the second group.

If we consider the possible pairs from the previous example:

A - X    B - X    C - X
A - Y    B - Y    C - Y
A - Z    B - Z    C - Z

The second dataset should to summarize the data as the average of the outgroup.

A - not X
A - not Y
A - not Z etc. 

Is there a way to manipulate aggregate functions in R to achieve this? Or I also thought there could be dummy variable that could represent the data in this way, although I am unsure how it would look.

I have found this answer here: R using aggregate to find a function (mean) for "all other"

I think this indicates that a dummy variable for each pairing is necessary. However if there is anyone who can offer a better or more efficient way that would be appreciated, as there are many pairings in the true data set.

Thanks in advance

Was it helpful?

Solution

First let us generate the data reproducibly (using set.seed):

# same as question but added set.seed for reproducibility
set.seed(123)
x = rnorm(1:100)
g1 = sample(LETTERS[1:3], 100, replace = TRUE)
g2 = sample(LETTERS[24:26], 100, replace = TRUE)

Now we have two solutions both of which use aggregate:

1) ave

# x equals the sums over the groups and n equals the counts
ag = cbind(aggregate(x, list(g1, g2), sum),
            n = aggregate(x, list(g1, g2), length)[, 3])

ave.not <- function(x, g) ave(x, g, FUN = sum) - x
transform(ag, 
     x = NULL, # don't need x any more
     n = NULL, # don't need n any more
     mean = x/n, 
     mean.not = ave.not(x, Group.1) / ave.not(n, Group.1)
)

This gives:

  Group.1 Group.2       mean     mean.not
1       A       X  0.3155084 -0.091898832
2       B       X -0.1789730  0.332544353
3       C       X  0.1976471  0.014282465
4       A       Y -0.3644116  0.236706489
5       B       Y  0.2452157  0.099240545
6       C       Y -0.1630036  0.179833987
7       A       Z  0.1579046 -0.009670734
8       B       Z  0.4392794  0.033121335
9       C       Z  0.1620209  0.033714943

To double check the first value under mean and under mean.not:

> mean(x[g1 == "A" & g2 == "X"])
[1] 0.3155084
> mean(x[g1 == "A" & g2 != "X"])
[1] -0.09189883

2) sapply Here is a second approach which gives the same answer:

ag <- aggregate(list(mean = x), list(g1, g2), mean)
f <- function(i) mean(x[g1 == ag$Group.1[i] & g2 != ag$Group.2[i]]))
ag$mean.not = sapply(1:nrow(ag), f)
ag

REVISED Revised based on comments by poster, added a second approach and also made some minor improvements.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top