Must ddply use all possible combinations of the splitting variable(s), or only observed?

https://stackoverflow.com/questions/16363834

14-04-2022
|

Pergunta

I have a data frame called thetas containing about 2.7 million observations.

> str(thetas)
'data.frame':   2700000 obs. of  8 variables:
 $ rho_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ pct_cnd   : num  0 0 0 0 0 0 0 0 0 0 ...
 $ sx        : num  1 2 3 4 5 6 7 8 9 10 ...
 $ model     : Factor w/ 7 levels "dN.mN","dN.mL",..: 1 1 1 1 1 1 1 1 1 1 ...
 $ estTheta  : num  -1.58 -1.716 0.504 -2.296 0.98 ...
 $ trueTheta : num  0.0962 -3.3913 3.6006 -0.1971 2.1906 ...
 $ estError  : num  -1.68 1.68 -3.1 -2.1 -1.21 ...
 $ trueAberSx: num  0 0 0 0 0 0 0 0 0 0 ...

I would like to use ddply, or some similar function, to sum the error of estimation (the column estError in my data frame), but where the sums are within each condition of my simulation. The problem is, I don't have a simple way to combine values from the other columns of this data frame to uniquely identify all those conditions. To be more specific: the column model contains 7 possible values. Three of these possible values are only matched up with one possible value in each of rho_cnd and pct_cnd, while the other four possible values of model are matched up with 6 possible pairings of values in rho_cnd and pct_cnd.

The obvious solution, I know, would be to go back and make a variable that uniquely identifies all the conditions that I would need to identify here, so that the following code would work:

> sums <- ddply(thetas,.(condition1,condition2,etc.),sum(estError))

But I just don't want to go back and recreate how this data frame is built. Right now I have two data frames created with two separate calls to expand.grid that are then rbinded and sorted to create a data frame listing all valid conditions, but even if I kept those few lines of code in I'm not sure how to reference them with ddply. I would rather not even use this solution, but I will if necessary.

> conditions 
   models rhos pcts
1   dN.mN  0.0 0.00
2   dN.mL  0.0 0.00
3   dN.mH  0.0 0.00
4   dL.mN  0.1 0.01
12  dL.mN  0.1 0.02
20  dL.mN  0.1 0.10
8   dL.mN  0.2 0.01
16  dL.mN  0.2 0.02
24  dL.mN  0.2 0.10
5   dL.mL  0.1 0.01
13  dL.mL  0.1 0.02
21  dL.mL  0.1 0.10
9   dL.mL  0.2 0.01
17  dL.mL  0.2 0.02
25  dL.mL  0.2 0.10
6   dH.mN  0.1 0.01
14  dH.mN  0.1 0.02
22  dH.mN  0.1 0.10
10  dH.mN  0.2 0.01
18  dH.mN  0.2 0.02
26  dH.mN  0.2 0.10
7   dH.mH  0.1 0.01
15  dH.mH  0.1 0.02
23  dH.mH  0.1 0.10
11  dH.mH  0.2 0.01
19  dH.mH  0.2 0.02
27  dH.mH  0.2 0.10

Any advice for better code and/or more efficiency? Thanks!

Solução

I agree with the comment that ddply(thetas,.(model,rho_cnd,pct_cnd),...) should work. If certain combinations of those variables don't show up, ddply(..., .drop=TRUE) will ensure that the unobserved combinations don't show up.

However, if you wanted to avoid ddply looking through some of the non-existant combinations, you could try something like the following:

#newCond <- apply(thetas[,c("model", "rho_cnd", "pct_cnd")], 1, paste, collapse="_")
newCond <- do.call(paste, thetas[,c("model", "rho_cnd", "pct_cnd")], sep="_") #as suggested by baptiste
thetas2 <- cbind(thetas, newCond)

I admit, the above code might run slowly for you, so I'm not sure it's what you want. But from there you should be able to use ddply() with .variables=newCond.

Furthermore, because you're returning only a single number for each subset of the data, you could just use aggregate, if you wanted.

sums <- aggregate(thetas2[,"estError"], by=thetas2[,"newCond"], colSums)

I hope this helps.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow