To get grouped results in R

https://stackoverflow.com/questions/12863547

07-07-2021
|

Question

Here I have like below data set. In SAS system, to get grouped results in whatever I do such as means, GLM or REG, I can cord as:

proc sort; by B;
proc glm;
class A;
model C=A; by B;
run;

Then, I can get GLM results within or by B level. But, I DO NOT know how to use like'by' to grouping in R system. You might want to suggest me to use >subset(), however, this cord would be really difficult when you have, for example, 10 levels of B. You might want to learn me but only anova analysis but also regression and average. Here is anyone to help me?

raw data

A   B   C
a   a   0.47
a   b   0.88
a   c   2.32
a   d   3.26
a   a   0.93
a   b   1.86
a   c   3.22
a   d   0.92
a   a   0.45
a   b   0.92
a   c   2.31
a   d   3.24
b   a   0.91
b   b   1.84
b   c   3.27
b   d   0.86
b   a   0.47
b   b   0.90
b   c   2.33
b   d   3.19
b   a   0.92
b   b   1.84
b   c   3.25
b   d   0.93
c   a   0.45
c   b   0.92
c   c   2.33
c   d   3.08
c   a   0.93
c   b   1.86
c   c   3.25
c   d   0.93
c   a   0.47
c   b   0.90
c   c   2.26
c   d   3.09

Solution

Based on @Brian Diggs' answer but using R base functions. I also used Brian's data set

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) 
do.call(rbind, lapply(models, function(y) y$coef))
  (Intercept)         Ab            Ac
a   0.6166667  0.1500000  1.802585e-17
b   1.2200000  0.3066667  6.666667e-03
c   2.6166667  0.3333333 -3.333333e-03
d   2.4733333 -0.8133333 -1.066667e-01

Edit 1: Results with P.values (alternative 1)

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) #the same as above 
Coef <- lapply(models, function(y) y$coef) # the same as above
Pval <-  lapply(models, function(z) summary(z)$coefficients[, 'Pr(>|t|)'])
Result <- cbind(do.call(rbind, Coef), do.call(rbind, Pval))
colnames(Result)[4:6] <- paste('P.val', colnames(Result)[4:6])
Result

 (Intercept)         Ab            Ac P.val (Intercept)  P.val Ab  P.val Ac
a   0.6166667  0.1500000  1.802585e-17       0.007088207 0.5167724 1.0000000
b   1.2200000  0.3066667  6.666667e-03       0.008446002 0.5191737 0.9886090
c   2.6166667  0.3333333 -3.333333e-03       0.000151772 0.4762936 0.9941859
d   2.4733333 -0.8133333 -1.066667e-01       0.016803317 0.4744404 0.9235623

Edit 2: Results with P.values (alternative 2)

models <- lapply(split(dat, dat$B), function(x) glm(C~A, data=x)) #the same as above
do.call(rbind, lapply(models, function(z) summary(z)$coefficients))
                 Estimate Std. Error       t value    Pr(>|t|)
(Intercept)  6.166667e-01  0.1540202  4.003804e+00 0.007088207
Ab           1.500000e-01  0.2178175  6.886500e-01 0.516772358
Ac           1.802585e-17  0.2178175  8.275668e-17 1.000000000
(Intercept)  1.220000e+00  0.3167661  3.851423e+00 0.008446002
Ab           3.066667e-01  0.4479749  6.845622e-01 0.519173660
Ac           6.666667e-03  0.4479749  1.488179e-02 0.988608995
(Intercept)  2.616667e+00  0.3103164  8.432253e+00 0.000151772
Ab           3.333333e-01  0.4388537  7.595545e-01 0.476293582
Ac          -3.333333e-03  0.4388537 -7.595545e-03 0.994185937
(Intercept)  2.473333e+00  0.7538543  3.280917e+00 0.016803317
Ab          -8.133333e-01  1.0661110 -7.628974e-01 0.474440418
Ac          -1.066667e-01  1.0661110 -1.000521e-01 0.923562285

OTHER TIPS

Making your dataset easier to import to R:

dat <-
structure(list(A = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 
1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 
3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L, 3L), .Label = c("a", 
"b", "c"), class = "factor"), B = structure(c(1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 
1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L, 1L, 2L, 3L, 4L
), .Label = c("a", "b", "c", "d"), class = "factor"), C = c(0.47, 
0.88, 2.32, 3.26, 0.93, 1.86, 3.22, 0.92, 0.45, 0.92, 2.31, 3.24, 
0.91, 1.84, 3.27, 0.86, 0.47, 0.9, 2.33, 3.19, 0.92, 1.84, 3.25, 
0.93, 0.45, 0.92, 2.33, 3.08, 0.93, 1.86, 3.25, 0.93, 0.47, 0.9, 
2.26, 3.09)), .Names = c("A", "B", "C"), class = "data.frame", row.names = c(NA, 
-36L))

Much of SAS by group processing maps to split-apply-combine methods (divide the data into parts, do something with each part, put those parts back together in some way). In this case, the results of models are objects (lists), and the natural way to "combine" multiple models is to put them in a list.

library("plyr")
models <- dlply(dat, .(B), function(DF) glm(C~A, data=DF))

models is now a list, each element of which is the result of a glm on a subset of dim.

> models
$a

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
  6.167e-01    1.500e-01    6.799e-17  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      0.472 
Residual Deviance: 0.427        AIC: 6.107 

$b

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
   1.220000     0.306667     0.006667  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      1.99 
Residual Deviance: 1.806        AIC: 19.09 

$c

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
   2.616667     0.333333    -0.003333  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      1.958 
Residual Deviance: 1.733        AIC: 18.72 

$d

Call:  glm(formula = C ~ A, data = DF)

Coefficients:
(Intercept)           Ab           Ac  
     2.4733      -0.8133      -0.1067  

Degrees of Freedom: 8 Total (i.e. Null);  6 Residual
Null Deviance:      11.4 
Residual Deviance: 10.23        AIC: 34.69 

attr(,"split_type")
[1] "data.frame"
attr(,"split_labels")
  B
1 a
2 b
3 c
4 d

Extraction of information from all models at once follows the same paradigm:

> ldply(models, coefficients)
  B (Intercept)         Ab            Ac
1 a   0.6166667  0.1500000  6.798700e-17
2 b   1.2200000  0.3066667  6.666667e-03
3 c   2.6166667  0.3333333 -3.333333e-03
4 d   2.4733333 -0.8133333 -1.066667e-01

Maybe the aggregate function is what you're looking for

aggregate(data$C, by=list(data$A), FUN=sum)

This will group your data by the first column, and collapsing column C into the sum for each group in the first column

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow