
I would like to automate a simple multiple regression for the subsets defined by the unique combinations of the grouping variables. I have a dataframe with several grouping variables df1[,1:6] and some independent variables df1[,8:10] and a response df1[,7].

This is an excerpt from the data.

structure(list(Surface = structure(c(1L, 1L, 1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("NiAu", "Sn"), class = "factor"), Supplier = structure(c(1L, 1L, 1L, 2L, 2L, 1L, 1L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor"), ParticleSize = structure(c(1L, 1L, 1L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("3", "5"), class = "factor"), T1 = structure(c(2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 1L, 1L), .Label = c("130", "144"), class = "factor"), T2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "200", class = "factor"), O2 = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L, 1L), .Label = "1300", class = "factor"), Shear = c(56.83, 67.73, 78.51, 62.61, 66.78, 60.89, 62.94, 76.34, 70.56, 70.4, 54.15), Gap = c(373, 450, 417, 450, 406, 439, 439, 417, 439, 441, 417), Clearance = c(500.13, 509.85, 495.97, 499.55, 502.66, 505.33, 500.32, 503.28, 507.44, 500.5, 498.39), Void = c(316, 343, 89, 247, 271, 326, 304, 282, 437, 243, 116)), .Names = c("Surface", "Supplier", "ParticleSize","T1", "T2", "O2", "Shear", "Gap", "Clearance", "Void"), class = "data.frame", row.names = c(NA, -11L))

Using unique(df1[,1:6]) returns 5 factor combinations of the grouping variables. So there should be 5 subsets where I apply the lm() function to. My call looks like that<-with(df1,by(df1,df1[,1:6], function(x) lm(Shear~Gap+Clearance+Void,data=x)))

Problem 1: it returns a list with 16 list entries. Apparently, it calculates all possible factor combinations of the first six grouping variables. (V5+V6 only have on level but V1:4 have two levels level in the excerpt. Resulting in 2^4=16) But it should only use the real existing factor combinations in the data. So I suppose by() is not the correct function to achieve that. Any suggestions?
Problem 2: I find it easier to refer to column indices rather than variable names. So I was initially trying to use my lm() function in the way lm(df1[,7]~df1[,8]+df1[,9]). That did not work out. Because I always access the entire df1 dataframe instead of the subsets. So probably I should pass the row indeces for the factor combinations to the lm()function rather than a complete dataframe.

I think the solution to problem 1 and 2 are somehow related and solved using another subset function. It would be nice if someone can try to explain where my mistake is. If its possible I would stick to the standard packages simply because I want to improve my understanding of R. Thanks

EDIT: a minor mistake in the variable assignment

¿Fue útil?


You could use the plyr package:

list_reg <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2), function(df) 
#We have indeed five different results
#That's how you check out one particular regression, in this case the first

The function dlply takes a data.frame (that's what the d... stands for), in your case df1, and returns a list (that's what the .l... stands for), in your case consisting of five elements, each containing the results of one regression.

Internally, your df1 is split up into five sub-data.frames according to the columns specified by .(Surface, Supplier, ParticleSize, T1, T2) and the function lm(Shear~Gap+Clearance+Void,data=df) is applied to every of these sub-data.frames.

To get a better feeling of what dlply really does, just call

list_sub_df <- dlply(df1, .(Surface, Supplier, ParticleSize, T1, T2))

and you can look at each sub-data.frame on which the lm will be applied to.

And just a general note at the end: The paper by the package author Hadley Wickham is really great: even if you won't end up using his package, it is still really good to get a feeling about the split-apply-combine approach.


I just did a quick search and as expected, this was already explained better before, so also make sure to read this SO post.


If you want to use the column numbers directly, try this (taken from this SO post):

 list_reg <- dlply(df1, names(df1[, 1:5]), function(df) 
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top