How is regsubsets in R's leaps package optimized under the hood for exhaustive model selection searches? [closed]

https://stackoverflow.com/questions/23638049

21-07-2023
|

Question

Closed. This question needs to be more focused. It is not currently accepting answers.

Want to improve this question? Update the question so it focuses on one problem only by editing this post.

Closed 9 years ago.

I'm running some model selection procedures based on 43 surveys of people who gave answers to about 50 variables. I have narrowed the useful enough variables down to 22 and discarded the rest.

I wanted to perform a model selection using the exhaustive regsubsets algo from the leaps library in R. I set nvmax=22 - the number of predictors in my set - regsubsets blew me away with its speed - just a few seconds to run 2^22 ~ 4 million regressions. This cannot possibly be enumerating all of the combinations of 22 choose k for all k from 1 to 22 and regressing, can it?

Is regsubsets optimized in some way so that the "exhaustive" algo can cleverly omit the vast majority of regressions that it can know a priori will have a bad R^2 relative to the best available?

I found that when I ran lm(y~.x, data=some.df), this took 25 seconds for only 10,000 regressions - a far cry from the 3 or 4 seconds it took regsubsets to analyze 4 million regressions - so it clearly has some optimization in the code. How is that optimization implemented?

Solution

The underlying Fortran code (by Allan Miller) uses a branch-and-bound algorithm based on the original ideas by Furnival & Wilson (Technometrics, 1974). This rules out large chunks of model space based on the principal that removing variables from a model can only increase the residual sum of squares.

The implementation is also efficient, computing only the residual sum of squares for each model, and computing the QR decomposition of the predictors only once.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow