Question

I'm working on the problem with too many features and training my models takes way too long. I implemented forward selection algorithm to choose features.

However, I was wondering does scikit-learn have forward selection/stepwise regression algorithm?

Was it helpful?

Solution

No, sklearn doesn't seem to have a forward selection algorithm. However, it does provide recursive feature elimination, which is a greedy feature elimination algorithm similar to sequential backward selection. See the documentation here:

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.RFE.html

OTHER TIPS

Sklearn DOES have a forward selection algorithm, although it isn't called that in scikit-learn. The feature selection method called F_regression in scikit-learn will sequentially include features that improve the model the most, until there are K features in the model (K is an input).

It starts by regression the labels on each feature individually, and then observing which feature improved the model the most using the F-statistic. Then it incorporates the winning feature into the model. Then it iterates through the remaining features to find the next feature which improves the model the most, again using the F-statistic or F test. It does this until there are K features in the model.

Notice that the remaining features that are correlated to features incorporated into the model will probably not be selected, since they do not correlate with the residuals (although they might correlate well with the labels). This helps guard against multi-collinearity.

Scikit-learn indeed does not support stepwise regression. That's because what is commonly known as 'stepwise regression' is an algorithm based on p-values of coefficients of linear regression, and scikit-learn deliberately avoids inferential approach to model learning (significance testing etc). Moreover, pure OLS is only one of numerous regression algorithms, and from the scikit-learn point of view it is neither very important, nor one of the best.

There are, however, some pieces of advice for those who still need a good way for feature selection with linear models:

  1. Use inherently sparse models like ElasticNet or Lasso.
  2. Normalize your features with StandardScaler, and then order your features just by model.coef_. For perfectly independent covariates it is equivalent to sorting by p-values. The class sklearn.feature_selection.RFE will do it for you, and RFECV will even evaluate the optimal number of features.
  3. Use an implementation of forward selection by adjusted $R^2$ that works with statsmodels.
  4. Do brute-force forward or backward selection to maximize your favorite metric on cross-validation (it could take approximately quadratic time in number of covariates). A scikit-learn compatible mlxtend package supports this approach for any estimator and any metric.
  5. If you still want vanilla stepwise regression, it is easier to base it on statsmodels, since this package calculates p-values for you. A basic forward-backward selection could look like this:

```

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import statsmodels.api as sm

data = load_boston()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X, y)

print('resulting features:')
print(result)

This example would print the following output:

Add  LSTAT                          with p-value 5.0811e-88
Add  RM                             with p-value 3.47226e-27
Add  PTRATIO                        with p-value 1.64466e-14
Add  DIS                            with p-value 1.66847e-05
Add  NOX                            with p-value 5.48815e-08
Add  CHAS                           with p-value 0.000265473
Add  B                              with p-value 0.000771946
Add  ZN                             with p-value 0.00465162
resulting features:
['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']

In fact there is a nice algorithm called "Forward_Select" that uses Statsmodels and allows you to set your own metric (AIC, BIC, Adjusted-R-Squared, or whatever you like) to progressively add a variable to the model. The algorithm can be found in the comments section of this page - scroll down and you'll see it near the bottom of the page.

https://planspace.org/20150423-forward_selection_with_statsmodels/

I would add that the algorithm also has one nice feature: you can apply it to either classification or regression problems! You just have to tell it.

Try it and see for yourself.

Actually sklearn doesn't have a forward selection algorithm, thought a pull request with an implementation of forward feature selection waits in the Scikit-Learn repository since April 2017.

As an alternative, there is forward and one-step-ahead backward selection in mlxtend. You can find it's document in Sequential Feature Selector

Licensed under: CC-BY-SA with attribution
Not affiliated with datascience.stackexchange
scroll top