Scikit-Learn是否具有正向选择/逐步回归算法？

https://datascience.stackexchange.com/questions/937

16-10-2019
|

题

我正在处理问题过多的问题，并且训练我的模型需要太长。我实现了前瞻性选择算法以选择功能。

但是，我想知道Scikit-Learn是否具有前向选择/逐步回归算法？

解决方案

不，Sklearn似乎没有前向选择算法。但是，它确实提供了递归功能消除，这是一种类似于顺序向后选择的贪婪的消除算法。请参阅此处的文档：

http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.rfe.html

其他提示

Sklearn确实具有前向选择算法，尽管在Scikit-Learn中并未将其称为。功能选择方法称为 f_Regress 在Scikit-learn中，将依次包含改进模型的功能，直到有 K 模型中的功能（K是输入）。

它从分别回归每个功能上的标签开始，然后观察使用F统计量最大的特征改进模型。然后，它将获胜功能纳入模型。然后，它通过其余功能进行迭代，以找到下一个功能，该功能再次使用F统计或F测试再次改善模型。它可以执行此操作，直到模型中有K功能为止。

请注意，与模型中合并到的功能相关的其余功能可能不会被选择，因为它们与残差无关（尽管它们可能与标签良好相关）。这有助于防止多重共线性。

Scikit-Learn确实不支持逐步回归。这是因为通常被称为“逐步回归”的是基于线性回归系数的p值的算法，而Scikit-Learn故意避免了推理的模型学习方法（显着性测试等）。此外，纯OLS只是众多回归算法之一，从Scikit-Learn的角度来看，它既不重要，也不是最好的。

但是，对于那些仍然需要使用线性模型选择功能选择的人来说，有一些建议：

使用固有的稀疏模型 ElasticNet 或者 Lasso.
通过 StandardScaler, ，然后仅通过 model.coef_. 。对于完全独立的协变量，这等同于按P值分类。班上 sklearn.feature_selection.RFE 会为您做的 RFECV 甚至将评估最佳功能数量。
利用实施通过调整后的$ r^2 $的远期选择 statsmodels.
向前或向后选择蛮力，以最大程度地提高您喜欢的交叉验证指标（可能需要大约二次的协变量时间）。 Scikit-Learn兼容 mlxtend 包裹支持任何估计器和任何指标的方法。
如果您仍然需要逐步回归香草，则更容易基于它 statsmodels, ，由于此软件包为您计算P值。基本的前卫选择看起来像这样：

```

from sklearn.datasets import load_boston
import pandas as pd
import numpy as np
import statsmodels.api as sm

data = load_boston()
X = pd.DataFrame(data.data, columns=data.feature_names)
y = data.target


def stepwise_selection(X, y, 
                       initial_list=[], 
                       threshold_in=0.01, 
                       threshold_out = 0.05, 
                       verbose=True):
    """ Perform a forward-backward feature selection 
    based on p-value from statsmodels.api.OLS
    Arguments:
        X - pandas.DataFrame with candidate features
        y - list-like with the target
        initial_list - list of features to start with (column names of X)
        threshold_in - include a feature if its p-value < threshold_in
        threshold_out - exclude a feature if its p-value > threshold_out
        verbose - whether to print the sequence of inclusions and exclusions
    Returns: list of selected features 
    Always set threshold_in < threshold_out to avoid infinite looping.
    See https://en.wikipedia.org/wiki/Stepwise_regression for the details
    """
    included = list(initial_list)
    while True:
        changed=False
        # forward step
        excluded = list(set(X.columns)-set(included))
        new_pval = pd.Series(index=excluded)
        for new_column in excluded:
            model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included+[new_column]]))).fit()
            new_pval[new_column] = model.pvalues[new_column]
        best_pval = new_pval.min()
        if best_pval < threshold_in:
            best_feature = new_pval.argmin()
            included.append(best_feature)
            changed=True
            if verbose:
                print('Add  {:30} with p-value {:.6}'.format(best_feature, best_pval))

        # backward step
        model = sm.OLS(y, sm.add_constant(pd.DataFrame(X[included]))).fit()
        # use all coefs except intercept
        pvalues = model.pvalues.iloc[1:]
        worst_pval = pvalues.max() # null if pvalues is empty
        if worst_pval > threshold_out:
            changed=True
            worst_feature = pvalues.argmax()
            included.remove(worst_feature)
            if verbose:
                print('Drop {:30} with p-value {:.6}'.format(worst_feature, worst_pval))
        if not changed:
            break
    return included

result = stepwise_selection(X, y)

print('resulting features:')
print(result)

此示例将打印以下输出：

Add  LSTAT                          with p-value 5.0811e-88
Add  RM                             with p-value 3.47226e-27
Add  PTRATIO                        with p-value 1.64466e-14
Add  DIS                            with p-value 1.66847e-05
Add  NOX                            with p-value 5.48815e-08
Add  CHAS                           with p-value 0.000265473
Add  B                              with p-value 0.000771946
Add  ZN                             with p-value 0.00465162
resulting features:
['LSTAT', 'RM', 'PTRATIO', 'DIS', 'NOX', 'CHAS', 'B', 'ZN']

实际上，有一种名为“ forward_select”的不错的算法，它使用统计模型，并允许您设置自己的度量标准（AIC，BIC，调整后R-squared或您喜欢的任何内容），以逐步为模型添加变量。该算法可以在此页面的注释部分中找到 - 向下滚动，您将在页面底部看到它。

https://planspace.org/20150423-forward_selection_with_statsmodels/

我要补充说，该算法还具有一个不错的功能：您可以将其应用于分类或回归问题！您只需要告诉它。

尝试一下，亲自看看。

实际上，Sklearn没有前向选择算法，想拉请求自2017年4月以来，随着Scikit-Learn存储库中的前向功能选择的实施。

作为替代方案，在 mlxtend. 。您可以在顺序特征选择器

许可以下： CC-BY-SA 和归因

不隶属于 datascience.stackexchange