Scikit-learn: Getting SGDClassifier to predict as well as a Logistic Regression

https://datascience.stackexchange.com/questions/6676

16-10-2019
|

Pergunta

A way to train a Logistic Regression is by using stochastic gradient descent, which scikit-learn offers an interface to.

What I would like to do is take a scikit-learn's SGDClassifier and have it score the same as a Logistic Regression here. However, I must be missing some machine learning enhancements, since my scores are not equivalent.

This is my current code. What am I missing on the SGDClassifier which would have it produce the same results as a Logistic Regression?

from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.linear_model import SGDClassifier
import numpy as np
import pandas as pd
from sklearn.cross_validation import KFold
from sklearn.metrics import accuracy_score

# Note that the iris dataset is available in sklearn by default.
# This data is also conveniently preprocessed.
iris = datasets.load_iris()
X = iris["data"]
Y = iris["target"]

numFolds = 10
kf = KFold(len(X), numFolds, shuffle=True)

# These are "Class objects". For each Class, find the AUC through
# 10 fold cross validation.
Models = [LogisticRegression, SGDClassifier]
params = [{}, {"loss": "log", "penalty": "l2"}]
for param, Model in zip(params, Models):
    total = 0
    for train_indices, test_indices in kf:

        train_X = X[train_indices, :]; train_Y = Y[train_indices]
        test_X = X[test_indices, :]; test_Y = Y[test_indices]

        reg = Model(**param)
        reg.fit(train_X, train_Y)
        predictions = reg.predict(test_X)
        total += accuracy_score(test_Y, predictions)
    accuracy = total / numFolds
    print "Accuracy score of {0}: {1}".format(Model.__name__, accuracy)

My output:

Accuracy score of LogisticRegression: 0.946666666667
Accuracy score of SGDClassifier: 0.76

Solução

The comments about iteration number are spot on. The default SGDClassifier n_iter is 5 meaning you do 5 * num_rows steps in weight space. The sklearn rule of thumb is ~ 1 million steps for typical data. For your example, just set it to 1000 and it might reach tolerance first. Your accuracy is lower with SGDClassifier because it's hitting iteration limit before tolerance so you are "early stopping"

Modifying your code quick and dirty I get:

# Added n_iter here
params = [{}, {"loss": "log", "penalty": "l2", 'n_iter':1000}]

for param, Model in zip(params, Models):
    total = 0
    for train_indices, test_indices in kf:
        train_X = X[train_indices, :]; train_Y = Y[train_indices]
        test_X = X[test_indices, :]; test_Y = Y[test_indices]
        reg = Model(**param)
        reg.fit(train_X, train_Y)
        predictions = reg.predict(test_X)
        total += accuracy_score(test_Y, predictions)

    accuracy = total / numFolds
    print "Accuracy score of {0}: {1}".format(Model.__name__, accuracy)

Accuracy score of LogisticRegression: 0.96
Accuracy score of SGDClassifier: 0.96

Outras dicas

SGDClassifier, as the name suggests, uses Stochastic Gradient descent as its optimization algorithm.

If you look at the implementation of LogisiticRegression in Sklearn there are five optimization techniques(solver) provided and by default it is 'LibLinear' that uses Coordinate Descent(CD) to converge.

Other than number of iterations, optimization, type of regularization(penalty) and its magnitude(C) also affect the performance of the algorithm.

If you are running it on Iris data-set tuning all these hyper-parameters may not bring significant change but for complex data set they do play a meaningful role.

For more, you can refer the Sklearn Logistic Regression Documentation.

You should also do a grid search for the "alpha" hyperparameter for the SGDClassifier. It is explicitly mentioned in the sklearn documentation and from my experience has a big impact on accuracy. Second hyperparameter you should look at is "n_iter" - however I saw a smaller effect with my data.

TL;DR: You could specify a grid of alpha and n_iter(or max_iter) and use parfit for hyper-optimization on SGDClassifier

My colleague, Vinay Patlolla, wrote an excellent blog post on How to make SGD Classifier perform as well as Logistic Regression using parfit.

Parfit is a hyper-parameter optimization package that he utilized to find the appropriate combination of parameters which served to optimize SGDClassifier to perform as well as Logistic Regression on his example data set in much less time.

In summary, the two key parameters for SGDClassifier are alpha and n_iter. To quote Vinay directly:

n_iter in sklearn is None by default. We are setting it here to a sufficiently large amount(1000). An alternative parameter to n_iter, which has been recently added, is max_iter. The same advice should apply for max_iter.

The alpha hyper-parameter serves a dual purpose. It is both a regularisation parameter and the initial learning rate under the default schedule. This means that, in addition to regularising the Logistic Regression coefficients, the output of the model is dependent on an interaction between alpha and the number of epochs (n_iter) that the fitting routine performs. Specifically, as alpha becomes very small, n_iter must be increased to compensate for the slow learning rate. This is why it is safer (but slower) to specify n_iter sufficiently large, e.g. 1000, when searching over a wide range of alphas.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange