Pergunta

I understand this question can be strang, but how do I pick the final random_seed for my classifier?

Below is an example code. It uses the SGDClassifier from SKlearn on the iris dataset, and GridSearchCV to find the best random_state:

from sklearn.linear_model import SGDClassifier
from sklearn import datasets
from sklearn.model_selection import train_test_split, GridSearchCV

iris = datasets.load_iris()
X, y = iris.data, iris.target
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=42)


parameters = {'random_state':[1, 42, 999, 123456]}

sgd = SGDClassifier(max_iter=20, shuffle=True)
clf = GridSearchCV(sgd, parameters, cv=5)

clf.fit(X_train, y_train)

print("Best parameter found:")
print(clf.best_params_)
print("\nScore per grid set:")
means = clf.cv_results_['mean_test_score']
stds = clf.cv_results_['std_test_score']
for mean, std, params in zip(means, stds, clf.cv_results_['params']):
    print("%0.3f (+/-%0.03f) for %r" % (mean, std * 2, params))

The results are the following:

Best parameter found:
{'random_state': 999}

Score per grid set:
0.732 (+/-0.165) for {'random_state': 1}
0.777 (+/-0.212) for {'random_state': 42}
0.786 (+/-0.277) for {'random_state': 999}
0.759 (+/-0.210) for {'random_state': 123456}

In this case, the difference from the best to second best is 0.009 from the score. Of course, the train/test split also makes a difference.

This is just an example, where one could argue that it doesn't matter which one I pick. The random_state should not affect the working of the algorithm. However, there is nothing impeding of a scenario where the difference from the best to the second best is 0.1, 0.2, 0.99, a scenario where the random_seed makes a big impact.

  • In the case where the random_seed makes a big impact, is it fair to hyper-parameter optimize it?
  • When is the impact too small to care?

Nenhuma solução correta

Licenciado em: CC-BY-SA com atribuição
scroll top