質問

I have some data that I'm fitting an sklearn DecisionTreeClassifier to. Because the classifier uses a bit of randomness I run it several times and save the best model. However I want to be able to re-train the data and get the same results on a different machine.

Is there a way to find out what was the initial random_state after I train the model for each classifier?

EDIT The sklearn models have a method called get_params() that shows what the inputs were. But for random_state it still says None. However according to the documentation when that's the case it uses numpy to produce a random number. I'm trying to figure out what that random number was

役に立ちましたか?

解決

You have to pass an explicit random state to the d-tree constructor:

>>> DecisionTreeClassifier(random_state=42).get_params()['random_state']
42

Leaving it at the default value of None means that the fit method will use numpy.random's singleton random state, which is not predictable and not the same across runs.

他のヒント

I would suggest that you are potentially better off using a Random Forest for this purpose - Random Forests contain a number of trees modelled on subsets of your predictors. Then, you are able to see the random_states that have been used in the model by simply using RandomForestVariableName.estimators_

I'll use my code as an example here:

with open('C:\Users\Saskia Hill\Desktop\Exported\FinalSpreadsheet.csv', 'rb') as csvfile:
    titanic_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    row = titanic_reader.next()
    feature_names = np.array(row)

    # Load dataset, and target classes
    titanic_X, titanic_y = [], []
    for row in titanic_reader:  
    titanic_X.append(row)
    titanic_y.append(row[11]) # The target values are your class labels

    titanic_X = np.array(titanic_X)
    titanic_y = np.array(titanic_y)
    print titanic_X, titanic_y

print feature_names, titanic_X[0], titanic_y[0]
titanic_X = titanic_X[:, [2,3,4,5,6,7,8,9,10]] #these are your predictors/ features
feature_names = feature_names[[2,3,4,5,6,7,8,9,10]]

from sklearn import tree

rfclf = RandomForestClassifier(criterion='entropy', min_samples_leaf=1,  max_features='auto', max_leaf_nodes=None, verbose=0)

rfclf = rfclf.fit(titanic_X,titanic_y)

rfclf.estimators_     #the output for this is pasted below:

[DecisionTreeClassifier(compute_importances=None, criterion='entropy',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_density=None, min_samples_leaf=1, min_samples_split=2,
        random_state=1490702865, splitter='best'),
DecisionTreeClassifier(compute_importances=None, criterion='entropy',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_density=None, min_samples_leaf=1, min_samples_split=2,
        random_state=174216030, splitter='best') ......

Random Forests thus introduce randomness into the Decision Tree files, and require no adjustment from the initial data used by the Decision Tree, but they act as a method of cross-validation supplying you with more confidence in the accuracy of your data (particularly if, like me, you have a small dataset).

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top