Pergunta

I have some data that I'm fitting an sklearn DecisionTreeClassifier to. Because the classifier uses a bit of randomness I run it several times and save the best model. However I want to be able to re-train the data and get the same results on a different machine.

Is there a way to find out what was the initial random_state after I train the model for each classifier?

EDIT The sklearn models have a method called get_params() that shows what the inputs were. But for random_state it still says None. However according to the documentation when that's the case it uses numpy to produce a random number. I'm trying to figure out what that random number was

Foi útil?

Solução

You have to pass an explicit random state to the d-tree constructor:

>>> DecisionTreeClassifier(random_state=42).get_params()['random_state']
42

Leaving it at the default value of None means that the fit method will use numpy.random's singleton random state, which is not predictable and not the same across runs.

Outras dicas

I would suggest that you are potentially better off using a Random Forest for this purpose - Random Forests contain a number of trees modelled on subsets of your predictors. Then, you are able to see the random_states that have been used in the model by simply using RandomForestVariableName.estimators_

I'll use my code as an example here:

with open('C:\Users\Saskia Hill\Desktop\Exported\FinalSpreadsheet.csv', 'rb') as csvfile:
    titanic_reader = csv.reader(csvfile, delimiter=',', quotechar='"')
    row = titanic_reader.next()
    feature_names = np.array(row)

    # Load dataset, and target classes
    titanic_X, titanic_y = [], []
    for row in titanic_reader:  
    titanic_X.append(row)
    titanic_y.append(row[11]) # The target values are your class labels

    titanic_X = np.array(titanic_X)
    titanic_y = np.array(titanic_y)
    print titanic_X, titanic_y

print feature_names, titanic_X[0], titanic_y[0]
titanic_X = titanic_X[:, [2,3,4,5,6,7,8,9,10]] #these are your predictors/ features
feature_names = feature_names[[2,3,4,5,6,7,8,9,10]]

from sklearn import tree

rfclf = RandomForestClassifier(criterion='entropy', min_samples_leaf=1,  max_features='auto', max_leaf_nodes=None, verbose=0)

rfclf = rfclf.fit(titanic_X,titanic_y)

rfclf.estimators_     #the output for this is pasted below:

[DecisionTreeClassifier(compute_importances=None, criterion='entropy',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_density=None, min_samples_leaf=1, min_samples_split=2,
        random_state=1490702865, splitter='best'),
DecisionTreeClassifier(compute_importances=None, criterion='entropy',
        max_depth=None, max_features='auto', max_leaf_nodes=None,
        min_density=None, min_samples_leaf=1, min_samples_split=2,
        random_state=174216030, splitter='best') ......

Random Forests thus introduce randomness into the Decision Tree files, and require no adjustment from the initial data used by the Decision Tree, but they act as a method of cross-validation supplying you with more confidence in the accuracy of your data (particularly if, like me, you have a small dataset).

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top