Discretisation Using Decision Trees

https://datascience.stackexchange.com/questions/73418

10-12-2020
|

Pergunta

I'm new to the machine learning and working on a supervised classification problem. I used discretization process to transform continuous variables into discrete variables. So I followed this article to implement it. But when repeat same process with same values it generate different boundary values. Can anyone explain about it?

X_train, X_test, y_train, y_test = train_test_split(train[['tripid', 'Hour', 'is_FairCorrect']],train.is_FairCorrect , test_size = 0.3)

tree_model = DecisionTreeClassifier(max_depth=2)
tree_model.fit(X_train.Hour.to_frame(), X_train.is_FairCorrect)
X_train['Age_tree']=tree_model.predict_proba(X_train.Hour.to_frame())[:,1] 

pd.concat([X_train.groupby(['Age_tree'])['Hour'].min(),
           X_train.groupby(['Age_tree'])['Hour'].max()], axis=1)

Solução

But when repeat same process with same values it generate different boundary values. Can anyone explain about it?

This is because you're not setting the random_seed in the in the train_test_split, which means that the training data is shuffled in a different way on each run.

With a quick check using one of sklearn's datasets, you can check that this is the issue:

from sklearn.datasets import load_boston

X,y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=4)
tree_model = DecisionTreeClassifier(max_depth=2, random_state=2)
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict_proba(X_train)[:,1] 
X_train_df = pd.DataFrame(X_train, columns = ['sepal_len', 'sepal_wid', 
                                              'petal_len', 'petal_wid'])
X_train_df['Age_tree'] = tree_model.predict_proba(X_train)[:,1] 
X_train_df.Age_tree.unique()

This will produce the same boundaries in all runs, in this case array([0. , 0.90697674, 0.03030303]). Whereas if you don't set the random seed you'll get different probabilities and boundaries on each run.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange