Discretisation Using Decision Trees
-
10-12-2020 - |
Pergunta
I'm new to the machine learning and working on a supervised classification problem. I used discretization process to transform continuous variables into discrete variables. So I followed this article to implement it. But when repeat same process with same values it generate different boundary values. Can anyone explain about it?
X_train, X_test, y_train, y_test = train_test_split(train[['tripid', 'Hour', 'is_FairCorrect']],train.is_FairCorrect , test_size = 0.3)
tree_model = DecisionTreeClassifier(max_depth=2)
tree_model.fit(X_train.Hour.to_frame(), X_train.is_FairCorrect)
X_train['Age_tree']=tree_model.predict_proba(X_train.Hour.to_frame())[:,1]
pd.concat([X_train.groupby(['Age_tree'])['Hour'].min(),
X_train.groupby(['Age_tree'])['Hour'].max()], axis=1)
Solução
But when repeat same process with same values it generate different boundary values. Can anyone explain about it?
This is because you're not setting the random_seed
in the in the train_test_split
, which means that the training data is shuffled in a different way on each run.
With a quick check using one of sklearn's datasets, you can check that this is the issue:
from sklearn.datasets import load_boston
X,y = load_iris(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state=4)
tree_model = DecisionTreeClassifier(max_depth=2, random_state=2)
tree_model.fit(X_train, y_train)
y_pred = tree_model.predict_proba(X_train)[:,1]
X_train_df = pd.DataFrame(X_train, columns = ['sepal_len', 'sepal_wid',
'petal_len', 'petal_wid'])
X_train_df['Age_tree'] = tree_model.predict_proba(X_train)[:,1]
X_train_df.Age_tree.unique()
This will produce the same boundaries in all runs, in this case array([0. , 0.90697674, 0.03030303])
. Whereas if you don't set the random seed you'll get different probabilities and boundaries on each run.