Python and GridSearchCV how to eliminate input contains NaN error when using cross validation and decision tree classifier?

https://datascience.stackexchange.com/questions/68291

09-12-2020
|

Pergunta

I am trying to do cross validation on Decision tree classifier for kaggle's titanic dataset. The first step after cleaning data is to split into train and test sets:

from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train, Y, test_size=0.2, random_state=0)

Then transform numbers into scaled values:

from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)

In addition:

from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()

For the grid search, I used GridSearchCV:

#Make a grid search
from sklearn.model_selection import GridSearchCV
tree_param = [{'criterion': ['entropy', 'gini'], 'max_depth': [2,3, 4]}]

And at the end to fit the GridSearchCV classifier into data:

clf = GridSearchCV(classifier, tree_param, cv=4)
clf.fit(X=x_train, y=y_train)

The error I am getting is as follows:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

I checked my x_train and y_train sets, and they both contains numeric values within a specific range:

Solução 2

My problem was that the datasets are as Data Frames.

Once I read them using:

X = X.iloc[:, [0,8]].values

It was transformed into int32 list and the error gone.

Outras dicas

You should impute missing values, try with:

from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(x_train)
x_train = imp.transform(x_train)
x_test = imp.transform(x_test)

Notice that I am fiting just in the train data, so you are not leaking information to the test.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a datascience.stackexchange