Python and GridSearchCV how to eliminate input contains NaN error when using cross validation and decision tree classifier?
-
09-12-2020 - |
문제
I am trying to do cross validation on Decision tree classifier for kaggle's titanic dataset. The first step after cleaning data is to split into train and test sets:
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(train, Y, test_size=0.2, random_state=0)
Then transform numbers into scaled values:
from sklearn.preprocessing import StandardScaler
sc = StandardScaler()
x_train = sc.fit_transform(x_train)
x_test = sc.transform(x_test)
In addition:
from sklearn.tree import DecisionTreeClassifier
classifier = DecisionTreeClassifier()
For the grid search, I used GridSearchCV:
#Make a grid search
from sklearn.model_selection import GridSearchCV
tree_param = [{'criterion': ['entropy', 'gini'], 'max_depth': [2,3, 4]}]
And at the end to fit the GridSearchCV classifier into data:
clf = GridSearchCV(classifier, tree_param, cv=4)
clf.fit(X=x_train, y=y_train)
The error I am getting is as follows:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
I checked my x_train
and y_train
sets, and they both contains numeric values within a specific range:
해결책 2
My problem was that the datasets are as Data Frames.
Once I read them using:
X = X.iloc[:, [0,8]].values
It was transformed into int32
list and the error gone.
다른 팁
You should impute missing values, try with:
from sklearn.impute import SimpleImputer
imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(x_train)
x_train = imp.transform(x_train)
x_test = imp.transform(x_test)
Notice that I am fiting just in the train data, so you are not leaking information to the test.