ValueError: Input contains NaN, infinity or a value too large for dtype('float32')
-
16-10-2019 - |
سؤال
I got ValueError when predicting test data using a RandomForest model.
My code:
clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2)
clf.fit(X_fit, y_fit)
df_test.fillna(df_test.mean())
X_test = df_test.values
y_pred = clf.predict(X_test)
The error:
ValueError: Input contains NaN, infinity or a value too large for dtype('float32').
How do I find the bad values in the test dataset? Also, I do not want to drop these records, can I just replace them with the mean or median?
Thanks.
المحلول
With np.isnan(X)
you get a boolean mask back with True for positions containing NaN
s.
With np.where(np.isnan(X))
you get back a tuple with i, j coordinates of NaN
s.
Finally, with np.nan_to_num(X)
you "replace nan with zero and inf with finite numbers".
Alternatively, you can use:
- sklearn.impute.SimpleImputer for mean / median imputation of missing values, or
- pandas'
pd.DataFrame(X).fillna()
, if you need something other than filling it with zeros.
نصائح أخرى
Assuming X_test
is a pandas dataframe, you can use DataFrame.fillna
to replace the NaN values with the mean:
X_test.fillna(X_test.mean())
For anybody happening across this, to actually modify the original:
X_test.fillna(X_train.mean(), inplace=True)
To overwrite the original:
X_test = X_test.fillna(X_train.mean())
To check if you're in a copy vs a view:
X_test._is_view
Don't forget
col_mask=df.isnull().any(axis=0)
Which returns a boolean mask indicating np.nan values.
row_mask=df.isnull().any(axis=1)
Which return the rows where np.nan appeared. Then by simple indexing you can flag all of your points that are np.nan.
df.loc[row_mask,col_mask]
I faced similar problem and saw that numpy handles NaN and Inf differently.
Incase if you data has Inf, try this:
np.where(x.values >= np.finfo(np.float64).max)
Where x is my pandas Dataframe
This will be giving a tuple of location of places where NA values are present.
Incase if your data has Nan, try this:
np.isnan(x.values.any())
Do not forget to check for inf values as well. The only thing that worked for me:
df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)
And even better if you are using sklearn
def replace_missing_value(df, number_features):
imputer = Imputer(strategy="median")
df_num = df[number_features]
imputer.fit(df_num)
X = imputer.transform(df_num)
res_def = pd.DataFrame(X, columns=df_num.columns)
return res_def
When number_features would be an array of the number_features labels, for example:
number_features = ['median_income', 'gdp']
Here is the code for how to "Replace NaN with zero and infinity with large finite numbers." using numpy.nan_to_num.
df[:] = np.nan_to_num(df)
Also see fernando's answer.