سؤال

I got ValueError when predicting test data using a RandomForest model.

My code:

clf = RandomForestClassifier(n_estimators=10, max_depth=6, n_jobs=1, verbose=2)
clf.fit(X_fit, y_fit)

df_test.fillna(df_test.mean())
X_test = df_test.values  
y_pred = clf.predict(X_test)

The error:

ValueError: Input contains NaN, infinity or a value too large for dtype('float32').

How do I find the bad values in the test dataset? Also, I do not want to drop these records, can I just replace them with the mean or median?

Thanks.

هل كانت مفيدة؟

المحلول

With np.isnan(X) you get a boolean mask back with True for positions containing NaNs.

With np.where(np.isnan(X)) you get back a tuple with i, j coordinates of NaNs.

Finally, with np.nan_to_num(X) you "replace nan with zero and inf with finite numbers".

Alternatively, you can use:

  • sklearn.impute.SimpleImputer for mean / median imputation of missing values, or
  • pandas' pd.DataFrame(X).fillna(), if you need something other than filling it with zeros.

نصائح أخرى

Assuming X_test is a pandas dataframe, you can use DataFrame.fillna to replace the NaN values with the mean:

X_test.fillna(X_test.mean())

For anybody happening across this, to actually modify the original:

X_test.fillna(X_train.mean(), inplace=True)

To overwrite the original:

X_test = X_test.fillna(X_train.mean())

To check if you're in a copy vs a view:

X_test._is_view

Don't forget

col_mask=df.isnull().any(axis=0) 

Which returns a boolean mask indicating np.nan values.

row_mask=df.isnull().any(axis=1)

Which return the rows where np.nan appeared. Then by simple indexing you can flag all of your points that are np.nan.

df.loc[row_mask,col_mask]

I faced similar problem and saw that numpy handles NaN and Inf differently.
Incase if you data has Inf, try this:

np.where(x.values >= np.finfo(np.float64).max)
Where x is my pandas Dataframe 

This will be giving a tuple of location of places where NA values are present.

Incase if your data has Nan, try this:

np.isnan(x.values.any())

Do not forget to check for inf values as well. The only thing that worked for me:

df[df==np.inf]=np.nan
df.fillna(df.mean(), inplace=True)

And even better if you are using sklearn

def replace_missing_value(df, number_features):

    imputer = Imputer(strategy="median")
    df_num = df[number_features]
    imputer.fit(df_num)
    X = imputer.transform(df_num)
    res_def = pd.DataFrame(X, columns=df_num.columns)
    return res_def

When number_features would be an array of the number_features labels, for example:

number_features = ['median_income', 'gdp']

Here is the code for how to "Replace NaN with zero and infinity with large finite numbers." using numpy.nan_to_num.

df[:] = np.nan_to_num(df)

Also see fernando's answer.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى datascience.stackexchange
scroll top