Question

I'm importing some data from a csv file. The file has nan values flagged with text 'NA'. I import the data with:

X = genfromtxt(data, delimiter=',', dtype=float, skip_header=1)

I the use this code to replace nan with a previosly calculated column mean.

inds = np.where(np.isnan(X))
X[inds]=np.take(col_mean,inds[1])

I then run a couple of checks and get empty arrays:

np.where(np.isnan(X))
np.where(np.isinf(X))

Finally I run a scikit classifier:

RF = ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1,verbose=2)
RF.fit(X, y)

and get the following error:

  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\ensemble\forest.py", line 257, in fit
    check_ccontiguous=True)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
    _assert_all_finite(array)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas why it is telling me that there are NaN or infinity? I read this post and tried to run:

RF.fit(X.astype(float), y.astype(float))

but I get the same error.

Était-ce utile?

La solution

scikit-learn's decision trees cast their input to float32 for efficiency, but your values won't fit in that type:

>>> np.float32(8.9932064170227995e+41)
inf

The solution is to standardize prior to fitting a model with sklearn.preprocessing.StandardScaler. Don't forget to transform prior to predicting. You can use a sklearn.pipeline.Pipeline to combine standardization and classification in a single object:

rf = Pipeline([("scale", StandardScaler()),
               ("rf", RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))])

Or, with the current dev version/next release:

rf = make_pipeline(StandardScaler(),
                   RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))

(I admit the error message could be improved.)

Autres conseils

I come across this problem as well. But on the contrary, my problem is that there are some 'NaN' in the array.

Here is how to fix it.

from sklearn.preprocessing import Imputer
X = Imputer().fit_transform(X)
RF.fit(X, y)

Reference here: sklearn.preprocessing.Imputer

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top