Pregunta

I'm importing some data from a csv file. The file has nan values flagged with text 'NA'. I import the data with:

X = genfromtxt(data, delimiter=',', dtype=float, skip_header=1)

I the use this code to replace nan with a previosly calculated column mean.

inds = np.where(np.isnan(X))
X[inds]=np.take(col_mean,inds[1])

I then run a couple of checks and get empty arrays:

np.where(np.isnan(X))
np.where(np.isinf(X))

Finally I run a scikit classifier:

RF = ensemble.RandomForestClassifier(n_estimators=100,n_jobs=-1,verbose=2)
RF.fit(X, y)

and get the following error:

  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\ensemble\forest.py", line 257, in fit
    check_ccontiguous=True)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 233, in check_arrays
    _assert_all_finite(array)
  File "C:\Users\m&g\Anaconda\lib\site-packages\sklearn\utils\validation.py", line 27, in _assert_all_finite
    raise ValueError("Array contains NaN or infinity.")
ValueError: Array contains NaN or infinity.

Any ideas why it is telling me that there are NaN or infinity? I read this post and tried to run:

RF.fit(X.astype(float), y.astype(float))

but I get the same error.

¿Fue útil?

Solución

scikit-learn's decision trees cast their input to float32 for efficiency, but your values won't fit in that type:

>>> np.float32(8.9932064170227995e+41)
inf

The solution is to standardize prior to fitting a model with sklearn.preprocessing.StandardScaler. Don't forget to transform prior to predicting. You can use a sklearn.pipeline.Pipeline to combine standardization and classification in a single object:

rf = Pipeline([("scale", StandardScaler()),
               ("rf", RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))])

Or, with the current dev version/next release:

rf = make_pipeline(StandardScaler(),
                   RandomForestClassifier(n_estimators=100, n_jobs=-1, verbose=2))

(I admit the error message could be improved.)

Otros consejos

I come across this problem as well. But on the contrary, my problem is that there are some 'NaN' in the array.

Here is how to fix it.

from sklearn.preprocessing import Imputer
X = Imputer().fit_transform(X)
RF.fit(X, y)

Reference here: sklearn.preprocessing.Imputer

Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top