sci-kit learn crashing on certain amounts of data

https://stackoverflow.com/questions/23353213

11-07-2023
|

Question

I'm trying to process a numpy array with 71,000 rows of 200 columns of floats and the two sci-kit learn models I'm trying both give different errors when I exceed 5853 rows. I tried removing the problematic row, but it continues to fail. Can sci-kit learn not handle this much data, or is it something else? The X is numpy array of a list of lists.

KNN:

nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)

Error:

File "knn.py", line 48, in <module>
  nbrs = NearestNeighbors(n_neighbors=2, algorithm='ball_tree').fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 642, in fit
  return self._fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/neighbors/base.py", line 180, in _fit
  raise ValueError("data type not understood")

ValueError: data type not understood

K-Means:

kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)

Error:

Traceback (most recent call last):
File "knn.py", line 48, in <module>
kmeans_model = KMeans(n_clusters=2, random_state=1).fit(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 702, in fit
X = self._check_fit_data(X)
File "/usr/local/lib/python2.7/dist-packages/sklearn/cluster/k_means_.py", line 668, in _check_fit_data
X = atleast2d_or_csr(X, dtype=np.float64)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 134, in atleast2d_or_csr
"tocsr", force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 111, in _atleast2d_or_sparse
force_all_finite=force_all_finite)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 91, in array2d
X_2d = np.asarray(np.atleast_2d(X), dtype=dtype, order=order)
File "/usr/local/lib/python2.7/dist-packages/numpy/core/numeric.py", line 235, in asarray
return array(a, dtype, copy=False, order=order)
ValueError: setting an array element with a sequence.

Solution

Please check the dtype of your matrix X, e.g. by typing X.dtype. If it is object or dtype('O'), then write the lengths of the lines of X into an array:

lengths = [len(line) for line in X]

Then take a look to see whether all lines have the same length, by invoking

np.unique(lengths)

If there is more than one number in the output, then your line lengths are different, e.g. from line 5853 on, but possibly not all the time.

Numpy data arrays are only useful if all lines have the same length (they continue to work if not, but don't do what you expect.). You should check to see what is causing this, correct it, and then return to knn.

Here is an example of what happens if line lengths are not the same:

import numpy as np
rng = np.random.RandomState(42)
X = rng.randn(100, 20)
# now remove one element from the 56th line
X = list(X)
X[55] = X[55][:-1]
# turn it back into an ndarray
X = np.array(X)
# check the dtype
print X.dtype  # returns dtype('O')

from sklearn.neighbors import NearestNeighbors
nbrs = NearestNeighbors()
nbrs.fit(X)  # raises your first error

from sklearn.cluster import KMeans
kmeans = KMeans()
kmeans.fit(X)  # raises your second error

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow