Question

I am using label propagation in scikit learn for semi-supervised classification. I have 17,000 data points with 7 dimensions. I am unable to use it on this data set. Its throwing a numpy big array error. However, it works fine when I work on a relatively small data set say 200 points. Can anyone suggestion a fix?

label_prop_model.fit(np.array(data), labels)
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 58, in fit
graph_matrix = self._build_graph()
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 108, in _build_graph
affinity_matrix = self._get_kernel(self.X_) # get the affinty martix from the data using rbf kernel
File "/usr/lib/pymodules/python2.7/sklearn/semi_supervised/mylabelprop.py", line 26, in _get_kernel
return rbf_kernel(X, X, gamma=self.gamma)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 350, in rbf_kernel
K = euclidean_distances(X, Y, squared=True)
File "/usr/lib/pymodules/python2.7/sklearn/metrics/pairwise.py", line 173, in euclidean_distances
distances = safe_sparse_dot(X, Y.T, dense_output=True)
File "/usr/lib/pymodules/python2.7/sklearn/utils/extmath.py", line 79, in safe_sparse_dot
return np.dot(a, b)
ValueError: array is too big.
Was it helpful?

Solution

How much memory does your computer have?

What sklearn might be doing here (I haven't gone through the source, so I might be wrong) is calculating euclidean lengths of vectors between each data point by taking the square of a 17000xK matrix. This would yield squared euclidean distance for all data points, but unfortunately produces an NxN ouput matrix if you have N data points. As far as I know numpy uses double precision, which results in an 17000x17000x8 bytes matrix, approximately 2.15 GB.

If your memory can't hold a matrix of that size that would cause trouble. Try creating a matrix of this size with numpy:

import numpy
mat = numpy.ones(17000, 17000)

If it succeeds I'm mistaken and the problem is something else (though certainly related to memory size and matrices sklearn is trying to allocate).

On the top of my head, one way to resolve this might be to propagate labels in parts by subsampling the unlabeled data points (and possibly the labeled points, if you have many of them). If you are able to run the algorithm for 17000/2 data points and you have L labeled points, build your new data set by randomly drawing (17000-L)/2 of the unlabeled points from the original set and combining them with the L labeled points. Run the algorithm for each partition of the full set.

Note that this probably will reduce the performance of the label propagation algorithm, since it will have fewer data points to work with. Inconsistencies between labels in each of the sets might also cause trouble. Use with extreme caution and only if you have some way to evaluate the performance :)

A safer approach would be to A: Get more memory or B: Get a label propagation algorithm that is less memory intensive. It is certainly possible to exchange memory complexity for time complexity by recalculating euclidean distances when needed rather than constructing a full all pairs distance matrix as scikit appears to be doing here.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top