Label Propagation - Array is too big

Question

How much memory does your computer have?

What sklearn might be doing here (I haven't gone through the source, so I might be wrong) is calculating euclidean lengths of vectors between each data point by taking the square of a 17000xK matrix. This would yield squared euclidean distance for all data points, but unfortunately produces an NxN ouput matrix if you have N data points. As far as I know numpy uses double precision, which results in an 17000x17000x8 bytes matrix, approximately 2.15 GB.

If your memory can't hold a matrix of that size that would cause trouble. Try creating a matrix of this size with numpy:

import numpy
mat = numpy.ones(17000, 17000)

If it succeeds I'm mistaken and the problem is something else (though certainly related to memory size and matrices sklearn is trying to allocate).

On the top of my head, one way to resolve this might be to propagate labels in parts by subsampling the unlabeled data points (and possibly the labeled points, if you have many of them). If you are able to run the algorithm for 17000/2 data points and you have L labeled points, build your new data set by randomly drawing (17000-L)/2 of the unlabeled points from the original set and combining them with the L labeled points. Run the algorithm for each partition of the full set.

Note that this probably will reduce the performance of the label propagation algorithm, since it will have fewer data points to work with. Inconsistencies between labels in each of the sets might also cause trouble. Use with extreme caution and only if you have some way to evaluate the performance :)

A safer approach would be to A: Get more memory or B: Get a label propagation algorithm that is less memory intensive. It is certainly possible to exchange memory complexity for time complexity by recalculating euclidean distances when needed rather than constructing a full all pairs distance matrix as scikit appears to be doing here.