kNN with big sparse matrices in Python

Question 1

Classic kNN data structures such as the KD tree used in sklearn become very slow when the dimension of the data increases. For very high-dimensional problems it is advisable to switch algorithm class and use approximate nearest neighbour (ANN) methods, which sklearn seems to be lacking, unfortunately. See links below for papers on algorithms and theory why approximate nearest neighbors is so much faster in these cases.

A well-known ANN library in the C++ world, widely used in Computer Vision for nearest neighbors in feature descriptor spaces, is FLANN. The homepage says it contains Python bindings (I have never worked with then).
Another popular alternative is the ANN library with Python wrapper here, although the newer FLANN seems to be more popular at the moment.
See also this answer (but some links are dead).

One caveat: Your data seems to be very high dimensional - I don't known how these libraries perform for you. They should still beat sklearn.

Question 2

Even predicting for 1 item of testX takes a while (i.e. something like 30-60 secs, but if you multiply by 2 million, it becomes pretty much impossible).

That's exactly why all scikit-learn estimators take batches of samples in their predict method. If you pass multiple samples in one call, the cost of input validation and Python's slow loops gets smaller, so the time per samples becomes a lot less than the cost of one sample times the number of samples.

>>> from sklearn.datasets import fetch_20newsgroups_vectorized
>>> from sklearn.decomposition import TruncatedSVD
>>> from sklearn.neighbors import KNeighborsClassifier
>>> data = fetch_20newsgroups_vectorized()
>>> X, y = data['data'], data['target']
>>> X = TruncatedSVD(n_components=100).fit_transform(X)
>>> clf = KNeighborsClassifier(n_neighbors=1).fit(X, y)
>>> %timeit clf.predict(X[0])
1000 loops, best of 3: 766 us per loop
>>> %timeit clf.predict(X[0:10])
100 loops, best of 3: 2.44 ms per loop
>>> %timeit clf.predict(X[0:100])
100 loops, best of 3: 14.2 ms per loop
>>> %timeit clf.predict(X[0:1000])
10 loops, best of 3: 117 ms per loop

Maybe I can somehow make use of the sparsity to speed up the calculation?

You can sample from the training set instead of using it all. k-NN performance depends on the size of the training set, which is why the vanilla k-NN algorithm isn't a very good choice for text classification.

(A favorite trick in the text processing field is to use an on-disk index to build a k-NN classifier, e.g. Lucene: use an entire document as a query, retrieve the top k documents, determine the label from those.)

Question 3

As far as I know, neither FLANN nor ANN handle sparse data very well. I've just released a new C++ library for K-NN search for generic data type and generic similarity measure at www.kgraph.org. All you have to do is to plug in your function of computing a similarity between object i and object j and the library will do the rest of the magic. The downside is that you probably won't be able to gain much by using python. As the similarity computing code will be invoked extremely frequently, it doesn't make much sense to add a python API for user-provided metric.

Question 4

If you're looking for scalable ANN algorithms, another avenue are the locality-sesitive hashing (LSH) approaches, like ITQ (http://www.cs.unc.edu/~lazebnik/publications/cvpr11_small_code.pdf). Along with the paper is some MATLAB code, but I have translated it before into python. See: https://github.com/Kitware/SMQTK/blob/master/python/smqtk/algorithms/nn_index/lsh/functors/itq.py