Fast way of doing k means clustering on binary vectors in c++

Question 1

Indeed k-means is not too appropriate here, because the means won't be reasonable on binary data.

Why do you need exactly k clusters? This will likely mean that some vectors won't fit to their clusters very well.

Some stuff you could look into for clustering: minhash, locality sensitive hashing.

Question 2

It is possible to do k-means with clustering with binary feature vectors. The paper called TopSig I co-authored has the details. The centroids are calculated by taking the most frequently occurring bit in each dimension. The TopSig paper applied this to document clustering where we had binary feature vectors created by random projection of sparse high dimensional bag-of-words feature vectors. There is an implementation in java at http://ktree.sf.net. We are currently working on a C++ version but it is very early code which is still messy, and probably contains bugs, but you can find it at http://github.com/cmdevries/LMW-tree. If you have any questions, please feel free to contact me at chris@de-vries.id.au.

If you are wanting to cluster a lot of binary vectors there are also more scalable tree based clustering algorithms of K-tree, TSVQ and EM-tree. For more details related to these algorithms you can see a paper I have recently submitted for peer review that is not yet published relating to the EM-tree.