Question

Consider the following problem:

Input: We are given a set $S$ of $n$ binary vectors. A binary vector is of the form $(b_0, b_1, ... b_n)$, where $b_i \in \{0, 1\}$. Therefore, there are $n$ vectors of size $n$.

Output: For each vector in $S$, we would like to (efficiently) compute a list of binary vectors from $S$ whose Hamming distance is within some predefined constant $c$.

Output (alternative): Find the pair of vectors with minimum Hamming distance.


Clearly, the above can be done in $O(n^3)$ by computing the Hamming distance pairwise between each vector in S ($\frac{n*(n+1)}{2}$ pairwise combinations and $O(n)$ to compute the Hamming distance). However, I am looking for something more efficient, since $n$ can take large values (say, $n=10^6$).

To speed up the computation, one could use some form of hashing, where we place the binary vectors into $N$ buckets, where every vector in the $i^{th}$ bucket contains exactly $i$ ones. This could work, but in my particular case, one could expect that all binary vectors have approximately the same number of ones, so not much would be gained from such hashing. One could also look to perform some sort of compression on binary vectors and perform Hamming distance computation on the smaller compressed vectors as a bound on the distance, but I suspect there are better approaches.

Another hashing approach would be to subdivide the space into hypercubes and place each vector in one of the cubes. I believe similar approaches are used in physics simulations, where space would be divided in a number of 2D- or 3D-cubes and then compute the interaction of elements within cubes and neighboring cubes.

I have seen variations of the above in many applications, but I am not sure what is the name of the problem or how to efficiently compute the solution. It is worth noting that in my particular case the solutions does not need to be "exact", that is, there can be some errors in the output as long as "most" of it is correct. The main issue seems to be how to avoid computing the distance between vectors that are clearly far apart.

(The structure of the binary vectors in $S$ can be diverse. In some cases, there are few bits set in each vector, while in other cases the vectors can be dense. For the sparse case, the Hamming distance computation can get greatly sped up, but the bottleneck is the $\frac{n*(n+1)}{2}$ comparisons that need to be made.).

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with cs.stackexchange
scroll top