Question

Ex: If I have the string "asdf" and the set of strings ("qwer", "aswr", "asdv"). The hamming distance between the set and the string would be 1 as the "asdv" and "asdf" have hamming distance one.

It's easy to brute force with something like this

def hamming_distance(string, set):
    min = len(string)
    for element in set:
        element_distance = sum(ch1 != ch2 for ch1, ch2 in zip(string, element))
        if min > element_distance:
            min = element_distance
        if min == 0:
            break
    return min

I think this has O(n*k) where n = len(string) and k = len(set). However, maximum set size scales with n^2, which means we're essentially dealing with O(n^3). The sets rather static, so if preprocessing would help that is definitely an option.

Finally, I should mention that the application here is to determine which set(s) are closest to the string in question, but I've reduced the problem because string-length is a much more limiting factor than the number of sets. If there's another way to approach this by looking at the space as a whole instead of the individual subsets I would be all ears. When I first took that approach it seemed like space complexity was going to get completely ridiculous though.

Was it helpful?

Solution

First of all, the hamming distance between strings is a Metric. Thus, you are trying to find the k-nearest-neighbors in a metric space (where k = 1).

Consequently, you may want to consider a tree similar to the M-Tree data structure: (see http://en.wikipedia.org/wiki/M-tree and http://www.vldb.org/conf/1997/P426.PDF ). This tree is designed to reduce the number distance comparisons that need to be performed to find "nearest neighbors".

Personally, I could not find an implementation of an M-Tree online that I was satisfied with (see my closed thread Looking for a mature M-Tree implementation) so I rolled my own.

My implementation is here: https://github.com/jon1van/MTreeMapRepo

The ONLY other implementation I could find was this one: https://github.com/erdavila/M-Tree I did not like this implementation because it had no remove functionality (and several other problems)(but it was free so...that's good).

You may want to consider using my code (which solves kNN searches in a generic metric space) with a Levensthtein distance metric(http://en.wikipedia.org/wiki/Levenshtein_distance). Finding a fully implemented Levenshtein distance metric online should be pretty easy

Added Levenstein distance function ** http://code.google.com/p/google-refine/source/browse/trunk/src/main/java/edu/mit/simile/vicino/distances/LevensteinDistance.java?r=181

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top