Question

I've a set of string [S1 S2 S3 ... Sn] and I'm to count all such target strings T such that each one of S1 S2... Sn can be converted into T within a total of K edits.
All the strings are of fixed length L and an edit here is hamming distance.

All I've is sort of brute force approach. so, If my alphabet size is 4, I've sample space of O(4^L) and it takes O(L) time to check each one of them. I can't seem to bring down the complexity from exponential to some poly or pseudo-poly! Is there any way to prune down the sample space to do better?

I tried to visualize it as in a L-dimensional vector space. I've been given N points and have to count all the points whose sum of distance from the given N points is less than or equal to K.
i.e. d1 + d2 + d3 +...+ dN <= K
Is there any known geometric algorithm which solves this or similar problem with a better complexity? Kindly point me in the right direction or any hints are appreciated.
Thank you

Was it helpful?

Solution

You can do this efficiently with dynamic programming.

The key idea is that you don't need to enumerate all possible target strings, you just need to know how many ways targets are possible with K edits considering only the string indicies after I.

alphabet = 'abcd'
s = [ 'aabbbb', 'bacaaa', 'dabbbb', 'cabaaa']

# use memoized from http://wiki.python.org/moin/PythonDecoratorLibrary          
@memoized
def count(edits_left, index):
  if index == -1 and edits_left >= 0:
    return 1
  if edits_left < 0:
    return 0
  ret = 0
  for char in alphabet:
    edits_used = 0
    for mutate_str in s:
      if mutate_str[index] != char:
        edits_used += 1
    ret += count(edits_left - edits_used, index - 1)
  return ret

OTHER TIPS

Thinking out loud, it seems to me that this problem boils down to a combinatorial problem.

In general for a string S of length L, there are a total of C(L,K) (binomial coefficient) positions that can be substituted and therefore (ALPHABET_SIZE^K)*C(L,K) target strings T from a Hamming Distance of K.

Binomial Coefficient can be computed quite easily using Dynamic Programming and the Pascal Triangle... No need to get crazy into factoriel etc...

Now that one string case is treated, dealing with multiple strings is a little bit more tricky since you might double count targets. Intuitively though if S1 is K far from S2 then both string will generate the same set of target so you don't double count in this case. This last statement might be a long shot that's why I made sure to say "intuitively" :)

Hope it helps,

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top