Domanda

I have been using Longest Common Subsequence (LCS) to find the similarly between sequences. The following dynamic programming code computes the answer.

def lcs(a, b):
    lengths = [[0 for j in range(len(b)+1)] for i in range(len(a)+1)]
    # row 0 and column 0 are initialized to 0 already
    for i, x in enumerate(a):
        for j, y in enumerate(b):
            if x == y:
                lengths[i+1][j+1] = lengths[i][j] + 1
            else:
                lengths[i+1][j+1] = \
                    max(lengths[i+1][j], lengths[i][j+1])
    # read the substring out from the matrix
    result = ""
    x, y = len(a), len(b)
    while x != 0 and y != 0:
        if lengths[x][y] == lengths[x-1][y]:
            x -= 1
        elif lengths[x][y] == lengths[x][y-1]:
            y -= 1
        else:
            assert a[x-1] == b[y-1]
            result = a[x-1] + result
            x -= 1
            y -= 1
    return result

However I have realised what I really want to solve is a little different. Given a fixed k I need to make sure that the common subsequence only involves substrings of length exactly k. For example, set k = 2 and let the two strings be

A = "abcbabab"
B = "baababcc"

The subsequence that I need would be "ba"+"ab" = baab.

Is it possible to modify the dynamic programming solution to solve this problem?

The original longest common subsequence problem would just be the k=1 case.

A method that doesn't work.

If we perform the LCS algorithm above, we can get the alignment from the dynamic programming table and check whether those symbols appear in non-overlapping substrings of length k in both input sequences, and delete them if not. The problem is that this doesn't give an optimal solution.

È stato utile?

Soluzione

The correction is basically when the two strings in the relevant index have a matching substring (instead of a matching letter, as it is now).

The idea is instead of simply checking for a substring of size 1 in the original solution, check for a substring of length k, and add 1 to the solution (and 'jump' by k in the string).

The formula for the recursive approach that should be translated to the DP solution is:

f(i,0) = 0
f(0,j) = 0
f(i,j) = max{
          f(i-k,j-k) + aux(s1,s2,i,j,k)
          f(i-1,j)
          f(i,j-1)
             }

where aux(s1,s2,i,j,k) is a function that is aimed to check if the two substrings are a match, and is defined as:

aux(s1,s2,i,j,k) = 1         | if s1.substring(i-k,i) equals s2.substring(j-k, j)
                   -infinity | otherwise

You can reconstruct the alignment later using an auxilary matrix that marks the choices of max{}, and go from last to first when the matrix is complete.

Example:

bcbac and cbcba. k=2

Matrix generated by f:

      c   b   c  b   a
   0  0   0   0  0   0
b  0  0   0   0  0   0
c  0  0   0   1  1   1   
b  0  0   1   1  1   1
a  0  0   1   1  1   2
c  0  0   1   1  1   2

And for reproducing the alignment you generate 'choices' matrix:

1 - chose f(i-k,j-k) + aux(s1,s2,i,j,k)
2 - chose f(i-1,j)
3 - chose f(i,j-1)
d - don't care, (all choices are fine)
x/y -means one of x or y.

c      b      c     b      a

b   2/3    2/3    2/3   2/3    2/3
c   2/3    2/3     1     2      2   
b   2/3     3     2/3    d     2/3
a   2/3     3     2/3   2/3     1
c   2/3     3     2/3   2/3     3

Now, reconstructing the alignment - start from the last (bottom right) cell:

  1. It's '3' - move up, don't add anything to the alignment.
  2. It's 1 - we need to add 'ba' to the alignment. (alignment='ba' currently). move k up and left.
  3. It's 1, ad 'bc' to the alignment. current alignment: 'bcba'. move k up and left.
  4. It's 2/3 - move left OR up.

The order of visiting while reconstructing is: (0 means - not visited, number in many cells means any of these is OK).

      c   b   c  b   a
   0  4   0   0  0   0
b  4  3   0   0  0   0
c  0  0   0   0  0   0   
b  0  0   0   2  0   0
a  0  0   0   0  0   0
c  0  0   0   0  0   1
Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top