3+字符串的最长常见随后

https://stackoverflow.com/questions/5057243

15-11-2019
|

题

我试图找到3个或更多字符串的最长的常见子序列。Wikipedia文章对如何执行此操作，但我'm一点不确定如何将其扩展到3个或更多字符串。

有很多库来查找2个字符串的LCS，因此如果可能，我想使用其中一个。如果我有3个字符串A，B和C，它有效地找到A和B为x的LCS，然后找到X和C的LCS，或者是错误的方法吗？

我在Python中实现了它，如下所示：

import difflib

def lcs(str1, str2):
    sm = difflib.SequenceMatcher()
    sm.set_seqs(str1, str2)
    matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
    return "".join(matching_blocks)

print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])

这个输出“ba”，但它应该是“baa”。

解决方案

Just generalize the recurrence relation.

For three strings:

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

Should be easy to generalize to more strings from this.

其他提示

To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.

This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.

Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:

string lcs(string[] strings)
{
    if (strings.Length == 0)
        return "";
    if (strings.Length == 1)
        return strings[0];
    int max = -1;
    int cacheSize = 1;
    for (int i = 0; i < strings.Length; i++)
    {
        cacheSize *= strings[i].Length;
        if (strings[i].Length > max)
            max = strings[i].Length;
    }
    string[] cache = new string[cacheSize];
    int[] indexes = new int[strings.Length];
    for (int i = 0; i < indexes.Length; i++)
        indexes[i] = strings[i].Length - 1;
    return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
    for (int i = 0; i < indexes.Length; i++ )
        if (indexes[i] == -1)
            return "";
    bool match = true;
    for (int i = 1; i < indexes.Length; i++)
    {
        if (strings[0][indexes[0]] != strings[i][indexes[i]])
        {
            match = false;
            break;
        }
    }
    if (match)
    {
        int[] newIndexes = new int[indexes.Length];
        for (int i = 0; i < indexes.Length; i++)
            newIndexes[i] = indexes[i] - 1;
        string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
        cache[calcCachePos(indexes, strings)] = result;
        return result;
    }
    else
    {
        string[] subStrings = new string[strings.Length];
        for (int i = 0; i < strings.Length; i++)
        {
            if (indexes[i] <= 0)
                subStrings[i] = "";
            else
            {
                int[] newIndexes = new int[indexes.Length];
                for (int j = 0; j < indexes.Length; j++)
                    newIndexes[j] = indexes[j];
                newIndexes[i]--;
                int cachePos = calcCachePos(newIndexes, strings);
                if (cache[cachePos] == null)
                    subStrings[i] = lcsBack(strings, newIndexes, cache);
                else
                    subStrings[i] = cache[cachePos];
            }
        }
        string longestString = "";
        int longestLength = 0;
        for (int i = 0; i < subStrings.Length; i++)
        {
            if (subStrings[i].Length > longestLength)
            {
                longestString = subStrings[i];
                longestLength = longestString.Length;
            }
        }
        cache[calcCachePos(indexes, strings)] = longestString;
        return longestString;
    }
}
int calcCachePos(int[] indexes, string[] strings)
{
    int factor = 1;
    int pos = 0;
    for (int i = 0; i < indexes.Length; i++)
    {
        pos += indexes[i] * factor;
        factor *= strings[i].Length;
    }
    return pos;
}

My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.

On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"

The LCS returned is "54442"

I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.

The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.

So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.

Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.

def lcs3(a, b, c):
    m = len(a)
    l = len(b)
    n = len(c)
    subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]

    for i, x in enumerate(a):
        for j, y in enumerate(b):
            for k, z in enumerate(c):
                if x == y and y == z:
                    subs[i+1][j+1][k+1] = subs[i][j][k] + 1
                else:
                    subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], 
                                              subs[i][j+1][k+1], 
                                              subs[i+1][j][k+1])
    # return subs[-1][-1][-1] #if you only need the length of the lcs
    lcs = ""
    while m > 0 and l > 0 and n > 0:
        step = subs[m][l][n]
        if step == subs[m-1][l][n]:
            m -= 1
        elif step == subs[m][l-1][n]:
            l -= 1
        elif step == subs[m][l][n-1]:
            n -= 1
        else:
            lcs += str(a[m-1])
            m -= 1
            l -= 1
            n -= 1

    return lcs[::-1]

This below code can find the longest common subsequence in N strings. This uses itertools to generate required index combinations and then use these indexes for finding common substring.

Example Execution:
Input:
Enter the number of sequences: 3
Enter sequence 1 : 83217
Enter sequence 2 : 8213897
Enter sequence 3 : 683147

Output:
837

from itertools import product
import numpy as np
import pdb

def neighbors(index):
    N = len(index)
    for relative_index in product((0, -1), repeat=N):
        if not all(i == 0 for i in relative_index):
            yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))

def longestCommonSubSequenceOfN(sqs):
    numberOfSequences = len(sqs);
    lengths = np.array([len(sequence) for sequence in sqs]);
    incrLengths = lengths + 1;
    lengths = tuple(lengths);
    inverseDistances = np.zeros(incrLengths);
    ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
    for tupleIndex in product(*ranges):
        tupleIndex = tupleIndex[::-1];
        neighborIndexes = list(neighbors(tupleIndex));
        operationsWithMisMatch = np.array([]);
        for neighborIndex in neighborIndexes:
            operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
        operationsWithMatch = np.copy(operationsWithMisMatch);
        operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
        chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
        if(all(elem == chars[0] for elem in chars)):
            inverseDistances[tupleIndex] = max(operationsWithMatch);
        else:
            inverseDistances[tupleIndex] = max(operationsWithMisMatch);
        # pdb.set_trace();

    subString = "";
    mainTupleIndex = lengths;
    while(all(ind > 0 for ind in mainTupleIndex)):
        neighborsIndexes = list(neighbors(mainTupleIndex));
        anyOperation = False;
        for tupleIndex in neighborsIndexes:
            current = inverseDistances[mainTupleIndex];
            if(current == inverseDistances[tupleIndex]):
                mainTupleIndex = tupleIndex;
                anyOperation = True;
                break;
        if(not anyOperation):
            subString += str(sqs[0][mainTupleIndex[0] - 1]);
            mainTupleIndex = neighborsIndexes[-1];
    return subString[::-1];

numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow