3 가지 이상의 문자열의 가장 긴 공통의 서브 시퀀스

https://stackoverflow.com/questions/5057243

15-11-2019
|

문제

나는 3 개 이상의 문자열의 가장 긴 공통된 서브 그런스를 찾으려고 노력하고 있습니다.Wikipedia 기사는 2 strings 에 대해이 작업을 수행하는 방법에 대한 훌륭한 설명을 가지고 있습니다.m 이이를 3 개 이상의 문자열로 확장하는 방법을 조금 확실하게 확신하지 못합니다.

2 개의 문자열의 LCS를 찾는 데 많은 도서관이 있으므로 가능한 한 그들 중 하나를 사용하고 싶습니다.3 개의 문자열 a, b 및 c가있는 경우 a와 b의 lcs를 x로 찾은 다음 x와 c의 lcs를 찾아내는 것이 유효하지 않거나이를 수행하는 데 잘못된 방법입니까?

다음과 같이 파이썬에서 구현했습니다.

import difflib

def lcs(str1, str2):
    sm = difflib.SequenceMatcher()
    sm.set_seqs(str1, str2)
    matching_blocks = [str1[m.a:m.a+m.size] for m in sm.get_matching_blocks()]
    return "".join(matching_blocks)

print reduce(lcs, ['abacbdab', 'bdcaba', 'cbacaa'])

이 출력 "BA"가 아니지만 "BAA"가되어야합니다.

해결책

Just generalize the recurrence relation.

For three strings:

dp[i, j, k] = 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

Should be easy to generalize to more strings from this.

다른 팁

To find the Longest Common Subsequence (LCS) of 2 strings A and B, you can traverse a 2-dimensional array diagonally like shown in the Link you posted. Every element in the array corresponds to the problem of finding the LCS of the substrings A' and B' (A cut by its row number, B cut by its column number). This problem can be solved by calculating the value of all elements in the array. You must be certain that when you calculate the value of an array element, all sub-problems required to calculate that given value has already been solved. That is why you traverse the 2-dimensional array diagonally.

This solution can be scaled to finding the longest common subsequence between N strings, but this requires a general way to iterate an array of N dimensions such that any element is reached only when all sub-problems the element requires a solution to has been solved.

Instead of iterating the N-dimensional array in a special order, you can also solve the problem recursively. With recursion it is important to save the intermediate solutions, since many branches will require the same intermediate solutions. I have written a small example in C# that does this:

string lcs(string[] strings)
{
    if (strings.Length == 0)
        return "";
    if (strings.Length == 1)
        return strings[0];
    int max = -1;
    int cacheSize = 1;
    for (int i = 0; i < strings.Length; i++)
    {
        cacheSize *= strings[i].Length;
        if (strings[i].Length > max)
            max = strings[i].Length;
    }
    string[] cache = new string[cacheSize];
    int[] indexes = new int[strings.Length];
    for (int i = 0; i < indexes.Length; i++)
        indexes[i] = strings[i].Length - 1;
    return lcsBack(strings, indexes, cache);
}
string lcsBack(string[] strings, int[] indexes, string[] cache)
{
    for (int i = 0; i < indexes.Length; i++ )
        if (indexes[i] == -1)
            return "";
    bool match = true;
    for (int i = 1; i < indexes.Length; i++)
    {
        if (strings[0][indexes[0]] != strings[i][indexes[i]])
        {
            match = false;
            break;
        }
    }
    if (match)
    {
        int[] newIndexes = new int[indexes.Length];
        for (int i = 0; i < indexes.Length; i++)
            newIndexes[i] = indexes[i] - 1;
        string result = lcsBack(strings, newIndexes, cache) + strings[0][indexes[0]];
        cache[calcCachePos(indexes, strings)] = result;
        return result;
    }
    else
    {
        string[] subStrings = new string[strings.Length];
        for (int i = 0; i < strings.Length; i++)
        {
            if (indexes[i] <= 0)
                subStrings[i] = "";
            else
            {
                int[] newIndexes = new int[indexes.Length];
                for (int j = 0; j < indexes.Length; j++)
                    newIndexes[j] = indexes[j];
                newIndexes[i]--;
                int cachePos = calcCachePos(newIndexes, strings);
                if (cache[cachePos] == null)
                    subStrings[i] = lcsBack(strings, newIndexes, cache);
                else
                    subStrings[i] = cache[cachePos];
            }
        }
        string longestString = "";
        int longestLength = 0;
        for (int i = 0; i < subStrings.Length; i++)
        {
            if (subStrings[i].Length > longestLength)
            {
                longestString = subStrings[i];
                longestLength = longestString.Length;
            }
        }
        cache[calcCachePos(indexes, strings)] = longestString;
        return longestString;
    }
}
int calcCachePos(int[] indexes, string[] strings)
{
    int factor = 1;
    int pos = 0;
    for (int i = 0; i < indexes.Length; i++)
    {
        pos += indexes[i] * factor;
        factor *= strings[i].Length;
    }
    return pos;
}

My code example can be optimized further. Many of the strings being cached are duplicates, and some are duplicates with just one additional character added. This uses more space than necessary when the input strings become large.

On input: "666222054263314443712", "5432127413542377777", "6664664565464057425"

The LCS returned is "54442"

I just had to do this for a homework, so here is my dynamic programming solution in python that's pretty efficient. It is O(nml) where n, m and l are the lengths of the three sequences.

The solution works by creating a 3D array and then enumerating all three sequences to calculate the path of the longest subsequence. Then you can backtrack through the array to reconstruct the actual subsequence from its path.

So, you initialize the array to all zeros, and then enumerate the three sequences. At each step of the enumeration, you either add one to the length of the longest subsequence (if there's a match) or just carry forward the longest subsequence from the previous step of the enumeration.

Once the enumeration is complete, you can now trace back through the array to reconstruct the subsequence from the steps you took. i.e. as you travel backwards from the last entry in the array, each time you encounter a match you look it up in any of the sequences (using the coordinate from the array) and add it to the subsequence.

def lcs3(a, b, c):
    m = len(a)
    l = len(b)
    n = len(c)
    subs = [[[0 for k in range(n+1)] for j in range(l+1)] for i in range(m+1)]

    for i, x in enumerate(a):
        for j, y in enumerate(b):
            for k, z in enumerate(c):
                if x == y and y == z:
                    subs[i+1][j+1][k+1] = subs[i][j][k] + 1
                else:
                    subs[i+1][j+1][k+1] = max(subs[i+1][j+1][k], 
                                              subs[i][j+1][k+1], 
                                              subs[i+1][j][k+1])
    # return subs[-1][-1][-1] #if you only need the length of the lcs
    lcs = ""
    while m > 0 and l > 0 and n > 0:
        step = subs[m][l][n]
        if step == subs[m-1][l][n]:
            m -= 1
        elif step == subs[m][l-1][n]:
            l -= 1
        elif step == subs[m][l][n-1]:
            n -= 1
        else:
            lcs += str(a[m-1])
            m -= 1
            l -= 1
            n -= 1

    return lcs[::-1]

This below code can find the longest common subsequence in N strings. This uses itertools to generate required index combinations and then use these indexes for finding common substring.

Example Execution:
Input:
Enter the number of sequences: 3
Enter sequence 1 : 83217
Enter sequence 2 : 8213897
Enter sequence 3 : 683147

Output:
837

from itertools import product
import numpy as np
import pdb

def neighbors(index):
    N = len(index)
    for relative_index in product((0, -1), repeat=N):
        if not all(i == 0 for i in relative_index):
            yield tuple(i + i_rel for i, i_rel in zip(index, relative_index))

def longestCommonSubSequenceOfN(sqs):
    numberOfSequences = len(sqs);
    lengths = np.array([len(sequence) for sequence in sqs]);
    incrLengths = lengths + 1;
    lengths = tuple(lengths);
    inverseDistances = np.zeros(incrLengths);
    ranges = [tuple(range(1, length+1)) for length in lengths[::-1]];
    for tupleIndex in product(*ranges):
        tupleIndex = tupleIndex[::-1];
        neighborIndexes = list(neighbors(tupleIndex));
        operationsWithMisMatch = np.array([]);
        for neighborIndex in neighborIndexes:
            operationsWithMisMatch = np.append(operationsWithMisMatch, inverseDistances[neighborIndex]);
        operationsWithMatch = np.copy(operationsWithMisMatch);
        operationsWithMatch[-1] = operationsWithMatch[-1] + 1;
        chars = [sqs[i][neighborIndexes[-1][i]] for i in range(numberOfSequences)];
        if(all(elem == chars[0] for elem in chars)):
            inverseDistances[tupleIndex] = max(operationsWithMatch);
        else:
            inverseDistances[tupleIndex] = max(operationsWithMisMatch);
        # pdb.set_trace();

    subString = "";
    mainTupleIndex = lengths;
    while(all(ind > 0 for ind in mainTupleIndex)):
        neighborsIndexes = list(neighbors(mainTupleIndex));
        anyOperation = False;
        for tupleIndex in neighborsIndexes:
            current = inverseDistances[mainTupleIndex];
            if(current == inverseDistances[tupleIndex]):
                mainTupleIndex = tupleIndex;
                anyOperation = True;
                break;
        if(not anyOperation):
            subString += str(sqs[0][mainTupleIndex[0] - 1]);
            mainTupleIndex = neighborsIndexes[-1];
    return subString[::-1];

numberOfSequences = int(input("Enter the number of sequences: "));
sequences = [input("Enter sequence {} : ".format(i)) for i in range(1, numberOfSequences + 1)];
print(longestCommonSubSequenceOfN(sequences));

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow