Question

Problem:

I have 3 strings s1, s2, s3. Each contain garbage text on either side, with a defining pattern in its centre: text1+number1. number1 increases by 2 in each string. I want to extract text1+number1.

I have already written code to find number1

How would I extend an LCS function to get text1?

#include <iostream>

const std::string longestCommonSubstring(int, std::string const& s1, std::string const& s2, std::string const& s3);

int main(void) {
    std::string s1="hello 5", s2="bolo 7", s3="lo 9sdf";
    std::cout << "Trying to get \"lo 5\", actual result: \"" << longestCommonSubstring(5, s1, s2, s3) << '\"';
}

const std::string longestCommonSubstring(int must_include, std::string const& s1, std::string const& s2, std::string const& s3) {
    std::string longest;

    for(size_t start=0, length=1; start + length <= s1.size();) {
        std::string tmp = s1.substr(start, length);
        if (std::string::npos != s2.find(tmp) && std::string::npos != s3.find(tmp)) {
            tmp.swap(longest);
            ++length;
        } else ++start;
    }

    return longest;
}

Example:

From "hello 5", "bolo 7", "lo 9sdf" I would like to get "lo 5"

Code:

I have been able to write a simple LCS function(test-case) but I am having trouble writing this modified one.

Was it helpful?

Solution 4

Wrote my own solution:

#include <iostream>
#include <string>
#include <sstream>
#include <vector>

typedef std::pair<std::pair<std::string, std::string>, std::pair<std::pair<std::string, std::string>, std::pair<std::string, std::string>>> pairStringTrio;
typedef std::pair<std::string,std::pair<std::string,std::string>> stringPairString;

stringPairString longestCommonSubstring(const pairStringTrio&);
std::string strFindReplace(const std::string&, const std::string&, const std::string&);

int main(void) {
        std::string s1= "6 HUMAN ACTIONb", s2="8 HUMAN ACTIONd", s3="10 HUMAN ACTIONf";
        pairStringTrio result = std::make_pair(std::make_pair(s1, "6"), std::make_pair(std::make_pair(s2, "8"), std::make_pair(s3, "10")));

        stringPairString answer = longestCommonSubstring(result);
        std::cout << '\"' << answer.first << "\"\t\"" << answer.second.first << "\"\t\"" << answer.second.second << '\"';
}


stringPairString longestCommonSubstring(const pairStringTrio &foo) {
        std::string longest;

        for(size_t start=0, length=foo.first.first.size()-1; start + length <= foo.first.first.size();) {
                std::string s1_tmp = foo.first.first.substr(start, length);
                std::string s2_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.first.second);
                std::string s3_tmp = strFindReplace(s1_tmp, foo.first.second, foo.second.second.second);

                if (std::string::npos != foo.second.first.first.find(s2_tmp) && std::string::npos != foo.second.second.first.find(s3_tmp)) {
                        s1_tmp.swap(longest);
                        ++length;
                } else ++start;
        }

        return std::make_pair(longest, std::make_pair(strFindReplace(longest, foo.first.second, foo.second.first.second), strFindReplace(longest, foo.first.second, foo.second.second.second)));
}

std::string strFindReplace(const std::string &original, const std::string& src, const std::string& dest) {
        std::string answer=original;
        for(std::size_t pos = 0; (pos = answer.find(src, pos)) != answer.npos;)
                answer.replace(pos, src.size(), dest);
        return answer;
}

OTHER TIPS

Let's say you're looking for a pattern *n, *n+2, *n+4, etc. And you have the following strings: s1="hello 1,bye 2,ciao 1", s2="hello 3,bye 4,ciao 2" and s3="hello 5,bye 6,ciao 5". Then the following will do:

//find all pattern sequences
N1 = findAllPatterns(s1, number);
 for i = 2 to n:
  for item in Ni-1:
   for match in findAllPatterns(si, nextPattern(item))
    Ni.add([item, (match, indexOf(match))]);

//for all pattern sequences identify the max common substring
maxCommonLength = 0; 
for sequence in Nn:
 temp = findLCS(sequence);
 if(length(temp[0]) > maxCommonLength):
  maxCommonLength = length(temp[0]);
  result = temp;

return result;

` The first part of the algorithm will identify the sequences: [(1, 6), (3, 6), (5, 6)], [(1, 19), (3, 6), (5, 6)], [(2, 12), (4, 12), (6, 12)]

The second part will identify: ["hello 1", "hello 3", "hello 5"] as the longest substrings matching the pattern.

The algorithm can be further optimized by combining the two parts and discarding early sequences that match the pattern but are suboptimal, but I preferred to present it in two parts for better clarity.

-- Edit fixed code block

If you know number1 already, and you know these numbers all appear just once in their corresponding strings, then the following should work:

I'll call your strings s[0], s[1], etc. Set longest = INT_MAX. For each string s[i] (i >= 0) just:

  • Find where number1 + 2 * i occurs in s[i]. Suppose it occurs at position j.
  • If (i == 0) j0 = j; else
    • for (k = 1; k <= j && k <= longest && s[i][j - k] == s[0][j0 - k]; ++k) {}
    • longest = k;

At the end, longest will be the length of the longest substring common to all the strings.

Basically we're just scanning backwards from the point where we find the number, looking for a mismatch with the corresponding character in your s1 (my s[0]), and keeping track of what the longest matching substring is so far in longest -- this can only stay the same or decrease with each new string we look at.

Rather than try to modify the internals of the LCS algorithm, you could take its output and find it in s1. From there, your number will be located at an offset of the length of the output plus 1.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top