Question

I try to produce / get an efficient implementation in C++ for the following problem:

I have to blobs (const char *data, size_t length), i call them "blob1" and "blob2". Now i like to get the longest prefix of "blob2" in "blob1". If the longest prefix is multiple times in "blob1" i like to get the one which has the biggest index.

Here an example (the blobs are here just ASCII-strings, so it's easier to read the example):

blob1 = HELLO LOOO HELOO LOOO LOO JU

blob2 = LOOO TUS

The following are all valid prefixes of blob2:

{ L, LO, LOO, LOOO, LOOO, LOOO T, LOOO TU, LOOO TUS }

The longest prefix of blob2 in blob1 is LOOO. It is there twice: HELLO *LOOO *HELOO *LOOO *LOO JU

So i like to get the index of the second one, which would be 6, and the length of the prefix which would be 4.

Unfortunately blob1 and blob2 change many times, so it is probably slow to create a tree or some other complex structure.

Do you know a good algorithm to solve this problem?

Thank you.

Cheers Kevin

Was it helpful?

Solution

I dont know if this the best algorithm to solve this (I'm sure, this is not), but, I guess this is a good one. The idea is simple, start by searching for the lowest token from blob2 in blob1. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length.

Continue your search, from your last stop, but, at this time, searching for token with the updated token length from blob2. When you find a match, try to see if you can match biggers tokens at this position. If this is true, update your token length. Repeat the previous procedure until the end of your buffer.

Bellow is a simple fluxogram trying to explain this algorithm and, in sequence, a simple complete program, showing an implementation.

enter image description here

#include <algorithm>
#include <vector>
#include <iostream>

/////////////////////0123456789012345678901234567
const char str1[] = "HELLO LOOO HELOO LOOO LOO JU";
const char str2[] = "LOOO TUS";

int main()
{
  std::vector<char> blob1(strlen(str1));
  std::vector<char> blob2(strlen(str2));
  blob1.reserve(30);
  blob2.reserve(30);

  std::copy(str1, str1+strlen(str1), blob1.begin());
  std::copy(str2, str2+strlen(str2), blob2.begin());

  auto next = blob1.begin();
  auto tokenLength = 1;
  auto position = -1;

  while ( std::next(next, tokenLength) < blob1.end() ) {
    auto current = std::search(next, 
                               blob1.end(), 
                               blob2.begin(), 
                               std::next(blob2.begin(), tokenLength));

    if (current == blob1.end() )
      break;

    position = std::distance(blob1.begin(), current);
    next = std::next(current, 1);

    for (auto i = tokenLength; std::next(blob2.begin(), i) < blob2.end(); ++i) {
      auto x = std::search(std::next(current, i), 
                           std::next(current, i + 1), 
                           std::next(blob2.begin(), i), 
                           std::next(blob2.begin(), i + 1));
      if ( x != std::next(current, i) ) 
            break;

      ++tokenLength;
    }
  }

  std::cout << "Index: " << position << ", length: " << tokenLength << std::endl;

}
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top