Computing the longest common substring of two strings using suffix arrays

https://cs.stackexchange.com/questions/9555

16-10-2019
|

Pergunta

After I learned how to build a suffix array in $O(N)$ complexity, I am interested in discovering the applications of the suffix arrays. One of these is finding the longest common substring between two strings, in $O(N)$ time. I found on the internet the following algorithm:

merge the two strings $A$ and $B$ into one string $AB$
compute the suffix array of $AB$
compute the $LCP$(longest common prefix) array
the answer is the largest value $LCP[i]$

I tried to implement it, but as many implementation details were not said(i.e. when concatenating the strings, should I put a special character between them($AcB$)?), my code failed on many test cases. Could someone elaborate more on this algorithm?

Thanks in advance.

Note: I do not guarantee the correctness of this algorithm; I found it on a blog, and I'm not sure it is working. If you think it is incorrect, please suggest another algorithm.

Solução

Your algorithm is incorrect. I assume you know how to compute the suffix array and the LCP array of a string, that is, their efficient implementation. As has been pointed out in the comments, you should try to understand what each component is, and why it works.

First of all, is the suffix array ($SA$) of a string. A suffix array is basically all the suffixes of the string $S$ arranged in ascending lexicographic order. More specifically, the value $SA[i]$ indicates that the suffix of $S$ starting from position $SA[i]$ is ranked $i$ in the lexicographic ordering of all suffixes of $S$.

Next is the $LCP$ array. $LCP[i]$ indicates the length of the longest common prefix between the suffixes starting from $SA[i-1]$ and $SA[i]$. That is, it keeps track of the length of the longest common prefix among two consecutive suffixes of $S$ when arranged in lexicographic order.

As an example, consider the string $S = abbabca$. The suffixes in lexicographic order would be $\{a, abbabca, abca, babca, bbabca, bca, ca\}$, so $SA = [7, 1, 4, 3, 2, 5, 6]$ for a 1-indexed array. The $LCP$ array would be $LCP = [-, 1, 2, 0, 1, 1, 0]$.

Now, given two strings $A$ and $B$, we concatenate them as $S = A\#B$, where $\#$ is a character not present in both $A$ and $B$. The reason for choosing such a character is so that when computing the LCP of two suffixes, say $ab\#dabd$ and $abd$, the comparison will break off at the end of the first string (since it only occurs once, two different suffixes will never have it in the same position), and won't "overflow" into the other string.

Now, it can be seen that you should be able to see why you only need to see consecutive values in the $LCP$ array (the argument is based on contradiction and the fact that the suffixes in $SA$ are in lexicographic order). Keep checking the $LCP$ array for the maximum value such that the two suffixes being compared do not belong to the same original string. If they don't belong to the same original string (one begins in $A$ and the other in $B$), then the largest such value is the length of the largest common substring.

As an example, consider $A = abcabc$ and $B = bc$. Then, $S = abcabc\#bc$. Sorted suffixes are $\{abc\#bc, abcabc\#bc, bc, bc\#bc, bcabc\#bc, c, c\#bc, cabc\#bc\}$.
$\begin{align*} SA &= [4, 1, 8, 5, 2, 9, 6, 3, 7] \\ LCP &= [-, 3, 0, 2, 2, 0, 1, 1, 0] \end{align*}$

Now, the greatest value is $LCP[2] = 3$, but it is for $SA[1]$ and $SA[2]$, both of which start in the string $A$. So, we ignore that. On the other hand, $LCP[4] = 2$ is for $SA[3]$ (corresponds to the suffix $bc$ of $B$) and $SA[4]$ (corresponding to suffix $bcabc\#bc$ of $A$). So, this is the longest common substring between the two strings. For getting the actual substring, you take a length $2$ (value of the greatest feasible $LCP$) substring starting from either $SA[3]$ or $SA[4]$, which is $bc$.

Outras dicas

The algorithm you found online is not entirely correct. As mentioned by Paresh, it will fail in the example given by him.

However, if you ensure that while checking the LCP, you only check the LCP of substrings of different strings. For example, if you are finding the LCS of strings A and B, then you need to ensure that the adjacent entries of the Suffix Array while checking for LCP are both not from the same string.

More details here.

I think something like the algorithm you cite should indeed work if a character that is not part of the character set is used as a separator, and the suffix/prefix arrays are built to exclude all strings that contain the separator, probably the intention of the designer. this is basically equivalent to building suffix/prefix arrays for the two separate strings.

it would be helpful for future ref if you posted a link to the algorithm. note that wikipedia has the algorithm for this in pseudocode & many other algorithms. and there are implementations in most standard languages available online.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange