How does LCP help in finding the number of occurrences of a pattern?

https://stackoverflow.com/questions/11373453

19-06-2021
|

Question

I have read that the Longest Common Prefix (LCP) could be used to find the number of occurrences of a pattern in a string.

Specifically, you just need to create the suffix array of the text, sort it, and then instead of doing binary search to find the range so that you can figure out the number of occurrences, you simply compute the LCP for each successive entry in the suffix array.

Although using binary search to find the number of occurrences of a pattern is obvious I can't figure out how the LCP helps find the number of occurrences here.

For example for this suffix array for banana:

LCP  Suffix entry
N/A  a  
1    ana  
3    anana  
0    banana  
0    na  
2    nana

How does the LCP help find the number of occurrences of a substring like "banana" or "na" is not obvious to me.

Any help figuring out how LCP helps here?

Solution

I do not know any way of using the LCP array instead of carrying out a binary search, but I believe what you refer to is the technique described by Udi Manber and Gene Myers in Suffix arrays: a new method for on-line string searches.

(Note: The below explanation has been copied into a Wikipedia article on 9th April 2014, see diff. If you look at the revision history here and on Wikipedia, you'll see that the one here was written first. Please don't insert comments like "taken from Wikipedia" into my answer.)

The idea is this: In order to find the number of occurrences of a given string P (length m) in a text T (length N),

You use binary search against the suffix array of T (just like you suggested)
But you speed it up using the LCP array as auxiliary data structure. More specifically, you generate a special version of the LCP array (I will call it LCP-LR below) and use that.

The issue with using standard binary search (without the LCP information) is that in each of the O(log N) comparisons you need to make, you compare P to the current entry of the suffix array, which means a full string comparison of up to m characters. So the complexity is O(m*log N).

The LCP-LR array helps improve this to O(m+log N), in the following way:

At any point during the binary search algorithm, you consider, as usual, a range (L,...,R) of the suffix array and its central point M, and decide whether you continue your search in the left sub-range (L,...,M) or in the right sub-range (M,...,R).
In order to make the decision, you compare P to the string at M. If P is identical to M, you are done, but if not, you will have compared the first k characters of P and then decided whether P is lexicographically smaller or larger than M. Let's assume the outcome is that P is larger than M.
So, in the next step, you consider (M,...,R) and a new central point M' in the middle:
```
              M ...... M' ...... R
              |
       we know:
          lcp(P,M)==k
```
The trick now is that LCP-LR is precomputed such that a O(1)-lookup tells you the longest common prefix of M and M', lcp(M,M').

You know already (from the previous step) that M itself has a prefix of k characters in common with P: lcp(P,M)=k. Now there are three possibilities:
- Case 1: k < lcp(M,M'), i.e. P has fewer prefix characters in common with M than M has in common with M'. This means the (k+1)-th character of M' is the same as that of M, and since P is lexicographically larger than M, it must be lexicographically larger than M', too. So we continue in the right half (M',...,R).
- Case 2: k > lcp(M,M'), i.e. P has more prefix characters in common with M than M has in common with M'. Consequently, if we were to compare P to M', the common prefix would be smaller than k, and M' would be lexicographically larger than P, so, without actually making the comparison, we continue in the left half (M,...,M').
- Case 3: k == lcp(M,M'). So M and M' are both identical with P in the first k characters. To decide whether we continue in the left or right half, it suffices to compare P to M' starting from the (k+1)-th character.
We continue recursively.

The overall effect is that no character of P is compared to any character of the text more than once. The total number of character comparisons is bounded by m, so the total complexity is indeed O(m+log N).

Obviously, the key remaining question is how did we precompute LCP-LR so it is able to tell us in O(1) time the lcp between any two entries of the suffix array? As you said, the standard LCP array tells you the lcp of consecutive entries only, i.e. lcp(x-1,x) for any x. But M and M' in the description above are not necessarily consecutive entries, so how is that done?

The key to this is to realize that only certain ranges (L,...,R) will ever occur during the binary search: It always starts with (0,...,N) and divides that at the center, and then continues either left or right and divide that half again and so forth. If you think of it: Every entry of the suffix array occurs as central point of exactly one possible range during binary search. So there are exactly N distinct ranges (L...M...R) that can possibly play a role during binary search, and it suffices to precompute lcp(L,M) and lcp(M,R) for those N possible ranges. So that is 2*N distinct precomputed values, hence LCP-LR is O(N) in size.

Moreover, there is a straight-forward recursive algorithm to compute the 2*N values of LCP-LR in O(N) time from the standard LCP array – I'd suggest posting a separate question if you need a detailed description of that.

To sum up:

It is possible to compute LCP-LR in O(N) time and O(2*N)=O(N) space from LCP
Using LCP-LR during binary search helps accelerate the search procedure from O(M*log N) to O(M+log N)
As you suggested, you can use two binary searches to determine the left and right end of the match range for P, and the length of the match range corresponds with the number of occurrences for P.

OTHER TIPS

The Longest Common Prefix (LCP) is the Lowest Common Ancestor (LCA) in a suffix tree. Once you have the Lowest Common Ancestor, you can count the number of nodes that branch out from the LCA. This will give you the number of occurrences of a pattern in the suffix tree. This is the relationship between the LCP and LCA.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow