I have to clarify what is meant by "grouping by LCP value" in the original answer. In fact, to the group with selected LCP value we should include all entries with larger LCP values.
This means that for your example, while processing LCP3, we need to merge preceding entries 0 and 6 to this group. And while processing LCP2, we need to merge all preceding entries with LCP3 and LCP4: 0, 1, 2, 6, 7.
As a result two (ab) pairs as well as one (ba) pair are remaining after filter. But since (ba) "intersects" with the first (ab) pair, it is rejected on step 5.
Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?
That's right, I mean all possible repeating sequences.
Is this a known algorithm with a name that I can find more details on elsewhere?
I don't know. Never seen such algorithm before.
Here is how steps 2 .. 4 may be implemented (in pseudo-code):
for (in_sa, in_src) in suffix_array: # step 2
lcp_groups[max(LCP[in_sa.left], LCP[in_sa.right])].append((in_sa, in_src))
apply(sort_by_position_in_src, lcp_groups) # step 3
for lcp from largest downto 1: # step 4
# sorted by position in src array and unique:
positions = merge_and_uniq(positions, lcp_groups[lcp])
for start in positions:
pos = start
while (next = positions[pos.in_src + lcp]).exists
and LCP.RMQ(pos.in_sa, next.in_sa) >= lcp
and not(prev = positions[pos.in_src - lcp]).exists # to process each
and LCP.RMQ(pos.in_sa, prev.in_sa) >= lcp): # chain only once
pos = next
if pos != start:
pass_to_step5(start, lcp, pos + lcp)
Here I don't plan any particular data structure for positions
. But for convenience an ordered associative array is assumed. RMQ is Range Minimum Query, so LCP array should be preprocessed accordingly.
This code is practically the same as the code in OP. But instead of expensive string comparisons (like common != prefix(input, pos+lcp*i, lcp)
) it uses RMQ, which (if properly implemented) works almost instantly (and has the same effect as the string comparison because it allows to reject a sub-string when it has too few starting characters in common with preceding sub-string).
It has quadratic worst-case time complexity. Should be slow for input arrays like "aaaaaaaaa". And it's not easy to find its time complexity for "better" strings, probably it is sub-quadratic in "average" case. The same problem could be solved with much simpler quadratic-time algorithm:
def loop_rolling(begin, end):
distance = (end - begin) / 2)
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
pass_to_step5(start, d, start + d * (repeats + 1))
start = pos
Or it may be made even simpler by removing steps 5 and 6. But the variant below has a disadvantage. It is much too greedy, so instead of 5*(ab) it will find 2*(2*(ab))ab:
def loop_rolling(begin, end, distance):
distance = min(distance, (end - begin) / 2))
for d from distance downto 1:
start = pos = begin
while pos + d < end:
while (pos + d < end) and (src[pos] == src[pos + d]):
++pos
repeats = floor((pos - start) / d)
if repeats > 0:
loop_rolling(begin, start, d - 1)
print repeats+1, "*("
loop_rolling(start, start + d, d - 1) # "nested loops"
print ')'
loop_rolling(start + d * (repeats + 1), end, d)
return
else:
if d == 1: print src[start .. pos]
start = pos