سؤال

I'm trying to implement the algorithm given by Evgeny Kluev in his answer to loop rolling algorithm but I'm having trouble getting it to work. Below is an example I tried to work out by hand following his instructions:

text: ababacababd


<STEP 1>                                    
  suffixes and LCPs:                               

  ababacababd                                            
  4 
  ababd 
  3 
  abacababd 
  2  
  abd
  1
  acababd
  0
  babacababd
  3
  babd
  2
  bacababd
  1
  bd
  0
  cababd
  0
  d

<STEP 2>
  sorted LCP array indices: 0,1,5,2,6,3,7,4,8,9
                 (values) : 4,3,3,2,2,1,1,1,0,0

<STEP 3>
LCP groups sorted by position in text (format => position: entry):
  lcp 4:
    0: ababacababd
    6: ababd

  lcp 3:
    1: babacababd
    2: abacababd
    6: ababd
    7: babd

  lcp 2:
    2: abacababd
    3: bacababd
    7: babd
    8: abd

  lcp 1:
    3: bacababd
    4: acababd
    8: abd
    9: bd

  lcp 0:
    0: ababacababd
    1: babacababd
    4: acababd
    5: cababd
    9: bd
   10: d

<STEP 4>
entries remaining after filter (LCP == positional difference):
   none! only (abd),(bd) and (bacababd),(acababd) from LCP 1 group
   have positional difference equal to 1 but they don't share prefixes
   with each other. shouldn't i have at least (ab) and (ba) here?

Can anybody tell me what I'm doing wrong in this process?

Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?

Is this a known algorithm with a name that I can find more details on elsewhere?

(I'm also confused about his definition of intersecting sequences in step 5, but maybe it would make sense if I understood the preceding steps correctly).

EDIT: Here is what I have for step 4,5,6 after Evgeny's helpful clarification:

<STEP 4>
filter pseudocode:
results = {}
for (positions,lcp) in lcp_groups:
  results[lcp] = []
  while positions not empty:
    pos = positions.pop(0) #pop lowest element
    if (pos+lcp) in positions:
      common = prefix(input, pos, lcp)
      if common.size() < lcp:
        next
      i = 1
      while (pos+lcp*(i+1)) in positions:
        if common != prefix(input, pos+lcp*i, lcp):
          break
        positions.delete(pos+lcp*i)
        i += 1

      results[lcp].append( (common, pos, i+1) )

application of filter logic:
  lcp 4:
    0: ababacababd # 4 not in {6}
    6: ababd       # 10 not in {}

  lcp 3:
    0: ababacababd # 3 not in {1,2,6,7}
    1: babacababd  # 4 not in {2,6,7}
    2: abacababd   # 5 not in {6,7}
    6: ababd       # 9 not in {7}
    7: babd        # 10 not in {}

  lcp 2:
    0: ababacababd # 2 in {1,2,3,6,7,8}, 4 not in {1,2,3,6,7,8} => ("ab", 0, 2)
    1: babacababd  # 3 in {2,3,6,7,8}, 5 not in {2,3,6,7,8} => ("ba", 1, 2)
    2: abacababd   # 4 not in {3,6,7,8}
    3: bacababd    # 5 not in {6,7,8}
    6: ababd       # 8 in {7,8}, 10 not in {7,8} => ("ab", 6, 2)
    7: babd        # 9 not in {8}
    8: abd         # 10 not in {}

  lcp 1:
    0: ababacababd # 1 in {1,2,3,4,6,7,8,9}, prefix is ""
    1: babacababd  # 2 in {2,3,4,6,7,8,9}, prefix is ""
    2: abacababd   # 3 in {3,4,6,7,8,9}, prefix is ""
    3: bacababd    # 4 in {4,6,7,8,9}, prefix is ""
    4: acababd     # 5 not in {6,7,8,9}
    6: ababd       # 7 in {7,8,9}, prefix is ""
    7: babd        # 8 in {8,9}, prefix is ""
    8: abd         # 9 in {9}, prefix is ""
    9: bd          # 10 not in {}

sequences: [("ab", 0, 2), ("ba", 1, 2), ("ab", 6, 2)]

<STEP 5>
add sequences in order of LCP grouping. sequences within an LCP group
are added according to which was generated first:
  lcp 4: no sequences
  lcp 3: no sequences
  lcp 2: add ("ab", 0, 2)
  lcp 2: dont add ("ba", 1, 2) because it intersects with ("ab", 0, 2)
  lcp 2: add ("ab", 6, 2)
  lcp 1: no sequences

collection = [("ab", 0, 2), ("ab", 6, 2)]
(order by position not by which one was added first)

<STEP 6>
recreate input by iterating through the collection in order and 
filling in gaps with the normal input:
  input = "ab"*2 + input[4..5] + "ab"*2 + input[10..10]

Evgeny, one quick question for you if you happen to look at this again: Am I doing step 5 correctly? That is, do I add sequences according to which LCP group they were generated from (with higher LCP valued groups coming first)? Or is it something else related to LCP?

Also, if there is anything wrong with step 4 or 6 please let me know, but it seems what I have works very well for this example.

هل كانت مفيدة؟

المحلول

I have to clarify what is meant by "grouping by LCP value" in the original answer. In fact, to the group with selected LCP value we should include all entries with larger LCP values.

This means that for your example, while processing LCP3, we need to merge preceding entries 0 and 6 to this group. And while processing LCP2, we need to merge all preceding entries with LCP3 and LCP4: 0, 1, 2, 6, 7.

As a result two (ab) pairs as well as one (ba) pair are remaining after filter. But since (ba) "intersects" with the first (ab) pair, it is rejected on step 5.

Also, he says at the end of step 4 we should have all possible sequences in the text, does he mean all possible repeating sequences?

That's right, I mean all possible repeating sequences.

Is this a known algorithm with a name that I can find more details on elsewhere?

I don't know. Never seen such algorithm before.


Here is how steps 2 .. 4 may be implemented (in pseudo-code):

for (in_sa, in_src) in suffix_array: # step 2
  lcp_groups[max(LCP[in_sa.left], LCP[in_sa.right])].append((in_sa, in_src))
apply(sort_by_position_in_src, lcp_groups) # step 3
for lcp from largest downto 1: # step 4
  # sorted by position in src array and unique:
  positions = merge_and_uniq(positions, lcp_groups[lcp])
  for start in positions:
    pos = start
    while (next = positions[pos.in_src + lcp]).exists
           and LCP.RMQ(pos.in_sa, next.in_sa) >= lcp
           and not(prev = positions[pos.in_src - lcp]).exists  # to process each
                   and LCP.RMQ(pos.in_sa, prev.in_sa) >= lcp): # chain only once
      pos = next
    if pos != start:
      pass_to_step5(start, lcp, pos + lcp)

Here I don't plan any particular data structure for positions. But for convenience an ordered associative array is assumed. RMQ is Range Minimum Query, so LCP array should be preprocessed accordingly.

This code is practically the same as the code in OP. But instead of expensive string comparisons (like common != prefix(input, pos+lcp*i, lcp)) it uses RMQ, which (if properly implemented) works almost instantly (and has the same effect as the string comparison because it allows to reject a sub-string when it has too few starting characters in common with preceding sub-string).

It has quadratic worst-case time complexity. Should be slow for input arrays like "aaaaaaaaa". And it's not easy to find its time complexity for "better" strings, probably it is sub-quadratic in "average" case. The same problem could be solved with much simpler quadratic-time algorithm:

def loop_rolling(begin, end):
  distance = (end - begin) / 2)
  for d from distance downto 1:
    start = pos = begin
    while pos + d < end:
      while (pos + d < end) and (src[pos] == src[pos + d]):
        ++pos
      repeats = floor((pos - start) / d)
      if repeats > 0:
        pass_to_step5(start, d, start + d * (repeats + 1))
      start = pos

Or it may be made even simpler by removing steps 5 and 6. But the variant below has a disadvantage. It is much too greedy, so instead of 5*(ab) it will find 2*(2*(ab))ab:

def loop_rolling(begin, end, distance):
  distance = min(distance, (end - begin) / 2))
  for d from distance downto 1:
    start = pos = begin
    while pos + d < end:
      while (pos + d < end) and (src[pos] == src[pos + d]):
        ++pos
      repeats = floor((pos - start) / d)
      if repeats > 0:
        loop_rolling(begin, start, d - 1)
        print repeats+1, "*("
        loop_rolling(start, start + d, d - 1) # "nested loops"
        print ')'
        loop_rolling(start + d * (repeats + 1), end, d)
        return
      else:
        if d == 1: print src[start .. pos]
        start = pos
مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top