Question

I have come up with the term loop rolling myself with the hope that it does not overlap with an existing term. Basically I'm trying to come up with an algorithm to find loops in a printed text.

Some examples from simple to complicated

Example1

Given:

a a a a a b c d

I want to say:

5x(a) b c d

or algorithmically:

for 1 .. 5
    print a
end
print b
print c
print d

Example2

Given:

a b a b a b a b c d

I want to say:

4x(a b) c d

or algorithmically:

for 1 .. 4
    print a
    print b
end
print c
print d

Example3

Given:

a b c d b c d b c d b c e

I want to say:

a 3x(b c d) b c e

or algorithmically:

print a
for 1 .. 3
    print b
    print c
    print d
end
print b
print c
print d

It didn't remind me of any algorithm that I know of. I feel like some of the problems can be ambiguous but finding one of the solutions is enough to me for now. Efficiency is always welcome but not mandatory. How can I do this?

EDIT

First of all, thanks for all the discussion. I have adapted an LZW algorithm from rosetta and ran it on my input:

abcdbcdbcdbcdef

which gave me:

a
b
c
d
8  => bc
10 => db
9  => cd
11 => bcd
e
f

where I have a dictionary of:

a a
c c
b b
e e
d d
f f
8 bc
9 cd
10 db
11 bcd
12 dbc
13 cdb
14 bcde
15 ef
7 ab

It looks good for compression but it's not quite what I wanted. What I need is more like compression in the algorithmic representation from my examples which would have:

  • subsequent sequences (if a sequence is repeating, there would be no other sequence in between)
  • no dictionary but only loops
  • irreducable
  • with maximum sequence sizes (which would minimize the algorithmic representation)
  • and let's say nested loops are allowed (contrary to what I said before in the comment)
Was it helpful?

Solution

I start with an algorithm, which gives maximum sequence sizes. Though it would not always minimize the algorithmic representation, it may be used as an approximation algorithm. Or it may be extended to optimal algorithm.

  1. Start with constructing Suffix array for your text along with LCP array.
  2. Sort an array of indexes of LCP array, indexes of larger elements of LCP array come first. This groups together repeating sequences of the same length and allows to process sequences in greedy manner, starting from maximum sequence sizes.
  3. Extract suffix array entries, grouped by LCP value (by group I mean all the entries with selected LCP value as well as all entries with larger LCP values), and sort them by position in the text.
  4. Filter out entries with positional difference not equal to LCP. For remaining entries, get prefixes of length, equal to LCP. This gives all possible sequences in the text.
  5. Add sequences, sorted by starting position, to ordered collection (for example, binary search tree). Sequences are added in order of appearance in sorted LCP, so longer sequences are added first. Sequences are added only if they are independent or if one of them is completely nested inside the other one. Intersecting intervals are ignored. For example, in caba caba bab sequence ab intersects with caba and so it is ignored. But in cababa cababa babab one instance of ab is dropped, 2 instances are completely inside larger sequence, and 2 instances are completely outside of it.
  6. At the end, this ordered collection contains all the information, needed to produce the algorithmic representation.

Example:

Text          ababcabab

Suffix array  ab abab ababcabab abcabab b bab babcabab bcabab cabab
LCP array       2    4         2       0 1   3        1      0

Sorted LCP            4 3 2 2 1 1 0 0
Positional difference 5 5 2 2 2 2 - -
Filtered LCP          - - 2 2 - - - -
Filtered prefixes   (ab ab) (ab ab)

Sketch of an algorithm, producing the minimal algorithmic representation.

Start with the first 4 steps of previous algorithm. Fifth step should be modified. Now it is not possible to ignore intersecting intervals, so every sequence is added to the collection. Since the collection now contains intersecting intervals, it is better to implement it as some advanced data structure, for example, Interval tree.

Then recursively determine the length of algorithmic representation for all sequences, that contain any nested sequences, starting from the smallest ones. When every sequence is evaluated, compute optimal algorithmic representation for whole text. Algorithm for processing either a sequence or whole text uses dynamic programming: allocate a matrix with number of columns, equal to text/sequence length and number of rows, equal to the length of algorithmic representation; doing in-order traversal of interval tree, update this matrix with all sequences, possible for each text position; when more than one value for some cell is possible, either choose any of them, or give preference to longer or shorter sub-sequences.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top