Discovering Consecutive Repetitive Patterns in a String

Question 1

This is what I used for a similar problem:

<?php

$input="AACABABCABCABCP";

//Prepare index array (A..Z) - adapt to your character range
$idx=array();
for ($i="A"; strlen($i)==1; $i++) $idx[$i]=array();

//Prepare hits array
$hits=array();

//Loop
$len=strlen($input);
for ($i=0;$i<$len;$i++) {

    //Current character
    $current=$input[$i];

    //Cycle past occurrences of character
    foreach ($idx[$current] as $offset) {

        //Check if substring from past occurrence to now matches oncoming
        $matchlen=$i-$offset;
        $match=substr($input,$offset,$matchlen);
        if ($match==substr($input,$i,$matchlen)) {
            //match found - store it
            if (isset($hits[$match])) $hits[$match][]=$i;
            else $hits[$match]=array($offset,$i);
        }
    }

    //Store current character in index
    $idx[$current][]=$i;
}

print_r($hits);

?>

I suspect it to be O(N*N/M) time with N being string length and M being the width of the character range.

It outputs what I think are the correct answers for your example.

Edit:

This algo hast the advantage of keeping valid scores while running, so it is usable for streams, asl long as you can look-ahaead via some buffering. It pays for this with efficiency.

Edit 2:

If one were to allow a maximum length for repetition detection, this will decrease space and time usage: Expelling too "early" past occurrences via something like if ($matchlen>MAX_MATCH_LEN) ... limits index size and string comparison length

Question 2

Suffix tree related algorithms are useful here.

One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.

The same book describes simpler O(N²) algorithm for this problem, also using suffix trees.