Question

I am trying to search for the maximal number of substring repetitions inside a string, here are some few examples:

"AQMQMB" => QM (2x)
"AQMPQMB" => <nothing>
"AACABABCABCABCP" => A (2x), AB (2x), ABC (3x)

As you can see I am searching for consecutive substrings only and this seems to be a problem because all compression algorithms (at least that I am aware of) don't care about the consecutivity (LZ*), or too simple to handle consecutive patterns instead of single data items (RLE). I think using suffix tree-related algorithms is also not useful due to the same problem.

I think there are some bio-informatics algorithms that can do this, does anyone have an idea about such algorithm?

Edit In the second example there might be multiple possibilities of consecutive patterns (thanks to Eugen Rieck for the notice, read comments below), however in my use case any of these possibilities is actually acceptable.

Was it helpful?

Solution

This is what I used for a similar problem:

<?php

$input="AACABABCABCABCP";

//Prepare index array (A..Z) - adapt to your character range
$idx=array();
for ($i="A"; strlen($i)==1; $i++) $idx[$i]=array();

//Prepare hits array
$hits=array();

//Loop
$len=strlen($input);
for ($i=0;$i<$len;$i++) {

    //Current character
    $current=$input[$i];

    //Cycle past occurrences of character
    foreach ($idx[$current] as $offset) {

        //Check if substring from past occurrence to now matches oncoming
        $matchlen=$i-$offset;
        $match=substr($input,$offset,$matchlen);
        if ($match==substr($input,$i,$matchlen)) {
            //match found - store it
            if (isset($hits[$match])) $hits[$match][]=$i;
            else $hits[$match]=array($offset,$i);
        }
    }

    //Store current character in index
    $idx[$current][]=$i;
}

print_r($hits);

?>

I suspect it to be O(N*N/M) time with N being string length and M being the width of the character range.

It outputs what I think are the correct answers for your example.

Edit:

This algo hast the advantage of keeping valid scores while running, so it is usable for streams, asl long as you can look-ahaead via some buffering. It pays for this with efficiency.

Edit 2:

If one were to allow a maximum length for repetition detection, this will decrease space and time usage: Expelling too "early" past occurrences via something like if ($matchlen>MAX_MATCH_LEN) ... limits index size and string comparison length

OTHER TIPS

Suffix tree related algorithms are useful here.

One is described in Algorithms on Strings, Trees and Sequences by Dan Gusfield (Chapter 9.6). It uses a combination of divide-and-conquer approach and suffix trees and has time complexity O(N log N + Z) where Z is the number of substring repetitions.

The same book describes simpler O(N2) algorithm for this problem, also using suffix trees.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top