Longest common substring across multiple sequences

https://stackoverflow.com/questions/20584536

01-09-2022
|

Question

I'm trying to find the longest common substring (LCS) across multiple sequences.

There are a number of modules on CPAN that implement the LCS algorithm for 2 sequences, such as ~~Algorithm::Diff and~~ String::LCSS_XS, but I'm having a hard time in extending them to work with more than 2 sequences because the LCS across multiple sequences is not necessarily a LCS between any two of them.

Of note, despite its name, the Algorith::MLCS doesn't actually return the LCS but all the common elements (also non-consecutive) of a number of arrays. My impression is that it is broken by design, but I might be wrong.

Algorithm::Diff and Algorith::MLCS solve the longest common subsequence problem, not the longest common substring one.

Is there an obvious way to extend the n=2 algorithms or do I have to implement my version? If yes, how?

Thanks.

Solution

This can be solved pretty easily using the Tree::Suffix module.

Example:

#!/usr/bin/env perl
use Modern::Perl;
use Bio::SeqIO;
use Tree::Suffix;

my $seqio = Bio::SeqIO->new(
    -file => "fasta_sequences.txt",
    -format => "Fasta");

my @seqs;

while (my $seqobj = $seqio->next_seq) {
    push @seqs, $seqobj->seq;
}

my $tree = Tree::Suffix->new(@seqs);
my @lcss = $tree->lcs;

say $_ for @lcss;

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow