Question

OK this is what I want to do:

Get more than two strings and "align" them (no DNA/RNA sequence or the like, just regular strings with not like 1000 items in each of them)

I've already done some work with pairwise alignment (align two strings) however the "gaps" create some issues for me when trying to align more than one pair.

Example (one I'm currently testing):

ABCDEF
ABGHCEEF
AJKLBCDYEOF

AB--CDEF
ABGHCEEF
=======================
AB--C-EF

A-B--C--E-F
AJKLBCDYEOF
=======================
A----C--E-F

And another (more illustrative) example :

http://nest.drkameleon.com
http://www.google.com
http://www.yahoo.com

http://nest.drkameleon.com
http://-www.--google--.com

=======================
http://----.------le--.com

http://----.------le--.com
http://-www.-----yahoo.com

=======================
http://----.----------.com

What I'm currently doing :

  • Sort the strings (longer strings come first in the list)
  • Align the first pair : A-B and get the result (let's say R1)
  • Then align the second pair : R1 and C (result in R2)
  • Then align the third pair : R2 and D
  • And so on...

So what's in your mind? How could I go for that? Is there a better way? (Of course, there must be...)

I'd rather do that in Perl/Python or something along these lines, however any type of code/reference would be more than welcome! :-)

Was it helpful?

Solution

I think you may be able to cast this problem as a more general string diff problem instead of a string alignment. Consider how GNU diff is used for finding differences between two files, and use the same algorithms as are used to perform an N-way diff.

I'm not sure if the time/memory complexity of this approach is amenable to your needs, but you can at least think about the problem this way.

OTHER TIPS

There is an algorithm based on Levenshtein algorithm to compute the longest common sequence, with optional spaces. Not sure if that helps.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top