Question

I'm working on a program to find the longest common substring between multiple strings. I've lowered my approach down to either using suffix array's or a suffix tree. I want to see which is the better approach (if there is one) and why. Also for suffix array's I've seen a few algorithms for two strings but not any for more then two strings. Any solid examples would be appreciated, thanks again for the advice!

Note: I didn't see any other questions that specifically addressed this issue, but if they exist please point me in that direction!

Was it helpful?

Solution

If you have a substring that occurs in all sequences, then in a suffix array, the pointers to each occurrence of that substring must sort close together. So you could attempt to find them by moving a window along the suffix array, where the window is just large enough to contain at least one occurrence of each sequence. You could do this in linear time by maintaining a table that tells you, for each sequence, how many times that sequence occurs within that window. Then when you move the rear end of the window forwards decrement the count for the sequence associated with the pointer you have just skipped over and, if necessary, move the forward end of the window just far enough to pick up a new occurrence of this sequence and update the table.

Now you need to be able to find the length of the common prefix shared by all substrings starting at the pointers in the window. This should be the minimum LCP value occurring between the pointers in the window. If you use a red-black tree, such as a Java Treeset, with a key which consists of the LCP value as most significant component and some tie-breaker such as the pointer itself as a less significant component, then you can maintain the minimum LCP value within the window at a cost of roughly log window size per window adjustment.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top