Question

Are there an efficient algorithm to search and dump all common substrings (which length is 3 or longer) between 2 strings?

Example input:

Length   : 0    5    10   15   20   25   30

String 1 : ABC-DEF-GHI-JKL-ABC-ABC-STU-MWX-Y
String 2 : ABC-JKL-MNO-ABC-DEF-PQR-DEF-ZWX-Y

Example output:

In string 1           2
---------------------------
ABC-DEF   0           12
ABC-DE    0           12
BC-DEF    1           13
:
-ABC-     15,19       11
-JKL-     11          3
-DEF-     3           15
-JKL      11          3
JKL-      12          4
-DEF      3           15,23
DEF-      4           16
WX-Y      29          29
ABC-      0,16,20     0,12
-ABC      15,19       11
DEF-      4           16,24
DEF       4           16,24
ABC       0,16,20     0,12
JKL       12          4
WX-       29          29
X-Y       30          30
-AB       15,19       11
BC-       1,17,21     1,13
-DE       3           15,23
EF-       5           17,25
-JK       11          3
KL-       13          5
:

In the example, "-D", "-M" is also a common substring but is not required, because it's length is only 2. (There might be some missing outputs in example because there are so many of them...)

Was it helpful?

Solution

You can find all common substrings using a data structure called a Generalized suffix tree

Libstree contains some example code for finding the longest common substring. That example code can be modified to obtain all common substrings.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top