How to find all Common substrings which is 3 chars or longer
-
30-05-2021 - |
Question
Are there an efficient algorithm to search and dump all common substrings (which length is 3 or longer) between 2 strings?
Example input:
Length : 0 5 10 15 20 25 30
String 1 : ABC-DEF-GHI-JKL-ABC-ABC-STU-MWX-Y
String 2 : ABC-JKL-MNO-ABC-DEF-PQR-DEF-ZWX-Y
Example output:
In string 1 2
---------------------------
ABC-DEF 0 12
ABC-DE 0 12
BC-DEF 1 13
:
-ABC- 15,19 11
-JKL- 11 3
-DEF- 3 15
-JKL 11 3
JKL- 12 4
-DEF 3 15,23
DEF- 4 16
WX-Y 29 29
ABC- 0,16,20 0,12
-ABC 15,19 11
DEF- 4 16,24
DEF 4 16,24
ABC 0,16,20 0,12
JKL 12 4
WX- 29 29
X-Y 30 30
-AB 15,19 11
BC- 1,17,21 1,13
-DE 3 15,23
EF- 5 17,25
-JK 11 3
KL- 13 5
:
In the example, "-D", "-M" is also a common substring but is not required, because it's length is only 2. (There might be some missing outputs in example because there are so many of them...)
Solution
You can find all common substrings using a data structure called a Generalized suffix tree
Libstree contains some example code for finding the longest common substring. That example code can be modified to obtain all common substrings.
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow