Question

I have found an algorithm for Longest Common Substring. It is usually done using dynamic programming, using a 2-D array of size mxn where m and n are lengths of the two strings under consideration.

I will construct the following matrix for the two strings.

M[i][j] = 1 if s1[i]==s2[j] else 0.

For example, if the strings are: abcxy and pqaabx

The matrix looks as follows:

    a b c x y
 p  0 0 0 0 0
 q  0 0 0 0 0
 a  1 0 0 0 0
 a  1 0 0 0 0
 b  0 1 0 0 0
 x  0 0 0 1 0

Now, I search for a maximal continuous sequence of 1s in every diagonal which is in top-left to bottom-right direction.

The maximum value among these will be the answer.

I can perform the above operation without using the array explicitly. The time-complexity is still O(M*N). So, there is no need of memory.

Can anyone point me where I am going wrong?

Was it helpful?

Solution

Your method is correct. For proof suppose the longest common substring for S1 and S2 was from S1[i..j] and S2[p..q]. this implies S1[i+k] = S2[p+k]

These all lie on the diagonal starting from (i,p).

The dynamic programming solution does the same thing but instead of computing the table first and going through diagonal paths it computes the table depending on it's diagonal parent plus whether or not they match.

EDITED

On your comment on the wikipedia solution using additional memory. It's there only for clarity. In principle you need only two rows of the matrix in the wikipedia solution and keep the current maximum count in one variable. This is correct since for any (i,j)th entry in the matrix

M(i,j) = 1 + M(i-1, j-1) (if s1[i] == s2[j])

as you can see the current row elements depend only on the elements of the immediately upper row.

OTHER TIPS

Your algorithm is correct, but the standard DP approach eliminates your second phase, and makes the solution simpler.

Instead of marking boolean values and then scanning the diagonals to look for longest sequences, you can compute the diagonal lengths as you build the matrix - Only one pass is required.

In terms of time and space complexity, both solutions are O(NxM). Your solution can save some memory if you use a bit matrix representation, while the other solution is probably slightly faster.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top