longest common substring with a rolling hash

https://stackoverflow.com/questions/21428189

04-10-2022
|

Question

I'm implementing the longest common substring problem with a rolling hash logic and I can't understand how it works in O(nlogn) time. The algorithm states that,

1) choose a length of substring by binary search( low, high, mid)
2) Compute rolling hash values for both strings for a given length mid
3) match the two rolling hashes and see if they are the same and reposition the binary search accordingly.

I can see that computing the rolling hashes is O(n) and the binary search for length is O(logn) thus giving O(nlogn), but from what I understand, the rolling hashes by themselves are not sufficient to get a guaranteed hit. Multiple substrings could have the same rolling hash value, so its another times O(n) to check if the strings really match. Am I missing something?

Solution

It works in O(nlogn) only if you assume that there are no hash-collisions. If you need an algorithm which has the same or better time complexity and perfomance and is guaranteed to produce correct result for all possible inputs, try to use suffix structures(such as suffix array, suffix tree or suffix automaton).

Edit: Here is a nice explanation of Suffix tree algorithm Stack overflow answer to a question on suffix tree. Some other useful resources are here and @makagonov has nice implementation of suffix tree in c++ here

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow