How is python's difflib.find_longest_match implemented?

https://stackoverflow.com/questions/17508235

02-06-2022
|

Question

Originally wanted an algorithm to find the longest substring between two python Strings. The general answer for the best runtime was "to construct a suffix tree", based on the online consensus for a linear runtime. However, there are zero examples online on any of this, and it's not surprising because suffix trees are noted to be incredibly complicated and unintuitive to construct.

I implemented a DP solution (which is still quadratic) and too slow for what i'm trying to do.

Tried using python's difflib.find_longest_match, and it was faster (but it still was not as fast as id like).

So if anyone knows, what algorithm does the find_longest_match() method use?

Also, if the algorithm of find_longest_match() isn't optimal, does anyone know where one can find out how the linear max substring algorithm can be implemented. This is a somewhat famous problem, and its weird how there are so few to none optimal solutions online. Or maybe im just misinformed and the lower bound is quadratic, or even nlogn.

Thanks.

Solution

Here is the code: http://svn.python.org/view/python/tags/r271/Lib/difflib.py?view=markup seems to me that it is quadratic.

Only speedup seems to be a dictionary to get all indices where a given character is used. This reduces the time by a factor which is the number of different characters used.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow