Unexplaned behavior with difflib.SequenceMatcher get_matching_blocks()

https://stackoverflow.com/questions/12142339

28-06-2021
|

سؤال

I was experimenting with fuzzywuzzy and encountered that for quite a few cases it was generating wrong result. I tried to debug and encountered a scenario with get_matching_blocks() which was difficult to explain.

My understanding of get_matching_blocks() is, it should return a triplet tuple (i,j,n) where the sub-string of length n in the first string at index i should match exactly with the sub-string of length n in the second string at index j.

>>> hay = """"Find longest matching block in a[alo:ahi] and b[blo:bhi]. If isjunk was omitted or None, find_longest_match() returns (i, j, k) such that a[i:i+k] is equal to b[j:j+k], where alo <= i <= i+k <= ahi and blo <= j <= j+k <= bhi. For all (i', j', k') meeting those conditions, the additional conditions k >= k', i <= i', and if i == i', j <= j' are also met. In other words, of all maximal matching blocks, return one that starts earliest in a, and of all those maximal matching blocks that start earliest in a, return the one that starts earliest in b."""
>>> needle = "meeting those conditions"
>>> needle in hay
True
>>> sm = difflib.SequenceMatcher(None,needle,hay)
>>> sm.get_matching_blocks()
[Match(a=5, b=8, size=2), Match(a=24, b=550, size=0)]
>>>

SO why the above code fails to find the matching block?

المحلول

I may not see well, but you aren't matching hay and needle. You got

sm = difflib.SequenceMatcher(None,needle, sms)

shouldn't it be

sm = difflib.SequenceMatcher(None, needle, hay)

? Also, for the record, the last element in the list returned by get_matching_blocks() is a dummy of format (len(a), len(b), 0).

Maybe it's just mistake in the pasting, but then please update your question with actual code (I'm thinking of SequenceMatcher() method)

SequenceMatcher has broken "junk heuristic" - if second string is at least 200 chars long, a junk is every letter which (count-1) is more than 1% of total length. Comment from official bug ticket (namely this comment):

The reason for the bug is the heuristic: if the second sequence is at least 200 items long then any item occurring more than one percent of the time in the second sequence is treated as junk. This was aimed at recurring code lines like 'else:' and 'return', but can be fatal for small alphabets where common items are necessary content.

I will also let myself quote code examples provided by author of above:

Here len(a) == 200, len(b) == 199:

>>> print(SM(None, 'x' + 'y'*199, 'y'*199).ratio())
>>> 0.9975 #correct

Here len(a) == 199, len(b) == 200 (we switch a and b):

>>> print(SM(None, 'y'*199, 'x' + 'y'*199).ratio())
>>> 0 #wrong

Where obviously both cases should give the same output.

The bug was fixed by adding optional parameter I mentioned, autojunk, which - for correct behaviour here - should be manually set to False.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow