difflib returns different ratio depending on order of sequences

https://stackoverflow.com/questions/9322135

30-04-2021
|

Question

Does anyone know why these two return different ratios.

>>> import difflib
>>> difflib.SequenceMatcher(None, '10101789', '11426089').ratio()
0.5
>>> difflib.SequenceMatcher(None, '11426089', '10101789').ratio()
0.625

Solution

This gives some ideas of how matching works.

>>> import difflib
>>> 
>>> def print_matches(a, b):
...     s =  difflib.SequenceMatcher(None, a, b)
...     for block in s.get_matching_blocks():
...         print "a[%d] and b[%d] match for %d elements" % block
...     print s.ratio()
... 
>>> print_matches('01017', '14260')
a[0] and b[4] match for 1 elements
a[5] and b[5] match for 0 elements
0.2
>>> print_matches('14260', '01017')
a[0] and b[1] match for 1 elements
a[4] and b[2] match for 1 elements
a[5] and b[5] match for 0 elements
0.4

It looks as if it matches as much as it can on the first sequence against the second and continues from the matches. In this case ('01017', '14260'), the righthand match is on the 0, the last character, so no further matches on the right are possible. In this case ('14260', '01017'), the 1s match and the 0 still is available to match on the right, so two matches are found.

I think the matching algorithm is commutative against sorted sequences.

OTHER TIPS

I was working with difflib lately, and though this answer is late, I thought it might add a little spice to the answer provided by hughdbrown as it shows what's happening visually.

Before I go to the code snippet, let me quote the documentation

The idea is to find the longest contiguous matching subsequence that contains no "junk" elements; these "junk" elements are ones that are uninteresting in some sense, such as blank lines or whitespace. (Handling junk is an extension to the Ratcliff and Obershelp algorithm.) The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

I think comparing the first string against the second one and then finding matches looks right enough to people. This is explained nicely in the answer by hughdbrown.

Now try and run this code snippet:

def show_matching_blocks(a, b):
    s = SequenceMatcher(None, a, b)
    m = s.get_matching_blocks()
    seqs = [a, b]

    new_seqs = []
    for select, seq in enumerate(seqs):
        i, n = 0, 0
        new_seq = ''
        while i < len(seq):
            if i == m[n][select]:
                new_seq += '{' + seq[m[n][select]:m[n][select] + m[n].size] + '}'
                i += m[n].size
                n += 1
            elif i < m[n][select]:
                new_seq += seq[i:m[n][select]]
                i = m[n][select]
        new_seqs.append(new_seq)
    for seq, n in zip(seqs, new_seqs):
        print('{} --> {}'.format(seq, n))
    print('')

a, b = '10101789', '11426089'
show_matching_blocks(a, b)
show_matching_blocks(b, a)

Output:

10101789 --> {1}{0}1017{89}
11426089 --> {1}1426{0}{89}

11426089 --> {1}{1}426{0}{89}
10101789 --> {1}0{1}{0}17{89}

The parts inside braces ({}) are the matching parts. I just used SequenceMatcher.get_matching_blocks() to put the matching blocks within braces for better visibility. You can clearly see the difference when the order is reversed. With the first order, there are 4 matches, so the ratio is 2*4/16=0.5. But when the order is reversed, there are now 5 matches, so the ratio becomes 2*5/16=0.625. The ratio is calculated as given here in the documentation

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow