Question

Is there a way to let difflib consider deletion in string matching?

I've tried the difflib.get_close_matches() but it doesn't consider strings with lower length in the close matches output. E.g.

from difflib import get_close_matches as gcm

x = """Erfreulich
Erfreuliche
Erfreulicher
Erfreulicherem
Erfreulicheres
Erfreulicherweis
Erfreulicherweise
Erfreuliches
Erfreulichste"""

x = [i for i in x.split("\n")]

for i in x:
  print i, gcm(i,x)

Output:

Erfreulich ['Erfreulich', 'Erfreuliche', 'Erfreuliches']
Erfreuliche ['Erfreuliche', 'Erfreuliches', 'Erfreulicher']
Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres']
Erfreulicherem ['Erfreulicherem', 'Erfreulicheres', 'Erfreulicher']
Erfreulicheres ['Erfreulicheres', 'Erfreulicherweis', 'Erfreulicherem']
Erfreulicherweis ['Erfreulicherweis', 'Erfreulicherweise', 'Erfreulicheres']
Erfreulicherweise ['Erfreulicherweise', 'Erfreulicherweis', 'Erfreulicheres']
Erfreuliches ['Erfreuliches', 'Erfreuliche', 'Erfreulicheres']
Erfreulichste ['Erfreulichste', 'Erfreuliche', 'Erfreuliches']

Note that for the string Erfreulicher, Erfreulich isn't considered a close match although the distance is only -1.

Was it helpful?

Solution

From the documentation, the n parameter can be increased to get more matches. Some of the words are shorter, so difflib does consider deletions.

difflib.get_close_matches(word, possibilities[, n][, cutoff])
Return a list of the best “good enough” matches. word is a sequence for which close matches are desired (typically a string), and possibilities is a list of sequences against which to match word (typically a list of strings).

Optional argument n (default 3) is the maximum number of close matches to return; n must be greater than 0.

Optional argument cutoff (default 0.6) is a float in the range [0, 1]. Possibilities that don’t score at least that similar to word are ignored.

The best (no more than n) matches among the possibilities are returned in a list, sorted by similarity score, most similar first.

Here is the same word with gcm(i,x,6):

Erfreulicher ['Erfreulicher', 'Erfreuliche', 'Erfreulicheres', 'Erfreulicherem',
              'Erfreuliches', 'Erfreulich']

OTHER TIPS

You should accept Mark Tolonen's answer - he read the docs ;-)

For a bit more insight, note that difflib's notion of similarity has nothing to do with Levenshtein edit distance - but maybe that's what you really want. When you say:

Note that for the string Erfreulicher, Erfreulich isn't considered a close match although the distance is only -1.

I have no idea what notion of "distance" you have in mind either. The strings differ by 2 characters, right? "-1" is mysterious.

difflib computes a "similarity score", which is a float in the range 0.0 through 1.0. Here's how to see what it's doing internally, using your list x:

import difflib
s = difflib.SequenceMatcher()
s.set_seq2("Erfreulicher")
full = []
for i in x:
    s.set_seq1(i)
    full.append((s.ratio(), i))
full.sort(reverse=True)
for score, i in full:
    print "{:20} {:.3f}".format(i, score)

Here's the result, sorted from highest similarity score to lowest:

Erfreulicher         1.000
Erfreuliche          0.957
Erfreulicheres       0.923
Erfreulicherem       0.923
Erfreuliches         0.917
Erfreulich           0.909
Erfreulichste        0.880
Erfreulicherweis     0.857
Erfreulicherweise    0.828

As the docs say, by default get_close_matches() only returns the top 3. The specific word you're asking about happens to be sixth on the list, and would be returned if you told the function to return the top 6 (or 7, etc) matches (see Mark's answer).

How the score is computed is also documented. Since "Erfreulich" is a prefix of "Erfreulicher", it reduces to:

>>> 2.0 * len("Erfreulich") / (len("Erfreulich") + len("Erfreulicher"))
0.9090909090909091

All the strings above "Erfreulich" on the list have at least one more character in common, which makes the numerator larger. The denominator is also larger for them, but increasing the numerator by (say) 1 has a bigger effect on the result than increasing the denominator by 1. That may or may not match your intuition, but it is how it works ;-)

I'm not a pyton developer, but it sounds like you need to compute levenshtein distances between strings. From the wiki:

the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.

If you compute the distance from each word to each word you can always get the closest matches, based on what you define as "close". Now, as I said I'm not a pyton developer so I can't help you on the language specific implementation, but I found a python-levenshtein package.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top