Ignore case with difflib.get_close_matches()

https://stackoverflow.com/questions/11384714

19-06-2021
|

Question

How can I tell difflib.get_close_matches() to ignore case? I have a dictionary which has a defined format which includes capitalisation. However, the test string might have full capitalisation or no capitalisation, and these should be equivalent. The results need to be properly capitalised, however, so I can't use a modified dictionary.

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
s = 'Acacia kochi W.Fitzg.'

# base case: proper capitalisation
print(difflib.get_close_matches(s,names,1,0.9))

# this should be equivalent from the perspective of my program
print(difflib.get_close_matches(s.upper(),names,1,0.9))

# this won't work because of the dictionary formatting
print(difflib.get_close_matches(s.upper().capitalize(),names,1,0.9))

Output:

['Acacia kochii W.Fitzg.']
[]
[]

Working code:

Based on Hugh Bothwell's answer, I have modified the code as follows to get a working solution (which should also work when more than one result is returned):

import difflib

names = ['Acacia koa A.Gray var. latifolia (Benth.) H.St.John',
    'Acacia koa A.Gray var. waianaeensis H.St.John',
    'Acacia koaia Hillebr.',
    'Acacia kochii W.Fitzg. ex Ewart & Jean White',
    'Acacia kochii W.Fitzg.']
test = {n.lower():n for n in names}    
s1 = 'Acacia kochi W.Fitzg.'   # base case
s2 = 'ACACIA KOCHI W.FITZG.'   # test case

results = [test[r] for r in difflib.get_close_matches(s1.lower(),test,1,0.9)]
results += [test[r] for r in difflib.get_close_matches(s2.lower(),test,1,0.9)]
print results

Output:

['Acacia kochii W.Fitzg.', 'Acacia kochii W.Fitzg.']

Solution

I don't see any quick way to make difflib do case-insensitive comparison.

The quick-and-dirty solution seems to be

make a function that converts the string to some canonical form (for example: upper case, single spaced, no punctuation)
use that function to make a dict of {canonical string: original string} and a list of [canonical string]
run .get_close_matches against the canonical-string list, then plug the results through the dict to get the original strings back

OTHER TIPS

After a lot of searching around I am sadly surprised to see no simple pre-canned answer to this obvious use case.

The only alternative seems to be "FuzzyWuzzy" library. Yet it relies on Levenshtein Distance just as Python's difflib, and its API is not production quality. Its more obscure methods are indeed case-insensitive, but it provides no direct or simple replacement for get_close_matches.

So here is the simplest implementation I can think of:

import difflib

def get_close_matches_icase(word, possibilities, *args, **kwargs):
    """ Case-insensitive version of difflib.get_close_matches """
    lword = word.lower()
    lpos = {p.lower(): p for p in possibilities}
    lmatches = difflib.get_close_matches(lword, lpos.keys(), *args, **kwargs)
    return [lpos[m] for m in lmatches]

@gatopeich had the right idea, but the problem is that there may be many strings which differ only in capitalization. We surely want them all in our results, not just one of them!

The following adaption manages to do this:

def get_close_matches_icase(word, possibilities, *args, **kwargs):
    """ Case-insensitive version of difflib.get_close_matches """
    lword = word.lower()
    lpos = {}
    for p in possibilities:
        if p.lower() not in lpos:
            lpos[p.lower()] = [p]
        else:
            lpos[p.lower()].append(p)
    lmatches = difflib.get_close_matches(lword, lpos.keys(), *args, **kwargs)
    ret = [lpos[m] for m in lmatches]
    ret = itertools.chain.from_iterable(ret)
    return set(ret)

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow