How does the python difflib.get_close_matches() function work?

https://stackoverflow.com/questions/10879247

12-06-2021
|

Pergunta

The following are two arrays:

import difflib
import scipy
import numpy

a1=numpy.array(['198.129.254.73','134.55.221.58','134.55.219.121','134.55.41.41','198.124.252.101'], dtype='|S15')
b1=numpy.array(['198.124.252.102','134.55.41.41','134.55.219.121','134.55.219.137','134.55.220.45', '198.124.252.130'],dtype='|S15')

difflib.get_close_matches(a1[-1],b1,2)

output:

['198.124.252.130', '198.124.252.102']

shouldnt '198.124.252.102' be the closest match for '198.124.252.101'?

I looked at the documentation where they have specified about some floating type weights but no information on algorithm use.

I am in need to find if the absolute difference between the last two octet is 1 (provided the first three octets are same).

So I am finding the closest string first and then checking that closest string for the above condition.

Is there any other function or way to achieve this? Also how does get_close_matches() behave?

ipaddr doesnt seem to have such a manipulation for ips.

Solução

Well, there is this part in the docs explaining your issue:

This does not yield minimal edit sequences, but does tend to yield matches that “look right” to people.

For getting the results you are expecting you could use the Levenshtein_distance.

But for comparing IPs I would suggest to use integer comparison:

>>> parts = [int(s) for s in '198.124.252.130'.split('.')]
>>> parts2 = [int(s) for s in '198.124.252.101'.split('.')]
>>> from operator import sub
>>> diff = sum(d * 10**(3-pos) for pos,d in enumerate(map(sub, parts, parts2)))
>>> diff
29

You can use this style to create a compare function:

from functools import partial
from operator import sub

def compare_ips(base, ip1, ip2):
    base = [int(s) for s in base.split('.')]
    parts1 = (int(s) for s in ip1.split('.'))
    parts2 = (int(s) for s in ip2.split('.'))
    test1 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts1)))
    test2 = sum(abs(d * 10**(3-pos)) for pos,d in enumerate(map(sub, base, parts2)))
    return cmp(test1, test2)

base = '198.124.252.101'
test_list = ['198.124.252.102','134.55.41.41','134.55.219.121',
             '134.55.219.137','134.55.220.45', '198.124.252.130']
sorted(test_list, cmp=partial(compare_ips, base))
# yields:
# ['198.124.252.102', '198.124.252.130', '134.55.219.121', '134.55.219.137', 
#  '134.55.220.45', '134.55.41.41']

Outras dicas

Some hint from difflib:

SequenceMatcher is a flexible class for comparing pairs of sequences of any type, so long as the sequence elements are hashable. The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching". The basic idea is to find the longest contiguous matching subsequence that contains no "junk" elements (R-O doesn't address junk). The same idea is then applied recursively to the pieces of the sequences to the left and to the right of the matching subsequence. This does not yield minimal edit sequences, but does tend to yield matches that "look right" to people.

Regarding your requirement to compare IPs based on custom logic. You should first validate if the string is proper ip. Then writing comparison logic using simple integer arithmetic should be an easy task to fulfill your requirement. A library is not needed at all.

difflib mentions:

The basic algorithm predates, and is a little fancier than, an algorithm published in the late 1980's by Ratcliff and Obershelp under the hyperbolic name "gestalt pattern matching".

And in terms of what that could mean, the "gestalt pattern matching" Wikipedia page can provide some answers. Also, in the Wikipedia page, some mentions about the Python difflib library, and its implementations are given, in the 'Applications' section.

https://en.wikipedia.org/wiki/Gestalt_Pattern_Matching

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow