Pregunta

I am working through a database of names with possible duplicate entries and attempting to identify which we have two of, unfortunately the formatting is a bit less than optimal and some entries have their first name, middle name, last name or maiden names mashed into one string and some have just first and last.

I need a way to see if say 'John Marvulli' matches 'John Michael Marvulli' and be able to do an operation on those matches. However if you try:

>>> 'John Marvulli' in 'John Michael Marvulli'
False

It returns False. Is there an easy way to compare two strings in this manner to see if one name is contained in another?

¿Fue útil?

Solución 2

I recently discovered the power of the difflib module.
Think this will hekp you:

import difflib

datab = ['Pnk Flooyd', 'John Marvulli',
         'Ld Zeppelin', 'John Michael Marvulli',
         'Led Zepelin', 'Beetles', 'Pink Fl',
         'Beatlez', 'Beatles', 'Poonk LLoyds',
         'Pook Loyds']
print datab
print


li = []
s = difflib.SequenceMatcher()

def yield_ratios(s,iterable):
    for x in iterable:
        s.set_seq1(x)
        yield s.ratio()

for text_item in datab:
    s.set_seq2(text_item)
    for gathered in li:
        if any(r>0.45 for r in yield_ratios(s,gathered)):
            gathered.append(text_item)
            break
    else:
        li.append([text_item])


for el in li:
    print el

result

['Pnk Flooyd', 'Pink Fl', 'Poonk LLoyds', 'Pook Loyds']
['John Marvulli', 'John Michael Marvulli']
['Ld Zeppelin', 'Led Zepelin']
['Beetles', 'Beatlez', 'Beatles']

Otros consejos

You need to split the strings and look for the individual words:

>>> all(x in 'John Michael Marvulli'.split() for x in 'John Marvulli'.split())
True
import re

n1 = "john Miller"
n1 = "john   Miller"

n2 = "johnas Miller"

n3 = "john doe Miller"
n4 = "john doe paul Miller"


regex = "john \\s*(\\w*\\s*)*\\s* Miller"
compiled=re.compile(regex)

print(compiled.search(n1)==None)
print(compiled.search(n2)==None)
print(compiled.search(n3)==None)
print(compiled.search(n4)==None)

'''
output:


False
True
False
False
'''
Licenciado bajo: CC-BY-SA con atribución
No afiliado a StackOverflow
scroll top