I use fuzzywuzzy to fuzzy match based on threshold and fuzzysearch to fuzzy extract words from the match.
process.extractBests
takes a query, list of words and a cutoff score and returns a list of tuples of match and score above the cutoff score.
find_near_matches
takes the result of process.extractBests
and returns the start and end indices of words. I use the indices to build the words and use the built word to find the index in the large string. max_l_dist
of find_near_matches
is 'Levenshtein distance' which has to be adjusted to suit the needs.
from fuzzysearch import find_near_matches
from fuzzywuzzy import process
large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
def fuzzy_extract(qs, ls, threshold):
'''fuzzy matches 'qs' in 'ls' and returns list of
tuples of (word,index)
'''
for word, _ in process.extractBests(qs, (ls,), score_cutoff=threshold):
print('word {}'.format(word))
for match in find_near_matches(qs, word, max_l_dist=1):
match = word[match.start:match.end]
print('match {}'.format(match))
index = ls.find(match)
yield (match, index)
To test:
query_string = "manhattan"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 70):
print('match: {}\nindex: {}'.format(match, index))
query_string = "citi"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}\nindex: {}'.format(match, index))
query_string = "greet"
print('query: {}\nstring: {}'.format(query_string, large_string))
for match,index in fuzzy_extract(query_string, large_string, 30):
print('match: {}\nindex: {}'.format(match, index))
Output:
query: manhattan
string: thelargemanhatanproject is a great project in themanhattincity
match: manhatan
index: 8
match: manhattin
index: 49
query: citi
string: thelargemanhatanproject is a great project in themanhattincity
match: city
index: 58
query: greet
string: thelargemanhatanproject is a great project in themanhattincity
match: great
index: 29