Assuming the maximum Levenshtein distance allowed is small, this can be done in a single-pass while keeping a list of candidates for fuzzy matches.
Here is an example implementation I just worked up. It's not thoroughly tested, documented or optimized. But at least it works for simple examples (see below). I've tried to avoid having it return several matches due to skipping characters at the edges of the sub-sequence, but as I've said, I haven't thoroughly tested this.
If you're interested I'll be happy to clean this up, write some tests, do basic optimization and make this available as an open-source library.
from collections import namedtuple
Candidate = namedtuple('Candidate', ['start', 'subseq_index', 'dist'])
Match = namedtuple('Match', ['start', 'end', 'dist'])
def find_near_matches(subsequence, sequence, max_l_dist=0):
prev_char = None
candidates = []
for index, char in enumerate(sequence):
for skip in range(min(max_l_dist+1, len(subsequence))):
candidates.append(Candidate(index, skip, skip))
if subsequence[skip] == prev_char:
break
new_candidates = []
for cand in candidates:
if char == subsequence[cand.subseq_index]:
if cand.subseq_index + 1 == len(subsequence):
yield Match(cand.start, index + 1, cand.dist)
else:
new_candidates.append(cand._replace(
subseq_index=cand.subseq_index + 1,
))
else:
if cand.dist == max_l_dist or cand.subseq_index == 0:
continue
# add a candidate skipping a sequence char
new_candidates.append(cand._replace(dist=cand.dist + 1))
# try skipping subsequence chars
for n_skipped in range(1, max_l_dist - cand.dist + 1):
if cand.subseq_index + n_skipped == len(subsequence):
yield Match(cand.start, index + 1, cand.dist + n_skipped)
break
elif subsequence[cand.subseq_index + n_skipped] == char:
# add a candidate skipping n_skipped subsequence chars
new_candidates.append(cand._replace(
dist=cand.dist + n_skipped,
subseq_index=cand.subseq_index + n_skipped,
))
break
candidates = new_candidates
prev_char = char
Now:
>>> list(find_near_matches('bde', 'abcdefg', 0))
[]
>>> list(find_near_matches('bde', 'abcdefg', 1))
[Match(start=1, end=5, dist=1), Match(start=3, end=5, dist=1)]
>>> list(find_near_matches('cde', 'abcdefg', 0))
[Match(start=2, end=5, dist=0)]
>>> list(find_near_matches('cde', 'abcdefg', 1))
[Match(start=2, end=5, dist=0)]
>>> match = _[0]
>>> 'abcdefg'[match.start:match.end]
'cde'
EDIT:
Following this question, I'm writing a Python library for searching for nearly matching sub-sequences: fuzzysearch
. It it still very much a work-in-progress.
For now, try the find_near_matches_with_ngrams()
function! It should perform particularly well in your use-case.