Get position of subsequence using Levenshtein-Distance

Question

Assuming the maximum Levenshtein distance allowed is small, this can be done in a single-pass while keeping a list of candidates for fuzzy matches.

Here is an example implementation I just worked up. It's not thoroughly tested, documented or optimized. But at least it works for simple examples (see below). I've tried to avoid having it return several matches due to skipping characters at the edges of the sub-sequence, but as I've said, I haven't thoroughly tested this.

If you're interested I'll be happy to clean this up, write some tests, do basic optimization and make this available as an open-source library.

from collections import namedtuple

Candidate = namedtuple('Candidate', ['start', 'subseq_index', 'dist'])
Match = namedtuple('Match', ['start', 'end', 'dist'])

def find_near_matches(subsequence, sequence, max_l_dist=0):
    prev_char = None
    candidates = []
    for index, char in enumerate(sequence):
        for skip in range(min(max_l_dist+1, len(subsequence))):
            candidates.append(Candidate(index, skip, skip))
            if subsequence[skip] == prev_char:
                break
        new_candidates = []
        for cand in candidates:
            if char == subsequence[cand.subseq_index]:
                if cand.subseq_index + 1 == len(subsequence):
                    yield Match(cand.start, index + 1, cand.dist)
                else:
                    new_candidates.append(cand._replace(
                        subseq_index=cand.subseq_index + 1,
                    ))
            else:
                if cand.dist == max_l_dist or cand.subseq_index == 0:
                    continue
                # add a candidate skipping a sequence char
                new_candidates.append(cand._replace(dist=cand.dist + 1))
                # try skipping subsequence chars
                for n_skipped in range(1, max_l_dist - cand.dist + 1):
                    if cand.subseq_index + n_skipped == len(subsequence):
                        yield Match(cand.start, index + 1, cand.dist + n_skipped)
                        break
                    elif subsequence[cand.subseq_index + n_skipped] == char:
                        # add a candidate skipping n_skipped subsequence chars
                        new_candidates.append(cand._replace(
                            dist=cand.dist + n_skipped,
                            subseq_index=cand.subseq_index + n_skipped,
                        ))
                        break
        candidates = new_candidates
        prev_char = char

Now:

>>> list(find_near_matches('bde', 'abcdefg', 0))
[]
>>> list(find_near_matches('bde', 'abcdefg', 1))
[Match(start=1, end=5, dist=1), Match(start=3, end=5, dist=1)]
>>> list(find_near_matches('cde', 'abcdefg', 0))
[Match(start=2, end=5, dist=0)]
>>> list(find_near_matches('cde', 'abcdefg', 1))
[Match(start=2, end=5, dist=0)]
>>> match = _[0]
>>> 'abcdefg'[match.start:match.end]
'cde'

EDIT:

Following this question, I'm writing a Python library for searching for nearly matching sub-sequences: fuzzysearch. It it still very much a work-in-progress.

For now, try the find_near_matches_with_ngrams() function! It should perform particularly well in your use-case.