Locate “N Gram” substrings that are smallest distance away from a target string N character long

https://stackoverflow.com/questions/4203142

25-09-2019
|

Question

I am looking for an algorithm, preferably in Python that would help me locate substrings, N characters long, of exisiting strings that are closest to a target string N character long.

Consider the target string, that is, say, 4 characters long, to be:

targetString -> '1111'

Assume this is the string I have available with me ( I will generate substrings of this for "best alignment" matching ):

nonEmptySubStrings -> ['110101']

Substrings of the above that are 4 characters long:

nGramsSubStrings -> ['0101', '1010', '1101']

I want to write/use a "Magic Function" that would select the string closest to targetString :

someMagicFunction -> ['1101']

Some more examples:

nonEmptySubStrings -> ['101011']
nGramsSubStrings -> ['0101', '1010', '1011']

someMagicFunction -> ['1011']

nonEmptySubStrings -> ['10101']
nGramsSubStrings -> ['0101', '1010']

someMagicFunction -> ['0101', '1010']

Is this "Magic Function" a well known substring problem?

I really want to find the min. number of changes in nonEmptySubStrings so that it would have targetString as a substring.

Solution

Base on OP's comment to question, this is what is desired

import functools

def edit_distance(str1, str2): 
    #implement it here

f = functools.operator(edit_distance, target_string)
return min(f(s) for s in slices(string_))   # use slices from below

This will return the minimum edit distance of any substring to the target string. It will not indicate which string that is or what its index is. It could be easily modified to do so though.

The naive way, which can be the best way, is

import functools

def diff(str1, str2):
    # However you test the distance gets defined here. e.g. Hamming distance, 
    # Levenshtein distance, etc.


def slices(string_, L):
    for i in xrange(len(string_) - L + 1)):
        yield string_[i:i+L]

best_match = min(slices(string_), key=functools.partial(diff, target_string))

This wont return the index at which the substring occurs though. Of course you didn't specify that you need it in your question ;)

If you want to get better than this, it will depend on how you're measuring the distance and will basically boil down to avoiding checking some substrings by infering that you would have to change at least x chars to get a better match than you already have. At that point, you might as well just change x chars by jumping ahead x chars.

OTHER TIPS

I believe you need Edit Distance. Peter Norvig's spelling corrector is an implementation example in python. Here's an implementation of Levenshtein Distance. See also this question.

EDIT: This is fairly frequent in bioinformatics. See e.g. FASTA and BLAST. Bioinformatics has many flavors of this algorithm. See Sequence Alignment for a survey of methods.

As part of a discussion a while ago on gene matching, I wrote this pyparsing example, implementing a pyparsing class CloseMatch. Normally pyparsing expressions return a structure containing matched strings and any named results, but CloseMatch returns a 2-tuple containing the matching string and a list of mismatch locations within the matched string. Here is how CloseMatch would be used:

searchseq = CloseMatch("TTAAATCTAGAAGAT", 3)
for g in genedata: 
    print "%s (%d)" % (g.id, g.genelen) 
    print "-"*24 
    for t,startLoc,endLoc in searchseq.scanString(g.gene): 
        matched, mismatches = t[0] 
        print "MATCH:", searchseq.sequence 
        print "FOUND:", matched 
        if mismatches: 
            print "      ", ''.join(' ' if i not in mismatches else '*'  
                            for i,c in enumerate(searchseq.sequence)) 
        else: 
            print "<exact match>" 
        print "at location", startLoc

Here is a sample output of a partial match:

organism=Toxoplasma_gondii_RH (258)
------------------------
MATCH: TTAAATCTAGAAGAT
FOUND: TTAAATTTAGGAGCT
             *   *  * 
at location 195

Note that this class does not find overlapping matches. That can still be accomplished, but with a slightly different approach with scanString (which I will include in the next pyparsing release).

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow