Approximate string matching with a letter confusion matrix?

https://stackoverflow.com/questions/2702332

01-10-2019
|

Question

I'm trying to model a phonetic recognizer that has to isolate instances of words (strings of phones) out of a long stream of phones that doesn't have gaps between each word. The stream of phones may have been poorly recognized, with letter substitutions/insertions/deletions, so I will have to do approximate string matching.

However, I want the matching to be phonetically-motivated, e.g. "m" and "n" are phonetically similar, so the substitution cost of "m" for "n" should be small, compared to say, "m" and "k". So, if I'm searching for [mein] "main", it would match the letter sequence [meim] "maim" with, say, cost 0.1, whereas it would match the letter sequence [meik] "make" with, say, cost 0.7. Similarly, there are differing costs for inserting or deleting each letter. I can supply a confusion matrix that, for each letter pair (x,y), gives the cost of substituting x with y, where x and y are any letter or the empty string.

I know that there are tools available that do approximate matching such as agrep, but as far as I can tell, they do not take a confusion matrix as input. That is, the cost of any insertion/substitution/deletion = 1. My question is, are there any open-source tools already available that can do approximate matching with confusion matrices, and if not, what is a good algorithm that I can implement to accomplish this?

EDIT: just to be clear, I'm trying to isolate approximate instances of a word such as [mein] from a longer string, e.g. [aiammeinlimeiking...]. Ideally, the algorithm/tool should report instances such as [mein] with cost 0.0 (exact match), [meik] with cost 0.7 (near match), etc, for all approximate string matches with a cost below a given threshold.

No correct solution

OTHER TIPS

I'm not aware of any phonetic recognizers that use confusion matrices. I know of Soundex, and match rating.

I think that the K-nearest neighbour algorithm might be useful for the type of approximations you are interested in.

Peter Kleiweg's Rug/L04 (for computational dialectology) includes an implementation of Levenshtein distance which allows you to specify non-uniform insertion, deletion, and substitution costs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow