Fuzzy string-matching that can "skip"? e.g. "i am (.*)." has 0 distance to "I am here."

https://stackoverflow.com/questions/17016947

31-05-2022
|

Вопрос

I'm writing a Python chatbot. No matter what the technique is(Levenshtein, LCS, regex, etc.), I want a pattern like My name is [ A ]. smart enough to match strings like:

My name is Tslmy.              #Distance should = 0, and groupdict()['a'] outputs "Tslmy"
My name is Tesla Tahomana.     #Distance should = 0(!), and groupdict()['a'] outputs "Tesla Tahomana"
my  naem ist tslmy .           #With a little typo, the distance = 5, and groupdict()['a'] outputs "tslmy "

Allow me to use groupdict()['a'] to refer to what the [ A ] thing (actually (?P<identifier>match)) has captured, please.

In other way, I'm looking for a "Levenshtein" with omits/skippings/blanks/neglects, and pick out what has been skipped as well.
In another way, I'm looking for a fuzzy(a.k.a. approximate) regex that can be less strict with the pattern, still provides the good old groupdict(), as well as a "fuzziness" value (or "edit distance", required to determine "the best matched pattern to the string" later).
This is the preferred solution, since it provides "sufficient" groupdict() if well managed.
However, The TRE library and the REGEX library, which is found to be the closest solution, don't seem to provide a "fuzziness" value. If this can be solved, then so much the better!

Is that possible? Thanks for paying attention.

Update:

I decided to use the powerful regex module in the end, but still unable to get the "fuzziness value".

Since the question on this page is theoratically solved, appending too further will be dishonorable. So I put forward another question about this new issue, and hopes you could solve it!

Решение

You could use a RegEx for the basic match:

r"My name is (\w+){1,2}."

And then use the TRE library to allow for variations.

Другие советы

DAT REGEX O_O

(?i)(?:(?:my|ym).?|.?(?:my|ym))\s+(?:.?(?:..me|n..e|na..)|(?:..me|n..e|na..).?)\s+(?:(?:is|si).?|.?(?:is|si))\s+(\w[\w\s])\s

Let's split it up:

(?i) : set the i modifier to match case insensitive
(?:(?:my|ym).?|.?(?:my|ym)) : this will match my, ym, My, Ym, may, amy etc...
\s+ : match white space one or more times
(?:.?(?:..am|n..e|na..)|(?:..am|n..e|na..).?) : match name, naao, tame, lame, n99e, names, Naats etc...
\s+ : match white space one or more times
(?:(?:is|si).?|.?(?:is|si)) : Match is, si, ist, sit, siR etc...
\s+ : match white space one or more times
(\w[\w\s]*) : match words and spaces one or more times and group it (it must start with a word \w)
\s* : match white spaces zero or more times

Online demo

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow