Domanda

I have the problem that I want to match all strings in the database having a certain edit distance to a given string.

My idea was to generate a regular expression that would match all strings with edit distance d to string s.

So for example I want to generate a regex r for d = 1 and s = 'abc' in the form of: r = 'abc|.abc|.bc|a.c|ab.|abc.' and so on. But I'm not sure if this is very efficient or are there already some good algorithms to that problem? I want to consider even character swaps in the edit distance. so 'acb' should also be part of r. I want to realise it in PHP and then make an SQL query: SELECT * FROM table WHERE name RLIKE TheRegularExpression.

Is it a good way to make it like that? Or what would you recommend?

È stato utile?

Soluzione

Probably the best thing to do is build up an iterative process for all the possibilities. In other words, something like this:

function findall($startString) {
    // create an array of all strings that are distance one away
    // each element would be $returnArray["abc"] = "abc";
}

$d = 2; // distance
$myArray[$startString] = $startString;

for($i = 0; $i < $d; $i++) {
    $newCombos = array_merge(array(), $myArray);
    foreach($myArray as $element) {
        $newCombos = array_merge($newCombos, findall($element));
    }
    $myArray = array_merge(array(), $newCombos);
}

$myRegex = implode("|", $myArray);

Altri suggerimenti

You can store a Levenshtein function in Mysql. After that you can simply do the search like this:

mysql_qery("SELECT `term` FROM `words` WHERE levenshtein('$word', `term`) BETWEEN 0 AND '$d'");

You need an implementation of Levenshtein Distance (or something very similar). Here is a function definition for use with MySQL.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top