Pergunta

Existe um algoritmo que permite encontrar a distância de edição do nível da palavra entre duas frases? Por exemplo, "um grande cão de gordura" e "a grande casa com o cão gordo" tem 1 substituto, 3 inserções

Foi útil?

Solução

You can use the same algorithms that are used for finding edit distance in strings to find edit distances in sentences. You can think of a sentence as a string drawn from an alphabet where each character is a word in the English language (assuming that spaces are used to mark where one "character" starts and the next ends). Any standard algorithm for computing edit distance, such as the standard dynamic programming approach for computing Levenshtein distance, can be adapted to solve this problem.

Outras dicas

In general, this is called the sequence alignment problem. Actually it does not matter what entities you align - bits, characters, words, or DNA bases - as long as the algorithm works for one type of items it will work for everything else. What matters is whether you want global or local alignment.

Global alignment, which attempt to align every residue in every sequence, is most useful when the sequences are similar and of roughly equal size. A general global alignment technique is the Needleman-Wunsch algorithm algorithm, which is based on dynamic programming. When people talk about Levinstain distance they usually mean global alignment. The algorithm is so straightforward, that several people discovered it independently, and sometimes you may come across Wagner-Fischer algorithm which is essentially the same thing, but is mentioned more often in the context of edit distance between two strings of characters.

Local alignment is more useful for dissimilar sequences that are suspected to contain regions of similarity or similar sequence motifs within their larger sequence context. The Smith-Waterman algorithm is a general local alignment method also based on dynamic programming. It is quite rarely used in natural language processing, and more often - in bioinformatics.

Here is a sample implementation of the @templatetypedef's idea in ActionScript (it worked great for me), which calculates the normalized Levenshtein distance (or in other words gives a value in the range [0..1])

  private function nlevenshtein(s1:String, s2:String):Number {
     var tokens1:Array = s1.split(" ");
     var tokens2:Array = s2.split(" ");
     const len1:uint = tokens1.length, len2:uint = tokens2.length;
     var d:Vector.<Vector.<uint> >=new Vector.<Vector.<uint> >(len1+1);
     for(i=0; i<=len1; ++i)
        d[i] = new Vector.<uint>(len2+1);

     d[0][0]=0;

     var i:int;
     var j:int;

     for(i=1; i<=len1; ++i) d[i][0]=i; 
     for(i=1; i<=len2; ++i) d[0][i]=i;

     for(i = 1; i <= len1; ++i)
        for(j = 1; j <= len2; ++j)
           d[i][j] = Math.min( Math.min(d[i - 1][j] + 1,d[i][j - 1] + 1),
              d[i - 1][j - 1] + (tokens1[i - 1] == tokens2[j - 1] ? 0 : 1) );

     var nlevenshteinDist:Number = (d[len1][len2]) / (Math.max(len1, len2));

     return nlevenshteinDist;
  }

I hope this will help!

The implementation in D is generalized over any range, and thus array. So by splitting your sentences into arrays of strings they can be run through the algorithm and an edit number will be provided.

https://dlang.org/library/std/algorithm/comparison/levenshtein_distance.html

Here is the Java implementation of edit distance algorithm for sentences using dynamic programming approach.

public class EditDistance {

    public int editDistanceDP(String sentence1, String sentence2) {
        String[] s1 = sentence1.split(" ");
        String[] s2 = sentence2.split(" ");
        int[][] solution = new int[s1.length + 1][s2.length + 1];

        for (int i = 0; i <= s2.length; i++) {
            solution[0][i] = i;
        }

        for (int i = 0; i <= s1.length; i++) {
            solution[i][0] = i;
        }

        int m = s1.length;
        int n = s2.length;
        for (int i = 1; i <= m; i++) {
            for (int j = 1; j <= n; j++) {
                if (s1[i - 1].equals(s2[j - 1]))
                    solution[i][j] = solution[i - 1][j - 1];
                else
                    solution[i][j] = 1
                            + Math.min(solution[i][j - 1], Math.min(solution[i - 1][j], solution[i - 1][j - 1]));
            }
        }
        return solution[s1.length][s2.length];
    }

    public static void main(String[] args) {
        String sentence1 = "first second third";
        String sentence2 = "second";
        EditDistance ed = new EditDistance();
        System.out.println("Edit Distance: " + ed.editDistanceDP(sentence1, sentence2));
    }
}
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top