Question

I'm using the Levenshtein Distance algorithm in C++ to compare two strings to measure how close they are to each other. However, the plain Levenshtein Distance algorithm does not distinguish word boundaries as delimited by spaces. This results in smaller distance calculations than I want. I'm comparing titles to see how close they are to each other and I wish for the algorithm to not count characters as matching if they come from across multiple words.

For example, if I compare these two strings I get the following result with + designating a match and - designating a non-match:

Al Chertoff Et
Al Church Department of finance Et
+++++------+--++-----++-+------+++
Al Ch      e  rt     of f       Et

I get a get a distance of 20 with the word "Chertoff" matching across the four words "Church Department of finance" whereas, I really want them to be considered further apart from each other by not allowing characters to match from more than one word and getting a distance of 25 with the word "Chertoff" most matching the one word "Department", with three characters matching:

Al Chertoff Et
Al Church Department of finance Et
+++--------+--++---------------+++
Al         e  rt                Et
         Ch     off

How could I adapt the Levenshtein Distance to accomplish this or is there another distance algorithm that would be better suited for this? Perhaps using the Levenshtein distance on each word individually word work and choosing the word with the least distance? However, what if matching one word well deep into the string caused the subsequent words to match poorly because their matches were best earlier in the string? Could this somehow be done with Levenshtein distance adapted to be at a word level?

For example, the shortest distance by this idea for the following more complicated example is 20:

Al Chertoff Deport Et
Al Church Department of finance Et
+++++----++++-++---------------+++
Al Ch     Dep rt                Et
     ertoff  o

Instead of maximizing "Chertoff"'s match and getting the longer distance of 24:

Al Chertoff Deport Et
Al Church Department of finance Et
+++--------+--++-----+---------+++
Al         e  rt     o          Et
         Ch     off
                  Dep rt

My current implementation of the Levenshtein Distance is as follows:

size_t
levenshtein_distance(const std::string& a_compare1,
                     const std::string& a_compare2) {
  const size_t length1 = a_compare1.size();
  const size_t length2 = a_compare2.size();
  std::vector<size_t> curr_col(length2 + 1);
  std::vector<size_t> prev_col(length2 + 1);

  // Prime the previous column for use in the following loop:
  for (size_t idx2 = 0; idx2 < length2 + 1; ++idx2) {
    prev_col[idx2] = idx2;
  }

  for (size_t idx1 = 0; idx1 < length1; ++idx1) {
    curr_col[0] = idx1 + 1;

    for (size_t idx2 = 0; idx2 < length2; ++idx2) {
      const size_t compare = a_compare1[idx1] == a_compare2[idx2] ? 0 : 1;

      curr_col[idx2 + 1] = std::min(std::min(curr_col[idx2] + 1,
                                             prev_col[idx2 + 1] + 1),
                                    prev_col[idx2] + compare);
    }

    curr_col.swap(prev_col);
  }

  return prev_col[length2];
}
Was it helpful?

Solution

I can get pretty close to what you want by making levenshtein_distance a generic algorithm on a sequence container and including a cost function that calculates the distance between two elements:

template<typename T, typename C>
size_t
seq_distance(const T& seq1, const T& seq2, const C& cost,
             const typename T::value_type& empty = typename T::value_type()) {
  const size_t size1 = seq1.size();
  const size_t size2 = seq2.size();

  std::vector<size_t> curr_col(size2 + 1);
  std::vector<size_t> prev_col(size2 + 1);

  // Prime the previous column for use in the following loop:
  prev_col[0] = 0;
  for (size_t idx2 = 0; idx2 < size2; ++idx2) {
    prev_col[idx2 + 1] = prev_col[idx2] + cost(empty, seq2[idx2]);
  }

  for (size_t idx1 = 0; idx1 < size1; ++idx1) {
    curr_col[0] = curr_col[0] + cost(seq1[idx1], empty);

    for (size_t idx2 = 0; idx2 < size2; ++idx2) {
      curr_col[idx2 + 1] = std::min(std::min(
        curr_col[idx2] + cost(empty, seq2[idx2]),
        prev_col[idx2 + 1] + cost(seq1[idx1], empty)),
        prev_col[idx2] + cost(seq1[idx1], seq2[idx2]));
    }

    curr_col.swap(prev_col);
    curr_col[0] = prev_col[0];
  }

  return prev_col[size2];
}

Given the above seq_distance, the edit distance between two sentences such that edits can not be made between word boundaries, can be defined with the following:

size_t
letter_distance(char letter1, char letter2) {
  return letter1 != letter2 ? 1 : 0;
}

size_t
word_distance(const std::string& word1, const std::string& word2) {
  return seq_distance(word1, word2, &letter_distance);
}

size_t
sentence_distance(const std::string& sentence1, const std::string& sentence2) {
  std::vector<std::string> words1;
  std::vector<std::string> words2;
  std::istringstream iss1(sentence1);
  std::istringstream iss2(sentence2);
  std::copy(std::istream_iterator<std::string>(iss1),
            std::istream_iterator<std::string>(),
            std::back_inserter(words1));
  std::copy(std::istream_iterator<std::string>(iss2),
            std::istream_iterator<std::string>(),
            std::back_inserter(words2));
  return seq_distance(words1, words2, &word_distance);
}

Here's the code working on ideone. I've tested a few cases and I'm pretty sure it does the right thing, but you should try it out more to make sure the results are reasonable.

Note that this isn't exactly what you asked for, since it ignores all spaces in the edit distance measurement: I think it shouldn't be too hard to modify it not to do that, but I haven't thought it through completely. In any case, this might be just as good (or even better), depending on your needs, so I'll let you decide if you want to try to tweak it.

Just a minor note, your original code was slightly buggy in that the following two lines:

curr_col.reserve(length2 + 1);
prev_col.reserve(length2 + 1);

reserve capacity in the vectors, but do not actually change the sizes of them, so accessing the array after that was undefined behavior. You should actually resize the vector if you're going to access elements in a range: reserve is usually for situations where you are about to push_back a certain number of elements one-by-one (which increases the size as you go, not all at once) and you want to avoid the cost of multiple internal reallocations (since the internal capacity only increases by a certain factor each time the capacity is exceeded).

EDIT:

This version takes into consideration spaces between words as part of the edit distance, but the results are still not exactly the same as your examples because of the requirement to add multiple spaces in some cases.

OTHER TIPS

Word boundaries will be crossed if the individual words are not of the same length. If you want to keep the indices being compared within the respective words then you will need to make the words of the same length. For example, here's a Javascript (yes, I know you asked or C++ but this is for illustration - code taken from Wikipedia) distance calculation routine:

var memo = {};

function d(str1, i, len1, str2, j, len2){
    var key = [i,len1,j,len2].join(',');
    if(memo[key] != undefined) return memo[key];

    if(len1 == 0) return len2;
    if(len2 == 0) return len1;
    var cost = 0;
    if(str1[i] != str2[j]) cost = 1;

    var dist = Math.min(
        d(str1, i+1,len1-1, str2,j,len2)+1, 
        d(str1,i,len1,str2,j+1,len2-1)+1,
        d(str1,i+1,len1-1,str2,j+1,len2-1)+cost);
    memo[key] = dist;
    return dist;
}

var str1 = "Al Chertoff Deport$$$$ $$ $$$$$$$ Et";
var str2 = "Al Church$$ Department of finance Et";

console.log(d(str1, 0, str1.length, str2, 0, str2.length));

Notice how I have modified the two input strings to match at individual word level. Running this I got a distance of 19. Similarly if I change the strings to:

var str1 = "Al Chertoff $$$$$$$$$$ $$ $$$$$$$ Et";
var str2 = "Al Church$$ Department of finance Et";

I get a distance of 24.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top