I can get pretty close to what you want by making levenshtein_distance
a generic algorithm on a sequence container and including a cost function that calculates the distance between two elements:
template<typename T, typename C>
size_t
seq_distance(const T& seq1, const T& seq2, const C& cost,
const typename T::value_type& empty = typename T::value_type()) {
const size_t size1 = seq1.size();
const size_t size2 = seq2.size();
std::vector<size_t> curr_col(size2 + 1);
std::vector<size_t> prev_col(size2 + 1);
// Prime the previous column for use in the following loop:
prev_col[0] = 0;
for (size_t idx2 = 0; idx2 < size2; ++idx2) {
prev_col[idx2 + 1] = prev_col[idx2] + cost(empty, seq2[idx2]);
}
for (size_t idx1 = 0; idx1 < size1; ++idx1) {
curr_col[0] = curr_col[0] + cost(seq1[idx1], empty);
for (size_t idx2 = 0; idx2 < size2; ++idx2) {
curr_col[idx2 + 1] = std::min(std::min(
curr_col[idx2] + cost(empty, seq2[idx2]),
prev_col[idx2 + 1] + cost(seq1[idx1], empty)),
prev_col[idx2] + cost(seq1[idx1], seq2[idx2]));
}
curr_col.swap(prev_col);
curr_col[0] = prev_col[0];
}
return prev_col[size2];
}
Given the above seq_distance
, the edit distance between two sentences such that edits can not be made between word boundaries, can be defined with the following:
size_t
letter_distance(char letter1, char letter2) {
return letter1 != letter2 ? 1 : 0;
}
size_t
word_distance(const std::string& word1, const std::string& word2) {
return seq_distance(word1, word2, &letter_distance);
}
size_t
sentence_distance(const std::string& sentence1, const std::string& sentence2) {
std::vector<std::string> words1;
std::vector<std::string> words2;
std::istringstream iss1(sentence1);
std::istringstream iss2(sentence2);
std::copy(std::istream_iterator<std::string>(iss1),
std::istream_iterator<std::string>(),
std::back_inserter(words1));
std::copy(std::istream_iterator<std::string>(iss2),
std::istream_iterator<std::string>(),
std::back_inserter(words2));
return seq_distance(words1, words2, &word_distance);
}
Here's the code working on ideone. I've tested a few cases and I'm pretty sure it does the right thing, but you should try it out more to make sure the results are reasonable.
Note that this isn't exactly what you asked for, since it ignores all spaces in the edit distance measurement: I think it shouldn't be too hard to modify it not to do that, but I haven't thought it through completely. In any case, this might be just as good (or even better), depending on your needs, so I'll let you decide if you want to try to tweak it.
Just a minor note, your original code was slightly buggy in that the following two lines:
curr_col.reserve(length2 + 1);
prev_col.reserve(length2 + 1);
reserve capacity in the vectors, but do not actually change the sizes of them, so accessing the array after that was undefined behavior. You should actually resize
the vector if you're going to access elements in a range: reserve
is usually for situations where you are about to push_back
a certain number of elements one-by-one (which increases the size as you go, not all at once) and you want to avoid the cost of multiple internal reallocations (since the internal capacity only increases by a certain factor each time the capacity is exceeded).
EDIT:
This version takes into consideration spaces between words as part of the edit distance, but the results are still not exactly the same as your examples because of the requirement to add multiple spaces in some cases.