First of all it doesn't matter what outputs similar_text()
, because it uses another algorithm to calculate similarity between strings.
Lets try to understand why levenstein()
thinks, that hw r u my dear ange is closer to orange than to 'how are you. Wikipedia has a good definition of what Levenstein distance is.
Informally, the Levenshtein distance between two words is the minimum number of single-character edits (insertion, deletion, substitution) required to change one word into the other.
Now lets count how many edits we have to do to change hw r u my dear angel into orange.
- hw r u my dear angel → hw r u my dear ange (deletion of last character)
- hw r u my dear ange → hw r u my dearange (deletion of last space)
- hw r u my dearange → arange (deletion of first 12 characters)
- arange → orange (substitution of a with o)
So it takes 1 + 1 + 12 + 1 = 15
edits total to change hw r u my dear angel into orange.
And here is transformation of hw r u my dear angel into how are you.
- hw r u my dear angel → how r u my dear angel (insertion of o character)
- how r u my dear angel → how dear angel (deletion of 7 characters)
- how dear angel → how ar angel (deletion of 2 characters)
- how ar angel → how are angel (insertion of e character)
- how are angel → how are ang (deletion of last 2 characters)
- how are ang → how are you (substition of last 3 characters)
Total 1 + 7 + 2 + 1 + 5 = 16
edits. So as you can see in terms of Levinstein distance orange is closer to hw r u my dear angel ;-)