The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.
Word distance algorithm for OCR
-
24-06-2023 - |
Pergunta
I am working with an OCR output and I'm searching for special words inside it.
As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.
However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).
Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?
Solução
Outras dicas
I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)
- Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
- OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf