Word distance algorithm for OCR

https://stackoverflow.com/questions/22758770

24-06-2023
|

Pergunta

I am working with an OCR output and I'm searching for special words inside it.

As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.

However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).

Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?

Solução

The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.

Outras dicas

I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)

Wikipedia http://en.wikipedia.org/wiki/Needleman%E2%80%93Wunsch_algorithm
OCR paper related to the NW-algorithm http://oro.open.ac.uk/20855/1/paper-15.pdf

Licenciado em: CC-BY-SA com atribuição

Não afiliado a StackOverflow