Pergunta

I am working with an OCR output and I'm searching for special words inside it.

As the output is not clean, I look for elements that match my inputs according to a word distance lower than a specific threshold.

However, I feel that the Levenshtein distance or the Hamming distance are not the best way, as the OCR always seem to make the same mistakes: I for 1, 0 for O, Q for O... and these "classic" mistakes seem to be less important than "A for K" for instance. As a result, these distances do not care of the amount of differences of the appearances of the characters (low / high).

Is there any word distance algorithm that was made specifically for OCR that I can use that would better fit my case? Or should I implement my custom word distance empirically according to the visual differences of characters?

Foi útil?

Solução

The Levenshtein distance allows you to specify different costs for every substitution pair (http://en.wikipedia.org/wiki/Levenshtein_distance#Possible_modifications, fifth item). So you can tune it to your needs by giving more or less emphasis to the common mistakes.

Outras dicas

I you want a custom cost function for letter mismatch you could look at the Needleman–Wunsch algorithm (NW)

Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top