In my application, i am reading text from an image that contains numbers and alphabets separated with -

For example 1-TT88TY5-AD5G

However, Tesseract is ignoring - and giving me 1TT88TY5AD5G..

How to force it to read hyphens too..

Here's my initial code for it..

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
                       [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
有帮助吗?

解决方案

I'm pretty much guessing here since I haven't used Tesseract, but shouldn't the - be in the whitelist?

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
                              ^

其他提示

Tesseract will not recognize accurately what you want. You must have to test tesseract as many time you can then apply some pattern matching based on tesseract performance.

And see what it is returning instead of -. So better replace what tesseract return instead of - with '-`.

In you case - is replaced with . which not looks good because your whiteList string don't contains any .

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

You can use below method to decide which character have how much confidence values

  /** Returns the (average) confidence value between 0 and 100. */
  int MeanTextConf();
  /**
   * Returns all word confidences (between 0 and 100) in an array, terminated
   * by -1.  The calling function must delete [] after use.
   * The number of confidences should correspond to the number of space-
   * delimited words in GetUTF8Text.
   */
  int* AllWordConfidences();
许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top