문제

In my application, i am reading text from an image that contains numbers and alphabets separated with -

For example 1-TT88TY5-AD5G

However, Tesseract is ignoring - and giving me 1TT88TY5AD5G..

How to force it to read hyphens too..

Here's my initial code for it..

Tesseract* tesseract = [[Tesseract alloc] initWithDataPath:@"tessdata" language:@"eng"];
                       [tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
도움이 되었습니까?

해결책

I'm pretty much guessing here since I haven't used Tesseract, but shouldn't the - be in the whitelist?

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];
                              ^

다른 팁

Tesseract will not recognize accurately what you want. You must have to test tesseract as many time you can then apply some pattern matching based on tesseract performance.

And see what it is returning instead of -. So better replace what tesseract return instead of - with '-`.

In you case - is replaced with . which not looks good because your whiteList string don't contains any .

[tesseract setVariableValue:@"-0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ" forKey:@"tessedit_char_whitelist"];

You can use below method to decide which character have how much confidence values

  /** Returns the (average) confidence value between 0 and 100. */
  int MeanTextConf();
  /**
   * Returns all word confidences (between 0 and 100) in an array, terminated
   * by -1.  The calling function must delete [] after use.
   * The number of confidences should correspond to the number of space-
   * delimited words in GetUTF8Text.
   */
  int* AllWordConfidences();
라이센스 : CC-BY-SA ~와 함께 속성
제휴하지 않습니다 StackOverflow
scroll top