Question

I am learning how to use the Tesseract API and I am interested in the hOCR output function. Currently I am using this code to scan the image.

 Tesseract* tesseract = [[Tesseract alloc] initWithLanguage:@"eng"];
tesseract.delegate = self;
[tesseract setVariableValue:@"0123456789abcdefghijklmnopqrstuvwxyzABCDEFGHIJKLMNOPQRSTUVWXYZ@.-():" forKey:@"tessedit_char_whitelist"];
[tesseract setVariableValue:@"0" forKey:@"tessedit_create_hocr"];

UIImage *image = [UIImage imageNamed:@"card.jpg"];

CGFloat newWidth = 1200;
CGSize newSize = CGSizeMake(newWidth, newWidth);
image = [image resizedImage:newSize interpolationQuality:kCGInterpolationHigh];


[tesseract setImage:image]; //image to check
[tesseract recognize];

 NSLog(@"Here is the text %@", [tesseract recognizedText]);

Everything is compiling fine, but I want to know how to store the .html that is returned by the hOCR function. Can I store it inside of a variable? I need to be able to access this file in my program after it has been generated. Any insight on how to use hOCR on iOS is appreciated.

Was it helpful?

Solution

You are getting an NSString if you proceed as follows.

- (NSString *)getHOCRText {
        char *boxtext = _tesseract->GetHOCRText(0);
        return [NSString stringWithUTF8String:boxtext];
}

Later you can convert this NSString to NSData.

    NSData *xmlData = [xmlString dataUsingEncoding:NSASCIIStringEncoding];

So that you can parse this data using NSXMLParser

        NSXMLParser *xmlParser = [[NSXMLParser alloc] initWithData:xmlData];

Hope you are aware remaining parsing procedures.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top