Question

I am looking to convert a large number of image files into text using Tesseract.

I have looked at their documentation but have not idea how that relates to PHP and how my php script will interact with tesseract ocr. I have seen on other questions that suggest that php exec() might be the way.

$img = myimage.png;
$text = exec($img,'tesseract');

I have downloaded and installed tesseract. Using windows 7 with a recent version of xampp installed. I have a beginner to intermediate knowledge of php. What knowledge am I missing?

Update I now have it working with in powershell and cmd with

tesseract.exe D:\Documents\Web_Development\Sandbox\php\images\23.png D:\Documents\Web_Development\Sandbox\php\images\23

But When I try to run it through exec like this:

<?php 
exec('tesseract.exe D:\Documents\Web_Development\Sandbox\images\23.png D:\Documents\Web_Development\Sandbox\images\23');
?>

I get a popup from windows that says the tesseract.exe has stopped working. here are the error details if they mean anything to anyone.

Problem signature:
  Problem Event Name:   BEX
  Application Name: tesseract.exe
  Application Version:  0.0.0.0
  Application Timestamp:    4ca507b3
  Fault Module Name:    MSVCR90.dll
  Fault Module Version: 9.0.30729.4926
  Fault Module Timestamp:   4a1743c1
  Exception Offset: 0002f93e
  Exception Code:   c0000417
  Exception Data:   00000000
  OS Version:   6.1.7600.2.0.0.768.3
  Locale ID:    1033
  Additional Information 1: e958
  Additional Information 2: e95831f9d00a16a326250da660e931c5
  Additional Information 3: 040a
  Additional Information 4: 040a259d27c5ccf749ee18722d5fbec0
Was it helpful?

Solution

You should try to get it working without PHP, that is, to run it from the ms windows CLI interface (the ms-dos prompt). After that, you simply put whatever you have typed in the CLI in the PHP runtime, running it via CLI or some other IPC mechanisms, eventually parameterizing it with PHP variables.

For example, if in the CLI you would be typing

ipconfig /all

to get the IP configuration of the system, then in PHP you'd simply use:

<?php
echo '<pre>';
echo exec('ipconfig /all');
echo '</pre>';

Back to your problem, if in the CLI you'd be issuing:

tesseract document.tif result

Then in PHP you'd do

<?php
echo '<pre>';
echo exec('tesseract document.tif result');
echo '</pre>';

That's about it. It's not specific to tesseract, it works with any program (with a CLI interface).

If you need more control over the output, or the input (as it's the case when the user is asked for input while the program is running), you should use the proc_*() family of functions from http://ch2.php.net/manual/en/function.exec.php

Good luck!

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top