Question

good moring.

first of all. This is the most impressive community i ever saw!

Well several days i mused about the three-folded job of

a. getting b. parsing c. storing a number of pages.

Two days ago i thought that getting the pages would be the major-task. No this isnt the case - i guess that the parser-job would be a heroic task. Each of the pages that are intended to be parsed is a png-image.

So the question is - after getting all them. How to parse them!? This seems to be the issue. Guess that there are some perl-modules out there - that can help in doing this...

Well - i think that this job only can be done with some OCR embedded! Question: is there a perl-module that can be use here to support this task:

BTW: see the result-pages.

see an image

BTW;: and as i thought i can find all 790 resultpages within a certain range between Id= 0 and Id= 100000 i thought, that i can go the way with a loop:

http://www.foundationfinder.ch/ShowDetails.php?Id=11233&InterfaceLanguage=&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=927&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=949&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=20011&InterfaceLanguage=1&Type=Html http://www.foundationfinder.ch/ShowDetails.php?Id=10579&InterfaceLanguage=1&Type=Html

i thought i can go the Perl-Way but i am not very very sure: I was trying to use LWP::UserAgent on the same URLs [see below] with different query arguments, and i am wondering if LWP::UserAgent provides a way for us to loop through the query arguments? I am not sure that LWP::UserAgent has a method for us to do that. Well - i sometimes heard that it is easier to use Mechanize. But is it really easier!?

But - to be frank; The first task " GETTING all the pages is not very difficult - if we compare this task with the parsing... How can this be done!?

Any ideas - suggestions -

look forward to hear from you...

zero

Was it helpful?

Solution

You do not need a Perl module, you only need the system function.

system qw[ tesseract.exe foo.png foo.txt ];
my $text = read_file('foo.txt');

You may need to preprocess the images to help Tesseract, say using ImageMagick like:

system qw[ convert.exe -resize 200%   image.jpg foo.png ];
Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top