extracting text from using pdfclown function 'textextractor'

https://stackoverflow.com/questions/16572369

29-05-2022
|

Question

i am getting an error while using textextractor of pdfclown library. The code i used is

TextExtractor textExtractor = new TextExtractor(true, true);
for(final Page page : file.getDocument().getPages())
{
  System.out.println("\nScanning page " + (page.getIndex()+1) + "...\n");

  //  Extract the page text!
  Map textStrings = textExtractor.extract(page);

a part of the error i got is

exception in thread 'main' java.lang.exceptionininitializer error
at org.pdfclown.document.contents.fonts.encoding.put
at ......
at ......
<about 30 such lines>
caused by java.lang.nullpointerexception
at java.io.reader.<init><Reader.java:78>
at java.io.inputstreamreader
<about 30 lines more>

I also found out that this happens when my pdf contains some bullets for example

item 1
item 2
item 3

Plz help me out to extract the text from such pdfs.

Solution

(The following comment turned out to be the solution:)

Using your highlighter.java class (provided on your google drive in a comment) together with the current PDF Clown trunk version as jar, the PDF was processed without incident, especially without NullPointerException (the highlights partially were not at the right position, though).

After looking at your shared google drive contents, though, I assumed you did not use a PDF Clown jar but instead merely compiled the classes from the distribution source folder and used them.

The PDF Clown jar files contain additional ressources, though, which your setup consequentially did not include. Thus:

Your highlighter.java has to be used with pdfclown.jar in the classpath.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow