extracting from specific areas using pdfclown

https://stackoverflow.com/questions/16663930

30-05-2022
|

Question

I am trying to highlight text in PDF with two columns and but the problem is while the extractor extracts the text row wise. So the queried text doesn't get matched. I was thinking if there is some function in pdfclown which can help me to extract first half of the page i.e., first column and then the second one probably by selecting the areas.

Thanks.

Solution

As you talk about text extraction with PDF Clown, I assume you are using the TextExtractor class of that library.

This class offers numerous attributes helping to restrict the parsing area:

public void setAreas(List<Rectangle2D> value);
public void setAreaTolerance(double value);
public void setAreaMode(AreaModeEnum value);

setAreas allows you to set the page areas to extract text from, setAreaTolerance allows you to add some tolerance to these areas (essentially enlarging the areas by this value in all directions), and setAreaMode is used to control whether a string must be contained by the area (Containment) or merely needs to intersect the area (Intersection) to be included in the scan results.

How these attributes work, can be witnessed in the TextExtractor method

public Map<Rectangle2D,List<ITextString>> filter(
    List<? extends ITextString> textStrings,
    Rectangle2D... areas
);

which filters the list of all text strings on the page.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow