The PDFTextStripper
class you use is documented (cf. its JavaDoc comment) like this:
* This class will take a pdf document and strip out all of the text and ignore the
* formatting and such.
To get specific font information, therefore, you have to change it somewhat.
The font information is available in this class all along and only discarded when outputting a line, have a look at its source:
protected void writePage() throws IOException
{
[...]
for( int i = 0; i < charactersByArticle.size(); i++)
{
[...]
List<TextPosition> line = new ArrayList<TextPosition>();
[...]
while( textIter.hasNext() )
{
[...]
if( lastPosition != null )
{
[...]
if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine))
{
writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
line.clear();
[...]
}
............
The TextPosition
instances in that list line
still have all formatting information available, among them the font used, only while "normalizing" line
it is reduced to pure characters.
To keep font information, therefore, you have different options, depending on how you want to retrieve the font information:
If you want to continue retrieving all page content information (including fonts) in a single String via
getText
: You change the methodprivate List<String> normalize(List<TextPosition> line, boolean isRtlDominant, boolean hasRtl)
to include some font tags (e.g.
[Arial]
) of your choice whenever the font changes. Unfortunately this method is private. Thus, you have to copy the wholePDFTextStripper
class and change the code of the copy.If you want to retrieve the specificfont information in a different structure (e.g. as
List<List<TextPosition>>
) you can derive your own stripper class fromPDFTextStripper
, add some variable of your desired type, and override theprotected
methodwritePage
mentioned above, copying it and only enhancing it right before or after the linewriteLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant);
with code adding the information to your new variable. E.g.
public class MyPDFTextStripper extends PDFTextStripper { public List<List<TextPosition>> myLines = new ArrayList<List<TextPosition>>(); [...] if(!overlap(positionY, positionHeight, maxYForLine, maxHeightForLine)) { writeLine(normalize(line,isRtlDominant,hasRtl),isRtlDominant); myLines.add(new ArrayList<TextPosition>(line)); line.clear(); [...] }
Now you can call
getText
for an instance of yourMyPDFTextStripper
, retrieve the plain text as result, and access the additional data via the new variable