While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators, What I want to do is to target only the ones whose color is black(or some other specifiable color). I am unable to find a method for the same in pdfBox docs.

Edit : Basically what I want to do is to keep only black colored text on the pdf, and remove/delete any other text operator which doesnt match the criteria.

Can anyone share a solution ?

Thanks !

有帮助吗?

解决方案

Text showing operators

While parsing a already present pdf, I am using if(op.getOperation().equals( "TJ")) to get text operators,

There are more text showing operators you have to take care of in general:

string Tj Show a text string.

string ' Move to the next line and show a text string. This operator shall have the same effect as the code T* string Tj

aw ac string " Move to the next line and show a text string, using aw as the word spacing and ac as the character spacing (setting the corresponding parameters in the text state). aw and ac shall be numbers expressed in unscaled text space units. This operator shall have the same effect as this code: aw Tw ac Tc string '

array TJ Show one or more text strings, allowing individual glyph positioning. Each element of array shall be either a string or a number. If the element is a string, this operator shall show the string. If it is a number, the operator shall adjust the text position by that amount; that is, it shall translate the text matrix, Tm. The number shall be expressed in thousandths of a unit of text space (see 9.4.4, "Text Space Details"). This amount shall be subtracted from the current horizontal or vertical coordinate, depending on the writing mode. In the default coordinate system, a positive adjustment has the effect of moving the next glyph painted either to the left or down by the given amount.

(Table 109 in the Pdf specification ISO 32000-1)

Text color

The color used to show text depends on the current text rendering mode.

The text rendering mode, Tmode, determines whether showing text shall cause glyph outlines to be stroked, filled, used as a clipping boundary, or some combination of the three.

(section 9.3.6 in the Pdf specification ISO 32000-1)

It is set using the Tr operator:

render Tr Set the text rendering mode, Tmode, to render, which shall be an integer. Initial value: 0.

(Table 105 in the Pdf specification ISO 32000-1)

Depending on this mode you have to consider the current stroke color, the current fill color, the color of whatever is later-on painted in the defined clipping boundary, or some combination of the three.

The color setting operators are defined in Table 74 of the specification ISO 32000-1.

Most often the glyph outlines merely are filled (mode 0). Thus, most often you have to consider the current fill color. That still leaves quite a lot of color setting commands to consider.

Most often gray, RGB, or CMYK colors are used here. Thus, most often you will have to check the g, rg, or k operators.

Pure black is set by 0 g, 0 0 0 rg, or 0 0 0 1 k. You might also want to consider values which are very near to those values; they might have been intended as black and only differ due to rounding issues.

Color transformations

To make things a bit more complex: The colors mentioned above may still be transformed to some completely different color, e.g. by means of transfer functions (cf. section 10.4), transparency or blending (cf. section 11).

If you also want to consider these effects, you essentially program your own PDF renderer.

Normally, though, PDFs intended mainly for text on the web don't use these features. Thus, for your purposes I would not consider them at first.

许可以下: CC-BY-SA归因
不隶属于 StackOverflow
scroll top