How to get style information of elements in PDF using Apache Tika?

Question 1

It is probably more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.

Question 2

I used https://pdfclown.org for stream text blocks and font height extraction:

Example

v.0.2.0

Question 3

Converting the pdf to the Scalable Vector Graphics (svg) xml format with mupdf will give you the information you want.

Download the mupdf tool here: http://artifex.com/developers-mupdf-download/mupdf-download-resources/ and choose the GNU AGPL LICENSE

Or here: https://mupdf.com/downloads/

Details: https://mupdf.com/index.html

After you download the executable you should add the path to the mupdf executable to your PATH Environment Variable.

You can then use the following from a command line interface (CLI) to convert the pdf (note - there will be a separate svg file for each page):

mutool convert -F svg -O text=text -o "your_pdf_pg.svg" your_pdf.pdf

More CLI details: https://mupdf.com/docs/manual-mutool-convert.html

In all of the cases I have seen, the font, size, style, color, and page coordinates for each line of text where that information is the same. Except for underlines and strikeouts which are included in the svg file as <paths> in the same coordinate system as the text. So you can develop some code to parse the xml and tag the text with the respective <u> </u> or <del> </del> accordingly.