質問

I am playing around with Apache Tika to extract text from PDF files. I would like to know how to get style information like font size, text color, whether specific piece of text (few words) are in Italics, Bold, etc. using Apache Tika?

Is it even possible to get this type of information?

Also I would like to if it is possible to get table information using Apache Tika? Information like start of table, start of first row, first cell, etc.

役に立ちましたか?

解決

It is probably more convenient to use another api like PDFTextStream. Tika extracts raw textual information from a pdf, while PDFTextStream gives you structured text with correlated info such as character encoding, height, region of the text etc.

他のヒント

I used https://pdfclown.org for stream text blocks and font height extraction:

Example

v.0.2.0

Converting the pdf to the Scalable Vector Graphics (svg) xml format with mupdf will give you the information you want.

Download the mupdf tool here: http://artifex.com/developers-mupdf-download/mupdf-download-resources/ and choose the GNU AGPL LICENSE

Or here: https://mupdf.com/downloads/

Details: https://mupdf.com/index.html

After you download the executable you should add the path to the mupdf executable to your PATH Environment Variable.

You can then use the following from a command line interface (CLI) to convert the pdf (note - there will be a separate svg file for each page):

mutool convert -F svg -O text=text -o "your_pdf_pg.svg" your_pdf.pdf

More CLI details: https://mupdf.com/docs/manual-mutool-convert.html

In all of the cases I have seen, the font, size, style, color, and page coordinates for each line of text where that information is the same. Except for underlines and strikeouts which are included in the svg file as <paths> in the same coordinate system as the text. So you can develop some code to parse the xml and tag the text with the respective <u> </u> or <del> </del> accordingly.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top