質問

PDFbox provides classes to convert a pdf to lucene document. Does it preserve the formatting of the document.By formatting i mean does it store details about the location and font type/size and other options.

役に立ちましたか?

解決

By default, it will remove all formatting and extract only textual content and make it searchable. This content can be searched, and the original PDF can be maintained external to the index and returned with search results when a hit has been found. Rebuilding a PDF from the Lucene index may not be the best approach, if that is your intent.

PDFBox is quite capable of extracting metadata, though, and it can certainly be used to index formatting / font / etc data, if you wish to be able to search on that sort of thing.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top