Parse document structure with Java

https://stackoverflow.com/questions/4962102

java
apache
apache-tika

12-11-2019
|

質問

We need to get tree like structure from a given text document using Java. Used file type should be common and open (rtf, odt, ...). Currently we use Apache Tika to parse plain text from multiple documents.

What file type and API we should use so that we could most reliably get the correct structure parsed? If this is possible with Tika, I would be happy to see any demonstrations.

For example, we should get this kind of data from the given document:

Main Heading
  Heading 1
    Heading 1.1
  Heading 2
    Heading 2.2

Main Heading is the title of the paper. Paper has two main headings, Heading 1 and Heading 2 and they both have one subheadings. We should also get contents under each heading (paragraph text).

Any help is appreciated.

正しい解決策はありません

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow