Getting started with Apache Tika?

https://stackoverflow.com/questions/17821895

04-06-2022
|

質問

I would like to program a Java web crawler that uses Apache Tika to download webpage textual content, but I'm a newbie to using Apache projects and I haven't found a definitive source that clarifies how to integrate Tika into programs, exactly. From what I've gathered from the Internet, I have built Tika with Maven in command line, but I'm not sure where to go from here to use Tika classes(?) like Parser, etc in my Java programs. I'm using Eclipse, if that makes a difference - I've also installed the Maven plugin for Eclipse but I'm not exactly sure what to do with it...Do I need to an "import..." line? Please excuse my "beginner" questions but a step-by-step guide to preparing Tika to be used would be appreciated.

解決

First up, you'll want to read through the Apache Tika getting started guide, which covers how to get Tika included in your project. (This assumes you have some basic knowledge of including Third Party jars into your own project, if not you'll need to go read some tutorials on that)

The easiest way to get started with Tika in your project is via the Tika Facade class. This provides a single class you can use for detection, parsing to plain text string, and parsing to xhtml via a reader, all from a variety of sources. All the basics are available there.

For more advanced use, you'll want to follow the information given on the Parser API page and Content Detection page. You can also follow the Tika Examples on parsing with the AutoDetectParser, which should do what you'll likely want, otherwise browse the annotated list of Tika examples with explanations to get a good idea of how to start!

ライセンス： CC-BY-SA と帰属

所属していません StackOverflow