質問

I had requirement to extract specific colums/rows from Excel/CSV file. Somebody suggest me to using Tika for this task.

While going thru tika, I came across POI API and found more friendly to use it.

we may have requirement to parse PDF file in further.

I am new to this technology, i would like know difference between two and which technology is more suitable for my requirement.

Thanks, Krishna

役に立ちましたか?

解決

Apache Tika provides a common way to extract consistent text and metadata from a wide range of formats. It also provides content detection, language detection and a few other bits. If you write your code to work with Apache Tika, then your code will be able to work with a huge range of formats in the same way. You don't need to worry about whether one format has a Title, or another calls the same logical thing a LongTitle or a Subject. You don't need to worry about what library to use for what format. You call Tika, it does the hard work for you, and back comes your consistent Metadata and Textual Content

Apache POI is one of the libraries that Tika uses. POI supports most of the main Microsoft Office formats, including Excel (.xls and .xlsx). It provides access to the whole of the file format, allowing you complete control over what information you read out. (It also supports writing). Tika uses POI to get text and metadata out of the various different Microsoft formats, but doesn't extract everything. Using POI directly would allow you to decide what you care about and get that.

If you want to support lots of file formats, use Tika. If you want full control of how you get the information out, use POI.

他のヒント

Apache POI is full blown parser/writer for most of the Microsoft Documents. It supports both newly introduced 2007 (XSSF) format and Microsoft 2003 file formats (HSSF). Apache POI provides two level of API for parsing and generating Microsoft files. One that is higher level API that is bit memory intensive which reads the whole file and keeps in the memory something similar to DOM parsing in XML and lower level API for memory intensive use which is similar to SAX/StAX parsing.

On the other hand Apache Tika is content analysis tool which I guess only supports Microsoft Excel and lot of other extraction components. There is no support for writing new files or generating content from Tika, anyway that is not the their use case at all.

So, you have to choose depending on your need.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top