Lucene 4.2.0 index pdf

https://stackoverflow.com/questions/16640292

30-05-2022
|

Domanda

I am using example source code from the Lucene 4.2.0 demo API: http://lucene.apache.org/core/4_2_0/demo/overview-summary.html

I run IndexFiles.java to create an index from a directory of rtf, pdf, doc, and docx files. I then run SearcFiles.java and notice that I encounter several instances where my searches fail i.e. it does not return a document that contains the word I searched for.

I suspect it has to do with Lucene 4.2.0 not being able to correctly index non .txt files without additional customization.

Question: Can the IndexFiles.java source code (Lucene 4.2.0) correctly index pdf, doc, docx files as it is written in the provided link? Does anyone have examples or references on how to code that functionality?

Thank You

Soluzione

No, it can't. IndexFiles is a demo, an example for you to learn from, but not really designed for production use. If you take a look at the code, you'll see it just uses a FileInputStream (wrapped with an InputStreamReader, wrapped with a BufferedReader). Generally, Lucene won't handle how to parse different file formats (except it's own index files, of course). How to parse a file to provide meaningful content to Lucene is up to you to define.

Apache Tika might be a good place to look for this functionality. Here is a simple example using Tika with Lucene.

You might also consider using Solr.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow