Question

I thought i had a simple question, but somehow i cant find a source for the answer....which document formats can be indexed by the Lucene version that is packaged with Railo 4.0?

Somehow .doc and .pdf seem to go well, but docx and rtf just don't seem to get indexed....Is there a list available somewhere? And for all formats that arent supported, what would be the best way to get that info indexed aswell by cfindex?

        <cfindex 
        collection = "#collection#"   
        action = "update"   
        type = "file"
        key ="#ABSfilepath#"
        title="#ABSfilepath#"
        >

thanks!

Question also posted to Railo mailing list: web link.

Was it helpful?

Solution

Railo 4 uses Lucene 2.4.1 - how do you tell? Same way you tell the version for all third-party software that Railo uses: locate the JAR file (in the lib/ext directory), open that archive (using 7-zip or equivalent), and look at META-INF/MANIFEST.MF where you find content like this:

Specification-Title: Lucene Search Engine: core Specification-Version: 2.4.1 Specification-Vendor: The Apache Software Foundation Implementation-Title: org.apache.lucene Implementation-Version: 2.4.1 750176 - 2009-03-04 21:56:52 Implementation-Vendor: The Apache Software Foundation

This seems to be a pretty old version and doesn't look like it has any docs on the Apache Lucene website. (It might be possible to upgrade Lucene by replacing the relevant JARs, but this might also cause dependency issues; do at own risk.)

Since the Lucene website doesn't help, a search for "lucene 2.4.1 indexable documents" brings back a pertinent question about v2.3.2 which asks:

Does Lucene java supports parsing of extensions *.docx, *.pptx, *.mpp i.e. Microsoft Windows 2007 documents?

With the response:

Lucene doesn't actually support any of the document types. What happens is that some program is used to parse the files into an indexable stream and that stream is indexed. That used to be POI in the old days.

Ok, so assuming that is still accurate, Lucene doesn't control the filetypes, Apache POI does.

Checking the JARs tells us Railo 4.0 uses Apache POI v3.8 and looking at the POI changelog reveals that .docx support arrived in v3.5

So, your .docx files should be supported along with the other MS Office formats. If it's definitely not being indexed, you probably need to identify if it's a POI issue or a Lucene issue or a Railo issue - creating a simple reproducable test case with both .doc and .docx documents is probably a good first step.

Beyond that, you'll need someone familiar with Lucene/POI to advise - there may or not be log files that will contain details of possible indexing/retrieval errors, or ways to interact with Lucene directly (not via Railo/cfindex) that can help identify where the issue lies.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top