Question

i've store a lucene document with a single TextField contains words without stems.

I need to implement a search program that allow users to search words and exact words, but if i've stored words without stemming, a stem search cannot be done. There's a method to search both exact words and/or stemming words in Documents without store Two fields ?

Thanks in advance.

Was it helpful?

Solution

Indexing two separate fields seems like the right approach to me.

Stemmed and unstemmed text require different analysis strategies, and so require you to provide a different Analyzer to the QueryParser. Lucene doesn't really support indexing text in the same field with different analyzers. That is by design. Furthermore, duplicating the text in the same field could result in some fairly strange scoring impacts (heavier scoring on terms that are not touched by the stemmer, particularly).

There is no need to store the text in each of these fields, but it only makes sense to index them in separate fields.

You can apply a different analyzer to different fields by using a PerFieldAnalyzerWrapper, by the way. Like:

Map<String,Analyzer> analyzerList = new HashMap<String,Analyzer>();
analyzerList.put("stemmedText", new EnglishAnalyzer(Version.LUCENE_44));
analyzerList.put("unstemmedText", new StandardAnalyzer(Version.LUCENE_44));
PerFieldAnalyzerWrapper analyzer = new PerFieldAnalyzerWrapper(new StandardAnalyzer(Version.LUCENE_44), analyzerList);

I can see a couple of possibilities to accomplish it though, if you really want to.

One would be to create your own stem filter, based on (or possibly extending) the one you wish to use already, and add in the ability to keep the original tokens after stemming. Mind your position increments, in this case. Phrase queries and the like may be problematic.

The other (probably worse) possibility, would be to add the text to the field normally, then add it again to the same field, but this time after manually stemming. Two fields added with the same name will be effectively concatenated. You'dd want to store in a separate field, in this case. Expect wonky scoring.

Again, though, both of these are bad ideas. I see no benefit whatsoever to either of these strategies over the much easier and more useful approach of just indexing two fields.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top