Question

I have a large DB with binary documents (like PDFs) and an index crated without TermFreqVector, just "Store.NO, Index.ANALYZED". I'm trying to implement phrase suggester/predictor using that. I would like to search for single and multiple words, like: "where" or "where are" and I expect to get something like "where are you john".

I'm surprised that LUKE is able to restore documents document term by term from created index somehow (I've checked its sources, but... I still don't know how it's possible without TermFreqVector). Is there anyone that knows how it's possible? I've got two options for my suggester:

1) Use 'somehow' LUKE's mechanism to restore a document from index I have now. (That would be the best).

2) Create another index just for phrase suggester. (However, currently implemented indexing takes about 2-3 days and about 4-5Gigs). I've searched over then net for the solution, but most of them lead to the SOLR which I can't use.

I've tried few solutions already but... I've stucked.

I would be grateful for any hints.

Was it helpful?

Solution 2

OK. After few retries taking different approach... I did that and it's working very fast. :) What I have done. I've Re-indexed my all documents with an addtional option "TermVector.WITH_POSITIONS" and I'm searching for terms directly in the index using PrefixQuery. Then I'm taking all positions of the term I'm searching for within the documents and storing it withing a map. Then I'm iterating over the document terms checking if the term position is TermPosition <= (number of suggested phrase).

If you need examples, please ask :)

OTHER TIPS

Firsly, I wouldn't recommend trying to emulate Luke's document rebuilding. It's meant for debugging. It's costly, complicated, and lossy. If you really want to know how it works, Luke is open source, so grab the source code and take a look at: /src/org/getopt/luke/DocReconstructor.java

The implementation I've seen of phrase suggestion is to store the phrases as a StringField, and use SpellChecker to find recommendations. This would require you define what qualifies as a "phrase" is this context, and index them separately. I would probably just create another field for this, rather than an entirely separate index, but that is up to you.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top