Can I insert a Document into Lucene without generating a TokenStream?

https://stackoverflow.com/questions/17432365

02-06-2022
|

Question

Is there a way to add a document to the index by supplying terms and term frequencies directly, rather than via Analysis and/or TokenStream? I ask because I want to model some data where I know the term frequencies, but there is no underlying text document to be analyzed. I could create one by repeating the same term many times (I don't care about positions or highlighting in this case, either, just scoring), but that seems a bit perverse (and probably slower than just supplying the counts directly).

(also asked on the mailing list)

Solution

At any rate, you don't need to pass everything through an Analyzer in order to create the document. I'm not aware of any way to pass in Terms and Frequencies as you've asked (though I'd be interested to know if you find a good approach to it), but you can certainly pass in IndexableFields one term at a time. That would still require you to add each term multiple times, like:

IndexableField field = new StringField(fieldName, myTerm, FieldType.TYPE_NOT_STORED);
for (int i = 0; i < frequency; i++) {
    document.add(field);
}

You can also take a step further back, and cut the Document class out entirely, by using any Iterable<IndexableField>, a simple List, for instance, which might suffice for a more direct approach for modelling your data.

Not sure if that gets you any closer to what you are looking for, but perhaps a step vaguely in the right direction.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow