Is it possible to modify term frequencies / term vectors directly?

https://stackoverflow.com/questions/21970462

15-10-2022
|

Pregunta

I would like to use Lucene.NET to store and query term vectors. However, I do not want the term vectors to be created from documents. Instead, I want to be able to write and update the term vectors directly, without positions or offsets of the term/token.

The workaround would be to generate text from a term vector, i.e. from the term vector

foo: 3; bar: 1

generate the text

foo, foo, foo, bar

and let Lucene index that text. If I want to update the term frequency of bar to 2, I could get the stored text (or generate it from the old term vector, if I don't store it), change it to

foo, foo, foo, bar, bar

and update the according document in the index.

This is quite expensive for such a simple task. Obviously, this is not the use case, Lucene was built to be used for. Still, I would like to be able to use the power of Lucene for querying, etc..

Is there a way to write term vectors for a document directly or do you have any other good ideas?

Solución

As I said in my question, Lucene is not intended for storing and manipulating term vectors directly. The initial approach is more or less the way to go at least with regards to the process of updating the term vector:

Retrieve the document which represents the relevant term vector
Update the according field of the document
Reindex the document (Delete, then Add equals Update in Lucene)

I haven't found a way to update a single term frequency in the vector without reindexing the entire document.

One improvement of the method described in the question is to encode the termvector as term-frequency pairs:

Instead of

foo foo foo bar

the field content can be written as

foo:3; bar:1;

You can then write a custom TokenFilter which reads these tokens one by one and then returns the term n times. This will not improve performance but simplify handling of the term vectors. If you're not familiar with custom token filters and analyzers it is probably not worth it to use this approach and I would stick with the naive version I already suggested in the question.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a StackOverflow