As I said in my question, Lucene is not intended for storing and manipulating term vectors directly. The initial approach is more or less the way to go at least with regards to the process of updating the term vector:
- Retrieve the document which represents the relevant term vector
- Update the according field of the document
- Reindex the document (
Delete, then Add
equalsUpdate
in Lucene)
I haven't found a way to update a single term frequency in the vector without reindexing the entire document.
One improvement of the method described in the question is to encode the termvector as term-frequency pairs:
Instead of
foo foo foo bar
the field content can be written as
foo:3; bar:1;
You can then write a custom TokenFilter
which reads these tokens one by one and then returns the term n
times. This will not improve performance but simplify handling of the term vectors. If you're not familiar with custom token filters and analyzers it is probably not worth it to use this approach and I would stick with the naive version I already suggested in the question.