Domanda

We got the following problem at hand. We want to do a full reindex with 100 % read availability during the process. The problem arises when deleting old documents from the index. At the moment we´re doing sth. like this:

1) fetch all data from db and update solr index per solrServer.add()
2) get all document ids that were updated and compare them with all the document ids in index
3) delete all documents that are in index but weren´t updated

This seems to work but is there maybe a better/easier solution for this task?

È stato utile?

Soluzione

The changes do not become visible until you commit. So, you can issue delete and then index all your documents. Just make sure automatic commits are not there. This obviously requires more memory.

Alternatively, you can do a separate field with generational stamp (e.g. increasing ID or timestamp). Then, you issue a query delete to pick up the left over documents with old generation.

Finally, you can index into a new Core/Collection and then swap out the active collection to point to the new one. Then, you can just delete the old collection directory.

Altri suggerimenti

It sounds like you may have a performance issue with the deletes. IF you do this:

delete id:12345 delete id:23456 delete id:13254

then it is a lot slower than this:

delete id:(12345 OR 23456 OR 13254)

Collect the list of ids that need to be deleted, batch them in groups of 100 or so, and transform those batches into delete queries using parentheses and OR. I have done this with batches of deletes numbering several thousand, and it is much faster than stepping through one at a time.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top