Pergunta

One of the things we learn from "Index Cardinality" video [M101J: MongoDB for Java Developers] is that when a document with multikey index get moved, all of his indexes must be updated as well, which incur a significant overhead.

I've thought would it be possible to somehow bypass this constraint. The obvious solution is to add another level of indirection (this is a famous pattern for solving computer science problems :-)) and instead of referencing the document directly from the index we create an entity for each document that reference that document and get the indexes to reference that entity, and now when we move the document we only have to modify that entity only (the entity will never move because its BSON shape will always be the same). The problem with this solution of course is that of trading space for performance (indexes also suffer from this problem).

But all hope is not lost; in MongoDB all documents have an immutable _id field which is automatically indexed. Given all this we know that if a document is ever moved its associated _id index will also be updated, so why not just make all the other indexes references the corresponding _id index of the document?

Given this solution the only index that will be ever be updated when a document moves is the _id index.

I want to know if this solution could possibly be implemented in MongoDB or are there some hidden gotchas to it that would make it impractical?

Thanks

Foi útil?

Solução

Here is the answer I got from "Andy Schwerin" when I posted the same question as a Jira ticket: https://jira.mongodb.org/browse/SERVER-12614

Andy Schwerin answer:

  • It's feasible, but it makes all reads access the primary index. So, if you want to read a document that you find via a secondary index, you must take that _id and then look it up in the primary index to find the current location. Depending on the application, that might be a good tradeoff or a bad one. Other database systems in the past have used special markers in the old location of records, sometimes called tombstones, to point to new locations. This lets you pay for the indirection only when a document does move, at the cost of needing to periodically clean up the indexes so that you can garbage collect old tombstones.

Also thanks to leif for the informative link http://www.tokutek.com/2014/02/the-effects-of-database-heap-storage-choices-in-mongodb/ I've asked the author the same question and here is his answer:

Zardosht Kasheff answer:

  • You could, but then a point query into a secondary index may cause three I/Os instead of two. Currently, with either scheme, a point query into the secondary index may require an I/O to get the row identifier, and another to retrieve the document. With this scheme, you would need one I/O to get the _id, another to get the row identifier, and a third to get the document. This seems undesirable.
Licenciado em: CC-BY-SA com atribuição
Não afiliado a StackOverflow
scroll top