How do you design a schema to efficiently query nested items in a key-value database?

https://stackoverflow.com/questions/607746

03-07-2019
|

Question

I'm using Mnesia with Erlang, but this question applies to any key-value db like couchdb, etc.

I'm trying to break free of the RDBMS thought process, but I can't wrap my head around how to efficiently implement this kind of schema.

Say I have a User record, and he has many SubItemA records, which has many SubItem B records, so:

User
-SubItem A
--SubItem B
...

I need to run queries on SubItem B. Is it efficient to do it when it's this nested? Should I just normalize it so it will be quicker?

I have heard of some people using data duplication so the data is both nested and separate, is this ridiculous or is this actually useful in some cases?

Solution

The underlying question is, when is the performance good enough?

Table-scanning the User dictionary isn't excessive overhead if you really do need to examine every SubItem B in detail and the size of the B's dominates the overall size of the dictionary.

If that isn't good enough, normalize it so you can avoid reading in all the User and SubItem A data up front when you're querying SubItem B. Use a compound key such as (UserId, SubItemAId, SubItemBId) in the SubItem B dictionary if the table is ordered so you can do range queries.

If that totally kills your User/SubItem A query performance, then consider data duplication as a last resort because it's more error-prone.

OTHER TIPS

In CouchDb it would be trivial to emit view entries for each of the SubItems. This would give you very fast access to those items. Depending on what you also put in the view entries you could probably provide any information you need for linking back to parent documents / sub items.

I'm not sure about Mnesia, and I'm only just getting started with CouchDB, but my understanding is that in CouchDB, since you generate your own custom indexes ("views"), you can straightforwardly build an index on those sub-items.

An example map function:

function(doc) {
    for(var i in doc.subitems_a) {
        var subitem_a = doc.subitems_a[i];

        for(var j in doc.subitems_a[item_a].subitems_b) {
            var subitem_b = subitem_a.subitems_b[j];

            emit(subitem_b, doc)
        }
    }
}

That is effectively an indexed listing of SubItem Bs and then you could cut and splice from that listing as you choose.

Actually it depends on the database you are using I think. In CouchDB one thing will work better while in Mnesia something else would be better. Should you partition and shard the data? On what criteria should you do so? How much data duplication is enough?

As Jeffery Hantin said it's going to take some experimentation and analysis to figure out the right solution. That said most of the non-relational databases out there provide you with the tools you need to solve the problem. Your part is figuring out the trade-offs of each one and which trade-off you can accept versus the others.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow