Use embedded documents, if the embedded document get's queried very often?

https://stackoverflow.com/questions/21273195

01-10-2022
|

Вопрос

The basic data model consists of 3 objects:

A: ObjectId id, String name, List of B objects
B: ObjectId id, String name
C: ObjectId id, reference to B object

Objects of type A and B will get created very infrequently (maybe one of those objects per month). And every A object will contain only a few (= 1-5) B objects. If I use embedding it would look like this:

{
  id : ObjectId(...),
  name : "some A name",
  bList : [
    {
      id : ObjectId(...),
      name : "some B name"
    },{
      id : ObjectId(...),
      name : "some other B name"
    }
  ]
}

Objects of type C will always look like that:

{
  id : ObjectId(...),
  bReference : ObjectId(...)
}

I can't embed C into B, because there can be an infinite (= millions of entries connected to a single B object) number of C objects.

A and B objects will get modified maybe once per month. C objects will never get modified. But C objects will get inserted very often (peak load should be about 20 inserts per second).

The problem/question:

Everytime I query for an document of type C, I need some information from the referenced B document (and some times additionally from the A document). For example I need to check, if I am allowed to access the C object(s):

c = db.c.find({ ... })
a = db.a.find({ bList.id : c.bReference })
// do something with the access permission information from within a.

If I need information from A and B this seems pretty good. Right?

But what if I only need information from the B document (which is pretty often the case)?

c = db.c.find({ ... })
b = db.a.find({ bList.id : c.bReference }, { bList.$ : 1 })

If I query for multiple C objects, they will always(!) reference the same B object. When using non-embedded database design, the query would look like that:

c = db.c.find({ ... })
b = db.b.find({ _id : c.bReference })

The most important question is: If I do this veeeeery frequently, will there be a huge performance difference between those two variants?

Bonus question:

Can I have a Unique constraint across all A objects? I want to have all those a.bList.id ObjectIds to be unique (I know they should(!) be globally unique, but having a constraint on DB level makes me feel more secure)

Решение

The answer to your first question is, as usually: it depends. If the objects are large (say hundreds of k each), there could be a slight performance hit. If the objects are relatively small, there shouldn't be a big difference.

Of course, 'slight', 'large', 'small', etc. are all very vague, so let me elaborate: Since there are very few A and B objects and they are queried very often, they will probably remain in RAM which is good, so the only potential bottleneck are the serializer/deserializer and network. But let's say you're doing 1000 queries per second and the objects are 1k in size then this is 1MB/s - still not a particularly high load.

If the objects are 100k in size we'd be talking about 100MB/s or roughly 1Gbit/s which is a completely different game - certainly not a high load for a powerful server, but it will likely choke a $20/m virtual private server. Then again, even at that rate, depending on your consistency constraints you might want to cache those objects for a second or two, dramatically reducing the network load for your reads.

In a nutshell, at 20 operations per second, there's not much to worry about.

As you can see from my other answers here, I very often advice against embedding, but since that data is modified very rarely, embedding should be fine.

Лицензировано под: CC-BY-SA с атрибуция

Не связан с StackOverflow