Combining relational and document database for movies

https://softwareengineering.stackexchange.com/questions/399392

03-03-2021
|

Pergunta

As a architecture design brainstorm, I am pondering on how should I define schemas for movie database where heavy text searching is delegated to document based database (eg. elasticseach) while rdbms is kept for relations and querying by ID.

Imagine having following entities...

actor (id, name, dob)
studio (id, name, city)
movie (id, title, description, studio_id)

... with relations

actor [N..N] movie
studio [1..N] movie

Now the way I would incorporate this between the two database types would be:

RDMBS having 3 entities as described above and a join table for the N..N relationship
Elastic having a single index (movie) with following properties:
- title [txt]
- description [txt]
- studio [txt] (studio name)
- actors [array of txt] (actor names)
- movie_id [int] (id of an actual rdbms record)

Questions to answer:

Should Elastic documents keep id's of RDBMS' entries?
Should I keep (and thus duplicate) movie title and description in the RDBMS?
If #1 == true, should I also keep studio and actors id's?

Solução

Pick a system of record.

What happens when these two databases disagree? Which one is right, which one needs correcting.

If you pick:

the document database as the system of record, then the RDBMS should track document id's.
the RDBMS as the system of record, then the document database should track the RDBMS entity id's.

Its possible that one datastore will be the system of truth for these bounded contexts, and the other datastore the system of truth for those bounded contexts.

Please don't say that both are the system of record for a given set of data. That path is paved in blood.

Whichever you do pick, the other is there to provide optimisations. Its faster, or more convenient than the other data-store for providing feature/property X.

This means a measure of duplication, and desynchronisation. You will need to consider how you are going to verify and maintain the integrity of the data.

Also the optimisation data-store will likely be behind the times and slightly out of date. This might be as little as milli-seconds, but it could be much longer... Is the staleness going to affect your business processes?

You should engineer the data structure to align with how you are using it.

If you are performing an elastic search storing a row ID can make sense. With that id:

RDBMS lookups can be performed more optimally.
Business logic that performs joins or record filtering client side can do so with less margin for error
The id can be displayed to customers, or to error logs.

In these cases the data is by definition useful. It might even be facilitating the robustness or speed of the system.

Conversely if you place the id in there saying that one day it will be useful. Its not just useless, but actively slows down the system.

Every document/record has to allocate extra storage, along with indecies, and therefore reduces overall query performance.
Data transfer has to move extra data. This reduces the amount of bandwidth available for useful data.
This data has to be stored and manipulated in the memory of the client (assuming its retrieved). This reduces the computational resources available to perform actually useful work, and might even make it harder to perform useful work when you start hitting the capacity limits.

If you are not sure which way to go, not writing code for it is usually the better option.

A part that isn't there is not going to throw exceptions, or misbehave, or cause you to work around it later.

The only exception would be if the part would be too expensive to add later. In which case: do you really really need it? and ask that question twice.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange