Why is it so bad to read data from a database “owned” by a different microservice

https://softwareengineering.stackexchange.com/questions/263735

06-10-2020
|

Question

I have recently read this excellent article on the microservice architecture: http://www.infoq.com/articles/microservices-intro

It states that when you load a web page on Amazon, then 100+ microservices cooperate to serve that page.

That article describes that all communication between microservices can only go through an API. My question is why it is so bad to say that all database writes can only go through an API, but you are free to read directly from the databases of the various micro services. One could for example say that only a few database views are accessible outside the micro service so that the team maintaining the micro service know that as long as they keep these views intact then they can change the database structure of their micro service as much as they want.

Am I missing something here? Is there some other reason why data should only be read via an API?

Needless to say, my company is significantly smaller than Amazon (and always will be) and the maximum number of users we can ever have is about 5 million.

Solution

Databases are not very good at information hiding, which is quite plausible, because their job is to actually expose information. But this makes them a lousy tool when it comes to encapsulation. Why do you want encapsulation?

Scenario: you tie a couple of components to an RDBMS directly, and you see one particular component becoming a performance bottle-neck for which you might want to denormalize the database, but you can't because all other components would be affected. You may even realize that you'd be better off with a document store or a graph database than with an RDBMS. If the data is encapsulated by a small API, you have a realistic chance to reimplement said API any way you need. You can transparently insert cache layers and what not.

Fumbling with the storage layer directly from the application layer is the diametrical opposite of what the dependency inversion principle suggests to do.

OTHER TIPS

What is more important and significant about a microservice: its API or its database schema? The API, because that is its contract with the rest of the world. The database schema is simply a convenient way of storing the data managed by the service, hopefully organised in a way that optimises the microservice´s performance. The development team should be free to reorganise that schema - or switch to an entirely different datastore solution - at any time. The rest of the world should not care. The rest of the world cares when the API changes, because the API is the contract.

Now, if you go peeking into their database

You add an unwanted dependency on their schema. They cannot change it without having an impact on your service.
You add unwanted and unpredictable load to their internals.
The performance of your own service will be affected by the performance of their database (they will be trying to optimise their service to perform well for clients and their database to perform well only for their service)
You are tying your implementation to a schema which may well not accurately and distinctively represent the resources in their data store - it may have extra details which are only needed to track internal state or satisfy their particular implementation (which you should not care about).
You may unwittingly destroy or corrupt the state of their service (and they will not know you are doing this)
You may update/delete/remove resources from their database without them knowing this has happened.

The last two points may not happen if you are only granted read access, but the other points are more than a good enough reason. Shared databases are a bad thing.

It is common for less experienced developers (or those who do not learn) to see the database as more important than the service, to see the database as the real thing and the service just a way of getting to it. That is the wrong way round.

Microservice Architecture is hard to describe but the best way to think about it is a marriage between Component Oriented Architecture and Service Oriented Architecture. Software as a suite is composed of many small business components with a very specific business domain responsibility. Their interface to the outside world either in provided services or required services is through an API of clearly defined services.

Writing to and even reading from a database that is outside of your components business domain is against this style of architecture.

The primary reason for this is that an API provided through a service by another software component has the reasonable expectation that the API will most likely be backwards compatible as new releases of the service providing component become available. If I am the developer of a "providing" component then I only have to worry about backwards compatibility to my API. If I know that there are three other development teams that wrote custom queries against my database directly then my job has become much more complicated.

Even worse, maybe that other team that wrote these is mid sprint in a critical project and they can't accept this change now from your component. Now software development for your component on a business domain that you own is being driven by development on another business domain.

Full interaction through services reduce coupling between various software components so situations like this do not occur so frequently. When it comes to other components using a View in the database, then you have more capability to make the View backwards compatible if anybody else wrote queries against it. I still feel however that this should be the exception case and only should be done for perhaps reporting or batch processing where an application will need to read in enormous amounts of data.

Clearly this works well in large distributed teams where development teams are separated out by business domain like Amazon. If you are a small development shop you can still benefit by this model, especially if you need to ramp up for a big project quickly, but also if you have to deal with vendor software.

Over the last 20 years I've seen a few large modular database designs and I've seen the scenario suggested by David quite a few times now where applications have write access to their own schema/set of tables and read access to another schema/set of tables. Most often this data that an application/module gets read-only access to could be described as "master data".

In that time I have not seen the problems that prior answers are suggesting I should have seen so I think it is worth having a closer look at the points raised in the previous answers in more detail.

Scenario: you tie a couple of components to an RDBMS directly, and you see one particular component becoming a performance bottle-neck

I agree with this comment except this is also an argument for have a copy of the data locally for the microservice to read. That is, most mature databases support replication and so without any developer effort the "master data" can be physically replicated to the microservice database if that is desired or needed.

Some might recognise this in older guise as an "Enterprise database" replicating core tables to a "Departmental database". A point here is that generally it is good if a database does this for us with built in replication of changed data (deltas only, in binary form and at minimal cost to the source database).

Conversely, when our database choices do not allow this 'off the shelf' replication support then we can get into a situation where we want to push "master data" out to the microservice databases and this can result in a significant amount of developer effort and also be a substantially less efficient mechanism.

might want to denormalize the database, but you can't because all other components would be affected

To me this statement is just not correct. Denormalisation is an "additive" change and not a "breaking change" and no application should break due to denormalisation.

The only way this break an application is where application code uses something like "select * ..." and does not handle an extra column. To me that would be a bug in the application?

How can denormalisation break an application? Sounds like FUD to me.

Schema dependency:

Yes, the application now has a dependency on the database schema and the implication is that this ought to be a major problem. While adding any extra dependency is obviously not ideal my experiance is that a dependency on the database schema has not been a problem so why might that be the case? Have I just been lucky?

Master data

The schema that we typically might want a microservice to have read-only access to is most commonly what I'd describe as "master data" for the enterprise. It has the core data that is essential to the enterprise.

Historically this means the schema we add the dependency on is both mature and stable (somewhat fundamental to the enterprise and unchanging).

Normalisation

If 3 database designers go and design a normalised db schema they'll end up at the same design. Ok, there might be some 4NF/5NF variation but not much. What's more there are a series of questions that the designer can ask to validate the model so the designer can be confident that they got to 4NF (Am I too optimistic? Are people struggling getting to 4NF?).

update: By 4NF here I mean all tables in the schema got to their highest normal form up to 4NF (all tables got normalised appropriately up to 4NF).

I believe the normalisation design process is why database designers are generally comfortable with the idea of depending on a normalised database schema.

The process of normalisation gets the DB design to a known "correct" design and the variations from there ought to be denormalisation for performance.

There can be variations based on DB types supported (JSON, ARRAY, Geo type support etc)
Some might argue for variation based on 4NF/5NF
We exclude physical variation (because that doesn't matter)
We restrict this to OLTP design and not DW design because those are the schemas we want to grant read-only access to

If 3 programmers where given a design to implement (as code) the expectation would be for 3 different implementations (potentially very different).

To me there is potentially a question of "faith in normalisation".

Breaking schema changes?

Denormalisation, adding columns, alter columns for bigger storage, extending the design with new tables etc are all non-breaking changes and DB designers who got to 4th normal form will be confident of that.

Breaking changes are obviously possible by dropping columns/tables or making a breaking type change. Possible yes, but in practical terms I've not experienced any problems here at all. Perhaps because it is understood what breaking changes are and these have been well managed?

I'd be interested to hear cases of breaking schema changes in the context of shared read-only schema's.

What is more important and significant about a microservice: its API or its database schema? The API, because that is its contract with the rest of the world.

While I agree with this statement I think there is an important caveat that we might hear from an Enterprise Architect which is "Data lives forever". That is, while the API might be the most important thing the data is also rather important to the enterprise as a whole and it will be important for a very long time.

For example, once there is a requirement to populate the Data Warehouse for Business intelligence then the schema and CDC support become important from the business reporting perspective irrespective of the API.

Issues with API's?

Now if API's were perfect and easy all the points are moot as we'd always choose an API rather than have local read-only access. So the motivation for even considering local read-only access is that there might be some problems using API's that local access avoids.

What motivates people to desire local read-only access?

API optimisation:

LinkedIn have an interesting presentation (from 2009) on the issue of optimising their API and why it is important to them at their scale. http://www.slideshare.net/linkedin/building-consistent-restful-apis-in-a-highperformance-environment

In short, once an API has to support many different use cases it can easily get into the situation where it supports one use case optimally and the rest rather poorly from a network perspective and database perspective.

If the API does not have the same sophistication as LinkedIn then you can easily get the scenarios where:

The API fetches much more data than you need (wasteful)
Chatty API's where you have to call the API many times

Yes, we can add caching to API's of course but ultimately the API call is a remote call and there are a series of optimisations available to developers when the data is local.

I suspect there is a set of people out there who might add it up as:

Low cost replication of master data to microservice database (at no development cost and technically efficient)
Faith in Normalisation and the resilience of applications to schema changes
Ability to easily optimise every use case and potentially avoid chatty/wasteful/inefficient remote API calls
Plus some other benefits in terms of constraints and coherent design

This answer has got way too long. Apologies!!

State management (potentially a database) can be deployed in the Microservice's container and exposed via an API. The a Microservice's database is not visible to other systems outside the container - only the API. Alternatively you could have another service (e.g. a cache) manage state via an API. Having all the Microservice's dependencies (other than API calls to other services) within a single deployable container is a key distinction in the architecture. If one does not get that go back and study the architecture.

Licensed under: CC-BY-SA with attribution

Not affiliated with softwareengineering.stackexchange