How to handle common data in multi-tenant applications

https://softwareengineering.stackexchange.com/questions/380356

14-02-2021
|

Pergunta

I am designing a multi-tenant application which will use the same database schema for a separate DB instance per-tenant (i.e. every time I get a new customer, I create a new database for them). This approach is desirable for its simplicity in that I won't need to filter data per-tenant -> I can just select every User in dbo.Users, for example - no need to filter by CustomerId / no danger that I forget to and accidentally expose the wrong data to the wrong customer.

What I am wondering is how to handle data which is common to all databases. Postcodes (Zipcodes) would be a great example. It looks like I either need to replicate this data in each database (which will be a nightmare to maintain and will mean I am storing loads of duplicated data) OR I need to use some common database which then prevents me from joining tables in the natural way, or using e.g. EntityFramework out of the box. Neither of these sound right/better than the other.

Does anyone have a good strategy for this?

Solução

I need to use some common database which then prevents me from joining tables in the natural way, or using e.g. EntityFramework out of the box.

You can use EntityFramework.DynamicFilters for this. It allows you to put a dynamic filter on your model that will be applied to all queries (both direct queries and loading related entities).

For example, I use DynamicFilters to filter soft-deleted items. Items with DeletedOn != null will be hidden from sight.

modelBuilder.Filter("IsDeleted", (BaseEntity d) => d.DeletedOn, null);

You could use a similar approach, something like:

modelBuilder.Filter("CustomerId", (User u) => u.CustomerId, GetCurrentCustomerId());

This mostly disables your further reasoning to use multi-tenancy purely to separate customer data.

However, there are other considerations that can lead you to want to use a multi-tenant platform.

I am designing a multi-tenant application which will use the same database schema for a separate DB instance per-tenant (i.e. every time I get a new customer, I create a new database for them).

What happens when you have customers, and you then wish to upgrade your database schema. Are you enforcing upgrades across the board, or are you going to allow customers to upgrade when/if they want to?
What happens when one customer requires a restore of their data? Do you want to be able to only restore that customer's data? (I assume so - I just wanted to point this out).

Both cases can be strong point for using multi-tenancy: customers can upgrade at their own pace, and can receive data restores without affecting other customers.

However, then we run into another issue: common data between different application versions. If the common data changes during an upgrade, you're going to run into trouble.

I know your examples of ZIP codes is less applicable here, as they're not prone to being changed between versions; but the general point still stands: some common data may indeed change between versions.

There are two solutions here, I will discuss both briefly.

1. Hosting the common data

You can keep a centralized database of common data, but I suggest hiding this behind a service. This gives you the possibility of easily returning versioned common data. Every tenant will tell you what their current version is (e.g. v1.3) and your service then ensures that it returns the common data for version 1.3.

This gives you the separation you need, but there are some issues here: it costs overhead to create the web service, and it's effectively an external dependency that you're always going to have to rely on.

I prefer this approach for common data which is not version-specific and instead considered "globally correct" (such as a list of ZIP codes).
However, there needs to be a reasonable data size to warrant putting it into a centralized repository. If it's 5 fields, the overhead of creating the service far outweighs the data footprint of copying those 5 lines in all tenants.

2. Loading the common data into the tenant

I prefer this approach for common data which is version-specific, unless the data is so large that it becomes a problem to include it in every tenant separately.

In short, you can achieve this using (for example) database seeding in EF, which allows you to update the common data at the same time you upgrade the database schema to a newer version.

There are many ways to achieve this. I like database seeding as it ties nicely into the schema upgrade process.

I understand why you want to centralize data - it's shared data right? But there is a line of reasonability here, not every abstraction is necessary.

As an oversimplified example, consider the idea that when your entities all have audit fields (CreatedOn, ModifiedOn, ...), you tend to abstract this in a IAuditable or AuditableEntity. That is good practice
However, when you have three entities (Person, Country and StuffedAnimal) which all have a Name property, that doesn't mean that you should abstract this into a INamed or NamedEntity. This no longer a reasonable argument.

The same is happening here. When you apply the theory to the letter, then shared data should be abstracted into a centralized point. However, by that same logic, you also shouldn't be using multi-tenancy then because they have a shared database schema, right?

You shouldn't apply the theory to the letter here, and instead consider the practical application. A tenant is created specifically to run independently from the other tenants. There are several benefits to doing so, but "pure abstraction" isn't one of them. If anything, multi-tenancy is refusing to abstract or share resources specifically so you can prevent issues from becoming a global issue for all yoru customers.

If you need updates to your common data to be propagated to all tenants at all times, then option 1 is better.

If you want to ability to version your common data for your tenants, then you are better off keeping the data locally inside the tenants themselves.

Outras dicas

... every time I get a new customer, I create a new database for them). This approach is desirable for its simplicity in that I won't need to filter data per-tenant ... no danger that I forget to and accidentally expose the wrong data to the wrong customer.

Really, what I'd recommend is reconsidering this approach. Separating tenants by schema does provide isolation and most RDBMSes can handle it, but it creates other headaches in the process. There's a lot of economy of scale in space and performance with large databases that get wiped away when the same thing is rubber stamped many times across schemas. There are also logistical challenges, such as how to handle backups for a large number of databases or how you roll back modifications to 10,000 tenant databases when the 5,758th fails.

If you're using a database with row-level security (RLS), the multi-tenancy problem becomes very easy to solve while keeping all of your tenant and common data under one roof. RLS lets you set per-table policies that determine whether or not a row is included in queries that affect it. For SELECT, UPDATE and DELETE queries, this effectively forces an additional constraint into the WHERE clause that makes the database behave as if rows not fitting the policy don't exist. Something similar is done on INSERT, usually rejecting queries that would add rows not matching the policy.

On databases without RLS, you can deny tenants permission on the tables holding their data and force them to access it through views. Modern RDBMSes have views that are updatable, which means they can be made to behave as if they were tables.

There are other ways to go about it, such as requiring clients to access the database through stored procedures, but that just space-shifts the problem of forgetting to filter other tenants' data.

If none of these alternatives work and you still want to do separate schemas, see if your database supports the notion of a foreign data source. This will allow external tables to be treated as if they're local and enables all of the referential integrity features you're trying to find for your common data.

I would use a shared database for common data such as ZIP codes. You won't be able to join tables across databases using Entity Framework, but that may not have been an option anyway (due to performance issues) if your DbContext classes are large. You can still join on-the-fly using LINQ.

I would probably wrap the shared data in some kind of SharedDataService and use that to retrieve it. That service could use a low-latency, long-life cache to make it fast.

Being able to keep the shared data in one place will make your life easier if you ever need to update it. ZIP codes rarely change, but other types may change more frequently (language, ethnicity, SOC, NAICS, ICD, etc.).

Not being able to use entity framework properly should not factor into your decision. If you are properly abstracting away data access with repositories (not optional for something as complicated as a multi tenant application) having a few entities that are not using entity framework should not be noticeable. Also, please consider using dapper.net, it is used by this site, is easy to use, performant, and will not have problems with either of these approaches.

If you had choose between a separate database and having the common data in every database, would go with having a separate database for the common database. Fortunately, you do not have to choose between these two approaches.

Another option is to have a separate database which serves as the source of truth, then push its data to individual customer databases. The common records stored in the customer databases is treated as a persistent cache. All writes of common data go to the dedicated common database first, and then are pushed to the customer DBs. In my mind, this is the preferred approach.

|C1| <-- |S| --> |C2|
C = Customer DB, S = Shared DB

Leverage a queuing service such as rabbitmq or SQS to schedule pushes when common data is added or edited.

The thing is, you will need a distinct common database anyway for things such as customer records, so you might as well leverage it for common data. Also, it's very helpful to have a well defined source of truth.

There is nothing wrong with having multiple copies of the same data, the real problems happen when you do not have single defined source of truth for a piece of data.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange