Techniques for efficiently partitioning data in a shared multi-tenant database

https://softwareengineering.stackexchange.com/questions/347572

11-01-2021
|

Pergunta

I work for a franchise and am beginning the complete redesign of the application that each franchise uses to manage its daily operations. It functions as a point of sale and scheduling application. Currently, each franchise has its own local SQL Express database. The desktop application runs in their facility and reads from and writes to this database. On the corporate end, we need to get at aggregate information about all of the franchises so each time the application saves a local record, it messages a copy to the corporate server which inserts the data from all franchises into a shared database with an identical schema. Corporate can then run reports on the shared database containing records from all franchises. Why is it designed like this? At the time of its design (over a decade ago), stakeholders felt it was important that franchises be able to access their data without an internet connection.

Well times have changed and all involved are finally willing to move to a single centralized database with a web app front end to support all franchises instead of a desktop app. So in technical terms, we are designing a multi-tenant app with a shared database. Like I said, we actually already have the database on the corporate end, but it's not supporting the moment-to-moment in-franchise reads and writes.

To give you an idea of the amount of data we're dealing with, there are currently a few hundred franchises, 3 million customer accounts of which a much smaller portion is actually active customers, and 8 million purchases. One of the largest tables contains weekly customer calendar entries and has over 91 million rows. Again, each of these records is specific to a single franchise but they will be stored together. E.G. tbl_Customers has 3 million customers and each has an fk_FranchiseID pointing back to tbl_Franchises.

The database will need to support the daily franchise operations (web app most likely using an Entity Framework-based data layer), reporting for both individual franchises and aggregate corporate reports, and some customer-facing website functions such as displaying schedules and customer account information. I think the aggregate reports, expense, and ease of adding new franchises are the factors driving the idea of a shared database. Additionally I think the plan is currently to have a load balancer with two web and two database servers.

With the caveat that there is a lot I still don't understand and am still learning about SQL Server, my biggest concerns are:

Partitioning franchise data. The franchise web app will need to retrieve records only for its current user's franchise. In the current database, some tables are 5-10 joins away from tbl_franchise. Having to do all of these joins just to filter on the franchise ID when otherwise no joins might have been required seems like it could really harm performance. Queries need to run at least as fast as they do today against their local private databases so the filtering by franchise needs to be negligible. As wrong as it seems, would it be better to sacrifice normalization and include the franchise ID directly in some or all of these child tables? Or are the joins negligible if the keys are indexed and the query is not using any other columns from the joined tables?
Privacy. We need to make it impossible for a user or a developer to intentionally or accidentally pull up data that belongs to the wrong franchise. SQL Server 2016 row level security seems to be an option for this. This, again, leads me to point 1, though. A whole lot of joins will be required to associate the rows with their franchise record in the predicate function.
Lock contention. If we can use indexes to safely put sifting through other franchise's records aside, for the most part, the franchise management app should be doing short read and write operations. But still suddenly having all users potentially accessing these tables at the same time has alarm bells sounding in my mind. Additionally, we need to continue offering reporting to the franchises for their individual data, and to corporate for the aggregate data. A few of these reports are pretty computationally heavy and cannot necessarily accept dirty reads so I'm worried if a report is running, we might have a few hundred franchises unable to use the system.

So the overall question is: what strategies and techniques can be used to efficiently segregate tenant data in a shared database such as this?

Solução

The primary concerns of this design are security and size. But before I get there, I want to clear up a misunderstanding:

As wrong as it seems, would it be better to sacrifice normalization and include the franchise ID directly in some or all of these child tables?

I see why you might think this: if you consider franchise ID as an attribute then including it in every table violates third normal form.

But here's an alternative way to look at it: in the logical design the row key includes franchise ID: (FRANCHISE_ID, TABLE_ID).

But, you say, TABLE_ID is an identity column! To which I answer: yes, but that's a physical detail, not a logical detail. And logically, tables are allowed to have multiple "candidate" keys (I turn to C. J. Date as my authority for this statement).

And once you accept this logical design, you'll get a lot of physical benefits. First, you don't need joins to access data; while, logically, joins take no time, physically they do. Plus, if your queries tend to retrieve multiple rows for the same franchise, you can also benefit by using a clustered index to collocate rows.

OK, now on to the main topics.

Security

From a corporate management perspective, this is probably your most important issue. Clearly, you can't allow one franchisee to see data that belongs to another. But there are many ways to accomplish this, imposing different levels of load on the system and its developers. I'm just going to throw out some ideas here for you to consider.

Predicate applied to individual queries

This is the simplest, but adds the heaviest load to the developers. Every one of your protected queries will have to include a check against franchise ID. Forgetting even one could have economic consequences for your company (ie, lawsuits).

However, I think you can probably overcome this with a combination of code review, static analysis, and integration testing. You need the discipline to ensure that all queries go through a data access layer that's rigorously verified.

Views

To ensure that all queries include a check for franchise ID, you can hide your tables behind views, and ensure that each view includes a franchise check. Each franchisee will have their own set of views, stored in a different schema.

An additional benefit of this approach is that you'll be able to expose data directly to the franchisees. It also allows your physical tables to change without affecting the exposed data.

However, there are several significant drawbacks. First, your developers will have to ensure that they use the correct set of views for each query (maybe not that bad, depending on how you manage connections). Second, you will have a long-term maintenance cost, as changes have to be propagated to all of schemas that hold a particular view (although this should be easily automatable).

Row-level Security

I'm not familiar with SQL-Server, but my understanding of row-level security is that it's based on database users, so you'll need (at least) one user per franchisee. Which means that you'll need (at least) one connection per franchisee, which may cause undue load for your database (or alternatively, constant creation/destruction of connections). I'm also guessing that your developers will have to code queries that include franchise conditions or suffer runtime errors. And you'll have to manage all of those users.

All-in-all, this seems like the most painful route, but it is the one that guarantees security, so I'm guessing it's the way you'll go.

Size

From the development perspective this is going to be the bigger pain point -- especially when your users complain about slow response times.

Your overriding goal should be to touch as few data blocks as possible per query. Here are a few techniques that I've used successfully in the past:

Buy as much RAM as you can afford

Your goal should be to keep the entire database in memory. Really. It doesn't matter that SSDs are blindingly fast, they still require time to read and write data blocks.

In a perfect world, you would read the entire database into memory at startup, and the only IO would be writes.

Reduce the "active" size of tables

You mentioned one 91MM row table. How much of this table is accessed for a particular query? Can you partition the table so that infrequently accessed data is stored in another table? (I'm assuming that SQL-Server supports declarative partitioning, but if not you can manually move/duplicate rows).

Large tables necessarily mean that queries have to access a lot of rows. Even if you have indexes, because those indexes will also be large.

Collocate data

By default, databases store rows wherever they can find the space. Which means that data that is typically accessed together, such as the transactions for a user, might be spread all over the disk.

However, you generally have some level of control over this, either using clustered indexes (see my link above) or covering indexes. Leverage these to their fullest.

Use read replicas

A typical application executes selects far more frequently than updates, and tends to select multiple rows while updates affect a single row.

By separating these two operations, you get a couple of benefits. First, you can scale capacity independently: if you have a lot of reads you can buy more or bigger machines. And second, you can reduce contention: a long-running select won't block an update (personally, I think this is less of an issue today than, say, 20 years ago, but it's still worth considering).

The downside of read replicas is that there's a lag between the time a row is updated on the server and on the replica. This may or may not be a problem for you (and in my experience, lag is caused by undersized machines; more money solves that problem).

Offload reporting to a data warehouse

True "reporting" queries tend to be very different from operational queries. For example, an operational query might retrieve the most recent order for a single user, while a reporting query might find all users that bought a particular product. As a result, attempting to support both operational and reporting queries with the same physical design is a recipe for failure.

At the very least, shift reporting to a dedicated read replica, one that is indexed appropriately. Better is to make use of a completely different DBMS, one whose storage and query characteristics more closely match your reporting needs. Something like Amazon Redshift, Google BigQuery, or Azure SQL Data Warehouse. Or maybe a locally-hosted option like Apache Cassandra.

And now for something completely different

Don't do this.

The time taken to develop a multi-tenant solution is time not available to add features that may be more relevant to your franchisees, or to improve the current code and processes.

If the issue is maintenance, or per-franchise capital expenses, look to alternatives that enable central management. For example, use Azure or another cloud provider with an ops person at the corporate office. You should be able to deploy a cloud-based solution for a per-franchise cost of a few hundred dollars per month (if that), cutting both capital and operational expenses for the franchisees.

If the issue is reporting, focus on more efficient data acquisition and transformation. Again, cloud-based solutions can help with this.

Update

The idea of moving to a cloud provider -- and Azure is only one option, which I picked because you seem to be a Microsoft shop -- is to eliminate the problems caused by franchisees who aren't trained computer operators.

In the simplest form, you would create a database server for each franchisee in the cloud, and their existing applications would point to that server rather than the local database. The database would always be up, so that you could retrieve data at any time. And, typically, the cloud provider does regular backups and provides other options for fault-tolerance and recovery.

Pricing in the cloud is largely dependent on the features that you want. For example, looking at the Azure Cloud SQL pricing page, the base price for a "Standard" database service is $0.0202/hour, or $15/month. I have no idea what this actually provides you in terms of database performance; in my experience, $100/month is more likely.

There is nothing that prevents you from using cloud hosting as a first step, and then moving on to a true multi-tenant solution. And if you have hundreds or thousands of franchisees, that makes sense to manage cost. But it seems like your real problem is one of operations management.

Outras dicas

I would suggest you look into table partitioning. With table partitioning, your application would still see a single "logical" table that it works with; however, physically, each client's data could be put in a separate table, filegroup, database, or even SQL Server instance.

This gives you two things

1) It gives you an excuse to add the client ID to all of your tables, since it would act as the partitioning key, which is required.

2) It allows you to isolate clients from each other, which could be very important if you plan to let them have access to their data directly. If you didn't partition the data, a careless developer could launch an unwitting DoS attack on your other clients by locking too many rows or just by hitting the tables too hard-- even if they only have datareader permissions. But if the tables are physically separated, this is impossible.

That being said, I probably would not give them direct access to the live database. That seems extremely risky. If they need to run reports and ad hoc queries, you can give them a data mart that contains their data only (potentially at an additional charge).

Licenciado em: CC-BY-SA com atribuição

Não afiliado a softwareengineering.stackexchange