SQL Database design for huge datasets

https://stackoverflow.com/questions/7542390

29-01-2021
|

Question

I have a customer that has the following data structure... for each patient, there may be multiple samples, and each sample may, after processing, have 4 million data objects. The max number of samples per patient is 20. So a single patient may end up with 80 million rows of data, and of course there will be many many hundreds of patients eventually.

In setting up a database to store the objects (which each contain about 30 fields of statistics and measurements) the challenge is pretty clear- how to manage this vast amount of data?

I was thinking that I would have one database, with a table for each sample- so each table may have at most 4 million records.

A colleague of mine had an interesting suggestion which was to take it one step further- create a new database per patient and then have a table per sample. His thinking was that having 1 log per patient, being able to move databases on a per patient basis, etc was good. I can't disagree with him.

Is this reasonable? Is it a bad idea for some reason to have many databases?

Thoughts? Thank you!

Solution

While the idea is interesting from privacy and migration standpoint, it is NOT a good idea to have a single database per patient. Think about managing, backing up, having files for each patient database. I'm even not sure if DBMS can handle millions of databases at the same time in an instance or a server.

What I would do is, accept the volumetric data as facts of live and deal with it in the type of parameters and tables you choose. Let the DBMS worry about the schale of it. Make sure you have a deployment model allowing to scale-up and scale-out your tables. A table per entity, at least would be wise, so for patient, measurement, etc.

Just, do what you are good in as a developer and let the DBMS do what it is created for.

OTHER TIPS

When working with that much data, you will definitely want to explore MySQL and RDBMS alternatives. Have you looked into any noSQL solutions? (i.e. key value stores). There are several open source solutions, some of which would immediately not be right for this application given that any data loss is probably unacceptable.

Perhaps try looking at Apache's Cassandra http://cassandra.apache.org/. Its a distributed database system (key-value store), but can run on a single node as well. It would allow you to store all of your data for each patient under a single key value "i.e. Patient1" and then from there you could organize your data into whatever key-value structure is best for querying in your application.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow