Growing database pains

https://dba.stackexchange.com/questions/282495

13-03-2021
|

Question

Happy new year everyone. I'm hoping for some general guidance in the following situation...

I have an application that has been running for about 10 years. The datastore is in mysql, (now on AWS Aurora).

Some of the tables that are in one-to-many relations are starting to have a larger number of rows:

Records (~1.4million rows) 
        |
        V
    (1 to many)
        |
        V
SubRecords (~10million rows)
        |
        V
    (1 to many)
        |
        V
SubSubRecords (~22million rows)

There is not a lot of actual data being stored in these rows (ie subSubRecords is about 5gb in total), and the queries I run are very straightforwards, using indexed keys with no joins. For example...

SELECT ... FROM Records WHERE id = ?;
SELECT ... FROM SubRecords WHERE recordId = ?;
SELECT ... FROM SubSubRecords WHERE subRecordId = ?;

So far, everything continues to be highly performant.

However, I'm starting to worry about how this design will hold up over time. While it took 10 years to get to 22 million rows in SubSubRecords, the db is growing a lot faster now. I wouldn't be surprised to see that table climb to 100million rows over the next 5 years, which feels like a lot. And I'm not sure at what point it will become a problem.

I realize this is a rather broad question, and is situation-dependent. But what what types of solutions are generally recommended in these cases?

Setup partitions? (The tables use foreign keys to enforce integrity and my understanding is that this is incompatible with partitions.)
Convert the data in subRecords and subSubRecords to a json payload and store that directly in a json column in the main records table? (Same amount of data, but less rows, if it matters.)
Move to an entirely different db? (Mongo? I know nothing about it but have heard is better at scaling in certain situations.)
Ignore it until it becomes an issue? :D

Any suggestions / pearls of wisdom from those who have wrestled with similar problems is welcome. Thanks (in advance) for your help!

Addendum:

As requested, here is the CREATE TABLE syntax for the above tables...

CREATE TABLE records (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    typeId TINYINT(1) UNSIGNED NOT NULL,
    userId INT UNSIGNED NOT NULL, 
    updated TIMESTAMP DEFAULT NOW() NOT NULL,
    savename VARCHAR(100) NOT NULL,
    title VARCHAR(100) NOT NULL,
    instructions TEXT NOT NULL,
    FULLTEXT ftRecords(savename, title),
    PRIMARY KEY(id),
    FOREIGN KEY(typeId) REFERENCES recordTypes(id),
    FOREIGN KEY(userId) REFERENCES users(id) ON DELETE CASCADE
) ENGINE=InnoDB CHARACTER SET=utf8;

CREATE TABLE subRecords (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    recordId INT UNSIGNED NOT NULL,
    thumbnailId INT UNSIGNED NULL,
    sortOrder SMALLINT NOT NULL,
    enabled TINYINT(1) DEFAULT 0 NOT NULL,
    title VARCHAR(100) NOT NULL,
    instructions TEXT NOT NULL,
    parameters VARCHAR(500) NOT NULL,
    PRIMARY KEY(id),
    FOREIGN KEY(recordId) REFERENCES records(id) ON DELETE CASCADE,
    FOREIGN KEY(thumbnailId) REFERENCES thumbnails(id) ON DELETE SET NULL
) ENGINE=InnoDB CHARACTER SET=utf8;

CREATE TABLE subSubRecords (
    id INT UNSIGNED NOT NULL AUTO_INCREMENT,
    subRecordId INT UNSIGNED NOT NULL,
    thumbnailId INT UNSIGNED NULL,
    sortOrder SMALLINT NOT NULL,
    caption VARCHAR(200) NOT NULL,
    PRIMARY KEY(id),
    FOREIGN KEY(subRecordId) REFERENCES subRecords(id) ON DELETE CASCADE,
    FOREIGN KEY(thumbnailId) REFERENCES thumbnails(id) ON DELETE SET NULL
) ENGINE=InnoDB CHARACTER SET=utf8;

La solution

100M rows is not scary.

Partitioning -- No. It is unlikely to add any performance benefit. If, however, you need to purge "old" data, there might be a use for partitioning.

If recordId is an INDEX of SubRecords -- Good.
If recordId is the first column in the PRIMARY KEY of SubRecords -- better.

Show us SHOW CREATE TABLE for further advice.

Your 3 SELECTs would run faster if you JOINed the 3 tables in a single SELECT.

Clustering

A slight improvement would be to improve the "clustering" of the rows that you fetch at the same time. For SubRecords,

ALTER TABLE SubRecords
    DROP PRIMARY KEY(id),
    ADD PRIMARY KEY(recordId, id),
    ADD INDEX(id);

That way, when you get the several SubRecords for one Record, they will be located next to each other. This is because the PRIMARY KEY (in InnoDB) is 'clustered' with the data. The INDEX(id) is to keep AUTO_INCREMENT happy.

This change may not show any noticeable improvement until the dataset is bigger than RAM.

A similar thing can be done with SubSubRecords.

Autres conseils

It's good to be future planning, but you're also likely worrying about a problem that may never exist. It sounds like your tables are pretty lightweight since your 22 million record table is only about 5 GB big.

By denormalizing your data of the sub-tables into JSON and stuffing it into a column of the main table, you actually risk making your system slower because now each record of your main table becomes larger in data size. When the main table is loaded into memory from disk, you have potential for that operation to become slower since it's more data per row (that you may not always need) that needs to be loaded from disk as opposed to your current normalized setup.

Most database systems are on par with each other in terms of performance when it comes to the standard operations of writing to a table, and recalling that data. MongoDB doesn't have anything unique that would make your example use cases faster for larger datasets compared to MySQL. Moreso the choice between a NoSQL database system vs a RDBMS is mainly based on how structured your data is, and not normally a question of performance.

A B-Tree can very efficiently handle a large number of nodes. Therefore your Tables' indexes can easily handle storing the data of their rows and do lookups very quickly. The Big-O notation of the search time for a B-Tree is O(log n), meaning if you have 10,000,000 rows, log(10,000,000) = 7, if your table grew to 10x that for 100,000,000 rows, log(100,000,000) only goes up to 8 (the search time changes very minimally for every factor of 10 your data grows by).

Without information about your server's provisioning, I can only give you an example of a database I've worked with for reference, where a single table had 10s of billions of records and seeking on those indexes were never more than a few seconds (generally under a second). This was on an 8 Core CPU, 16 GB memory, regular SSD server in AWS. The table itself had about 1 TB of data in it too (the rows were wider and stored a lot of verbose data).

Although the term Big Data is somewhat subjective, in general the total size of the data should be in the 10s to 100s of TB big these days, to be considered Big Data and possibly needing an alternative solution than the standard RDBMS implementation. Though that's where things like partitioning and sharding come in to help even the regular RDBMS handle data that large. I think in your case, it sounds like your system will never reach such a limit.

Licencié sous: CC-BY-SA avec attribution

Non affilié à dba.stackexchange