Can MySQL Cluster handle a terabyte database

https://stackoverflow.com/questions/850307

21-08-2019
|

Question

I have to look into solutions for providing a MySQL database that can handle data volumes in the terabyte range and be highly available (five nines). Each database row is likely to have a timestamp and up to 30 float values. The expected workload is up to 2500 inserts/sec. Queries are likely to be less frequent but could be large (maybe involving 100Gb of data) though probably only involving single tables.

I have been looking at MySQL Cluster given that is their HA offering. Due to the volume of data I would need to make use of disk based storage. Realistically I think only the timestamps could be held in memory and all other data would need to be stored on disk.

Does anyone have experience of using MySQL Cluster on a database of this scale? Is it even viable? How does disk based storage affect performance?

I am also open to other suggestions for how to achieve the desired availability for this volume of data. For example, would it be better to use a third party libary like Sequoia to handle the clustering of standard MySQL instances? Or a more straight forward solution based on MySQL replication?

The only condition is that it must be a MySQL based solution. I don't think that MySQL is the best way to go for the data we are dealing with but it is a hard requirement.

Solution

Speed wise, it can be handled. Size wise, the question is not the size of your data, but rather the size of your index as the indices must fit fully within memory.

I'd be happy to offer a better answer, but high-end database work is very task-dependent. I'd need to know a lot more about what's going on with the data to be of further help.

OTHER TIPS

Okay, I did read the part about mySQL being a hard requirement.

So with that said, let me first point out that the workload you're talking about -- 2500 inserts/sec, rare queries, queries likely to have result sets of up to 10 percent of the whole data set -- is just about pessimal for any relational data base system.

(This rather reminds me of a project, long ago, where I had a hard requirement to load 100 megabytes of program data over a 9600 baud RS-422 line (also a hard requirement) in less than 300 seconds (also a hard requirement.) The fact that 1kbyte/sec × 300 seconds = 300kbytes didn't seem to communicate.)

Then there's the part about "contain up to 30 floats." The phrasing at least suggests that the number of samples per insert is variable, which suggests in turn some normaliztion issues -- or else needing to make each row 30 entries wide and use NULLs.

But with all that said, okay, you're talking about 300Kbytes/sec and 2500 TPS (assuming this really is a sequence of unrelated samples). This set of benchmarks, at least, suggests it's not out of the realm of possibility.

This article is really helpful in identifying what can slow down a large MySQL database.

Possibly try out hibernate shards and run MySQL on 10 nodes with 1/2 terabyte each so you can handle 5 terabytes then ;) well over your limit I think?

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow