Question

I have an extremely large dataset (250M Rows in one table, 170G on disk) on a very powerful desktop computer that I'm utilizing to clean-up the data and do analysis on it.

As part of this analysis:

  • I need to run updates on the entire database to find things like phone numbers in a non-standard format, and convert them to a standardized format.
  • I need to be able to run an analysis on things like, give me all the records which share the same phone number with another record and similar queries.

All fields I will primarily querying have been setup as indexes, but the only unique index is the primary key.

The powerful desktop machine I have is a Ryzen processor with 4 CPUS with 4 cores per CPU, and it has 64GB of RAM. There is also a dedicated 2TB SSD drive for the database (OS is running on a different drive).

I currently am operating off of a default Xampp Configuration on Windows 10 Pro running MariaDB 10.4.17.

What changes do you recommend to the MariaDB or computer setup in order to get the best performance out of this scenario and hopefully make things go faster. I've had some update queries take over 16 hrs. I know I could likely optimize those queries more; but I also imagine there are a lot of settings in the default Xampp config that could be significantly tweaked.


Update

I'm kind of surprised by the responses. I know I can make queries even better, etc; but there isn't one database setting I can tweak to make things run a little bit smoother or make it more geared towards the size and type of data I have?

Was it helpful?

Solution

Rick James is absolutely right that there are much bigger gains to be had from process than from tuning. Using multiple threads to run queries on different tables, indexing tables to avoid table-scans in the queries, and dropping unnecessary indexes for mass updates, avoiding function calls in where clauses, etc. One of the most basic things you can do, for instance, is to bundle many updates into a single transaction. Committing transactions is expensive, so if you can find & delete 1000 rows per transaction instead of just letting autocommit issue one commit for each delete statement, things will move along a lot faster.

However, there is definitely some gain you can have from tuning the database properly, and although there are many things that can be tuned really well through an iterative process for the last 10-20% of gains, there are significant gains to be had from some very basic tuning:

  1. You won't be needing more than the default 151 simultaneous connections, so we'll work form the assumption that max_users is set to the default 151, which could account for as much as 2.5G RAM. Leaving another 8G for the OS, you can then safely allocate at least 50G to the innodb_buffer_pool_size.

  2. In addition, if you can afford to lose 1s worth of transactions in the event of a crash, which is fine because you're not capturing new data, just fixing data that's already there: You will see a significant increase in performance by setting innodb_flush_log_at_trx_commit=2.

  3. Disable binary logging if it's enabled and you're not using it (the log_bin parameter), or, if you need binary logging, set 1binlog_row_image=MINIMAL` to save a lot of IO.

  4. Set innodb_flush_neighbors=0; the default is 1, and because SSDs have essentially 0 seek-times, this default is wasted IO when running on solid state storage.

It's easy to go crazy tweaking many of the over 600 available configuration parameters. Resist the urge to fiddle with parameters you don't fully understand. Just these four should give you a pretty good baseline performance boost on a 64GB RAM machine with SSD storage.

OTHER TIPS

Yes, data should be cleansed before inserting. (Apparently, you are belatedly doing the cleansing.)

This may help: DROP INDEX on phone_num while cleaning up the numbers. ADD INDEX after finished.

Do the UPDATEs in chunks. http://mysql.rjweb.org/doc.php/deletebig#deleting_in_chunks Chunk based on the PRIMARY KEY; check for errors; chunksize only 1000.

It would take some code, but you might benefit by having 16 (a la 16 cores) connections, each working on chunks of the UPDATE. (Don't expect more than 8x improvement; might not get even that much.)

Please provide more details if you want to discuss further.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top