Performance degradation while updating tables having 10s of millions of records

https://dba.stackexchange.com/questions/17653

22-10-2019
|

Question

I want to update tables ( my be 20-30 ) having 10s of millions of records each.

The problem is that it is taking too much time for the update process and also at that time CPU usage also goes very high. I want to do in such a way that it can not use much CPU while processing the data. If the processing time is increased then it will not be a problem for me but it should use limited CPU resources for processing ( updating ) the table. I am using PostgreSQL as database and server operating system is Linux.

Sample query of mine can be like this

UPDATE TEMP 
SET CUSTOMERNAME = 
  ( select customername from user where user.customerid = temp.customerid );

Solution

The first question is: Why is it important that you not use a lot of CPU time? The query will be bottlenecked on some resource; if you could introduce enough additional disk access, the CPU time used per second would go down, but would that really be an improvement? What resource would you prefer to saturate? Understanding why you have emphasized this might help guide people to providing an answer that you will find useful.

As suggested in a comment, your query may run faster with a join rather than a correlated subquery. Something like this:

UPDATE temp
  SET customername = user.customername
  FROM user
  WHERE user.customerid = temp.customerid;

Another important thing to know is whether you want to update all rows in the table. Are some of the values already correct? If so, you will get a big performance boost by not updating the rows that don't need it. Add AND temp.customername is distinct from user.customername to the WHERE clause.

If you limit the number of rows updated in each statement, and VACUUM ANALYZE after each UPDATE, you will avoid table bloat. If the point of the desire to minimize CPU time is to avoid a performance impact on concurrent transactions, this would give you the opportunity to introduce a short delay (in the form of a sleep or something similar) before you start the next UPDATE of a set of rows.

Even better, why are you redundantly storing the information in the temp table rather than joining to it when needed? (Sometimes there is a good reason; quite often there isn't.)

OTHER TIPS

If you follow the very good advice of kgrittn and still have performance issues, you may need to perform the update in batches. You would still perform set-based updates but limit them to the first 1000 (or whatever number works for you, I've used from 500 to 50,000)records that don't match and then keep looping until all are done.

If there is an index on TEMP.CUSTOMERNAME and you are updating a significant part of the TEMP table then drop that index before the update and rebuild it after.

PostgreSQL has no way to reduce the amount of CPU time a process can use. On Linux you can use OS features like the renice command to do that. See Priorities for more information and some samples.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange