Question

My database is very cpu constrained, and I can't find the root cause of the issue. I currently have two applications servers each wit a Rails api connecting to PostgreSQL via the ruby-pg gem. Both application server also have sidekiq running background jobs, and I have a handful of support servers processing new posts from a national feed via sidekiq. If I were running out of memory, the solution would seemingly be straight forward. Any general ideas why I am CPU constrained?

Database Specs:

  • Rackspace 8GB Performance Tier cloud VM (8GB RAM, 8x Core CPU, SSD)
  • Debian 7 Wheezy Linux OS
  • PostgreSQL 9.1 with PostGIS extension

Possible Problems:

  • PostgreSQL 9.1 is bad at indexes

The database has nearly 10GB of indexes. I am going to upgrade my database to PostgreSQL version >= 9.2. In version 9.2, index only scans were introduced.

  • Too many connections

In the postgresql.conf, I have set max connection equal to '500'. Usually throughout the day, only 175 connections are utilized, but during peak times, sidekiq tasks will increase the current connections to 350. How many connections are recommended with an 8GB server instance?

  • Idol Connections

When I take a look at pg_stat_activity in the psql console, I see sidekiq is leaving a lot of IDLE connections. Could these connections result in CPU inflation? Does the fix exist in the api or in sidekiq?

  • Need a more powerful server

Maybe there is not a bug. I might need to simply increase the server instance. Again this would make more sense if I was memory bound. However, both app servers and 3 of the support sidekiq servers are 4gb performance tier instances. Essentially, servers that interact with the database have combined more than double the resources of the database. Should this even matter?

Additional questions:

  • What tools/techniques should I employ to troubleshoot the issue?
  • Any basic settings in the postgresql.conf related to cpu usage?
  • Are there any known issues related to rails, sidekiq, or the pg gem that could be a contributing factor? (I havent seen any open issues.)
  • Are there any general postgreSQL guideline for CPU usage?
  • Any other ideas thoughts that might help my search?
Was it helpful?

Solution

You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.

On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.

If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.

See number of database connections (Pg wiki).

Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain module to record long running queries, and using pg_stat_statements / pg_stat_plans to record what's taking up resources. Examine these queries to see if they need improvement.

Your idle in transaction sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.

So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.

OTHER TIPS

Most likely you are CPU constrained because your work needs a lot of CPU. :)

9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.

Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.

350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.

If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.

The connection you showed from top was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top just happens to catch it while it is between them. But you didn't say how many similar lines there were in top, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.

Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.

My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top