Why is my PostgreSQL server cpu constrained?

Question 1

You are using massively too many concurrent connections. PostgreSQL will be wasting lots of its time on housekeeping and juggling concurrent queries. All the concurrent work will be fighting for CPU and buffer space, there'll be heavy contention on spinlocks, and it'll all generally be a mess.

On an 8 core machine, you should probably not have more than 20 actively working connections if you're mostly CPU constrained. If you're I/O limited, you can go higher, but 350 is just ridiculous.

If possible, put a PgBouncer in transaction pooling mode in front of your PostgreSQL instance, so queries get queued up and executed rapidly in series instead of slowly in parallel.

See number of database connections (Pg wiki).

Additionally, PostGIS can be very CPU-heavy. It sometimes needs to do very complex calculations. I suggest using the auto_explain module to record long running queries, and using pg_stat_statements / pg_stat_plans to record what's taking up resources. Examine these queries to see if they need improvement.

Your idle in transaction sessions must be dealt with, too. Depending on why they're idle and whether they have a transaction ID or not, they might be causing serious table bloat. They're also creating unnecessary signalling overhead within PostgreSQL, as it has to do more co-ordination with backends that're actively doing things. Finally, the number of open transactions its self increases the cost of some internal housekeeping operations.

So. Your DB will probably perform better if you reduce the connection counts, put a PgBouncer in transaction pooling mode in front, and fix those idle connections.

Question 2

Most likely you are CPU constrained because your work needs a lot of CPU. :)

9.1 is not generally bad at indexes. There may be some specific issues, as all versions might, which exactly what they are might change from version to version.

Index-only-scans are mostly a benefit when you are IO constrained. I wouldn't hold out much hope for that being a magic bullet for you.

350 connections are certainly not helpful, but probably are not very harmful, either. But when they are harmful, it can be downright catastrophic. The correct value is more determined by the number of cores, not the amount of RAM. If it is easy to throttle down the sidekiq connections, do it even if you can't prove that it helps.

If the connections are just IDLE, not IDLE in transaction, then they probably aren't very harmful, but again there are a few cases where they can be. That is pretty much the same issue as the number of connections.

The connection you showed from top was idle in transaction. That status shouldn't be taking up much CPU, so that probably means it is rapidly cycling through statements and top just happens to catch it while it is between them. But you didn't say how many similar lines there were in top, if it is just that one it suggests your code is not running concurrently and 7 of you 8 CPUs are wasted.

Regarding the db server versus the other servers, if the database is fundamentally the limit, beating on it with a bigger hammer is not going to help. Often there is some flexibility about where computation is done. If you can get the app servers to do more computation that is currently done on the db and let the db focus on ACID issues, that would be good. But no one but you can know if that is possible or feasible.

My first stop would be to use pg_stat_statements to see what SQL statements are taking the most time. Maybe just adding an index to the slowest/most frequent query would make the problem magically go away.