Postgresql count+sort performance

https://stackoverflow.com/questions/12918750

07-07-2021
|

Domanda

I have built a small inventory system using postgresql and psycopg2. Everything works great, except, when I want to create aggregated summaries/reports of the content, I get really bad performance due to count()'ing and sorting.

The DB schema is as follows:

CREATE TABLE hosts
(
        id SERIAL PRIMARY KEY,
        name VARCHAR(255)
);
CREATE TABLE items
(
        id SERIAL PRIMARY KEY,
        description TEXT
);
CREATE TABLE host_item
(
        id SERIAL PRIMARY KEY,
        host INTEGER REFERENCES hosts(id) ON DELETE CASCADE ON UPDATE CASCADE,
        item INTEGER REFERENCES items(id) ON DELETE CASCADE ON UPDATE CASCADE
);

There are some other fields as well, but those are not relevant.

I want to extract 2 different reports: - List of all hosts with the number of items per, ordered from highest to lowest count - List of all items with the number of hosts per, ordered from highest to lowest count

I have used 2 queries for the purpose:

Items with host count:

SELECT i.id, i.description, COUNT(hi.id) AS count
FROM items AS i
LEFT JOIN host_item AS hi
ON (i.id=hi.item)
GROUP BY i.id
ORDER BY count DESC
LIMIT 10;

Hosts with item count:

SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;

Problem is: the queries runs for 5-6 seconds before returning any data. As this is a web based application, 6 seconds are just not acceptable. The database is heavily populated with approximately 50k hosts, 1000 items and 400 000 host/items relations, and will likely increase significantly when (or perhaps if) the application will be used.

After playing around, I found that by removing the "ORDER BY count DESC" part, both queries would execute instantly without any delay whatsoever (less than 20ms to finish the queries).

Is there any way I can optimize these queries so that I can get the result sorted without the delay? I was trying different indexes, but seeing as the count is computed it is possible to utilize an index for this. I have read that count()'ing in postgresql is slow, but its the sorting that are causing me problems...

My current workaround is to run the queries above as an hourly job, putting the result into a new table with an index on the count column for quick lookup.

I use Postgresql 9.2.

Update: Query plan as ordered :)

EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
ORDER BY count DESC
LIMIT 10;


 Limit  (cost=699028.97..699028.99 rows=10 width=21) (actual time=5427.422..5427.424 rows=10 loops=1)
   ->  Sort  (cost=699028.97..699166.44 rows=54990 width=21) (actual time=5427.415..5427.416 rows=10 loops=1)
         Sort Key: (count(hi.id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=613177.95..697840.66 rows=54990 width=21) (actual time=3317.320..5416.440 rows=54990 loops=1)
               ->  Merge Left Join  (cost=613177.95..679024.94 rows=3653163 width=21) (actual time=3317.267..5025.999 rows=3653163 loops=1)
                     Merge Cond: (h.id = hi.host)
                     ->  Index Scan using hosts_pkey on hosts h  (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.012..15.693 rows=54990 loops=1)
                     ->  Materialize  (cost=613177.95..631443.77 rows=3653163 width=8) (actual time=3317.245..4370.865 rows=3653163 loops=1)
                           ->  Sort  (cost=613177.95..622310.86 rows=3653163 width=8) (actual time=3317.199..3975.417 rows=3653163 loops=1)
                                 Sort Key: hi.host
                                 Sort Method: external merge  Disk: 64288kB
                                 ->  Seq Scan on host_item hi  (cost=0.00..65124.63 rows=3653163 width=8) (actual time=0.006..643.257 rows=3653163 loops=1)
 Total runtime: 5438.248 ms





EXPLAIN ANALYZE
SELECT h.id, h.name, COUNT(hi.id) AS count
FROM hosts AS h
LEFT JOIN host_item AS hi
ON (h.id=hi.host)
GROUP BY h.id
LIMIT 10;


 Limit  (cost=0.00..417.03 rows=10 width=21) (actual time=0.136..0.849 rows=10 loops=1)
   ->  GroupAggregate  (cost=0.00..2293261.13 rows=54990 width=21) (actual time=0.134..0.845 rows=10 loops=1)
         ->  Merge Left Join  (cost=0.00..2274445.41 rows=3653163 width=21) (actual time=0.040..0.704 rows=581 loops=1)
               Merge Cond: (h.id = hi.host)
               ->  Index Scan using hosts_pkey on hosts h  (cost=0.00..1779.16 rows=54990 width=17) (actual time=0.015..0.021 rows=11 loops=1)
               ->  Index Scan Backward using idx_host_item_host on host_item hi  (cost=0.00..2226864.24 rows=3653163 width=8) (actual time=0.005..0.438 rows=581 loops=1)
 Total runtime: 1.143 ms

Update: All the answers to this question is really good for learning and understanding how Postgres works. There does not seem to be any definite solution to this problem, but I really appreciate all the excellent answers you have provided, and I will use those in my future work with Postgresql. Thanks alot guys!

Soluzione

@Gordon and @willglynn have provided a lot of useful background as to why your query is slow.

A workaround would be to add a counter to the tables items and hosts and triggers that keep them up to date - for a non-trivial cost to write operations.
Or use materialized views like you do. I might opt for that.

For that, you still need to execute these queries on a regular basis and they can be improved. Rewrite your first one to:

SELECT id, i.description, hi.ct
FROM   items i
JOIN  (
    SELECT item AS id, count(*) AS ct
    FROM   host_item
    GROUP  BY item
    ORDER  BY ct DESC
    LIMIT  10
    ) hi USING (id);

If there is a row in table items for most rows in table host_item, it is faster to aggregate first and then JOIN. Contrary to what @willglynn speculates, this is not optimized automatically in Postgres 9.1.
count(*) is faster than count(col) on principal - and equivalent while col cannot be NULL. (A LEFT JOIN might introduce NULL values.)
Simplified LEFT JOIN to JOIN. It should be safe to assume that there are always at least ten distinct hosts. Doesn't matter much for your original query, but it's a requirement for this one.
Indexes on table host_item won't help, and the PK on items covers the rest.

Probably still not good enough for your case, but in my tests with Postgres 9.1 this form is more than twice as fast. Should translate to 9.2, but test with EXPLAIN ANALYZE to be sure.

Altri suggerimenti

As @GordonLinoff says, these queries are going to be slow regardless of the database involved, but it's helpful to know why. Consider how a database can execute this query:

SELECT table1.*, count(*)
FROM table1
JOIN table2 ON table2.id1 = table1.id
GROUP BY table1.id

Assuming table2 contains data for most rows in table1 and both tables are nontrivially sized, relational databases will tend to do the following:

Scan table2, computing the aggregate on id1, yielding a { id1, count } result set.
Scan table1.
Hash join.

Adding or not adding an ORDER BY count doesn't materially change the amount of work: you've still got two table scans and a JOIN, you've just added a sort step. You might try to add an index on table2 (id1), but all that could improve is the aggregation step: now instead of reading two whole tables, you're reading one whole table and a whole index. Joy.

If you can eliminate most of the rows from consideration using indexes on one or both tables, by all means do so. Otherwise, the operation will always boil down to two scans, and as your data set gets larger, this gets less and less performant.

Incidentally, this is the effect of dropping the ORDER BY in your query: by leaving the LIMIT clause, you've told PostgreSQL you're only interested in the first N rows. This means it can pick N rows from table1 and do a nested loop against table2 -- for each of those N rows in table1, it finds count(*) in table2 for that specific ID using an index on that ID. This is what makes it much faster: you've excluded most of table2.

If your application commonly needs a count of associated records, the usual solution is to maintain a counter yourself. One convention (natively supported by Rails and several other ORMs) is to add a table2_count column to table1. If you index this counter, an ORDER BY ... LIMIT query will be extremely performant.

If your tools can't do this out of the box, or you're using a diverse set of tools to manipulate this database, triggers are a better option. You could put this in a separate summary table as @GordonLinoff suggests -- that might mean less contention in the base table, but it forces a JOIN when retrieving counts. I'd suggest adding a table2_count column to table1 first and breaking it out only if performance measurements indicate it's a win.

The queries that you have written are going to be slow in any database. The comparison to the query without the order by is interesting. The speed return suggests that an index is involved. If so, then it can find the counts from the index.

A fairer comparison is to the query without the order by and with no limit clause. That way, all the rows will be generated, just as in the version with the order by. Basically, the database engine must evaluate all the rows in order to find the top 10. The optimizer decides whether it needs to sort the data or take some other approach.

You have several options. The first is to see if you can speed up the performance of the query by changing parameters specific to Postgres. For instance, perhaps the page cache is too small and could be expanded. Or, perhaps there are sort optimization paramters that can help.

Second, you can have a summary table, as you suggest, which is built by a job that runs periodically. If slightly out-of-date data is no problem, then this is fine.

Third, you can have the summary table, but populate it using triggers rather than a job. When the data changes, update the various counts.

Fourth, you might experiment with other approaches. For instance, perhaps Postgres optimizes the window function COUNT(*) over () better than the aggregation. Or, it might optimize row_number() on the aggregated results better than an order by. Or, if you can live with only one value instead of 10, then MAX() is sufficient.

Based on the posted plans your row-count estimates are fine and the plans look vaguely sane. Your main issue is the big sort, probably necessitated by the ORDER BY:

Sort Method: external merge Disk: 64288kB

That's going to hurt even if you have fast storage. If you're on a single hard drive or (worse) a RAID5 array, that's going to be very, very slow. That sort goes away with Erwin's updated query, but increasing work_mem is still likely to gain you some performance.

You should increase work_mem, either for this query or (less) globally to get much better performance. Try:

SET work_mem = '100MB';
SELECT your_query

and see what difference it makes.

You may also want to play with the random_page_cost and seq_page_cost parameters to see if a different balance produces cost estimates that're a better match for your environment and thus causes the planner to choose a quicker query. For relatively small amounts of data like this where most of it will be cached in RAM I'd start with something like random_page_cost = 0.22 and seq_page_cost = 0.2. You can use SET for them like you do work_mem, eg:

SET work_mem = '100MB';
SET random_page_cost = 0.22;
SET seq_page_Cost = 0.2;
SELECT your_query

Do not set work_mem that high if you're setting it in postgresql.conf and you have lots of active connections, as it's per-sort not per-query, so some queries could use several times work_mem and just a couple at once could bring the system to memory exhaustion; you need to set it low enough that each connection in max_connections can be using 2x or 3x work_mem without your system running out of memory. You can set it per-transaction with SET LOCAL, per-user with ALTER USER ... SET, per-database with ALTER DATABASE ... SET or globally in postgresql.conf.

See:

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow