Can an index speed up a query with GROUP BY/aggregate over the whole table (no selectivity)?

https://dba.stackexchange.com/questions/160586

05-10-2020
|

Question

Let's say i have a table with 3 columns a,b and c.

Could i possible speed up a query that looks like this by using index(es) ??

SELECT a,b,SUM(c)  # or AVG(c)
FROM table
GROUP BY a,b
ORDER BY a,b
;

If the above question is positive, what type of index do you recommend and how would this work?

Solution

Not likely. GROUP BY and ORDER BY typically entail a sort. However, in this case a HashAggregate is used (likely because we're working with the whole table).

CREATE TABLE foo AS
SELECT x % 5 AS a, x % 10 AS b, x AS c
FROM generate_series(1,1e6) AS x;

With the HashAggregate plan,

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=611.607..611.608 rows=10 loops=1)
   Sort Key: a, b
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=611.589..611.593 rows=10 loops=1)
         Group Key: a, b
         ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.012..71.157 rows=1000000 loops=1)
 Planning time: 0.168 ms
 Execution time: 611.665 ms

so we add an index...

CREATE INDEX idx ON foo (a,b);
VACUUM FULL ANALYZE foo;

... still shows the same query plan. So we disable HashAggregate

SET enable_hashagg = false;

And, try again..

# EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                            QUERY PLAN                                                            
----------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=0.42..61292.04 rows=50 width=14) (actual time=108.149..655.536 rows=10 loops=1)
   Group Key: a, b
   ->  Index Scan using idx on foo  (cost=0.42..53791.41 rows=1000000 width=14) (actual time=0.066..272.299 rows=1000000 loops=1)
 Planning time: 0.121 ms
 Execution time: 655.594 ms
(5 rows)

And, it takes more time 655ms compared to the previous 611ms.

Need faster?

If that's not fast enough (and 611ms to group and sum a million rows isn't bad). Then you can use a MATERIALIZED VIEW if your workload permits it (if the query is hot and/or infrequently updated),

CREATE MATERIALIZED VIEW foo2 AS
SELECT a,b,sum(c)
FROM foo
GROUP BY a,b
ORDER BY a,b;

Now you're down to sub-MS time when TABLE foo2. Then just do REFRESH MATERIALIZED VIEW foo2; to refresh the view. Or, you can create a trigger update another table and update it with triggers.

On the actually column being aggregated.

There are some exceptions, but sum() isn't one of them. Most aggregates do not use an index because they usually do not have a need for one. The exceptions are the aggregates that are order-specific (like min() and max()). So for instance, if after we created the index on (a,b) you run a sum(a),

# EXPLAIN ANALYZE SELECT sum(a) FROM foo;
                                                     QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=18666.00..18666.01 rows=1 width=4) (actual time=287.063..287.063 rows=1 loops=1)
   ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=4) (actual time=0.015..85.435 rows=1000000 loops=1)
 Planning time: 0.098 ms
 Execution time: 287.104 ms
(4 rows)

you can see that it still uses a seq scan.. You'll see the same plan for sum(c) which doesn't have an index at all. Now here is the kicker,

# EXPLAIN ANALYZE SELECT min(a) FROM foo;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 Result  (cost=0.48..0.49 rows=1 width=0) (actual time=0.041..0.041 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.42..0.48 rows=1 width=4) (actual time=0.036..0.037 rows=1 loops=1)
           ->  Index Only Scan using idx on foo  (cost=0.42..56291.41 rows=1000000 width=4) (actual time=0.035..0.035 rows=1 loops=1)
                 Index Cond: (a IS NOT NULL)
                 Heap Fetches: 1
 Planning time: 0.171 ms
 Execution time: 0.080 ms
(8 rows)

min(a) unlike sum(a) can make use of an ordering, and so the query planner realizes that the index scan, which isn't free, has a benefit.

Proof using index on (a,b,c)

For whatever reason if you want to see proof that a further index on c doesn't matter for the purposes of summing (ask a question if after reading the above you still do not understand why),

-- turn this back on we turned it off earlier
SET enable_hashagg = true;
DROP INDEX idx;
CREATE INDEX idx ON foo (a,b,c);
VACUUM FULL ANALYZE foo;
EXPLAIN ANALYZE SELECT a,b,sum(c) FROM foo GROUP BY a,b ORDER BY a,b;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Sort  (cost=23668.04..23668.16 rows=50 width=14) (actual time=608.888..608.889 rows=10 loops=1)
   Sort Key: a, b
   Sort Method: quicksort  Memory: 25kB
   ->  HashAggregate  (cost=23666.00..23666.62 rows=50 width=14) (actual time=608.869..608.871 rows=10 loops=1)
         Group Key: a, b
         ->  Seq Scan on foo  (cost=0.00..16166.00 rows=1000000 width=14) (actual time=0.015..72.613 rows=1000000 loops=1)
 Planning time: 0.130 ms
 Execution time: 608.947 ms
(8 rows)

No improvement whatsoever. Disabling hashagg still shows no improvement whatsoever.

TLDR;

In this specific and simple use case indexes don't matter. The planner chooses the best method.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange