Why is array_agg() slower than the non-aggregate ARRAY() constructor?

https://dba.stackexchange.com/questions/159710

05-10-2020
|

Question

I was just reviewing some old code written for pre-8.4 PostgreSQL, and I saw something really nifty. I remember having a custom function do some of this back in the day, but I forgot what pre-array_agg() looked like. For review, modern aggregation is written like this.

SELECT array_agg(x ORDER BY x DESC) FROM foobar;

However, once upon a time, it was written like this,

SELECT ARRAY(SELECT x FROM foobar ORDER BY x DESC);

So, I tried it with some test data..

CREATE TEMP TABLE foobar AS
SELECT * FROM generate_series(1,1e7)
  AS t(x);

The results were surprising.. The #OldSchoolCool way was massively faster: a 25% speedup. Moreover, simplifying it without the ORDER, showed the same slowness.

# EXPLAIN ANALYZE SELECT ARRAY(SELECT x FROM foobar);
                                                         QUERY PLAN                                                          
-----------------------------------------------------------------------------------------------------------------------------
 Result  (cost=104425.28..104425.29 rows=1 width=0) (actual time=1665.948..1665.949 rows=1 loops=1)
   InitPlan 1 (returns $0)
     ->  Seq Scan on foobar  (cost=0.00..104425.28 rows=6017728 width=32) (actual time=0.032..716.793 rows=10000000 loops=1)
 Planning time: 0.068 ms
 Execution time: 1671.482 ms
(5 rows)

test=# EXPLAIN ANALYZE SELECT array_agg(x) FROM foobar;
                                                        QUERY PLAN                                                         
---------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=119469.60..119469.61 rows=1 width=32) (actual time=2155.154..2155.154 rows=1 loops=1)
   ->  Seq Scan on foobar  (cost=0.00..104425.28 rows=6017728 width=32) (actual time=0.031..717.831 rows=10000000 loops=1)
 Planning time: 0.054 ms
 Execution time: 2174.753 ms
(4 rows)

So, what's going on here. Why is array_agg, an internal function so much slower than the planner's SQL voodoo?

Using "PostgreSQL 9.5.5 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 6.2.0-5ubuntu12) 6.2.0 20161005, 64-bit"

Solution

There is nothing "old school" or "outdated" about an ARRAY constructor (That's what ARRAY(SELECT x FROM foobar) is). It's modern as ever. Use it for simple array aggregation.

The manual:

It is also possible to construct an array from the results of a subquery. In this form, the array constructor is written with the key word ARRAY followed by a parenthesized (not bracketed) subquery.

The aggregate function array_agg() is more versatile in that it can be integrated in a SELECT list with more columns, possibly more aggregations in the same SELECT, and arbitrary groups can be formed with GROUP BY. While an ARRAY constructor can only return a single array from a SELECT returning a single column.

I did not study the source code, but it would seem obvious that a much more versatile tool is also more expensive.

One notable difference: the ARRAY constructor returns an empty array ({}) if no rows qualify. array_agg() returns NULL for the same.

OTHER TIPS

I believe the accepted answer by Erwin could be added with the following.

Usually, we are working with regular tables with indices, instead of temporary tables (without indices) as in the original question. It's useful to note that aggregations, such as ARRAY_AGG, cannot leverage existing indices when the sorting is done during the aggregation.

For example, assume the following query:

SELECT ARRAY(SELECT c FROM t ORDER BY id)

If we have an index on t(id, ...), the index could be used, in favor of a sequential scan on t followed by a sort on t.id. Additionally, if the output column being wrapped in the array (here c) is part of the index (such as an index on t(id, c) or an include index on t(id) include(c)), this could even be an index-only scan.

Now, let's rewrite that query as following:

SELECT ARRAY_AGG(c ORDER BY id) FROM t

Now, the aggregation will not use the index and it has to sort the rows in memory (or even worse for large data sets, on disk). This will always be a sequential scan on t followed by aggregation+sort.

As far as I know, this is not documented in the official documentation, but can be derived from the source. This should be the case for all current versions, v11 included.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange