Question

Postgres 9.3

In the following example query, why does the HashAggregate process 10 million rows (in 5s) instead of stopping as soon as it has collected 1 row as specified by the limit (which should take less than 1ms)?

I have problems like that with many limited queries... the HashAggregate makes limited queries take as long as unlimited queries... which makes the limit totally useless.

Is there a reason why it cannot stop after having collected n rows?

Create some test data:

create table foo (x integer);
insert into foo (x) (select * from generate_series(1, 10000000));

Run the query:

explain analyze
select x from foo group by x limit 1;

or with distinct instead of group by (results in the same query plan!):

explain analyze
select distinct x from foo limit 1;

http://explain.depesz.com/s/arPX

 Limit  (cost=176992.00..176992.01 rows=1 width=4) (actual time=5185.125..5185.125 rows=1 loops=1)
   ->  HashAggregate  (cost=176992.00..176994.00 rows=200 width=4) (actual time=5185.124..5185.124 rows=1 loops=1)
         ->  Seq Scan on foo  (cost=0.00..150443.20 rows=10619520 width=4) (actual time=0.018..949.926 rows=10000000 loops=1)
 Total runtime: 5244.966 ms
Was it helpful?

Solution

In a query with an "order by", "distinct", or aggregate function, the entire query results have to be gathered, sorted, and aggregated before a limit can be applied (regardless of the limit number). You can rewrite the query in a number of ways to achieve the same result, but faster, however, I'd need to see a more lifelike query as the example isn't very realistic of an actual use case.

When considering your example, consider how the DB would determine which result to show (limit 1).. it has to perform some kind of a sort. I'm assuming that your actual examples would include a limit > 1, but i they are using limit 1, then there are many ways to rewrite the queries to take advantage of their limited rows.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top