Question

I have 4 tables which are

create table web_content_3 ( content integer, hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_4 ( content character varying (128 ), hits bigint, bytes bigint, appid varchar(32)  );
create table web_content_5 ( content character varying (128 ), hits bigint, bytes bigint, appid integer );
create table web_content_6 ( content integer, hits bigint, bytes bigint, appid integer );

i am using same query for group by Approx 2 Millions records i.e. SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_{3,4,5,6} GROUP BY content,appid; Result is:

 - Table Name    | Content   | appid     | Time Taken [In ms]
 - ===========================================================
 - web_content_3 | integer   | Character | 27277.931
 - web_content_4 | Character | Character | 151219.388
 - web_content_5 | Character | integer   | 127252.023
 - web_content_6 | integer   | integer   | 5412.096

Here web_content_6 query taking around 5secs only compare to other three combination, using this statistics we can say that integer, integer combination for group by is much faster but Question is WHY?

I have EXPLAIN Results also but it does give me any explanation of Drastic change between web_content_4 and web_content_6 query.

here it is.

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_4 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=482173.36..507552.31 rows=17680 width=63) (actual time=138099.612..151565.655 rows=17680 loops=1)
   ->  Sort  (cost=482173.36..487196.11 rows=2009100 width=63) (actual time=138099.202..149256.707 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 152488kB
         ->  Seq Scan on web_content_4  (cost=0.00..45218.00 rows=2009100 width=63) (actual time=0.010..349.144 rows=2009100 loops=1)
 Total runtime: 151613.569 ms
(6 rows)

Time: 151614.106 ms

test=# EXPLAIN ANALYSE SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid from web_content_6 GROUP BY content,appid;
                                                              QUERY PLAN                                                              
--------------------------------------------------------------------------------------------------------------------------------------
 GroupAggregate  (cost=368814.36..394194.51 rows=17760 width=24) (actual time=3282.333..5840.953 rows=17760 loops=1)
   ->  Sort  (cost=368814.36..373837.11 rows=2009100 width=24) (actual time=3282.176..3946.025 rows=2009100 loops=1)
         Sort Key: content, appid
         Sort Method:  external merge  Disk: 74632kB
         ->  Seq Scan on web_content_6  (cost=0.00..34864.00 rows=2009100 width=24) (actual time=0.011..297.235 rows=2009100 loops=1)
 Total runtime: 6172.960 ms
Was it helpful?

Solution

Gordon Linoff is right, of course. Spilling over to disk is expensive.

If you can spare the memory, you can tell PostgreSQL to use more for sorting and such. I built a table, populated it with random data, and analyzed it before running this query.

EXPLAIN ANALYSE 
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
from web_content_4 
GROUP BY content,appid;

"GroupAggregate  (cost=364323.43..398360.86 rows=903791 width=96) (actual time=25059.086..29789.234 rows=1998067 loops=1)"
"  ->  Sort  (cost=364323.43..369323.34 rows=1999961 width=96) (actual time=25057.540..27907.143 rows=2000000 loops=1)"
"        Sort Key: content, appid"
"        Sort Method: external merge  Disk: 216016kB"
"        ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.010..475.187 rows=2000000 loops=1)"
"Total runtime: 30012.427 ms"

I get the same execution plan you did. In my case, this query does an external merge sort that requires about 216MB of disk. I can tell PostgreSQL to allow more memory for this query by setting the value of work_mem. (Setting work_mem this way affects only my current connection.)

set work_mem = '250MB';
EXPLAIN ANALYSE 
SELECT content, sum(hits) as hits, sum(bytes) as bytes, appid 
from web_content_4 
GROUP BY content,appid;

"HashAggregate  (cost=72472.22..81510.13 rows=903791 width=96) (actual time=3196.777..4505.290 rows=1998067 loops=1)"
"  ->  Seq Scan on web_content_4  (cost=0.00..52472.61 rows=1999961 width=96) (actual time=0.019..437.252 rows=2000000 loops=1)"
"Total runtime: 4726.401 ms"

Now PostgreSQL is using a hash aggregate, and execution time dropped by a factor of 6, 30 seconds to 5 seconds.


I didn't test web_content_6, because replacing text with integers will usually require a couple of joins to recover the text. So I'm not sure we'd be comparing apples to apples there.

OTHER TIPS

The performance of this aggregation is going to be driven by the speed of the sort. All things being equal, larger data is going to require more time than shorter data. The "fast" case is sorting 74Mbytes; the "slow", 152Mbytes.

This would account for some difference in performance, but not the 30x difference in most cases. The one case where you would see a drastic difference is when the smaller data fits into memory and the larger one does not. Spilling over to disk is expensive.

One suspicion is that the data is already sorted, or almost sorted, by web_content_6(content, appid). This might shorten the time needed for the sort. If you compare the actual time and the "cost" for each of the two sorts, you'll see that the "fast" version runs relatively much faster than expected (assuming the costs are comparable).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top