Question

I have 2 tables: mailings and form_responses. The form_responses table has a foreign key with the mailings table, so a mailing has one-to-many form_responses. What I want to do is calculate mailing stats: per mailing get total form_responses and when the last form_response was received. I want to return the first 20 mailings that have last received a form_response. But when it has to query 50k+ rows, the query gets really slow.

mailings (~100k rows)

    Column    |            Type             |                     Modifiers                     
--------------+-----------------------------+---------------------------------------------------
 id           | integer                     | pkey
 project_id   | integer                     | not null
 deleted      | boolean                     |
 active       | boolean                     |
 created_at   | timestamp with time zone    | 

Indexes:
 "mailings_pkey" PRIMARY KEY, btree (id)
 "mailing_on_project_id_where_not_deleted_and_active" btree (project_id) (WHERE deleted IS NOT true AND active IS TRUE)
 + individual indexes on almost all columns

form_responses (~6m rows)

    Column    |            Type             |                     Modifiers                     
--------------+-----------------------------+---------------------------------------------------
 id           | integer                     | pkey
 project_id   | integer                     | not null
 mailing_id   | integer                     | 
 deleted      | boolean                     |
 is_unanswered| boolean                     |
 created_at   | timestamp with time zone    | 

Indexes:
 "form_responses_pkey" PRIMARY KEY, btree (id)
 "form_responses_on_pid_cr_time_where_not_del_not_is_unans" btree (project_id, created_time) (WHERE deleted IS NOT true AND is_uanswered IS NOT TRUE)
 + individual indexes on almost all columns

I want to get the mailing stats of a project (id=1), within a date range. I only want the stats of mailings that are not deleted and active and only count non deleted responses.

So this gives me the following query

EXPLAIN (ANALYZE, BUFFERS)
SELECT
  count(DISTINCT form_responses.id) AS total,
  max(form_responses.created_time) AS form_response_last_received,
  (SELECT mailings.name FROM mailings WHERE id = form_responses.mailing_id) AS name,
  form_responses.mailing_id AS mailing_id FROM form_responses
JOIN
  mailings ON mailings.id = form_responses.mailing_id
WHERE
  form_responses.deleted IS NOT true AND
  form_responses.is_unanswered IS NOT true AND
  form_responses.created_time >= '2017-08-02 00:00:00' AND
  form_responses.created_time <= '2018-01-29 00:00:00' AND
  form_responses.project_id = 1 AND
  mailings.deleted IS NOT true AND
  mailings.project_id = 1 AND
  mailings.active IS true
GROUP BY form_responses.mailing_id
ORDER BY form_response_last_received DESC
LIMIT 20 OFFSET 0

This gives me the following output:

Limit  (cost=308997.69..308997.74 rows=20 width=16) (actual time=7762.554..7762.565 rows=20 loops=1)
  Buffers: shared hit=74535 read=13513 dirtied=1
  ->  Sort  (cost=308997.69..308998.19 rows=197 width=16) (actual time=7762.552..7762.557 rows=20 loops=1)
        Sort Key: (max(form_responses.created_time))
        Sort Method: top-N heapsort  Memory: 26kB
        Buffers: shared hit=74535 read=13513 dirtied=1
        ->  GroupAggregate  (cost=307039.18..308992.45 rows=197 width=16) (actual time=7673.523..7762.278 rows=686 loops=1)
              Group Key: form_responses.mailing_id
              Buffers: shared hit=74535 read=13513 dirtied=1
              ->  Sort  (cost=307039.18..307111.58 rows=28961 width=16) (actual time=7673.277..7702.783 rows=137542 loops=1)
                    Sort Key: form_responses.mailing_id
                    Sort Method: quicksort  Memory: 12592kB
                    Buffers: shared hit=71790 read=13512 dirtied=1
                    ->  Hash Join  (cost=7376.42..304892.90 rows=28961 width=16) (actual time=44.576..7609.818 rows=137542 loops=1)
                          Hash Cond: (form_responses.mailing_id = mailings.id)
                          Buffers: shared hit=71790 read=13512 dirtied=1
                          ->  Bitmap Heap Scan on form_responses  (cost=3815.27..299862.55 rows=134811 width=16) (actual time=41.439..7499.305 rows=137543 loops=1)
                                Recheck Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone) AND (deleted IS NOT TRUE) AND (is_unanswered IS NOT TRUE))
                                Heap Blocks: exact=84014
                                Buffers: shared hit=71032 read=13512
                                ->  Bitmap Index Scan on form_responses_on_pid_cr_time_where_not_del_not_is_unans  (cost=0.00..3781.57 rows=134811 width=0) (actual time=23.892..23.892 rows=137543 loops=1)
                                      Index Cond: ((project_id = 1) AND (created_time >= '2017-08-02 00:00:00+00'::timestamp with time zone) AND (created_time <= '2018-01-29 00:00:00+00'::timestamp with time zone))
                                      Buffers: shared hit=530
                          ->  Hash  (cost=3286.82..3286.82 rows=21946 width=4) (actual time=3.093..3.093 rows=2866 loops=1)
                                Buckets: 4096  Batches: 1  Memory Usage: 101kB
                                Buffers: shared hit=758 dirtied=1
                                ->  Bitmap Heap Scan on mailings  (cost=874.50..3286.82 rows=21946 width=4) (actual time=0.403..2.145 rows=2866 loops=1)
                                      Recheck Cond: ((project_id = 1) AND (deleted IS NOT TRUE) AND (active IS TRUE))
                                      Heap Blocks: exact=732
                                      Buffers: shared hit=758 dirtied=1
                                      ->  Bitmap Index Scan on mailing_on_project_id_where_not_deleted_and_active  (cost=0.00..869.01 rows=21946 width=0) (actual time=0.315..0.315 rows=2866 loops=1)
                                            Index Cond: (project_id = 1)
                                            Buffers: shared hit=26
              SubPlan 1
                ->  Index Scan using mailings_pkey on mailings mailings_1  (cost=0.42..8.44 rows=1 width=26) (actual time=0.003..0.003 rows=1 loops=686)
                      Index Cond: (id = form_responses.mailing_id)
                      Buffers: shared hit=2745 read=1
                      Planning time: 0.865 ms
Execution time: 7764.951 ms

In this example it queries ~135k rows, this can go up to almost ~600k. As you can see 99% of the time is spend on the Bitmap heap scan. I run a db.t2.large RDS database at AWS with 100gb (SSD) of storage, which runs postgres 9.4.7

  • Is there anyway to avoid doing the bitmap heap scan?
  • Or a way to improve the speed of this?
  • Can I create a covering index, so it doesn't have to read the form responses table itself?
  • Should I cluster the form_responses table on project_id and/or mailing_id?
  • Should I upgrade my hardware?
  • Am I asking too much of my Database?
Was it helpful?

Solution

You can avoid the bitmap scan by set enable_bitmapscan to off. Then you can see what it chooses instead, and how that choice performs. This is generally an exploratory tool, it is usually not a good idea to set it this way globally in a production system. But you can set just in one session, so there is no problem with doing the test on a production system.

If you run the same query a second time with the same parameters, is it much faster? If so you can improve the performance by pre-warming the form_responses table into memory (for example, using the pg_prewarm extension), assuming that your memory is large enough to hold it all. You can also try increasing effective_io_concurrency, if RDS will let you.

If you make an index just like mailing_on_project_id_where_not_deleted_and_active but also including mailing_id and id at the end of the column list, then it could act like a covering index and you might get it to use an index-only scan. You would have to keep your table well-vacuumed for this to be maximally effective. The default level of vacuuming might not be aggressive enough, depending on the pattern in which your data is updated.

PostgreSQL clusters tables based on an index, not based on columns. Clustering on mailing_on_project_id_where_not_deleted_and_active would likely help this particular query.

Based on the answers and results of the above, you might want to upgrade your hardware. If form_responses doesn't fit in RAM, more RAM would help. If that is not feasible and it needs to be read from disk, then upgrading from HDD to SSD would help (unless RDS is already on SSD, in which case you shouldn't have a problem in the first place unless it is getting throttled).

This query might benefit in some parallel query implementations introduced in PostgreSQL 9.6 and improved in 10, but I don't know if RDS supports those.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top