Postgres 9.5 foreign table inheritance not using indexes

https://dba.stackexchange.com/questions/127535

29-09-2020
|

Question

In PostgreSQL 9.5.0 I have a partitioned table that collects data by months. I tried to use the new PostgreSQL feature of foreign table inheritance and pushed one month of data to another PostgreSQL server, so I got a foreign table. When I run my query from the primary server, the query takes 7x longer to execute than on the server where I have the foreign table. I am not passing a lot of data by network, my query looks like:

explain analyze
SELECT source, global_action, paid, organic, device, count(*) as count, sum(price) as sum
FROM "toys"
WHERE "toys"."container_id" = 857 AND (toys.created_at >= '2015-12-02 05:00:00.000000') AND
(toys.created_at <= '2015-12-30 04:59:59.999999') AND ("toys"."source" IS NOT NULL)
GROUP BY "toys"."source", "toys"."global_action", "toys"."paid", "toys"."organic", "toys"."device";

HashAggregate  (cost=1143634.94..1143649.10 rows=1133 width=15) (actual time=1556.894..1557.017 rows=372 loops=1)
   Group Key: toys.source, toys.global_action, toys.paid, toys.organic, toys.device
   ->  Append  (cost=0.00..1143585.38 rows=2832 width=15) (actual time=113.420..1507.373 rows=76593 loops=1)
         ->  Seq Scan on toys  (cost=0.00..0.00 rows=1 width=242) (actual time=0.001..0.001 rows=0 loops=1)
               Filter: ((source IS NOT NULL) AND (created_at >= '2015-12-02 05:00:00'::timestamp without time zone) AND (created_at <= '2015-12-30 04:59:59.999999'::timestamp without time zone) AND (container_id = 857))
         ->  Foreign Scan on toys_201512_new  (cost=100.00..1143585.38 rows=2831 width=15) (actual time=113.419..1488.445 rows=76593 loops=1)
 Planning time: 2.990 ms
 Execution time: 1560.131 ms

Does PostgreSQL use indexes on foreign tables? (I have indexes defined in the foreign table.) If I run the query directly on that server, it takes 200ms.

Here is parent table definition:

Table "public.toys"
id bigint
job_reference character varying(100)
container_id integer
user_token character varying(1000)
user_ip character varying(100)
user_zip character varying(10)
user_agent character varying(2000)
url_referrer character varying(2000)
page_url character varying(2000)
source character varying(100)
action integer
created_at timestamp without time zone
cpa numeric(9,3) not null default 0.0
duplicate boolean not null default false
fingerprint character varying(255)
email character varying(1000)
mobile_email_apply boolean
country character varying(255)
country_matched boolean
device integer
organic boolean
job_seeker_id character varying(255)
applicant_status integer
ats_applicant_status character varying(255)
ats_applicant_source character varying(255)
price numeric(9,4)
job_group_id integer
analytic_source character varying(255)
global_action integer
paid_organic integer
paid boolean
meta text
params character varying(2000)
analytic_associated_click_id bigint
external_id character varying(100)
associated_click_id bigint
cpc numeric(9,3)
Indexes:
    "job_stats_master_pkey1" PRIMARY KEY, btree (id)

Child table has check constraint:

"toys_201512_new_created_at_check" CHECK (
    created_at >= '2015-11-30 19:00:00'::timestamp without time zone AND
    created_at <  '2015-12-31 19:00:00'::timestamp without time zone)
Inherits: toys

And indexes on:

"toys_201512_new_analytic_source" btree (analytic_source)
"toys_201512_new_country" btree (country)
"toys_201512_new_created_at" btree (created_at)
"toys_201512_new_duplicate" btree (duplicate) WHERE duplicate = false
"toys_201512_new_container_id" btree (container_id)
"toys_201512_new_container_id_created_at" btree (container_id, created_at)
"toys_201512_new_fingerprint" btree (fingerprint)
"toys_201512_new_global_action" btree (global_action)
"toys_201512_new_id" btree (id)
"toys_201512_new_job_group_id" btree (job_group_id)
"toys_201512_new_job_reference" btree (job_reference)
"toys_201512_new_on_country_matched" btree (country_matched) WHERE country_matched = true
"toys_201512_new_on_cpa" btree (cpa) WHERE cpa <> 0::numeric
"toys_201512_new_on_duplicate_and_country_matched" btree (duplicate, country_matched) WHERE duplicate = false AND country_matched = true
"toys_201512_new_on_mobile_email_apply" btree (mobile_email_apply) WHERE mobile_email_apply = true
"toys_201512_new_source" btree (source)
"toys_201512_new_user_ip_user_agent" btree (user_ip, user_agent)
"toys_201512_new_user_token" btree (user_token)

Solution

Postgres can use indexes on the foreign server. But there are quite a few more obstacles than for local tables. Read the chapter Remote Query Optimization in the manual.

Comments in the current source code of postgres_fdw.c are revealing, too:

521  * [...] For a foreign
522  * table, we don't know what indexes are present on the remote side but
523  * want to speculate about which ones we'd like to use if they existed.
...
675      * need to be a bit cautious here.  It would sure be nice to have a local
676      * cache of information about remote index definitions...
...
722      * corresponds to SeqScan path of regular tables (though depending on what
723      * baserestrict conditions we were able to send to remote, there might
724      * actually be an indexscan happening there).

Index

This index of yours looks good for it:

"toys_201512_new_container_id_created_at" btree (container_id, created_at)

If you have many NULL values you might even make that a partial index by appending WHERE source IS NOT NULL, making the index look even better for the Postgres query planner.

Statistics for query planning

Make sure the query planner can work with valid statistics. The numbers in your EXPLAIN output show quite a mismatch:

Foreign Scan on toys_201512_new  (cost=100.00..1143585.38 rows=2831 width=15)
                           (actual time=113.419..1488.445 rows=76593 loops=1)

27x as many rows as Postgres expected were actually returned . The manual:

Running ANALYZE on the foreign table is the way to update the local statistics; this will perform a scan of the remote table and then calculate and store statistics just as though the table were local. Keeping local statistics can be a useful way to reduce per-query planning overhead for a remote table — but if the remote table is frequently updated, the local statistics will soon be obsolete.

Since accessing foreign tables is potentially expensive / delicate, this does not happen automatically. Foreign tables are not covered by autovacuum. The manual:

Foreign tables are analyzed only when explicitly selected.

If the remote table changes a lot, you might want to activate use_remote_estimate. The manual:

This option, which can be specified for a foreign table or a foreign server, controls whether postgres_fdw issues remote EXPLAIN commands to obtain cost estimates. A setting for a foreign table overrides any setting for its server, but only for that table. The default is false.

Finally, test to see what is actually sent to the foreign server:

The query that is actually sent to the remote server for execution can be examined using EXPLAIN VERBOSE.

Query

Your query decluttered and formatted, with one minor improvement:

SELECT source, global_action, paid, organic, device
     , count(*) AS count, sum(price) AS sum
FROM   toys
WHERE  container_id = 857
AND    created_at >= '2015-12-02 05:00:00'
AND    created_at <  '2015-12-30 05:00:00'
AND    created_at <= '2015-12-30 04:59:59.999999'
AND    source IS NOT NULL
GROUP  BY source, global_action, paid, organic, device;

Simpler, cleaner and also matches your CHECK constraint better and avoids possible corner case problems.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange