Postgres optimizing join of 2 large tables

https://dba.stackexchange.com/questions/270862

04-03-2021
|

Question

I have 2 tables one with around 1M entries and other around 4M. I want to join these tables in a materialized view. But the materialized view creation is taking too long ~4 hours.

I don't see the indexes being used, which is fine as per this answer. When a join involves the whole tables then indices don't help so much.

So what is the workaround? My tables have time-series data. Would TimeScaleDB extension be of any help? Any other suggestion to improve the performance ?

EDIT

CREATE TABLE public.fco_assets
(
  id serial,
  symbol character varying(64) NOT NULL,
  currency integer NOT NULL,
  CONSTRAINT fco_assets_pkey1 PRIMARY KEY (id),
  CONSTRAINT fco_assets_symbol_key1 UNIQUE (symbol)
)

CREATE TABLE public.fco_prices
(
  aid integer NOT NULL,
  cid integer NOT NULL,
  "time" timestamp without time zone NOT NULL,
  close double precision,
  CONSTRAINT fco_prices_pkey PRIMARY KEY (aid, cid, "time")
)   

CREATE TABLE public.fco_chr_res
(
  pid integer NOT NULL,
  aid integer NOT NULL,
  cid integer NOT NULL,
  "time" timestamp without time zone NOT NULL,
  res double precision[] NOT NULL,
  CONSTRAINT fco_chr_res_pkey PRIMARY KEY (pid, aid, cid, "time")
)

There are around 8M rows in fco_chr_res. I have created various indices on price and res tables besides the primary key which should act as an index.

 select  count(*) 
 FROM fco_prices pr
 join fco_assets a on pr.aid = a.id and pr.cid=a.currency
 LEFT JOIN fco_chr_res ch ON pr.aid = ch.aid AND pr.cid = ch.cid AND pr."time" = ch."time"

For above query here is the Explain/Analysis report

Aggregate (rows=1 loops=1) 1 1
Gather (rows=1 loops=1) 1 1
Aggregate (rows=1 loops=1) 1 1
Merge Left Join (rows=45636664 loops=1) 45636664 1
Merge Inner Join (rows=21023282 loops=1) 21023282 1
Index Only Scan using fco_prices_pkey on fco_prices as pr (rows=21023282 loops=1) 21023282 1
Sort (rows=3428 loops=1) 3428 1
Seq Scan on fco_assets as a (rows=3428 loops=1) 3428 1
Index Only Scan using aid_cid_time on fco_chr_res as ch (rows=34245723 loops=1) 34245723 1 is there any scope for improvement ?

Solution

First, about the count query: the explain gives me impression that the query takes time due to IO. The table fco_chr_res contains more than 34M rows and one row in index is around 20 bytes, which results into more than 600MB read from the index. This definitely will not fit work_mem, which is 4MB by default. So it is good to pay attention how good is the disk IO.

It was mentioned that it is timeseries data, i.e., it can be assumed that no data arrives with old timestamps. In such case count on old data can be pre-calculated and stored in an aggregation table. I don't think MATERIALIZED VIEW can help here, since refreshing it to include newly arrived data will re-calculate entire view. Continuous aggregates in TimescaleDB are designed to deal with aggregates with time bucketing in timeseries data, however joins are not supported (which I discuss later).

Another note is that the explain plan shows that the estimated number of rows assumes that there are no referential constraints between the tables, while the comments clarify that they are. Thus introducing foreign keys might help query planner to calculate better estimates.

As discussed in comments applying count is not a common case, instead columns from all three tables will be returned from a query with the same joins. If the query projection will include columns outside indexes, then the query plan will be different and it might increase the amount of data processed significantly. Notice, the indexes are large and close to the actual tables in size, since they include almost all columns. You might want to reconsider them, but they will be needed for foreign keys and duplicate elimination.

Such query produces millions of rows, which is incomprehensible for humans. Thus there is a question how the join result is used and analysed. In particular, will there be grouping and aggregations and/or selections applied to the join result?

Since the question explicitly asks about TimescaleDB and I affiliated with it, I expand my answer with TiemscaleDB. If the data and its usage are timeseries related such as data are inserted mainly in time order and queries do aggregation on time or limited to time intervals, then TimescaleDB can be a good fit and be used to get better performance and to utilise some timeseries specific features. TimescaleDB optimises how data are stored, so it is efficient to insert recent data and to query data in a time interval. Performance can be further improved by using community features: continuous aggregates and compression. In particular, continuous aggregates materialise aggregation results continuously by adding results on new data on schedule without recalculating materialized results if not need. However, a continuous aggregate can be defined on a single timeseries table (hypertable) and no joins can be used. The reason is that in general changes to one of the table (e.g., fco_assets) can affect materialised result, which will waste previous work. A possible solution is to create continuous aggregates on each timeseries table (hypertable) and then join the continuous aggregates in queries. To further investigate if TimescaleDB is applicable, I suggest to check TimescaleDB support slack, where TimescaleDB users discuss also such issues.

Summary of recommendations from the above:

Check disk IO and see if it can be improved: a better disk(s) or a cloud machine with better disk IO;
Materialize the result of count for old data. The automation will require some work;
Define foreign keys;
Reconsider indexes;
Understand better your application if it fits timeseries use cases supported by TimescaleDB.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange