How to speed up querying last values in a time series?

https://dba.stackexchange.com/questions/202248

28-12-2020
|

Question

I have a time series table prices in a PostgreSQL 10 DB.
Here is a simplified test case to illustrate the problem:

CREATE TABLE prices (
    currency text NOT NULL,
    side     boolean NOT NULL,
    price    numeric NOT NULL,
    ts       timestamptz NOT NULL
);

I want to quickly query the last values of each currency/side duo, as this would give me the current buy/sell price of each currency.

My current solution is:

create index on prices (currency, side, ts desc);

select distinct on (currency, side) *
 order by currency, side, ts desc;

But this will give me very slow queries (~500ms) in this table with only ~30k rows.

The actual table has four columns that I want to group, instead of two. Here is what the actual table and query really looks like:

create table prices (
    exchange integer not null,
    pair text not null,
    side boolean not null,
    guaranteed_volume numeric not null,
    ts timestamp with time zone not null,
    price numeric not null,
    constraint prices_pkey primary key (exchange, pair, side, guaranteed_volume, ts),
    constraint prices_exchange_fkey foreign key (exchange)
        references exchanges (id) match simple
        on update no action
        on delete no action
);

create index prices_exchange_pair_side_guaranteed_volume_ts_idx
      on prices (exchange, pair, side, guaranteed_volume, ts desc);

create view last_prices as
select distinct on (exchange, pair, side, guaranteed_volume)
       exchange
     , pair
     , side
     , guaranteed_volume
     , price
     , ts
  from prices
 order by exchange
        , pair
        , side
        , guaranteed_volume
        , ts desc;

There are 34441 rows, currently. Some useful debug queries:

# explain (analyze,buffers) select * from last_prices;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2662.03..2997.71 rows=1224 width=37) (actual time=403.218..459.041 rows=392 loops=1)
   Buffers: shared hit=418
   ->  Sort  (cost=2662.03..2729.17 rows=26854 width=37) (actual time=403.213..411.041 rows=28353 loops=1)
         Sort Key: prices.exchange, prices.pair, prices.side, prices.guaranteed_volume, prices.ts DESC
         Sort Method: quicksort  Memory: 2984kB
         Buffers: shared hit=418
         ->  Seq Scan on prices  (cost=0.00..686.54 rows=26854 width=37) (actual time=0.022..31.407 rows=28353 loops=1)
               Buffers: shared hit=418
 Planning time: 0.911 ms
 Execution time: 460.190 ms

Explain analyze with seqscan disabled:

# explain (analyze,buffers) select * from last_prices;
                                                                                  QUERY PLAN                                                                                  
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=0.41..4458.07 rows=1224 width=37) (actual time=0.037..122.237 rows=392 loops=1)
   Buffers: shared hit=15182
   ->  Index Scan using prices_exchange_pair_side_guaranteed_volume_ts_idx on prices  (cost=0.41..4189.53 rows=26854 width=37) (actual time=0.034..91.237 rows=29649 loops=1)
         Buffers: shared hit=15182
 Planning time: 0.291 ms
 Execution time: 122.417 ms

Adding a query with the view's query being accessed directly:

# explain (analyze, buffers)
select distinct on (exchange, pair, side, guaranteed_volume)
       exchange
     , pair
     , side
     , guaranteed_volume
     , price
     , ts
  from prices
 order by exchange
        , pair
        , side
        , guaranteed_volume
        , ts desc;
                                                       QUERY PLAN                                                       
------------------------------------------------------------------------------------------------------------------------
 Unique  (cost=2163.56..2429.99 rows=1224 width=37) (actual time=364.716..391.405 rows=380 loops=1)
   Buffers: shared hit=418
   ->  Sort  (cost=2163.56..2216.85 rows=21314 width=37) (actual time=364.711..370.458 rows=24011 loops=1)
         Sort Key: exchange, pair, side, guaranteed_volume, ts DESC
         Sort Method: quicksort  Memory: 2644kB
         Buffers: shared hit=418
         ->  Seq Scan on prices  (cost=0.00..631.14 rows=21314 width=37) (actual time=0.025..13.751 rows=24011 loops=1)
               Buffers: shared hit=418
 Planning time: 0.258 ms
 Execution time: 392.110 ms

Solution

I want to quickly query the last values of each currency/side duo

DISTINCT ON excels for few rows per combination of interest. But your use case obviously has many rows per distinct (currency, side). So DISTINCT ON is a bad choice as far as performance is concerned. You'll find a detailed assessment and an arsenal of solutions in these two related answer on SO:

If all you need is the latest timestamp ts, the column is sort criteria and desired return value in one and the case is very simple. Look to Evan's simple solution with max(ts).

(Well, ideally, you'd have an index on (currency, side, ts desc NULLS LAST), since max(ts) ignores NULL values and better matches this sort order. But that won't matter much with a column defined NOT NULL.)

Typically, you need additional columns from each selected row (like the current price!) and/or you need to sort by multiple columns, so you need to do more.

Ideally, you have another table listing all currencies - and a FK constraint to enforce referential integrity and disallow nonexistent currency values. Then use the query technique from chapter "2a. LATERAL join" in the linked answer, expanded to account for the added side:

Based on your initial simple test case:

SELECT c.currency, s.side, p.*
FROM   currency c
CROSS  JOIN (VALUES (true), (false)) s(side)  -- account for side
CROSS  JOIN LATERAL (
   SELECT ts, price              -- more columns?
   FROM   prices
   WHERE  currency = c.currency
   AND    side = s.side
   ORDER  BY ts DESC             -- ts is NOT NULL
   LIMIT  1
   ) p
ORDER  BY 1, 2;  -- optional, whatever you prefer;

You should see very fast index scans on an index on (currency, side, ts DESC).

If index-only scans are possible and you only need ts and price it might pay to add price as last column to the index.

dbfiddle here

Whether you save this query in a VIEW or not doesn't affect performance.

OTHER TIPS

If you have the index on (currency, side, ts desc), what's the time like for:

SELECT currency, side, max(ts)
FROM prices
GROUP BY currency, side;

It gets much faster, but then again, the reason that I used distinct on was to get the price value associated with the last ts. – ivarec 56 mins ago

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange