How to speed up querying last values in a time series?
Question
I have a time series table prices
in a PostgreSQL 10 DB.
Here is a simplified test case to illustrate the problem:
CREATE TABLE prices (
currency text NOT NULL,
side boolean NOT NULL,
price numeric NOT NULL,
ts timestamptz NOT NULL
);
I want to quickly query the last values of each currency
/side
duo, as this would give me the current buy/sell price of each currency.
My current solution is:
create index on prices (currency, side, ts desc);
select distinct on (currency, side) *
order by currency, side, ts desc;
But this will give me very slow queries (~500ms) in this table with only ~30k rows.
The actual table has four columns that I want to group, instead of two. Here is what the actual table and query really looks like:
create table prices (
exchange integer not null,
pair text not null,
side boolean not null,
guaranteed_volume numeric not null,
ts timestamp with time zone not null,
price numeric not null,
constraint prices_pkey primary key (exchange, pair, side, guaranteed_volume, ts),
constraint prices_exchange_fkey foreign key (exchange)
references exchanges (id) match simple
on update no action
on delete no action
);
create index prices_exchange_pair_side_guaranteed_volume_ts_idx
on prices (exchange, pair, side, guaranteed_volume, ts desc);
create view last_prices as
select distinct on (exchange, pair, side, guaranteed_volume)
exchange
, pair
, side
, guaranteed_volume
, price
, ts
from prices
order by exchange
, pair
, side
, guaranteed_volume
, ts desc;
There are 34441 rows, currently. Some useful debug queries:
# explain (analyze,buffers) select * from last_prices;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=2662.03..2997.71 rows=1224 width=37) (actual time=403.218..459.041 rows=392 loops=1)
Buffers: shared hit=418
-> Sort (cost=2662.03..2729.17 rows=26854 width=37) (actual time=403.213..411.041 rows=28353 loops=1)
Sort Key: prices.exchange, prices.pair, prices.side, prices.guaranteed_volume, prices.ts DESC
Sort Method: quicksort Memory: 2984kB
Buffers: shared hit=418
-> Seq Scan on prices (cost=0.00..686.54 rows=26854 width=37) (actual time=0.022..31.407 rows=28353 loops=1)
Buffers: shared hit=418
Planning time: 0.911 ms
Execution time: 460.190 ms
Explain analyze with seqscan disabled:
# explain (analyze,buffers) select * from last_prices;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique (cost=0.41..4458.07 rows=1224 width=37) (actual time=0.037..122.237 rows=392 loops=1)
Buffers: shared hit=15182
-> Index Scan using prices_exchange_pair_side_guaranteed_volume_ts_idx on prices (cost=0.41..4189.53 rows=26854 width=37) (actual time=0.034..91.237 rows=29649 loops=1)
Buffers: shared hit=15182
Planning time: 0.291 ms
Execution time: 122.417 ms
Adding a query with the view's query being accessed directly:
# explain (analyze, buffers)
select distinct on (exchange, pair, side, guaranteed_volume)
exchange
, pair
, side
, guaranteed_volume
, price
, ts
from prices
order by exchange
, pair
, side
, guaranteed_volume
, ts desc;
QUERY PLAN
------------------------------------------------------------------------------------------------------------------------
Unique (cost=2163.56..2429.99 rows=1224 width=37) (actual time=364.716..391.405 rows=380 loops=1)
Buffers: shared hit=418
-> Sort (cost=2163.56..2216.85 rows=21314 width=37) (actual time=364.711..370.458 rows=24011 loops=1)
Sort Key: exchange, pair, side, guaranteed_volume, ts DESC
Sort Method: quicksort Memory: 2644kB
Buffers: shared hit=418
-> Seq Scan on prices (cost=0.00..631.14 rows=21314 width=37) (actual time=0.025..13.751 rows=24011 loops=1)
Buffers: shared hit=418
Planning time: 0.258 ms
Execution time: 392.110 ms
Solution
I want to quickly query the last values of each
currency
/side
duo
DISTINCT ON
excels for few rows per combination of interest. But your use case obviously has many rows per distinct (currency, side)
. So DISTINCT ON
is a bad choice as far as performance is concerned. You'll find a detailed assessment and an arsenal of solutions in these two related answer on SO:
If all you need is the latest timestamp ts
, the column is sort criteria and desired return value in one and the case is very simple. Look to Evan's simple solution with max(ts)
.
(Well, ideally, you'd have an index on (currency, side, ts desc NULLS LAST)
, since max(ts)
ignores NULL values and better matches this sort order. But that won't matter much with a column defined NOT NULL
.)
Typically, you need additional columns from each selected row (like the current price!) and/or you need to sort by multiple columns, so you need to do more.
Ideally, you have another table listing all currencies - and a FK constraint to enforce referential integrity and disallow nonexistent currency values. Then use the query technique from chapter "2a. LATERAL join" in the linked answer, expanded to account for the added side
:
Based on your initial simple test case:
SELECT c.currency, s.side, p.*
FROM currency c
CROSS JOIN (VALUES (true), (false)) s(side) -- account for side
CROSS JOIN LATERAL (
SELECT ts, price -- more columns?
FROM prices
WHERE currency = c.currency
AND side = s.side
ORDER BY ts DESC -- ts is NOT NULL
LIMIT 1
) p
ORDER BY 1, 2; -- optional, whatever you prefer;
You should see very fast index scans on an index on (currency, side, ts DESC)
.
If index-only scans are possible and you only need ts
and price
it might pay to add price
as last column to the index.
dbfiddle here
Whether you save this query in a VIEW
or not doesn't affect performance.
OTHER TIPS
If you have the index on (currency, side, ts desc)
, what's the time like for:
SELECT currency, side, max(ts)
FROM prices
GROUP BY currency, side;
It gets much faster, but then again, the reason that I used distinct on was to get the price value associated with the last ts. – ivarec 56 mins ago