Question

Hi I am trying to optimise the a timestamp range contains <@ query for Postgres 12

I have done a bit of reading postgres documentation and discovered only GiST and SP-GiST indexes support this operator. However, I can't add one of these ( I tink I would need to add one to the heartrate table - see Schema below, but that is not a range type...).

My question is similar to this question and this one which also indicate I will need a GiST index. However, they are the other way around, eg the column they have a single timestamp and want to return from a table of tsranges all records that are contained. I have a table of timestamps and want to join it to a table of tsranges

For a bit of context for my schema, I have a collection of heartrates in the real dataset sampled ~1/3seconds, and a list of songs I have listen to, and when. I would like to query things like

  • avg(heartrate) for a particular track and artist
  • avg(heartrate) for a particular artist
  • etc.

Schema

create table heartrate (
    "time"  timestamp primary key ,
    value float
)
;

CREATE INDEX ON heartrate ("time", value);
-- CREATE INDEX ON heartrate USING GIST ("time", value); can't do as "time" is not a range column.
-- one gets the following error: 
--- ERROR: data type timestamp without time zone has no default operator class for access method "gist" Hint: You must specify an operator class for the index or define a default operator class for the data type.


create table song_play(
    track       TEXT      NOT NULL,
    artist      TEXT      NOT NULL,
    play        tsrange not null
)
;
CREATE INDEX ON song_play(track, artist);


INSERT INTO heartrate("time", value)
SELECT d, 60+60*random()
FROM   generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '5 min'::interval)  d
;

INSERT INTO song_play(track,artist, play)
SELECT case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end 
, case when random() > 0.5 then 'a' when random() > 0.5 then 'b' else 'c' end
, tsrange(d, d+ (((random()*3+1)::text|| 'min')::interval))
FROM   generate_series('2015-01-01 00:00:00'::timestamp, '2020-01-01 00:00:00'::timestamp, '1 day'::interval)  d
;

EXPLAIN  SELECT sp.track, sp.artist, avg(h.value) FROM song_play sp left join heartrate h ON h.time <@ sp.play where sp.track='a' and sp.artist='b' GROUP BY sp.track, sp.artist;

Which results in the following:

✓

✓

✓

✓

525889 rows affected

1827 rows affected

| QUERY PLAN                                                                                                 |
| :--------------------------------------------------------------------------------------------------------- |
| GroupAggregate  (cost=0.28..14689.24 rows=1 width=72)                                                      |
|   Group Key: sp.track, sp.artist                                                                           |
|   ->  Nested Loop Left Join  (cost=0.28..14685.28 rows=526 width=72)                                       |
|         Join Filter: (h."time" <@ sp.play)                                                                 |
|         ->  Index Scan using song_play_track_artist_idx on song_play sp  (cost=0.28..8.29 rows=1 width=96) |
|               Index Cond: ((track = 'a'::text) AND (artist = 'b'::text))                                   |
|         ->  Seq Scan on heartrate h  (cost=0.00..8102.55 rows=525955 width=16)                             |

Note: the above plan results in A Full seq scan of the heartrate table, the largest table - not ideal at all!

I then decided to create the following function to see if it would help speed up queries. It converts a range eg tsrange('2020-01-01 00:00:00', '2020-01-02 00:00:00') to a conditional query eg field >= 2020-01-01 00:00:00 and field < '2020-01-02 00:00:00' .

essentially the same as the <@ contains operator.

And it seems to work! Although this is only helpful for looking up a particular song_play's heartrate... not all of a track / artist's song_play's heartrates

CREATE OR REPLACE FUNCTION range_to_conditional(range anyrange, field text)
    RETURNS text
    LANGUAGE SQL
    IMMUTABLE STRICT AS
$$
SELECT case
           when isempty(range) then 'false'
           when upper_inf(range) and lower_inf(range) then 'true'
           when upper_inf(range) then case
                                          when lower_inc(range) then format(' %L <= %I ', lower(range), field)
                                          else format(' %L < %I ', lower(range), field)
               end
           when lower_inf(range) then case
                                          when upper_inc(range) then format(' %L >= %I ', upper(range), field)
                                          else format(' %L > %I ', upper(range), field)
               end
           else
               case
                   when lower_inc(range) and upper_inc(range)
                       then format(' %1$L <= %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
                   when lower_inc(range)
                       then format(' %1$L <= %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
                   when upper_inc(range)
                       then format(' %1$L < %3$I AND %2$L >= %3$I ', lower(range), upper(range), field)
                   else format(' %1$L < %3$I AND %2$L > %3$I ', lower(range), upper(range), field)
                   end
           end
$$


;


create function avg_heartrate(sp song_play)
returns double precision as $$
DECLARE
    retval double precision ;
BEGIN
    EXECUTE format('select avg(h.value) from heartrate h where %s',  range_to_conditional(sp.play, 'time'))
    INTO STRICT retval;
    RETURN retval;
END
$$
    LANGUAGE plpgsql stable;

SELECT sp.track, sp.artist, sp.play, avg_heartrate(sp) from song_play sp where  sp.track='a' and sp.artist='b' limit 10;
✓

✓

track | artist | play                                                 | avg_heartrate     
:---- | :----- | :--------------------------------------------------- | :-----------------
a     | b      | ["2015-01-03 00:00:00","2015-01-03 00:03:42.413608") | 78.93074469582096 
a     | b      | ["2015-01-10 00:00:00","2015-01-10 00:01:32.299356") | 83.89127804586359 
a     | b      | ["2015-01-11 00:00:00","2015-01-11 00:03:24.722083") | 62.333722293527885
a     | b      | ["2015-01-19 00:00:00","2015-01-19 00:01:14.845757") | 77.65872734128969 
a     | b      | ["2015-01-30 00:00:00","2015-01-30 00:01:40.991165") | 102.88233680407437
a     | b      | ["2015-02-06 00:00:00","2015-02-06 00:03:51.264716") | 70.34797302970127 
a     | b      | ["2015-02-13 00:00:00","2015-02-13 00:01:23.358657") | 62.91734005187344 
a     | b      | ["2015-02-25 00:00:00","2015-02-25 00:02:04.856602") | 115.45533419257616
a     | b      | ["2015-02-28 00:00:00","2015-02-28 00:02:46.800728") | 117.39846990343175
a     | b      | ["2015-03-18 00:00:00","2015-03-18 00:02:54.893186") | 68.1618921408235  

db<>fiddle here

Thanks!

Was it helpful?

Solution

Change the join condition from

ON h.time <@ sp.play

to

ON h.time >= lower(sp.play) AND h.time < upper(sp.play)

(if your ranges are open at the right end, else use different inequality operators).

Then the nested loop join can use a regular b-tree index on heartrate(time) to speed up the inner query.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top