What is the meaning of `(ORDER BY x RANGE BETWEEN n PRECEDING…)` if x is a date?
-
02-10-2020 - |
Question
In another thread:
the OP wanted a sliding average for the last 365 days. Using ROWS BETWEEN ...
would be fine if it where guaranteed that there where exactly one occurrence per day, but that is not the case here. RANGE BETWEEN ...
seems like a good fit, but it is not clear to me what it means in DB2. Not sure if it matters that db2 does not have an INTERVAL
type, but mimics it with labled durations.
The documentation says: (https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0023461.html)
unsigned-constant PRECEDING
Specifies either the range or number of rows preceding the current row. If ROWS is specified, then unsigned-constant must be zero or a positive integer indicating a number of rows. If RANGE is specified, then the data type of unsigned-constant must be comparable to the type of the sort-key-expression of the window-order-clause. There can only be one sort-key-expression, and the data type of the sort-key-expression must allow subtraction. This clause cannot be specified in group-bound2 if group-bound1 is CURRENT ROW or unsigned-constant FOLLOWING.
unsigned-constant FOLLOWING
Specifies either the range or number of rows following the current row. If ROWS is specified, then unsigned-constant must be zero or a positive integer indicating a number of rows. If RANGE is specified, then the data type of unsigned-constant must be comparable to the type of the sort-key-expression of the window-order-clause. There can only be one sort-key-expression, and the data type of the sort-key-expression must allow addition.
DB2 allow constructions like:
values current_date - 1
the default unit date is day so this means:
values current_date - 1 day
Given this I would expect this example to work:
create table test
( d date not null
, x decimal(3,0) not null);
insert into test (d,x)
values ('2016-01-01',10),('2016-01-07',20),('2016-01-12',30);
I would expect the query:
select d, avg(x) over (order by d
range between 30 preceding
and current row)
from test
order by d;
to return:
2016-01-01-00.00.00 10
2016-01-07-00.00.00 15
2016-01-12-00.00.00 20
or possibly generate an error, but instead the result is:
2016-01-01-00.00.00 10
2016-01-07-00.00.00 20
2016-01-12-00.00.00 25
I also tried adding day to the query:
select d, avg(x) over (order by d
range between 30 days preceding
select d, avg(x) over (order by d
range between (cast 30 as day) preceding
just in case, but both of these attempts results in:
SQL0104N An unexpected token "day" was found following "y ...
SQL0104N An unexpected token "cast(30 as day)" was found following "r ...
First suspicion I had that the unit is something smaller than day, but increasing the preceding part to 300 returns the same result. Perhaps even more surprising is that increasing it to 500 changes the result to:
select d, avg(x) over (order by d
range between 500 preceding
and current row)
from test
order by d
2016-01-01-00.00.00 10
2016-01-07-00.00.00 20
2016-01-12-00.00.00 30
Given the query:
select d, avg(x) over (order by d
range between n preceding
and current row)
for 300 <= n <= 399 the result of the last row is 25, for n<300 or n>399 the result of the last row is 30.
I can not figure out which rows that are seen by the avg function, my best guess is that there is some implicit cast of date to something else in the framing clause, but don't have any idea how to prove or disprove this assumption. Can someone shed some light on this?
Solution
Firstly, the expression values current_date - 1
would only be valid if Oracle compatibility mode were in effect -- it mimics the Oracle's datetime arithmetic where the default interval is expressed in (potentially fractional) days.
I think that regardless of Oracle compatibility, the range bounds should be comparable as integers, and comparing DATE
values with integers might produce unexpected results. If you convert your DATE
s to the number of days since some past moment you can use integer comparisons. You could use JULIAN_DAY()
, for example:
select d, avg(x) over (order by julian_day(d)
range between 30 preceding
and current row)
from test
order by d
which produces the result you expect:
D 2
---------- ---------------------------------
01/01/2016 10.0000000000000000000000000000
01/07/2016 15.0000000000000000000000000000
01/12/2016 20.0000000000000000000000000000
3 record(s) selected.
In the first fixpak of 10.5 it was allowed to use range over dates, but the results where unpredictable. In recent fixpaks this is no longer allowed, so much of the confusion in the question could have been avoided by using a recent fixpak.