What is the meaning of `(ORDER BY x RANGE BETWEEN n PRECEDING…)` if x is a date?

https://dba.stackexchange.com/questions/141263

02-10-2020
|

Question

In another thread:

https://stackoverflow.com/questions/37759659/db2-query-to-find-average-sale-for-each-item-1-year-previous

the OP wanted a sliding average for the last 365 days. Using ROWS BETWEEN ... would be fine if it where guaranteed that there where exactly one occurrence per day, but that is not the case here. RANGE BETWEEN ...seems like a good fit, but it is not clear to me what it means in DB2. Not sure if it matters that db2 does not have an INTERVAL type, but mimics it with labled durations.

The documentation says: (https://www.ibm.com/support/knowledgecenter/SSEPGG_10.5.0/com.ibm.db2.luw.sql.ref.doc/doc/r0023461.html)

unsigned-constant PRECEDING

Specifies either the range or number of rows preceding the current row. If ROWS is specified, then unsigned-constant must be zero or a positive integer indicating a number of rows. If RANGE is specified, then the data type of unsigned-constant must be comparable to the type of the sort-key-expression of the window-order-clause. There can only be one sort-key-expression, and the data type of the sort-key-expression must allow subtraction. This clause cannot be specified in group-bound2 if group-bound1 is CURRENT ROW or unsigned-constant FOLLOWING.

unsigned-constant FOLLOWING

Specifies either the range or number of rows following the current row. If ROWS is specified, then unsigned-constant must be zero or a positive integer indicating a number of rows. If RANGE is specified, then the data type of unsigned-constant must be comparable to the type of the sort-key-expression of the window-order-clause. There can only be one sort-key-expression, and the data type of the sort-key-expression must allow addition.

DB2 allow constructions like:

values current_date - 1

the default unit date is day so this means:

values current_date - 1 day

Given this I would expect this example to work:

create table test 
( d date not null
, x decimal(3,0) not null);

insert into test (d,x) 
values ('2016-01-01',10),('2016-01-07',20),('2016-01-12',30);

I would expect the query:

select d, avg(x) over (order by d 
                       range between 30 preceding 
                                 and current row) 
from test 
order by d;

to return:

2016-01-01-00.00.00  10
2016-01-07-00.00.00  15
2016-01-12-00.00.00  20

or possibly generate an error, but instead the result is:

2016-01-01-00.00.00  10
2016-01-07-00.00.00  20
2016-01-12-00.00.00  25

I also tried adding day to the query:

select d, avg(x) over (order by d 
                       range between 30 days preceding

select d, avg(x) over (order by d 
                       range between (cast 30 as day) preceding

just in case, but both of these attempts results in:

 SQL0104N  An unexpected token "day" was found following "y ...
 SQL0104N  An unexpected token "cast(30 as day)" was found following "r ...

First suspicion I had that the unit is something smaller than day, but increasing the preceding part to 300 returns the same result. Perhaps even more surprising is that increasing it to 500 changes the result to:

 select d, avg(x) over (order by d 
                        range between 500 preceding 
                                  and current row) 
 from test 
 order by d

2016-01-01-00.00.00  10
2016-01-07-00.00.00  20
2016-01-12-00.00.00  30

Given the query:

 select d, avg(x) over (order by d 
                        range between n preceding 
                                  and current row)

for 300 <= n <= 399 the result of the last row is 25, for n<300 or n>399 the result of the last row is 30.

I can not figure out which rows that are seen by the avg function, my best guess is that there is some implicit cast of date to something else in the framing clause, but don't have any idea how to prove or disprove this assumption. Can someone shed some light on this?

Solution

Firstly, the expression values current_date - 1 would only be valid if Oracle compatibility mode were in effect -- it mimics the Oracle's datetime arithmetic where the default interval is expressed in (potentially fractional) days.

I think that regardless of Oracle compatibility, the range bounds should be comparable as integers, and comparing DATE values with integers might produce unexpected results. If you convert your DATEs to the number of days since some past moment you can use integer comparisons. You could use JULIAN_DAY(), for example:

select d, avg(x) over (order by julian_day(d) 
                   range between 30 preceding 
                             and current row) 
from test 
order by d

which produces the result you expect:

D          2                                
---------- ---------------------------------
01/01/2016   10.0000000000000000000000000000
01/07/2016   15.0000000000000000000000000000
01/12/2016   20.0000000000000000000000000000

  3 record(s) selected.

In the first fixpak of 10.5 it was allowed to use range over dates, but the results where unpredictable. In recent fixpaks this is no longer allowed, so much of the confusion in the question could have been avoided by using a recent fixpak.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange