Oracle: Find previous record for a ranked list of forecasts
-
19-09-2019 - |
Question
Hi I am faced with a difficult problem:
I have a table (oracle 9i) of weather forecasts (many 100's of millions of records in size.) whose makeup looks like this:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk) integer (pk) datetime (pk) integer
where:
stationid
refers to one of the many weather stations that may create a forecast;forecastdate
refers to the date the forecast is for (date only not time.)forecastinterval
refers to the hour in theforecastdate
for the forecast (0 - 23).forecastcreated
refers to the time the forecast was made, can be many days beforehand.forecastvalue
refers to the actual value of the forecast (as the name implies.)
I need to determine for a given stationid
and a given forecastdate
and forecastinterval
pair, the records where a forecastvalue
increments more than a nominal number (say 500). I'll show a table of the condition here:
stationid forecastdate forecastinterval forecastcreated forecastvalue
---------------------------------------------------------------------------------
'stationa' 13-dec-09 10 10-dec-09 04:50:10 0
'stationa' 13-dec-09 10 10-dec-09 17:06:13 0
'stationa' 13-dec-09 10 12-dec-09 05:20:50 300
'stationa' 13-dec-09 10 13-dec-09 09:20:50 300
In the above scenario, I'd like to pull out the third record. This is the record where the forecast value increased by a nominal (say 100) amount.
The task is proving to be very difficult due to the sheer size of the table (many 100s of millions of records.) and taking so long to finish (so long in fact that my query has never returned.)
Here is my attempt so far to grab these values:
select
wtr.stationid,
wtr.forecastcreated,
wtr.forecastvalue,
(wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
(select inner.*
rank() over (partition by stationid,
(inner.forecastdate + inner.forecastinterval),
inner.forecastcreated
order by stationid,
(inner.forecastdate + inner.forecastinterval) asc,
inner.forecastcreated asc
) rk
from weathertable inner) wtr
where
wtr.forecastvalue - 100 > (
select lastvalue
from (select y.*,
rank() over (partition by stationid,
(forecastdate + forecastinterval),
forecastcreated
order by stationid,
(forecastdate + forecastinterval) asc,
forecastcreated asc) rk
from weathertable y
) z
where z.stationid = wtr.stationid
and z.forecastdate = wtr.forecastdate
and (z.forecastinterval =
wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
and wtr.rk = z.rk + 1)
Solution
Rexem's suggestion of using LAG() is the right approach but we need to use a partitioning clause. This becomes clear once we add rows for different intervals and different stations...
SQL> select * from t
2 /
STATIONID FORECASTDATE INTERVAL FORECASTCREATED FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa 13-12-2009 10 10-12-2009 04:50:10 0
stationa 13-12-2009 10 10-12-2009 17:06:13 0
stationa 13-12-2009 10 12-12-2009 05:20:50 300
stationa 13-12-2009 10 13-12-2009 09:20:50 300
stationa 13-12-2009 11 13-12-2009 09:20:50 400
stationb 13-12-2009 11 13-12-2009 09:20:50 500
6 rows selected.
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (ORDER BY t.forecastcreated) as difference
12 FROM t) v
13 WHERE v.difference >= 100
14 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
stationa 13-12-2009 09:20:50 400 13-12-2009 11:00:00
stationb 13-12-2009 09:20:50 500 13-12-2009 11:00:00
SQL>
To remove the false positives we group the LAG() by STATIONID, FORECASTDATE and FORECASTINTERVAL. Note that the following relies on the inner query returning NULL from the first calculation of each partition window.
SQL> SELECT v.stationid,
2 v.forecastcreated,
3 v.forecastvalue,
4 (v.forecastdate + v.forecastinterval / 24) fcst_date
5 FROM (SELECT t.stationid,
6 t.forecastdate,
7 t.forecastinterval,
8 t.forecastcreated,
9 t.forecastvalue,
10 t.forecastvalue - LAG(t.forecastvalue, 1)
11 OVER (PARTITION BY t.stationid
12 , t.forecastdate
13 , t.forecastinterval
14 ORDER BY t.forecastcreated) as difference
15 FROM t) v
16 WHERE v.difference >= 100
17 /
STATIONID FORECASTCREATED FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa 12-12-2009 05:20:50 300 13-12-2009 10:00:00
SQL>
Working with large volumes of data
You describe your tables as containing many hundreds of millions of rows. Such huge tables are like black holes, they have a different physics. There are various potential approaches, depending on your needs, timescales, finances, database version and edition, and any other usage of your system's data. It's more than a five minute answer.
But here's the five minute answer anyway.
Assuming your table is the live table it is presumably being populating by adding forecasts as they occur, which is basically an appending operation. This would mean forecasts for any given station are scattered throughout the table. Consequently indexes on just STATIONID or even FORECASTDATE would have a poor clustering factor.
On that assumption, the one thing I would suggest you try first is building an index on (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE)
. This will take some time (and disk space) to build, but it ought to speed up your subsequent queries quite considerably, because it has all the columns needed to satisfy the query with an INDEX RANGE SCAN without touching the table at all.