Oracle：找到以前的记录的排名列表中的预测

https://stackoverflow.com/questions/1518054

19-09-2019
|

题

嗨，我面临着一个困难的问题：

我有一个表(oracle9i)的天气预报，(许多100人的数以百万计的记录的大小。) 他的妆看起来是这样的：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk)   integer (pk)        datetime (pk)      integer

其中：

stationid 指的是一个许多气象站，可以创建一个预测；
forecastdate 是指该日期的预测是用于(日期不仅没有时间)。
forecastinterval 是指小时 forecastdate 对于预测(0-23).
forecastcreated 指的是当时的预测是，可能很多天。
forecastvalue 是指实际值预报(作为顾名思义。)

我需要确定一定 stationid 和给定 forecastdate 和 forecastinterval 对，记录在哪一个 forecastvalue 增量超过一个称号(说500).我会告诉一个表格的条件：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
'stationa'   13-dec-09       10                  10-dec-09 04:50:10  0
'stationa'   13-dec-09       10                  10-dec-09 17:06:13  0
'stationa'   13-dec-09       10                  12-dec-09 05:20:50  300
'stationa'   13-dec-09       10                  13-dec-09 09:20:50  300

在上述情况，我想拉出的第三个记录。这是所记录的预期值增加的一个称(说100)的金额。

该任务已被证明是非常困难的，由于庞大的表(100多数以百万计的记录。) 和花这么长时间才能完成(所以只要在事实，我的查询已经永远不会返回。)

这是我尝试迄今为止抓住这些价值观：

select
    wtr.stationid,
    wtr.forecastcreated,
    wtr.forecastvalue,
    (wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
    (select inner.*
            rank() over (partition by stationid, 
                                   (inner.forecastdate + inner.forecastinterval),
                                   inner.forecastcreated
                         order by stationid, 
                                  (inner.forecastdate + inner.forecastinterval) asc,
                                  inner.forecastcreated asc
            ) rk
      from weathertable inner) wtr 
      where
      wtr.forecastvalue - 100 > (
                     select lastvalue
                      from (select y.*,
                            rank() over (partition by stationid, 
                                            (forecastdate + forecastinterval),
                                            forecastcreated
                                         order by stationid, 
                                           (forecastdate + forecastinterval) asc,
                                           forecastcreated asc) rk
                             from weathertable y
                            ) z
                       where z.stationid = wtr.stationid
                             and z.forecastdate = wtr.forecastdate                                                   
                             and (z.forecastinterval =    
                                         wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
                             and wtr.rk = z.rk + 1)

解决方案

Rexem的建议，使用滞后()是正确的做法，但我们需要使用一个分区的条款。这变得清楚一旦我们加入的行为不同时间和不同工作站...

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

除误报，我们集团的滞后()通过STATIONID,FORECASTDATE和FORECASTINTERVAL.注意，以下依靠内部查询返回空，从第一个计算的每个分区的窗口。

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

工作与大量数据

你描述你的表作为包含许多数亿行。如此巨大的桌子就像是黑洞，他们有不同的物理学。有各种可能的办法，这取决于你的需求、时限、财务、数据库的版本，以及任何其他使用的系统的数据。这是超过五分钟的答复。

但是这里有五分钟答案。

假设你的表格是现场表可能被填充通过增加预测，因为他们发生，这基本上是一个附加的操作。这将意味着预测任何特定站是分散在整个表格中。因此索引只是STATIONID或甚至FORECASTDATE会有一个贫困聚集的因素。

在这一假设，有一件事我会建议你尝试，首先是建立一个索引 (STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE).这将需要一些时间(以及磁盘空间)来建造，但是它应该加速随后的查询相当大，因为它具有所需要的列，以满足查询与指标的范围内扫描没有触及表。

许可以下： CC-BY-SA 和归因

不隶属于 StackOverflow