SQL - Can Cursors be avoided in this case?
-
21-12-2019 - |
Question
In case it matters for the following will use a Netezza backend + SPSS Modeler and/or Advanced Query Tool for the query itself. I have no access to the CLI. I'm trying to understand if cursors and walking through a sorted table is necessary to handle the following:
Imagine a table with 2 columns, the first is a non-unique ID and the second is a date. Any given ID may occur in the table multiple times with one or more dates.
My goal is to select from this table rows for which the dates are spaced out by no less than a fixed number of days, say 90. For example:
| ID | DATE |
===================
X 2014-01-01
X 2014-02-01
X 2014-07-01
Y 2014-02-01
Y 2014-06-01
Y 2014-07-01
In the above example, the rows I want to select for X would be Jan 1 and Jul 1 (exclude Feb 1 because it is less than 90 days from Jan 1) and the rows for Y would be Feb 1 and Jun 1 (exclude Jul 1 because it's within 90 days of a prior case.
In practice there could be well over 100M rows in the table. Is it possible to do this without cursors? What would the optimum method be?
Thanks in advance for any advice!
EDIT: Expanded the test table data here. SQL Fiddle
In the above edited example, the desired output would be
| ID | DATE |
===================
X 2014-01-01
X 2014-04-01
X 2014-10-01
Y 2014-01-15
Y 2014-04-15
Y 2014-10-15
Z 2014-01-01
Z 2014-04-01
Z 2014-10-01
Solution
I found an SPSS/Netezza native iterative solution that was successful. SPSS supports the @OFFSET(Field, integer) function that can read ahead or behind. I had attempted to use this function previously but encountered errors related to recursion when attempting to use "field" equal to the same field the result of the function goes along with a negative integer to read the previous result.
Today working on another project I discovered the documentation for @OFFSET() is poor, and while I had believe it to function from first-to-last row order w/ positive integers representing read-ahead, that is actually backwards. It actually is solved last-row-to-first and positive integer offsets actually imply reading previous rows. A retry using my original method and correcting for the sign on the integer offset eliminated the recursion error and solved the problem.
The actual solution can be though of in this way. This description is overly verbose for the sake of clarity, in practice most of these items can be condensed into the same step, and it ignores the possibility of an ID having multiple instances of the same date (which isn't hard to handle using extra logic in the comparison but wasn't necessary for my needs).
- Select a sorted set on ID then Date.
- Define a new column PR_ID by @OFFSET(ID, 1) to store the previous row's ID
- Define a new column called LST_CNT_DT (last countable date) containing the current row's date if PR_ID <> ID. Otherwise, compare the difference in days between the current row's date and @OFFSET(LST_CNT_DT,1) [i.e. the prior row's value for the same field], if the difference is >=90, store the current row's date. Otherwise store @OFFSET(LST_CNT_DT,1).
- Select from this new set all rows where LST_CNT_DT = DT (where DT is date for current row).
Not quite as elegant as the CTE recursion method available in MsSQL, but can be built entirely in SPSS v15 and it executed on the full table in a few minutes.
OTHER TIPS
If you accept something that works in SQL Server, then the following code will work:
With CTE as (
select A.ID, A.DATA, MIN(B.DATA) DATA1
from Table1 A
inner join Table1 B
on A.ID = B.ID
and DATEADD(DAY, 90, A.DATA) <= B.DATA
GROUP BY A.ID, A.DATA
), REC AS (
SELECT ID, MIN(DATA) DATA
FROM Table1
GROUP BY ID
UNION ALL
SELECT A.ID, B.DATA1
FROM REC A
INNER JOIN CTE B
ON A.ID = B.ID
AND A.DATA = B.DATA
)
SELECT *
FROM REC
ORDER BY ID, DATA
It user recursion of CTE. By choosing the minimum date for each ID, it follows by recursion taking always the minimum date which is bigger than 90 days. But this will only works in SQL Server.
UPDATE
As you are getting ideas to implement somewhere else, it is interesting to have more than one way to implement it. In SQL Server is also possible to implement this way in TSQL:
DECLARE @TABLE1 TABLE (ID VARCHAR(1), DATA DATE, DATA1 DATE)
INSERT INTO @TABLE1
select A.ID, A.DATA, MIN(B.DATA) DATA1
from (
SELECT ID, MIN(DATA) DATA
FROM Table1
GROUP BY ID
) A
inner join Table1 B
on A.ID = B.ID
and DATEADD(DAY, 90, A.DATA) <= B.DATA
GROUP BY A.ID, A.DATA
DECLARE @AUX INT = 0
WHILE (SELECT COUNT(*) FROM @TABLE1) <> @AUX
BEGIN
SELECT @AUX = COUNT(*) FROM @TABLE1
INSERT INTO @TABLE1
select *
from (
select A.ID, A.DATA, MIN(B.DATA) DATA1
from @Table1 A
inner join Table1 B
on A.ID = B.ID
and DATEADD(DAY, 90, A.DATA1) <= B.DATA
GROUP BY A.ID, A.DATA
) A
where not exists (
SELECT 1
FROM @TABLE1
WHERE ID = A.ID
AND DATA = A.DATA
AND DATA1 = A.DATA1
)
END
SELECT ID, DATA
FROM @TABLE1
UNION
SELECT ID, DATA1
FROM @TABLE1
UNION
SELECT ID, MIN(DATA) DATA
FROM Table1
GROUP BY ID