Question

In case it matters for the following will use a Netezza backend + SPSS Modeler and/or Advanced Query Tool for the query itself. I have no access to the CLI. I'm trying to understand if cursors and walking through a sorted table is necessary to handle the following:

Imagine a table with 2 columns, the first is a non-unique ID and the second is a date. Any given ID may occur in the table multiple times with one or more dates.

My goal is to select from this table rows for which the dates are spaced out by no less than a fixed number of days, say 90. For example:

| ID |  DATE      |
===================
  X    2014-01-01
  X    2014-02-01
  X    2014-07-01
  Y    2014-02-01
  Y    2014-06-01
  Y    2014-07-01

In the above example, the rows I want to select for X would be Jan 1 and Jul 1 (exclude Feb 1 because it is less than 90 days from Jan 1) and the rows for Y would be Feb 1 and Jun 1 (exclude Jul 1 because it's within 90 days of a prior case.

In practice there could be well over 100M rows in the table. Is it possible to do this without cursors? What would the optimum method be?

Thanks in advance for any advice!

EDIT: Expanded the test table data here. SQL Fiddle

In the above edited example, the desired output would be

| ID |  DATE      |
===================
  X    2014-01-01
  X    2014-04-01
  X    2014-10-01
  Y    2014-01-15
  Y    2014-04-15
  Y    2014-10-15
  Z    2014-01-01
  Z    2014-04-01
  Z    2014-10-01
Was it helpful?

Solution

I found an SPSS/Netezza native iterative solution that was successful. SPSS supports the @OFFSET(Field, integer) function that can read ahead or behind. I had attempted to use this function previously but encountered errors related to recursion when attempting to use "field" equal to the same field the result of the function goes along with a negative integer to read the previous result.

Today working on another project I discovered the documentation for @OFFSET() is poor, and while I had believe it to function from first-to-last row order w/ positive integers representing read-ahead, that is actually backwards. It actually is solved last-row-to-first and positive integer offsets actually imply reading previous rows. A retry using my original method and correcting for the sign on the integer offset eliminated the recursion error and solved the problem.

The actual solution can be though of in this way. This description is overly verbose for the sake of clarity, in practice most of these items can be condensed into the same step, and it ignores the possibility of an ID having multiple instances of the same date (which isn't hard to handle using extra logic in the comparison but wasn't necessary for my needs).

  1. Select a sorted set on ID then Date.
  2. Define a new column PR_ID by @OFFSET(ID, 1) to store the previous row's ID
  3. Define a new column called LST_CNT_DT (last countable date) containing the current row's date if PR_ID <> ID. Otherwise, compare the difference in days between the current row's date and @OFFSET(LST_CNT_DT,1) [i.e. the prior row's value for the same field], if the difference is >=90, store the current row's date. Otherwise store @OFFSET(LST_CNT_DT,1).
  4. Select from this new set all rows where LST_CNT_DT = DT (where DT is date for current row).

Not quite as elegant as the CTE recursion method available in MsSQL, but can be built entirely in SPSS v15 and it executed on the full table in a few minutes.

OTHER TIPS

If you accept something that works in SQL Server, then the following code will work:

With CTE as (
    select A.ID, A.DATA, MIN(B.DATA) DATA1 
    from Table1 A
    inner join Table1 B
      on A.ID = B.ID
      and DATEADD(DAY, 90, A.DATA) <= B.DATA
    GROUP BY A.ID, A.DATA
), REC AS (
   SELECT ID, MIN(DATA) DATA
   FROM Table1
   GROUP BY ID
   UNION ALL
   SELECT A.ID, B.DATA1
   FROM REC A
   INNER JOIN CTE B
     ON A.ID = B.ID
     AND A.DATA = B.DATA
)

SELECT *
FROM REC
ORDER BY ID, DATA

It user recursion of CTE. By choosing the minimum date for each ID, it follows by recursion taking always the minimum date which is bigger than 90 days. But this will only works in SQL Server.

SQL Fiddle

UPDATE

As you are getting ideas to implement somewhere else, it is interesting to have more than one way to implement it. In SQL Server is also possible to implement this way in TSQL:

DECLARE @TABLE1 TABLE (ID VARCHAR(1), DATA DATE, DATA1 DATE)

  INSERT INTO @TABLE1
    select A.ID, A.DATA, MIN(B.DATA) DATA1 
    from (
       SELECT ID, MIN(DATA) DATA
       FROM Table1
       GROUP BY ID
    ) A
    inner join Table1 B
      on A.ID = B.ID
      and DATEADD(DAY, 90, A.DATA) <= B.DATA
    GROUP BY A.ID, A.DATA

DECLARE @AUX INT = 0

WHILE (SELECT COUNT(*) FROM @TABLE1) <> @AUX
BEGIN

  SELECT @AUX = COUNT(*) FROM @TABLE1

  INSERT INTO @TABLE1
    select *
    from ( 
       select A.ID, A.DATA, MIN(B.DATA) DATA1 
       from @Table1 A
       inner join Table1 B
         on A.ID = B.ID
         and DATEADD(DAY, 90, A.DATA1) <= B.DATA
       GROUP BY A.ID, A.DATA    
    ) A
    where not exists (
        SELECT 1
        FROM @TABLE1
        WHERE ID = A.ID
         AND DATA = A.DATA
         AND DATA1 = A.DATA1
    )

END

SELECT ID, DATA
FROM @TABLE1
UNION
SELECT ID, DATA1
FROM @TABLE1
UNION
SELECT ID, MIN(DATA) DATA
FROM Table1
GROUP BY ID

SQL Fiddle

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top