Finding earliest connected value over two columns

https://dba.stackexchange.com/questions/132994

30-09-2020
|

Question

I have a, perhaps slightly esoteric, use case.

Given a table like the following:

 id1 | id2 |      timestamp      
-----+-----+---------------------
   3 |   4 | 2016-03-22 09:52:15
   1 |   2 | 2016-03-22 12:05:28
   2 |   3 | 2016-03-22 11:55:56
   5 |   6 | 2016-03-22 11:58:23
   6 |   7 | 2016-03-22 09:51:54

I would like to "trace" each id in id1 to its oldest id2 where connection can span over id1 and id2. For instance in the above dataset I would like the resultset to contain:

1 |   4 | 2016-03-22 09:52:15
5 |   7 | 2016-03-22 09:51:54

The results should be ordered by the timestamp column.
And ideally no other occurrences of "1" and "5".

I am using Amazon Redshift, which imposes some limitations.

So far I have dabbled with using window functions combined with a self join:

SELECT t1.id1,
FIRST_VALUE(t2.id2) OVER (
   PARTITION BY t1.id1, (t1.id1=t2.id1 or t1.id2=t2.id2)
   ORDER BY t2.timestamp ASC
   ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
FROM    test_decision t1
LEFT JOIN test_decision t2 ON t1.id1 = t2.id2

The general idea is to group based on logical expression (e.g. contains either id1 or id2), rather than partitioning only on rows with both. I've tried SELECT ... GROUP BY in a similar manner, but without getting the results I want.

I hope this use case makes sense to some of you.

In an effort to eliminate ambiguity: The rows are connected by having a value of id1 or id2 in either id1 or id2 columns of other rows. For instance, 1 leads to 4 through rows having (id1,id2) = (1,2),(2,3) and (3,4) in my example.

Regarding recursiveness: I think I can solve it by joining multiple times and coalescing the "outermost" value, my problem here is that the problem can be arbitrarily deep, e.g. I can solve this case by joining twice, but is there a way in SQL to get arbitrary recursion depth? And is that a good idea?

Solution

It's a recursive problem by nature. SQL introduced recursive CTEs for that purpose quite some time ago. Since Redshift does not seem to support this, your only remaining option to solve this within the database is a user defined function (UDF). It seems like the only server-side language supported by Redshift is plpythonu.

Here is an old answer with a PL/pgSQL function and an rCTE implementing the same solution side-by-side. But it's using the default procedural language of Postgres: PL/pgSQL:

Loop in function does not work as expected

No idea why Redshift chose not to support PL/pgSQL.

OTHER TIPS

I think you need a recursive query. Something like this:

WITH RECURSIVE 
  groups (id1, id2, timestamp, level) AS   
    ( SELECT t.id1, t.id2, t.timestamp, 1 
      FROM test_decision AS t
      WHERE NOT EXISTS
            ( SELECT * 
              FROM test_decision AS p
              WHERE p.id2 = t.id1
            )
     UNION ALL
       SELECT g.id1, t.id2, t.timestamp, g.level + 1
       FROM groups AS g
         JOIN test_decision AS t
           ON g.id2 = t.id1 
    ) 
SELECT DISTINCT ON (id1) 
    id1, id2, timestamp
FROM groups 
ORDER BY id1, level DESC ;

You may have to modify the final SELECT, depending on what you want exactly for the timestamp result. The minimum value in a group or the one from the "latest" id (by latest I mean the 4 in the group 1 -> 2 -> 3 -> 4, which is not necessarily the maximum id).

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange