Question

I have a table where every entry is a node and the table contains the direct connections of each node to other nodes. I am looking to create a view with a column for each node containing all the nodes in the chain, not just the nodes the node itself is connected to.

An example would be generating the Nodes in Chain column from the first two columns of the following table:

CREATE TABLE example
(
    node text,
    connections text[],
    nodes_in_chain text[]
)

INSERT INTO example VALUES
('a', ARRAY['a','b'],         null),
('b', ARRAY['a','b','c','d'], null),
('c', ARRAY['b','c'],         null),
('d', ARRAY['b','d'],         null),
('e', ARRAY['e','f'],         null),
('f', ARRAY['e','f'],         null);
Node   Connections    Nodes in Chain
"a"    "{a,b}"        "{a,b,c,d}"
"b"    "{a,b,c,d}"    "{a,b,c,d}"
"c"    "{b,c}"        "{a,b,c,d}"
"d"    "{b,d}"        "{a,b,c,d}"
"e"    "{e,f}"        "{e,f}"
"f"    "{e,f}"        "{e,f}"

That is a small simplified version of the real problem. If I can solve the example, the full table should be no problem.

The data of this table can be visualized in the following way:

data

I have looked into several different methods to solve this problem. I have looked into recursive CTEs but I have not managed to make them work.

Each node is connected to itself currently in the database. Not a problem to remove the connection to itself in the database if that is necessary.

Probably unnecessary background to the problem:

The origin of this problem comes from attempting to identify vehicles in traffic. The original database contains a vehicles location and speed at every time step t in a given area. The goal is to determine the time spent at a traffic light. To solve this problem a stopping area for the traffic light was identified. Each vehicle in this area with a speed below a certain threshold is considered to be waiting for the traffic light. Due to a long queue line, vehicles may however be queuing outside this area. Therefore a traffic line ("Chain of Nodes") is made with all vehicles that are within a certain distance of each other and have a low speed below. Starting from a vehicle within the identified queue area. The problem is part of scientific research into aircraft taxi times. The vehicles are therefore aircraft and the stoplights are runway thresholds.

I first performed the calculation to identify vehicles in an area with Python and pandas. However the code took 10x longer to run which made it prohibitive to the project. The code was very simple without manually introduced loops and could therefore not be accelerated (I believe). I will also be comparing the speed of performing the queuing algorithm in Python versus PostgreSQL.

Was it helpful?

Solution

Approach 1:

At first sight, it seems that I could apply a basic solution because, according to your sample data, each single connection is included in another connection.

SELECT 
    e1.node, 
    e1.connections, 
    COALESCE(e2.connections, e1.connections) nodes_in_chain
FROM
    example e1
LEFT JOIN 
    example e2
    ON e2.node <> e1.node
    AND e1.connections <@ e2.connections;
node | connections | nodes_in_chain
:--- | :---------- | :-------------
a    | {a,b}       | {a,b,c,d}     
b    | {a,b,c,d}   | {a,b,c,d}     
c    | {b,c}       | {a,b,c,d}     
d    | {b,d}       | {a,b,c,d}     
e    | {e,f}       | {e,f}         
f    | {e,f}       | {e,f}         

Approach 2:

But, as @ypercube pointed out, this solution doesn't work if there are 3 or more linear points in a row.

Ex: e -> f -> g -> h

As a reference to solve this question I used the answers in another related question:

It uses a method called transitive closure to solve the problem.

Transitive closure

In mathematics, the transitive closure of a binary relation R on a set X is the smallest relation on X that contains R and is transitive.

For example, if X is a set of airports and xRy means "there is a direct flight from airport x to airport y" (for x and y in X), then the transitive closure of R on X is the relation R+ such that x R+ y means "it is possible to fly from x to y in one or more flights". Informally, the transitive closure gives you the set of all places you can get to from any starting place.

First let me change your sample data by adding a linear connection of 4 nodes.

DELETE FROM example WHERE node = 'f';
INSERT INTO example VALUES
('f', ARRAY['e','f','g'],     null),
('g', ARRAY['f','g','h'],     null),
('h', ARRAY['g','h'],         null);

Now applying the math:

WITH RECURSIVE al (dst, src) AS --adjacent list or list of all related nodes
(
  SELECT e1.node, e2.node
  FROM   example e1
  JOIN   example e2
         ON e1.node = any(e2.connections)   
), tc (dst, src) AS
   (
     SELECT dst, src FROM al -- transitive closure
     UNION
     SELECT a1.dst, a2.src
     FROM   al as a1 
     JOIN   tc as a2 
            ON a1.src = a2.dst
   )
   SELECT   src, array_agg(DISTINCT dst ORDER BY dst) AS nodes_in_chain
   FROM     tc
   GROUP BY src;

Give us this result:

src | nodes_in_chain
:-- | :-------------
a   | {a,b,c,d}     
b   | {a,b,c,d}     
c   | {a,b,c,d}     
d   | {a,b,c,d}     
e   | {e,f,g,h}     
f   | {e,f,g,h}     
g   | {e,f,g,h}     
h   | {e,f,g,h}     

db<>fiddle here

NOTE: The original relation has only the immediate connections, which can be seen as paths of length 1 (2 nodes each). Approach 1 finds all paths of length 2 (3 nodes) as it applies a method of connecting once. To find paths of length N, you have to apply the methods N-1 times. To find all paths of arbitrary lengths (alas the transitive closure), you need a recursive solution, or a while loop. It can't be done with simple SQL. (ie: one query without CTEs.)

@ypercube

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top