Aggregate rows for each group of rows

https://dba.stackexchange.com/questions/270157

04-03-2021
|

Question

I am trying to solve a problem which is quite easy to solve in a procedural language, but I fail to solve it in SQL in an efficient manner.

Let me first explain the problem. I have a series of events which happen on a certain point in time. To keep it simple, let's assume that each event happens on a distinct point in time. An event is represented by a number. Take for example the following data:

create table event
(
    time   time,
    status integer
);

insert into event
values
('12:00', 0),
('13:00', 8),
('14:00', 4),
('15:00', 2),
('16:00', 0),
('17:00', 9),
('18:00', 5),
('19:00', 8),
('20:00', 0),
('21:00', 1),
('22:00', 3),
('23:00', 0);

Now, a cycle is defined as a sequence of events happening between two events with status 0. So, for the data above, I have the following cycles:

cycle 1: 0 -> 8 -> 4 -> 2 -> 0
cycle 2: 0 -> 9 -> 5 -> 8 -> 0
cycle 3: 0 -> 1 -> 3 -> 0

The goal is to find these cycles.

I have a working solution (see fiddle), and it goes as following:

with
cycle_boundary(begin_time, end_time) as
(
    select begin_time, end_time
    from   (
             select    time, lead(time) over(order by time)
             from      event
             where     status = 0
           ) as cycle(begin_time, end_time)
    where  end_time is not null
)
select     begin_time, end_time, array_agg(row(time, status)) as events
from       cycle_boundary
cross join event
where      cycle_boundary.begin_time < event.time and
           cycle_boundary.end_time > event.time
group by   cycle_boundary.begin_time,
           cycle_boundary.end_time;

This outputs:

begin_time  end_time    events
12:00:00    16:00:00    {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"}
16:00:00    20:00:00    {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"}
20:00:00    23:00:00    {"(21:00:00,1)","(22:00:00,3)"}

The problem is that this solution is quite inefficient. First, I scan through the complete events to find the boundaries (which are two subsequent events with status 0), and then, I scan through these boundaries to find the containing events. This is basically a nested loop, so O(n^2).

In a procedural language, this can be easily solved in O(n) under the pre-condition that the events are sorted (which we can achieve in a database as well if there is an index on event(time)): loop through the ordered events and collect the events (in a temporary collection) as long we do not encounter an event with status 0; once we encounter such event, we output the collected events so far and clear this temporary collection.

So my question boils down to: how can we solve this in O(n) in SQL? I believe that one of the problems is that the FILTER clause for aggregate window functions is not implemented in PostgreSQL, but that might be irrelevant here.

Solution

You can use status to set groups, and then get min and max time of each group.

If there is a serial id (PK), and it can be used to set an order, maybe you can get a better performance.

Due each intermediate status=0 belongs to two groups, I've added a new column with the time of the next row to get max(time).

with ev as
(
  select
      time, status,
      lead(time) over (order by time) as next_time,
      sum(case when status = 0 then 1 else 0 end) over (order by id) as grp
  from
      event
)
select 
    min(time) as min_time,
    max(next_time) as max_time,
    array_agg(row(time, status)) filter (where status <> 0) as events
from 
    ev
group by
    grp
order by
    grp;

min_time | max_time | events                                        
:------- | :------- | :---------------------------------------------
12:00:00 | 16:00:00 | {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"}
16:00:00 | 20:00:00 | {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"}
20:00:00 | 23:00:00 | {"(21:00:00,1)","(22:00:00,3)"}               
23:00:00 | null     | null

db<>fiddle here

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange