Question

I am trying to solve a problem which is quite easy to solve in a procedural language, but I fail to solve it in SQL in an efficient manner.

Let me first explain the problem. I have a series of events which happen on a certain point in time. To keep it simple, let's assume that each event happens on a distinct point in time. An event is represented by a number. Take for example the following data:

create table event
(
    time   time,
    status integer
);

insert into event
values
('12:00', 0),
('13:00', 8),
('14:00', 4),
('15:00', 2),
('16:00', 0),
('17:00', 9),
('18:00', 5),
('19:00', 8),
('20:00', 0),
('21:00', 1),
('22:00', 3),
('23:00', 0);

Now, a cycle is defined as a sequence of events happening between two events with status 0. So, for the data above, I have the following cycles:

cycle 1: 0 -> 8 -> 4 -> 2 -> 0
cycle 2: 0 -> 9 -> 5 -> 8 -> 0
cycle 3: 0 -> 1 -> 3 -> 0

The goal is to find these cycles.

I have a working solution (see fiddle), and it goes as following:

with
cycle_boundary(begin_time, end_time) as
(
    select begin_time, end_time
    from   (
             select    time, lead(time) over(order by time)
             from      event
             where     status = 0
           ) as cycle(begin_time, end_time)
    where  end_time is not null
)
select     begin_time, end_time, array_agg(row(time, status)) as events
from       cycle_boundary
cross join event
where      cycle_boundary.begin_time < event.time and
           cycle_boundary.end_time > event.time
group by   cycle_boundary.begin_time,
           cycle_boundary.end_time;

This outputs:

begin_time  end_time    events
12:00:00    16:00:00    {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"}
16:00:00    20:00:00    {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"}
20:00:00    23:00:00    {"(21:00:00,1)","(22:00:00,3)"}

The problem is that this solution is quite inefficient. First, I scan through the complete events to find the boundaries (which are two subsequent events with status 0), and then, I scan through these boundaries to find the containing events. This is basically a nested loop, so O(n^2).

In a procedural language, this can be easily solved in O(n) under the pre-condition that the events are sorted (which we can achieve in a database as well if there is an index on event(time)): loop through the ordered events and collect the events (in a temporary collection) as long we do not encounter an event with status 0; once we encounter such event, we output the collected events so far and clear this temporary collection.

So my question boils down to: how can we solve this in O(n) in SQL? I believe that one of the problems is that the FILTER clause for aggregate window functions is not implemented in PostgreSQL, but that might be irrelevant here.

Était-ce utile?

La solution

You can use status to set groups, and then get min and max time of each group.

If there is a serial id (PK), and it can be used to set an order, maybe you can get a better performance.

Due each intermediate status=0 belongs to two groups, I've added a new column with the time of the next row to get max(time).

with ev as
(
  select
      time, status,
      lead(time) over (order by time) as next_time,
      sum(case when status = 0 then 1 else 0 end) over (order by id) as grp
  from
      event
)
select 
    min(time) as min_time,
    max(next_time) as max_time,
    array_agg(row(time, status)) filter (where status <> 0) as events
from 
    ev
group by
    grp
order by
    grp;
min_time | max_time | events                                        
:------- | :------- | :---------------------------------------------
12:00:00 | 16:00:00 | {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"}
16:00:00 | 20:00:00 | {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"}
20:00:00 | 23:00:00 | {"(21:00:00,1)","(22:00:00,3)"}               
23:00:00 | null     | null                                          

db<>fiddle here

Licencié sous: CC-BY-SA avec attribution
Non affilié à dba.stackexchange
scroll top