Aggregate rows for each group of rows
-
04-03-2021 - |
Question
I am trying to solve a problem which is quite easy to solve in a procedural language, but I fail to solve it in SQL in an efficient manner.
Let me first explain the problem. I have a series of events which happen on a certain point in time. To keep it simple, let's assume that each event happens on a distinct point in time. An event is represented by a number. Take for example the following data:
create table event
(
time time,
status integer
);
insert into event
values
('12:00', 0),
('13:00', 8),
('14:00', 4),
('15:00', 2),
('16:00', 0),
('17:00', 9),
('18:00', 5),
('19:00', 8),
('20:00', 0),
('21:00', 1),
('22:00', 3),
('23:00', 0);
Now, a cycle is defined as a sequence of events happening between two events with status 0
. So, for the data above, I have the following cycles:
cycle 1: 0 -> 8 -> 4 -> 2 -> 0
cycle 2: 0 -> 9 -> 5 -> 8 -> 0
cycle 3: 0 -> 1 -> 3 -> 0
The goal is to find these cycles.
I have a working solution (see fiddle), and it goes as following:
with
cycle_boundary(begin_time, end_time) as
(
select begin_time, end_time
from (
select time, lead(time) over(order by time)
from event
where status = 0
) as cycle(begin_time, end_time)
where end_time is not null
)
select begin_time, end_time, array_agg(row(time, status)) as events
from cycle_boundary
cross join event
where cycle_boundary.begin_time < event.time and
cycle_boundary.end_time > event.time
group by cycle_boundary.begin_time,
cycle_boundary.end_time;
This outputs:
begin_time end_time events
12:00:00 16:00:00 {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"}
16:00:00 20:00:00 {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"}
20:00:00 23:00:00 {"(21:00:00,1)","(22:00:00,3)"}
The problem is that this solution is quite inefficient. First, I scan through the complete events to find the boundaries (which are two subsequent events with status 0
), and then, I scan through these boundaries to find the containing events. This is basically a nested loop, so O(n^2)
.
In a procedural language, this can be easily solved in O(n)
under the pre-condition that the events are sorted (which we can achieve in a database as well if there is an index on event(time)
): loop through the ordered events and collect the events (in a temporary collection) as long we do not encounter an event with status 0
; once we encounter such event, we output the collected events so far and clear this temporary collection.
So my question boils down to: how can we solve this in O(n)
in SQL? I believe that one of the problems is that the FILTER
clause for aggregate window functions is not implemented in PostgreSQL, but that might be irrelevant here.
Solution
You can use status
to set groups, and then get min and max time of each group.
If there is a serial id
(PK), and it can be used to set an order, maybe you can get a better performance.
Due each intermediate status=0
belongs to two groups, I've added a new column with the time of the next row to get max(time).
with ev as
(
select
time, status,
lead(time) over (order by time) as next_time,
sum(case when status = 0 then 1 else 0 end) over (order by id) as grp
from
event
)
select
min(time) as min_time,
max(next_time) as max_time,
array_agg(row(time, status)) filter (where status <> 0) as events
from
ev
group by
grp
order by
grp;
min_time | max_time | events :------- | :------- | :--------------------------------------------- 12:00:00 | 16:00:00 | {"(13:00:00,8)","(14:00:00,4)","(15:00:00,2)"} 16:00:00 | 20:00:00 | {"(17:00:00,9)","(18:00:00,5)","(19:00:00,8)"} 20:00:00 | 23:00:00 | {"(21:00:00,1)","(22:00:00,3)"} 23:00:00 | null | null
db<>fiddle here