How to select blocks with sequential data and aggregate the ids
-
20-01-2021 - |
Question
I have the following table:
id(int) startDate(timestamp) endDate(timestamp) plan_id(int) planned(bool) machine(int)
--------------------------------------------------------------------------------------------------------------
2005 '2019-01-16 08:29:24.872736' '2019-01-16 08:30:23.529706' 34 true 6
2004 '2019-01-16 08:19:28.011148' '2019-01-16 08:29:22.680828' 34 true 6
2003 '2019-01-16 08:18:27.074312' '2019-01-16 08:19:25.753475' 34 true 6
2002 '2019-01-16 08:08:30.206288' '2019-01-16 08:18:24.856308' 34 true 6
2001 '2019-01-16 08:07:29.163124' '2019-01-16 08:08:27.949013' 34 true 6
2000 '2019-01-16 07:59:03.221309' '2019-01-16 08:00:14.654391' null false 7
1999 '2019-01-16 08:00:00.986367' '2019-01-16 08:00:03.221309' null false 6
1998 '2019-01-16 07:57:30.711044' '2019-01-16 07:59:58.778444' null false 6
1997 '2019-01-16 07:56:32.466508' '2019-01-16 07:57:28.489287' null false 6
1996 '2019-01-16 07:50:06.887349' '2019-01-16 07:56:30.237725' null false 6
1995 '2019-01-16 07:46:34.327582' '2019-01-16 07:50:04.619592' 33 true 6
1994 '2019-01-16 07:45:33.813483' '2019-01-16 07:46:32.014849' 33 true 6
1993 '2019-01-16 07:24:39.267365' '2019-01-16 07:39:23.786911' null false 6
1992 '2019-01-16 07:23:39.646218' '2019-01-16 07:24:37.093414' null false 6
1991 '2019-01-16 07:13:41.166337' '2019-01-16 07:23:37.403375' null false 6
1990 '2019-01-16 07:12:39.961234' '2019-01-16 07:13:38.907838' null false 6
1989 '2019-01-16 07:10:46.984236' '2019-01-16 07:12:37.647108' null false 6
1988 '2019-01-15 17:05:59.832834' '2019-01-15 17:08:21.603931' 31 true 6
1987 '2019-01-15 17:04:59.567046' '2019-01-15 17:05:57.565188' 31 true 6
1986 '2019-01-15 17:00:01.411266' '2019-01-15 17:10:57.255158' 31 true 7
I have to select the IDs of the blocks of unplanned records for a specific machine. I have been trying using window function, unfortunately, I couldn't work out the logic of it!
The problem here is that since we have different machines, we cannot rely on sequential ids, just that the endDate
of a sequence is very close to next startDate
(it is ok to set a tolerance constant e.g. 3 seconds).
I would like to have a query where the result would be: the min startDate
, the max endDate
and the IDs of the block. For this sample with machine = 6
, it would be:
blockStartDate blockEndDate ids
-------------------------------------------------------------------------------
"2019-01-16 07:50:06.887349" "2019-01-16 08:00:03.221309" [1999,1998,1997,1996]
"2019-01-16 07:10:46.984236" "2019-01-16 07:39:23.786911" [1989,1990,1991,1992,1993]
Note that the answer, in this case, has sequential IDs but this is not always the case. I am working on providing real data where 2 machines are producing data at the same time and the ids become useless.
Solution
If blocks are defined by continuously incrementing IDs (or continuously incrementing startDate
or endDate
, same principle) with the same machine
and NOT planned
, ignoring possible gaps, only separated by rows with planned IS NOT FALSE
, then there is no need to speculate with tolerance between end & begin and you can just use this:
SELECT min(startDate) AS block_start_date
, max(endDate) AS block_end_date
, array_agg(id) AS ids
FROM (
SELECT id, startDate, endDate, planned
, row_number() OVER (ORDER BY id)
- row_number() OVER (PARTITION BY planned ORDER BY id) AS grp
FROM tbl
WHERE machine = 6
ORDER BY id -- to get sorted arrays
) sub
WHERE NOT planned
GROUP BY grp;
db<>fiddle here (building on McNets' fiddle)
Basics:
Note, this returns uniformly sorted arrays, unlike your example with varying sort order.
Aside: planned
seems to say no more than plan_id IS NOT NULL
. If so, you can remove the redundant column completely. Redundant columns only add cost.
OTHER TIPS
Basically it set a reset point each time there is a difference greater than 3 seconds between startDate
and previous endDate
.
Then set blocks (groups) by getting the sum of each reset point, and finally returns MIN/MAX dates and an array with the aggregated id's of every group.
WITH x AS
(
SELECT
id,
startDate,
endDate,
machine,
case when
date_part('second',
startDate -
coalesce(lag(endDate) over (partition by machine order by machine, startDate),
startDate - interval '10' second)) > 3
then 1 else 0 end as reset
FROM
tbl
WHERE
machine = 6
AND planned = false
ORDER BY
machine,
startDate
), y AS
(
SELECT
id, startDate, endDate,
sum(reset) over (partition by machine order by machine, startDate) grp
FROM
x
)
SELECT
MIN(startDate) blockStartDate,
MAX(endDate) blockEndDate,
array_agg(id) ids
FROM
y
GROUP BY
grp
ORDER BY
grp;
blockstartdate | blockenddate | ids :------------------------- | :------------------------- | :------------------------- 2019-01-16 07:10:46.984236 | 2019-01-16 07:39:23.786911 | {1989,1990,1991,1992,1993} 2019-01-16 07:50:06.887349 | 2019-01-16 08:00:03.221309 | {1996,1997,1998,1999}
db<>fiddle here