How to select blocks with sequential data and aggregate the ids

https://dba.stackexchange.com/questions/227382

20-01-2021
|

Question

I have the following table:

 id(int) startDate(timestamp)         endDate(timestamp)            plan_id(int)  planned(bool)  machine(int)
--------------------------------------------------------------------------------------------------------------
  2005  '2019-01-16 08:29:24.872736'  '2019-01-16 08:30:23.529706'    34          true              6
  2004  '2019-01-16 08:19:28.011148'  '2019-01-16 08:29:22.680828'    34          true              6
  2003  '2019-01-16 08:18:27.074312'  '2019-01-16 08:19:25.753475'    34          true              6
  2002  '2019-01-16 08:08:30.206288'  '2019-01-16 08:18:24.856308'    34          true              6
  2001  '2019-01-16 08:07:29.163124'  '2019-01-16 08:08:27.949013'    34          true              6
  2000  '2019-01-16 07:59:03.221309'  '2019-01-16 08:00:14.654391'    null        false             7
  1999  '2019-01-16 08:00:00.986367'  '2019-01-16 08:00:03.221309'    null        false             6
  1998  '2019-01-16 07:57:30.711044'  '2019-01-16 07:59:58.778444'    null        false             6
  1997  '2019-01-16 07:56:32.466508'  '2019-01-16 07:57:28.489287'    null        false             6
  1996  '2019-01-16 07:50:06.887349'  '2019-01-16 07:56:30.237725'    null        false             6
  1995  '2019-01-16 07:46:34.327582'  '2019-01-16 07:50:04.619592'    33          true              6
  1994  '2019-01-16 07:45:33.813483'  '2019-01-16 07:46:32.014849'    33          true              6
  1993  '2019-01-16 07:24:39.267365'  '2019-01-16 07:39:23.786911'    null        false             6
  1992  '2019-01-16 07:23:39.646218'  '2019-01-16 07:24:37.093414'    null        false             6
  1991  '2019-01-16 07:13:41.166337'  '2019-01-16 07:23:37.403375'    null        false             6
  1990  '2019-01-16 07:12:39.961234'  '2019-01-16 07:13:38.907838'    null        false             6
  1989  '2019-01-16 07:10:46.984236'  '2019-01-16 07:12:37.647108'    null        false             6
  1988  '2019-01-15 17:05:59.832834'  '2019-01-15 17:08:21.603931'    31          true              6
  1987  '2019-01-15 17:04:59.567046'  '2019-01-15 17:05:57.565188'    31          true              6
  1986  '2019-01-15 17:00:01.411266'  '2019-01-15 17:10:57.255158'    31          true              7

I have to select the IDs of the blocks of unplanned records for a specific machine. I have been trying using window function, unfortunately, I couldn't work out the logic of it!

The problem here is that since we have different machines, we cannot rely on sequential ids, just that the endDate of a sequence is very close to next startDate (it is ok to set a tolerance constant e.g. 3 seconds).

I would like to have a query where the result would be: the min startDate, the max endDate and the IDs of the block. For this sample with machine = 6, it would be:

blockStartDate                blockEndDate                  ids
-------------------------------------------------------------------------------
"2019-01-16 07:50:06.887349" "2019-01-16 08:00:03.221309" [1999,1998,1997,1996]
"2019-01-16 07:10:46.984236" "2019-01-16 07:39:23.786911" [1989,1990,1991,1992,1993]

Note that the answer, in this case, has sequential IDs but this is not always the case. I am working on providing real data where 2 machines are producing data at the same time and the ids become useless.

Solution

If blocks are defined by continuously incrementing IDs (or continuously incrementing startDate or endDate, same principle) with the same machine and NOT planned, ignoring possible gaps, only separated by rows with planned IS NOT FALSE, then there is no need to speculate with tolerance between end & begin and you can just use this:

SELECT min(startDate) AS block_start_date
     , max(endDate)   AS block_end_date
     , array_agg(id)  AS ids
FROM (
   SELECT id, startDate, endDate, planned
        , row_number() OVER (ORDER BY id)
        - row_number() OVER (PARTITION BY planned ORDER BY id) AS grp
   FROM   tbl
   WHERE  machine = 6
   ORDER  BY id  --  to get sorted arrays
   ) sub
WHERE  NOT planned
GROUP  BY grp;

db<>fiddle here (building on McNets' fiddle)

Basics:

Select longest continuous sequence

Note, this returns uniformly sorted arrays, unlike your example with varying sort order.

Aside: planned seems to say no more than plan_id IS NOT NULL. If so, you can remove the redundant column completely. Redundant columns only add cost.

OTHER TIPS

Basically it set a reset point each time there is a difference greater than 3 seconds between startDate and previous endDate.

Then set blocks (groups) by getting the sum of each reset point, and finally returns MIN/MAX dates and an array with the aggregated id's of every group.

WITH x AS
(
SELECT
    id,
    startDate,
    endDate,
    machine,
    case when
         date_part('second', 
                   startDate -  
                   coalesce(lag(endDate) over (partition by machine order by machine, startDate), 
                            startDate - interval '10' second)) > 3
         then 1 else 0 end as reset

FROM
    tbl
WHERE
    machine = 6
    AND planned = false
ORDER BY
    machine,
    startDate
), y AS
 (
 SELECT 
     id, startDate, endDate,
     sum(reset) over (partition by machine order by machine, startDate) grp
 FROM
     x
 )
 SELECT
     MIN(startDate) blockStartDate,
     MAX(endDate) blockEndDate,
     array_agg(id) ids
 FROM
     y
 GROUP BY
     grp
 ORDER BY
     grp;

blockstartdate             | blockenddate               | ids                       
:------------------------- | :------------------------- | :-------------------------
2019-01-16 07:10:46.984236 | 2019-01-16 07:39:23.786911 | {1989,1990,1991,1992,1993}
2019-01-16 07:50:06.887349 | 2019-01-16 08:00:03.221309 | {1996,1997,1998,1999}

db<>fiddle here

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange