Question

I think the best way to describe what I am looking for is to show a table of data and what I want returned from my Query. This is a simple data table in SQL Server:

JobNumber TimeOfWeigh 
100       01/01/2014 08:00 
100       01/01/2014 09:00 
100       01/01/2014 10:00 
200       01/01/2014 12:00 
200       01/01/2014 13:00 
300       01/01/2014 15:00 
300       01/01/2014 16:00 
100       02/01/2014 08:00 
100       02/01/2014 09:00 
100       03/01/2014 10:00 

I want a query that will group the job and return the first and last DateTime from each group. However, as you can see here there are 2 sets of the 100 Job Number. I dont want the second set joined with the first.

Instead I would like this:

JobNumber   First Weigh         Last Weigh
100         01/01/2014 08:00    01/01/2014 10:00
200         01/01/2014 12:00    01/01/2014 13:00
300         01/01/2014 15:00    01/01/2014 16:00
100         02/01/2014 08:00    03/01/2014 10:00

I have been struggling with this for hours. Any help would be appreciated.

EDITED

The Date & Times are all just dummy random data. The actual data has thousands of weighs within one day. I want the first and last weight of each job to determine the duration of the job so I can represent the duration on a timeline. But I want to display Job 100 twice, indicating it was paused and resumed after 200 & 300 were completed

Was it helpful?

Solution

Here's my attempt at this, using row_number() with a partition. I've broken it into steps to hopefully make it easy to follow. If your table already has a column with integer identifiers in it, then you can omit the first CTE. Even after that, you might be able to simplify this further, but it does appear to work.

(Edited to add a flag indicating jobs with multiple ranges as requested in a comment.)

declare @sampleData table (JobNumber int, TimeOfWeigh datetime);
insert into @sampleData values
    (100, '01/01/2014 08:00'),
    (100, '01/01/2014 09:00'), 
    (100, '01/01/2014 10:00'),
    (200, '01/01/2014 12:00'),
    (200, '01/01/2014 13:00'),
    (300, '01/01/2014 15:00'),
    (300, '01/01/2014 16:00'),
    (100, '02/01/2014 08:00'),
    (100, '02/01/2014 09:00'),
    (100, '03/01/2014 10:00');

-- The first CTE assigns an ordering to the records according to TimeOfWeigh,
-- producing the row numbers you gave in your example.
with JobsCTE as
(    
    select 
        row_number() over (order by TimeOfWeigh) as RowNumber, 
        JobNumber,
        TimeOfWeigh
    from @sampleData
),

-- The second CTE orders by the RowNumber we created above, but restarts the
-- ordering every time the JobNumber changes. The difference between RowNumber
-- and this new ordering will be constant within each group.
GroupsCTE as
(
    select
        RowNumber - row_number() over (partition by JobNumber order by RowNumber) as GroupNumber,
        JobNumber,
        TimeOfWeigh
    from JobsCTE
),

-- Join by JobNumber alone to determine which jobs appear multiple times.
DuplicatedJobsCTE as
(
    select JobNumber 
    from GroupsCTE 
    group by JobNumber 
    having count(distinct GroupNumber) > 1
)

-- Finally, we use GroupNumber to get the mins and maxes from contiguous ranges.
select
    G.JobNumber,
    min(G.TimeOfWeigh) as [First Weigh],
    max(G.TimeOfWeigh) as [Last Weigh],
    case when D.JobNumber is null then 0 else 1 end as [Multiple Ranges]
from
    GroupsCTE G
    left join DuplicatedJobsCTE D on G.JobNumber = D.JobNumber
group by
    G.JobNumber,
    G.GroupNumber,
    D.JobNumber
order by
    [First Weigh];

OTHER TIPS

you have to use self joins to create pseudo tables that contain the first, and last row in each set.

Select F.JobNumber, 
   f.TimeOfWeigh FirstWeigh, 
   l.TimeOfWeigh LastWeigh
From table f -- for first record
   join table l -- for last record
       on l.JobNumber = f.JobNumber 
          And Not exists
              (Select * from table
               Where JobNumber = f.JobNumber 
                  And id = f.id-1)
          And Not exists
              (Select * from table
               Where JobNumber = f.JobNumber 
                  And id = l.id+1)
          And Not Exists
              (Select * from table
               Where JobNumber <> f.JobNumber 
                  And id Between f.Id and l.Id)

This one fascinated me when I saw it, and I wondered how I would go about solving it. I was too busy to get in with an answer first, and I got it working later but have sat on it for a few days since! After a few days I still understand what I devised, which is a good sign :)

I've added some extra data at the end to demonstrate that this works with single-row JobNumber entries, rather than assuming that weighings will always be in batches, but the first rows in the results match the original solution.

This approach also uses cascading CTEs (one more than the accepted answer here but I won't let that discourage me!) with the first being the test data setup :

With Weighs AS   -- sample data
(
SELECT 100 AS JobNumber, '01/01/2014 08:00' AS TimeOfWeigh UNION ALL 
SELECT 100 AS JobNumber, '01/01/2014 09:00' AS TimeOfWeigh UNION ALL 
SELECT 100 AS JobNumber, '01/01/2014 10:00' AS TimeOfWeigh UNION ALL 
SELECT 200 AS JobNumber, '01/01/2014 12:00' AS TimeOfWeigh UNION ALL 
SELECT 200 AS JobNumber, '01/01/2014 13:00' AS TimeOfWeigh UNION ALL 
SELECT 300 AS JobNumber, '01/01/2014 15:00' AS TimeOfWeigh UNION ALL 
SELECT 300 AS JobNumber, '01/01/2014 16:00' AS TimeOfWeigh UNION ALL 
SELECT 100 AS JobNumber, '02/01/2014 08:00' AS TimeOfWeigh UNION ALL 
SELECT 100 AS JobNumber, '02/01/2014 09:00' AS TimeOfWeigh UNION ALL 
SELECT 100 AS JobNumber, '03/01/2014 10:00' AS TimeOfWeigh UNION ALL
SELECT 400 AS JobNumber, '04/01/2014 14:00' AS TimeOfWeigh UNION ALL
SELECT 300 AS JobNumber, '04/01/2014 14:30' AS TimeOfWeigh
)
,
Numbered AS  -- add on a unique consecutive row number
( SELECT *, ROW_NUMBER() OVER (ORDER BY TimeOfWeigh) AS ID FROM Weighs )
, 
GroupEnds AS  -- add on a 1/0 flag for whether it's the first or last in a run
( SELECT *,
    CASE WHEN -- next row is different JobNumber?
      (SELECT ID FROM Numbered n2 WHERE n2.ID=n1.ID+1 AND n2.JobNumber=n1.JobNumber) IS NULL
    THEN 1 ELSE 0 END AS GroupEnd,
    CASE WHEN -- previous row is different JobNumber?
      (SELECT ID FROM Numbered n2 WHERE n2.ID=n1.ID-1 AND n2.JobNumber=n1.JobNumber) IS NULL
    THEN 1 ELSE 0 END AS GroupBegin
  FROM Numbered n1 
)
,
Begins_and_Ends AS  -- make sure there are always matching pairs
( SELECT * FROM GroupEnds WHERE GroupBegin=1
    UNION ALL
  SELECT * FROM GroupEnds WHERE GroupEnd=1
)
,
Pairs AS  -- give matching pairs the same ID number for GROUPing next..
( SELECT *, (1+Row_Number() OVER (ORDER BY ID))/2 AS PairID
  FROM Begins_and_Ends
)
SELECT
  Min(JobNumber) AS JobNumber,
  Min(TimeOfWeigh) as [First Weigh],
  Max(TimeOfWeigh) as [Last Weigh]
FROM Pairs
GROUP BY PairID
ORDER BY PairID

The Numbered CTE is fairly obvious, giving an ordered ID number to each row.

CTE GroupEnds adds on a pair of booleans - a 1 or 0 if the row is the first or last in a run of JobNumbers - by trying to see if the next or previous row is the same JobNumber.

From there I simply needed a way to pair up the adjacent GroupBegins and GroupEnds. I played with the N-tile ranking function NTILE() to generate these numbers by dividing the rowcount by 2 by counting the GroupEnds and SELECTing that result as the parameter for NTILE() - but this broke when there were an odd number of rows due to single-row batches where the same row is a Begin and End of a batch.

I got around this by guaranteeing an equal number of Begin and End rows : a UNION of Begin rows and End rows, even if some are the same rows. This is CTE Begins_and_Ends.

The Pairs CTE adds on Pair Numbers using Row_Number() divided by two - the integer result PairID being the same for pairs of rows.

This gives us the following - all rows in the middle of JobNumber batches have been filtered out by now :

JOBNUMBER  TIMEOFWEIGH     ID  End? Begin PairID
100     01/01/2014 08:00    1   0   1     1
100     01/01/2014 10:00    3   1   0     1
200     01/01/2014 12:00    4   0   1     2
200     01/01/2014 13:00    5   1   0     2
300     01/01/2014 15:00    6   0   1     3
300     01/01/2014 16:00    7   1   0     3
100     02/01/2014 08:00    8   0   1     4
100     03/01/2014 10:00    10  1   0     4
400     04/01/2014 14:00    11  1   1     5
400     04/01/2014 14:00    11  1   1     5
300     04/01/2014 14:30    12  1   1     6
300     04/01/2014 14:30    12  1   1     6

From there it's now a final piece of cake to GROUP BY the PairID and grab the first and last weigh times. I enjoyed the challenge, I wonder if anyone else finds it useful in any weigh!
http://sqlfiddle.com/#!3/b4f39/48

Yep, this is a fascinating mind puzzle. Thank you for sharing it. I wanted to come up with the solution that does not involve EXISTS or JOINS

First I created a table with job_id (j_id) and integer value to be used for sequencing (j_v). Ints are just easier to type, while the logic is exactly the same as for the date times.

     select * from j order by j_v;
 j_id | j_v 
------+-----
  100 |   1
  100 |   2
  100 |   2
  100 |   2
  100 |   2
  100 |   3
  200 |   4
  200 |   5
  300 |   6
  300 |   6
  300 |   6
  300 |   7
  300 |   7
  100 |   8
  100 |   9
(15 rows)

I used windows functions and 3 CTEs:

  • First one adds lead and lag from the table
  • Second one filters leaving only those rows that are either start or end of the job
  • Third one introduces row_number used to remove all even rows.

Here you go:

with X AS (
select j_id, j_v,
       coalesce ( lag(j_id,1) OVER (MY_W), -1)  as j_id_lag,
       lag(j_v,1) over (MY_W) as j_v_lag,
       coalesce ( lead(j_id,1) OVER (MY_W), -1)  as j_id_lead,
       lead(j_v,1) over (MY_W) as j_v_lead
from j
WINDOW MY_W as ( ORDER BY j_v)
order by j_v 
),
Y AS ( 
select *
from X
where j_id_lag != j_id_lead
),
Z AS ( 
select * ,
      lead(j_v) OVER () AS L2,
      row_number() OVER () as my_row
from Y
) 
SELECT j_id, j_v as job_start ,l2 as job_end
from Z
where my_row %2 = 1
;
 j_id | job_start | job_end
------+-----+----
  100 |   1 |  3
  200 |   4 |  5
  300 |   6 |  7
  100 |   8 |  9
(4 rows)

Here comes the query plan:

                                                    QUERY PLAN                                                     
--------------------------------------------------------------------------------------------------------------------
 CTE Scan on z  (cost=325.94..379.17 rows=11 width=12) (actual time=0.047..0.071 rows=4 loops=1)
   Filter: ((my_row % 2::bigint) = 1)
   Rows Removed by Filter: 4
   CTE x
     ->  WindowAgg  (cost=149.78..203.28 rows=2140 width=8) (actual time=0.027..0.039 rows=15 loops=1)
           ->  Sort  (cost=149.78..155.13 rows=2140 width=8) (actual time=0.019..0.019 rows=15 loops=1)
                 Sort Key: j.j_v
                 Sort Method: quicksort  Memory: 25kB
                 ->  Seq Scan on j  (cost=0.00..31.40 rows=2140 width=8) (actual time=0.004..0.006 rows=15 loops=1)
   CTE y
     ->  CTE Scan on x  (cost=0.00..48.15 rows=2129 width=24) (actual time=0.031..0.050 rows=8 loops=1)
           Filter: (j_id_lag <> j_id_lead)
           Rows Removed by Filter: 7
   CTE z
     ->  WindowAgg  (cost=0.00..74.51 rows=2129 width=24) (actual time=0.042..0.062 rows=8 loops=1)
           ->  CTE Scan on y  (cost=0.00..42.58 rows=2129 width=24) (actual time=0.031..0.052 rows=8 loops=1)
 Total runtime: 0.122 ms
(17 rows)

As you see, there is one sort (to order the data by sequence value, or time in original question) and several CTE scans, but no joins. Complexity - NlogN for sort which exactly what I was looking for.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top