Count Items not In Time Range
-
05-02-2021 - |
Question
I'm attempting to count the number of items if they are not within 30 seconds of the first item in a "group". I'm having a hard time figuring this out.
So, I have this table:
WITH ADates AS (
SELECT
Id
, SharedId
, TheDateTime
FROM (VALUES
(CAST(1 AS int), CAST(1 AS int), CAST('2019-01-01 01:01:00.00' AS datetime2(7))),
(2, 1, '2019-01-01 01:01:33.00'),
(3, 1, '2019-01-01 01:02:00.00'),
(4, 1, '2019-01-01 01:02:01.00'),
(5, 1, '2019-01-01 01:02:04.00'),
(6, 1, '2019-01-01 01:06:15.00'),
(7, 2, '2019-01-01 01:06:00.00'),
(8, 2, '2019-01-01 01:06:45.00'),
(9, 1, '2019-01-01 01:02:31.00'),
(10, 2, '2019-01-01 01:06:05.00'),
(11, 2, '2019-01-01 01:06:46.00'),
) X (Id, SharedId, TheDateTime)
)
So, the expected result that I'm looking for is:
+==========+=======+
| SharedId | Count |
+==========+=======+
| 1 | 4 |
+----------+-------+
| 2 | 2 |
+----------+-------+
The numbers are determined by:
- Count since first in new group.
- Not within 30 seconds of previous group so it is a new group and count.
- Don't count since it is within 30 seconds of 2.
- Don't count since it is within 30 seconds of 2.
- Count since not within 30 seconds of previous group (Item 2).
- Count since not within 30 seconds of previous group (item 2).
- Count in new group for SharedId.
- Count since not within previous grouping.
I'm thinking I should be doing a Window Function
for this. Just not sure how to have it rely on just the first of the group.
Solution
Problem Categorization
I was looking for a way to use an analytical function that keeps track of inline manipulations. A single run analytical function is only able to perform so much, but not to the extend to solve this problem. The problem with nesting analytical functions is that we loose information about our dynamic pattern.
To allow dynamic inline pattern matching, in Oracle you can use MATCH_RECOGNIZE. I had no clue how to do it in Sql Server though. Then I came across a similar problem, which got resolved using a recursive CTE.
Proposed Solution
- SharedId GroupStartDateTime - 1 01/01/2019 01:01:00 - 1 01/01/2019 01:01:33 - 1 01/01/2019 01:02:04 - 1 01/01/2019 01:06:15 - 2 01/01/2019 01:06:00 - 2 01/01/2019 01:06:45 6 rows
CteBase
and CteRecursive
are heavely inspired by Bogdan Sahlean's answer on this related question.
WITH CteBase
AS
(
SELECT v.SharedId,
v.TheDateTime,
ROW_NUMBER() OVER(PARTITION BY v.SharedId ORDER BY v.TheDateTime)
AS RowNum
FROM ADates v
), CteRecursive
AS
(
SELECT crt.SharedId,
crt.TheDateTime,
crt.TheDateTime AS GroupStartDateTime,
crt.RowNum,
1 AS SharedIdRowNum
FROM CteBase crt
WHERE crt.RowNum = 1
UNION ALL
SELECT crt.SharedId,
crt.TheDateTime,
CASE
WHEN DATEDIFF(SECOND, prv.GroupStartDateTime, crt.TheDateTime) <= 30
THEN prv.GroupStartDateTime
ELSE crt.TheDateTime
END,
crt.RowNum,
CASE
WHEN DATEDIFF(SECOND, prv.GroupStartDateTime, crt.TheDateTime) <= 30
THEN prv.SharedIdRowNum + 1
ELSE 1
END
FROM CteBase crt
INNER JOIN CteRecursive prv ON crt.SharedId = prv.SharedId
AND crt.RowNum = prv.RowNum + 1
)
SELECT SharedId, Count(*) as [COUNT] FROM (
SELECT r.SharedId,
r.GroupStartDateTime
FROM CteRecursive r
WHERE r.SharedIdRowNum = 1
) X
GROUP BY SharedId;
OTHER TIPS
It has something to do with item 6 being after item 7 which is in another group. Because of that it is seen as a new count in my query, resulting in 5. Is this time with id=6 correct? can it be after id=7?
The query has gotten so complex that I cannot explain, or it too late already (🍺), which is not good, but …. 😊
The next one seems to produce the correct results (with the 9 records):
WITH ADates AS (
SELECT
Id
, SharedId
, TheDateTime
FROM (VALUES
(CAST(1 AS int), CAST(1 AS int), CAST('2019-01-01 01:01:00.00' AS datetime2(7))),
(2, 1, '2019-01-01 01:01:33.00'),
(3, 1, '2019-01-01 01:02:00.00'),
(4, 1, '2019-01-01 01:02:01.00'),
(5, 1, '2019-01-01 01:02:04.00'),
(6, 1, '2019-01-01 01:06:15.00'),
(7, 2, '2019-01-01 01:06:00.00'),
(9, 1, '2019-01-01 01:02:31.00'),
(8, 2, '2019-01-01 01:06:45.00')
) X (Id, SharedId, TheDateTime)
),
TMPADates AS (
SELECT
Id,
SharedId,
TheDateTime,
--DATEADD(S, 30, TheDateTime ) TheDateTime30,
ISNULL((SELECT MIN(TheDateTime)
FROM ADates t2
WHERE t2.TheDateTime BETWEEN DATEADD(s,-30,ADates.TheDateTime) AND ADates.TheDateTime
and ADates.SharedId=t2.SharedId),TheDateTime) ingroup
FROM ADates
),
TMPAdates2 AS (
SELECT
Id,
SharedId,
TheDateTime,
ingroup,
-- ISNULL(LAG(ingroup) OVER (ORDER BY SharedId,ingroup DESC),ingroup) as ingroup2
ISNULL(LAG(ingroup) OVER (PARTITION BY SharedId ORDER BY SharedId,ingroup DESC),ingroup) as ingroup2
FROM TMPADates
)
SELECT
SharedId, COUNT(DISTINCT ingroup2) As Count
FROM TMPADates2
GROUP BY SharedId
What about:
WITH ADates AS (
SELECT
Id
, SharedId
, TheDateTime
FROM (VALUES
(CAST(1 AS int), CAST(1 AS int), CAST('2019-01-01 01:01:00.00' AS datetime2(7))),
(2, 1, '2019-01-01 01:01:33.00'),
(3, 1, '2019-01-01 01:02:00.00'),
(4, 1, '2019-01-01 01:02:01.00'),
(5, 1, '2019-01-01 01:02:04.00'),
(6, 1, '2019-01-01 01:06:15.00'),
(7, 2, '2019-01-01 01:06:00.00'),
(8, 2, '2019-01-01 01:06:45.00')
) X (Id, SharedId, TheDateTime)
),
TMPADates AS (
SELECT
Id,
SharedId,
TheDateTime,
DATEADD(S, 30, TheDateTime ) TheDateTime30,
ISNULL((SELECT MIN(TheDateTime) FROM ADates t2 WHERE t2.TheDateTime BETWEEN DATEADD(s,-30,ADates.TheDateTime) AND ADates.TheDateTime),TheDateTime) ingroup
FROM ADates
)
SELECT
SharedId, COUNT(DISTINCT ingroup) AS Count
FROM TMPADates
GROUP BY SharedId