Validate id sequence consistency in records in Redshift
-
07-10-2020 - |
Question
I have a backend that log event to Redshift, for each event it generates a unique id. The id is a sequential number.
I have something like (events
table):
+-------------------------+------+
| created_at | id |
+-------------------------+------+
| 2017-06-30 09:20:47 UTC | 100 |
| 2017-06-30 09:18:31 UTC | 101 |
| 2017-06-30 09:16:19 UTC | 102 |
| 2017-06-30 09:12:08 UTC | 103 |
| 2017-06-30 09:11:59 UTC | 104 |
| 2017-06-30 09:11:15 UTC | 105 |
| 2017-06-30 07:03:41 UTC | 106 |
+-------------------------+------+
I have task than run every hour that move few records to an other table (deactivated_events
).
I would like to verify that for the last 3 hours I don't have any missing record using id
sequence.
First I thought about using generate_series but this is not present in Redshift. Other suggested to make a table with only the ids but that's still a pain to populate database with integer (here one example to generate 1 million
I'm wondering if the best is not use min max and count like this :
WITH merged_events AS
(SELECT *
FROM
(SELECT id, created_at
FROM events
UNION
SELECT id, created_at
FROM deactivated_events
)
WHERE created_at > GETDATE() - INTERVAL '3 hours'
ORDER BY id)
SELECT COUNT(*), (max(id) - min(id) + 1) AS diff
FROM merged_events;
PS: Bonus, how to found missing or duplicate records?
La solution
Let's imagine you have this setup:
INSERT INTO events
(created_at, id)
VALUES
('2017-06-30 09:20:47 UTC', 100),
-- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
('2017-06-30 09:16:19 UTC', 102),
('2017-06-30 09:12:08 UTC', 103),
('2017-06-30 09:11:59 UTC', 104),
('2017-06-30 09:11:15 UTC', 105),
('2017-06-30 07:03:41 UTC', 106) ;
and ...
INSERT INTO deactivated_events
(created_at, id)
VALUES
('2017-06-30 07:03:41 UTC', 97),
('2017-06-30 09:11:15 UTC', 98),
('2017-06-30 09:11:15 UTC', 99),
('2017-06-30 09:18:31 UTC', 100) -- Repeated row
;
Provided that Redshift provides Window Functions
, you can use a slight variation of your query to do two things:
- Consider that there are repeated
id
s when you group byid
and find the count to be > 1 - Consider that there is a missing row (or more!) if the
id
of the previous row is not 1 + the current one. This is done by means of theLAG
function.
This can be done with the following query
WITH merged_events AS
(
SELECT
id
FROM
(
SELECT
id
FROM
events
UNION ALL /* Must be UNION ALL, because we want to find repeated values */
SELECT
id
FROM
deactivated_events
) AS q0
WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT
id,
count(id) > 1 AS repeated_event,
((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM
merged_events
GROUP BY
id ;
This will produce:
id | repeated_event | previous_event_missing --: | :------------- | :--------------------- 97 | f | null 98 | f | f 99 | f | f 100 | TRUE | f 102 | f | TRUE 103 | f | f 104 | f | f 105 | f | f 106 | f | f
You can check the whole setup (with PostgreSQL instead of RedShift) at dbfiddle here
Caveats: your id
might have gaps if it is actually from an IDENTITY
column (or a SEQUENCE
it Redshift happens to implement it, which I don't think is the case). So, you should first guarantee by some other means that your ids are originally consecutive...
References:
- Sequence number generation function in AWS redshift which discusses different ways of getting sequences, and its tradeoffs.
- How to find missing values in a sequence with SQL which shows different alternatives.
Autres conseils
I have task than run every hour that move few records to an other table (deactivated_events).
This sounds like a really bad idea. Instead, I would urge you to create a new column active
of type bool
. Especially if you're needing to merge events from that table.
PS: Bonus, how to found missing or duplicate records?
That will stop being a problem if you stop deleting from the table.
Also, see @joanolo's notes on sequences. That's also true of IDENTITY COLUMNS
on Redshift,
To be sure that the identity values are unique, Amazon Redshift skips a number of values when creating the identity values. As a result, identity values are sequential, but not consecutive.
Generally, you don't want to move data out of a table to archive it. Especially if you don't have to. PostgreSQL 10 is coming out. Redshift may very well upgrade to using it. If they do, you'll have value partitioning which will do this transparently and automatically.