Validate id sequence consistency in records in Redshift

https://dba.stackexchange.com/questions/177691

07-10-2020
|

Question

I have a backend that log event to Redshift, for each event it generates a unique id. The id is a sequential number.

I have something like (events table):

+-------------------------+------+
| created_at              |  id  |
+-------------------------+------+
| 2017-06-30 09:20:47 UTC | 100  |
| 2017-06-30 09:18:31 UTC | 101  |
| 2017-06-30 09:16:19 UTC | 102  |
| 2017-06-30 09:12:08 UTC | 103  |
| 2017-06-30 09:11:59 UTC | 104  |
| 2017-06-30 09:11:15 UTC | 105  |
| 2017-06-30 07:03:41 UTC | 106  |
+-------------------------+------+

I have task than run every hour that move few records to an other table (deactivated_events).

I would like to verify that for the last 3 hours I don't have any missing record using id sequence. First I thought about using generate_series but this is not present in Redshift. Other suggested to make a table with only the ids but that's still a pain to populate database with integer (here one example to generate 1 million

I'm wondering if the best is not use min max and count like this :

WITH merged_events AS
  (SELECT *
   FROM
     (SELECT id, created_at
      FROM events
      UNION 
      SELECT id, created_at
      FROM deactivated_events
     )
   WHERE created_at > GETDATE() - INTERVAL '3 hours'
   ORDER BY id)
SELECT COUNT(*), (max(id) - min(id) + 1) AS diff
FROM merged_events;

PS: Bonus, how to found missing or duplicate records?

Solution

Let's imagine you have this setup:

INSERT INTO events
    (created_at, id)
VALUES
    ('2017-06-30 09:20:47 UTC', 100),
    -- ('2017-06-30 09:18:31 UTC', 101), -- Missing row
    ('2017-06-30 09:16:19 UTC', 102),
    ('2017-06-30 09:12:08 UTC', 103),
    ('2017-06-30 09:11:59 UTC', 104),
    ('2017-06-30 09:11:15 UTC', 105),
    ('2017-06-30 07:03:41 UTC', 106) ;

and ...

INSERT INTO deactivated_events
    (created_at, id)
VALUES
    ('2017-06-30 07:03:41 UTC', 97),
    ('2017-06-30 09:11:15 UTC', 98),
    ('2017-06-30 09:11:15 UTC', 99),
    ('2017-06-30 09:18:31 UTC', 100)     -- Repeated row
    ;

Provided that Redshift provides Window Functions, you can use a slight variation of your query to do two things:

Consider that there are repeated ids when you group by id and find the count to be > 1
Consider that there is a missing row (or more!) if the id of the previous row is not 1 + the current one. This is done by means of the LAG function.

This can be done with the following query

WITH merged_events AS
(
  SELECT 
      id
  FROM
  (
      SELECT 
          id
      FROM 
          events
      UNION ALL  /* Must be UNION ALL, because we want to find repeated values */
      SELECT 
          id
      FROM 
          deactivated_events
  ) AS q0
  WHERE true -- In practice, created_at > GETDATE() - INTERVAL '3 hours'
)
SELECT 
    id, 
    count(id) > 1 AS repeated_event, 
    ((lag(id) OVER(ORDER BY id)) /* previous id */ + 1) <> id AS previous_event_missing
FROM 
    merged_events
GROUP BY
    id ;

This will produce:

 id | repeated_event | previous_event_missing
--: | :------------- | :---------------------
 97 | f              | null                  
 98 | f              | f                     
 99 | f              | f                     
100 | TRUE           | f                     
102 | f              | TRUE                     
103 | f              | f                     
104 | f              | f                     
105 | f              | f                     
106 | f              | f

You can check the whole setup (with PostgreSQL instead of RedShift) at dbfiddle here

Caveats: your id might have gaps if it is actually from an IDENTITY column (or a SEQUENCE it Redshift happens to implement it, which I don't think is the case). So, you should first guarantee by some other means that your ids are originally consecutive...

References:

Sequence number generation function in AWS redshift which discusses different ways of getting sequences, and its tradeoffs.
How to find missing values in a sequence with SQL which shows different alternatives.

OTHER TIPS

I have task than run every hour that move few records to an other table (deactivated_events).

This sounds like a really bad idea. Instead, I would urge you to create a new column active of type bool. Especially if you're needing to merge events from that table.

PS: Bonus, how to found missing or duplicate records?

That will stop being a problem if you stop deleting from the table.

Also, see @joanolo's notes on sequences. That's also true of IDENTITY COLUMNS on Redshift,

To be sure that the identity values are unique, Amazon Redshift skips a number of values when creating the identity values. As a result, identity values are sequential, but not consecutive.

Generally, you don't want to move data out of a table to archive it. Especially if you don't have to. PostgreSQL 10 is coming out. Redshift may very well upgrade to using it. If they do, you'll have value partitioning which will do this transparently and automatically.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange