Updating table of versioned rows with historical records in PostgreSQL

https://dba.stackexchange.com/questions/95845

17-12-2020
|

Question

I have a master table of versioned rows:

CREATE TABLE master (
    id SERIAL PRIMARY KEY,
    rec_id integer, 
    val text, 
    valid_on date[], 
    valid_during daterange
);

INSERT INTO master (rec_id, val, valid_on, valid_during) VALUES
    (1, 'a', '{2015-01-01,2015-01-05}', '[2015-01-01,infinity)'),
    (2, 'b', '{2015-01-01,2015-01-05}', '[2015-01-01,infinity)'),
    (3, 'c', '{2015-01-01,2015-01-05}', '[2015-01-01,infinity)');

SELECT * FROM master ORDER BY rec_id, id;
/*
     id | rec_id | val |        valid_on         |     valid_during
    ----+--------+-----+-------------------------+-----------------------
      1 |      1 | a   | {2015-01-01,2015-01-05} | [2015-01-01,infinity)
      2 |      2 | b   | {2015-01-01,2015-01-05} | [2015-01-01,infinity)
      3 |      3 | c   | {2015-01-01,2015-01-05} | [2015-01-01,infinity)
*/

The rec_id is a the record's natural key, the valid_on is an array of dates on which the record was valid, and the valid_during is a date range describing the interval during which the record is valid. (The upper bound on the valid_during is 'infinity' if there is no record with the same rec_id with a more recent valid_on value.)

Given a second table of updated records, along with new dates on which each record was valid:

CREATE TABLE updates (id SERIAL PRIMARY KEY, rec_id integer, val text, valid_on date); 
INSERT INTO updates (rec_id, val, valid_on) VALUES
(1, 'a', '2015-01-03'), -- (1) same "val" for id 1, just add valid_on date
(2, 'd', '2015-01-06'), -- (2) different val for id 2,
(3, 'e', '2015-01-03'); -- (3) different val for id 3 with new date 
                        --     intersecting old date range

SELECT * FROM updates;
/*
     id | rec_id | val |  valid_on
    ----+--------+-----+------------
      1 |      1 | a   | 2015-01-03
      2 |      2 | d   | 2015-01-06
      3 |      3 | e   | 2015-01-03
*/

I would like to insert/update the master table to wind up with something like this:

-- The goal
SELECT rec_id, val, valid_on, valid_during FROM master ORDER BY rec_id, id;
/*
     rec_id | val |        valid_on                    |     valid_during
    --------+-----+------------------------------------+-----------------------
      1     | a   | {2015-01-01,2015-01-05,2015-01-03} | [2015-01-01,infinity)
      2     | b   | {2015-01-01,2015-01-05}            | [2015-01-01,2015-01-06)
      2     | d   | {2015-01-06}                       | [2015-01-06,infinity)
      3     | c   | {2015-01-01}                       | [2015-01-01,2015-01-03)
      3     | e   | {2015-01-03}                       | [2015-01-03,2015-01-05)
      3     | c   | {2015-01-05}                       | [2015-01-05,infinity)
*/

Specifically:

If a new record's rec_id exists in the master table with the same val, but the new valid_on date is not in the valid_on array in the master, simply add the new date to the master table's valid_on field (see rec_id 1)
If a new record's rec_id exists with a different val, insert the new record into the master table. The old record in the master table should have its valid_during value end on the date of the new record's valid_on (see rec_id 2)
If the new record's valid_on date intersects the old record's valid_during range, the old record should appear on both "sides" of the updated record (see rec_id 3)

I can get most of the way there. The first case is straightforward: we just need to update the valid_on field in the master table (we'll worry about the valid_during field momentarily in a separate step):

UPDATE master m
SET valid_on = m.valid_on || u.valid_on
FROM updates u
WHERE m.rec_id = u.rec_id 
    AND m.val = u.val 
    AND NOT m.valid_on @> ARRAY[u.valid_on];

SELECT * FROM master ORDER BY rec_id, id;
/*
     id | rec_id | val |              valid_on              |     valid_during
    ----+--------+-----+------------------------------------+-----------------------
      1 |      1 | a   | {2015-01-01,2015-01-05,2015-01-03} | [2015-01-01,infinity)
      2 |      2 | b   | {2015-01-01,2015-01-05}            | [2015-01-01,infinity)
      3 |      3 | c   | {2015-01-01,2015-01-05}            | [2015-01-01,infinity)
*/

For case #2, we can do a simple insert:

INSERT INTO master (rec_id, val, valid_on)
SELECT u.rec_id, u.val, ARRAY[u.valid_on]
FROM updates u 
    LEFT JOIN master m ON u.rec_id = m.rec_id AND u.val = m.val
WHERE m.id IS NULL;

SELECT * FROM master ORDER BY rec_id, id;
/*
     id | rec_id | val |              valid_on              |     valid_during
    ----+--------+-----+------------------------------------+-----------------------
      1 |      1 | a   | {2015-01-01,2015-01-05,2015-01-03} | [2015-01-01,infinity)
      2 |      2 | b   | {2015-01-01,2015-01-05}            | [2015-01-01,infinity)
      4 |      2 | d   | {2015-01-06}                       |
      3 |      3 | c   | {2015-01-01,2015-01-05}            | [2015-01-01,infinity)
      5 |      3 | e   | {2015-01-03}                       |
*/

Now, we can correct the valid_during range in one pass by joining on a subquery which uses a window function that checks for the next valid date for a record with the same rec_id:

-- Helper function...
CREATE OR REPLACE FUNCTION arraymin(anyarray) 
RETURNS anyelement AS $$ 
    SELECT min($1[i]) 
    FROM generate_series(array_lower($1,1), array_upper($1,1)) g(i); 
$$ language sql immutable strict; 


UPDATE master m
SET valid_during = daterange(arraymin(valid_on), new_valid_until)
FROM (
    SELECT
        id,
        lead(arraymin(valid_on), 1, 'infinity'::date)
        OVER (partition by rec_id ORDER BY arraymin(valid_on)) AS new_valid_until
    FROM master ) t
WHERE
    m.id = t.id;

SELECT * FROM master ORDER BY rec_id, id;
/*
     id | rec_id | val |              valid_on              |      valid_during
    ----+--------+-----+------------------------------------+-------------------------
      1 |      1 | a   | {2015-01-01,2015-01-05,2015-01-03} | [2015-01-01,infinity)
      2 |      2 | b   | {2015-01-01,2015-01-05}            | [2015-01-01,2015-01-06)
      4 |      2 | d   | {2015-01-06}                       | [2015-01-06,infinity)
      3 |      3 | c   | {2015-01-01,2015-01-05}            | [2015-01-01,2015-01-03)
      5 |      3 | e   | {2015-01-03}                       | [2015-01-03,infinity)
*/

And here's where I'm stuck: rec_id 1 and 2 are exactly what I want, but rec_id 3 needs to be inserted again to appear valid on '2015-01-05'. I can't seem to wrap my head around the array operation to perform that insert. Any thoughts on approaches that don't involve unnesting the master table? Or is that the only/best approach here?

I'm using PostgreSQL 9.3 (but would happily upgrade to 9.4 if there's a graceful way to do this in the newer version).

Solution

1st case

You seem to forget the valid_during range. As your third case suggests, there can be multiple entries per (rec_id, val), so you must select the right one:

UPDATE master m
SET    valid_on = f_array_sort(m.valid_on || u.valid_on) -- sorted array, see below
FROM   updates u
WHERE  m.rec_id = u.rec_id 
AND    m.valid_during @> u.valid_on  -- additional check
AND    m.val = u.val 
AND    NOT m.valid_on @> ARRAY[u.valid_on];

I assume the whole possible date range is always covered for each existing rec_id and valid_during shall not overlap per rec_id, or you'd have to do more.

After installing the additional module btree_gist, add an exclusion constraint to rule out overlapping date ranges if you don't have one, yet:

ALTER TABLE master ADD CONSTRAINT EXCLUDE
USING gist (rec_id WITH =, valid_during WITH &&)  -- disallow overlap

The GiST index this is implemented with is also a perfect match for the query. Details:

2nd / 3rd case

Assuming that every date range starts with the smallest date in the (now sorted!) array: lower(m.valid_during) = m.valid_on[1]. I would enforce that with a CHECK constraint.

Here we need to create one or two new rows In the 2nd case it is enough to shrink the range of the old row and insert one new row In the 3rd case we update the old row with the left half of array and range, insert the new row and finally insert the with the right half of array and range.

Helper functions

To keep it simple I introduce a new constraint: every array is sorted. Use this helper function

-- sort array
CREATE OR REPLACE FUNCTION f_array_sort(anyarray) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT unnest($1) ORDER BY 1)$$;

I don't need your helper function arraymin() any more, but it could be simplified to:

CREATE OR REPLACE FUNCTION f_array_min(anyarray) 
  RETURNS anyelement LANGUAGE sql IMMUTABLE AS
$$SELECT min(a) FROM unnest($1) a$$;

Two more to get the left and right half of an array split at a given element:

-- split left array at given element
CREATE OR REPLACE FUNCTION f_array_left(anyarray, anyelement) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a < $2 ORDER BY 1)$$;

-- split right array at given element
CREATE OR REPLACE FUNCTION f_array_right(anyarray, anyelement) 
  RETURNS anyarray LANGUAGE sql IMMUTABLE AS
$$SELECT ARRAY (SELECT * FROM unnest($1) a WHERE a >= $2 ORDER BY 1)$$;

Query

This does all the rest:

WITH u AS (  -- identify candidates
   SELECT m.id, rec_id, m.val, m.valid_on, m.valid_during
        , u.val AS u_val, u.valid_on AS u_valid_on
   FROM   master  m
   JOIN   updates u USING (rec_id)
   WHERE  m.val <> u.val
   AND    m.valid_during @> u.valid_on
   FOR    UPDATE  -- lock for update
   )
, upd1 AS (  -- case 2: no overlap, no split
   UPDATE master m  -- shrink old row
   SET    valid_during = daterange(lower(u.valid_during), u.u_valid_on)
   FROM   u
   WHERE  u.id = m.id
   AND    u.u_valid_on > m.valid_on[array_upper(m.valid_on, 1)]
   RETURNING m.id
   )
, ins1 AS (  -- insert new row
   INSERT INTO master (rec_id, val, valid_on, valid_during)
   SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
        , daterange(u.u_valid_on, upper(u.valid_during))
   FROM   upd1
   JOIN   u USING (id)
   )
, upd2 AS (  -- case 3: overlap, need to split row
   UPDATE master m  -- shrink to first half
   SET    valid_during = daterange(lower(u.valid_during), u.u_valid_on)
        , valid_on = f_array_left(u.valid_on, u.u_valid_on)
   FROM   u
   LEFT   JOIN upd1 USING (id)
   WHERE  upd1.id IS NULL  -- all others
   AND    u.id = m.id
   RETURNING m.id, f_array_right(u.valid_on, u.u_valid_on) AS arr_right
   )
INSERT INTO master (rec_id, val, valid_on, valid_during)
          -- new row
SELECT u.rec_id, u.u_val, ARRAY[u.u_valid_on]
     , daterange(u.u_valid_on, upd2.arr_right[1])
FROM   upd2
JOIN   u USING (id)
UNION ALL  -- second half of old row
SELECT u.rec_id, u.val, upd2.arr_right
     , daterange(upd2.arr_right[1], upper(u.valid_during))
FROM   upd2
JOIN   u USING (id);

SQL Fiddle.

Notes

You need to understand the concept of data-modifying CTEs (writeable CTEs), before you touch this. Judging from the code you provided, you know your way around Postgres.
- Are SELECT type queries the only type that can be nested?
FOR UPDATE is to avoid race conditions with concurrent write access. If you are the only user writing to the tables, you don't need it.
I took a piece of paper and drew a timeline so not to get lost in all of this.
Each row is only updated / inserted once, and operations are simple and roughly optimized. No expensive window functions. This should perform well. Much faster than your previous approach in any case.
It would be a bit less confusing if you'd use distinct column names for u.valid_on and m.valid_on, which are related but different things.
I compute the right half of the split array in the RETURNING clause of CTE upd2: f_array_right(u.valid_on, u.u_valid_on) AS arr_right, because I need it several times in the next step. This is a (legal) trick to save one more CTE.
As for solutions that don't involve unnesting the master table: You have to unnest the array valid_on either way, to split it, at least as long as it's not sorted. Also, your helper function arraymin() is already unnesting it anyway.

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange