How to atomically replace a subset of table data

https://dba.stackexchange.com/questions/166816

06-10-2020
|

Question

In PostgreSQL 9.6 I have a table T like this

category | id | data
---------+----+------
A        | 1  | foo
A        | 2  | bar
A        | 3  | baz
B        | 4  | eh
B        | 5  | whatcomesafterfoobarbaz

There is a view V giving me the data for T, so it has columns category, id, data. T is essentially the materialized view for V, except that I need to refresh it with more granularity than "refresh everything".

So I will select from V for example

SELECT * FROM V WHERE category = 'A';

SELECT * FROM V WHERE category = 'A' AND id = 2;

And replace the relevant rows in T with whatever data V gives me. Unfortunately I cannot do a simple UPDATE: asking V eg. for WHERE category = 'A' might give me a totally different set of rows than before. Therefore I need to do this sequence:

DELETE FROM T WHERE <condition>;
INSERT INTO T (SELECT FROM V WHERE <condition>);

<condition> is either WHERE category = ? or WHERE category = ? AND id = ?.

How do I do this so that the following conditions hold?

Reads from rows not satisfying <condition> should be unaffected.
The change should be atomic, meaning reads from rows satisfying <condition> should either see the the old row set or the new row set, not a mix.

Note: unlike this question, I don't want to replace the whole table at once - only the rows affected.

Added details

There are more reads than writes, on the order of 10-100 times more. After each write there will be a read to the adjacent categories. The application is looking at a set of categories, ids and data and updates the data for one or more categories at a time. Right after it will re-fetch those categories and display them, and it must see the fresh data. All the ids are always fetched with "their" category.
Each category will have something like 1-10 ids, there will be tens of thousands of categories.

More details after first answer

Transactions can run concurrently. There can definitely be a case when two transactions start with DELETE FROM T WHERE category = 'A';.
There is a table categories where it's possible to lock rows FOR UPDATE. There also is a table where ids can be locked FOR UPDATE.
RETURNING makes not much sense here as I need to fetch more than just the rows changed. Thus all is simpler with a separate SELECT.

Solution

Concurrent reads are not a problem. Writers don't block readers and vice versa in the default READ COMMITTED isolation level. Enclose DELETE and INSERT in a single transaction to make the operation atomic (all applied or nothing).

If there can be multiple transactions trying to write at the same time, that's a game changer. A single transaction protects you from inconsistent updates, but it cannot protect you from race conditions between concurrent transactions.

Say we have two transactions T1 and T2, and category 'A' has 10 IDs:

T1: DELETE FROM T WHERE category = 'A';
-- starts taking row locks in arbitrary order: id 1,2,3,4,5,6,7 ...
                    T2: DELETE FROM T WHERE category = 'A';
                    -- starts taking row locks in arbitrary order: id 10, 9, 8, ...
T1: wait for T2 to release lock on id 8
                    T2: wait for T1 to release lock on id 7

DEADLOCK.

Postgres detects the deadlock after some time and kills one of the two transactions.

You could switch to SERIALIZABLE transaction isolation. But that's much more expensive and you need to prepare for serialization failures and retry in this case.

Or you can avoid the problem by always deleting rows in identical, deterministic order. Like:

WITH del AS (
   SELECT category, id
   FROM   T
   WHERE  category = 'A'
   ORDER  BY category, id  -- enforce this order in *all* writing queries
   FOR    UPDATE
   )
DELETE FROM T 
USING  del
WHERE  T.category = del.category
AND    T.id = del.id;

But typically, there is a more convenient option. If you have a separate table holding unique categories named, say, cat, you can lock the single parent row in cat with:

SELECT * FROM cat WHERE category = 'A' FOR UPDATE;

and then (in the same transaction) write to category 'A' rows in T at will (still encapsulated in a single transaction to avoid intermediary, inconsistent states being visible). Of course, all writing queries must follow the same protocol. Then, concurrent transactions will wait for the lock on cat before they write to T and everything is groovy ...

In Postgres 9.4 or later consider FOR NO KEY UPDATE instead:

How to perform conditional insert based on row count?

Concerning:

After each write there will be a read to the adjacent categories.

You are aware of the RETURNING clause, right? No need for a separate read, if you just inserted all rows for a given category. Example:

PostgreSQL multi INSERT…RETURNING with multiple columns

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange