ACID violation in Postgres 9.1?

https://dba.stackexchange.com/questions/80720

10-12-2020
|

Question

I am using a Postgres DB to implement a scheduling of jobs for a large number of computers/processes. To make the story short, every job has its id, all scheduling is implemented with three tabes: all jobs, currently running jobs, and already completed jobs.

The key functionality of the scheduling is to (1) request a job and (2) inform DB about a completed job. Requesting a job takes any id from the job list, which is not in the running table and not in the completed table:

insert into piper.jobs_running
select x.fid from ( 
  SELECT fid FROM piper.jobs
  except 
  select fid from piper.jobs_running
  except 
  select fid from piper.jobs_completed
 ) as x limit 1
returning(fid)

Completing the job removes it from the running list and inserts it into the completed list. Since it is not concurrency-specific I ommit the SQL commands (it takes something from tens of minutes to a couple of hours to finish a job.)

It came as a nasty surpirse to me, that two processes runing exactly the same query above (pretty much requesting the job at the same time) may get the same job id (fid). The only possible explanation that I am comming up is that Postgres is not rely ACID conform. Comments?

Additional information: I set the transactions to be serializable (in postgresql.conf set default_transaction_isolation = 'serializable'). Now the DBMS fails the transactions in case the isolation is not fullfilled. is it possible to force Postgres to restart them automatically?

Solution

Specific issues with the query

PostgreSQL defaults to READ COMMITTED isolation, and it looks like you're not using anything different. In READ COMMITTED each statement gets its own snapshot. It cannot see uncommitted changes from other transactions; it's as if they simply haven't happened.

Now, imagine you run this simultaneously in three sessions, on a setup with three entries in jobs, zero in jobs_running and zero in jobs_completed:

insert into piper.jobs_running
select x.fid from ( 
  SELECT fid FROM piper.jobs
  except 
  select fid from piper.jobs_running
  except 
  select fid from piper.jobs_completed
 ) as x limit 1
returning(fid)

Each run will select the three jobs in jobs. Because their snapshots were all taken before any of them committed a change, or even created the uncommitted row, *all of them will find zero rows in jobs_running and jobs_completed.

So they all claim a job. Probably the same job, because even though there's no ORDER BY, the scan order will be the same.

Locking

Row locking crosses transnational boundaries, and lets you communicate between transactions to enforce ordering. So you might think this would solve your problem, but it won't.

If you take a FOR UPDATE lock on the row entry, so the row is locked exclusively, the lock is held until the transaction commits or rolls back. So you'd think that then the next transaction would get a different row, or if it tried to get the same row it'd wait for the lock to release, see that there was now an entry in jobs_running, and skip the row. You'd be wrong.

What'll happen is that you'll lock all rows. One transaction will successfully get the locks on all rows. The other transactions, which will do the same index or sequential scan, will generally try to lock the rows in the same order, get stuck on the first lock attempt, and wait for the first transaction to roll back or commit. If you're unlucky, they might start locking different sets of rows and deadlock against each other instead, causing a deadlock abort, but usually you're just getting no useful concurrency.

Worse though, the first transaction picks one of the rows it locked, inserts a row into jobs_running and commits, releasing the locks. Another transaction is then able to continue, and locks all the rows.... but it doesn't get a new snapshot of the database state (the snapshot gets taken at statement start), so *it can't see that you inserted a row into jobs_running. Thus, it grabs the same job, inserts a row for that job into jobs_running, and commits.

Condition re-checks

PostgreSQL has a quirky feature not included in most databases where, if a transaction blocks on a lock, it re-checks that the selected row still matches the lock condition when it gets the lock after the first transaction commits.

That's why the example in https://stackoverflow.com/questions/11532550/atomic-update-select-in-postgres works - it relies on qualifier re-checks in WHERE clauses after reclaiming a lock.

The use of locking forces everything to run serially, so in practice you might as well have a single connection doing the work.

Isolation, ACID, and reality

Transaction isolation in PostgreSQL is not the perfect ideal where transactions run concurrently, but their effects are the same as if they were executed serially.

The only real world database that would deliver perfect isolation would be one that exclusively locks every table when a write transaction first accesses it, so in practice transactions can only be concurrent access if it's to different parts of the DB. Nobody wants to use such a database in situations where concurrency is desirable or useful.

All real world implementations are compromises.

`READ COMMITTED` default

PostgreSQL's default is READ COMMITTED isolation, a well-defined isolation level that permits non-repeatable and phantom reads, as discussed in the PostgreSQL manual on transaction isolation.

`SERIALIZABLE` isolation

You can request stricter SERIALIZABLE isolation on a per-transaction basis or as a per-user, per-database or (not recommended) global default. This provides much stronger guarantees, though still not perfect, at the cost of forcibly rolling back transactions if they interact in ways that couldn't happen if they had run serially.

Because your concurrent queries will always try to grab the first job, no matter how many jobs there are, all but one will abort with serialization failures. So in practice you don't get any useful concurrency and might as well have a single connection handing out jobs to workers.

(Note that prior to PostgreSQL 9.1 SERIALIZABLE offered much weaker guarantees and would not detect quite a lot of cases where transactions were interdependent.)

Auto-re-run `SERIALIZABLE`

PostgreSQL does not auto-re-run SERIALIZABLE transactions that abort due to serialization failures. There are cases where this would be quite useful, but other cases where doing so would be completely wrong and dangerous - especially where read/modify/write cycles via the application are involved. At this time there is no support for automatically rerunning transactions on serialization failures. The application is expected to retry.

Don't write a DIY queuing system

It looks like what you're trying to do is write a queuing system. Consider not doing that. Writing a robust, reliable and correct queueing system is hard, and there are a few good ones already available that you could adopt. You have to handle things like crash-safety, what happens when someone takes a task then fails to complete it, race conditions around completion just as you give up on the task handler completing it, etc. There are lots of subtle concurrency problems. Don't try to DIY.

9.5 and `SKIP LOCKED`

PostgreSQL 9.5, still in development, adds a feature that makes queuing quite a bit easier.

It lets you say that if when you SELECT ... FOR UPDATE, if a row is locked, you should ignore it and carry on to find the next not-locked row. This is very useful when combined with a LIMIT as it makes it possible to say "find me the first row someone else isn't already trying to claim". So it becomes very simple to write queuing that grabs jobs safely and concurrently.

Until that feature is available I strongly suggest that you stick to single connection queue management, or use a well-tested task queue system.

OTHER TIPS

The answer from craig-ringer fully answers the question. For the sake of completeness, I add up the following trivial code to solve the issue using locks:

create or replace function f_requestJob(
    jobs           text, 
    jobs_running   text, 
    jobs_completed text
) 
    returns bigint as
$$
declare 
    sql varchar;
    c bigint;

begin
    execute 'LOCK TABLE ' || jobs           || ' IN ACCESS EXCLUSIVE MODE'; 
    execute 'LOCK TABLE ' || jobs_running   || ' IN ACCESS EXCLUSIVE MODE'; 
    execute 'LOCK TABLE ' || jobs_completed || ' IN ACCESS EXCLUSIVE MODE';

    sql := '
    insert into ' || jobs_running || '
    select x.fid     
    from ( 
      SELECT fid FROM ' || jobs || '
      except 
      select fid from ' || jobs_running || '
      except 
      select fid from ' || jobs_completed || '
     ) as x limit 1
    returning(fid)';

    execute sql into c;

    return c; --may be a null value

end;
$$ language plpgsql;

Licensed under: CC-BY-SA with attribution

Not affiliated with dba.stackexchange