Question

Recently I'm learning concurrency control techniques in transactional databases. However, I am so confused about the differences between concurrency control in operating systems and in transactional databases.

Within my understanding, the concurrency control techniques introduced in database literature can be used in a multithreading program, whose threads share some variables with each other, and vice versa. The techniques used in multithreading programs for sharing variables between threads can also be used in databases for concurrency control.

Why do we bother to introduce this differently in database literature and in operating systems?

Was it helpful?

Solution

Databases have two concurrency requirements. One is the very short term physical management of memory blocks as they are referenced. These are known as latches in the DB world and can be implemented using mutexes and the like. The concern here is for the stability of a block of memory while a worker thread is accessing it.

The second controls the validity and isolation of the client data held within the DB. They control access to logical things (tuples, tables etc.) rather than physical hardware-related things (memory blocks). These are refered to as locks. They last as long as the client transaction lasts, which can be an arbitrarily long time. The locks are typically held in a separate structure to the one which holds the object that is locked. Consequently the thing locked need not be in memory for the whole duration that the lock is held. For example, a page may be read in, a row inserted and a lock taken on it to ensure isolation. If the page containing the new row is evicted the lock would remain in effect. Indeed, the things locked may not exist when the lock is taken! A range lock protects key values from a lower bound to an upper bound. Neither of these values need be present in the data when the lock is acquired, nor must There be values that fall within the range. Nevertheless the lock will come into effect and allow isolation of clients' work.

The introductory sections of Goetz Graefe's "Survey of B-Tree Locking Techniques" covers these ideas nicely.

An OS deals with the first of these only (in my limited expertise). Any extension to disk is hidden behind virtual memory and the page file. In contrast a DBMS explicitly handles RAM and disk residence to optimise for its specific use-case. Different requirements lead to different treatments between OS and DBMS.

OTHER TIPS

In addition to what Michael Green has pointed out in his excellent answer, you should also be aware of optimistic concurrency which is an application-level technique in a database that is used to guard against two users (or processes) attempting to modify the same piece of data. The technique is used when there is a low, but non-zero chance of two updates being made to a single piece of data. It's a similar scenario, conceptually at least, to a race condition, but it's not identical and it doesn't call for the same handling, so the techniques that an OS might use can't be applied in exactly the same way to an application database.

The scenario is that two users, for example Bob and Jane, read a customer record in the database. They both see the same version of the record. Then Bob saves a change, let's say to a customer's address on that record. A little bit latter, Jane saves a different change, let's say to the customer's credit limit. Since Jane didn't know about Bob's change, Jane's change overwrites Bob's, causing Bob's changes to be lost.

At the application level, you can protect against this scenario a couple of different ways. One is to re-read all of the data just before saving changes to make sure that it hasn't been changed since the last time it was read. This is a little bit onerous if the record has a lot of fields. A second way is to use a single field in each record as a sentinel that is updated every time anyone saves a change to a record. You could do this with something like a last_modified_datetime field, but depending on how actively records are updated this may not be precise enough. Many RDBMSs have a feature to help with this. SQL Server, for example has a data type called ROWVERSION (formerly TIMESTAMP) which is a system-generated binary field that is automatically modified by the database every time a record is updated.

To be sure that Bob had not swooped in and modified the data out from under Jane, she would write her update statement to including something like: ...WHERE CustomerID=@CustID AND CustRowVersion=@LastModRowVersion(*)

Jane checks the count of affected rows for her update statement and if the number is 0 she knows that Bob (i.e. someone else) has been up to his old tricks and she needs to refresh her view of the data and reapply her changes.

Why use optimistic concurrency at the application level? You could instead use pessimistic concurrency and not let Jane read the record while Bob has it locked. Some systems will be built this way instead. The issue is one of design choice and user requirements. Optimistic concurrency is used when the chance of a collision is smaller than the chance of two people using the same record at the same time. For example, let's say Bob only wants to read (not update) the customer's address. In that case, why lock the record and prevent Jane from doing her job and modifying the customer's credit limit?

(*) these are terrible field/variable names, I'm not suggesting you use such bad names!

Michael green and Joel Brown have given excellent answers that cover most of the ground. Let me add my two cents. If concurrency control is basically about preventing phantom updates, the question arises about what level of granularity is relevant for the data containers being protected.

For an OS, the user data being protected from phantom updates boils down to the block for disk files, and the page for user memory. There are smaller units of data managed inside the OS itself, but concurrency control here is generally transparent to the user community.

Some OSes have a record management system layered on top of the file management system, and this record management system may have some locking control over data at the record level. I'm disregarding this feature.

A DBMS is generally concerned about data sharing and concurrency control at a very different level of granularity than the OS. At the very least, a DBMS has to be concerned with data sharing at the table row level and at the index node level. Many DBMSes have lots more levels than this.

In addition, the DBMS user gets actively involved in transaction management using SET TRANSACTION, COMMIT, and ROLLBACK to indicate transaction boundaries to the DBMS. Many OSes to not have any user level transaction indicators when users share files at the disk block level.

Without the transaction management and concurrency control features of a DBMS, applications that operate on a shared database would face unacceptable tradeoffs between execution bottlenecks and irreproducible results.

Licensed under: CC-BY-SA with attribution
Not affiliated with dba.stackexchange
scroll top