Domanda

I am trying to implement a database index based on the data structure (Blink tree) and algorithms suggested by Lehman and Yao in this paper. In page 2, the authors state that:

The disk is partitioned in sections of fixed size (physical pages; in this paper, these correspond to the nodes of the tree). These are the only units that can be read or written by a process. [emphasis mine] (...)

(...) a process is allowed to lock and unlock a disk page. This lock gives that process exclusive modification rights to that page; also, a process must have a page locked in order to modify that page. (...) Locks do not prevent other processes from reading the locked page. [emphasis mine]

I am not completely sure my interpretation is correct (I am not used to reading academic papers), but I think it can be concluded from the emphasized sentences that the authors mean the operations that read and write a page are assumed to be "atomic", in the sense that, if a process A has already begun reading (resp. writing) a page, another process B may not begin writing (resp. reading) that same page until A is done performing its read (resp. write) operation. Multiple processes simultaneously reading the same page is, of course, a legitimate condition, as is having multiple processes simultaneously performing arbitrary operations on exclusively different pages (process A on page P, process B on page Q, process C on page R, etc.).

  1. Is my interpretation correct?

  2. Can I assume POSIX' read() and write() system calls are "atomic" in the sense described above? Can I rely on these system calls having some internal logic to determine whether a specfic read() or write() call should be temporarily blocked based on the position of the file descriptor and the specified size of the chunk to be read or written?

  3. If the answer to the above questions is "No", how should I roll my own locking mechanism?

È stato utile?

Soluzione

I don't believe the text you cites implies anything of the sort. It doesn't even mention read() or write() or POSIX. In fact, read() and write() cannot be relied on to be atomic. The only thing POSIX says is that write() must be atomic if the size of the write is less than PIPE_BUF bytes, and even that only applies to pipes.

I didn't read the context around the part of the paper you cited, but it sounds like the passage you cited is stating constraints which must be placed on an implementation in order for the algorithm to work correctly. In other words, it states that an implementation of this algorithm requires locking.

How you do that locking is up to you (the implementor). If we are dealing with a regular file and multiple independent processes, you might try fcntl(F_SETLKW)-style locking. If your data structure is in memory and you are dealing with multiple threads in the same process, something else might be appropriate.

Altri suggerimenti

Answers:

  1. Concurrent reads to writes may see torn writes depending on OS, filing system, and what flags you opened the file with. A quick summary by flags, OS and filing system is below.

  2. You can lock byte ranges in a file before accessing them using fcntl() on POSIX or LockFile() on Windows.


No O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = 1 byte

Linux 4.2.6 with ext4: update atomicity = 1 byte

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)

O_DIRECT/FILE_FLAG_NO_BUFFERING:

Microsoft Windows 10 with NTFS: update atomicity = up to 4096 bytes only if page aligned, otherwise 512 bytes if FILE_FLAG_WRITE_THROUGH off, else 64 bytes. Note that this atomicity is probably a feature of PCIe DMA rather than designed in (*).

Linux 4.2.6 with ext4: update atomicity = at least 1Mb, probably infinite (*). Note that earlier Linuxes with ext4 definitely did not exceed 4096 bytes, XFS certainly used to have custom locking but it looks like recent Linux has finally fixed this.

FreeBSD 10.2 with ZFS: update atomicity = at least 1Mb, probably infinite (*)


You can see the raw empirical test results at https://github.com/BoostGSoC13/boost.afio/blob/master/fs_probe/fs_probe_results.yaml. The results were generated by a program written using asynchronous file i/o through on all platforms. Note we test for torn offsets only on 512 byte multiples, so I cannot say if a partial update of a 512 byte sector would tear during the read-modify-write cycle.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top