Question

I'm working on a project on an unreliable system which I'm assuming can fail at any point. What I want to guarantee is that if I write_state and the machine fails mid-operation, a read_state will either read a valid state or no state at all. I've implemented something which I think will work below -- I'm interested in criticism of that or alternative solutions if anyone knows of one.

My idea:

import hashlib, cPickle, os

def write_state(logname, state):
    state_string = cPickle.dumps(state, cPickle.HIGHEST_PROTOCOL)
    state_string += hashlib.sha224(state_string).hexdigest()

    handle = open('%s.1' % logname, 'wb')
    handle.write(state_string)
    handle.close()

    handle = open('%s.2' % logname, 'wb')
    handle.write(state_string)
    handle.close()

def get_state(logname):
    def read_file(name):
        try:
            f = open(name,'rb')
            data = f.read()
            f.close()
            return data
        except IOError:
            return ''
    def parse(data):
        if len(data) < 56:
            return (None, '', False)
        hash = data[-56:]
        data = data[:-56]
        valid = hashlib.sha224(data).hexdigest() == hash
        try:
            parsed = cPickle.loads(data)
        except cPickle.UnpicklingError:
            parsed = None
        return (parsed, valid)

    data1,valid1 = parse(read_file('%s.1'%logname))
    data2,valid2 = parse(read_file('%s.2'%logname))

    if valid1 and valid2:
        return data1
    elif valid1 and not valid2:
        return data1
    elif valid2 and not valid1:
        return data2
    elif not valid1 and not valid2:
        raise Exception('Theoretically, this never happens...')

e.g.:

write_state('test_log', {'x': 5})
print get_state('test_log')
Was it helpful?

Solution

Your two copies won't work. The filesystem can reorder things so that both files have been truncated before any has been written to disk.

There are a few filesystem operations that are guaranteed to be atomic: renaming a file over another is one, insofar as the file will either be in one place or another. However, as far as POSIX is concerned, it doesn't guarantee the move is done before the file contents have hit the disk, meaning it only gives you locking.

Linux filesystems have enforced that file contents hit the disk before the atomic move does (but not synchronously), so this does what you want. ext4 has broken that assumption for a short while, making those files actually more likely to end up empty. This was widely regarded as a dick move, and has been remedied since.

Anyway, the proper way to do this is: create temporary file in the same directory (so it's on the same filesystem); write new data; fsync the temporary file; rename it over the previous version. This is as atomic as the OS can guarantee. It also gives you durability at the cost of spinning up the disks, which is why app developers prefer not using fsync and blacklisting the offending ext4 versions.

OTHER TIPS

I will add a heretic response: what about using sqlite? Or, possibly, bsddb, however that seems to be deprecated and you would have to use a third-party module.

My vague recollection from the way databases work is this. It involves three files. A control file, the target database file and a pending transaction log.

The control file has a global transaction counter and a hash or other checksum. This is a small file that's one physical block in size. One OS-level write.

Have a global transaction counter in your target file with the real data, plus a hash or other checksum.

Have a pending transaction log that just grows or is a circular queue of a finite size, or perhaps rolls over. It doesn't much matter.

  1. Log all pending transactions to the simple log. There's a sequence number and the content of the change.

  2. Update the transaction counter, update the hash in the control file. One write, flushed. If this fails, then nothing has changed. If this succeeds, the control file and target file don't match, indicating a transaction was started but not finished.

  3. Do the expected update on the target file. Seek to the beginning and update the counter and the checksum. If this fails, the control file has a counter one more than the target file. The target file is damaged. When this works, the last logged transaction, the control file and the target file all agree on the sequence number.

You can recover by replaying the log, since you know the last good sequence number.

Under UNIX like systems the usual answer is to do the link dance. Create the file under a unique name (use the tmpfile module) then use the os.link() function to create a hard link to the destination name after you synchronized the contents into the desired (publication) state.

Under this scheme your readers don't see the file until the state is sane. The link operation is atomic. You can unlink the temporary name after you'successfully linked to the "ready" name. There are some additional wrinkles to handle if you need to guarantee semantics over old versions of NFS without depending on the locking daemons.

I think you can simplify a few things

def read_file(name):
    try:
        with open(name,'rb') as f
            return f.read()
    except IOError:
        return ''

if valid1:
    return data1
elif valid2:
    return data2
else:
    raise Exception('Theoretically, this never happens...')

You probably don't need to write both files all the time, just write file2 and rename it over file1.

I think there is still a chance that a hard reset (eg power cut) could cause both files not to be written to disk properly due to delayed writing

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top