Question

I have a very latency sensitive routine that generates integers sequentially, but needs to store the last generated one to disk in case of a crash or re-start.

Currently I'm doing a seek to beginning of file then writing out the integer then flush each time a new int is generated. The flush is required so the write at least hits the battery-backed controller cache.

The seek is quite costly so I was thinking about just appending 4 bytes and if recovery is needed then to seek to the end and read the last 4 bytes. This previous statement obviously assumes that there isn't too much other disk activity happening, so the write head should ideally stay at end of the file.

The number won't typically go higher than 10,000,000 so 40MB isn't so bad.

Any advice as to how to achieve minimum latency without sacrificing integrity?

C or C++ on Linux 2.6+

Was it helpful?

Solution

I would think the fastest/easiest way to do this would be with mmap/msync -- mmap 1 page of the file into memory and store the value on that page. Any time the value changes, call msync(2) to force the page back to disk. This way you need only one system call per store

OTHER TIPS

If I read correctly, how about using a memory mapped file? Just write your number to the assigned address and it appears in the file. This makes assumptions that the OS writing the cache to disk robustly when needed, but you might find it worth a try.

int len = sizeof(unsigned);
int fildes = open(...)
void* address = mmap(0, len, PROT_READ, MAP_PRIVATE, fildes, 0)
unsigned* mappedNumber = (unsigned*)(address);

*mappedNumber can now contain your integer.

Measure.

How much control do you have over the hardware? If anything less than full, you'll get no guarantees.

On Linux I'd probably try making a kernel driver that would do its writes with the highest priority, possibly even without using a file system.

But, theoretically... If it is enough for you to hit the controller cache, data will hit it every time you flush anything to disk. This means regardless of whether there will be physical seek inside the drive or not, the data will already be there. And because you'll never know what will other applications do, or how fast does the disk rotate, your seeks will be random even if you keep the logical file handle at the beginning or end of file.

And you can always ask your user to use a flash drive.

The fastest way to write a file is to map that file into memory and treat it as a char array.

You don't need to sync the file if you don't care about OS crashes (Linux never crashed on me in production). All your writes go to that file mapping bypassing the kernel, in other words, real zero-copy (you can't do that with sockets on the standard hardware yet). You may need to keep a header in that file that contains a number of records written in case your application crash during writing a record into the memory. I.e. write a record and only after that increment the record counter.

Resizing this file requires ftruncate()/remap() sequence which may take a bit too long, so you may want to minimize resizing by growing the file by a factor, like std::vector<> grows by 1.5 its size on push_back() when it overflows. Depending on your throughput and latency requirements certain optimization can be applied.

The kernel is going to write the file mapping to disk asynchronously (as if there were another thread in your application dedicated to writing to disk). There is a way to force the writes to disk if necessary by using msync(). This is only necessary, however, if you'd like to survive an OS crash. But surviving an OS crash requires sophisticated application design anyway, so in practice surviving the application crash is good enough.

Why does your application have to wait for the write complete at all?

Write your data asynchronously, or perhaps from another thread.

You don't really have much low-level control over the harddrive. As long as you write so little data at a time, you're going to incur a lot of expensive seeks. But since you're only using it as "checkpoints" to recover from in case of a crash, there seems to be no reason why the write couldn't occur asynchronously.

Storing an int only takes one block on disc, regardless of block size. So you have to sync one block to disc, and it takes as long as it takes, and there is nothing you can do to make it faster.

Whatever else you do, fdatasync() will be the killer, time-wise. It will sync one block into your (battery-backed RAID) controller.

Unless you have some kind of non-volatile ram, all (sensible) methods are going to be exactly equivalent because they all require one block to be sync'd.

Doing a seek system call is not going to make any difference, as that has no effect on hardware. In any case, you can avoid it by using pwrite().

Consider what "appending 4 bytes" means. Disks don't store files, or even bytes. They store clusters, and a fixed number of them. The notion of a file is created by the OS. It allocates some clusters to file system tables, to keep track of where a file is precisely located. Now, appending 4 bytes means at least writing the 4 bytes to a cluster. But that also means determining which cluster. What's the existing file size? Do we need a new cluster? If not, we need to read the last cluster, patch the 4 bytes in the correct position, and write back the cluster, then update the file size in the file system. If we do append a new cluster, we can write the 4 bytes followed by zeroes (don't need old value) but we need to do a whole lot of bookkeeping to add a cluster to a file.

So, the absolute fastest way cannot ever be to append 4 bytes. You must overwrite 4 existing bytes. Preferably in a sector that you already have in memory. Others have already pointed out that you can achieve this with mmap/msync.

Obviously, given current SSD and developer prices, and your 40 MB limit, you'll be using an SSD. It pays for itself if you save an hour. Therefore seek times are irrelevant; SSDs don't have physical heads.

There are a lot of people here talking about mmap() as if that will fix something, but your syscall overhead is basically zero compared to the disk write overhead. Remember that appending or writing to a file requires you to update the inode (mtime, filesize) anyway, so that means a disk seek.

I suggest you consider storing the integer somewhere other than a disk. For example:

  • write it to some nvram that you control (eg. on an embedded system). (If your RAID controller has nvram for writing, it might do this for you. But if you're asking this question, it probably doesn't.)

  • write it to free bytes in the system CMOS memory (eg. on PC hardware).

  • write it to another machine on the network (if it's a fast network) and get them to acknowledge.

  • redesign your application so you can get away with syncing after every n transactions, instead of after every transaction. That will be about n times faster than doing it every time.

  • redesign your application so that if the integer is lost, the changes from your most recent transaction are also lost. Then the fact that you've technically lost an integer update doesn't matter; when you reboot, it'll be as if you never incremented it, so you can just resume from there.

You didn't explain why you need this behaviour; to be honest, if your app needs this, it sounds like your application is probably not designed very well. For example, some people suggested using a database because they do this sort of thing all the time; true, but databases do it by being slow (ie. syncing the disk every time), unless you create a transaction first, in which case the disk only needs to get synced when you do 'commit transaction'. But if you absolutely must have a sync after every integer, you'd be constantly committing transactions, and a database couldn't save you from that; there's no magical way a database could guarantee not to lose data unless it does at least fdatasync().

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top