What is constraining cross-platform asynchronous file I/O?

https://softwareengineering.stackexchange.com/questions/364572

26-01-2021
|

Pregunta

Looking at a range of cross-platform languages, libraries and GUI toolkits, I often notice a conspicuous absence of support for asynchronous file I/O. This seems like too much of a common factor to be a coincidental oversight in all of them. However I don't know enough about individual OSs or these languages'/libraries' development to understand why they don't feature it.

Here are some examples.

Python's asyncio (started in 3.4) contains ways to handle network and subprocess activity asynchronously, but nothing for reading files from disk.
The Twisted library for event and protocol driven programming in Python seems to contain nothing for async file I/O.
Qt 5's QFile specifically does not emit the readyRead or bytesWritten signals, in contrast to other QIODevice implementations for networking.
All of wxWidget's file related classes are completely synchronous.

These are just the last four I looked up; it's possible that I picked four in a row without async file IO by coincidence, but they're all highly popular and stable. I know they're not the only popular libraries I've used over the last ten years in which I've missed it.

Maybe there's less demand for this than for reading data from a socket or subprocess, but is it so little as to be considered unwanted? A seemingly local file could be on the other side of a network connection (eg. a SMB share or NSF mount) making it not at all necessarily true that local disk access will be faster than a user would notice. It's not even necessarily true for a local, brand new SSD for that matter.

Common advice to roll one's own async file IO with threads seems counter to prevailing wisdom, when so much of what motivates toolkit usage is not rolling your own, especially when it comes to anything involving threading. I know it's possible, and maybe not that hard, but neither are many other things that are commonly available in these libraries and languages.

Let's take Qt as an example (extracting this from a comment on an answer). In Qt, if I want to do X without pausing my entire program, I can use Y:

If X = redraw a canvas; Y = use signals and slots
If X = read data via HTTP; Y = use signals and slots
If X = get data from a subprocess; Y = use signals and slots
But if X = get data from the hard drive; Y = implement something with a thread, or maybe two threads, two semaphores, special shared memory pointers, and maybe a bunch of other stuff.

Whether or not it's simple for me to implement is beside the point. The point is that file IO is an operation that can block, but it's consistently the odd one out in cross-platform toolkits and libraries by not having high-level handling.

Basically it seems that this omission demands extra complexity from the library user for a not-uncommon task, when the goal of these libraries is to absorb that kind of low-level implementation complexity. It seems odd that putting a local file on the other side of a local webserver makes it simpler to access in an event-driven program.

To be clear, and to make my question clear, this isn't a complaint about missing functionality. I want to understand cross-platform libraries and OS differences better, and so I'm genuinely curious about why this situation arose and whether there's some technical limitation or other constraint at the root of it.

Solución

I see two issues regarding asynchronous file IO:

Absense of async file IO on Linux.
Completion-based vs readiness-based async IO.

Linux provides syscalls io_setup, io_submit, io_getevents and few others to manage asynchronous file IO. It has following constraints:

File should be opened with O_DIRECT flag e.g. all operations bypass file cache. This alone makes it worthless for most applications.
Both file offset and buffer address should be aligned by 512 or 4096 bytes (depending on underlying filesystem). This is done to make it possible to read/write data directly to/from user buffer.

If user violates any of those constrains, io_submit will silently perform all operations synchronously.

I read somewhere on Nginx mailing list years ago that this API was implemented by Oracle for their database. They only needed asynchronous file IO that bypasses file cache (something databases do), so they left implementation incomplete.

POSIX provides aio_write, aio_read functions, but on Linux those are implemented in userspace using thread pool which makes existing implementations non-conforming (it is illegal to use those functions from signal handler, for example).

Completion-based vs readiness-based IO is not related to only files. Completion-based IO is when user gets notified about completion of the whole operation, while with readiness-based API user is only notified that reading or writing can be performed without blocking.

Completion-based IO is more general and can work with threads better. Readiness-based IO can only be used with non-blocking IO and thus cannot be used with files.

Completion-based IO can be implemented using readiness-based IO, but the opposite is not true. So if library provides readiness-based IO that works with sockets, it cannot provide the same interface for files.

On Windows the most efficient native asynchronous API is completion based and it is called overlapped I/O. Unix-like systems use primary readiness-based IO: epoll, kqueue, /dev/poll.

Linux has possiblity to get completion notification that can be through eventfd. But there is no point when there are so many limitations.

I think FreeBSD implements POSIX async IO in the kernel and allows you to receive completion notification through kqueue. I am not sure how good it is though.

Otros consejos

There's nothing particularly interesting about file I/O from an async standpoint. You can't even parallelize it well in the general case, because your I/O channel inevitably has intrinsic limitations that can't be overcome with parallelization. Some parallelization schemes will actually hurt performance on media that lends itself more readily to consecutive reads and writes, such as spindle drives.

Consequently, it's better to treat such I/O as a payload, not a special feature of async. Once you make this design decision, you are now free to write an async library that works with any payload, not just file I/O. This arrangement shouldn't be surprising; it's merely another form of Separation of Concerns.

Consider this: any synchronous method can be made asynchronous by using mechanisms like callbacks, promises, threads, continuations, etc.

Licenciado bajo: CC-BY-SA con atribución

No afiliado a softwareengineering.stackexchange