Asynchronous and/or non-blocking IO for disks?

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

I'm currently wondering how to handle asynchronous I/O for file-systems. For stuff like pipes or sockets it's easy to notify a process that all data was sent, or that there's new data to be read. In fact, this works quite beautifully in my system right now. However, even with sockets, there's some issues if one wants to build the system in completely asynchronous way: operations like TCP connect() can take a long time. I'm not worried about those though, because that's relatively easy to solve: just add a non-blocking connect() and then report the success/failure as an event, or something..

However, file-systems are nasty, for several reasons actually. Assuming we have an open file, then reading/writing is more or less instant, if all the required data is in buffer caches, so there's hardly any point to make such operations non-blocking. But what to do when actual disk I/O (most of the time reading) into the buffer cache needs to be done? That would be an I/O wait one would like to make non-blocking... but how? Would it be acceptable to fail the request with "would block", then start the I/O operation, and then notify the process when the I/O operation completes? Should there be some guarantee about not losing the buffer cache before the process has a chance to attempt to read()/write()? A kernel side file-descriptor specific buffer could solve this, but it's another level of indirection then...

Maybe non-blocking file I/O is ill-defined idea? Maybe one should use special asynchronous I/O directly to user-space buffers instead? That would be relatively easy to do, and would avoid most problems with non-blocking..

But what to do with directory-lookups and stuff like open() then? Resolving a path on a floppy, or over network, could well take enough time that I'd rather not unconditionally block a thread for the whole time... so ehm? Ideas? With readdir() I think it's possible to simply read-ahead one entry asynchronously, and then fail if the read-ahead isn't ready, but open() would need to use connect() like partially open file descriptors or something, and what to do with other system calls which don't normally need any sort of file descriptors at all (rename(), link(), mkdir(), etc)??

Why do I care? Well, I'd like to avoid "hung" applications that are waiting for the kernel to do something, which will take a day, because the kernel needs to do retries and what not, because stuff isn't going right, and there's no way for the user to cancel the operation because we're stuck in kernel.

Maybe I just should forget POSIX like interface for good? Maybe that would help?

Colonel Kernel · **Posted:** Sun Apr 22, 2007 12:25 pm

mystran wrote:

Maybe I just should forget POSIX like interface for good? Maybe that would help?

Yes.

As an example of a non-POSIX-like way of implementing asynchronous operations, have a look at Singularity. All IPC is asynchronous with non-blocking send and blocking receive. When doing network I/O for example, buffers are passed back and forth by pointer (thanks to the use of a type-safe language) between requesting processes and the network stack. Designing asynchronous I/O in this kind of system is a lot like designing a simple communications protocol. Take a look at the paper on channel contracts for one or two examples:

http://www.cs.kuleuven.ac.be/conference ... ndrich.pdf

Kevin McGuire · **Posted:** Sun Apr 22, 2007 3:07 pm

Might help a small amount.
http://en.wikipedia.org/wiki/Asynchronous_I/O

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Colonel Kernel wrote:

As an example of a non-POSIX-like way of implementing asynchronous operations, have a look at Singularity. All IPC is asynchronous with non-blocking send and blocking receive. When doing network I/O for example, buffers are passed back and forth by pointer (thanks to the use of a type-safe language) between requesting processes and the network stack.

Yup, but I'm not designing for type-safe language, instead relying on hardware protection. When doing network I/O, I'm indeed passing everything by pointer. Neither send nor receive needs to copy any data other than what's required for buffering. Even some of that could be optimized. Compared the Singularity like type-safe system, the only extra copy is from user-space to kernel, and in some cases even that could be avoided for sending.

For file I/O there isn't much more copying going on. If a driver supports it (with floppy driver it makes no sense), the driver fills blocks in the block-cache in place, and the data is then copied once to userspace. By forcing page aligned I/O, even that copy could be eliminated for memory mapped file access, but I've not though about that yet.

Rather, the problem is that if you want to open "/usr/local/bin/foo" the kernel will need to search the directories "/usr", "/usr/local", "/usr/local/bin" and finally "/usr/local/bin/foo" and if all of the data is in cache, this can be done more or less "atomically" but if that is not the case, there could be a potentially long delay for each, during which the system call for open() would block.

What I'm thinking of, would be to allow a non-blocking version of open() to return a file-descriptor immediately, but then do the actual lookup for the file later, as the required files come from the disk. You would then have to check whether the open() actually succeeded (or maybe it's still pending) when you actually want to use the file.

In any case, the big problem I see, is that while sockets/pipes are "masters" in the sense that it's not the program that decides when some I/O should occur, the filesystem is more or less a "slave" only doing stuff when explicitly. POSIX implicitly assumes synchronous I/O with the idea that you can use a common set of operations like read() and write() whether you are the "master" or the "slave" and while the non-blocking mode works nicely with event-notifications in the "program is slave" case (sockets/pipes) it doesn't seem to make much sense for the "program is master" case of filesystem I/O.

Kevin McGuire wrote:

Might help a small amount.
http://en.wikipedia.org/wiki/Asynchronous_I/O

Yeah well, I've read that, and it just seems to describe the (rather sad) state of asynchonous I/O as it's currently implemented, and I don't like half of what I'm reading. I guess Windows comes closest to actually providing what I want, but...

I've also not really after the implementation. That's hardly a problem once I have a design. Rather, I'm trying to figure what kind of API one should provide for the user process.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

I guess part of my problem comes from the fact that in POSIX, the API sees most filesystem operations as atomic: you do a single call, and you get a the status as a return value. For an asynchronous mechanism on the other hand, one would have to break these two into separate parts (so that process can go and do something else in between them) and it looks like it's pretty hard to simply try to extend the POSIX API without doing a larger redesign of the whole thing.

It also seems like such a redesign will necessarily have to make a separation between file I/O and IPC. One way to think of it, I guess, is that POSIX assumes file I/O and IPC are on the same level of abstraction (or on the same layer in the protocol stack) while asynchronous I/O requires a protocol (on the same level/layer as IPC) for actually controlling I/O and the actual I/O then goes to one layer higher (or lower, depending on how you look at it).

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

http://opengroup.org/onlinepubs/0096953 ... aio.h.html

That actually seems kinda interesting.. but in typical modern POSIX style I'm not quite sure if it's a bit overly complicated..

Kevin McGuire · **Posted:** Sun Apr 22, 2007 6:13 pm

Quote:

I guess part of my problem comes from the fact that in POSIX, the API sees most filesystem operations as atomic: you do a single call, and you get a the status as a return value. For an asynchronous mechanism on the other hand, one would have to break these two into separate parts (so that process can go and do something else in between them) and it looks like it's pretty hard to simply try to extend the POSIX API without doing a larger redesign of the whole thing.

Emulate the POSIX functionality from what you implement. If you decide to make a complete asynchronous I/O mechanism just wrap these system calls with the POSIX API using a library for linking against by software.

Quote:

For file I/O there isn't much more copying going on. If a driver supports it (with floppy driver it makes no sense), the driver fills blocks in the block-cache in place, and the data is then copied once to user space. By forcing page aligned I/O, even that copy could be eliminated for memory mapped file access, but I've not though about that yet.

The page aligned block cache would make more sense. It should allow you to, like you said, eliminate the copy completely and still provide read and/or write permissions independently for each descriptor opened for that file. This is also known as a memory mapped file. It will also lets you unload cold caches, and support reloading it on a page fault. So this makes better sense in the long run.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Kevin McGuire wrote:

Emulate the POSIX functionality from what you implement. If you decide to make a complete asynchronous I/O mechanism just wrap these system calls with the POSIX API using a library for linking against by software.

Yeah. That's the plan, at least when I reach a point where it makes sense to attempt to port POSIX programs to the system.

Quote:

The page aligned block cache would make more sense. It should allow you to, like you said, eliminate the copy completely and still provide read and/or write permissions independently for each descriptor opened for that file. This is also known as a memory mapped file. It will also lets you unload cold caches, and support reloading it on a page fault. So this makes better sense in the long run.

Well, it's true that page aligned block cache can simplify some things. For one, even without memory mapped files, it makes it easier to free the memory used by the block cache to rest of the system (just unmap the pages) though it means one must keep the blocks and the cache-headers separate. Not a huge complication, I guess.

The problem I can see with page-aligned caches though, is that now we have two kinds of blocks to handle: disk blocks, and cache blocks. Disk blocks would normally be 512, but filesystems tend to cluster several of them, so one could expect powers of two between 512 and something like 16k. That means any cache block can contain several filesystem blocks, and any filesystem block might need more than one cache block.

More than one cache block per filesystem block isn't that hard to handle (just break the filesystem blocks into several cache blocks in VFS and handle like if filesystem block-size was the same as cache block-size) but several filesystem blocks per cache blocks is nastier: if one only needs a particular filesystem block, but cache mandates page-sized blocks, then one must fetch extra blocks, and in the worst case, these could be fragmented around the disk, requiring extra seeks. Handling missing blocks at the end of file must also be handled, but that's not too different from having to zero partially used blocks anyway..

I guess it could still be done.

As for memory mapped I/O, if one is willing to use separate API for stream (pipe/socket) and file I/O, then I guess one could make memory mapped I/O the only supported I/O model for files. Memory mapped I/O kinda solves asynchronous writing, but then again that isn't the hard case anyway. As for reading, page faults are necessarily synchronous, so some sort of "prefetch these pages" type system call is then needed for truly asynchronous operation, or even "prefetch and lock" if the process really must not stop in I/O wait ever. Still saves some buffering, but not sure if there's any other advantage. I've been thinking of doing it anyway.

Anyway, none of this really solves any problems with regards to slow directory lookups. I don't wanna have an application stuck in "still trying" so I guess the whole VFS has to be made event-driven service and the state of the operations exported to the user-space somehow. It's just a large scary chunk of design work, and I guess I'm afraid...

Basicly, what it seems to involve, is first identifying those operations which are known to complete in bounded time. That's stuff like close(), dup(), or lseek() which only deal with file descriptors. Rest of the operations then need to be split into request and reply, with events to notify the process about completion / failure. Currently I can only signal events for file descriptors, which is fine for something like connect() or open(), but I'm not quite sure what to do with stuff like unlink(), rename() or stat().

One possibility would be to make these be operations on file descriptors of the directories. The other possibility would be for those operations to return some dummy-descriptor with no purpose other than signaling the condition of the operation. I guess the latter approach is cleaner, especially as it fits nicely with my idea that a "file descriptor" is really a generalized handle, that could be anything from a semaphore to a timer...

Actually, now that I wrote it, that sounds like a pretty decent design. So what remains then, is how to structure VFS to be driven by events from device drivers (or well, block cache). I guess the cleanest approach is to have the operation add a block into the cache, attach a "continuation" to the block, and request a device driver to fill the block with the relevant data, then call the "continuation." If another operation needs the same block before it's been filled, it'll add another continuation, so as long as several of them can be queued, I guess it should work...

...and this is one of the days I wish I could just write my kernel in Scheme instead of C. Dealing with continuation-passing code would be so much easier if the language natively supported closures.

Kevin McGuire · **Posted:** Mon Apr 23, 2007 1:01 pm

You might have the block devices that the VFS handles have a attribute that determines the cache fade and cache size.

So for a regular IDE disk you might use a weight of 1.0. While for a floppy disk you might use a weight of 12.0.

Quote:

More than one cache block per filesystem block isn't that hard to handle (just break the filesystem blocks into several cache blocks in VFS and handle like if filesystem block-size was the same as cache block-size) but several filesystem blocks per cache blocks is nastier: if one only needs a particular filesystem block, but cache mandates page-sized blocks, then one must fetch extra blocks, and in the worst case, these could be fragmented around the disk, requiring extra seeks.

Hopefully the file system driver will use a sense that it is implied that eight blocks need to written for that disk file instead of just one to help completely fill the disk cache page. I would assume that if a file is opened then there is a extremely high likely hood that the process doing the I/O on this file would access more data after five-hundred-twelve-bytes.

This should actually be a better deal; like read-ahead.

Quote:

Actually, now that I wrote it, that sounds like a pretty decent design. So what remains then, is how to structure VFS to be driven by events from device drivers (or well, block cache). I guess the cleanest approach is to have the operation add a block into the cache, attach a "continuation" to the block, and request a device driver to fill the block with the relevant data, then call the "continuation." If another operation needs the same block before it's been filled, it'll add another continuation, so as long as several of them can be queued, I guess it should work..

In my opinion you are actually trying to do two things at once, or in other words it is having two code streams executing. When a AIO call is made the execution must split. If it does not split you just get left with synchronous I/O.

To split it you might create another thread which could end up being a costly operation because of the overhead. If there was no overhead then this should be the best route.

If you use a que in the VFS so the primary stream of execution only has to enter the request in this que and break back to the main processing. Then you have to figure out if you need to create individual threads for each of these requests which creates overhead. You could use a single kernel thread that multiplexes between handle these requests: translating them into information that the FS can understand. Of course the FS drivers need to have a structure that do no block or wait. A callback method which could be direct (it stays inside the interrupt handler) or a thread method in which once the driver gets interrupt that the data has been transfer or is waiting could spawn a thread that makes a callback to the FS, which then make a callback to the VFS which then makes the data available to the process. Of course you just got a bill for more overhead for scheduler switching and in a advanced kernel maybe a even larger cost for the transversal through the kernel abstractions heading to the scheduler for all these threads being created by device drivers.

A alternative could be the usage of a thread pool which is somehow dynamically controlled. Such as:

<user thread> -----> (request que manipulation)
<user thread> -----> (request que manipulation)

<vfs thread pool> -----> (request que manipulation)
.............|------------> (fs-driver-->device-driver)

<block device thread pool> -----> (complete request in que)
.............|------------> (fs-driver-->vfs)

Possibly have the pools become larger or smaller depending on the system resources. Some sort of profiling or algorithm to determine if a gain is made by increasing the threads in each pool or decreasing them.

The whole point is to not block execution by waiting for something that is slower than electrons. So it might be okay to eliminate the <vfs thread pool> and just let the thread run right through the vfs, but you might end up stuck waiting because of mutual exclusion however it is still unlike blocking on something with a unknown reliability. You still might have a non-determinate wait time if tons of threads are waiting in line maybe on a SMP machine with eight processors.

So, you might be able to strike a balance by creating some type of hybrid.

To note. If you use the XFS on Linux. There are about four major kernel threads created in which one of them is constantly translating data from a log into actual writes to the disk. If I remember correctly it will spawn extra threads to help if things get out of hand.

The Linux cache system actually spawns kernel threads at times to help write cache to disk when it reaches a certain threshold. So the idea of a kernel thread pool for I/O is nothing new.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Kevin McGuire wrote:

You might have the block devices that the VFS handles have a attribute that determines the cache fade and cache size.

I guess you are talking about cache block size? Because the total cache size should eventually become something very close to "all free memory."

I think it makes sense to have somewhat separate cache for each device though. Basicly, the only thing the caches need to share at all is LRU list for throwing away old pages.

Quote:

Hopefully the file system driver will use a sense that it is implied that eight blocks need to written for that disk file instead of just one to help completely fill the disk cache page.

Yeah ofcourse.

Quote:

This should actually be a better deal; like read-ahead.

Well, you more or less get some read-ahead whether you want it or not that way. For random access into a fragmented file, it's not necessarily a good thing.

Quote:

To split it you might create another thread which could end up being a costly operation because of the overhead. If there was no overhead then this should be the best route.

No need to split into threads. Just run in the thread you happen to be, until you hit something which requires blocking (disk I/O). Now device drivers have bottom halves (currently just regular threads, one per driver) which can then call rest of the pending operations with the I/O result, and those can either start a new request, or complete the operation.. So basicly continuation passing style, or event-callback driven, or whatever you want to call it.

Quote:

The Linux cache system actually spawns kernel threads at times to help write cache to disk when it reaches a certain threshold. So the idea of a kernel thread pool for I/O is nothing new.

Yea, flushing the disk is good candidate for some extra threads.

Kevin McGuire · **Posted:** Mon Apr 23, 2007 3:48 pm

Quote:

I guess you are talking about cache block size? Because the total cache size should eventually become something very close to "all free memory."

I forgot memory is wasted unless its being used for cache. Never the less it still does make sense that you might want to add some weight to the block group for a floppy so it does not get pushed completely out of memory by cache blocks from another block group such as a hard disk: If you are really worried about the floppy being slow to access.

Just now I am thinking that windows just caches the entire floppy disk. IIRC - but if a floppy is in the drive and a couple of days go by then it might not be a bad idea to drop the cache for it until needed again especially if some weird guy has about ten floppy drivers.. hehe. Nevertheless I think the possibility of some device being really slow could arise in the future even for such things as a remote block group (across a network)? You might want the cache for those things weighted much more than a locally installed high speed drive.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Kevin McGuire wrote:

I forgot memory is wasted unless its being used for cache. Never the less it still does make sense that you might want to add some weight to the block group for a floppy so it does not get pushed completely out of memory by cache blocks from another block group such as a hard disk: If you are really worried about the floppy being slow to access.

Yeah I can see the point. I guess one possible policy would be a modified LRU: rather than simply select the least-recently-used block to be thrown away, keep the average load time for each device, and the absolute time a given block as last needed.

Let's call the average load time A and the time since use T. Now LRU picks which ever block has the biggest T, and we'd like to prefer blocks that are retrieved fast, right? So how always picking the block with the biggest T/A. That way if a floppy has 100 times slow access time, a floppy block will need to be unused 100 times as long, before it'll get thrown away.

Such a scheme would have the nice advantage that it would work automatically for stuff like network block devices (or well, blocks for network filesystems), even if different network filesystems had different access times.

Another option would be make some estimation of how likely a block is to be needed again. We could for example assume exponential decay for the likelyhood. Say, let's assume that there's P (between 0 and 1) chance that a block that is 1ms old will be needed again, and that every millisecond the likelyhood decays by D (between 0 and 1 as well).

So after N milliseconds, the estimation would be P*D^N. Assuming that a given block again takes L amount of time to fetch, then the expected cost of throwing a block away would be L*P*D^N and we should pick the block with the smallest estimated cost. If P is assumed to be the same for each block, we can throw it away, and we get a cost of L*D^N, So the only parameter to be tuned is D.

The decay parameter (if N is measured in real-time) will depend on the speed of the processor, and also on the access patterns of the software running, but as long as the basic exponential decay is assumed correct, one can build on estimate for it on the fly, either by curve fitting, or trial and error.

There could be some improvement also in estimating decay separately for each process, though it becomes unclear what to do with a block accessed by multiple processes then. To get estimated cost now (for two processes) we need to find the likelyhood that at least one of them needs the block, which becomes:

L * (1 - (1 - D1^N1) * (1 - D2^N2))

It's easy to see how this becomes unwieldy when more than a few processes need the same block, so some simplification is probably necessary. I'm also not quite sure this is a good idea, because just like process access patterns can vary, access patterns of different parts of the same process can vary just as much.

So maybe a better idea would be to track the number of times a block was accessed while it was in the cache, with the assumption that the more it's been accessed, the slower it should decay, and then build another decay on the number of accesses, to avoid keeping forever blocks that were accessed several times and then forgotten.... or something.

But quite interesting subject indeed, now that one thinks of it.

Now I actually want to know if it's reasonable to assume exponential decay. The data must be out there somewhere..

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Oh and I wanna add that doing such management need not be very expensive. Obviously a linked-list isn't suitable (like for LRU) but a priority heap should be fairly fast.

Kevin McGuire · **Posted:** Mon Apr 23, 2007 7:58 pm

It looks like a good idea.
Let A be the latency of reading a block into the cache as a time value.
Let T be the last known access as a time value.
Let F be the scale for A.
T > A*F
Flush block cache, and replace with new block cache or use for memory allocation.

I agree.

mystran · **Joined:** Thu Mar 08, 2007 11:08 am **Posts:** 670

Oh god, I just realized a decay model is a nightmare, because keeping track of the decays means evaluating the blocks at the time one needs to free them, instead of when they are accessed, which means one ends up with an O(N) replacement strategy.. Oh well..

But I found something interesting, the LIRS strategy: http://ieeexplore.ieee.org/search/wrapp ... er=1453496

Not that it has anything to do with different device speeds, but one can still use the basic idea above, which would be to take the best-candidate from every device with some cost estimate, and multiply by device speed, then throw away the block with the smallest resulting cost estimate...

OSDev.org

Asynchronous and/or non-blocking IO for disks?

Who is online