Kevin McGuire wrote:
Emulate the POSIX functionality from what you implement. If you decide to make a complete asynchronous I/O mechanism just wrap these system calls with the POSIX API using a library for linking against by software.
Yeah. That's the plan, at least when I reach a point where it makes sense to attempt to port POSIX programs to the system.
Quote:
The page aligned block cache would make more sense. It should allow you to, like you said, eliminate the copy completely and still provide read and/or write permissions independently for each descriptor opened for that file. This is also known as a memory mapped file. It will also lets you unload cold caches, and support reloading it on a page fault. So this makes better sense in the long run.
Well, it's true that page aligned block cache can simplify some things. For one, even without memory mapped files, it makes it easier to free the memory used by the block cache to rest of the system (just unmap the pages) though it means one must keep the blocks and the cache-headers separate. Not a huge complication, I guess.
The problem I can see with page-aligned caches though, is that now we have two kinds of blocks to handle: disk blocks, and cache blocks. Disk blocks would normally be 512, but filesystems tend to cluster several of them, so one could expect powers of two between 512 and something like 16k. That means any cache block can contain several filesystem blocks, and any filesystem block might need more than one cache block.
More than one cache block per filesystem block isn't that hard to handle (just break the filesystem blocks into several cache blocks in VFS and handle like if filesystem block-size was the same as cache block-size) but several filesystem blocks per cache blocks is nastier: if one only needs a particular filesystem block, but cache mandates page-sized blocks, then one must fetch extra blocks, and in the worst case, these could be fragmented around the disk, requiring extra seeks. Handling missing blocks at the end of file must also be handled, but that's not too different from having to zero partially used blocks anyway..
I guess it could still be done.
As for memory mapped I/O, if one is willing to use separate API for stream (pipe/socket) and file I/O, then I guess one could make memory mapped I/O the only supported I/O model for files. Memory mapped I/O kinda solves asynchronous writing, but then again that isn't the hard case anyway. As for reading, page faults are necessarily synchronous, so some sort of "prefetch these pages" type system call is then needed for truly asynchronous operation, or even "prefetch and lock" if the process really must not stop in I/O wait ever. Still saves some buffering, but not sure if there's any other advantage. I've been thinking of doing it anyway.
Anyway, none of this really solves any problems with regards to slow directory lookups. I don't wanna have an application stuck in "still trying" so I guess the whole VFS has to be made event-driven service and the state of the operations exported to the user-space somehow. It's just a large scary chunk of design work, and I guess I'm afraid...
Basicly, what it seems to involve, is first identifying those operations which are known to complete in bounded time. That's stuff like close(), dup(), or lseek() which only deal with file descriptors. Rest of the operations then need to be split into request and reply, with events to notify the process about completion / failure. Currently I can only signal events for file descriptors, which is fine for something like connect() or open(), but I'm not quite sure what to do with stuff like unlink(), rename() or stat().
One possibility would be to make these be operations on file descriptors of the directories. The other possibility would be for those operations to return some dummy-descriptor with no purpose other than signaling the condition of the operation. I guess the latter approach is cleaner, especially as it fits nicely with my idea that a "file descriptor" is really a generalized handle, that could be anything from a semaphore to a timer...
Actually, now that I wrote it, that sounds like a pretty decent design. So what remains then, is how to structure VFS to be driven by events from device drivers (or well, block cache). I guess the cleanest approach is to have the operation add a block into the cache, attach a "continuation" to the block, and request a device driver to fill the block with the relevant data, then call the "continuation." If another operation needs the same block before it's been filled, it'll add another continuation, so as long as several of them can be queued, I guess it should work...
...and this is one of the days I wish I could just write my kernel in Scheme instead of C. Dealing with continuation-passing code would be so much easier if the language natively supported closures.