Design Issues
In these pages, you'll find basic information about the main design choices an OS developer will face. We intentionally will not suggest one model over another, but instead try to offer comprehensive information about strengths and weaknesses of each model.
Kernel Models
Monolithic Kernel : One Big Chunk
Included from Monolithic Kernel
A monolithic kernel includes all (or at least, most) of its services in the kernel proper.
This reduces the amount of context switches and messaging involved, making the concept faster than a Microkernel. On the downside, the amount of code running in kernel space makes the kernel more prone to fatal bugs.
The word "monolithic" by itself means a single piece (mono) that is of or like stone (lithic), however when applied to kernels the exact meaning is more general. Most people consider that a kernel where device drivers and services run as part of the kernel is a monolithic kernel, regardless of whether parts are dynamically loaded "kernel modules" or if everything is a true single unchangeable binary. For this reason I draw a distinction between "monolithic" and "pure monolithic".
Modern versions of Linux are a well-known example of a monolithic kernel - while drivers are shipped as dynamically loaded "kernel modules" they are still loaded into and running in kernel space. Monolithic kernels are common for the 80x86/PC architecture.
Examples of "pure monolithic" kernels are rare for the 80x86/PC architecture (but more common in embedded systems). This is because of the wide variety of devices, hardware and CPU features that may be present within a modern PC - a pure monolithic kernel would need to be far too large or compiled specifically for the computer before use.
In general most OS's aren't "pure monolithic" or "pure micro-kernel", but fall somewhere between these extremes in order to make use of the advantages of both methods.
Microkernel : Divide and Conquer
Included from Microkernel
A Microkernel tries to run most services - like networking, filesystem, etc. - as daemons / servers in user space. All that's left to do for the kernel are basic services, like memory allocation, scheduling, and messaging (Inter Process Communication).
In theory, this concept makes the kernel more responsive (since much functionality resides in preemptible user-space threads and processes, removing the need for context-switching into the kernel proper), and improves the stability of the kernel by reducing the amount of code running in kernel space. There are also additional benefits for OS's that support multi-CPU computers (much simpler re-entrancy protection and better suitability for asynchronious functionality) and distributed OS's (code can use services without knowing if the service provider is running on the same computer or not). A drawback is the amount of messaging and Context Switching involved, which makes microkernels conceptually slower than a Monolithic Kernel. (This isn't to say that a smartly designed microkernel could not beat a foolishly designed monolithic one.)
The "in theory" starting above paragraph means that, yes, your kernel would not crash alongside with e.g. a crashing filesystem. But your user would still be stuck with data he cannot save, unless you made provisions to restart the filesystem server / daemon. Microkernels can be more stable, but it requires additional design work and does not come free with the architecture. Likewise, the additional design work that has to be done to get a microkernel design right could also be spent on making a monolithic kernel preemptable.
AmigaOS, for example, was a microkernel - and an unusual one: Since the original AmigaOS had no memory protection, its messaging was as quick as it could get (passing a pointer to memory), making the AmigaOS kernel one of the fastest ever devised. On the other hand, that lack of memory protection also meant that the microkernel architecture gave no added stability (later versions did implement MMU support, but at the same speed cost that affects other microkernel systems).
One of the side-effects of using a micro-kernel design (or a modular Monolithic Kernel), is the changes needed for booting the OS. If the file system is handled by a user-space process loaded by the kernel, the kernel itself contains no code to handle file systems or storage device drivers to load the file system process in the first place. One method of resolving this is for the boot loader to load a "RAM disk image", containing the kernel and several support files (device drivers, etc).
It's also possible for an OS design to borrow concepts from both monolithic kernels and micro-kernels in order to use the benefits of either method where appropriate.
Recommended Reading
There are two papers available on the web discussing the performance issues involved in microkernel designs:
The Increasing Irrelevance of IPC performance in Microkernel-based Operating Systems by Brian N. Bershad
The Persistent Relevance of IPC performance in Microkernel-based Operating Systems by Wilson C. Hsieh, M. Frans Kaashoek, and William E. Weihl
µ-Kernels Must And Can Be Small by Jochen Liedtke
Related Links
Related Threads
Exotic kernel models
Included from Exokernels
Exokernels?
Exokernels are an attempt to separate security from abstraction, making non-overrideable parts of the operating system do next to nothing but securely multiplex the hardware. The goal is to avoid forcing any particular abstraction upon applications, instead allowing them to use or implement whatever abstractions are best suited to their task without having to layer them on top of other abstractions which may impose limits or unnecessary overhead. This is done by moving abstractions into untrusted user-space libraries called "library operating systems" (libOSes), which are linked to applications and call the operating system on their behalf.
The concept of an exokernel is orthogonal to that of micro- vs. monolithic kernels. It is not important whether the secure multiplexing is performed in privileged kernel code or usermode servers, just that there are no forced abstractions.
"Separate security from abstraction"? Huh?
Take, for example, the typical abstraction of a file. Files, as the application sees them, do not exist on disk. On the disk, there are disk sectors. The operating system abstracts the reality of the disk to create the more convenient illusion of files and a filesystem. Usually, security is also provided at this level. ACLs, Unix-style permissions, etc. are applied to files. Security is combined with the abstraction.
On exokernels, security is provided on the unabstracted hardware level, in this example, to disk sectors. LibOSes provide any desired abstractions on top of this interface. Non-overrideable security is put in the exokernel, and the overrideable abstraction is implemented in a libOS. Security is separated from abstraction.
What are the advantages of doing this?
The upside of this is that user-space applications are allowed to implement their own, optimized memory management (by directly accessing e.g. memory tables), file access (by "raw" disk access), etc. This can, for special applications, result in significant performance increases. Engler et al. benchmarked a web server, Cheetah, running on an exokernel being up to eight times faster than the competition. Cheetah, among other things, uses knowledge of HTTP to combine IO requests, knowledge of HTML to appropriately colocate resources on disk, avoids copying data by sending resources to clients directly from the file cache, and caches pregenerated network packets. See
Application Performance and Flexibility on Exokernel Systems.
In addition, by being aware of resource availability, revocation and allocation, it is hoped that applications can make more efficient and intelligent use of hardware resources. For example, a server would know not to make its cache in memory larger than the amount of memory it actually has. In traditional systems, the operating system would fake however much the server had allocated with virtual memory, and the server would carry on, ignorant of any page faults, swapping, and paging.
It is also hoped that exokernels will, in a similar way to microkernels, ease development and testing of new operating system ideas. New scheduling techniques, memory management methods, filesystems, etc. can be developed and tested in libOSes, which are more quickly and easily modified than current operating systems. In
The Exokernel Operating System Architecture, Engler recalls the story of an undergraduate who was able to develop and test a new page table structure on an exokernel in a week while the designers had only been able to simulate it.
What about the downsides?
Exokernel technology is still not thoroughly researched, so surprises might lurk everywhere. Moreover, the added flexibility for user-space programs also means reduced consistency. While it would be theoretically possible to provide libOSes that enable e.g. MacOS, Windows, and Linux applications to run simultaneously on the same system, that would also mean different look & feels for each of them. In addition, different libOSes may have varying levels of compatibility and interoperability with each other.
It can be difficult to design exokernel interfaces. The designer must develop adequate and appropriate interfaces to low level hardware and delicately balance power and minimalism vs. enough protection. Engler et al. remark that their exokernel interfaces went through many revisions.
The ease of creation and mixing of libOSes could lead to code messes that would be a nightmare for maintenance coders and system administrators. Mantainence coders would have to deal with not only the application code, but any overriden abstractions or new implementations.
Information that could otherwise be useful to a kernel can be lost if the related abstractions are implemented in libOSes rather than the kernel.
What about nanokernels, picokernels, cache kernels, virtualizing kernels, etc.?
While Monolithic Kernel and Microkernel are rather well-defined terms, the advocates of exokernel-like technology have coined many different terms - nanokernel, picokernel, cache kernel, virtualizing kernel etc. Most of these are relatively minor variations of each other.
Nanokernels and picokernels are usually small kernels considered by their creators to be even smaller than microkernels. Examples include:
Adeos,
KeyKOS, and
LSE/OS.
The
Stanford cache kernel caches kernel objects, like address spaces and threads, and allows usermode “application kernels” to manage them, loading and unloading them as needed. Application kernels manage their threads' page faults, exceptions, etc., and the cache kernel allows several of these application kernels to coexist in a single system.
Virtualizing kernels are usually designed to allow multiple operating systems to run on a single computer, by allowing free execution of unprivileged instructions and trapping and simulating privileged instructions. Adeos, while it calls itself a nanokernel, is similar in concept to virtualizing kernels. Unlike exokernels, virtualizing kernels attempt to be as transparent as possible, to avoid requiring many modifications, if any, to the hosted operating systems.
Task Models
Monotasking Systems
Included from Monotasking Systems
Monotasking systems, also referred to as single-tasking systems, are operating systems in which only one thread of execution is run at a given time. When an application is executed, it takes control of the whole computer, save for a 'resident' part of the OS which handles system calls and reloads the rest of the system when the application exits. Generally speaking, such systems have little or no protection against malicious software.
The canonical example of a monotasking operating system is MS-DOS.
While they are easier to design and write, monotasking operating systems are extremely limited. As a result, nearly all new OS designs are forMultitasking Systems.
Multitasking Systems
Included from Multitasking Systems
What is Multitasking ?
Multitasking Systems are operating systems (or even system extensions) which divide available processor time between several tasks automatically, creating the illusion that the tasks are running simultaneously.
This can be achieved in several ways.
Cooperative Multitasking
This concept runs an application until it exits or yields control back to the OS. Examples for cooperative multitasking systems are pre-X MacOS, or Windows 3.x.
In some single language cooperative multitasking systems, such as Oberon and ruby, the compiler/interpreter automatically ensures that the code will periodically yield control; it allows such program to run in multi-threading on non-preemptive OS such as DOS. As yielding isn't necessary anymore, I'm not sure if we can still say it's cooperative multitasking.
Preemptive Multitasking
In a preemptive multitasking system, the OS can take away control from (preempt) an application after a time slice is used up or a signal occurred. An example would be e.g. Linux, *BSD, post-3.x Windows, BeOS, or AmigaOS.
You can further subdivide these systems into those who can preempt applications, and those who can preempt the kernel itself. Linux (pre-2.6 kernel) is an example of the former, while e.g. AmigaOS is an example for the latter. This is a major concern for multimedia applications or any "soft" Real-Time Systems because a non-preemptive kernel introduces latencies that can ruin such "near real-time" performance.
How does it work ?
You have programs running. Each program has some binary code to be executed by the processor and an execution context made of e.g. registers state, stack content, etc.
Since you have a single CPU, you have a single program executed at a given moment and its execution context is the state of the cpu's registers while you may have plenty of programs sleeping, waiting for their turn with their context saved in OS's datastructures, right ?
Now, the OS has set up a timer interrupt, which causes the OS call a specific interrupt service routine at regular interval. When the timeslot for the current program runs out, the routine will save the current CPU context into a datastructure, select a new program to be run for the next timeslot, and load the CPU registers with the values that were saved in that process's datastructure.
If you still want to figure out, imagine a machine with an accumulator and a stack pointer. You can save the machine state, switch and restore another machine state with
PUSH ;; put the accumulator's content on stack LOAD [current_pid] ;; put it in the accumulator STORE_SP [context_table+ACC] ;; get somehow the next pid into ACC, without using the stack STORE [current_pid] ;; we'll need it here for the next run LOAD_SP [context_table+ACC] POP ;; get back the accumulator's content IRET ;; end of interrupt
Now imagine two programs on that machine, e.g.
IMM 0 ;; into accumulator here: INC ;; accumulator++ JMP here
and
IMM 0 there: DEC ;; accumulator -- JMP there
Sketch on a papersheet the memory of the machine, with the pid_table, the two stacks , the current_pid variable, and then go. You should be able to emulate the behaviour of that machine if --say-- you have a interrupt every 10+some_number_you_get_by_rolling_a_dice machine instructions.
And you'll see it can continue working on program 1 after it has been interrupted for 10+ instructions by program 2.
Of course "10+dice" is not enough to get decent performances. On modern systems, your timeslice is usually of a few milliseconds, which makes millions of instructions! Moreover, you don't always switch when the timer arise (because you might want the system clock more accurate than the switching rate) and you can switch on other events (e.g. because a program needs to wait for something or because something important like the end of a disk request happened).
Related Threads:
Designing a kernel to be preemptible
Realtime Systems
Included from Real-Time Systems
Real-Time Operating Systems are systems in which certain processes or operations have guaranteed minimum and/or maximum response times. That is to say, the system ensures that it will complete operation x after time t but before time t', whatever t and t' are, without fail, even at the expense of other lower priority operations.
Speed, in and of itself, is not critical; the primary goal is predictability. A response time less than t may be just as bad as, or worse than, one greater than t'.
'Soft' real-time systems (e.g., laser-printer microcontrollers) promise best-effort reliability; 'hard' real-time systems (e.g., 'fly-by-wire' avionics in an aircraft) need to have zero-failure-tolerance reliability.
Real time operating systems are, as a general rule, only used in time-dependent embedded applications; general-purpose systems rarely if ever need to meet real-time constraints. When the do occur, real-time considerations may rule out the use of some common techniques, such as virtual memory, which may make the system's behavior less deterministic.
One of the best-known Real Time OS for the x86 platform is QnX. Each system call of QnX is documented with a 'worst case completion time'.
It should be noted that "being Real Time" not necessarily means that an OS is very good at playing MPEGs, or using hardware efficiently - this is a common misunderstanding. To the contrary, for a system to provide hard real-time services implies that it can use only a limited percentage of the system's resources, including CPU time. It also fundamentally changes how software can be built for the system. For example, Rate Monotonic Scheduling - a hard real-time scheduling algorithm - can guarantee time restraints only up to 70% CPU load. Beyond that, the system has "to hit the red button" because it can no longer guarantee anything.
This implies that applications have to state their run-time requirements beforehand - how often they must be called in a second, which maximum response time is acceptable etc. All this information must be provided by the application programmer.
Bottom line, hard real time is for industrial, medical, or military systems. On your average desktop, it is misplaced.
See also Real-Time Scheduling Algorithms
Memory and Resource Management
Physical, Virtual, Paging, help?!?
Included from Physical, Virtual, Paging, help?!?
Physical Addresses
...is when every address "visible" to a program directly relates to an address on the address bus between CPU and RAM. If I access 0x0023 5fc4, I mean the address identified by the bit pattern 0x0023 5fc4 on the address bus.
In this memory model, every executable or library must either use PIC (position-independent code), or come with relocation tables so jump and branch targets can be adjusted by the loader.
The AmigaOS used this memory model, in absence of a MMU in early 680x0 CPUs. It is most efficient, but it does not allow for protecting processes from each other, thus it is considered obsolete in today's desktop operating systems. It is also prone to memory fragmentation; certain embedded systems still use it, though.
Virtual Addresses
The advent of MMUs (Memory Management Units) allowed to play tricks to the addressing. A virtual address could be mapped to any physical address. It was possible to provide each executable with its own address space, so that memory always starts at 0x0000 0000. This relieves the executable loader of some relocation work, and solves the memory fragmentation problem - you no longer need physically continuous blocks of memory. And since the kernel is in control of the virtual-to-physical mapping, processes cannot access each other's memory unless allowed to do so by the kernel.
Paging
Having an individual virtual-to-physical mapping for each address is of course ineffective. The traditional approach to virtual memory is to split up the available physical memory into chunks (pages), and to map virtual to physical addresses page-wise. This task is largely handled by the MMU, so the performance impact is low, and generally accepted as an appropriate price to pay for memory protection.
physical memory virtual memory (A) virtual memory (B)
+-------+ +-------+ +-------+
00x |H E L L| page table |H E L L| page table |H A V E|
01x |R L D !| for proc. A |O W O| for proc. B | L O T|
02x |O W O| 00x => 00 |R L D !| 00x => 03 |S O F|
03x |H A V E| 01x => 02 |#######| 01x => 05 | F U N|
04x | F U N| 02x => 01 |#######| 02x => 06 |#######|
05x | L O T| 03x n.a. |; - ) | 03x => 04 |; - ) |
06x |S O F| 04x n.a. +-------+ 04x => n.a. +-------+
07x |; - ) | 05x => 07 05x => 07
+-------+
paging illustrated: two process with different views of the
same physical memory
Memory Protection
Probably the most useful application of virtual addresses is memory protection. In a memory-protected environment, every process (executable) gets its own address space. To the executable, it looks like it is running alone on the system, no-one else there but the kernel. The bargain is that a misbehaved, malicious, or buggy executable cannot damage / corrupt other processes in the system. If the executable fails, the rest of the system keeps running instead of meeting the Guru Meditation, Blue Screen of Death, or whatever your system tells you when the system gets borked.
Virtual Memory (Swap file/partitions)
The next step is, instead of reporting an "out of memory" once the physical memory runs out, is to take pages that are not actually accessed currently, and write them to hard disk (swapfile or -partition) - freeing up the physical memory page. This is referred to as "paging out" memory.
This requires additional bookkeeping and scheduling, introduces a severe performance hit when a process accesses a page that's currently swapped out and must be swapped in again from hard drive, and requires some smart design to run efficiently at all. Do it wrong, and this one part of your OS can severely impact your performance.
On the other hand, your "virtual address space" grows to whatever your CPU and hard drive can handle. In concept, CPU caches and RAM simply become cache layers on top of your hard drive, which represents your "real" memory limitation.
Page swapping systems relies on the assumption that, at a given time, a process does not need all of its memory to work properly, but only a subset of it (like, if you're copying a book, you certainly don't need the whole book and a full set of blank pages: the current chapter and a bunch of blank page can be enough if someone can bring you new blank pages and archive the pages you've just written when you come short on blank pages, or bring you the next chapter when you're almost done with the current one). This is known as the working set abstraction. In order to run correctly, a process requires at least its working set of physical pages: if less pages are provided to the process, there's a high risk of thrashing, which means the process will be constantly requiring pages to be swapped in -- which forces other pages from this process's working set to be swapped out while they should have remained present.
Note: there are alternatives to page-swapping like segments-swapping and process-swapping. In those cases, the swap is rather user-software controlled, which puts more stress on the application developer and leads to longer swapping burst as the logical things to be swapped are bigger than 4K pages.
Other note: mainstream desktop OSes have a speculative algorithm that tries to reduce the 'page miss' frequency by loading more than what is required, and hoping that these extra pages will be useful. As programs tend to have localized access and that disks can read a track of N sectors faster than N independent sector, speculative swap-in may pay.
Summary
Virtual addresses and paging are A Good Thing (tm), even a requirement for every decent OS not necessarily limited to desktop/server systems (sometimes even embedded ones). Virtual memory certainly is a nice-to-have (and every modern desktop OS has it), but it comes at a cost and requires significant additional design and know-how to get right.
Related Threads
Allocating and Freeing Memory
Included from Allocating and Freeing Memory
One of the most basic functions of a kernel is the memory management, i.e. the allocating and freeing of memory.
At square one, the kernel is the only process in the system. But it is not alone: BIOS data structures, memory-mapped hardware registers etc. populate the address space. Among the first things a kernel must do is to start bookkeeping about which areas of physical memory are available for use and which are to be considered "occupied".
The free space will subsequently be used for kernel data structures, application binaries, their heap and stack etc. - the kernel needs a function that marks a memory area as reserved, and makes that memory available to the process requiring it. In the C Standard Library, this is handled by malloc() and free(); in C++ by new() and delete().
A very very simple Memory Manager
The easiest you can do is the WaterMark allocator. Just keep track of how far you've allocated and forget about the notion of "freeing".
before ...
<-N-->
+----+-----+--+--------//--+ +----+-----+--+----+---//--+
|##A#|##B##|C#| free | |##A#|##B##|C#|##D#|free |
+----+-----+--+---------//-+ +----+-----+--+----+---//--+
^freebase ^freetop ^d ^fbase ^ftop
When allocating N bytes for D, simply check that freetop-freebase>N and increment freeebase by N. Period.
A very simple Memory Manager
Now, if you need to free things, one of the easiest solution is to put at the start of the freed zone a descriptor that allows you to insert it in a list of free zones. Keeping that list sorted by address helps you identifying contiguous free zones and allows you to merge them in larger free zones.
first free Structure of a +--.--.--.--+ +---> +-- ...
\/ free zone |F R E E | | | FREE
+----+-----+--+----+---+-----//--+ |nextfreeptr|----+
|##A#|free |C#|free|#E#| free | <----|prevfreeptr|
+----+-|---+--+-|--+---+-----//--+ | zone size |
+-next>--+ +///////////+
What data structure and algorithms can be used with such lists has been discussed in
Memmaker's started thread (as well as in many others)
Tips to go further
- Sometimes, and especially when you're working with objects, you have to allocate many objects that always have a certain size. It is wise to create pools of pre-divided large blocks for such objects.
- It's way easier to keep the size of allocated objects in a header hidden from the requestor, so that a call to "free" doesn't require the object's size.
- It's way easier to design a memory allocator in a host OS than in your kernel.
- Magic words like "F R E E" and "U S E D" will make your life easier when debugging. TimRobinson even allows 32 bits to store the address of the requestor so that you can see "okay, this is a N-bytes block that was requested by MySillyFunction(), line 3405" ...
Memory & Microkernels
The following paragraphs were contributed to the Pro-POS Wiki by "Beyond Infinity". He agreed to have them forwarded here due to their catch-all nature.
-- MartinBaute
In a microkernel environment, there comes up a question: where the hell shall I put the memory management? In sense of heap management: give the kernel a dedicated allocator and a dedicated memory area to use - you might need two of them: one for the messages, and one for all the other stuff.
The other task of memory management: Process address space management, keeping track of used pages (yes, lets talk about paging, it is nice, it is neat, it causes a warm fuzzy feeling beneath the toes) and assigning memory to processes as needed, you can split it up. To be more precise, you have to split this task up - or keep every aspect of memory management in kernel land to make it easy. A recommended method for managing process address space: handle it in a per-process-manner. The actual process needs memory: allocate the memory and hand it out to the process. So you can keep the page allocating routines easy and straight forward. For this task of allocating/deallocating memory, you should take into consideration, that the task performing such actions should be able to slip into adress spaces at needs (it loads the required pagedirectory and does what it has to do - slimy little weasel thou' it is.) Take those things into good consideration and put quite an amount of brainwork into them. It's worth doing good engineering here.
File Management
Included from File Management
This has been contributed to the Pro-POS Wiki by "Beyond Infinity". He agreed to have it forwarded here due to its catch-all nature.
-- MartinBaute
A monolithic kernel handles everything in one monolithic (hence the name) process. A Microkernel consists of several processes which are responsible for all the grunt work: allocating memory, managing processes - and housekeeping permanent data storage like hard disks or floppies. Yes, about filesystems the talk is.
Say, we have a file system service. It is responsible for keeping the File System on a Floppy/HD what so ever up to date (and eventually handle pipes and so forth...) The very basic -- the uttermost basic task of the file system service is to keep track of allocated/deallocated blocks and (in case of ext2) inodes - files, to which blocks are allocated - you also need blocks for the file/block management. Another task a file system service usually performs, is keeping blocks in memory for quick access. They are slow to retrieve from disk storage - compared to a simple memory access. So the file system service keeps the so called block cache. Now, think about it: you use paging, you have the file system service as a process of its own. It fetches blocks and keeps them in its adress space.
You'll have to do lots of work to transfer the data of a retrieved block to a user process which performs the read() call - or vice versa a write().
Check out, how it would be to have a task in kernel space keep the block cache and performing the actual reading/writing.
Algorithms and Tips for Memory Management
Included from Algorithms and Tips for Memory Management
Physical Memory Allocators
These are the algorithm that will provide you a new page frame when you'll need it. The client of this algorithm is usually indifferent to which frame is returned, and especially, a request for n frames needn't to return contiguous frames.
N will be the size of the memory in pages in the following text.
Bitmap
A large array of N/8 bytes is used as a large bit map of the memory usage (that is, bit #i in byte #n define the status of page #n*8+i). Setting the state of a given page is fast (O(1)), allocating a page may take time (O(N)).
- a dword comparison can test up to 32 bits at once and thus speed up allocs
- keeping a pointer to the last allocated bit may improve the performance of the next search (keeping information about the fact all the previous bytes were searched unsucessfully)
-- any pointer to a PD reference implementation is welcome /PypeClicker
Stack/List of pages
The address of each available physical frame is stored in a stack-like dynamic structure. Allocating a page is fast (O(1)), freeing a page too but checking the state of a page is not practical. Note that when all the memory is free, up to N*4 bytes are required to store the states.
-- any pointer ...
Sized Portion Scheme
- You split each area of, say 16kb into (for example) chunks of 1 8kb, and 2 4kb's. Then you hand out each chunk. By doing this you can find closer fits to exact sizes. That means less waste. So say that you have an area of 32kb
<---------------------------------------->
|8kb....|4kb|4kb||8kb....|4kb|4kb|
<---------------------------------------->
You can even have 1, 2, even 3 or 4 (or more!) types of layouts for each portion. This way you have even more sizes to choose from.
Buddy Allocation System
This is the physical memory allocator of Linux kernel. Note that linux has several buddies depending on whether the memory is suitable for ISA DMA, or is coming from 'high physical memory' or just 'normal'. Each buddy contains k bitmaps, each indicating the availability of 2^i-sized and 2^i aligned blocks of free pages. Usually, linux uses from 4K to 512K blocks.
0 4 8 12 16 20 24 28 32 36
###.#....#........#...###...########.... real memory pattern
buddy[0]---> ###.#.xx.#xxxxxxxx#.xx###.xx########xxxx 5 free 4K , 5-byte bitmap
buddy[1]---> # # # . # . x x . # . # # . # # # # x x 5 free 8K , 20-bits map
buddy[2]---> # # # . # # # # # . 2 free 16K, 10-bits map
buddy[3]---> # # # # # 0 free 32K, 5-bits map
A buddy for N pages is about twice the size of a bitmap for the same area, but it allows faster location of collections of pages. Figure above shows a 4-buddy with free pages/blocks denoted as . and used pages/blocks denoted as #. When a block contains at least one used sub-block, it is itself marked as used and sub-blocks that are part of a larger block are also marked as used (x on the figure). Say we want to allocate a 12-K region on this buddy, we'll lookup the bitmap of free 16K blocks (which says we have one such starting at page #12 and another starting at page #36). buddy[2]->bit[4] is then set to 'used'. Now we only want 3 pages out of the 4 we got, so the remaining page is returned to the appropriated buddy bitmap (e.g. the single pages map). The resulting buddy is
0 4 8 12 16 20 24 28 32 36
###.#....#..###...#...###...########.... real memory pattern
buddy[0]---> ###.#.xx.#xx###.xx#.xx###.xx########xxxx 6 free 4K , 5-byte bitmap
buddy[1]---> # # # . # . # # . # . # # . # # # # x x 5 free 8K , 20-bits map
buddy[2]---> # # # # # # # # # . 1 free 16K, 10-bits map
buddy[3]---> # # # # # 0 free 32K, 5-bits map
Note that initially, only the largest regions are available, so if buddy[0] is apparently empty, we need to check buddy[1], then buddy[2] etc. for a free block to be split.
Hybrid scheme
Allocators may be chained so that (for instance) a stack only covers the last operations and that the 'bottom' of the stack is committed to a bitmap (for compact storage). If the stack lacks pages, it can scan the bitmap to find some (possibly in a background job).
Hybrid scheme #2
Instead of keeping track of just bits representing pages, or just page numbers on a stack, use a big array of structs to represent the memory. In these page frame structs, store a single link to a next page (pointer-sized) and a 8-16 bit information block indicating its status. Also include the virtual page pointer and the TCB to which the page number belongs. Keep pointers to each type of page, to both the start and the end of their lists. This way, you can easily display information about their content, the amount of pages for each type available, mix types, allow dynamic cleaning threads, do copy-on-write fairly easily and keep clear & concise overviews of the pages. It functions as a reverse page mapping table that lists types of pages too.
For an example implementation see AtlantisOS 0.0.2 or higher. Think I did this, not entirely sure :)
Virtual Addresses Allocator
Flat List
One straightforward way to manage big areas of addresses space is a linked-list (as depicted below). Each "free" region is associated with a descriptor giving its size and its base address. When some space needs to be allocated, the list is scanned for a region being large enough with a "first fit" or "best fit" (or wathever) algorithm. This was e.g. the way memory was managed by MS-DOS. When memory is allocated, the suitable entry is either removed (the whole region is allocated) or resized (only a portion of the region is allocated).
Note that with flat linked-lists, both "is memory at address XXX free" or "where can i get a block of size YYY" questions may require a complete traversal of the list to get answered. If virtual memory gets fragmented and the list gets longer, that may become an issue. "Is memory at address XXX free?" is mainly used to merge two free zone into a new (bigger) one when a block is released, and it is easier to deal with if the list is kept ordered by growing addresses.
address space
_______ ______
Start--------------->| | | (A) |
Size--------------\ | free | -> | free |
Block_before \ | space | |------|
,-Block_after \|_______| |\\\\\\|
| |\used\|
| _______ |\\\\\\|
| Start--------------->| | |------|
| Size--------------\ | free | | (B)|
'-Block_before \ | space | -> | free |
,-Block_after \|_______| | |
| |------|
| _______ |\used\|
| Start--------------->| | |---(C)|
| Size--------------\ | free | -> | free |
'-Block_before \ | space | |------|
,-Block_after \|_______| |\\\\\\|
| |\used\|
| _______ |\\\\\\|
| Start--------------->| | |------|
| Size--------------\ | free | | (D)|
'-Block_before \ | space | -> | free |
Block_after \|_______| |______|
Tree-based approach.
Since it is frequent that the list is searched for a given address or a given size, it may be interresting to use more efficient data structures. One of them that still keeps the ability of traversing the whole list is the AVL Tree. Each "node" in the AVL tree will describe a memory region and has pointer to the subtree of lower nodes and to the subtree of higher nodes.
tree _________ tree ________
root | B.start | root | D.size |
| B.size | | D.start|
|_________| |________|
/ \ / \
____/____ ____\____ ____/___ ___\____
| A.start | | C.start | | C.size | | B.size |
| A.size | | C.size | | C.start| | B.start|
|_________| |_________| |________| |________|
\ \
____\____ ____\___
| D.start | | A.size |
| D.size | | A.start|
|_________| |________|
by-address ordering by-size ordering.
While insertion/deletion in such a balanced tree requires more complex operations than linked list manipulation, searching the tree is usually achieved with O(log2(N)) rather than O(N) for linked lists -- that is, if you have 1000 entries, it requires 1000 iterations to scan the list against 10 iterations to find a given interval in the tree.
Linux has used AVL trees for virtual addresses management for quite a while. Note however that it uses it for regions (like what you find in /proc/xxxx/maps), not for a malloc-like interface.
Related Threads
Allocating memory for an allocator without an allocator,
A bitmap based allocation technique
Ways to keep track of allocated RAM,
Questions about Memory Allocation,
Memory Management,
Memory Management to the X'th,
MM Questions,
(about)Tim Robinson Memory Management Tutorial #1
Managing used/free pages
Malloc, etc. (tute by curufir)
Physical MM (by Freanan)
Additionnal References
mystran's
Basic VMM for Dummies
Scheduling
Definition of terms
Included from Tasks, Processes and Threads
The following is a "pseudo-Unixish" definition of the terms. Various sources use these terms in different ways.
Process
A process is a running program, including code, data, heap, and stack. In most implementations (but not always), each process has its own virtual address space (i.e. its own logical address-to-physical memory mapping) and its own set of system resources (files, environment variable, etc.)
Thread
A thread is a control flow in an executable image. Threads can be "user level" (i.e., the process handles multiple threads within itself) or "kernel level" (i.e., the OS scheduler handles multiple threads within a single process). Two threads from the same process naturally share the same code and global data but are given different stacks so that they do not interfere with each other's local variable and may have their own call chain.
Task
This is the least well-defined term of the three; it is frequently used to describe a control flow that can be scheduled by the OS (i.e., a kernel-level thread or a process)...
Context Switching
Included from Context Switching
In your average, memory-protected environment, a "context" is a virtual address space, the executable contained in it, its data etc.
A "context switch" occurs for a variety of reasons - because a kernel function has been called, the application has been preempted, or because it had yielded its time slice.
A context switch involves storing the old state and retrieving the new state. The actual information stored and retrieved may include EIP, the general registers, the segment registers, CR3 (and the paging structures), FPU/MMX registers, SSE registers and other things. Because a context switch can involve changing a large amount data it can be the one most costly operation in an operating system.
There are many ways of performing a context switch. The x86 CPU provides a way of doing it completely in hardware, but for performance and portability reasons most modern OS's do context switches in software.
Hardware Context Switching
Some CPU's have a special mechanism to perform context switches in hardware. The following information gives details on 80x86 CPU's only.
The hardware context switching mechanism (called Hardware Task Switching in the CPU manuals) can be used to change all of the CPU's state except for the FPU/MMX and SSE state. To use the hardware mechanism you need to tell the CPU where to save the existing CPU state, and where to load the new CPU state. The CPU state is always stored in a special data structure called a TSS (Task State Segment).
To trigger a context switch and tell the CPU where to load it's new state from the far version of CALL and JMP instructions are used. The offset given is ignored, and the segment is used to refer to a "TSS Descriptor" in the GDT. The TSS descriptor is used to specify the base address and limit of the TSS to be used to load the new CPU state from.
The CPU has a register called the "TR" (or Task Register) which tells which TSS will receive the old CPU state. When the TR register is loaded with an "LDTR" instruction the CPU looks at the GDT entry (specified with LDTR) and loads the visable part of TR with the GDT entry, and the hidden part with the base and limit of the GDT entry. When the CPU state is saved the hidden part of TR is used.
A step further with Hardware Switches ...
In addition to the CALL and JMP instructions, a context switch can be triggered by a using a Task-Gate Descriptor. Unlike TSS Descriptors, task-gate descriptors can be in the GDT, LDT or IDT. Normally, task-gate descriptors are used in the IDT, so that an exception (or IRQ) can cause a context switch, which is the only way of handling a double fault exception with complete reliability.
The design of the basic hardware mechanism is limited by the number of usable entries in the GDT because TSS descriptors can be in the GDT only (theoretical limit is 8190 tasks). However, it is possible to avoid this restriction by dynamically changing TSS descriptor/s, by setting the TSS descriptor's base before each context switch. Care must be taken when using this approach when task-gate descriptors in the IDT are also used (the TSS descriptors refered to by each task-gate descriptor would have to be constant). Also context switches can't be initiated with a CALL instruction, because the CPU saves the GDT entry to use for the return in the TSS's "backlink" field.
If the FPU/MMX and SSE state also needs to be changed during a context switch there are a few options. The data could be explicitly saved by any code that causes a context switch, or the CPU can generate an exception the first time an FPU/MMX or SSE instruction is used. With the second option, the exception handlers would save the old FPU/MMX/SSE state and reload the new state. This option may prevent this data from being changed when it's not necessary (for e.g. when no tasks or only one task is using them), but fails to work correctly in a multi-processor environment without additional syncronisation which may be more expensive than using the first option.
Performance Considerations
Because the hardware mechanism saves almost all of the CPU state it can be slower than is necessary. For example, when the CPU loads new segment registers it does all of the access and permission checks that are involved. As most modern operating systems don't use segmentation loading the segment registers during context switches may be not be required, so for performance reasons these operating systems tend not to use the hardware context switching mechanism. Due to it not being used as much CPU manufacturers don't optimise CPUs for this method anymore (AFAIK). In addition the new 64 bit CPU's do not support hardware context switches when in 64 bit/long mode.
However, there was an interesting post on OSNews by Aage in July 2004, quantifying the amount of unavoidable hardware overhead involved in a context switch. It appears that the hardware overhead in a context switch on a modern P4 processor dwarfs the overhead involved in saving/loading registers (995ns of HW overhead vs 67ns to save/load registers). From this, it would appear that any performance gains from switching to software task switching would be minimal, amounting to no more than a few percentage points.
There is actually quite little you can do in software to improve the overhead of context switches. Most of the overhead is hardware related. Sure you can tweak the code that stores/restores registers, performs scheduling, and stuff, but in the grand scheme of things hw overhead dominates (I'll substantiate that below). Using the x86 as an example architecture:
Assuming the context switch is initiated by an interrupt, the overhead of switching from user-level to kernel-level on a (2.8 GHz) P4 is 1348 cycles, on a (200 MHz) P2 227 cycles. Why the big cycle difference? It seems like the P4 flushes its micro-op cache as part of handling an interrupt (go to arstechnica.com for some details on the micro-op cache). Counting actual time, the P4 takes 481 ns and the P2 1335 ns.
The return from kernel-level to user-level will cost you 923 cycles (330 ns) on a P4, 180 cycles (900 ns) on a P2.
The overhead of storing / restoring registers (not counting any TLB overhead / excluding cost of FPU register store / restore) is 188 cycles on the P4 (67 ns), 79 cycles on the P2 (395 ns).
A context switch also includes the overhead of switching address spaces (if we're switching between processes, not threads). The minimal cost of switching between two address spaces (counting a minimal TLB reload of 1 code page, 1 data page, and 1 stack page) is 516 cycles on a P4 (184 ns) and 177 cycles on a P3 (885 ns).
So the equation is (for a P4):
811 ns (HW) + 184 ns (HW: address space switch) + 67 ns (register store / restore) + ?? (scheduling overhead) = cost of context switch.
That'll leave you with 995 ns of HW overhead. You can spend as much as 2598 cycles in the scheduler before SW overhead dominates.
So, measured in actual time the cost of context switches is declining (P2: 3120 ns vs. P4: 995 ns - 3:1). But looking at CPU clock speed differences (P2: 200 MHz vs P4: 2800 MHz - 1:14), one can only conclude that the cost of context switches is rising.
And yes, I used some home-grown software to perform these measurements.
Aage
Software Context Switching
Software context switching can be used on all CPUs, and can be used to save and reload only the state that needs to be changed. The basic idea is to provide a function that saves the current stack pointer (ESP) and reloads a new stack pointer (SS:ESP). When the function is called EIP would be stored on the old stack and a new EIP would be popped off the new stack when the function returns. Of course the operating system usually needs to change much more than just the stack and EIP.
Eflags, the general registers and any data segment registers should also be pushed on the old stack and popped off the new stack. If the paging structures need to be changed, CR3 will also need to be reloaded.
The FPU/MMX and SSE state could be saved and reloaded, but the CPU can also be tricked into generating an exception the first time that an FPU/MMX or SSE instruction is used by copying the hardware context switch mechanism (setting the TS flag in CR0).
When the going gets tough ...
One of the complications is that when the CPU changes to a higher privilege level (CPL) it will load new values for SS and ESP from the current TSS. If the operating system uses multiple privilege levels it will need to create a single TSS and TSS descriptor. During a software context switch the values for SS0:ESP0, SS1:ESP1 and/or SS2:ESP2 will need to be corrected in the TSS.
In addition the CPU will need to be in CPL=0 (with interrupts disabled) in order to switch contexts properly.
The main problems with software context switching is that virtual 8086 tasks are harder to support, and the IO permission map (at the end of a TSS) can't be used to automatically provide IO port protection.
If the functionality of the IO permission map is desired, the single TSS can be dynamically changed during each context switch, or you can set IOPL=0 and emulate IO port protection within the GPF handler. Both of these methods may end up using more CPU time than software context switches save (this depends on a lot of things).
Other Possibilities
During a context switch the operating system can do additional work that isn't strictly part of the context switch. One common thing is calculating the amount of time the last thread/task/process used so that software (and the end user) can determine where all the CPU time is going. Another possibility would be dynamically changing thread/task/process priorites.
Related Links
BonaFide hosted tutorials:
Multitasking Howto by BeyondInfinity,
Software Task Switching from alt.os.development
Scheduling Algorithms
Included from Scheduling Algorithms
Known scheduling algorithms include Round Robin Scheduling, Priority-Based Scheduling, Weighted Scheduling, Rate Monotonic Scheduling, Feedback Scheduling, ...
To make some clearance in the mist, and to help you choose scheduling algorithms, here is a simple classification:
#Real-Time Scheduling Algorithms
- #Rate Monotonic Scheduling (RMS)
- #Earliest Deadline First (EDF)
#General Scheduling Algorithms
- #First Come First Serve (FCFS)
- #Round-Robin (RR)
- #Priority-based Round-Robin (PRR)
-
- #Shortest Process Next (SPN)
- #Shortest Remaining Time (SRT)
- #Highest Response Ratio Next (HRRN)
The trick with any scheduling algorithms is that it must fulfill a number of criteria:
- no task must be starved of resources - all tasks must get their chance at CPU time;
- if using priorities, a low-priority task must not hold up a high-priority task;
- the scheduler must scale well with a growing number of tasks (ideally, being O(1), i.e. taking constant time no matter how many tasks are queued - yes, this has been done, e.g. in the Linux kernel by Ingo Molnar).
An Example (Priority-Based Round Robin)
Imagine you have a Scheduler, which is responsible for the management of 10 Queues of runnable processes. A runnable process is stuck to the tail of the queue corresponding to his priority. Say, 1 is the highest priority and 10 the lowest.
The Scheduler then checks each of the ten ready-queues for runnable processes (if (queue[Priority].head->next!=NULL)), picks the first runnable process (so, the higher the priority the earlier a process gets its turn of cpu time) and hands it to the task switching mechanism.
Now, imagine the following: You have a queue whose property is Round-Robin in the list of queues: The actual running process at the head of this queue has a certain share of time. Each time the timer irq fires, this share is decremented by one. If the share is down to zero, the timer isr invokes the round-robin scheduler, which picks the process from the queues head and stuffs it to the tail of it whilst renewing its share of time. The next process is to get the cpu for its share of time.
you see, the task of scheduling is usually not concentrated in one single function but there are more functions working together.
Let's have a look on the round robin scheduler with three processes in the queue: A B C:
- A(time 0) B(time 10) C(time 10) A's time slice is zero: let's do round robin scheduling:
- B(time 10) C(time 10) A(time 10) ... one clock tick occurs ... the next one ...
- B(time 8) C(time 10) A(time 10) ... several clock ticks occur ... b's time slice is worn out ...
- C(time 10) A(time 10) B(time 10) ... ten clock ticks later ...
- A(time 10) B(time 10) C(time 10) ... now A has it's share of cpu time again.
Real-Time Scheduling Algorithms
Real-Time Scheduling Algorithms are a special class of algorithms of which it is required that they can guarantee a process will be done before its deadline. The only way these algorithms can work is if they at least know when the deadline for a process is, and how much the process takes of the system. Only if the system is not overloaded (subjective term) can the threads be guaranteed to finish before their deadline.
Each task has to be scheduled Xt times a second, or every Yt milliseconds (Yt = 1000 / Xt). Each run of that task takes at most Zt milliseconds. This task then creates a load of Lt = Zt / Yt.
The system as a whole has a load L, which is the sum of all task-loads: L = E Lt. If this systemload exceeds 0.7 (in some rare cases it can be slightly larger, but we don't count them) the system is unschedulable using Rate Monotonic Scheduling. If this systemload exceeds 1.0 it is unschedulable for any real-time system. Note that for normal systems any load is possible, including the ones that are extremely large. They will make the system very unusable though.
See also Real-Time Systems.
There are two well-known Real-time algorithms, they are:
Rate Monotonic Scheduling
Rate Monotonic Scheduling is a way to schedule Real-Time threads in such a way, that can be guaranteed that none of the threads will ever exceed their deadline. The load of the system can vary, but in order to guarantee the deadline the load may not go above 0.7 or 70% CPU time usage.
The RMS scheduling works by assigning each task a priority based on its interval. The task with the shortest interval gets the highest priority and the task with the largest interval gets the lowest priority (still realtime though). The tasks are then run similar to a prioritized preempting #Round-Robin. This means, any task that can run runs, and if a task runs but a task with a higher priority is available, the higher one runs instead.
If your system is based on a #Round-Robin scheduler, this is the easiest way to do Real-Time scheduling. However, because of the 70% limit it's not the best algorithm. See also #Earliest Deadline First.
Earliest Deadline First
Each task in an EDF scheduler is assigned a deadline (e.g. a moment in the future at which the task must be completed). Everytime a task is inserted in the system or completed, the scheduler looks for the task which has the closest deadline and selects it for execution. In order to ensure that the scheduler is still able to meet each deadline, a policer must evaluate if each new task doesn't overload the system and deny execution if it will do so.
In order to implement EDF-based system, one will have to know both the deadline of the task (which could optionnally be computed as "no more than X ms in the future") and the expected time needed to perform the task (required by the policer). QoS network routers usually implement variants of EDF scheduling.
Related Threads
Real-Time Scheduling (the what and why)
Linux 2.6 scheduler, under FDL
What changes if multiple CPUs exists
Included from Multiprocessor Scheduling
Introduction
Scheduling Algorithms can get tricky when more than one CPU is involved, and with Intel's Hyperthreading technology (one complex CPU appearing as mutliple CPUs), MultiProcessors OSdev'ing come to the hobbyist aswell...
All of a sudden, the details of the CPU architecture become even more important than with singleprocessor systems. How is the Memory Management Unit (MMU) designed? Some CPUs are able to hold page tables for more than one process in their Translation Lookaside Buffer (TLB) - it would benefit the system performance if the same process would always run on the same CPU, instead of a different one each time it is scheduled.
Another issue is multithreading. On a single-CPU system, multithreading (i.e., more than one control flow in a single address space) is a fine thing, and if done right can greatly benefit a system's performance. But on a multiprocessor system, there are even greater benefits to win - while at the same time it gets harder to achieve them.
ToDo: I have to remember putting that piece about scheduler activations in here...
Other facts ...
One thing that I have read on the subject is that some threads may be pinned on a specific CPU while other threads will execute on the first available CPU. This can make sense, for instance, to make the GUI server's main thread 'resident' on a CPU to achive high responsiveness (other threads that want to interact with the GUI can do so without incurring a context switch penalty :)''
Pinning a thread to a CPU also helps the CPU's cache. If a thread is jumping between CPUs each time it is run, it never gets chance to fill either CPU's cache with code or data: it might run for a moment, and load some instructions from main memory, then get pre-empted. Next time it is scheduled, it might be running on a different CPU, where it would need to load the cache all over again. If the scheduler can assign each thread its own favourite CPU, it can help increase cache performance. Of course, if two threads with the same favourite CPU want to run at the same time on a dual processor system, one of them will have to take the 'wrong' CPU.
System V (iirc) had a nice synchronisation tool (a sort of smart spinlock) for multiprocessors that allowed a thread to busy-wait for a resource as long as the resource holder was active on another CPU and to enter the SLEEP state when the holder wasn't running. This has been proved in papers (which i lost the references) that this method pays and leads to better performances than always entering the SLEEP state when a resource was busy.
Related Threads
Sleeping and Waiting Processes
Included from Sleeping and Waiting Processes
A process is described as waiting (also referred to as blocking) when it is set inactive until a specific event occurs, such as a semaphore being released or a message arriving in its message queue. In multitasking systems, such processes are expected to notify the scheduler with a system call that it is to wait, so that they can be removed from the active scheduling queue until the event occurs. A process that continues to run while waiting (i.e., continuously polling for the event in a tight loop) is said to be busy-waiting, which is undesireable as it wastes clock cycles which could be used for other processes.
A special case of waiting is sleeping, which is when a process is set inactive for a given period of time (i.e., 20 milliseconds). This is usually handled separately for ordinary waiting, primarily for efficiency reasons; since the clock interrupt is frequent, and keeping track of individual ticks is infeasible, the usual approach to use a system of relative counters to mark when a given process will be awakened.
A commonly used data structure for tracking sleeping processes is the delta queue, and ordered list of sleeping processes in which each process has a counter which is offset relative to last process in the list before it; for example, if process A is sleeping for 3 ticks, process B for 5 ticks, and process C for 5 ticks, and process D for 8 ticks, then the list would be
{A, 3} -> {B, 2} -> {C, 0} -> {D, 3}
At each clock tick, the counter for the topmost process is decremented, like so:
{A, 2} -> {B, 2} -> {C, 0} -> {D, 3}
If at this point process E is added which is to wait 6 ticks, then it would be inserted before D, and both E and D would be updated appropriately:
{A, 2} -> {B, 2} -> {C, 0} -> {E, 2} -> {D, 1}
if the topmost process reaches zero, then it (and any processes immediately following it that are also zero) is placed back into the active scheduling queue. Thus, after 2 more ticks, A is set as active; 2 ticks after that, both B and C are set as active; and so on. This approach has the advantage of minimizing the number of changes required in a given clock tick.
It is generally good idea to make a difference between interruptible wait/sleep, and uninterruptible wait/sleep. Depending on your system, you will probably want most wait states interruptible, while you almost definitely will need some uninterruptible states. You might want to consider building interruptable wait on top of uninterruptible wait, instead of using special flags.
For example, in my kernel, all waiting/sleeping is done on semaphores. Semaphore wait is uninterruptible. But since most things are done asynchronously using queued messages (sent to threads) and event-driven programming, most of the time a thread is waiting on it's own message queue semaphore. Sending any message (like timer expiry, or even dummy interruption) to that message queue, will cause the thread to be released from it's uninterruptible semaphore wait. Added bonus of this kind of system is the ability to wait for several things at once. -- MystranTheGreen who learned this the hard way.
Process Sychronization / Inter-Process Communication
Semaphores
Included from Semaphores
All the techniques presented here are basic building blocks to address the problem of process synchronization. E.g. given programs that are running independently from each other on the same machine, how can one ensure some properties about what combination of operations are allowed and what combinations are not.
Among other examples of real-world problems, we're looking for technique that can grant:
- Mutual exclusion of processes: a portion of code cannot be executing by two process simultaneously.
- Rendez-vous: one process must not perform one operation (e.g. generating a summary) before other processes have completed their operations.
Shared readers/Single writer approach of resource locking: many process may be reading a table at the same time, but only one can write at a time and it should prevent readers to access the table until the table has returned to a consistent state.
Note: A good synchronizator should not only guarantee correctness, but also fairness (all process have equal chance to get the access) and non-starvation (any waiting process will eventually have the resource).
Semaphores
Semaphores are one of the oldest and most widely used methods of ensuring Mutual Exclusion between two or more processes. A semaphore is a special integer variable which is (usually) initialized to 1, and can only altered by a pair of functions. Each of these functions, historically called p and v (from the Dutch words proberen, to test, and verhogen, to increment), must be an Atomic Action. Each semaphore has an associated queue for processes waiting on the resource it guards.
The function p, also called wait() (or test()), decrements the value of the semaphore, and if the semaphore is negative, puts the process on the waiting queue until the semaphore is released by the process holding it.
The function v, also called signal() (or release()), increments the semaphore and, if it is still negative, indicates to the scheduler to wake the next waiting process in the queue.
Note that a general semaphore can do much more than just guaranteeing mutual exclusion. Some FIFO queue (single reader and single writer) can for instance be implemented by using one semaphore counting "how many messages are available" and another one counting "how many free slots are available"
Message queue[N];
Semaphore slots=new Semaphore(N);
Semaphore messages=new Semaphore(0);
int last_read=0, last_written=0;
Message get() {
Message m;
messages.wait();
m=queue[last_read]; last_read=(last_read+1)%N;
slots.signal();
return m;
}
void put(Message m) {
slots.wait();
queue[last_written]=m; last_written=(last_written+1)%N;
messages.signal();
}
Mutexes
A variant on this, called a binary semaphore uses a boolean value instead of an integer. In that case, p tests the value of the semaphore, and if it is true, sets it to false, and if false, waits. The binary v function checks the waiting queue, and if it is empty, set the semaphore to true; otherwise, it indicates to the scheduler to wake the next queued process.
In either form, it is important that a process release a semaphore once it has finished using the resource it guards, otherwise the resource could be left inaccessible.
Note that while "semaphore" is a globally-unique semantic items, "mutex" is a fuzzy name and system designers tends to have "their own mutex" which may look more like a spinlock or like a binary semaphore or like a general semaphore...
Spinlocks
Spinlocks try to address the same problem of Mutual Exclusion, but without relying on a scheduler infrastructure to make the process sleep if the resource is busy. Instead, a spinlock will keep checking the value until it has changed and usually relies on some atomic test_and_set instruction on the CPU to perform its task (See
Intel Manuals to see how xchg can be used to mimmic test_and_set virtual operation).
While poorly used spinlocks will lead to severe performance penalty in single-cpu systems, wise use on multi-cpu may achieve higher throughput.
- If you need more information on spinlocks, you're suggested to walk through these documents
Note for IA-32 programmers: If you consider to use spinlocks, be aware that the P4 / Xeon CPUs will falsely detect a possible memory order violation as the spinloop finishes, resulting in an large additional performance penalty. Place a PAUSE instruction into the spinlock to avoid this undesirable behavior. Refer to the
Intel Manuals for more information.
Relate Threads
Userland only Semaphores,
Spinlocks that disable interrupts,
SMP compatibility
Signals
Included from Signals
Signals are a Unix invention for asynchronous signaling, and were integrated into the C standard (<signal.h>). When a process receives a signal (send by hardware, or another process using raise()), a signal handler is called. A signal handler is a C function that handles the signal; which function to call on which signal is defined by passing its function pointer to the signal() function. (If no signal handler is defined for a given signal, a raise() of that signal aborts the program.)
Signal handling is tricky, since it breaks the single-control-flow structure of a C program. Make sure you read the manuals...
Message Passing
Included from Message Passing
If you consider writing a Microkernel, you should layout how you'll manage the message passing. This page collects considerations on how you can do it.
Synchronous vs Asynchronous
- synchronous
- Process A delivers the message directly to process B. If process B is not ready to receive at the moment of sending, Process A is suspended and queued upon Process B's 'sending' queue, until Process B is receiving. Using this method, you don't need to care for message queues - but you enqueue processes.
- asynchronous
- Process A sends the message to Process B. The message is copied to a dedicated region in kernel space and attached to Process B's message queue (lets call it the Post box). The next time, Process B looks into its postbox, it will retrieve the message. You can have processes block for receiving a message or just stroll by and check whether there is something to fetch. One could call this method "store and forward message passing". The main benefit of asynchronous messaging is that it allows a single thread/process to wait for the results of many unrelated "requests" at the same time.
We may note that in the synchronous approach, A has the guarantee that B received the message when the deliver call terminates, which makes it especially interesting for local RPC. It should be noted that it can be very difficult to implement standard C library functions without this guarantee.
It isn't a bad idea to wrap some brains around the following problem: What, if Process A expects a message from Process C. It needs a possibility to communicate this to the message routing subsystem. If Process B sends a message to Process A, Process A shall not be woken up for it wants something from Process C. Process C sends a message to Process A. Then and only then shall the message routing subsystem wake up Process A.
What primitives will you need to get synchronous message passing running? Two: request(message) and respond(message).
What primitives will you need to get asynchronous message passing running? Two: send(message) and receive(message).
What primitives will you need to get both asynchronous and synchronous message passing running? Three: send(message), receive(message) and request(message), as the send(message) primitive is/can be the same as the respond(message) primitive.
The "Port" abstraction
Many Microkernel use the notion of port, which is a 'point for receiving messages', or a 'one-way message communication channel'. The idea of ports is that you no longer send a message to a thread but to a port (which is linked to a unique thread).
Ports can be allocated by servers at will and may restrict who can send and who cannot send on it. Using ports, the 'A listen to C, B sends to A' problem above can be solved by creating a new port p that A will open to C only (and sending a message to C to tell p's port number). A can now wait for C's message by receiving on p only. B's attempt to send a message on port q of A will not interfere.
To solve situations when a server is willing to receive messages from different ports (and synchronous messaging is used), we need a select-like interface that will allow reception of a message from any port p in a given set of ports.
Variable or Fixed Messages Size
A message typically includes the ID of the emitting thread (or a replying port), a message code and a few arguments. Message queueing and dispatching code can be greatly simplified if message size is fixed.
When messages longer than the fixed-size are required, one will need to provide a way to describe the long message in the small structure. If copying a 4-word message forth and back doesn't imply excessive processing cost, it will be very different for the 1MB of data you send to a disk server. In that case, it is suggested to toy with paging in order to map the real data from the emitter to the receiver's address space.
Contributors: BeyondInfinity, PypeClicker
Related Threads:
Shared Memory
Included from Shared Memory
The concept of memory protection has become a well-accepted one in OS design. A special CPU component, the MMU (Memory Management Unit), provides each user-space application with a "virtual" address space. Each application "thinks" it has the whole address space for itself.
In reality, the MMU maps each "virtual" address to a "physical" one, usually in units of "pages" (4k or 2 or 4MB on the IA32 architecture). Page 0 of one application is mapped to physical 0x0007 0000, while page 0 of another application is mapped to physical 0x0008 0000. You get the idea.
This means that applications cannot communicate with (pass data to) each other directly, because they are "virtually alone" in their address space: They have to ask the OS kernel to pass a message (data). (Only the kernel and services in the kernel address space can see the real address spaces)
Now, Message Passing can be costly, in terms of clock cycles. If two processes intend to closely cooperate, the penalty of passing messages back and forth can become significant. An alternative would be to share memory.
To do this, the kernel is asked to map a certain address range of process A to the same physical address as a similar sized address range of process B. This address range is thus shared between process A and process B.
There are many things to consider here: Authentication, for example. Or that, if the shared physical memory is not mapped to identical virtual address ranges in process A and process B, any pointers in the shared memory structures must be offsets instead of absolute pointers, or they will no longer point to the same thing in the other process' address space.
Remote Procedure Calls
Included from Remote Procedure Calls
What is RPC?
Remote Procedure Calls, or RPC's for short, are a means to have certain services executed by some code that lies in an different process (maybe even on a machine located somewhere in the network), instead of being available locally. This is a common practice, employed e.g. by application servers or web applications.
Popular protocols for RPC's are COM+, Corba, or SOAP.
RPC in a nutshell
A RPC system relies on a compiler that will generate helper code from a high-level interface description (usually IDL, "Interface Description Language"). The 'client' code that wish to call a service will actually call a proxy function locally, passing it the arguments for the remote code. The proxy then marshalls (or serializes) the arguments according so that they fit in a flat buffer and send that buffer to the server code (for instance involving a UDP packet or some local message queue) and waits for response.
On its side, the server suddenly receive the marshalled packet and will pass it to a stub that will re-create the data (unmarshall) and call the real service procedure, getting back the results and marshall them back in a message to the client.
( client ) ( server )
| ^
| sayHello("World") | serverSayHello("World")
V |
+------\ - - - <channel>- - - - \-------+
| PROXY > -> iXX mYY Str'W o r l d \0' -> > STUB |
+------/ - - - - - |- - - - /-------+
|
marshalled message
^ ^
/_\ /_\
: :
: :
: |------------|\ :
: | .idl file L_\ :
+ - compiler - | (describing | - compiler +
| interface) |
+--------------+
Is RPC useful for me ?
Only you know for sure...
Links
RPC Chapter in "Computer Networks, A System Approach"
