Tell me about Filesystems

Filesystems are the machine's way of ordering your data on readable and/or writable media. They provide a logical way to access the stuff that you have down on disk so that you can read or modify extit. Which file system you use depends upon what you want to do with it. For example, Windows uses the Fat32 or NTFS filesystem. If your disk is really huge, then there's no point using Fat32 because the FAT system was designed in the days when nobody had disks as big as we do now. At the same time, there's no point using a NTFS filesystem on a tiny disk, because it was designed to work with large volumes of data - the overhead would be pointless for, say, reading a 1.44m floppy disk.

There are many different kinds of filesystems around, from the well-known to the more obscure ones. The most unfortunate thing about filesystems is that every hobbyist OS programmer thinks that the filesystem they design is the ultimate technology, when in reality it's usually just a bad copy of DOS FAT with a change here and there. The world doesn't need another crap filesystem. Investigate all the possibilites before you decide you roll your own.

FAT And its Variants

File Allocation Table (FAT) was introduced with DOS v1.0 (and possibly CP/M), supposedly written by Bill Gates. FAT is a very simple filesystem which is nothing more than a singular linked list of clusters. FAT filesystems use very little memory and is one of, if not the, most basic filesystem in existance today.

There are two versions of this simplified FAT, FAT12 and FAT16. FAT12 was designed for floppy disks and can manage a maximum size of 16mb using 12bit cluster numbers. FAT16 was designed for early hard disks and could handle a maximum size of 64kb * cluster_size. The larger the hard disk, the larger the cluster size would be, which lead to large amounts of "slack space" on the disk.

FAT12+FAT16 filesystems have fixed size for filenames of "8.3" and limited support for file attributes. You could also read the FAT12 document. You could also check out the FAT tutorial reported by Kemp.

VFAT

VFAT is an extension of FAT16 and FAT12 that has the ability to use long filenames (up to 255 characters i think). First introduced by Windows95, it uses a "cludge" whereby long filenames are marked with an "volume label"; attribute and filenames are subsequently stored in the 8.3 format in sequential directory entries. (This is a bit of an oversimplification, but close enough).

FAT32

FAT32 was introduced to us by Windows95-B and Windows98. FAT32 solved some of FAT's problems. No more 64kb max clusters! FAT32, as its name suggests, can handle a maximum of 4gig clusters per partition. This enables very large hard disks to still maintain very small cluster sizes and thus reduce slack space between files.

Is FAT32 really able to handle 4G clusters ? last time i looked to a FAT table, it looked to have last 4 bits cleared all the time, including the 'end of chain' tag, giving me the feeling that it was somehow more a "FAT28" than "FAT32" -- PypeClicker

Correct - FAT32 is actually only FAT28. Top 4 bits are currently "Undefined" (according to Microsoft's specification). Note that this does not mean 'top four bits should be "0"' - they should be honored. This means that if you need to alter a FAT entry, and the top 4 bits contain 1001, you should write it back to disk containing 1001 and not 0000. -- djhayman

Inode-based File Systems

Inodes (information nodes) are a crucial design element in most Unix filesystems: Each file is made of data blocks (the sectors that contains your raw data bits), index blocks (containing pointers to data blocks so that you know which sector is the nth in the sequence), and one inode block.

The inode is the root of the index blocks, and can also be the sole index block if the file is small enough. Moreover, as unix filesystems support hard links (the same file may appear several times in the directory tree), inodes are a natural place to store metadata such as file size, owner, creation/access/modification times, locks, etc.

HPFS (High Performace Filesystem)

The HPFS was designed by IBM/Microsoft for IBMs new windowing system, OS/2. It was designed to be fast, remove all the shortcomings of FAT, support long filenames, small cluster sizes, remove degfragmentation as much as possible and support more attributes.

HPFS is the precursor to NTFS and is, in a nutshell, NTFS minus the security features embeded into NTFS. Instead of storing cluster chains in a single linked list format, HPFS stores its information in sorted B-Tree's. This makes searching for files very fast.

Instead of keeping the directory tables and other descriptors at the start of the disk, HPFS bands them at regular intervals throughout the disk and in the middle of the disk, with the theory being that the heads only have to move half as much in any direction.

More information about HPFS can be found at: http://www.wotsit.org/download.asp?f=hpfs

NTFS (New Technology Filesystem)

NTFS is the native filesystem of WindowsNT. It is much like HPFS, but supports security features in the filesystem such as access control. Since WindowsNT is entirly unicode, NTFS is a unicode filesystem, each "character" being 16bits wide. NTFS adds quite a bit more to HPFS than just security features, though. First, it adds quite a bit of builtin redundancy -- with HPFS, wiping out one sector in the wrong place can render an entire volume inaccessible. Second, it adds support for multiple hard-links to a file (up 'til now, the only easy access has been via the POSIX subsystem, but NT 5/Win2K adds this to Win32 as well). Third, it supports an arbitrary number of file forks a la MacOS (except MacOS always has exactly 2 forks per file). Fourth, HPFS decrees that a cluster is always 512 bytes, and a cluster is always one sector. For the sake of performance and compatibility with some (especially Japanese) machines, NTFS allows sectors of other sizes. It also supports clusters of more than one sector, which tends to help performance a little.

NTFS is probably one of the most difficult file system to deal with, especially because of the lack of hacking experience and reliable documents about it. A read-only stable driver is in Linux source code base since kernel 2.4, while an experimental read-write driver is coming with linux 2.6.

The best information found about it so far is Andrew Tanenbaum's article. The Linux NTFS project also has some information about it, at http://linux-ntfs.sourceforge.net/ntfs/index.html. (Use the next / previous links at the top of the pages, or use the glossary.) You are welcome to add more.

ext2fs (Second Extended Filesystem)

The Second Extended Filesystem (ext2fs) was the default filesystem of Linux prior the advent of the journaling file systems ext3fs and ReiserFS. It has native support for UNIX ownership / access rights, symbolic and hard links and other Unix-native properties. Like HPFS, it tries to minimize head movement by distributing data across the disk. Also, by using "groups", it minimizes the impact of fragmentation. It is another "inode" based system. An ext2fs-partition is made up from blocks, which normally are 1K each. The first block (the bootblock) is zeroized, all the other blocks are divided into so-called block groups (normally, between 256 and 8192 blocks form a group). Each block group contains:

The first inode is a special one; it is the bad blocks inode, which references all the damaged sectors of the partition. The fifth inode contains the bootloader, whereas the 11th contains the root directory.

Windows users can access ext2fs partitions with explore2fs.

Additional information about ext2fs:

ext3fs (Third Extended File System)

ext3fs is basically ext2fs with journaling added. If your ext3fs partition does not need journal replay, it can even be accessed with a 'simple' ext2fs driver.

ReiserFS

ReiserFS - homepage at http://www.namesys.com - is a file system that is free if your OS is free, and comes at a licensing cost if your OS is not free. It has excellent performance on large directories and small files, using "dancing trees" instead of B-trees, and does meta-data journaling to improve file-system stability across system crashes.

Unlike "classic" filesystems, Reiser allows you to have files that occupy less than one sector on the disk (i.e. it can store several tiny files or tails of files on the same sector) through its tree organization.

As of version 3.6, ReiserFS supports:

Network-based File Systems

All these file systems are a way to create a large, distributed storage system from a collection of "back end" systems. That means you cannot (for instance) format a disk in 'NFS' but you instead mount a 'virtual' NFS partition that will reflect what's on another machine. Note that a new generation of File Systems is under heavy research, basing on latest P2P, cryptography and error correction techniques (such as the Ocean Store Project or Archival Intermemory.

NFS

NFS was invented by Sun Microsystems. It became widespread largely because it'a quite easy to implement. In return for its simplicity, it tends to give relatively poor performance and a nearly complete lack of safety. These are both largely due to its connectionless nature. When you request data from a file, the server sends you the requested information, but does NOT keep track of which clients have which files open. To keep you from seeing (terribly) out-of-date information from a file, the data you read has an "expiration date". If you refer to the data from more than, say, a minute, it will expire and your client will request the data from the server again, whether it's changed or not. If you write data to the file, you have no way of knowing whether somebody else has updated the information between your reading and writing your data, so you may overwrite things they've written with older data. To ensure at least a little bit of safety, the server is supposed to actually commit data you write to disk before it returns to you.

In other words, NFS works pretty well for read-only access to things like executables on a server. For things like on-line databases, it's essentially a disaster waiting to happen (and it usually doesn't wait very long).

More recent versions of the NFS spec have cured most of these problems, but support for these updates is still (years later) somewhat uneven.

AFS

AFS is the Andrew File System aka Advanced File System, similar to NFS to about the same degree that a tricycle is similar to a fighter jet -- they're both typically one-person vehicles. AFS is a drastically more robust design than NFS, and is intended for MUCH larger networks. OTOH, it's also much more difficult to implement completely -- to the point that it's not likely to be of much interest to most hobbyists and such writing a new OS.

RFS

RFS (Remote File System) was introduced in UNIX System V to compete with NFS and such. Unlike NFS, RFS is a connection-oriented system, so if, for example, two different machines access a file on a server, they get about the same semantics as if two processes on a single machine accessed the file. Note that NFS and RFS are both built on top of some sort of local file system, which determines things like inodes and such.

Unclassifiable and Other Stuff

BeFS

BeFS is the new filesystem for the Be Operating system. It is very much like the MacOS Filesystem, supporting multiple forks in a 64bit filesystem. One very useful feature it shares with the AmigaOS FFS is the ability for an application to set a "notify callback", i.e. being notified when a file or directory changes.

Get way more information about BeFS in Practical File System Design

FFS (Amiga)

The Amiga Fast File System, to put it bluntly, is not - or rather, it's fast only when compared to the OFS, the Original File System of AmigaOS 1.x.

There are many bright design ideas making the AmigaOS a very special thing, but the file system was not exactly part of it. It is prone to invalidation, holds redundant data, and its directory structure is comparatively slow to traverse. It also lacks any concept of multi-user environments.

Perhaps the only good thing with the Amiga FFS was the concept of the Rigid Disk Block (RDB) - a special area at the beginning of a disk, holding not only the partitioning information. It was also possible to store a file system there - a module that would tell a different AmigaOS machine how to read a partition if it was not formatted in FFS format but something else.

For those interested in its internals should try to find a copy of "The Amiga Guru Book" by Ralph Babel, which holds a complete reference of its rather complex block structure. (It also has a complete reference of the DOS library, as well as interesting information on various internals of the Amiga architecture. It is long out of print, but perhaps you can still find copies on eBay.) The old FAQ also held some info in the internals, which are preserved in the AmigaFFS Document.

FFS / UFS (BSD)

Not to be confused with the Amiga FFS, the BSD FFS / UFS is commonly used on hard disks for the *BSD and derivatives. What is usually called a "partition" is called a "slice" in *BSD, which is in turn subdivided into "partitions" - a naming pattern that leads to some confusion, and to rather cryptic device names (ad0s1c for the third partition on the second slice on the primary master ATAPI hard drive...).

XFS

XFS is Silicon Graphics "Next Generation Journalled 64-Bit Filesystem With Guaranteed Rate I/O" designed for IRIX based systems. XFS uses the standard inodes, bitmaps and blocks, and is compatable with EFS and NFS filesystems.

According to the XFS white paper it has;

BFS

BFS (UnixWare Boot File System) is a SCO specification for a KISS filesystem used at bootstrap. It only offers one directory and, due to the way information about blocks are stored, only one file opened for writing at a time.

http://www.penguin.cz/~mhi/fs/bfs/bfs-structure.html

From what i see, it also means BFS will have to do nasty things if a file must be extended after some other file has been created -- PypeClicker

Agreed, but it's not a general-purpose filesystem. One tends not to extend things like the kernel image or modules. -- Strib