OSDev.org
https://forum.osdev.org/

Design problem/question on FS design
https://forum.osdev.org/viewtopic.php?f=1&t=12623
Page 1 of 1

Author:  Candy [ Fri Jan 05, 2007 1:29 pm ]
Post subject:  Design problem/question on FS design

I'm working on the *FS mkfs tool. It works up to the level that I've come to the part where you take minute decisions that make a huge impact later on.

I'm implementing the support for copy-on-write files, where a file inode is shared by multiple file links, which can only write to it by copying it before the actual write. This works fine. I am considering, if you allow this for directory inodes as well, writing a file includes checking all of its predecessors and possibly copying a number of them. Not allowing this would weaken it since you'd still get loads of copies of the directory tree.

What do you think?

Author:  Brynet-Inc [ Fri Jan 05, 2007 1:54 pm ]
Post subject: 

Sounds cool Candy.. :)

*FS? I'm guessing by the asterisk your designing your own file system and haven't picked a name?

How about CFS? Candy File System.. hehehehe :P

An example of COW here is neat.. Surprised I never seen it before :?

To me it just seems like a bit of a space waister.. As it looks like it's just making duplicates of the file.. How's it different then making a copy of the file normally using cp? (Or is the duplicate temporary.. in memory?)

Author:  Candy [ Fri Jan 05, 2007 2:07 pm ]
Post subject: 

Brynet-Inc wrote:
*FS? I'm guessing by the asterisk your designing your own file system and haven't picked a name?

The name is StarFS, short name is *FS.

Quote:
An example of COW here is neat.. Surprised I never seen it before :?

To me it just seems like a bit of a space waister.. As it looks like it's just making duplicates of the file.. How's it different then making a copy of the file normally using cp? (Or is the duplicate temporary.. in memory?)


Suppose you have a source tree. You make a daily backup by copying it to a subdirectory of the appropriate name with a date in it. Every copy takes space for the entire tree of files. The idea is, since the files are the same, and you're most likely not going to write any of them (since the most likely reason for copies is backups or something similar) why copy at all?

So, instead, you link to the inode like a unix hardlink. Except, when you write you instead copy & write the file so it appears to be just another file.

It allows for a number of things that previously were plain stupid:

- You could make your file system function like GMail, in that you don't put a file in one directory but you put it in each directory that applies. As long as you don't change it, it'll be the same in each directory.

- You can add intelligent diff management to the filesystem so instead of actually COWing the file it creates a new inode with a backlink and a diff method, and only stores the diff. That way you never delete anything that you've backed up and you won't waste space duplicating half the file either. Not too sure on this though, but it's a point we were thinking on.

Author:  smbogan [ Fri Jan 05, 2007 4:32 pm ]
Post subject: 

Yes, but I would include the diff system with the filesystem. I don't see a reason why you wouldn't want it...except maybe speed...

Author:  Candy [ Fri Jan 05, 2007 4:59 pm ]
Post subject: 

I got a preliminary version of mkfs on my subversion, which follows an XML file to create a filesystem with a preloaded set of information. An example script is also included (that makes a disk image of my OS, of course) including all code and scripts to do so (but actually executing that takes about an hour on my machine, mostly due to compiling two custom crosscompilers).

I'll try to upload an image if anybody cares to take a look at the binary structures created as such. It merges identical files, but doesn't try to do that with directories (yet). It's also pretty limited in a few other aspects, but I think this'll work for a basic version.

I might rip out the hashes from the inodes, they're large. It still needs info for allowing diffs instead of just files. I'm also not quite sure on the disk spanning methods to be used so they might grow a bit and lose the hash, so pretty much stay the same. The file / directory / section structs aren't final or even close to. There's no way to find out where the inode file is right now, you'll have to wait until I add that info to the boot section somehow. The system information is in the first four inodes, in the order [ inode, boot, free, section ]. The inode file contains the inodes, the boot file contains the boot block (32k at the start), the free file contains all the free extents (and is atm the only indirect file) and the section file contains all info on the sections.

Author:  Combuster [ Fri Jan 05, 2007 5:15 pm ]
Post subject: 

Sounds like: "Why would we put a jet engine into a motor cycle?"
Answer: "Because we can" 8)

Apart from that, I like the idea. And since you are COWing files, doing it for directories is just the logical next step.

Author:  m [ Fri Jan 05, 2007 11:01 pm ]
Post subject:  Re: Design problem/question on FS design

Candy wrote:
I'm working on the *FS mkfs tool. It works up to the level that I've come to the part where you take minute decisions that make a huge impact later on.

I'm implementing the support for copy-on-write files, where a file inode is shared by multiple file links, which can only write to it by copying it before the actual write. This works fine. I am considering, if you allow this for directory inodes as well, writing a file includes checking all of its predecessors and possibly copying a number of them. Not allowing this would weaken it since you'd still get loads of copies of the directory tree.

What do you think?


So when it's sharing time,only links(references) will be added to indicate the shared files;and only when actually writing occurs will they be copied?

The key point is that links(pointers or references) are seperated from the contents(datas),may I understand it like that?

Will a shared file be completely copy when it's going to be written through one or more links(sometimes it's unnecessary or may slow down the process)?Or you're going to generate additional minimal descriptions to specify just what(which copy and the location within it) is modified(when the file is read next time,the FS will combine the original one and the certain modification description (related to the requested copy) to return a final output)?

Generally I think,if the backup is for updating in different period,that will work fine.

Anyway,I think that's good. :)

Combuster wrote:
Sounds like: "Why would we put a jet engine into a motor cycle?"
Answer: "Because we can"

Apart from that, I like the idea. And since you are COWing files, doing it for directories is just the logical next step.


Yeah...
Copy-on-write may help more in a distributed system,however similar techniques can also be valubale in desktop areas.

Author:  Candy [ Mon Jan 08, 2007 2:07 pm ]
Post subject: 

I've been tweaking a few bugs out of the mkfs tool and it now properly handles huge images and lots of files. Current example to show that COW does actually save space (this is without directory COW but with file COW):

Quote:
candy@blackbox:~$ ls -l /data/disk.img
-rw-r--r-- 1 candy users 16492674416640 2007-01-08 21:10 /data/disk.img
candy@blackbox:~$ du -sh /data/disk.img
4.3G /data/disk.img
candy@blackbox:~$ du -sh .
12G .
candy@blackbox:~$


The example target output is 15TB large (terabyte), since it's a sparse file it fits. It takes 4.3GB of space in total, including management information for the files. The biggest directory that's in it itself takes 12GB on the host file system. That's about 60-65% compression without any compression at all. The downside is that writing a big shared file is slower, but I'm gonna work on a diff thing that allows that without duplicating the files.

Author:  rexlunae [ Mon Jan 08, 2007 5:03 pm ]
Post subject:  Re: Design problem/question on FS design

Candy wrote:
I'm working on the *FS mkfs tool. It works up to the level that I've come to the part where you take minute decisions that make a huge impact later on.

I'm implementing the support for copy-on-write files, where a file inode is shared by multiple file links, which can only write to it by copying it before the actual write. This works fine. I am considering, if you allow this for directory inodes as well, writing a file includes checking all of its predecessors and possibly copying a number of them. Not allowing this would weaken it since you'd still get loads of copies of the directory tree.

What do you think?


This is the coolest idea I've heard here in a long time. Seems like it could lead to pretty sever fragmentation though. Have you put any thought into avoiding this?

Author:  Jules [ Tue Jan 09, 2007 2:53 am ]
Post subject: 

My immediate thought for an application was setting up filesystems for chroot (or similar) virtual machines -- you'd no longer need a copy of the OS and standard applications for each VM, it could all be done with COW links.

Nice one. :)

Author:  Candy [ Thu Jan 11, 2007 11:50 am ]
Post subject:  Re: Design problem/question on FS design

rexlunae wrote:
This is the coolest idea I've heard here in a long time. Seems like it could lead to pretty sever fragmentation though. Have you put any thought into avoiding this?


I'm not sure on how you figure it'd cause fragmentation.

I do intend to avoid fragmentation due to a few tidbits on how to:

- When you write to a small file or directory, instead of always overwriting and reusing the existing sectors, allocate a new extent using best-fit allocation (which can/should include the existing current block) and write it there instead. That should lead to less fragmentation due to small increments in filesize.
- Attempt to keep extents at least a number of clusters in length. The longer the extents are the less fragmented the files will be.
- When handling a large file, attempt to keep it in extents of a logical length, which depends on the filetype itself. For example, for MP3 files, keep 1M segments, for MPEG2 video keep a segment of about a group of pictures etc. That keeps the performance more deterministic in most cases leading to more reliable behaviour. I'm not quite sure on how to do this but the FS itself has support on the inode level for storing the file type of the item, so the info is/can be available...

I'm still considering whether I'll let very small / active files be in the log only. I think I might, but not quite sure. That would lead to less fragmentation as the log is by its nature rewritten often.

Do you have any ideas to avoid fragmentation?

Author:  Combuster [ Thu Jan 11, 2007 1:56 pm ]
Post subject: 

For the average case, using this to prevent fragmentation should be pretty effective without being bad for stability. If you want to take it to the extreme, you should read yourself into the likes of XFS...

Author:  Candy [ Thu Jan 11, 2007 3:46 pm ]
Post subject: 

Combuster wrote:
For the average case, using this to prevent fragmentation should be pretty effective without being bad for stability. If you want to take it to the extreme, you should read yourself into the likes of XFS...


Actually, it's better for stability. You only need to have an atomic update for the inode, which is fairly trivial to do. This way you can un-journal the other files without a problem. You just write them to a number of sectors that were/are unused, overwrite them without any care and when that's done you overwrite the inode. Sector writes are atomical, inodes are <= one sector (atm 1/8th of a sector) and you will never get a failed operation. Not even a journal needed for this.

I will add a journal for larger transactioned actions such as recursively removing a directory. Going to have a lot of fun testing that ;)

Page 1 of 1 All times are UTC - 6 hours
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group
http://www.phpbb.com/