MM ponderings

elderK · **Posted:** Wed May 23, 2007 7:02 pm

Hey all.
Yes, I know - Im stuck back into my never-ending obsession (That being Memory Management).

But... I was wondering if anyone could explain to me Buddy Allocation, implemented via. Bitmaps? To me, that sounds less efficient on resources - since youd always need X many bitmaps in RAM... or am I misunderstanding it?

Was pondering Page stacks too - and I figured, sure - they are more efficient in terms that the Memory the 'pagestack' occupies, will shrink as memory starts to be used.

Problem is - The page stack itself could occupy a great deal of pages.

For example, say we have a machine with 8MB of RAM.
Assuming page size of 4 kilobytes, we have 2048 pages.

2048 entries that need to be tracked, 2048 * 4 = 8192. We need two pages to actually store the Pagestack.

Okay, Thats no big deal - say we allocate 1024 pages (half of the RAM). The second page for tracking is no longer needed, and it is up for grabs.
Assume it is now in use, servicing some page request JUST WHEN another two pages are released.

There is no more room to store those free pages. To me, it makes sense to say 'okay, this free page is taking over the page-stack roll', and it takes on the role that the 2nd page earlier did. BUT... These 'pagestack nodes' need to be tracked, they need to be able to be connected - which brings me to think of Intels Page Directory.

Each Pagetable tracks 4mb. Just like each Page of the Pagestack, tracks 4MB. When a Pagestack page ends up being used, its link in the Pagestack...directory is nulled. When something is freed, and needs another page to store the pagestack info... it is just plugged into the Pagestack directory.

Overhead would be 4K more than the typical pagestack - and that 4K overhead... could scale... too... in the event that it is the last page to be allocated.

Personally, I like the suond of Buddy allocation, it seems elegant. I can deal with the Internal fragmentation issues, but External fragmentation will bug me - so, Buddy Allocation make sense.

But - How to implement it, is what gets to me, My ideas make me think using a Tree would be far, far too ... memory hungry - 16MB worst case tracking, for 4GB. That is a little hungry, dont you think? If I cuold represent it all as a Complete BST tree, as an array, itd be 4MB to track 4GB - which is much, much less painful. Problem is, that wont do.

Perhaps I shall investigate the ADT Heap.

Anywho, Any ideas... suggestions, references, materials... explanations you guys can give, would be highly useful and appreciated.

~Zeii

jnc100 · **Posted:** Thu May 24, 2007 2:45 am

Stacks for physical memory management are easy to implement, but as you rightly point out, they require (for 32-bit) a 4 byte address to store each free page. I think the idea of using a page that the stack no longer fills is interesting, but horrendously difficult to implement. Stacks require 1 kB of space per 1 MB of total memory.

Bitmaps on the other hand only require 1 bit per page = 32 bytes per MB.

Buddies are similar to bitmaps in that you need 1 bit per page for the largest buddy, followed by 1 bit per 2 pages in the next and so on. It is basically a series convergent on 2 bits per page = 64 bytes per MB.

I use a buddy system, but actually store 2 bits per entry in each buddy as I also wish to define a little bit more other than free or used. In particular, I mark memory mapped devices or other system information as such so that the swapping system will not attempt to save and restore the bios data area.

I have attached my physical memory manager code under a bsd licence. I must admit that I haven't altered it in a long time, so am a bit rusty on how it works. It seems to work a charm though.

I basically call it like this:

edit: this is actually the code to initialise the buddies, use pmem_alloc() to actually get some space...

Code:

   buddy_size = allocate_buddy(~0, 0x1000, 0x400000);
   buddy_loc = find_free_address(buddy_size);
   allocate_buddy(buddy_loc, 0x1000, 0x400000);

Regards,
John.

elderK · **Posted:** Thu May 24, 2007 3:44 am

Zenobit 1dc0.1 and 1dc0.2 (current) use Bitmaps at the moment as well, but the entire system is very.. very... tightly integrated.

Both Physical and Virtual Memory managers use the same code - since the Address space that the Paging stuff is happening on is abstracted.

Conceptually, it works like a Revolver's bullet barrel. In that, it fires out pages and rotates, ready to fire another when requested. Bullets are replaced in any order. Eventually, the barrel will rotate 360 degrees, back to the original bullet (Which may be replaced now, or may still be empty). At this point, it just spins the barrel to find the next bullet it can fire.

Technically, it is a circular stack implemented as a bitmap. Pages are allocated at the current 'vector' (position in the stack).

for example:

switch_space(physical);
bla = alloc_page();
switch_space(virtual);
valloc_page(ROOT | READ | WRITE, bla);

(That isnt the actual code, but it is very similar).

The VMM functions (such as VALLOC) are simply wrappers, automating a lot of the stuff that would ordinarily make code much larger and cumbersome (such as Space switching, aliasing, etc).

BUT ... The problem is, because it is a circular stack, the potential for external fragmentation is very, very high. This doesnt matter so much in the physical sense, but in the virtual sense - it matters a great deal.

Thus, my need to find a replacement - or some way to implement Buddy allocation with a much smaller overhead.

If you could go into a little more detail about hte buddy allocation as bitmap thing, I would be very, very greatful.

As I understand it, you have several bitmaps. One for each block size?
For example, BItmaps for 4, 8, 16 KB.

If you allocate a 4k one, you mark the corresponding bits in the 8/16k as used.

If you allocate a 8k block, you mark 2 4k blocks in use, and the corresponding 16k block in use.

if you use a 16k block, you mark two 8k blocks in use, aswella s the corresponding 4 4k pages.

Is this how it works mostly? Because if that is true, it sounds inefficient.
OR... Is there some way in which you are merging the Bitmap?

so that, you have ONE big bitmap... and all the other bitmaps, are actually just different views of that same bitmap?

~zeii

jnc100 · **Posted:** Thu May 24, 2007 4:32 am

No, I use a separate bitmap for each level, from 4 kB to 4 MB.

For a 32MB system, you only need 1 byte to represent the 4MB buddy.

At system startup, all buddies are marked as not free, apart from the highest level buddy.

When a request for an, e.g. 4kB page is made, you look for a free entry in the 4 kB buddy, if not you recursively call the 8 kB buddy for a free entry. If it doesn't return anything then you're out of memory. If it does, then that 8 kB entry is marked as used in the 8kB buddy, and you have two corresponding entries in the 4 kB buddy.

One of these you mark as free, and the other as used, which you return.

Of course, the recursive calling doesn't stop at the 8kB buddy - if it can't find any space it calls the 16kB one and so on, until it reaches the 4MB buddy. If nothing is found here, then it does return out of memory.

Frees work like allocs in reverse - if an entry is freed, and its 'complement' (e.g. bit 1 is the complement of 0, 2 of 3, 4 of 5 and 6 of 7) is free, then you mark both as _used_ and free in the buddy above (recursively up to 4MB again). If the complement is not free, then you just mark free in the current buddy.

You can split the buddies at the 16MB mark if you like to allow you to allocate space for DMA.

Of course this implementation is very slow for looking for a free page at the start (you have to check every bit in every buddy from 4kB up to 4MB). You can improve it by maintaining a 'free entry' count for each buddy level, which is altered on allocs and frees. If this is 0, then we don't bother searching that buddy for a free entry but just go up to the next.

I haven't implemented this optimisation in my code yet. I have other matters to work on...

Regards,
John.

elderK · **Posted:** Thu May 24, 2007 5:26 am

thanks john, that helped my understanding.
Ive been toying with the idea, for a great few minutes.
And, I think ive got a minor optimization, at least in terms of memory required to store the bitmaps.

I will draw some quick diagrams and post them quickly.
I believe my alternative would take more CPU time, but... its still a neat idea, I think.

I will post shortly, tell me what you think!

~zeii.

elderK · **Posted:** Thu May 24, 2007 6:29 am

Here is my idea.
Basically, we are still using a single bitmap.
So, the overhead would be no larger than my existing allocation mechanism.

However... It is Buddy Allocator.
Yes, we need seperate bitmaps to track the various block 'lists.'
BUT... I was thinking... Why cant we.. somehow... represent all the layers we need, in the SAME bitmap?

And... I thought of this.

Any ideas?
~zeii

jnc100 · **Posted:** Thu May 24, 2007 6:52 am

Nice diagram, but that's a just a bitmap system that supports searching for different block sizes.

The point of a buddy is that you can _quickly_ find free memory of any size.

Your system works fine for anything that fits within 32 bits (max 128 kB) as you can do a single test in a 32-bit register, but becomes more complex when you are looking for larger sizes, e.g. 4MB to use as a large page, for example. You need to find 1024 contiguous bits set to '0', optionally aligned on a 1024 bit boundary.

Then the test becomes a long loop testing 32 dwords for each possible return. The point of a buddy is that you can just check for one bit to be marked free.

As I said before, a buddy system does not occupy more than twice as much memory as a bitmap. If a bitmap requires 32 bytes per 1 MB of physical memory, then you are saving at most 0.003% of your total memory by using a bitmap over a buddy system, at the expense of a (albeit small for most cases) performance gain.

Regards,
John.

edit: I have now updated my physical memory allocator to use the optimisation described above. Let me know if you want to see it. I really should get a website at some point...

elderK · **Posted:** Thu May 24, 2007 6:56 am

Not neccessarily.

All of the scanning, checking can be done in a rather fast manner.

For example, for a 4MB chunk - you would simply need to check 32 bits. Not 1024.

All of it can be optimized, but I do admit it would need to do a lot more work (All the computation needed).

Then again, I may have misunderstood what you were saying.

So, youll have to go into depth.

If possible, a diagram would help!

for example, if we have a 8MB computer, we know there are at most, 2 4MB blocks.

4MB Units 0 and 1. We can calculate the precise bits in the bitmap, to check:

1024 * 4 = 4MB.
0 * 4mb = unit 0.
1 * 4mb = unit 1 = 1024th page.
dword = (1024 / 32) = 32.
bit = (1024 % 32) = 0.

Check 4M Unit 0 = meaning, check dword 0, bit 0.
Check 4M Unit 1 = meaning, check dword 32, bit 0.

They are aligned by the nature of the system, as all blocks are power of two. When you want to set a 4MB Block as used, it just falls down to clever bit setting - en masse. Why set 1024 seperate bits? when you can simply set 32 dwords to 0xFFFFFFFF?

Dont get me wrong, Im just exploring alternative implementation ideas

~zeii.

mathematician · **Posted:** Thu May 24, 2007 7:00 am

Whenever you do a calculation about how much memory an OS look up tables are going to occupy you get a horrendous answer - until you remember how much RAM a modern computer contains. The thing to keep in mind is the % of available memory, rather than absolute numbers.

jnc100 · **Posted:** Thu May 24, 2007 7:03 am

4 MiB = 1024 * 4 kiB

Given that 1 bit represents a 4 kiB page, then that does indeed need 1024 bits.

Regards,
John.

elderK · **Posted:** Thu May 24, 2007 7:09 am

Yes, it does need 1024 bits but it doesnt mean you have to set each bit in a slow, sickly way!

And if you want to check that a 4MB page exists, you can just check the dwords in ONE go, rather than all individual bits.

It doesnt have to be as slow as it looks.
~zeii.

jnc100 · **Posted:** Thu May 24, 2007 7:23 am

By your system, to search for a 4MB space, I assume you plan to:

Code:

uint total_bits = total_mem / 4096;

for(uint i = 0; i < total_bits; i += 1024) {
  for(uint j = i; j < i + 1024; j += 32) {
    if(dword_at_bit_not_free(j))
      break;
  }

  return address_represented_by_bit(i);
}

return NULL;

In order to find a free 4 MB block you need to check at least the 1024 bits in the block you return and a variable number in all the blocks before that.

I simply check for one free entry in a 4 MB buddy (after checking whether its free counter is > 0 of course)
This requires me to check between 1 bit and the size of the 4 MB buddy (total_mem / 4 MB bits).

Regards,
John.

elderK · **Posted:** Thu May 24, 2007 7:41 am

well, no.

First of all, We know how many 4MB groups there would be, since we can easily calculate that on the fly from the systems RAM count.

Then... All we would have to do, would be to scan the possible 4MB positions. To get the 'starting' bit locations of those Units, we would perform simple hashing.

for example:
Unit 0 :
dword = 0
bit = 0.
Unit 1 :
dword = 32
bit = 0 :
Unit 2 :
dword = 64
bit = 0

Once we know the Dword that the logical 4MB block starts from, we can just scan that 32 dwords are ZERO (including the starting logical block we have hashed to).

If all 32 dwords check as zero, we have a 4MB block free.
AND YES... I realize, this is still checking 1024 bits

. Just, checking each dword itself, is faster than checking every single bit individually.

The performance hit would start to hit home when you allocated less than things that would span Dwords

For example, 16k. Four pages.

Unit 0 = 0/0
Unit 1 = 0/4
Unit 2 = 0/8
Unit 3 = 0/12

Even then, the search can be made faster.
Instead of checking each individual bit, why not caculate the mask for the allocation size?

16 = four pages = 1111.

So, progressively scan each unit by anding it by the mask. If the result isnt zero, shift it by four and repeat. Once you have located a 'block' that is decent, you will have the Dword and bit index. OR the Mask to the write location, calculate the address to return via the bit, boom, done.

~zeii.

jnc100 · **Posted:** Thu May 24, 2007 10:33 am

Don't worry, I was assuming you'd check it a (insert register size here) unit at a time.

Aside from the speed issues, what happens if you have 8MB of memory and try the following:

Code:

void *a = palloc(4MB);
void *b = palloc(4kB);
free(a);
void *c = palloc(4kB);
a = palloc(4MB);

I assume you wouldn't have enough memory for the last palloc because b would sit at the 4 MB boundary and c at the 0MB boundary.

In other words, you have no way of telling without performing complicated bit manupulations each time whether a bit set free at the lowest level is actually part of a larger block that should be kept together. I generally believe its best to try and allocate the smallest possible block that fits in each case.

Regards,
John.

elderK · **Posted:** Thu May 24, 2007 10:43 am

Well, that is true.
I believe also, it is best to locate the smallest possible free block.
And yes, I would have to do rather complicated checking to do that.
I wouldnt run out of memory though

.

I would appreciate a detailed explanation of your algorithm

.
~zeii.

OSDev.org

MM ponderings

Who is online