Questions on Hash Tables...

Alboin · **Posted:** Sat May 26, 2007 4:15 pm

This is part 2 in my nth part series consisting of my inability to understand a subject after thoroughly researching it.

I'm working on implementing some associative arrays, and came across hash tables.

My understanding consists of this:

You run your key through the hash function. (I'm using FNV.)
The hash is then the index to the array.
Access the element in the array.
Do chaining if collision occurs.

I'm somewhat confused by the whole thing:

How does one keep an arrays with hashed keys so large? For instance, "Hello World!" hashed is 1754836950. Am I then supposed to mod that by the array size? How would one go about this?

And then, with the whole chaining and open addressing situation: If implementing an associative array, would collisions be met? If so, how would one determine what element in the list (Considering I went with chaining.) was wanted?

Thanks... :cry:

Kevin McGuire · **Posted:** Sat May 26, 2007 4:29 pm

Quote:

How does one keep an arrays with hashed keys so large? For instance, "Hello World!" hashed is 1754836950. Am I then supposed to mod that by the array size? How would one go about this?

One way it works to make things work really fast! Is to consider:

Code:

uint8_t **stringArray;
stringArray = (uint8_t**)malloc(sizeof(uint8_t) * 1000);
// fill the array with strings
FillArrayWithStrings(stringArray);
// search the array for a certain string.
for(uint32_t x = 0; x < 1000; ++x)
{
   if(strcmp(stringArray[x], WhatToFind) == 0)
   {
        printf("found string %s at %x\n", stringArray[x], x);
   }
}

Okay. That is pretty fast right? Well. This one below is even faster, and I mean much faster!

Code:

struct sPair{
   uint32_t hash;
   uint8_t *string;
};
sPair *pairs = (sPair*)malloc(sizeof(sPair) * 1000);
// fill array with strings
lookForHash = doMyHash("ForThisWord");
/// check the entire array.
for(uint32_t x = 0; x < 1000; ++x)
{
   /// check hash (a collision could exist right this moment).
   if(pairs[x].hash == lookForHash)
   {
      /// error checking prevents a collision for causing problems.
      if(strcmp(pairs[x].string, "ForThisWord") == 0)
      {
          /// this was not a collision and we have found the string.
          printf("found string %s at %x.\n", pairs[x].string, x);
       }
   }
}

See the difference? I am sure there may exist hashing algorithms that can produce a index somehow. I have tried before and come really close to doing it I thought. The above should serve as a practical example of using a hash to decrease execution time when searching for strings.

The processor only has to make one thirty-two bit compare instead of a series of compares with code to handle strings that are not a multiple of four bytes by doing a last two byte(sixteen bit) or one byte compare(eight bit).

If you stored the hash separately from the strings in a really large list you might even enable the processor to use a cache line to hold the entire array of hashes (and using the rep instruction) which should make the operation extremely fast I would imagine.

Kevin McGuire · **Posted:** Sat May 26, 2007 6:15 pm

Maybe to continue the conversation for the sake that as a child I had tried to perform compression using permutation. During that time as a child I did not realize that what I was dealing with was called permutations or that I was simply wanting the change bases which is why my idea failed. However, the fondness of permutations, or combinations as I called them, has remained over the years. Some times I start thinking about it. The last time I took a look at it again was a few months ago when I wanted a hash algorithm that had a domain that was much smaller than the items to be hashed which is exactly like most hashing algorithms I know of today.

What I wanted to do differently was to make the algorithm aware of previous hashes so it could generate a hash in a sequential order with out storing the string. I think a equivalent example would be:

Code:

struct sPair
{
   uint32_t hash;
   uint8_t *string;
};
sLinkedListItem<sPair> pairs;

uint32_t doHash(uint8_t *string)
{
   // check if string already exists.
   if(STRING_EXISTS_IN_LIST(pairs))
   {
      return GET_HASH_FROM_LIST_BY_STRING(string, pairs);
   }
   // create a new hash - and the return value is the index of the pair in the list.
   index = CREATE_NEW_PAIR_IN_LIST(string, pairs)
   hash = index;
   return hash;
}

This is bloated though. What I wanted to do was remove the need to store current pairs and remove the need to illiterate a list or I suppose I could say make it more linear. The way I tried solving this was using premutations of permutations. I think that is the proper way to say it. The idea was similar to:

Here is a direct mapping.
kevin = kevin
apple = apple
pear = pear
shirt = shirt
wall = wall
a = a

However, I wanted to treat the mapping as a permutation it's self and call this permutation of the mapping a sequence so that something simpler like:

Mapping Sequence: aa
aa = 'aa
ab = 'ab
ba = 'ba
bb = 'bb
Mapping Sequence: ab
aa = 'aa
ab = 'ab
ba = 'bb
bb = 'ba
Mapping Sequence: ba
aa = 'aa
bb = 'bb
ba = 'ab
bb = 'ba
Mapping Sequence: bb
aa = 'aa
bb = 'bb
ba = 'ba
bb = 'ab
Mapping Sequence: baa
aa = 'aa
bb = 'ba
ba = 'bb
bb = 'ab
Mapping Sequence:
aa = 'aa
bb = 'ba
ba = 'ab
bb = 'bb
Mapping Sequence:
.... continued onward ...

So if we hashed in this order: 'aa, 'bb, 'ba, 'bb. Then, a direct mapping could be achieved by changing the mapping sequence from the default aa to bb. The need to store the current pairs should be removed (eliminated) since this is obviously stored by the mapping sequence instead of being expanded. More like a compressed state that retains the order. This is most likely already done somewhere although it is named something else and represented differently. I have no real idea but then again it might not be.

I forgot to mention that the point is to completely eliminate collision up to a point where you can detect that a collision will happen and you could remove a older or unused hashing by changing the mapping sequence.

Alboin · **Posted:** Sat May 26, 2007 7:10 pm

So...If one was to access an item in the associative array, it would go a little something like:

Hash the string(ie. the input.) and create an index to the array.
Goto the index, if linked list, search for original string; else return string.

Therefore, each structure, or element in the linked list (If there were collisions causing chaining.) would then consist of the string index, the hash, and the data, correct?

An implementation question. The array containing the array of structures, of the type described in the paragraph above, would that be a linear array? Or a linked list?

Also, the creation of an index to the array from a hash is somewhat foggy, yet...If the array size changes, how can a hash%array_size be constant? Wouldn't the place of the items change as more and more were added?

Kevin McGuire · **Posted:** Sat May 26, 2007 7:33 pm

I think I understand what you mean by chaining. Let me see if I can think it out. I had to read about associative arrays since I do not know the terminology very well about these things. I generally come up with them on my own where I see a fit.

Code:

struct ObjectData
{
   uint8_t   *address;
   uint8_t   *country;
   uint8_t   *state;
};
struct sPair;
struct sPairLinkedListHeader
{
   struct sPair *first;
};
struct sPair
{
   uint32_t   key;
   uint8_t   *string;
   struct ObjectData *data;
   struct sPair *next;
   struct sPair *prev;
   struct sPairLinkedListHeader hdr;
};

sPair* AddPairToLinkedList(struct sPairLinkedListHeader *hdr, uint8_t *string)
{
   struct sPair *pair, *chainPair;
   uint32_t hash = hashString(string);
   pair = (struct sPair*)malloc(sizeof(struct sPair));
   pair->string = (uint8_t*)malloc(strlen(string)+1);
   pair->key = hash;
   strcpy(pair->string, string);
   cpair = FindPairByHash(hdr, hash);
   if(chainPair)
   {
      if(GetMatchFromPairOrChain(chainPair))
      {
         /// chain it.
         pair->next = chainPair->hdr.first;
         pair->prev = 0;
         if(chainPair->hdr.first)
         {
            chainPair->hdr.first->prev = pair;
         }
         chainPair->hdr.first = pair;
         return pair;
      }
   }
   // no chain create first mapping.
   pair->next = hdr->first;
   pair->prev = 0;
   if(hdr->first)
   {
      hdr->first->prev = pair;
   }
   hdr->first = pair;
   return pair;
}

struct sPair* FindPairByHash(struct sPairLinkedListHeader *hdr, uint32_t hash)
{
   struct sPair *c;
   for(c = hdr->first; c != 0; c = c->next)
   {
      if(c->key == hash)
      {
         return c;
      }
   }
   return 0;
}

struct ObjectData* GetMatchFromPairOrChain(struct sPair* pair, uint8_t *string)
{
   struct sPair *c;
   if(strcmp(string, pair->string) == 0)
   {
      return pair->data;
   }
   // search chain..
   for(c = pair->hdr.first; c != 0; c = c->next)
   {
      if(strcmp(string, c->string) == 0)
      {
         return pair->data;
      }
   }
   return 0;
}

I think doing the chain pairs like I did could be a bit inefficient in space usage since we could omit the member key/hash since we already know it and just need to do a manual compare of strings for a fast lookup of accessing the object data associated with a string (say a name).

There is some syntax coloring here:
http://kmcguire.jouleos.galekus.com/dok ... collisions

I do not know of a method to generate a hash that is a index into a array. Instead I only know of generating a hash so that you can search for it instead of a string and the newly encountered method of chaining items that have the same hash (or AKA collision) so that you do you have to continue wasting time comparing more hashes and go straight to trying to find the right string using a manual method of comparing each byte in the string (in general).

You might modify the hashing function to use a different method for a chain so that:

hash1 = doHash1(string);
hash2 = doHash2(string);
hash1 != hash2

So that when entering into the chain you could check for hash2 instead of hash1 which might show a large performance increase if collisions were abundant. There might be a way to continue this trend where you have chaining on chaining on chaining and so on?

Alboin · **Posted:** Mon May 28, 2007 12:25 pm

Kevin McGuire wrote:

I do not know of a method to generate a hash that is a index into a array.

Could anyone shed some light on this?

Colonel Kernel · **Posted:** Mon May 28, 2007 12:34 pm

Alboin wrote:

Kevin McGuire wrote:

I do not know of a method to generate a hash that is a index into a array.

Could anyone shed some light on this?

Mod by the size of the array as part of your hash function. If you decide to grow or shrink the array, that means you'll have to re-hash everything in it.

Candy · **Posted:** Mon May 28, 2007 12:41 pm

Alboin wrote:

You run your key through the hash function. (I'm using FNV.)
The hash is then the index to the array.
Access the element in the array.
Do chaining if collision occurs.

I'm somewhat confused by the whole thing:

How does one keep an arrays with hashed keys so large? For instance, "Hello World!" hashed is 1754836950. Am I then supposed to mod that by the array size? How would one go about this?

int modulated_hash = hash(text) % table_size;

Quote:

And then, with the whole chaining and open addressing situation: If implementing an associative array, would collisions be met? If so, how would one determine what element in the list (Considering I went with chaining.) was wanted?

struct item *curitem = items[modulated_hash];
while (curitem != 0 && curitem->key != requested_item_key) curitem = curitem->next;
if (curitem) return curitem->value;
return 0;

This is with closed addressing (the names confuse, yes). It's closed since given an item, you know in which box it must be. You could also use open addressing:

end_hash = modulated_hash;
second_hash = hash(text) % small_prime;
while (items[modulated_hash] != 0 && items[modulated_hash]->key != requested_item_key && ((modulated_hash + second_hash) != end_hash))
modulated_hash = modulated_hash + second_hash % table_size;
if (curitem && curitem->key == requested_item_key) return curitem->value;
return 0;

It's better to limit the hash during generation to the given limit instead of to an arbitrary other limit since you will get an uneven distribution given this method. Admitted, the difference is fairly negligible until you get to table sizes somewhat closer to your original hash size. For instance, if you have a hash function that gives 65536 equally distributed hashes and you use that as a hash on a 32767 entry table the first two entries will be a lot larger than the rest.

Any other specific questions?

Alboin · **Posted:** Mon May 28, 2007 12:47 pm

Candy wrote:

Any other specific questions?

I'm pretty set at the moment! I'll post here if I come to any other problems.

Thanks greatly!

(Especially for the quick replies.)

Kevin McGuire · **Posted:** Mon May 28, 2007 1:10 pm

Holy crap. I did not realize that modulo could do what it does such as distributing the hash key space over the entire array index space.

I wish I was wrong and corrected more often..

From now on I want you guys to make me feel like I know nothing. I want to be corrected so much I start crying and sucking my thumb. I want you to be so through in correcting everything I do from grammar to code.

Alboin · **Posted:** Tue May 29, 2007 2:38 pm

Hey, fancy that. Another question.

If to get the index I do hash%table_size, when do I need to change my table size? (I'm using chaining to fix collisions.) Wouldn't all the elements have to fit in the table, because of the modulo?

Also, if I don't change size, then what would be a good table size?

Thanks.

Kevin McGuire · **Posted:** Tue May 29, 2007 3:24 pm

Alboin wrote:

Hey, fancy that. Another question.

If to get the index I do hash%table_size, when do I need to change my table size? (I'm using chaining to fix collisions.) Wouldn't all the elements have to fit in the table, because of the modulo?

You change the game from static size to more length to stay in frame with the gain if to be wise, for there is a price to pay when you chain away your day.

Quote:

Also, if I don't change size, then what would be a good table size?
Thanks.

The size is a matter of being wise for there is no prize for being enticed.

Alboin · **Posted:** Tue May 29, 2007 3:35 pm

Kevin McGuire wrote:

Alboin wrote:

Hey, fancy that. Another question.

If to get the index I do hash%table_size, when do I need to change my table size? (I'm using chaining to fix collisions.) Wouldn't all the elements have to fit in the table, because of the modulo?

You change the game from static size to more length to stay in frame with the gain if to be wise, for there is a price to pay when you chain away your day.

Quote:

Also, if I don't change size, then what would be a good table size?
Thanks.

The size is a matter of being wise for there is no prize for being enticed.

I knew I shouldn't have started that...

But how do I know when I should increase the array size?

Kevin McGuire · **Posted:** Tue May 29, 2007 3:40 pm

Quote:

I knew I shouldn't have started that...
But how do I know when I should increase the array size?

You played the game now you want to change,
But to show blame and still be sane,
I want to say you should have framed what your teachers hanged,
Outside there door frame which says 101 you so called brain.

Alboin · **Posted:** Tue May 29, 2007 3:53 pm

So because you don't know the answer to my question, you're mocking me in rhyme? Could we please just get back to the purpose of this thread? I don't feel like rhyming at the moment, and that's reserved for that one thread anyway.

If anyone could shed some more light on the question... :cry:

Thanks.

OSDev.org

Questions on Hash Tables...

Who is online