Questions on Hash Tables...

Candy · **Posted:** Wed May 30, 2007 1:07 pm

Kevin McGuire wrote:

Code:

uint32_t doHash(uint8_t *in, uint32_t len)
{
   for(uint32_t x = 0; x < len; ++x)
  {
      hash = hash + in[x]; 
  }
  return hash;
}

(And the above algorithm might be wrong. I am still thinking it over.)

The algorithm isn't wrong, but it's a bad hashing algorithm. Given random inputs, each 257 bytes in length, the hash will never come above 65535. If you give it a permutation of your input, it'll collide. Lower one number by 1 and raise another by 1 and you'll have yet another collision. For example, make me a hash table containing "abc", "acb", "bac", "bca", "cab" and "cba" with this algorithm. Then try after making a new one with the following:

You make a hash algorithm by realising which bits of your input hold the most entropy, and to ensure this entropy is spread to as many bits of output as possible. If you, for instance, use only ASCII text with numbers as input, the first 4 bits don't hold any entropy as they'll always be "0011". You won't gain anything there. E(7) = E(6) = E(5) = E(4) = 0.

The fourth bit holds entropy, which given a truly random input between 0 and 9 has the chance of 0.2 to contain a 1 and 0.8 to contain a 0. That's not a full bit of entropy, but it's something. The third bit has a 0.4-0.6 split between 1 and 0, so it's got more entropy, but still no full bit. The second bit and the first bit both have a full bit of entropy, as they can vary between 0 and 1 with the chance of 0.5.

(less sure about the next)
So, one of these numbers will hold about (0.2 + 0.4 + 0.5 + 0.5) * 2 = 3.2 bits of entropy. The first four don't contain any, so ignore it.

Adding numbers together with N bits of entropy gives you about N+1 bits of entropy. Adding a bit without entropy to a number with entropy retains the entropy in the output. XORing bits also holds the entropy, whilst XORing two bits with entropy results in one bit with some function between the entropies somehow resulting in a number that's at least as high as each of them, and if both are nonzero, closer to 0.5 than that (if relevant).

So, the least useful thing you can do is add them together, since that gives you N+1 bits of entropy if you input 2N bits. You could shift them left, adding numbers without entropy to numbers with entropy and inversely, allowing you to keep your 2N bits of entropy.

At some time you'll have gathered enough "entropy" so that you can't fit possibly entropic values into the space you have. Dumping the bits is a waste so it's usually best to then chop off a bit from the start and as entropically possible mix these into the next bits.

You can test for the first behaviour by giving your algorithm a lot of small inputs to hash. If they collide, it's not using the entropy properly. You can test for the second behaviour by giving the algorithm a lot of long inputs, each equal in the first N bytes, but each padded after these first N bytes by equal bytes, possibly all 0 bytes or spaces (since they contain only 1 bit). If these collide, you're either wasting entropy by moving it out or wasting entropy by not using an empty input to generate entropy from (the state doesn't change on empty input).

People have done these kind of things in the past and come up with a number of methods for fixing up fairly good outputs. One thing that is a particularly useful tool as a hash is using a CRC. A CRC is more or less based on the entire theory above, where it shifts in new entropy in full (shifting the state left 8 bits and adding in the new byte), and then uses the 8 out-shifted bits (plus the full leftover state) to distribute their entropy to all the other bits. It does this in a bit-by-bit fashion (originally) but effectively it can be compressed into a byte-fashion by using a lookup table for the effect on the rest of the output. It's very fast for what it does and very good.

For more information, buy a good discrete mathematics book and an introductory cryptography book. I can't vouch for the first but "Applied Cryptography" by Bruce Schneier is a good one for the second.

Alboin · **Posted:** Thu May 31, 2007 5:06 pm

Candy wrote:

"Applied Cryptography" by Bruce Schneier is a good one for the second.

Agreed. (Although, I've only read half of it.)

I'm up to the rehashing, and was thinking of simply creating a new hash, increasing the table size, and then going through the old list, and entering each item in one by one. (I would then free the old list, of course.)

I'm thinking this is slow, however. Any recommendations?

Thanks!

OSDev.org

Questions on Hash Tables...

Who is online