OSDev.org

The Place to Start for Operating System Developers
It is currently Tue May 14, 2024 12:28 pm

All times are UTC - 6 hours




Post new topic Reply to topic  [ 32 posts ]  Go to page Previous  1, 2, 3
Author Message
 Post subject:
PostPosted: Wed May 30, 2007 1:07 pm 
Offline
Member
Member
User avatar

Joined: Tue Oct 17, 2006 11:33 pm
Posts: 3882
Location: Eindhoven
Kevin McGuire wrote:
Code:
uint32_t doHash(uint8_t *in, uint32_t len)
{
   for(uint32_t x = 0; x < len; ++x)
  {
      hash = hash + in[x];
  }
  return hash;
}


(And the above algorithm might be wrong. I am still thinking it over.)


The algorithm isn't wrong, but it's a bad hashing algorithm. Given random inputs, each 257 bytes in length, the hash will never come above 65535. If you give it a permutation of your input, it'll collide. Lower one number by 1 and raise another by 1 and you'll have yet another collision. For example, make me a hash table containing "abc", "acb", "bac", "bca", "cab" and "cba" with this algorithm. Then try after making a new one with the following:

You make a hash algorithm by realising which bits of your input hold the most entropy, and to ensure this entropy is spread to as many bits of output as possible. If you, for instance, use only ASCII text with numbers as input, the first 4 bits don't hold any entropy as they'll always be "0011". You won't gain anything there. E(7) = E(6) = E(5) = E(4) = 0.

The fourth bit holds entropy, which given a truly random input between 0 and 9 has the chance of 0.2 to contain a 1 and 0.8 to contain a 0. That's not a full bit of entropy, but it's something. The third bit has a 0.4-0.6 split between 1 and 0, so it's got more entropy, but still no full bit. The second bit and the first bit both have a full bit of entropy, as they can vary between 0 and 1 with the chance of 0.5.

(less sure about the next)
So, one of these numbers will hold about (0.2 + 0.4 + 0.5 + 0.5) * 2 = 3.2 bits of entropy. The first four don't contain any, so ignore it.

Adding numbers together with N bits of entropy gives you about N+1 bits of entropy. Adding a bit without entropy to a number with entropy retains the entropy in the output. XORing bits also holds the entropy, whilst XORing two bits with entropy results in one bit with some function between the entropies somehow resulting in a number that's at least as high as each of them, and if both are nonzero, closer to 0.5 than that (if relevant).

So, the least useful thing you can do is add them together, since that gives you N+1 bits of entropy if you input 2N bits. You could shift them left, adding numbers without entropy to numbers with entropy and inversely, allowing you to keep your 2N bits of entropy.

At some time you'll have gathered enough "entropy" so that you can't fit possibly entropic values into the space you have. Dumping the bits is a waste so it's usually best to then chop off a bit from the start and as entropically possible mix these into the next bits.

You can test for the first behaviour by giving your algorithm a lot of small inputs to hash. If they collide, it's not using the entropy properly. You can test for the second behaviour by giving the algorithm a lot of long inputs, each equal in the first N bytes, but each padded after these first N bytes by equal bytes, possibly all 0 bytes or spaces (since they contain only 1 bit). If these collide, you're either wasting entropy by moving it out or wasting entropy by not using an empty input to generate entropy from (the state doesn't change on empty input).


People have done these kind of things in the past and come up with a number of methods for fixing up fairly good outputs. One thing that is a particularly useful tool as a hash is using a CRC. A CRC is more or less based on the entire theory above, where it shifts in new entropy in full (shifting the state left 8 bits and adding in the new byte), and then uses the 8 out-shifted bits (plus the full leftover state) to distribute their entropy to all the other bits. It does this in a bit-by-bit fashion (originally) but effectively it can be compressed into a byte-fashion by using a lookup table for the effect on the rest of the output. It's very fast for what it does and very good.

For more information, buy a good discrete mathematics book and an introductory cryptography book. I can't vouch for the first but "Applied Cryptography" by Bruce Schneier is a good one for the second.


Top
 Profile  
 
 Post subject:
PostPosted: Thu May 31, 2007 5:06 pm 
Offline
Member
Member
User avatar

Joined: Thu Jan 04, 2007 3:29 pm
Posts: 1466
Location: Noricum and Pannonia
Candy wrote:
"Applied Cryptography" by Bruce Schneier is a good one for the second.

Agreed. (Although, I've only read half of it.)

I'm up to the rehashing, and was thinking of simply creating a new hash, increasing the table size, and then going through the old list, and entering each item in one by one. (I would then free the old list, of course.)

I'm thinking this is slow, however. Any recommendations?

Thanks!

_________________
C8H10N4O2 | #446691 | Trust the nodes.


Top
 Profile  
 
Display posts from previous:  Sort by  
Post new topic Reply to topic  [ 32 posts ]  Go to page Previous  1, 2, 3

All times are UTC - 6 hours


Who is online

Users browsing this forum: No registered users and 3 guests


You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot post attachments in this forum

Search for:
Jump to:  
Powered by phpBB © 2000, 2002, 2005, 2007 phpBB Group