Alboin wrote:
Hey, fancy that. Another question.
If to get the index I do hash%table_size, when do I need to change my table size? (I'm using chaining to fix collisions.) Wouldn't all the elements have to fit in the table, because of the modulo?
Also, if I don't change size, then what would be a good table size?
Thanks.
That's because the hash table is in effect a mix between a pure linked list and a direct mapping, where the linked list is a hash table with one chain (all hashes are 0) and the direct map is one with, for N possible objects, N possible hashes, so there can't be a collision unless there is an object -> by definition no chains. Both have really bad cases and you're mostly trying to find the optimal case for your application:
Making it a linked list gives average lookup times of O(n), which is bad. Making it a direct map when there are huge gaps between objects, you get space complexity O(lots) which is even worse. If you fold the space complexity, you can reduce it to O(n) without raising its average amortized lookup time to beyond O(1) significantly. That requires keeping your chains short enough to not be related to N. That means, if your average chain length is too large, you should rehash. Inversely, if your table has lots of empty items, you should also rehash.
The concept of using primes is to ensure that any input will more or less distribute evenly over the bins. The exact definition is, every input will divide over the amount of bins divided by the GCD of the hash modulus and the average stride of the input values.
That means, if you have objects that occur at, say, every 16 bytes or a multiple of that (allocated a lot of 24-byte objects and stuffed them into a hashtable) where the hash table uses 32 bins with a simple modulus over the pointer, all objects (ALL) will end up in bin 0. If you have 64 bins, they'll end up in (64/32) bins, or 2. If you have 63 bins, even though it isn't prime (but it is coprime to 32), you'll use all 63 bins.