2w ago

The probability of a hash collision

kevingal.com

13 comments

This is valid on a single unlisted assumption: The hash function has equal distribution. If your has function ends by multipliying the has value by 4, for example, your number of possible boxes is 1/4th the otherwise expected value based on the size of the hash output
- The distribution is super important here too. Hashing any value to zero (or h(x) = 0) is valid, but a terrible distribution. The challenge is getting real-world values hashed in a mostly uniform distribution to avoid collisions where possible.
  Still, the contents of the article are useful even outside of hashing. It should just disclaim that the width of the output isn't the only thing important in a hash function.
- Of course. That's one of the basic requirements for something to be an actual hash function.
  
  Not only is md5sum not proven to have equal dostribution, it is specifically known to not have equal distribution, only nearly equal distribution.
  A hash function is any function that converts an arbitrary input size into a specific output size deterministically. No other requirements are there. A hash for a simple job could be just adding the ascii values together and give the output. Needless to say, that would not have an even distribution.
- The assumption is there though.
  Wouldn't multiplying the hash simply relabel the hash sites, as hashes non divisible by the factor simply be not accessible/not exist?
  
  The hashes not being there isn't particularly relevant within a hash function outputting a specific size. If your hash function is always 64 bits for example, the fact that you have 3/4th of them not exist means you should be operating as if its a 16 bit hash, not a 64 bit hash. If you still do this math based on the 64 bits outputted (2^64 boxes) you'd arrive at very inaccurate numbers.

13 comments