Domanda

So I've read the Wikipedia page on Hash functions as I'm currently playing with some. Both on that page and other sources I've read mention that the distribution of the data affects the hash function.

Despite some explanations it is still unclear to me what exactly those effects are and perhaps why. So my question:

  1. Just to make sure I've got it right, when they mention distribution is this the frequency of each word in the input data set?
  2. What effect does the distribution of input data have on hash functions? Of particular interest is, the performance of the hash function, in terms of both speed and uniformity of the output produced by the hash algorithm.

EDIT 1: I'm thinking specifically of the Wikipedia English corpus vs data from a more dynamic source, Twitter's tweets for example.

È stato utile?

Soluzione

Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.

If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)

Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.

Autorizzato sotto: CC-BY-SA insieme a attribuzione
Non affiliato a StackOverflow
scroll top