Understanding the effect the distribution of data has on hashing

Question

Usually you do not have as many input datasets as you have possible inputs. The distribution is therefore more of a propability, that a certain input with certain features will be picked. (essentially the same as you said, but with p<1 for every word instead of some count n>1) E.g. if you know, that the first bit of the input will always be 1, then the data is not uniformly distributed.

If your hash were very simple, eg. by only taking the first byte as 'hash', then this non-uniform distribution would lead to more collisions than anticipated. (only 128 values are possible even though you expected to get 256 different values)

Most (cryptographic) hash functions that you might know by name are good enough so that you do not have to care about this. For cryptography it is even an explicit condition: you must not be able to tell how many bits in the input changed just by looking at the difference of the hashes. That does not mean that it is impossible though. I can vaguely remember a paper stating an increased collision rate for md5 when only ascii letters and digits were hashed. I cannot find it right now, so enjoy this piece of information with care - but even if i have mixed up something, such a scenario is easily possible. And no matter whether it is md5 or some other algorithm, if you actually have such a relation, then certainly your distribution of input datasets is relevant again.