Pergunta

The Count-Min Sketch is an awesome data structure for estimating the frequencies of different elements in a data stream. Intuitively, it works by picking a variety of hash functions, hashing each element with those hash functions, and incrementing the frequencies of various slots in various tables. To estimate the frequency of an element, the Count-Min sketch applies the hash functions to those elements and takes the minimum value out of all the slots that are hashed to.

The original paper on the Count-Min Sketch mentions that the data structure requires pairwise independent hash functions in order to get the necessary guarantees on its expected performance. However, looking over the structure, I don't see why pairwise independence is necessary. Intuitively, I would think that all that would be required would be that the hash function be a universal hash function, since universal hash functions are hash functions with low probabilities of collisions. The analysis of the collision probabilities in the Count-Min Sketch looks remarkably similar to the analysis of collision probabilities in a chained hash table (which only requires a family of universal hash functions, not pairwise independent hash functions), and I can't spot the difference in the analyses.

Why is it necessary for the hash functions in the Count-Min Sketch to be pairwise independent?

Thanks!

Foi útil?

Solução

You are right: universal hashing suffices. Pairwise independence, while stronger, is the usual method to construct a universal hash family. Also pairwise independence is contrasted in the paper with the 4-wise independence required by previous methods, such as the AMS sketch.

Licenciado em: CC-BY-SA com atribuição
Não afiliado a cs.stackexchange
scroll top