Hash sketches/Flajolet-Martin Sketches
Flajolet, P./Martin, G. (1985): Probabilistic counting algorithms for data base applications, in: Journal of Computer and System Sciences, Vol. 31, No. 2 (September 1985), pp. 182-209.
Durand, M./Flajolet, P. (2003): Loglog Counting of Large Cardinalities, in: Springer LNCS 2832, Algorithms ESA 2003, pp. 605–617.
Hash sketches are used to count the number of distinct elements in a set.
given:
- a bit array B[] of length l
- a (single) hash function h() that maps to [0,1,...2^l)
- a function r() that gives the position of the least-significant 1-bit in the binary representation of its input (e.g. 000101 returns 1, 001000 returns 4)
insertion of element x:
- pn := h(x) returns a pseudo-random number
- apply r(pn) to get the position of the bit array to set to 1 since output of h() is pseudo-random every bit i is set to 1 ~n/(2^(i+1)) times
number of distinct elements in the set:
- find the position p of the right-most 0 in the bit array
- p = log2(n), solve for n to get the number of distinct element in the set; the result might be up to 1.83 magnitudes off
usage:
- in Data Mining, P2P/distributed applications, estimation of the document frequency, etc.
Bloom filters
Bloom, H. (1970): Space/time trade-offs in hash coding with allowable errors, in: Communications of the ACM, Vol. 13, No. 7 (July 1970), pp. 422-426.
Bloom filters are used to test whether an element is a member of a set.
given:
- a bit array B[] of length m
- k different hash functions h_k() that map to [0,...,m-1], i.e. to one of the position of the m-bit array
insertion of element x:
- apply h_k to x (h_k(x)), for all k, i.e. you get k values
- set the resulting bits in the array B to 1 (if already set to 1, don't change anything)
check if y is already in the set:
- get the positions p_k to check using all the hash functions h_k (h_k(y)), i.e. for each function h_k you get a position p_k
- if one of the positions p_k is set to 0 in the array B, the element y is definitively not in the set
- if all positions given by p_k are 1, the element y might (!) be in the set
- false positive rate is approximately (1 - e^(-kn/m))^k, no false negatives are possible!
- by increasing the number of hashing functions, the false positive rate can be decreased; however, at the same time your bloom filter gets slower; the optimal value of k is k = (m/n)ln(2)
usage:
- in the beginning used as a cheap filter in databases to filter out elements that do not match a query
- various applications today, e.g. in Google BigTable, but also in networking for IP lookups, etc.