Question

I am trying to understand this sketch but am not able to understand. Correct me if I am wrong but basically, lets say I have a text data.. words.. I have a hash function.. which takes a word and create an integer hash and then I convert that hash to binary bit vector?? Right.. Then I keep a track of the first 1 I see from left.. And the position where that 1 is (say , k)... the cardinality of this set is 2^k?

http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html

But ... say I have just one word. and the hash function of it is such that hash it generates is 2^5, then I am guessing there are 5 (??) trailing 0's?? so it will predict 2^5 (??) cardinality? That doesnt sounds right? What am I missing

Was it helpful?

Solution

For a single word the distribution of R is a geometric distribution with p = 1/2, and its standard deviation is sqrt(2) ≈ 1.41.

So for a word with hash ending in 100000b the algorithm will, indeed, yield 25/0.77351 = 41.37. But the probability of that is only 1/64, which is consistent with the statement that the standard deviation of R is close to 1.

OTHER TIPS

http://ravi-bhide.blogspot.com/2011/04/flajolet-martin-algorithm.html

We had a good, random hash function that acted on strings and generated integers, what can we say about the generated integers? Since they are random themselves, we would expect:

1/2 of them to have their binary representation end in 0(i.e. divisible by 2),
1/4 of them to have their binary representation end in 00 (i.e. divisible by 4)
1/8 of them to have their binary representation end in 000 (i.e. divisible by 8)

Turning the problem around, if the hash function generated an integer ending in 0^m bits ..intuitively, the number of unique strings is around 2^m.

What is really important to remember is that the Flajolet Martin Algorithm is meant to count distinct elements (lets say M distinct elements) from a set of N elements, when M is expected to be very very large.

There is no point of using the algorithm if N or M are small enough for us to store all distinct elements in memory.

In the case where N and M are really large, the probability of the estimate being close to 2^k is actually very reasonable.

There is an explanation of this at : http://infolab.stanford.edu/~ullman/mmds/ch4.pdf (page 143)

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top