Sizing a Cuckoo filter for a subset of elements

https://cs.stackexchange.com/questions/123692

29-09-2020
|

Pergunta

Question

I am considering the use of Cuckoo filter for a business case. To simplify the explanation here is an analogy of my needs:

There are over $n = 30 000$ first names that exists in the whole world
I have a Cuckoo filter storing the names of people who gave me a gift at least once
I am certain that the Cuckoo filter will not have to store more than 600 names
The names the Cuckoo filter will store are evenly distributed across the 30 000 names
Of course the final goal is to be able to know which names gave me a gift at least once, I might have to query for any of the 30 000 names.

I would like to design my filter for 600 elements; however the original paper about Cuckoo filters considers that the number of buckets $m$ is a multiple of $n$.

Now consider a construction process that inserts $n$ random items to an empty table of $m=cn$ buckets for a constant $c$

Am I supposed to size the filter according to set of all names in the world ? Is there something I am missing ? This is not addressed in the original publication.

Constraints

Size : The Cuckoo filter will be transmitted over the network. But it will be done in an asynchronous way (let's say an update every 2 hours). So this will weigh in the size constraint.

Time : The real constraint is time. When I query the filter I would like my response in less than 5 seconds.

False positive : I also hope to have false positive rate of at most 20%.

Security : In my example I store names, but in reality I will store sensitive data that should be protected and anonymized under GDPR. After a statistical analysis I concluded that the data's entropy is too low to store as a list of hashes.

Other options : I'm open minded by nature. I always welcome other options.

Attempt at a self answer

Considering that the fingerprint is $f$ bits long, and I have $m$ buckets, the probability that the name Bob has the same features (index and fingerprint) as the name Alice is :

$$ \frac{1}{2^f} \cdot \frac{1}{m} $$

Which means that the probability that there is another name colliding with the name Anna is:

$$ (n - 1) \cdot \frac{1}{2^f} \cdot \frac{1}{m} $$

With 30 000 names, a fingerprint of 8 bits and 600 buckets, the result is $0.39$ which means a false positive rate of 39 %.

To reduce the false positive rate I can either increase the fingerprint size or the number of buckets. Using 6000 buckets instead of 600 gives me a false positive rate of 3.9%. This hack might be working in my example case but in reality we are talking of $10^{19}$ names in the whole world and $10000$ names to store in the filter.

It seems to me that Cuckoo filter were not designed with this use case in mind, and when people are using Cuckoo filter, they hope to store almost every existing item in there at some point.

Solução

As discussed in the comments it turns out that, yes, I need to size my Cuckoo filter for the set of every existing name. That being said, it is not much of a problem in the end if I only send the buckets with fingerprints in it, which shouldn't be more than 600.

In addition to that, my attempt at a self answer was very wrong. The probability that there is another name colliding with the name Anna is actually:

$$P\left(\bigcup_{i = 1}^n A_i\right) = \sum_{k = 1}^n (-1)^{k+1} \sum_{1 \le i \lt j ... \lt k \le n} P(A_i \cap A_j ... \cap A_k)$$

This is the inclusion-exclusion principle. Because of my hypotheses, the probability can be simplified as:

$$P\left(\bigcup_{i = 1}^n A_i\right) = 1 - (1 - p)^n$$

As highlighted in the original paper, we also have to take into account that $$2b/2^f \le \epsilon$$.

Licenciado em: CC-BY-SA com atribuição

Não afiliado a cs.stackexchange