سؤال

Since they fill up and the percentage of false positives increase, what are some of the techniques used to keep them from saturating? It seems like you cannot empty out bits, since that would make an immediate negative on data stored in that node.

Even if you have a set of known size, in a data store using bloom filters like Cassandra what confuses me is that the data in a node is going to be added and removed, right? But when you remove a key you cannot set its bloom filter buckets to 0 since that might create a false negative for data in the node that hash to one or more same buckets as the removed key. So over time, it is as if the filter fills up

هل كانت مفيدة؟

المحلول

I think you need to set an upper bound on the size of the set that the bloom filter covers. If the set exceeds that size, you need to recalculate the bloom filter.

As used in cassandra, the size of the set covered by the bloom filter is known before creating the filter, so this is not an issue.

Another aproach is Scalable Bloom Filters

نصائح أخرى

The first thing you should realize is that bloom filters are only additive. There are some approaches to approximate deletion:

  • Rewriting the bloom filter
    • You have to keep the old data
    • You pay a performance price
  • A negative bloom filter
    • Much cheaper than the above, also helps deal with false positives if you can detect them.
  • Counting bloom filters (decrement the count)
  • Buckets
    • Keep multiple categorized bloom filters, discarding a category when it is no longer needed (e.g. 'Tuesday', 'Wednesday', 'Thursday',...)
  • Others?

If you have time-limited data, it may be efficient to use buckets, and discard filters that are too old.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top