Bloom filter usage

Question 1

Bloom filters are designed for situations where a false negative is a Very Bad Thing and a false positive is acceptable.

For example, suppose that you are making a web browser and have a known blacklist of scam websites. Your blacklist is massive - in the hundreds of gigabytes - so you can't ship it with the browser. However, you can store it on your own servers. In that case, you could ship the browser with a Bloom filter of an appropriate size that holds all the URLs. Before visiting a site, you look it up in the filter. Then, if you get a "no" answer, you're guaranteed that the URL is not blacklisted and can just visit the site. If you get a "yes" answer, the site might be evil, so you can have the browser call up your main server to get the real answer. The fact that you can save a huge number of calls to the server here without ever sacrificing accuracy is important.

The cache idea is similar to this setup. You can query the filter to see if the page is in the cache. If you get a "no" answer, you're guaranteed it's not cached and can do an expensive operation to pull the data from the main source. Otherwise, you can then check the cache to see if it really is there. In rare instances you might need to check the cache, see that it isn't there, then pull from the main source, but you will never accidentally miss something really in cache.

Hope this helps!

Question 2

Bloom filters may be useful when both of the following conditions are met:

A False Negative is unacceptable
Cost of lookup is expensive relative to the cost of lookup in a Bloom Filter

The first point is pretty simple. The second point generally becomes significant when a Bloom Filter can be stored in primary memory but the actual lookup may take database hits which can be very "costly" relative to performing some number of hashes on the key followed by in-memory lookup (ie. a Bloom filter).

If only one of the above criteria are met, then Bloom Filters are not the best solution to the problem.

Savings come when one can eliminate a costly lookup because it is known there is no possibility of getting a match. This is the value of the first point - Bloom Filters do not generate false negatives so if a match is not found in the filter there is no point going to the next, more costly step, of retrieving the data associated with the key.

When the filter has a hit then a costly lookup is required to verify the hit (eliminate false positive) and retrieve related data. Here there is a possibilty of not finding anything due to a false positive and this is why the filter needs to be tuned to minimize this risk to an acceptable level. A functional Bloom filter must have a low false positive rate so the overall cost of lookup remains low.

Now if, as you say, your cache is already optomized for fast lookup then the utility of a Bloom filter may be questionalble.

Question 3

the problem is that your example isn't so great.

in a web cache, if the url is not present, you have to make an expensive call to the net anyway, so saving a disk access is not a big deal. so you're right to question this (and Diego Basch's comment is not very well thought out, imho).

so i went looking for why you used that example. and it turns out that the wikipedia article mentions that the squid web cache uses bloom filters. but they're not used in the way you described. instead they're used to decide which cache to select from amongst a set of distributed caches. and they're used mainly to save space (because squid can cache a lot of objects, so these tables would otherwise be very large).

for more info on squid and bloom filters see http://wiki.squid-cache.org/SquidFaq/CacheDigests

otherwise, the other answer here, from templatetypedef, is fine - checking for bad sites is a much better example.