Bloom filter design

https://stackoverflow.com/questions/8776549

14-04-2021
|

سؤال

I wanted to know where I can find an implementation of the Bloom filter, with some explanation about the choice of the hash functions.

Additionally I have the following questions:
1) the Bloom filter is known to have false positives. Is it possible to reduce them by using two filters, one for used elements, and one for non-used elements (assuming the set is finite and known a-priori) and compare the two?
2) are there other similar algorithms in the CS literature?

المحلول

My intuition is that you'll get a better reduction in false positives by using the additional space that the anti-filter would have occupied to just expand the positive filter.

As for resources, the papers referenced for March 8 from my course syllabus would be useful.

نصائح أخرى

An Java implementation of the Bloom filter can be found from here. In case you cannot view it, I will paste the code in the following (with comments in Chinese).

import java.util.BitSet;

publicclass BloomFilter 
{
    /* BitSet初始分配2^24个bit */ 
    privatestaticfinalint DEFAULT_SIZE =1<<25; 
    /* 不同哈希函数的种子，一般应取质数 */
    privatestaticfinalint[] seeds =newint[] { 5, 7, 11, 13, 31, 37, 61 };
    private BitSet bits =new BitSet(DEFAULT_SIZE);
    /* 哈希函数对象 */ 
    private SimpleHash[] func =new SimpleHash[seeds.length];

    public BloomFilter() 
    {
        for (int i =0; i < seeds.length; i++)
        {
            func[i] =new SimpleHash(DEFAULT_SIZE, seeds[i]);
        }
    }

    // 将字符串标记到bits中
    publicvoid add(String value) 
    {
        for (SimpleHash f : func) 
        {
            bits.set(f.hash(value), true);
        }
    }

    //判断字符串是否已经被bits标记
    publicboolean contains(String value) 
    {
        if (value ==null) 
        {
            returnfalse;
        }
        boolean ret =true;
        for (SimpleHash f : func) 
        {
            ret = ret && bits.get(f.hash(value));
        }
        return ret;
    }

    /* 哈希函数类 */
    publicstaticclass SimpleHash 
    {
        privateint cap;
        privateint seed;

        public SimpleHash(int cap, int seed) 
        {
            this.cap = cap;
            this.seed = seed;
        }

        //hash函数，采用简单的加权和hash
        publicint hash(String value) 
        {
            int result =0;
            int len = value.length();
            for (int i =0; i < len; i++) 
            {
                result = seed * result + value.charAt(i);
            }
            return (cap -1) & result;
        }
    }
}

In terms of designing Bloom filter, the number of hash functions your bloom filter need can be determined as in here also refering the Wikipedia article about Bloom filters, then you find a section Probability of false positives. This section explains how the number of hash functions influences the probabilities of false positives and gives you the formula to determine k from the desired expected prob. of false positives.

Quote from the Wikipedia article:

Obviously, the probability of false positives decreases as m (the number of bits in the array) increases, and increases as n (the number of inserted elements) increases. For a given m and n, the value of k (the number of hash functions) that minimizes the probability is

It's very easy to implement a Bloom filter using Java 8 features. You just need a long[] to store the bits, and a few hash functions, which you can represent with ToIntFunction<T>. I made a brief write up on doing this from scratch.

The part to be careful about is selecting the right bit from the array.

public class BloomFilter<T> {

     private final long[] array;
     private final int size;
     private final List<ToIntFunction<T>> hashFunctions;

     public BloomFilter(long[] array, int logicalSize, List<ToIntFunction<T>> hashFunctions) {
         this.array = array;
         this.size = logicalSize;
         this.hashFunctions = hashFunctions;
     }

     public void add(T value) {
         for (ToIntFunction<T> function : hashFunctions) {
              int hash = mapHash(function.applyAsInt(value));
              array[hash >>> 6] |= 1L << hash;
         }
     }

     public boolean mightContain(T value) {
         for (ToIntFunction<T> function : hashFunctions) {
              int hash = mapHash(function.applyAsInt(value));
              if ((array[hash >>> 6] & (1L << hash)) == 0) {
                   return false;
              }
         }
         return true;
    }

    private int mapHash(int hash) {
         return hash & (size - 1);
    }


    public static <T> Builder<T> builder() {
         return new Builder<>();
    }

    public static class Builder<T> {
         private int size;
         private List<ToIntFunction<T>> hashFunctions;

         public Builder<T> withSize(int size) {
             this.size = size;
             return this;
         }

         public Builder<T> withHashFunctions(List<ToIntFunction<T>> hashFunctions) {
              this.hashFunctions = hashFunctions;
              return this;
          }

         public BloomFilter<T> build() {
              return new BloomFilter<>(new long[size >>> 6], size, hashFunctions);
         }
     }  
}

I think we should look at the application of Bloom Filters, and the secret is in the name, it is a Filter and not a data-structure. It is mostly used for saving resources by checking whether items are not part of a set. If you want to minimize false positives to 0, you will have to insert all the items that aren't apart of the set, since there are no false negatives for a well designed Bloom Filter, except that bloom filter would be gigantic and unpractical, might as well just store the items in a skip-list :) I wrote a simple tutorial on Bloom Filters if anyone is interested.

http://techeffigy.wordpress.com/2014/06/05/bloom-filter-tutorial/

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow