سؤال

I want to build a bloom filter in Clojure but I don't have much knowledge of all the hashing libraries that may be available to JVM based languages.

What should I use for the fastest (as opposed to most accurate) bloom map implementation in Clojure?

هل كانت مفيدة؟

المحلول

So the fun thing about bloom filters is that to work effectively they need multiple hash functions.

Java Strings already have one hash function built in that you can use - String.hashCode() with returns a 32-bit integer hash. It's an OK hashcode for most purposes, and it's possible that this is sufficient: if you partition this into 2 separate 16-bit hashcodes for example then this might be good enough for your bloom filter to work. You will probably get a few collisions but that's fine - bloom filters are expected to have some collisions.

If not, you'll probably want to roll your own, in which case I'd recommend using String.getChars() to access the raw char data, then use this to calculate multiple hashcodes.

Clojure code to get you started (just summing up the character values):

(let [s "Hello"
      n (count s)
      cs (char-array n)]
  (.getChars s 0 n cs 0)
  (areduce cs i v 0 (+ v (int (aget cs i)))))
=> 500

Note the use of Clojure's Java interop to call getChars, and the use of areduce to give you a very fast iteration over the character array.

You may also be interested in this Java bloom filter implementation I found on Github: https://github.com/MagnusS/Java-BloomFilter . The hashcode implementation looks OK at first glance but it uses a byte array which I think is a bit less efficient than using chars because of the need to deal with the character encoding overhead.

نصائح أخرى

Take a look at the Bloom Filter implementation in Apache Cassandra. This uses the very fast MurmurHash3 algorithm and combines two hashes (or two portions of the same hash, since upgrading to MurmurHash3 instead of MurmurHash2) in different ways to calculate the desired number of hashes.

The combinatorial generation approach is described in this paper

and here's a snippet from the Cassandra sourcecode:

    long[] hash = MurmurHash.hash3_x64_128(b, b.position(), b.remaining(), 0L);
    long hash1 = hash[0];
    long hash2 = hash[1];
    for (int i = 0; i < hashCount; ++i)
    {
        result[i] = Math.abs((hash1 + (long)i * hash2) % max);
    }

See also Bloomfilter and Cassandra = Why used and why hashed several times?

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top