Hash function on list independant of order of items in it

https://stackoverflow.com/questions/8188877

04-03-2021
|

Question

I want to have a dictionary that assigns a value to a set of integers.

For example key is [1 2 3] and value will have certain value.

The thing is that [3 2 1] needs to be treated the same in my case so hash needs to be equal, if I go with hash approach.

The set will have 2 to 10 items.

Sum of items is usually fixed so we cannot make hashcode according to sum, which is a first natural idea here.

Not a homework task, actually facing this problem in my code.

This set is basically IEnumerable<int> in C# so any data structure is fine to store them.

Any help appreciated. Performance is pretty important here too.

An immediate thought: we could sum up items^2 and already get some kind of better hash, but still I would like to hear some thoughts.

EDIT: hmm really sorry guys, everyone suggests ordering, didn't come to my mind that I needed to say that actually ordering and hashing is the current solution I use and I am considering faster alternatives.

Solution

Basically all of the approaches here are instantiations of the same template. Map x₁, …, x_n to f(x₁) op … op f(x_n), where op is a commutative associative operation on some set X, and f is a map from items to X. This template has been used a couple of times in ways that are provably good.

Choose a random large prime p and a random residue b in [1, p - 1]. Let f(x) = b^x mod p and let op be addition. We essentially interpret a set as a polynomial and use the Schwartz–Zippel lemma to bound the probability of a collision (= the probability that a nonzero polynomial has b as a root mod p).
Let op be XOR and let f be a randomly chosen table. This is Zobrist hashing and minimizes in expectation the number of collisions by straightforward linear-algebraic arguments.

Modular exponentiation is slow, so don't use it. As for Zobrist hashing, with 3 million items, the table f probably won't fit into L2, though it does set an upper bound of one main-memory access.

I would instead take Zobrist hashing as a departure point and look for a cheap function f that behaves like a random function. This is essentially the job description of a non-cryptographic pseudorandom generator – I would try computing f by seeding a fast PRG with x and generating one value.

EDIT: given that the sets all have the same sums, don't choose f to be a degree 1 polynomial (e.g., the step function of a linear congruential generator).

OTHER TIPS

Use a HashSet<T> and HashSet<T>.CreateSetComparer(), which returns an IEqualityComparer<HashSet<T>>.

I think what is mentioned in this paper would definitely help:

http://people.csail.mit.edu/devadas/pubs/mhashes.pdf

Incremental Multiset Hash Functions and Their Application to Memory Integrity Checking

Abstract: We introduce a new cryptographic tool: multiset hash functions. Unlike standard hash functions which take strings as input, multiset hash functions operate on multisets (or sets). They map multisets of arbitrary ﬁnite size to strings (hashes) of ﬁxed length. They are incremental in that, when new members are added to the multiset, the hash can be updated in time proportional to the change. The functions may be multiset-collision resistant in that it is diﬃcult to ﬁnd two multisets which produce the same hash, or just set-collision resistant in that it is diﬃcult to ﬁnd a set and a multiset which produce the same hash.

I think your squaring idea is going in the right direction, but a poor choice of function. I'd try something more like the PRNG functions or just multiplication by a large prime, followed by XOR of all the resulting values.

If the range of the values in key happens to be limited to low-ish positive integers, you could map each one to a prime number using a simple lookup, then multiply them together to arrive at the value.

Using the example in the question:

[1, 2, 3] maps to 2 x 3 x 5 = 30
[3, 2, 1] maps to 5 x 3 x 2 = 30

One possibility: sort the items in the list, then hash that.

You could sort the numbers and select a sample from predetermined indices and leave rest as zero if current value has less numbers. Or you could xor them, or whatever.

Why not something like

public int GetOrderIndependantHashCode(IEnumerable<int> source)
{
    return (source.Select(x => x*x).Sum()
            + source.Select(x => x*x*x).Sum()
            + source.Select(x => x*x*x*x).Sum()) & 0x7FFFFF;
}

Create your own type that implements IEnumerable<T>.

Override GetHashCode. In it, sort your collection, call and return ToArray().GetHashCode().

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow