C++: Suggestions about a hash function for a sequence of strings where the order of the strings is irrelevant

Question 1

Just the idea demonstration (very inefficient string copying), complexity O(NlogN) where N is the size of the key (=== O(1) if your keys have constant length known at compile time), I don't think you can do better complexity:

#include <boost/functional/hash.hpp>
#include <set>
#include <algorithm>

std::size_t make_hash(
  std::string const& a,
  std::string const& b,
  std::string const& c)
{
    std::string input[] = {a,b,c};
    std::sort(input, input + (sizeof(input)/sizeof(*input)));
    return boost::hash_range(input, input + (sizeof(input)/sizeof(*input)));
}

#include <iostream>
// g++ -I.../boost_1_47_0 string_set_hash.cpp
int main()
{
    std::cout << make_hash("abc", "bcd", "def") << std::endl; // 46247451276990640
    std::cout << make_hash("bcd", "def", "abc") << std::endl; // 46247451276990640
}

A fragment of boost/functional/hash.hpp for reference:

template <class T>
inline void hash_combine(std::size_t& seed, T const& v)

{
    boost::hash<T> hasher;
    seed ^= hasher(v) + 0x9e3779b9 + (seed<<6) + (seed>>2);
}

template <class It>
inline std::size_t hash_range(It first, It last)
{
    std::size_t seed = 0;

    for(; first != last; ++first)
    {
        hash_combine(seed, *first);
    }

    return seed;
}

Question 2

Whatever hashing function you pick, you want an operator for the final combination of each individual hash which would be:

commutative
associative

the sum, the product, and the exclusive or come to mind as candidates for integral values. So yes, adding would work. You would still have collisions on unrelated sequences which need to be resolved though, so you would need a string comparison function, but permutations of the same set of strings would end up in the same bucket.

You could also reverse the operation order: add the strings character-wise together first (eg. adding "ab" and "cba" becomes ('a' + 'c')('b' + 'b')('\0' + 'a') with carry propagation for sum or product, so perhaps xor is an interesting candidate here), and then apply a hash function. You could even combine these two operations while performing them (pseudo code follows):

int hash(string a, string b, string c){
    int r = 0, k;
    int m = max(a.length(), max(b.length(), c.length()));
    for (int i = 0; i < m; i++) {
        k = ( i < a.length()? a[i] : 0) ^
              (i < b.length()? b[i] : 0) ^
              (i < c.length()? c[i] : 0);
        r = hash(r,k);
    }
    return r;
}

With hash the incremental hashing function. A simple modulo against a prime number large enough (ie. larger than the expected size of the bucket array) should be alright for normal purposes.

A completely different (and better?) solution is to simply sort the sequence (3 entries means quasi constant time), then make a ordered map with the comparison function considering the strings as a "digit" of a 3 digits number. But this is out of the scope of the question.

Question 3

I would hash each element individually.

Then sort those hashes. Sorting 3 size_t is fast.

Then chain those hashes. Your library may have hash chain functions, or even use hash( a+b+c ) with overflow wrap.

Avoid xor, because xor two identical hash values is zero. And hash of identical strings is identical. So a naive xor can lead to ( a,a,b ) and ( c,c,b ) having the same hash output, which sucks.