Question

I want to implement of sort of Symbol the same way ruby does.

For this, I created a user defined literal which returned a std::hash of the std::basic_string<T> corresponding.

The code was great, but as I read somewhere the hash function may not be consistent over several executions of the same program. Moreover, I wanted to make this computation at compile-time, which was 1) not supported by std::hash and 2) would break code if std::hash return value changes.

So I wrote up the following implementation, based on java.lang.String.hashCode implementation.

typedef size_t symbol;

template<typename CharT>
constexpr size_t constant_hash(const CharT* p, size_t h = 0) noexcept
{
    return (*p == 0) ? h : constant_hash(p + 1, h * 31 + static_cast<size_t>(*p));
}

constexpr symbol operator "" _sym (const char* p, size_t n) noexcept
{
    return constant_hash(p);
}

My question is: are there any problem with this implementation?

I'm only able to test it on GCC 4.7.1, and I'd like to know if it is standard-compliant and should also work on other compiler.

I'm asking this because a previous implementation was working on GCC but caused segfault if the binary was compiled with clang++ (problem of undefined behavior with increment operators I think).

Thanks in advance

Edit

Work with clang++ (thanks KennyTM)

Was it helpful?

Solution

There's no UB, it works fine, as long as the string has '\0' termination. Note that constexpr evaluation cannot invoke UB; arithmetic or pointer operations that cause UB at runtime are required to produce a compile error in the context of a constant-expression.

Note that the static_cast is unnecessary; the char operand will be promoted to size_t.

Also, on first glance the hash function doesn't look very good because h * 31 is just ( h << 5 ) - h. You might pick a larger number with 1's randomly spread out across entire the binary value. But on the other hand, they might be trying to be clever since the low 5 bits of ASCII has the most entropy and this eliminates the possibility of collisions between short strings of different length.

OTHER TIPS

Note: n3333 is a proposal for C++17. While I don't believe there is a requirement for C++11 that hash produce the same result on multiple runs, in practice, I believe all current implementations do.

In the current active C++ standard, the definition of a hash function in general is written that way to allow more possibilities for domain-specific implementations, rather than to require that hashes be performed in a particular way. For example, it allows for the possibility of doing pooled strings and using the memory location of the pool instance as the hash value (incidentally, this is how ruby does its strings and hashes thereof, which has led to some interesting issues). If you're computing your hash on the data, rather than the reference, then the value will be stable - unless you've discovered some form of math where constant expressions aren't.

Basically, "may not" is, in this case, granting permission for things to behave in a particular way, as opposed to stating what might happen.

That said, if you are using std::hash, you cannot guarantee that the values will always be the same between executions (and in the future, if n3333 is adopted, any code which relies on that will break), so it is better to define your own stable hash function if you require stable hashing.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top