Question

What kind of hashing algorithm is used in the built-in HASH() function?

I'm ideally looking for a SHA512/SHA256 hash, similar to what the SHA() function offers within the linkedin datafu UDFs for Pig.

Was it helpful?

Solution

HASH function (as of Hive 0.11) uses algorithm similar to java.util.List#hashCode.

Its code looks like this:

int hashCode = 0; // Hive HASH uses 0 as the seed, List#hashCode uses 1. I don't know why.
for (Object item: items) {
   hashCode = hashCode * 31 + (item == null ? 0 : item.hashCode());
}

Basically it's a classic hash algorithm as recommended in the book Effective Java. To quote a great man (and a great book):

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

I digress. You can look at the HASH source here.

If you want to use SHAxxx in Hive then you can use Apache DigestUtils class and Hive built-in reflect function (I hope that'll work):

SELECT reflect('org.apache.commons.codec.digest.DigestUtils', 'sha256Hex', 'your_string')

OTHER TIPS

As of Hive 2.1.0 there is a mask_hash function that will hash string values.

For Hive 2.x it uses md5 as the hashing algorithm. This was changed to sha256 for Hive 3.x

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top