Question

I am writing a simple tool to check duplicate files(i.e. files having same data). I generate hashes for each file using sha-512 algorithm. I am using MYSQL database for storing hashes.

My question is - what datatype, size and collation should I use to store sha_512 hash in my database? File-hash (sha-512) is going to be primary key for the table. Should I use binary datatype (like BLOB, VARBINARY(64) and BINARY) or VARCHAR(128)?

What other things need to be taken care of to get the best performance?

Was it helpful?

Solution

You'll want to use BINARY(64), as the hash is exactly 64 Bytes long, so you don't want to use variable length types. You don't want to use any character data type either, as the hash is binary in nature. Without characters, you don't have any collation.

The fixed length and the absence of character encoding and collation allow for certain optimizations. If it is still too slow, try some generic MySQL optimizations or if that doesn't help either, a smaller primary key size.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top