Question

right now, we are looking for generating some unique and deterministic ID for some string value (file URL). Based on this link How to Create Deterministic Guids, looks like that we could create a GUID based on MD5 hash or Sha1 Hash (type 3 or type 5, see GUID wiki page). I did some search on internet as well, I think that it is pretty much same, basically generate a deterministic GUID based on hash.

It looks great when I first saw it, however I am still not comfortable to use it as key to identify something. I think that generally hash is used to check:

  1. whether 2 string matches without revealing the content of the original string
  2. whether some string/file content is changed

Here, even if there is some collision on hash value, it is not great, but it is ok and it would self correct itself once data is changed again and it won't overwrite other unrelated data. However if we use hash as a Primary key to identify some data, a collision would mean that we would override some unrelated data, there is no way to self correct once overwrite happens.

So it seems to me that we should use database to really generate the deterministic GUID here instead of relying on Hash:

  1. have a table at database with 2 columns: str_val, guid_val. str_val is the PK
  2. if we need to generate a guid for string1, we would try to find the record at the table
  3. If we could find a guid, we are done.
  4. If we could not find a guid, we do the insert logic. If insert is failed, most probably it is due to that other thread just inserts one, however it is probably ok since insert race should happen very rarely.

Just before I am going to post my question, I saw this stackoverflow post: How safe is it to rely on hashes for file identification?, it use the hash as the file identification where the accepted answer thinks that it is ok to use hash as the key. Again, I feel that I still need more convincing here for this.

If anyone could give more suggestions, it would be really appreciated.

Was it helpful?

Solution

Talking about GUIDs or UUIDs in this context is confusing.

The Key is either

  1. a function of the input (a hash, be it a "deterministic GUID" or otherwise) or;
  2. independent of the input, such as random GUID or auto-increment ID

All hashes for which the range (all possible outputs) is less than the domain (all possible inputs) will have collisions due to the Pigeonhole Principle. However, the idea is to use an appropriate hash for which it is improbable for a collision to occur, which is discussed in the linked questions.

Cryptographic hash functions also have additional goals, such as "it is infeasible to find two different messages with the same hash." (That is, even if someone was trying, it should still be impractical to generate a collision - MD5 fails here and is considered broken; and SHA-1 has been superseded by SHA-2.)

If the Key is independent of the data (e.g. is not a hash) then we might as well use an auto-increment ID - which is deterministic, albeit independent of the data, and guaranteed unique by the database - and call it a day.


So, using a Hash, the key is a form of a Natural Key that identifies data. This makes it possible to query the database with just the hash. And, if we trust that the client generated the hash from data on-hand then it can generally be assumed that the client has the data which results in the hash.

While using an "independent deterministic value", the Key is a Surrogate Key that identifies the tuple. In this case we need to query against the data to find the appropriate data if the "ID" is unknown. This of course requires that the client uses the original data in such queries.

Both of these approaches are valid in the appropriate context, and I use both in database design. (I generally prefer Candidate Keys to impose the multiplicity, but the result is the same.)


OTHER TIPS

You can use ticks to get a unique name from time. I have use both the tick and a random string, then concadenate both and obtain a unique string name for my files. Hope it helps

http://msdn.microsoft.com/en-us/library/system.datetime.ticks(v=vs.110).aspx

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top