Question

I have some data that comes regularily as a dump from a data souce with a string natural key that is long (up to 60 characters) and not relevant to the end user. I am using this key in a url. This makes urls too long and user unfriendly.

I would like to transform the string keys into integers with the following requirements:

The source dataset will change over time.

The ID should be:

  • non negative integer
  • unique and constant even if the set of input keys changes
  • preferrably reversible back to key (not a strong requirement)

The database is rebuilt from scratch every time so I can not remember the already assigned IDs and match the new data set to existing IDs and generate sequential IDs for the added keys.

There are currently around 30000 distinct keys and the set is constantly growing.

How to implement a function that will map string keys to integer IDs?

What I have thought about:

1. Built-in string.GetHashCode:

ID(key) = Math.Abs(key.GetHashCode())

  • is not guaranteed to be unique
  • (not reversible)

1.1 "Re-hashing" the built-in GetHashCode until a unique ID is generated to prevent collisions.

  • existing IDs may change if something colliding is added to the beginning of the input data set

2. a perfect hashing function

  • I am not sure if this can generate constant IDs if the set of inputs changes
  • (not reversible)

3. translate to base 36/64/??

  • does not shorten the long keys enough

What are the other options?

Was it helpful?

Solution

You can only do that if you can keep a list of assigned IDs.

For any give algorithm that actually gives you unique ID for the current set, any new value is not guaranteed to get a unique ID.

The strings contain about 400 bits of information, so to get an integer that is guaranteed to be unique it would have to contain all the information from the string and be about 400 bits. That's a 120 characters expressed as a decimal number so that's not shorter than what you have now.

OTHER TIPS

A Base64-encoded sha1sum is 27 characters. base64(md5(...)) is 22 characters. Any smaller and you will have a non-negligible risk of collisions.

Perfect hashing functions aren't possible when the set of inputs changes.

Set up a second, persistant DB and store your KEY/ID pairs there. Make sure you also have the data's date in the table so you can do some house-keeping.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top