Question

I'm looking for a good way to anonymize data in my database while retaining the capability of aggregating / summarizing statistical information.

As an example, let's say I want to track clicks by IP address per hour but I don't actually want to store the IP address.

My first thought is to store only a hash (e.g. SHA-256) of the IP. However, I'm not sure this provides sufficient security. If an attacker got ahold of our database and was determined to reverse our anonymization they could generate a rainbow table of IP's and get back the real IP info fairly easily.

My next thought was to add a static prefix to the IP before hashing (e.g. 192.168.1.10 becomes MY_SECRET_STRING-192.168.1.10). Of course, if the attacker finds the static prefix then it is essentially useless.

I've been searching for sound solutions to this problem and I haven't found anything I really like so far. Are there any well known methods for anonymizing data like this?

Était-ce utile?

La solution

If someone have access to your salt and database I would say it's almost impossible (if not impossible) to keep them from creating some sort of collision table and "cracking" your hashes. The only option you have is to make their job hard/expensive.

Using a static salt is a bad idea though, this since the whole point of a salt is to prevent an attacker from generating a rainbow table for all your records. The uniqueness is what makes a salt a good salt , this since the purpose of the salt is to make each hash unique regardless if the original content was the same as another record (thus obligating an attacker to brute-force each row to figure out its content). Also something that is worth noticing is that salts don't need to be secret, so you can just store your salt in an additional column.

There is this nice article about salting and hashing if you have any doubt about the topic.

The problem with the described approach is that in the end, just like an attacker, you won't be able to tell which of the rows are the same IPs.

One potential solution I can see if you really really need to implement this is having a table where you store the IPs + click count, and then every 1 hour have a process to anonymize the data by simply replacing all the IPs/hash from the last hour with a good RANDOM value. This in the end means that you will only be able to group the clicks per hour without knowing the actual IP, but, please notice two things:

  1. Although an attacker will never be able to figure out the past data, you will have 1 hour worth of data that is not anonymized at any given time. Meaning that an attacker could "spy" on you and store this information over time which could become a much bigger problem than "we just leaked 1 hour worth of data".
  2. You won't be able to tell the same IP apart between each hour. For example: if IP 127.0.0.1 did 3 click from 17:00 to 18:00 and the same IP did 6 clicks from 18:00 to 19:00 you wouldn't able to tell that 127.0.0.1 did 9 clicks from 17:00 to 19:00.

Also to make the hourly non-anonymized IP a bit more hard to crack you could have a function that takes an IP and generates unique salt and then caches that unique salt for that IP till the next hour, meaning that each IP would have its own unique salt every hour. This way the attacker would have to calculate a new rainbow table for each row every hour and you could still figure out what IP row to increment|create.

Autres conseils

Why, yes, there are. The most well-known is called "salting". Basically, instead of adding a static string to all of the plain texts, you add a unique string to each one. This string is randomly or algorithmically generated and stored separately. It doesn't make a single hash any harder to crack, but it prevents use of tables to crack multiple hashes. See the wikipedia article on Salt(crytography).

That being said, I think that a one-way hash of the IP is sufficient. An attacker would have to crack each IP address. No matter what method you use, once an IP is cracked then all of the records for that IP will be exposed. But cracking one IP doesn't help with any of the others.

Licencié sous: CC-BY-SA avec attribution
Non affilié à StackOverflow
scroll top