Question

I've got a requirement to encrypt Personally identifiable information (PII) data in an application DB. The application uses smart searches in the system that use sound like, name roots and part words searches to find name and address quickly.

If we put in encryption on those fields (the PII data encrypted at the application tier), the searches will be impacted by the volume of records because we cant rely on SQL in the normal way and the search engine (in the application) would switch to reading all values, decrypt them and do the searches.

Is there any easy way of solving this so we can always encrypt the PII data and also give our user base the fast search functionality?

We are using a PHP Web/App Tier (Zend Server and a SQL Server DB). The application does not currently use technology like Lucene etc.

Thanks

Cheers

Was it helpful?

Solution

Encrypting the data also makes it look a great deal like randomized bit strings. This precludes any operations the shortcut searching via an index.

For some encrypted data, e.g. Social security number, you can store a hash of the number in a separate column, then index this hash field and search for the hash. This has limited utility obviously, and is of no value in searches name like 'ROB%'

If your database is secured properly may sound nice, but it is very difficult to achieve if the bad guys can break in and steal your servers or backups. And if it is truly as requirement (not just a negotiable marketing driven item), you are forced to comply.

You may be able to negotiate storing partial data in unencrypted, e.g., first 3 character of last name or such like so that you can still have useful (if not perfect) indexing.

ADDED

I should have added that you might be allowed to hash part of a name field, and search on that hash -- assuming you are not allowed to store partial name unencrypted -- you lose usefulness again, but it may still be better than no index at all.

For this hashing to be useful, it cannot be seeded -- i.e., all records must hash based on the same seed (or no seed), or you will be stuck performing a table scan.

You could also create a covering index, still encrypted of course, but a table scan could be considerable quicker due to the reduced I/O & memory required.

OTHER TIPS

I'll try to write about this simply because often the crypto community can be tough to understand (I resisted the urge to insert a pun here).

A specific solution I have used which works nicely for names is to create index tables for things you wish to index and search quickly like last names, and then encrypt these index column(s) only.

For example, you could create a table where the key column contains one entry for every possible combination of characters A-Z in a 3-letter string (and include spaces for all but the first character). Like this:

A__
AA_
AAA
AAB
AAC
AAD
..
..
..
ZZY
ZZZ

Then when you add a person to your database, you add their index to a second column which is just a list of person ID's.

Example: In your patients table, you would have an entry for smith like this:

231    Smith    John    A     1/1/2016   .... etc

and this entry would be encrypted, perhaps all columns but the ID 231. You would then add this person to the index table:

SMH    [342, 2342, 562, 12]
SMI    [123, 175, 11, 231]

Now you encrypt this second column (the list of ID's). So when you search for a last name, you can type in 'smi' and quickly retrieve all of the last names that start with this letter combination. If you don't have the key, you will just see a cyphertext. You can actually create two columns in such a table, one for first name and one for last name.

This method is just as quick as a plaintext index and uses some of the same underlying principles. You can do the same thing with a soundex ('sounds like') by constructing a table with all possible soundex patterns as your left column, and person (patient?) Id's as another column. By creating multiple such indices you can develop a nice way to hone in on the name you are looking for.

You can also extend to more characters if you like but obviously this lengthens your table by more than an order of magnitude for each letter. It does have the advantage of making your index more specific (not always what you want). Truthfully any type of histogram where you can bin the persons using their names will work. I have seen this done with date of birth as well. anything you need to search on.

A table like this suffers from some vulnerabilities, particularly that because the number of entries for certain buckets may be very short, it would be possible for an attacker to determine which names have no entries in the system. However, using a sort of random 'salt' in your index list can help with this. Other problems include the need to constantly update all of your indices every time values get updated.

But even so, this method creates a nicely encrypted system that goes beyond data-at-rest. Data-at-rest only protects you from attackers who cannot gain authorization to your systems, but this system provides a layer of protection against DBA's and other personnel who may need to work in the database but do not need (or want) to see the personal data contained within. They will just see ciphertext. So, an additional key is needed by the users or systems that actually need/want to access this information. Ashley Madison would have been wise to employ such a tactic.

Hope this helps.

Sometimes, "encrypt the data" really means "encrypt the data at rest". Which is to say that you can use Transparent Data Encryption to protect your database files, backups, and the like but the data is plainly viewable through querying. Find out if this would be sufficient to meet whatever regulations you're trying to satisfy and that will make your job a whole lot easier.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top