Question

I need to save several documents to the cloud and need to save the documents, document metadata, and words/phrases for searching.

My plan is to use a symmetric cypher for encrypting the whole document, but I'm unsure of the right way to hash each word. I would like something secure, but I don't want to increase the count of characters in each word unnecessarily.

What implementation is most suitable for doing a symmetric encryption against a document, and what is the best way to hash a word or phrase without making it many times larger than it needs to be?

Was it helpful?

Solution

First, I suggest different tags. It sounds like you're really interested in offloading searching to a server in a cryptographically secure way (such that the server doesn't have access to the plaintext and such that the client need not transfer the entire index).

Issues:

  • An attacker being able to figure out which words are in the index (and which are not) could be an issue for you. You should state whether it is as a part of your requirements.
  • An attacker being able to figure out which items in the index occur more frequently could be an issue for you. You should state whether it is as a part of your requirements.
  • An attacker being able to associate words with a document could be an issue for you. You should state whether it is as a part of your requirements.
  • An attacker may be able to subvert the server entirely and observe queries / retrievals. You should state security needs in this circumstance as well.
  • Probably others I haven't thought of.

I'm assuming that you're designing your own, but there is probably some prior art, research, etc. that would be smarter than I am below:

For the first, I suggest that you should hash the words, combining the plaintext with a secret (not shared with the index server) before hashing, and truncating the hash to the point where it is likely to be non-unique in the index. This costs you hash efficiency, but helps prevent an attacker from using the hash as a plaintext equivalent or experimentally determining the secret

For the second and third, you should encrypt any indexed data (such as counts or document+position) and decrypt it on the client. This may cost you latency.

For the fourth, you'd want to consider concealing real requests inside groups of unrelated requests, things like that, but you'd want a lot of math to make sure you weren't still vulnerable to statistical analysis.

For the fifth, do some web research. I'm confident there will be stuff out there, and this is a pretty specific (and less common) need, so you'll want someone who put more thought into it than I just have.

OTHER TIPS

Your requirements are mutually exclusive. That kind of metadata will leak a huge amount of information about the document content, to the point it can't be called secure.

Furthermore, encrypting individual words is futile. The difficulty of breaking encryption is usually said to be as difficult as breaking the key, but this assumes the information content in the plaintext is greater than that in the key. For single words, that certainly isn't true.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top