Question

I've been reading about machine learning models deployment and pipelines and so far most of the sources suggest that the data should be ingested from some sort of cloud based storage or source, be it AWS S3, Kaggle, Bigquery or whatever else.

Now the thing is that in my company we analyze sensitive client data which i think should not be stored in the cloud as its a potential security threat, or at least it should not leave the country/EU because of GDPR.

So given this how machine learning pipelies can use offline, local data to work?

Was it helpful?

Solution

Four options:

  • you have plenty of means (e.g you’re a government, a bank, a fortune 500): you build your own private cloud. You could even tender and let a cloud provider do it for your in your own facilities under your control and clearances.
  • depending on how or where the processing is done, you could encrypt the data with a key that does not leave your facility. The data in the cloud would then be unusable for anybody else (except perhaps the NSA).
  • depending on sensitivity, you could anonymize the data. This makes it usable only if you know how to desanonymize it (which is usually based on data staying in your premises). There are a couple if tricky risks to cope with (e.g. digital twins that could help to identify someone base on a combination of properties).
  • you accept the risk, find the right lawyers to be sure the cloud provider complies with all legal requirements and security standards and can be sued in case of breach. In this case, you could introduce some dummies in your data on monitored email addresses, so that you can spot data theft (e.g. sundden spam campaigns on the dummies)

OTHER TIPS

There is no such thing as “the cloud,” it’s just somebody else’s computer

(This is a common saying, the version above came from an article by Hayato Huseman)

"The cloud" is just a fancy way of saying that the data is stored on someone else's server, somewhere. Often, you don't know where the server is, and it could be moved without telling you.

If you don't want to do that, create a "private cloud", which is just a fancy way of saying that you are storing the data on your own server, in your own server room.

Licensed under: CC-BY-SA with attribution
scroll top