Question

I am trying to build a neural network simulation running on several high-CPU diskless instances. I am planning to use a persistent disk to store my simulation code and training data and mount them on all server instances. It is basically a map reduce kind of task (several nodes working on the same training data, the results of all nodes need to be collected to one single results file).

My only question now is, what are my options to (permanently) save the simulation results of the different servers (either at some points during the simulation or once at the end). Ideally, I would love to write them to the single persistent disk mounted on all servers but this is not possible because i can only mount it read-only to more than one server.

What is the smartest (and cheapest) way to collect all simulation results of all servers back to one persistent disk?

Was it helpful?

Solution

Google Cloud Storage is a great way to permanently store information in the Google Cloud. All you need to do is enable that product for your project, and you'll be able to access Cloud Storage directly from your Compute Engine virtual machines. If you create your instances with the 'storage-rw' service account, access is even easier because you can use the gsutil command built into your virtual machines without needing to do any explicit authorization.

To be more specific, go to the Google Cloud Console, select the project with which you'd like to use Compute Engine and Cloud Storage and make sure both those services are enabled. Then use the 'storage-rw' service account scope when creating your virtual machine. If you use gcutil to create your VM, simply add the --storage_account_scope=storage-rw (there's also an intuitive way to set the service account scope if you're using the Cloud Console to start your VM). Once your VM is up and running you can use the gsutil command freely without worrying about doing interactive login or OAuth steps. You can also script your usage by integrating any desired gsutil requests into your application (gsutil will also work in a startup script).

More background on the service account features of GCE can be found here.

OTHER TIPS

Marc's answer is definitely best for long-term storage of results. Depending on your I/O and reliability needs, you can also set up one server as an NFS server, and use it to mount the volume remotely on your other nodes.

Typically, the NFS server would be your "master node", and it can serve both binaries and configuration. Workers would periodically re-scan the directories exported from the master to pick up new binaries or configuration. If you don't need a lot of disk I/O (you mentioned neural simulation, so I'm presuming the data set fits in memory, and you only output final results), it can be acceptably fast to simply write your output to NFS directories on your master node, and then have the master node backup results to some place like GCS.

The main advantage of using NFS over GCS is that NFS offers familiar filesystem semantics, which can help if you're using third-party software that expects to read files off filesystems. It's pretty easy to sync down files from GCS to local storage periodically, but does require running an extra agent on the host.

The disadvantages of setting up NFS are that you probably need to sync UIDs between hosts, NFS can be a security hole, (I'd only expose NFS on my private network, not to anything outside 10/8) and that it will require installing additional packages on both client and server to set up the shares. Also, NFS will only be as reliable as the hosting machine, while an object store like GCS or S3 will be implemented with redundant servers and possibly even geographic diversity.

If you want to stay in the google product space, how about google cloud storage?

Otherwise, I've used S3 and boto for these kinds of tasks

As a more general option, you're asking for some sort of general object store. Google, as noted in previous responses, makes a nice package, but nearly all cloud providers provide some storage option. Make sure your cloud provider has BOTH key options -- a volume store, a store for data similar to a virtual disk, and an object store, a key/value store. Both have their strengths and weaknesses. Volume stores are drop-in replacements for virtual disks. If you can use stdio, you can likely use a remote volume store. The problem is, they often have the structure of a disk. If you want anything more than that, you're asking for a database. The object store is a "middle ground" between the disk and the database. It's fast, and semi-structured.

I'm an OpenStack user myself -- first, because it does provide both storage families, and second, it's supported by a variety of vendors, so, if you decide to move from vendor A to vendor B, your code can remain unchanged. You can even run a copy of it on your own machines (Go to www.openstack.org) Note however, OpenStack does like memory. You're not going to run your private cloud on a 4GB laptop! Consider two 16GB machines.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top