There are many methodologies for refactoring data to suit GPU computation, one of the challenges being copying data between device and host, the other challenge being representation of data (and also algorithm design) on the GPU to yield efficient use of memory bandwidth. I'll highlight 3 general approaches, focusing on ease of copying data between host and device.
Since you mention
std::vector
, you might take a look at thrust which has vector container representations that are compatible with GPU computing. However thrust won't conveniently handle vectors of vectors AFAIK, which is what I interpret to be your "3D std::vector" nomenclature. So some (non-trivial) refactoring will still be involved. And thrust still doesn't let you use a vector directly in ordinary CUDA device code, although the data they contain is usable.You could manually refactor the vector of vectors into flat (1D) arrays. You'll need one array for the data elements (length = total number of elements contained in your "3D" std::vector), plus one or more additional (1D) vectors to store the start (and implicitly the end) points of each individual sub-vector. Yes, folks will say this is inefficient because it involves indirection or pointer chasing, but so does the use of vector containers on the host. I would suggest that getting your algorithm working first is more important than worrying about one level of indirection in some aspects of your data access.
as you point out, the "deep-copy" issue with CUDA can be a tedious one. It's pretty new, but you might want to take a look at Unified Memory, which is available on 64-bit windows and linux platforms, under CUDA 6, with a Kepler (cc 3.0) or newer GPU. With C++ especially, UM can be very powerful because we can extend operators like
new
under the hood and provide almost seamless usage of UM for shared host/device allocations.