What is (if any) the standard approach for designing out of core/ external memory algorithms? [closed]

Question

In general, the GPU memory should not be an arbitrary limitation on the size of data for algorithms. The GPU memory could be considered to be a "cache" of data that the GPU is currently operating on, but many GPU algorithms are designed to operate on more data than can fit in the "cache". This is accomplished by moving data to and from the GPU while computation is going on, and the GPU has specific concurrent execution and copy/compute overlap mechanisms to enable this.

This usually implies that independent work can be completed on sections of the data, which is typically a good indicator for acceleration in a parallelizable application. Conceptually, this is similar to large scale MPI applications (such as high performance linpack) which break the work into pieces and then send the pieces to various machines (MPI ranks) for computation.

If the amount of work to be done on the data is small compared to the cost to transfer the data, then the data transfer speed will still become the bottleneck, unless it is addressed directly via changes to the storage system.

The basic approach to handling out-of-core or algorithms where the data set is too large to fit in GPU memory all at once is to determine a version of the algorithm which can work on separable data, and then craft a "pipelined" algorithm to work on the data in chunks. An example tutorial which covers such a programming technique is here (focus starts around 40 minute mark, but the whole video is relevant).