What would be the general structure of such algorithm, and what would be the resulting complexity?
This is essentially the n-body problem. Solved using a direct particle-particle approach. It's been written about a lot. The order of the algorithm is O(N^2) on the GPU, just as it is on the CPU.
The core algorithm as implemented in CUDA doesn't change a lot except to take advantage of local block memory and optimize for it. Essentially the implementation would still come does to two loops.
The following paper is a good place to start, Chapter 31. Fast N-Body Simulation with CUDA.
Could we optimize it for locality?
Yes. Many n-body algorithms attempt to optimize for locality as gravitational and E-M forces decrease as a power of the distance between particles so distant particles can be ignored or their contribution can be approximated. Which of these approximation approaches to take largely depends on the type of system you are trying to simulate.
The following is a good overview of some of the more popular approaches, Seminar presentation, N-body algorithms