How can I enforce CUDA global memory coherence without declaring pointer as volatile?

Question 1

It seems based on the comments that the only working solution is to turn off L1 caching. This can be accomplished on a program-wide basis by passing the following switch to nvcc when compiling:

–Xptxas –dlcm=cg

The L1 caches are a property/resource of the SM, not the device as a whole. Since threadblocks execute on specific SMs, the activity of one threadblock in its L1 cache can be incoherent from the activity of another threadblock and its L1 cache (assuming it happens to be running on a different SM), even though they are both referencing the same locations in global memory. L1 caches in different SMs have no connection with each other and are not guaranteed to be coherent with each other.

Note that the L2 cache is device-wide and therefore "coherent" from the perspective of individual threadblocks. Turning off L1 caching has no effect on L2 caching, so there is still the possibility of some caching benefit, however the time required to satisfy a request out of L2 is longer than the time required to satisfy a request out of L1, so turning off L1 caching program-wide is a pretty large hammer to try to get things working.

The volatile keyword in front of a variable definition should have the effect of telling the compiler to skip L1 caching on loads (according to my understanding). But volatile by itself doesn't address the write path, so it's possible for one threadblock in one SM to do a volatile read, pulling a value out of L2, modify that value, and then write it back, where it ends up in L1 (until it is evicted). If another threadblock reads the same global value, it may not see the effect of the update.

Diligent use of __threadfence() while tedious, should force any such updates out of L1 into L2, so that other threadblocks can read them. However this still leaves a synchronization gap from when the value was written to when it is observable by other SMs/threadblocks.

(Global) Atomics should also have the effect of going directly to "global memory" to read and write the values used.

It may be instructive to also go through the code to ensure that every possible read from a globally synchronized location is handled properly (e.g. with volatile or using atomics) and that every possible write to a globally synchronized location is handled properly (e.g. with __threadfence() or atomics), and also check for race conditions between different blocks.

As discovered, the process of creating a stable globally-synchronized environment within the GPU is non-trivial. These other questions may also be of interest (e.g. with respect to Kepler) (and e.g. discussing global semaphores).

Edit: To respond to a question posted in the comments, I would say this:

Perhaps there's no issue. However __threadfence() provides no guarantee (that I know of) for a maximum completion time. Therefore at the moment an update is made to a global location, only the L1 associated with the executing threadblock/SM gets updated. Then we hit the __threadfence(). Presumably threadfence takes some time to complete, and during this time another threadblock could be resident on the same SM, brought in for execution (while the previous thread/warp/block is stalled at the threadfence), and "see" the updated global value in the (local) L1 associated with that SM. Other threadblocks executing in other SMs will see the "stale" value until the __threadfence() completes. This is what I am referring to as a possible "synchronization gap". Two different blocks can still see two different values, for a brief period of time. Whether this matters or not will be dependent on how the global value is being used for synchronization between blocks (since that is the topic under discussion.) Therefore atomics + volatile may be a better choice than volatile + threadfence, to try and cover both read and write paths for synchronization.

Edit #2: It seems from the comments that the comination of the use of atomics plus volatile also solved the problem.

Question 2

Frankly, I find your code overly complicated with indeces and - more importantly - incomplete. How do popBottom and popTop function? Moreover, how is the push operation implemented? Those two have to be carefully crafted in order to work correctly and ensure that some synchronization problems do not occur.

For example: what will happen when one block tries to push something to its global-memory queue, while another block tries to read from it at the same very moment? This is very important and if it is not done right, it can crash in some very rare circumstances, e.g. you may pop from data cell which was not written to yet.

When I was implementing a similar thing - a single global-memory deuque shared between all blocks, I was additionally marking each data cell as: empty, occupied and dead. In pseudocode the algorithm worked more-or-less like this:

/* Objects of this class should reside in CUDA global memory */
template <typename T, size_t size>
class WorkQueue {
private:
    size_t head, tail;
    size_t status[size];
    T data[size];

    enum {
        FieldFree = 0,
        FieldDead = 1,
        FieldTaken = 2
    };      

public:
    /* 
       This construction should actually be done by host on the device,
       before the actual kernel using it is launched!
       Zeroing the memory should suffice.
    */
    WorkQueue() : head(0), tail(0) {
        for (size_t i=0; i<size; ++i)
            status[i]=FieldFree;
    }   

    __device__ bool isEmpty() { return head==tail; }

    /* single thread of a block should call this */
    __device__ bool push(const T& val) {
        size_t oldFieldStatus;
        do {
            size_t cell = atomicInc(&tail,size-1);
            data[cell]=val;
            __threadfence(); //wait untill all blocks see the above change
            oldFieldStatus=atomicCAS(&status[cell],FieldFree,FieldTaken); //mark the cell as occupied
        } while (oldFieldStatus!=FieldFree); 
        return true;
    }

    /* single thread of a block should call this */
    __device__ bool pop(T& out) {
        size_t cellStatus;
        size_t cell;
        do {
            cell=atomicInc(&head,size-1);
            cellStatus=atomicCAS(&status[cell],FieldFree,FieldDead);
            //If cell was free, make it dead - any data stored there will not be processed. Ever.
        } while (cellStatus==FieldDead);
        if (cellStatus!=FieldTaken)
            return false;
        out = data[cell];
        status[cell]=FieldFree;
        return true;
    }
};

I do not see a reliable way of implementing it without the cell status - otherwise bad things will happen if two threads from two different blocks try to push/pop into the same cell of dequeue. With the approach above, the worst case can happen that popping thread will fail to pop, return false and mark the cell as dead, and the pushing thread will retry pushing into next cell. The idea behind is, that if the popping thread fails to pop, then there is probably not much work to do anyway and the block can be terminating. With that approach you will "kill" only as many cells as there are blocks running in parallel.

Note, in the above code I do not check for overflow!