OpenCL - Why Use READ_ONLY or WRITE_ONLY Buffers

Question 1

To answer straight forward to your question I'd say: No, these flags do not just exist to help with debugging and catching errors. However it's hard to give any reference on how these flags are used by any implementation and how they impact the performances.

My understanding (unfortunately not backed up by any documentation) is that when using these flags you put more constraints on how the buffers will be used and therefore you can help the runtime/driver/compiler to make some assumptions that might improve the performances. For instance I imagine that there should be no worries about memory consistency with a read only buffer while a kernel is using it since the workitems are not supposed to write in it. Therefore some checks could be skipped...though in Opencl you are suppose to take care of this yourself using barriers and so on.

Note also that since Opencl 1.2 some other flags have been introduced related this time to how the host needs to access the buffers. There are:

CL_MEM_HOST_NO_ACCESS,
CL_MEM_HOST_{READ, WRITE}_ONLY,
CL_MEM_{USE, ALLOC, COPY}_HOST_PTR

I'm guessing that again it must help the people implementing opencl to enhance performance, but I guess we'd need the input from some AMD or NVIDIA experts.

Please note that all I said so far are only my thoughts and are not based on any serious documentation (I didn't manage to find any).

On the other hand I can tell you for sure that the standard does not forced a read only buffer to be in the constant space as @Quonux stated. It might be that some implementations do this for small buffer. Let's not forget that the constant space memory is small so you can have read only buffer too large to fit in. The only way to make sure that a buffer is in the constant space memory is to use the constant key word in your kernel code as explained here. Of course in the host side, if you want to use constant buffer you have to use the read only flag.

Question 2

Note that there are actually two kinds of those. You have CL_MEM_READ_ONLY, CL_MEM_WRITE_ONLY and CL_MEM_READ_WRITE when allocating your buffers but then you also have __read_only, __write_only and __read_write to decorate your pointers in the kernel code with.

These could be used for both optimization and error checking. Lets look at the performance first. If a write-only buffer is encountered, the writes need not be cached (as in write through cache), saving more cache for the reads. This depends on the GPU hardware a lot and at least NVIDIA hardware does have the instructions needed to actually implement this (the .cs and .lu modifiers). You can refer to their PTX ISA. I haven't seen any evidence of the compiler actually performing this optimization, e.g.:

__kernel void Memset4(__global __write_only unsigned int *p_dest,
    const unsigned int n_dword_num)
{
    unsigned int i = get_global_id(0);
    if(i < n_dword_num)
        p_dest[i] = 0; // this
}

gets compiled as:

st.global.u32 [%r10], %r11; // no cache operation specified

This makes sense as CUDA does not have equivalents for those qualifiers so the compiler most likely silently ignores those. But it does not hurt to put them there, we might get luckier in the future. In CUDA, some of this functionality is exposed using the __ldg function and by using compiler flags to opt in/out of caching the global memory transfers in L1 (-Xptxas -dlcm=cg). You can also always use asm if you find that bypassing cache yields a major advantage.

As for error checking, writing to a read-only buffer is readily avoided using the const specifier in the kernel declaration. Disallowing reading from a write-only buffer is not possible in pure "C".

Another possible optimization happens when mapping those buffers to host memory. When mapping a CL_MEM_READ_ONLY buffer, the mapped region could be left uninitialized as the host will only write to that memory, for the device to only read it. Similarly, when un-mapping a CL_MEM_WRITE_ONLY buffer, the driver does not need to copy the (potentially modified by the host) contents from the host memory to the device memory. I did not measure this.

As a side note, I have tried using:

inline unsigned int n_StreamingLoad(__global __read_only const unsigned int *p_src)
{
#ifdef NVIDIA
    unsigned int n_result;
    asm("ld.global.cs.u32 %r0, [%r1];" : "=r" (n_result) : "r" (p_src));
    return n_result;
#else // NVIDIA
    return *p_src; // generic
#endif // NVIDIA
}

inline void StreamingWrite(__global __write_only unsigned int *p_dest, const unsigned int n_value)
{
#ifdef NVIDIA
    asm("st.global.cs.u32 [%r0], %r1;" : : "r" (p_dest), "r" (n_value) : "memory");
#else // NVIDIA
    *p_dest = n_value; // generic
#endif // NVIDIA
}

which gives you about 15 extra GB/sec even on a simple memcpy kernel with sm_35 devices (tested on GTX 780 and K40). Haven't seen noticeable speedup on sm_30 (not sure if its even meant to be supported there - although the instructions are not being stripped from ptx). Note that you need to define NVIDIA yourself (or see Detect OpenCL device vendor in kernel code).

Question 3

It depends,

a READ_ONLY __global memory location is stored in "Global / Constant Memory Data cache" which is much faster than the normal cache or RAM on a GPU (see here), on a CPU it doesn't matter.

I don't know any advantages of the WRITE_ONLY, maybe it helps too because the GPU knows that it can stream data out witout the need for caching.

Just go and measure it if your unsure...