Pointers and bit operators in GPU kernels

https://stackoverflow.com/questions/16786685

30-05-2022
|

سؤال

I want to perform a double threshold on a volume, using a GPU kernel. I send my volume, per slice, as read_only image2d_t. My output volume is a binary volume, where each bit specifies if its related voxel is enabled or disabled. My kernel checks if the current pixel value is within the lower/upper threshold range, and enables its corresponding bit in the binary volume.

For debugging purposes, I left the actual check commented for now. I simply use the passed slice nr to determine if the binary volume bit should be on or off. The first 14 slices are set to "on", the rest to "off". I have also verified this code on the CPU side, the code I pasted at the bottom of this post. The code shows both paths, the CPU being commented now.

The CPU code works as intended, the following image is returned after rendering the volume with the binary mask applied:

Rendering with a correct computed mask

Running the exact same logic using my GPU kernel returns incorrect results (1st 3D, 2nd slice view):

Rendering with an incorrect GPU computed mask

Rendering with an incorrect GPU computed mask (sliceview)

What goes wrong here? I read that OpenCL does not support bit fields, but it does support bitwise operators as far as I could understand from the OpenCL specs. My bit logic, which selects the right bit from the 32 bit word and flips it, is supported right? Or is my simple flag considered a bit field. What it does is select the voxel%32 bit from the left (not the right, hence the subtract).

Another thing could be that the uint pointer passed to my kernel is different from what I expect. I assumed this would be valid use of pointers and passing data to my kernel. The logic applied to the "uint* word" part in the kernel is due to padding words per row, and paddings rows per slice. The CPU variant confirmed that the pointer calculation logic is valid though.

Below; the code

            uint wordsPerRow = (uint)BinaryVolumeWordsPerRow(volume.Geometry.NumberOfVoxels);
            uint wordsPerPlane = (uint)BinaryVolumeWordsPerPlane(volume.Geometry.NumberOfVoxels);

            int[] dims = new int[3];
            dims[0] = volume.Geometry.NumberOfVoxels.X;
            dims[1] = volume.Geometry.NumberOfVoxels.Y;
            dims[2] = volume.Geometry.NumberOfVoxels.Z;

            uint[] arrC = dstVolume.BinaryData.ObtainArray() as uint[];
            unsafe {
                fixed(int* dimPtr = dims) {
                    fixed(uint *arrcPtr = arrC) {
                        // pick Cloo Platform
                        ComputePlatform platform = ComputePlatform.Platforms[0];

                        // create context with all gpu devices
                        ComputeContext context = new ComputeContext(ComputeDeviceTypes.Gpu,
                            new ComputeContextPropertyList(platform), null, IntPtr.Zero);

                        // load opencl source
                        StreamReader streamReader = new StreamReader(@"C:\views\pii-sw113v1\PMX\ADE\Philips\PmsMip\Private\Viewing\Base\BinaryVolumes\kernels\kernel.cl");
                        string clSource = streamReader.ReadToEnd();
                        streamReader.Close();

                        // create program with opencl source
                        ComputeProgram program = new ComputeProgram(context, clSource);

                        // compile opencl source
                        program.Build(null, null, null, IntPtr.Zero);

                        // Create the event wait list. An event list is not really needed for this example but it is important to see how it works.
                        // Note that events (like everything else) consume OpenCL resources and creating a lot of them may slow down execution.
                        // For this reason their use should be avoided if possible.
                        ComputeEventList eventList = new ComputeEventList();

                        // Create the command queue. This is used to control kernel execution and manage read/write/copy operations.
                        ComputeCommandQueue commands = new ComputeCommandQueue(context, context.Devices[0], ComputeCommandQueueFlags.None);

                        // Create the kernel function and set its arguments.
                        ComputeKernel kernel = program.CreateKernel("LowerThreshold");

                        int slicenr = 0;
                        foreach (IntPtr ptr in pinnedSlices) {
                            /*// CPU VARIANT FOR TESTING PURPOSES 
                            for (int y = 0; y < dims[1]; y++) {
                                for (int x = 0; x < dims[0]; x++) {
                                    long pixelOffset = x + y * dims[0];
                                    ushort* ushortPtr = (ushort*)ptr;
                                    ushort pixel = *(ushortPtr + pixelOffset);

                                    int BinaryWordShift = 5;
                                    int BinaryWordBits = 32;
                                    if (
                                        (0 <= x) &&
                                        (0 <= y) &&
                                        (0 <= slicenr) &&
                                        (x < dims[0]) &&
                                        (y < dims[1]) &&
                                        (slicenr < dims[2])
                                    ) {
                                        uint* word =
                                            arrcPtr + 1 + (slicenr * wordsPerPlane) +
                                            (y * wordsPerRow) +
                                            (x >> BinaryWordShift);
                                        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (byte)(x & 0x1f)));
                                        //if (pixel > lowerThreshold && pixel < upperThreshold) {
                                        if (slicenr < 15) {
                                            *word |= mask;
                                        } else {
                                            *word &= ~mask;
                                        }
                                    }
                                }
                            }*/

                            ComputeBuffer<int> dimsBuffer = new ComputeBuffer<int>(
                                context,
                                ComputeMemoryFlags.ReadOnly | ComputeMemoryFlags.CopyHostPointer,
                                3,
                                new IntPtr(dimPtr));

                            ComputeImageFormat format = new ComputeImageFormat(ComputeImageChannelOrder.Intensity, ComputeImageChannelType.UnsignedInt16);
                            ComputeImage2D image2D = new ComputeImage2D(
                                context, 
                                ComputeMemoryFlags.ReadOnly, 
                                format, 
                                volume.Geometry.NumberOfVoxels.X, 
                                volume.Geometry.NumberOfVoxels.Y, 
                                0, 
                                ptr
                            );

                            // The output buffer doesn't need any data from the host. Only its size is specified (arrC.Length).
                            ComputeBuffer<uint> c = new ComputeBuffer<uint>(
                                context, ComputeMemoryFlags.WriteOnly, arrC.Length);

                            kernel.SetMemoryArgument(0, image2D);
                            kernel.SetMemoryArgument(1, dimsBuffer);
                            kernel.SetValueArgument(2, wordsPerRow);
                            kernel.SetValueArgument(3, wordsPerPlane);
                            kernel.SetValueArgument(4, slicenr);
                            kernel.SetValueArgument(5, lowerThreshold);
                            kernel.SetValueArgument(6, upperThreshold);
                            kernel.SetMemoryArgument(7, c);

                            // Execute the kernel "count" times. After this call returns, "eventList" will contain an event associated with this command.
                            // If eventList == null or typeof(eventList) == ReadOnlyCollection<ComputeEventBase>, a new event will not be created.
                            commands.Execute(kernel, null, new long[] { dims[0], dims[1] }, null, eventList);

                            // Read back the results. If the command-queue has out-of-order execution enabled (default is off), ReadFromBuffer 
                            // will not execute until any previous events in eventList (in our case only eventList[0]) are marked as complete 
                            // by OpenCL. By default the command-queue will execute the commands in the same order as they are issued from the host.
                            // eventList will contain two events after this method returns.
                            commands.ReadFromBuffer(c, ref arrC, false, eventList);

                            // A blocking "ReadFromBuffer" (if 3rd argument is true) will wait for itself and any previous commands
                            // in the command queue or eventList to finish execution. Otherwise an explicit wait for all the opencl commands 
                            // to finish has to be issued before "arrC" can be used. 
                            // This explicit synchronization can be achieved in two ways:
                            // 1) Wait for the events in the list to finish,
                            //eventList.Wait();
                            //}
                            // 2) Or simply use
                            commands.Finish();

                            slicenr++;
                        }

                    }
                }
            }

And my kernel code:

const sampler_t smp = CLK_FILTER_NEAREST | CLK_ADDRESS_CLAMP |   CLK_NORMALIZED_COORDS_FALSE;
kernel void LowerThreshold(
    read_only image2d_t image,
    global int* brickSize,
    uint wordsPerRow,
    uint wordsPerPlane,
    int slicenr,
    int lower,
    int upper,
    global write_only uint* c )
{

    int4 coord = (int4)(get_global_id(0),get_global_id(1),slicenr,1);
    uint4 pixel = read_imageui(image, smp, coord.xy);

    uchar BinaryWordShift = 5;
    int BinaryWordBits = 32;
    if (
            (0 <= coord.x) &&
            (0 <= coord.y) &&
            (0 <= coord.z) &&
            (coord.x < brickSize[0]) &&
            (coord.y < brickSize[1]) &&
            (coord.z < brickSize[2])
    ) {
        global uint* word =
            c + 1 + (coord.z * wordsPerPlane) +
            (coord.y * wordsPerRow) +
            (coord.x >> BinaryWordShift);

        uint mask = (uint)(0x1 << ((BinaryWordBits - 1) - (uchar)(coord.x & 0x1f)));
        //if (pixel.w > lower && pixel.w < upper) {
        if (slicenr < 15) {
            *word |= mask;
        } else {
            *word &= ~mask;
        }
    }
}

المحلول

Two issues:

You've declared "c" as "write_only" yet use the "|=" and "&=" operators, which are read-modify-write
As the other posters mentioned, if two work items are accessing the same word, there are race conditions between the read-modify-write that will cause errors. Atomic operations are much slower than non-atomic operations, so while possible, not recommended.

I'd recommend making your output 8x larger and using bytes rather than bits. This would make your output write-only and would also remove contention and therefore race conditions.

Or (if data compactness or format is important) process 8 elements at a time per work item, and write the composite 8-bit output as a single byte. This would be write-only, with no contention, and would still have your data compactness.

مرخصة بموجب: CC-BY-SA مع الإسناد

لا تنتمي إلى StackOverflow