What might cause "Undefined Behaviour" in this parallel GPU code?

https://stackoverflow.com/questions/21705865

10-10-2022
|

문제

Lets assume core1 and core2 try writing their variables a and b to same memory location.

How can UB be explained here?

We dont know if a or b is written to that memory location(as a last action).
We dont even know what is written there (a garbage)
Even the target memory address can be miscalculated(segfault?).
Some logical gates make wrong currents and CPU disables itself
CPU's frequency information becomes corrupt and goes high overclock(and break itself)

Can I assume only the first option is valid for all vendors of CPU( and GPU)?

I just converted below code into a parallel GPU code and it seems to be working fine.

Generic code:

for (j=0; j<YRES/CELL; j++) // this is parallelized
        for (i=0; i<XRES/CELL; i++) // this is parallelized
        {
            r = fire_r[j][i];
            g = fire_g[j][i];
            b = fire_b[j][i];
            if (r || g || b)
                for (y=-CELL; y<2*CELL; y++)
                    for (x=-CELL; x<2*CELL; x++)
                        addpixel(i*CELL+x, j*CELL+y, r, g, b, fire_alpha[y+CELL][x+CELL]);
   //addpixel accesses neighbour cells' informations and writes on them
   //and makes UB
            r *= 8;
            g *= 8;
            b *= 8;
            for (y=-1; y<2; y++)
                for (x=-1; x<2; x++)
                    if ((x || y) && i+x>=0 && j+y>=0 && i+x<XRES/CELL && j+y<YRES/CELL)
                    {
                        r += fire_r[j+y][i+x];
                        g += fire_g[j+y][i+x];
                        b += fire_b[j+y][i+x];
                    }
            r /= 16;
            g /= 16;
            b /= 16;
            fire_r[j][i] = r>4 ? r-4 : 0; // UB
            fire_g[j][i] = g>4 ? g-4 : 0; // UB
            fire_b[j][i] = b>4 ? b-4 : 0;
        }

Opencl:

"   int i=get_global_id(0); int j=get_global_id(1);"
"   int VIDXRES="+std::to_string(kkVIDXRES)+";"
                        "   int VIDYRES="+std::to_string(kkVIDYRES)+";"
                        "   int XRES="+std::to_string(kkXRES)+";"
                        "   int CELL="+std::to_string(kkCELL)+";"
                        "   int YRES="+std::to_string(kkYRES)+";"

                        "   int x=0,y=0,r=0,g=0,b=0,nx=0,ny=0;"

                        "       r = fire_r[j*(XRES/CELL)+i];"
                        "       g = fire_g[j*(XRES/CELL)+i];"
                        "       b = fire_b[j*(XRES/CELL)+i];"

                        "       int counterx=0;"
                        "       if (r || g || b)"
                        "       for (y=-CELL; y<2*CELL; y++){"
                        "       for (x=-CELL; x<2*CELL; x++){"
                        "       addpixel(i*CELL+x, j*CELL+y, r, g, b, fire_alpha[(y+CELL)*(3*CELL)+(x+CELL)],vid,vido);"
                        "       }}"

                        "       r *= 8;"
                        "       g *= 8;"
                        "       b *= 8;"
                        "       for (y=-1; y<2; y++){"
                        "       for (x=-1; x<2; x++){"
                        "       if ((x || y) && i+x>=0 && j+y>=0 && i+x<XRES/CELL && j+y<YRES/CELL)"
                        "       {"
                        "           r += fire_r[(j+y)*(XRES/CELL)+(i+x)];"
                        "           g += fire_g[(j+y)*(XRES/CELL)+(i+x)];"
                        "           b += fire_b[(j+y)*(XRES/CELL)+(i+x)];"
                        "       }}}"
                        "       r /= 16;"
                        "       g /= 16;"
                        "       b /= 16;"
                        "       fire_r[j*(XRES/CELL)+i] = (r>4 ? r-4 : 0);"
                        "       fire_g[j*(XRES/CELL)+i] = (g>4 ? g-4 : 0);"
                        "       fire_b[j*(XRES/CELL)+i] = (b>4 ? b-4 : 0);"

Here is picture of some rare artifacts of a 2D NDrangeKernel 's local boundary UB. Can these kill my GPU?

enter image description here

해결책

On xf86 and xf86_64 architectures it means We dont know if a or b is written to that memory location(as a last action), because load/store operations of 32 (for both) or 64 bit (xf86_64 only) memory aligned datatypes are atomic.

On other architectures usually We dont even know what is written there (a garbage) is a valid answer - for sure on RISC architectures, I currently don't know on GPU's.

Note that The fact the code works doesn't imply that it is correct and in the 99% of the times it's the source of sentences like "there's a compiler bug, the code was working until the previous version" or "the code works on the development machine. The server selected for production is broken" :)

EDIT:

On NVidia GPUs we have weakly-ordered memory model. In the description on the Cuda C Programming guide it's not explicitly stated that store operations are atomic. The write operations come from the same thread, so it does not mean that load/store operations are atomic.

다른 팁

For the code above, IMHO the first option is the only possible one. Basically, if you assume that you have enough threads/processors to execute all the loops in parallel, the inner nested loops (the x and y ones) will have undetermined values.

For example, if we consider only the

r += fire_r[j+y][i+x];

section, the value at fire_r[j+y][i+x] can be the original one just as well as the result of another instance of the same loop being finished in another thread.

라이센스 : CC-BY-SA ~와 함께 속성

제휴하지 않습니다 StackOverflow