CUDA scheduling issue or a bug in kernel launch?

https://stackoverflow.com/questions/11583029

22-06-2021
|

Question

In my program I have a work deque for each block issued in the kernel lauch. Each block stays in a loop popping work in its deque, processing it and pushing dynamically generated work back. An array of deque flags is maintened, indicating which deques are active, i.e., have work. If a deque is empty, the kernel enters in another loop trying to steal work from a deque of another block. The stop condition is achieved when no more deques are active.

In a test I set all deques starting with 1 work item. My problem is that some blocks appear to not be running at all. Since some of them aren't running, they stay active and my program enters in an infinite loop.

Now to the code. Kernel and auxiliary pop and push functions:

bool __inline__ __device__ pushBottom(int *aiDequeBottoms , const int &iBid , const unsigned int &uiSize ,
unsigned int &uiPushStartIdx)   
{
int iOldBot = aiDequeBottoms[iBid];
uiPushStartIdx = iOldBot;
iOldBot += uiSize;
if(iOldBot < DEQUE_SIZE)
{
    aiDequeBottoms[iBid] = iOldBot;
    return true;
}
else
{
    return false;
}
}

bool __inline__ __device__ popTop(int *aiDequesBottoms , unsigned int *auiDequesAges , const int &iBid ,
int2 &popStartIdxAndSize)
{
int index;
unsigned int oldAge = auiDequesAges[iBid];
int localBot = aiDequesBottoms[iBid];
index = oldAge >> WORK_STEALING_TAG_NBITS;
if(localBot < index + 2*WORK_STEALING_BATCH_SIZE)
{
    return false;
}

int localTag = oldAge & WORK_STEALING_TAG_MASK;
int size = min(WORK_STEALING_BATCH_SIZE , localBot - index);
unsigned int newAge = (index+size << WORK_STEALING_TAG_NBITS)| localTag;

if(oldAge == atomicCAS(&auiDequesAges[iBid] , oldAge , newAge))
{
    popStartIdxAndSize.x = index;
    popStartIdxAndSize.y = size;
    return true;
}
else
{
    return false;
}
}

bool __inline__ __device__ popBottom(int *aiDequesBottoms , unsigned int *auiDequesAges , const int &iBid ,
int2 &popStartIdxAndSize)
{
int localBot = aiDequesBottoms[iBid];
if(localBot == 0)
{
    return false;
}

int index = localBot;
localBot = localBot - WORK_STEALING_BATCH_SIZE;
aiDequesBottoms[iBid] = localBot;
unsigned int oldAge = auiDequesAges[iBid];
int oldAgeTop = int(oldAge >> WORK_STEALING_TAG_NBITS);
if(localBot > oldAgeTop)
{
    popStartIdxAndSize.y = WORK_STEALING_BATCH_SIZE;
    popStartIdxAndSize.x = index - WORK_STEALING_BATCH_SIZE;
    return true;
}

aiDequesBottoms[iBid] = 0;
unsigned int newAge = ((oldAge & WORK_STEALING_TAG_MASK) + 1) % (WORK_STEALING_TAG_MASK + 1);
if(index > oldAgeTop)
{
    if(oldAge == atomicCAS(&auiDequesAges[iBid] , oldAge , newAge))
    {
        popStartIdxAndSize.y = index - oldAgeTop;
        popStartIdxAndSize.x = index - popStartIdxAndSize.y;
        return true;
    }
}

auiDequesAges[iBid] = newAge;
return false;
}

//----------------------------------------------------------------------------------------------------------------------------
// Function to pop work from deques. Each block try to pop from its own deque. If work isn't available, it try to steal from
// other deques.
//----------------------------------------------------------------------------------------------------------------------------
template <typename Work>
bool __inline__ __device__ popWork(bool *abDequeFlags , int *aiDequesBottoms , unsigned int *auiDequesAges ,
const Work *aDeques , const int &iTid , const int &iBid , unsigned int &uiPopDequeIdx , int2 &popStartIdxAndSize ,
int &iLocalDequeCounter , bool &bPopFlag , unsigned int *uiActiveDeques , unsigned int &uiActiveDequesIdx , Work &work)
{
if(iTid == 0)
{   //Try to pop from block deque
    iLocalDequeCounter = 0;
    bPopFlag = popBottom(aiDequesBottoms , auiDequesAges , iBid , popStartIdxAndSize);
    if(bPopFlag)
    {
        uiPopDequeIdx = iBid;
    }
    else
    {
        abDequeFlags[iBid] = false;
    }
}
__syncthreads();

while(!bPopFlag)
{   //No more work, try to steal some!
    if(iTid == 0)
    {
        uiActiveDequesIdx = 0;
    }
    __syncthreads();

    if(iTid < NDEQUES)
    {
        if(abDequeFlags[iTid] == true) //assuming iTid >= NDEQUES
        {   //Set this deque for a work stealing atempt.
            unsigned int uiIdx = atomicAdd(&uiActiveDequesIdx,1);
            uiActiveDeques[uiIdx] = iTid;
        }
    }
    __syncthreads();

    if(iTid == 0)
    {   //Try to steal until succeeds or there are no more deques left to search
        bPopFlag = false;
        for(uiPopDequeIdx = 0 ; uiPopDequeIdx < uiActiveDequesIdx && bPopFlag == false ; ++uiPopDequeIdx)
        {
            bPopFlag = popTop(aiDequesBottoms , auiDequesAges , uiPopDequeIdx , popStartIdxAndSize);
        }
    }
    __syncthreads();

    if(uiActiveDequesIdx == 0)
    { //No more work to steal. End.
        return false;
    }
}

//Get poped data
if(iTid < popStartIdxAndSize.y) //assuming number of threads >= WORK_SIZE
{
    work = aDeques[uiPopDequeIdx*DEQUE_SIZE + popStartIdxAndSize.x + iTid];
}

return true;
}

//----------------------------------------------------------------------------------------------------------------------------
// Function to push work on deques. To achieve better coalescent global memory accesses the input data is assumed to be tight
// packed in shared mem.
//----------------------------------------------------------------------------------------------------------------------------
template <typename Work>
bool __inline__ __device__ pushWork(int *aiDequesBottoms , Work *aDeques , const int &iTid , const int &iBid ,
const unsigned int &uiDequeOutputCounter , Work *aOutputLocalWork)
{
//Transfer to global mem.
unsigned int uiWorkLeft = uiDequeOutputCounter;
unsigned int uiThreadOffset = iTid;

while(uiWorkLeft > 0)
{
    unsigned int uiWorkTransfered = min(WORK_STEALING_BATCH_SIZE , uiWorkLeft);
    unsigned int uiPushStartIdx;
    bool bPushFlag;
    if(iTid == 0)
    {
        bPushFlag = pushBottom(aiDequesBottoms , iBid , uiWorkTransfered , uiPushStartIdx);
    }
    __syncthreads();
    if(!bPushFlag)
    {
        return false;
    }

    if(iTid < uiWorkTransfered)
    {
        aDeques[DEQUE_SIZE*iBid + uiPushStartIdx +  uiThreadOffset] = aOutputLocalWork[uiThreadOffset];
    }

    uiThreadOffset += WORK_STEALING_BATCH_SIZE;
    uiWorkLeft -= uiWorkTransfered;
}

return true;
}

void __global__ workKernel(bool *abDequeFlags , int *aiDequesBottoms , unsigned int *auiDequesAges , int2 *aDeques , 
int2 *aOutput , unsigned int *puiOutputCounter)
{
int iTid = threadIdx.x;
int iBid = blockIdx.x;

__shared__ int2 aOutputLocalWork[DEQUE_SHARED_SIZE];
__shared__ unsigned int uiPopDequeIdx;
__shared__ int2 popStartIdxAndSize;
__shared__ int iLocalDequeCounter;
__shared__ bool bPopFlag;
__shared__ unsigned int uiActiveDeques[NDEQUES]; //Contains indices for deques with useful work that can be stolen
__shared__ unsigned int uiActiveDequesIdx;
__shared__ unsigned int uiLastOutputCounter;

int2 work;
int iRun = 0;

while(true) //Work loop will continue until cannot pop from bottom or cannot steal work from other deques
{
    if(!popWork<int2>(abDequeFlags , aiDequesBottoms , auiDequesAges , aDeques , iTid , iBid , uiPopDequeIdx ,
        popStartIdxAndSize , iLocalDequeCounter , bPopFlag , uiActiveDeques , uiActiveDequesIdx , work))
    {   //No more work
        return; 
    }

    //Useful work comes here. For now, just some dummy code for testing.
    if(iRun < 5)
    {   //Just 5 iterations that generate more work
        if(iTid < popStartIdxAndSize.y)
        {
            unsigned int uiNewWorkCounter = 1;
            int iDequeOutputCounter = atomicAdd(&iLocalDequeCounter , uiNewWorkCounter);
            work.x++; work.y++;
            aOutputLocalWork[iDequeOutputCounter] = work;
            __syncthreads();

            if(iTid == 0)
            {
                uiLastOutputCounter = atomicAdd(puiOutputCounter , iLocalDequeCounter);
            }
            __syncthreads();

            if(iTid < iLocalDequeCounter) //assuming iLocalDequeCounter <= blockDim.x
            {
                aOutput[uiLastOutputCounter + iTid] = aOutputLocalWork[iTid];
            }
        }
    }

    //Push back to global mem
    if(!pushWork<int2>(aiDequesBottoms , aDeques , iTid , iBid , iLocalDequeCounter , aOutputLocalWork))
    {   //overflow
        return;
    }
    ++iRun;
}
}

And this is the test:

#define NDEQUES 256
#define DEQUE_SIZE 20000

void workStealingWrap(bool *abDequeFlags , int *auiDequesBottoms , unsigned int *auiDequesAges , int2 *aDeques , 
int2 *aOutput , unsigned int *puiOutputCounter)
{
workKernel<<<NDEQUES , WORK_STEALING_THREADS>>>(abDequeFlags , auiDequesBottoms , auiDequesAges , aDeques , aOutput , 
    puiOutputCounter);
CUT_CHECK_ERROR("workKernel");
}


//----------------------------------------------------------------------------------------------------------
// This entry point is for work stealing testing.
//----------------------------------------------------------------------------------------------------------
int main(int argc, char* argv[])
{
//Test 0: All deques start with 1 work item.
bool h_abDequeFlags[NDEQUES];
int h_aiDequesBottoms[NDEQUES];
unsigned int h_auiDequesAges[NDEQUES];
int2 *h_aDeques = (int2*) malloc(sizeof(int2)*NDEQUES*DEQUE_SIZE);
unsigned int h_uiOutputCounter;
int2 *h_aOutput = (int2*) malloc(sizeof(int2)*NDEQUES*DEQUE_SIZE);

for(int i = 0 ; i < NDEQUES ; ++i)
{
    h_abDequeFlags[i] = true;
    h_aiDequesBottoms[i] = 1;
    h_auiDequesAges[i] = 0;
    int2 work; work.x = i ; work.y = i;
    h_aDeques[DEQUE_SIZE*i] = work;
}

bool *d_abDequeFlags;
int *d_auiDequesBottoms;
unsigned int *d_auiDequesAges;
int2 *d_aDeques;
GPUMALLOC((void**)&d_abDequeFlags , sizeof(bool)*NDEQUES);
GPUMALLOC((void**)&d_auiDequesBottoms , sizeof(int)*NDEQUES);
GPUMALLOC((void**)&d_auiDequesAges , sizeof(unsigned int)*NDEQUES);
GPUMALLOC((void**)&d_aDeques , sizeof(int2)*NDEQUES*DEQUE_SIZE);

TOGPU(d_abDequeFlags , h_abDequeFlags , sizeof(bool)*NDEQUES);
TOGPU(d_auiDequesBottoms , h_aiDequesBottoms , sizeof(int)*NDEQUES);
TOGPU(d_auiDequesAges , h_auiDequesAges , sizeof(unsigned int)*NDEQUES);
TOGPU(d_aDeques , h_aDeques , sizeof(int2)*NDEQUES*DEQUE_SIZE);

int2 *d_aOutput;
unsigned int *d_puiOutputCounter;
GPUMALLOC((void**)&d_aOutput , sizeof(int2)*NDEQUES*DEQUE_SIZE);
GPUMALLOC((void**)&d_puiOutputCounter , sizeof(unsigned int));
GPUMEMSET(d_aOutput , -1 , sizeof(int2)*NDEQUES*DEQUE_SIZE);
GPUMEMSET(d_puiOutputCounter , 0 , sizeof(unsigned int));

workStealingWrap(d_abDequeFlags , d_auiDequesBottoms , d_auiDequesAges , d_aDeques , d_aOutput , d_puiOutputCounter);

FROMGPU(h_aOutput , d_aOutput , sizeof(int2)*NDEQUES*DEQUE_SIZE);
FROMGPU(&h_uiOutputCounter , d_puiOutputCounter , sizeof(unsigned int));
assert(h_uiOutputCounter == NDEQUES);
for(int i = 0 ; i < NDEQUES*DEQUE_SIZE ; ++i)
{
    int2 work = h_aOutput[i];
    if(i < NDEQUES)
    {
        assert(work.x >= 1 && work.x < NDEQUES*5 && work.y >= 1 && work.y < NDEQUES*5);
    }
    else
    {
        assert(work.x == -1 && work.y == -1);
    }
}

GPUFREE(d_abDequeFlags);
GPUFREE(d_auiDequesBottoms);
GPUFREE(d_auiDequesAges);
GPUFREE(d_aDeques);
GPUFREE(d_aOutput);
GPUFREE(d_puiOutputCounter);

safeFree(h_aDeques);
safeFree(h_aOutput);
}

Debugging this code using NSight I have verified that just the first 8 blocks are running. I'm wondering if this is a schedulling problem and the popWork polling is consuming all resources or it is just a bug in my program. Any help will be very appreciated.

Solution

I found some problems in the code. I didn't run it, so I don't know to what extend it is relevant.

The implementation of popTop is missing. I assume it can succeed or it can fail, and the result will influence the bPopFlag. I assume you can have nonzero uiActiveDequesIdx while bPopFlag is still zero.

In that case, in popWork you have a while(!bPopFlag) loop. Imagine two warps reaching the last __syncthreads() in the loop. Now, the first warp will check uiActiveDequesIdx (assume nonzero), go back to the beginning of the loop, and the iTid thread sets uiActiveDequesIdx to 0. Now the second warp resumes the execution from the last __syncthreads(), checks the uiActiveDequesIdx (which is now 0) and quits the loop. From this point onward your warps have diverged on __syncthreads() and bad things will happen.
In pushWork, the bPushFlag is a local register. It is altered only by iTid == 0 thread, but then it is read by all threads. You may read uninitialized value and get diverged warps on further __syncthreads().

There may be more problems that I didn't see yet. In particular, I am worried that you may skipped the __threadfence() when setting your flags in the global memory (check the Programming Guide on what it does)

Also, double-check other loops for synchronization problems similar to the problem I reported above - where some warps are still in the old iteration of the loop and others are already in the new iteration. I think a good general approach is, that if you have a nondivergent-loop depending on some shared values, always put a __syncthreads() at the end of the loop, and only when you do some late fine-grained optimization of the code, try removing it.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow