سؤال

If i have a kernel which looks back the last Xmins and calculates the average of all the values in a float[], would i experience a performance drop if all the threads are not executing the same line of code at the same time?

eg: Say @ x=1500, there are 500 data points spanning the last 2hr period.

@ x = 1510, there are 300 data points spanning the last 2hr period.

the thread at x = 1500 will have to look back 500 places yet the thread at x = 1510 only looks back 300, so the later thread will move onto the next position before the 1st thread is finished.

Is this typically an issue?

EDIT: Example code. Sorry but its in C# as i was planning to use CUDAfy.net. Hopefully it provides a rough idea of the type of programming structures i need to run (Actual code is more complicated but similar structure). Any comments regarding whether this is suitable for a GPU / coprocessor or just a CPU would be appreciated.

public void PopulateMeanArray(float[] data)
{
    float lookFwdDistance = 108000000000f;
    float lookBkDistance = 12000000000f;
    int counter = thread.blockIdx.x * 1000;    //Ensures unique position in data is written to (assuming i have less than 1000 entries).
    float numberOfTicksInLookBack = 0;
    float sum = 0;    //Stores the sum of difference between two time ticks during x min look back.

    //Note:Time difference between each time tick is not consistent, therefore different value of numberOfTicksInLookBack at each position.
    //Thread 1 could be working here.
    for (float tickPosition = SDS.tick[thread.blockIdx.x]; SDS.tick[tickPosition] < SDS.tick[(tickPosition + lookFwdDistance)]; tickPosition++)
    {
        sum = 0;
        numberOfTicksInLookBack = 0;

        //Thread 2 could be working here. Is this warp divergence?
        for(float pastPosition = tickPosition - 1; SDS.tick[pastPosition] > (SDS.tick[tickPosition - lookBkDistance]); pastPosition--)
        {
            sum += SDS.tick[pastPosition] - SDS.tick[pastPosition + 1];
            numberOfTicksInLookBack++;
        }
        data[counter] = sum/numberOfTicksInLookBack;
        counter++;
    }
}
هل كانت مفيدة؟

المحلول

CUDA runs threads in groups called warps. On all CUDA architectures that have been implemented so far (up to compute capability 3.5), the size of a warp is 32 threads. Only threads in different warps can truly be at different locations in the code. Within a warp, threads are always in the same location. Any threads that should not be executing the code in a given location are disabled as that code is executed. The disabled threads are then just taking up room in the warp and cause their corresponding processing cycles to be lost.

In your algorithm, you get warp divergence because the exit condition in the inner loop is not satisfied at the same time for all the threads in the warp. The GPU must keep executing the inner loop until the exit condition is satisfied for ALL the threads in the warp. As more threads in a warp reach their exit condition, they are disabled by the machine and represent lost processing cycles.

In some situations, the lost processing cycles may not impact performance, because disabled threads do not issue memory requests. This is the case if your algorithm is memory bound and the memory that would have been required by the disabled thread was not included in the read done by one of the other threads in the warp. In your case, though, the data is arranged in such a way that accesses are coalesced (which is a good thing), so you do end up losing performance in the disabled threads.

Your algorithm is very simple and, as it stands, the algorithm does not fit that well on the GPU. However, I think the same calculation can be dramatically sped up on both the CPU and GPU with a different algorithm that uses an approach more like that used in parallel reductions. I have not considered how that might be done in a concrete way though.

A simple thing to try, for a potentially dramatic increase in speed on the CPU, would be to alter your algorithm in such a way that the inner loop iterates forwards instead of backwards. This is because CPUs do cache prefetches. These only work when you iterate forwards through your data.

مرخصة بموجب: CC-BY-SA مع الإسناد
لا تنتمي إلى StackOverflow
scroll top