compute shader with numthreads (1,1,1) runs extremly slow

https://stackoverflow.com/questions/19860586

29-07-2022
|

Question

I am just beginning to learn DirectX programming, using F# and SharpDX as .NET wrapper. As a test case I render the Mandelbrot set. The computation is done using 2 compute shaders.

The first shader computes the depth for each pixel (function "CalcMandel"), the results are stored in a RWStructuredBuffer. This computation requires a lot of single or double multiplications, yet it’s blindingly fast on my GPU (AMD 7790). “CalcMandel” has the attribute

[numthreads(16, 16, 1)]

and is dispatched via

context.Dispatch (imageWidth / 16, imageHeight / 16, 1)

NO PROBLEM here – a 1000 x 800 pixel image of the “core” Mandelbrot set runs with over 1000 fps (using single precision on the GPU).

The second shader has almost nothing do: it computes the min, max, and average of the previous computation (function “CalcMinMax”). “CalcMinMax” has the attribute

[numthreads(1, 1, 1)]

and is called via

context.Dispatch (1,1,1)

For then given image size a single GPU thread has to traverse the buffer over 800.000 integers to calculate the min, max and average. I use a single thread because I have no idea how to implement this calculation in a parallel fashion.

THE PROBLEM: “CalcMinMax” is terribly slow: the frame rate drops from over 1000 to 5 fps!

MY QUESTIONS: What’s wrong here? Am I using the wrong setup / parameters (numthreads)? How can I speed up the min-max calculation?

MY THOUGHTS: My first assumption was that accessing the RWBuffer might be slow – this is not the case. When I replaced the buffer access by a constant, the frame rate did not increase.

My GPU has appr. 900 shader cores and uses thousands of threads to compute the Mandelbrot set, while “CalcMinMax” uses only one thread. Yet, I still don’t understand why things get so slow.

I would appreciate any advice!

================================================

// HLSL CONTENT (Mandelbrot set calculation is omitted):

cbuffer cbCSMandel : register( b0 )
{

double  a0, b0, da, db;
double  ja0, jb0;   
int max_iterations;
bool julia;     int  cycle;
int width;      int height;
double colorFactor;
int algoIndex;
int step;
};


struct statistics
{
  int   minDepth;
  int     axDepth;
  float avgDepth;
  int   loops;
};

RWStructuredBuffer<float4> colorOutputTable :   register (u0);
StructuredBuffer<float4> output2 :          register (t0);
RWStructuredBuffer<int> counterTable :          register (u1);
RWStructuredBuffer<float4> colorTable :     register (u2);

RWStructuredBuffer<statistics>statsTable :      register (u3);


// Mandelbrot calculations….
// Results are written to counterTable and colorOutputTable


// I limit the samples to 10000 pixels because calcMinMax is too slow
#define NUM_SAMPLES 10000;

void calcMinMax() 
{
    int minDepth = 64000;
    int maxDepth = 0;
    int len = width * height;
    int crit = len / NUM_SAMPLES;
    int steps = max (crit, 1);
    int index = 0;          
    int sumCount = 0;
    float sum = 0.0;

while (index < len)
{
    int cnt = counterTable[index];
    minDepth = cnt < minDepth & cnt > 0 ? cnt : minDepth;
    maxDepth = cnt > maxDepth ? cnt : maxDepth;
    sum += cnt > 0 ? cnt : 0.0f;
sumCount += cnt > 0 ? 1 : 0; 
    index += steps;
}

statsTable[0].minDepth = minDepth;
statsTable[0].maxDepth = maxDepth; 
statsTable[0].avgDepth = sum / sumCount;
statsTable[0].loops += 1; 
}


[numthreads(1, 1, 1)]
void CalcMinMax ( uint3 Gid : SV_GroupID, uint3 DTid : SV_DispatchThreadID, uint3 GTid :    SV_GroupThreadID, uint GI : SV_GroupIndex )

{
    switch (GI)     // this switch is used to verify GI number (always 0)
{
        case 0: calcMinMax();
    break;

        default: ;
    break;

}
}

// ******************* F# program (min-max part) *************

shader setup:

use minMaxShaderCode = ShaderBytecode.CompileFromFile(shaderPath, "CalcMinMax", "cs_5_0")                                                                
minMaxShader <- new ComputeShader(device, minMaxShaderCode.Bytecode.Data  )

shader usage:

// ---------- CONNECT MinMap Shader
context.ComputeShader.Set(minMaxShader)    
context.ComputeShader.SetUnorderedAccessView(STATS_SLOT, statsBuffer.BufferView) 

context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, constantBuffer)
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, dataBuffer.BufferView)  
context.Dispatch (1,1,1)

// ---------- DISCONNECT MinMap Shader            
context.ComputeShader.SetConstantBuffer(CONSTANT_SLOT, null)
context.ComputeShader.SetUnorderedAccessView (STATS_SLOT, null) 
context.ComputeShader.SetUnorderedAccessView (COUNTER_SLOT, null) 
context.ComputeShader.Set (null)

reading statistics:

context.CopyResource(statsBuffer.DataBuffer, statsBuffer.StagingBuffer)
let boxer, stream = context.MapSubresource(statsBuffer.StagingBuffer, MapMode.Read, MapFlags.None)                                                                                                                                    
calcStatistics <- stream.Read<CalcStatistics>()
context.UnmapSubresource(statsBuffer.DataBuffer, 0)

Solution

If you dispatch only 1 thread, every shader unit but one on your GPU will be idle waiting for that thread to finish. You need to parallelize your minmax algorythm and, considering that you have to compute an array of values to come up with one single value, this is a typical reduce problem. The more efficient approach is to compute local min/max values recursively. A detailed explanation with an example of summing the values of an array can be seen here (starting at slide 19).

OTHER TIPS

Thank you very much for your feedback. My question is answered.

In his response akhanubis shared a link to a PDF document describing the map-reduce problem on GPUs. Before I posted my question in stackoverflow I made an extensive search in the internet – and I had found this paper already and read it twice!

Why did I still miss the point? 8 ms reduction time over an array of 4M in the worst case seemed to be acceptable for me (having only 800,000 points). But I didn’t realize that even the worst case in the demo was at least 100 times faster than my single threaded approach because it employs 128 thread groups.

I will use the concepts in the paper to implement a multithreaded version of my min-max-calculation.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow