Why is this compute shader so much slower than vertex shader?

Question

I’m just learning how to work with compute shaders, so I’m not an expert. Regarding your bone calculation I’m sure that the CS should work at least as fast as the VS. Intuition tells me that numthreads (64,1,1) is less efficient than something like numthreads (16,16,1). So you could give this approach a try:

Treat your linear buffer as if it had a quadratic layout, with x and y size being the same
Compute x/y-size as size = ceil (sqrt (numvertices))
Use dispatch(size / 16, size / 16) in your program and numthreads (16,16,1) in your hlsl file
Allocate a constant buffer where you copy your size and numvertices values
Instead of using id.x as index, you calculate your own (linear) index as int index = id.y * size +id.x), (maybe id.xy is also possible as index)
In most cases size * size will be greater than numvertices, so you’ll end up with more threads than vertices. You can block these extra threads by adding a condition in your hlsl function:
```
int index = id.y * size +id.x;
if (index < numvertices) { .. // your code follows
```

I hope that this approach speeds up your CS calculations.

================ EDIT ==================

My suggestion was based on my own timing tests. In order to verify my case I repeated these tests with more variances of the numthreads parameters. I calculate the mandelbrot set over 1034 x 827 = 855,118 pixels. Here the results:

numthreads       Dispatch      groups  threads/  total
  x   y    fps     x     y             group     threads

  4   4    240    259   207    53445     16     855118
  8   8    550    129   103    13361     64     855118
 16  16    600     65    52     3340    256     855118
 32  32    580     32    26      835   1024     855118
 64   1    550     16   827    13361     64     855118
256   1    460      4   827     3340    256     855118
512   1    370      2   827     1670    512     855118

As you can see, the sweet spot - numthreads(16,16,1) - creates the same #of thread groups (3340) as numthreads(256,1,1), but the performance is 30% better. Please note that the total thread count is (and must be) always the same! My GPU is a ATI 7790.

================ EDIT 2 ==================

In order to investigate deeper into your question about CS vs. VS speed I have re-viewed a very interesting channel 9 video (PDC09 presentation, held by Microsoft chief architect Chas Boyd about direct compute, see link below). In this presentation Boyd states that optimizing the thread layout (numthreads) can lead to twofold increase of throughput.

More interesting however is the part of his presentation (starting at minute 40) where he explains the correlation between UAVs and GPU memory layout (“Graphics vs. Compute I/O”). I don’t want to draw wrong conclusions from Boyds statements, but it seems at least possible, that Compute shaders bound via UAVs do have a lower memory bandwidth than other GPU shaders. If this were true we might have an explanation for the fact that UAVs can’t be bound to VS, for example (at least in version 11.0).

Since these memory access patterns also depend on hardware design, you should escalate your question directly to ATI / NVIDIA engineers.

CONCLUSION

I have absorbed tons of information about CS usage, but there was not the slightest indication that CS could run the same algorithm slower than VS. If that is really the case you have detected something that matters for all folks who use direct compute.

link: http://channel9.msdn.com/Events/PDC/PDC09/P09-16