I’m just learning how to work with compute shaders, so I’m not an expert. Regarding your bone calculation I’m sure that the CS should work at least as fast as the VS. Intuition tells me that numthreads (64,1,1)
is less efficient than something like numthreads (16,16,1)
.
So you could give this approach a try:
- Treat your linear buffer as if it had a quadratic layout, with x and y size being the same
- Compute x/y-size as
size = ceil (sqrt (numvertices))
- Use dispat
ch(size / 16, size / 16)
in your program andnumthreads (16,16,1)
in your hlsl file - Allocate a constant buffer where you copy your
size
andnumvertices
values - Instead of using
id.x
as index, you calculate your own (linear) index asint index = id.y * size +id.x)
, (maybe id.xy is also possible as index) In most cases
size * size
will be greater thannumvertices
, so you’ll end up with more threads than vertices. You can block these extra threads by adding a condition in your hlsl function:int index = id.y * size +id.x; if (index < numvertices) { .. // your code follows
I hope that this approach speeds up your CS calculations.
================ EDIT ==================
My suggestion was based on my own timing tests. In order to verify my case I repeated these tests with more variances of the numthreads parameters. I calculate the mandelbrot set over 1034 x 827 = 855,118 pixels. Here the results:
numthreads Dispatch groups threads/ total
x y fps x y group threads
4 4 240 259 207 53445 16 855118
8 8 550 129 103 13361 64 855118
16 16 600 65 52 3340 256 855118
32 32 580 32 26 835 1024 855118
64 1 550 16 827 13361 64 855118
256 1 460 4 827 3340 256 855118
512 1 370 2 827 1670 512 855118
As you can see, the sweet spot - numthreads(16,16,1) - creates the same #of thread groups (3340) as numthreads(256,1,1), but the performance is 30% better. Please note that the total thread count is (and must be) always the same! My GPU is a ATI 7790.
================ EDIT 2 ==================
In order to investigate deeper into your question about CS vs. VS speed I have re-viewed a very interesting channel 9 video (PDC09 presentation, held by Microsoft chief architect Chas Boyd about direct compute, see link below). In this presentation Boyd states that optimizing the thread layout (numthreads) can lead to twofold increase of throughput.
More interesting however is the part of his presentation (starting at minute 40) where he explains the correlation between UAVs and GPU memory layout (“Graphics vs. Compute I/O”). I don’t want to draw wrong conclusions from Boyds statements, but it seems at least possible, that Compute shaders bound via UAVs do have a lower memory bandwidth than other GPU shaders. If this were true we might have an explanation for the fact that UAVs can’t be bound to VS, for example (at least in version 11.0).
Since these memory access patterns also depend on hardware design, you should escalate your question directly to ATI / NVIDIA engineers.
CONCLUSION
I have absorbed tons of information about CS usage, but there was not the slightest indication that CS could run the same algorithm slower than VS. If that is really the case you have detected something that matters for all folks who use direct compute.