Each of the "numbered" sections is a single VLIW, for example:
4 x: MOV R2.x, 0.0f
y: MOV R3.y, 0.0f
w: LSHL ____, R2.y, 2
This is one instruction which uses three of the available ALUs, namely "x", "y" and "w". It could also use "z" and "t" which constitutes the maximum parallel instructions, like:
4 x: MOV R2.x, 0.0f
y: MOV R3.y, 0.0f
z: MOV R3.z, 0.0f
w: LSHL ____, R2.y, 2
t: LSHL ____, R2.z, 4
Still, this is a single VLIW instruction that 'feeds' all five ALU 'lanes' of a shader core, and the five operations are then executed in parallel in a single step.
But what's 't'? Is that another vector unit?
Yes, "t" is a fifth scalar unit, dubbed "transcendental", which can be used to perform transcendental computations, like sin(x)
or cos(x)
. Besides that, it can also perform normal scalar operations, but it is limited in that not all scalar operations which are possible in "x" through "w" can also be performed in "t". Therefore, ideally each core can perform five scalar operations in a single step. It may be noteworthy that, unlike the SSE instructions on CPUs, those five units work independently: Each of them can perform its own operation in each step, whereas in SSE units only one operation can be applied in parallel to multiple data. This basically constitutes the difference between the SSE SIMD and the VLIW architecture.
Those
01 JUMP POP_CNT(1) ADDR(9)
instructions are apparently special instructions that don't actually perform operations on the ALUs, like fetching data from (off-chip) memory, or control-flow instructions.
To get an estimate of memory latencies, have a look at Appendix D - Device Parameters in AMD's OpenCL programming guide.
__constant
memory is not quite the same as __local
memory: It has its own memory space on chip which is the same for all the work items and, as per the docs, it can be accessed about twice as fast as __local
memory - because there is no coherency logic required between work items.
Some sources on the internet state that the (AMD) GPUs don't have cache memory and that LDS memory should be used to explicitly emulate a cache. In some documents though there are references to L1 and L2 caches.
Either way, note that the GPU is very good in "hiding" memory latencies by switching execution 'contexts' extremely fast when one thread gets stalled waiting for data. Given enough parallel tasks to choose from, the GPU will likely always find a task ready to execute which can be swapped in for one that needs to wait.