Question

I'm trying to understand what machine code the OpenCL compiler produces in order to optimize it. Therefore I used the tool m2s-opencl-kc (from multi2sim) to offline-compile my *.cl file and keep intermediate files (switch: -a) as the *.isa file. This *.isa contains a "Disassembly" section, which seems to be what I'm looking for...

Note: My assembly knowledge is a bit "old". I produced assembly for older CPUs like Pentium 386/486 CPUs. So I actually have problems reading vector instructions, while I have some theoretical knowledge about them.

[... OTHER STUFF ... ]
; --------  Disassembly --------------------
00 ALU_PUSH_BEFORE: ADDR(32) CNT(6) KCACHE0(CB2:0-15) KCACHE1(CB0:0-15)
      0  x: MOV         R2.x,  0.0f
         z: SETGT_INT   R0.z,  1,  KC0[0].y
         t: MULLO_INT   ____,  R1.x,  KC1[1].x
      1  w: ADD_INT     ____,  R0.x,  PS0
      2  y: ADD_INT     R2.y,  PV1.w,  KC1[6].x
      3  x: PREDE_INT   ____,  R0.z,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED
01 JUMP  POP_CNT(1) ADDR(9)
02 ALU: ADDR(38) CNT(5) KCACHE0(CB1:0-15)
      4  x: MOV         R2.x,  0.0f
         y: MOV         R3.y,  0.0f
         w: LSHL        ____,  R2.y,  2
      5  z: ADD_INT     R2.z,  KC0[0].x,  PV4.w
03 LOOP_DX10 i0 FAIL_JUMP_ADDR(8)
    04 ALU: ADDR(43) CNT(11) KCACHE0(CB2:0-15)
          6  y: ADD_INT     R3.y,  R3.y,  1
             w: LSHL        ____,  R3.y,  2
          7  x: SETGT_INT   R3.x,  KC0[0].y,  PV6.y
             z: ADD_INT     ____,  R2.z,  PV6.w
             w: ADD_INT     ____,  PV6.w,  8
          8  x: ASHR        R0.x,  PV7.w,  4
             y: LSHR        R0.y,  PV7.z,  2
             z: BFE_UINT    R0.z,  PV7.w,  0x00000002,  0x00000002
[... some more ... ]

What I'm wondering is the meaning of numbers and characters in front of the commands. As I understood, the compiler produced some "complex" instructions as:

00 ALU_PUSH_BEFORE: ADDR(32) CNT(6) KCACHE0(CB2:0-15) KCACHE1(CB0:0-15)

(Question: Is that a so called "Very Long Instruction Word"?)

And this "complex" instruction consists of multiple "simple" instructions as:

      0  x: MOV         R2.x,  0.0f
         z: SETGT_INT   R0.z,  1,  KC0[0].y
         t: MULLO_INT   ____,  R1.x,  KC1[1].x
      1  w: ADD_INT     ____,  R0.x,  PS0
      2  y: ADD_INT     R2.y,  PV1.w,  KC1[6].x
      3  x: PREDE_INT   ____,  R0.z,  0.0f      UPDATE_EXEC_MASK UPDATE_PRED

These "simple" instructions seem to be the instructions for each vector unit. The four vector units are referenced by x, y, z and w. But what's 't'? Is that another vector unit? I compiled it for a "Cypress" GPU...

Now about the numbers... are these just like "line numbers"? Leading zeros: complex-instruction serial number ...? No leading zero: simple-instruction serial number ...?

I assume all "simple" instruction with the same serial can be "logically" executed in one cycle, if we assume there are no wait states for memory access. For example the following instructions (of the above complex instruction) are 'executed' in cycle 0:

      0  x: MOV         R2.x,  0.0f
         z: SETGT_INT   R0.z,  1,  KC0[0].y
         t: MULLO_INT   ____,  R1.x,  KC1[1].x

By "executed" I mean we have some kind of (e.g. 4-cycle) pipelining. This would mean the above instructions should start execution in cycle 0 and should have finished after cycle 3.

Question about pipelining

What happens if the next instruction (e.g. "1") would read register R2.x? Would that read the old value of R2.x (before instruction "0") or would instruction "1" be delayed, until instruction "0" finishes? Or is this maybe a "don't care"-situation (producing undefined results) that the compiler has to take care for that this does never happen?

Questions about memory access

I assume access to registers can be executed during the data fetch cycle, without waiting. Memory accesses will need some extra cycles depending on the kind of memory that's accessed:

  • The "__private" memory should be mostly mapped to registers.
  • __local memory (up to 64KB shared between work-items of same group): How many extra cycles do I have to expect in current GPUs?
  • __global memory: This should be the e.g. 256MB to x GB of external DRAM. How many extra cycles do I have to expect here? As far as I know, this memory is not cached for GPU devices.
  • __constant memory should be like __global memory, but is cached using __local memory

Is there any good tutorial for "ISA"?

Regards, Stefan

Was it helpful?

Solution

Each of the "numbered" sections is a single VLIW, for example:

  4  x: MOV         R2.x,  0.0f
     y: MOV         R3.y,  0.0f
     w: LSHL        ____,  R2.y,  2

This is one instruction which uses three of the available ALUs, namely "x", "y" and "w". It could also use "z" and "t" which constitutes the maximum parallel instructions, like:

  4  x: MOV         R2.x,  0.0f
     y: MOV         R3.y,  0.0f
     z: MOV         R3.z,  0.0f
     w: LSHL        ____,  R2.y,  2
     t: LSHL        ____,  R2.z,  4

Still, this is a single VLIW instruction that 'feeds' all five ALU 'lanes' of a shader core, and the five operations are then executed in parallel in a single step.

But what's 't'? Is that another vector unit?

Yes, "t" is a fifth scalar unit, dubbed "transcendental", which can be used to perform transcendental computations, like sin(x) or cos(x). Besides that, it can also perform normal scalar operations, but it is limited in that not all scalar operations which are possible in "x" through "w" can also be performed in "t". Therefore, ideally each core can perform five scalar operations in a single step. It may be noteworthy that, unlike the SSE instructions on CPUs, those five units work independently: Each of them can perform its own operation in each step, whereas in SSE units only one operation can be applied in parallel to multiple data. This basically constitutes the difference between the SSE SIMD and the VLIW architecture.

Those

01 JUMP  POP_CNT(1) ADDR(9)

instructions are apparently special instructions that don't actually perform operations on the ALUs, like fetching data from (off-chip) memory, or control-flow instructions.

To get an estimate of memory latencies, have a look at Appendix D - Device Parameters in AMD's OpenCL programming guide.

__constant memory is not quite the same as __local memory: It has its own memory space on chip which is the same for all the work items and, as per the docs, it can be accessed about twice as fast as __local memory - because there is no coherency logic required between work items.

Some sources on the internet state that the (AMD) GPUs don't have cache memory and that LDS memory should be used to explicitly emulate a cache. In some documents though there are references to L1 and L2 caches.

Either way, note that the GPU is very good in "hiding" memory latencies by switching execution 'contexts' extremely fast when one thread gets stalled waiting for data. Given enough parallel tasks to choose from, the GPU will likely always find a task ready to execute which can be swapped in for one that needs to wait.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top