How ALU execute instruction in AMD GPU (VLIW)?

https://stackoverflow.com/questions/20793813

21-09-2022
|

Question

I wanna ask something about OpenCL programming. I understand that a quarter of wavefront can issue instruction for each cycle clock and it will need four cycle clock to call a wavefront. To finish the instruction in VLIW architecture, it will need eight cycle clock. So, calling another wavefront is a solution. If I call two wavefront then it will be eight cycle clocks. So after wavefront A is executed ( 4 cycle clock), then wavefront B is executed (another four cycle clock). After wavefront B is executed (the total cycle clock is 8), the wavefront A will be executed again with another instruction.

The question is:

How ALU execute another instruction if four ALU for each processing element is already used to execute another instruction??

For Example: In cycle 1, work item 0-15 begin to execute instruction "ADD". The first ALU in each processing element (total 16 PE in SIMD / compute unit) calculate "ADD" instruction.
It happens in cycle 2, 3, and 4 for a wavefront (now there are 4 ALU in each PE that keep busy to execute the "ADD" instruction) In cycle 5, a quarter of wavefront 2 begin to execute instruction "SUBTRACT". How ALU in processing element calculate the instruction since they are busy to calculate "ADD" instruction from first wavefront (Remember that the execution for instruction "ADD" for a quarter of wavefront in first cycle is unfinished since it take 8 cycle clock)??

Update: 8 cycle clock means the latency of read after write

Solution

As you have stated, it takes 4 clock cycles for a wavefront to be processed. The results of that instruction get sent to the registers but, because of the read-after-write latency, these results will only be available after 8 clock cycles. The important distinction here is that the ALUs finished their work in 4 cycles so they can go on processing other instructions. The register memory takes 8 cycles to do its job, i.e. store the new data and make it visible again.

As a general note for all types of memory accesses, including registers: Memory accesses get handled differently to normal arithmetic, the ALUs can continue executing instructions that don't depend on the results of the memory access while waiting for the memory access to finish.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow