As you have stated, it takes 4 clock cycles for a wavefront to be processed. The results of that instruction get sent to the registers but, because of the read-after-write latency, these results will only be available after 8 clock cycles. The important distinction here is that the ALUs finished their work in 4 cycles so they can go on processing other instructions. The register memory takes 8 cycles to do its job, i.e. store the new data and make it visible again.
As a general note for all types of memory accesses, including registers: Memory accesses get handled differently to normal arithmetic, the ALUs can continue executing instructions that don't depend on the results of the memory access while waiting for the memory access to finish.