Does -use-fast-math option translate SP multiplications to intrinsics?

https://stackoverflow.com/questions/11507440

21-06-2021
|

Question

I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn is used?

So there are two questions:

Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?
Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.

Solution

"Standalone" single precision multiplication always compiles to hardware instructions ("intrinsics"). There is no other type of floating point multiplication instructions. The -use_fast_math option in nvcc has no effect on the floating point multiplication instructions emitted for compute capability 1.x targets. On compute 2.x and 3.x targets, it puts the compiler into a compatibility mode and all single precision multiplication instructions will be mul.ftz.f32 (flush to zero).

The floating point intrinics you mention (__fmul_{rm,rn,rp,rz,ftz,sat}) only provide explicit control the IEEE rounding behaviour. I don't believe there is a throughput difference between any of them on Fermi or Kepler GPUs.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow