Question

I had a quick glance of the CUDA Programming guide w.r.t -use-fast-math optimizations, and although appendix C mention divisions to be converted to an intrinsic but there are no mention of multiplications. The reason I ask this question is, my kernel has a lot of multiplications. I am aware that NVCC would try to fuse multiplications and additions (when regular '*' and '+' operators are used, and that intrinsics are never merged into FMAD operations). But if my code is multiplication heavy, then would there be a benefit if rounding-off SP intrinsic like __fmul_rn is used?

So there are two questions:

  1. Does -use-fast-math option translate multiplications with '*' operator to SP instrinsics like __fmul_rn?

  2. Could there be a performance benefit in hand-coding multiplications to explicitly use __fmul_rn? An example or some numbers would help me understand.

Was it helpful?

Solution

"Standalone" single precision multiplication always compiles to hardware instructions ("intrinsics"). There is no other type of floating point multiplication instructions. The -use_fast_math option in nvcc has no effect on the floating point multiplication instructions emitted for compute capability 1.x targets. On compute 2.x and 3.x targets, it puts the compiler into a compatibility mode and all single precision multiplication instructions will be mul.ftz.f32 (flush to zero).

The floating point intrinics you mention (__fmul_{rm,rn,rp,rz,ftz,sat}) only provide explicit control the IEEE rounding behaviour. I don't believe there is a throughput difference between any of them on Fermi or Kepler GPUs.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top