Question

I am using nvprof to measure achieved occupancy and I am findind it as

Achieved Occupancy 0.344031 0.344031 0.344031

but using occupancy calculator , I am finding 75%.

The results are:

Active Threads per Multiprocessor   1536
Active Warps per Multiprocessor 48
Active Thread Blocks per Multiprocessor 6
Occupancy of each Multiprocessor    75%

I am using 33 registers , 144 bytes shared memory , 256 threads/block ,device capability 3.5.

EDIT:

Also , something I want to clarify.In http://docs.nvidia.com/cuda/profiler-users-guide/#axzz30pb9tBTN it states for

gld_efficiency

Ratio of requested global memory load throughput to required global memory load throughput expressed as percentage

So , If this is 0% it means that I have no global memory transfers in the kernel?

:

Was it helpful?

Solution

You need to understand that the occupancy calculator is providing the maximum theoretical occupancy that a given kernel can achieve, based only on the resource requirements of that kernel. It does not (and cannot) say anything about how much of that theoretical occupancy the code is capable of achieving.

The profiling tools, on the other hand, deduce actual occupancy from measured profile counters. According to this document, the achieved occupancy number you are asking about is calculated as

(active_warps / active_cycles) / MAX_WARPS_PER_SM

ie. it samples the number of active warps on one or more SM during a kernel run and calculates actual occupancy from that

There can be a lot of reasons why a kernel doesn't achieve its theoretical occupancy, and (before you ask), no I can't tell you why your kernel doesn't reach theoretical occupancy. But the Visual Profiler can. If it is important to you, I suggest you look at the automated performance analysis features available in the CUDA 5/6 visual profiler as a way of better understanding the performance of your code.

It is also worth pointing out that occupancy should be treated as only a rough metric of potential code performance, and high theoretical occupancy doesn't always translate into high performance. Instruction level parallelism and latency minimisation strategies can also be very effective at reaching high levels of performance, even at low occupancy. There is a large body work on this, most stemming from Vasily Volkov's seminal GTC 2010 paper.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top