different occupancy between calculator and nvprof

Question

You need to understand that the occupancy calculator is providing the maximum theoretical occupancy that a given kernel can achieve, based only on the resource requirements of that kernel. It does not (and cannot) say anything about how much of that theoretical occupancy the code is capable of achieving.

The profiling tools, on the other hand, deduce actual occupancy from measured profile counters. According to this document, the achieved occupancy number you are asking about is calculated as

(active_warps / active_cycles) / MAX_WARPS_PER_SM

ie. it samples the number of active warps on one or more SM during a kernel run and calculates actual occupancy from that

There can be a lot of reasons why a kernel doesn't achieve its theoretical occupancy, and (before you ask), no I can't tell you why your kernel doesn't reach theoretical occupancy. But the Visual Profiler can. If it is important to you, I suggest you look at the automated performance analysis features available in the CUDA 5/6 visual profiler as a way of better understanding the performance of your code.

It is also worth pointing out that occupancy should be treated as only a rough metric of potential code performance, and high theoretical occupancy doesn't always translate into high performance. Instruction level parallelism and latency minimisation strategies can also be very effective at reaching high levels of performance, even at low occupancy. There is a large body work on this, most stemming from Vasily Volkov's seminal GTC 2010 paper.