CUDA measure execution time per gpu core

Question

Each CUDA device has multiple streaming multi-processors (SMs). Each SM can have multiple warp schedulers and multiple execution units. CUDA cores are execution units not "cores" so I will avoid them for the rest of the discussions.

The NVIDIA profiling tools

CUDA command line profiler
nvprof command line profiler (new in CUDA 5.0)
Visual Profiler
Nsight VSE CUDA profiler

support the ability to collect the duration and PM counters for CUDA grid launches. A subset of the PM counters can be collected per SM.

I've provided the command line for nvprof for collecting the two pieces of information. Both examples run a debug build of the matrixMul sample on a GTX480 with 15 SMs.

COLLECTING GRID EXECUTION TIME

Each of the tools listed above has simplified mode to collect the execution duration of each kernel grid launch. The graphics tools can display this on a timeline or in a table.

nvprof --print-gpu-trace matrixMul.exe
======== NVPROF is profiling matrixMul.exe...
======== Command: matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 480" with compute capability 2.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 39.40 GFlop/s, Time= 3.327 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
======== Profiling result:
     Start  Duration           Grid Size     Block Size     Regs*    SSMem*    DSMem*      Size  Throughput    Device   Context    Stream  Name
  267.83ms   71.30us                   -              -         -         -         -  409.60KB    5.74GB/s         0         1         2  [CUDA memcpy HtoD]
  272.72ms  139.20us                   -              -         -         -         -  819.20KB    5.88GB/s         0         1         2  [CUDA memcpy HtoD]
  272.86ms    3.33ms           (20 10 1)      (32 32 1)        20    8.19KB        0B         -           -         0         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)
  277.29ms    3.33ms           (20 10 1)      (32 32 1)        20    8.19KB        0B         -           -         0         1         2  void matrixMulCUDA<int=32>(float*, float*, float*, int, int)

In order to collect in the other tools

CUDA command line profiler - specify timestamps
Visual Profiler - run generate timeline
Nsight VSE - New Analysis Activity | Trace | Enable CUDA

COLLECTING SM ACTIVITY

Your questions states you need the execution time per GPU core. This can mean per GPU (see above) or per SM. SM execution time can be collected using the SM PM counter active_cycles. active_cycles counts the number of cycles the SM has at least one active warp.

For each line in the output there will be 15 values (one for each SM).

nvprof --events active_cycles --aggregate-mode-off matrixMul.exe
======== NVPROF is profiling matrixMul.exe...
======== Command: matrixMul.exe
[Matrix Multiply Using CUDA] - Starting...
GPU Device 0: "GeForce GTX 480" with compute capability 2.0

MatrixA(320,320), MatrixB(640,320)
Computing result using CUDA Kernel...
done
Performance= 12.07 GFlop/s, Time= 10.860 msec, Size= 131072000 Ops, WorkgroupSize= 1024 threads/block
Checking computed result for correctness: OK

Note: For peak performance, please refer to the matrixMulCUBLAS example.
======== Profiling result:
    Device   Context    Stream, Event Name, Kernel, Values
         0         1         2, active_cycles, void matrixMulCUDA<int=32>(float*, float*, float*, int, int), 2001108 2001177 2000099 2002857 2152562 2153254 2001086 2153043 2001015 2001192 2000065 2154293 2000071 2000238 2154905
         0         1         2, active_cycles, void matrixMulCUDA<int=32>(float*, float*, float*, int, int), 2155340 2002145 2155289 2002374 2003336 2002498 2001865 2155503 2156271 2156429 2002108 2002836 2002461 2002695 2002098