How to read profiling results of CUDA Dynamic Parallelism with Nsight

Question

I can answer your profiler questions. I'll reference the last image in your question.

The "Compute" row shows a summary of all kernel executions happening in a context (in your case you seem to have only one context on the device, so the compute row is therefore showing all activity on the device).

Sub-rows within the Compute row are used to show concurrent kernel executions (note that this is true for both CDP and non-CDP applications). These sub-rows are created dynamically as necessary depending on how much concurrency there is. The placement of kernels within certain sub-rows does not indicate anything... they are just packed in using a heuristic that tries to minimize the number of needed sub-rows. The height of the sub-rows is also scaled to keep vertical space reasonable. In your case you have a lot of concurrency at some points so it looks like you have 30+ sub-rows, which means that as some points you have 30+ kernels executing simultaneously on the device.

The goal of the Compute row is to give an overview of how busy the GPU is over time. To see specifically what kernels are executing you need to explore the kernel rows in the timeline.

The kernel timelines are shown just below the Context timeline. There is one row for each host-launched kernel. Like the Compute row, a kernel row can have sub-rows if needed to show concurrent execution of that kernel. In the case of CDP applications, a kernel row can also have child kernel rows that represent the kernels launched from that kernel.

For your example I see a single host-launched kernel "nQue...". Note that there are 7 sub-rows for this kernel since at some points there are 7 instances of the kernel executing at the same time. For kernels that don't launch child kernels, the kernel execution is represented by a solid interval, showing the time that that instance of the kernel was executing on the GPU. For kernels that launch child kernels, the kernel execution can also include a hollow part at the end. The hollow part represents the time after the kernel has finished executing where it is waiting for child kernels to finish executing. The CDP execution model requires that a parent kernel not complete until all child kernels complete and this is what the hollow part is showing.

You can see the exact parent/child relationships by expanding a parent kernel timeline. Notice the "+" icon next to the "nQue..." timeline row. If you open that you will see all the child kernels launched by any of the "nQue..." kernels in the timeline row. Those child kernels can in turn be expanded if they launched children of there own.

If you want to see the "family tree" of a particular kernel there are two ways to do it. You can select a kernel and all ancestors and descendants will highlight in the kernel rows and the compute rows. You can see this in your image. Ctrl-select works as well to multi-select. If you select a single kernel and right-click you can select "Focus" in the menu and this will hide all the kernels except ancestors/descendants of that kernel. Use "Don't Focus" from that same menu to restore all the kernels.

To your specific question about the long executing kernel. It appears to be one of the "nQue..." instances launched from the host. I can't say why it is executing so long without knowing more about the code. You could select it and use the Focus feature to see exactly what child kernels it is launching and perhaps that would give some insight.