GPUs have individual warp schedulers to which you do not have access.
For the Fermi architecture, for example, the device has a Giga-Thread Scheduler that dispatches the work to the different Streaming Multiprocessors and a Dual-Warp scheduler inside that dispatches the warps to the individual cores. But this is transparent to the user.
What you can do to profile individual or sequence of instructions is to use the NVTX Tracing Library which helps you annotating part of the code to be subsequently profiled by Parallel Nsight traces.
You can find some material on the NVTX library at
CUDA Pro Tip: Generate Custom Application Profile Timelines with NVTX
Optimizing Application Performance with CUDA Profiling Tools
and in Chapter 3 of the book "CUDA Application Design and Development" by Rob Farber.
Concerning using NVTX, have a look at my question here: