CUDA timing kernels - how many launches?

Question

If your code has variable execution paths (data-dependent, perhaps, and you're feeding it varying data), then nobody can really answer this for you.

If your code has a relatively constant execution path, I usually have pretty good results by timing things twice and throwing away the first set of results.

Various GPUs do have power management features, but the first time you run a kernel, any relevant features will be promoted to their highest state, and they won't change in the short time (microseconds) it takes to run that kernel again, for timing.

Benchmarking traditionalists would tell you to run a code hundreds or thousands of times and average the result. I'm rarely interested in that level of clarity. I can usually get a pretty good answer to how fast something is by timing the second run.

As an experiment, you might actually try and plot the data of the timing from each run for 500 runs. This might give you much more insight than any answer on SO can provide. If you see a big spike at the beginning, rather than try and average it out over a large number of runs, I'm usually more interested in discarding it - because it's not representative of the rest of my data.

Also, be aware that GPUs running under WDDM are just wacky in terms of timing. The OS is actually managing a WDDM GPU to a much finer degree than is really desirable for computing tasks, and so that might be a situation where you just have to give up and time lots of runs. You'll likely have much more consistent and predictable results run-to-run if you can run your GPU in TCC mode on windows (won't work with a GeForce GPU), or else on linux without X running on that GPU. (X can be running, just keep it off the compute GPUs, if you can.) In my opinion, timing is considerably more challenging under WDDM.