The answer to your first "sub question" is no this isn't the proper way to do it because none of those functions you have written will get emitted by the compiler.
You can see more details in my answer in the question I linked to above, but the short version is that C compiler level dead code optimisation will eliminate any code which doesn't participate in a value which is written to memory. So you must have those functions return a value, and you must use the return value in a such a way that the compiler can't deduce that the call to your device function is redundant and eliminate it.
Beyond that you have to have enough active warps per SM to amortise all the instruction scheduling latency in the architecture and ensure that you measure the instruction throughput of your device functions, not the latency of the instruction scheduler and pipeline.