Is this a valid method for CPU time comparison between C++ and Python?

Question 1

"Returns the approximate processor time used by the process since the beginning of an implementation-defined era related to the program's execution. To convert result value to seconds divide it by CLOCKS_PER_SEC."

That's pretty vague. CLOCK_PER_SEC is set to 10^6 and the approximate stands for poor resolution, not that the current clocks tick over 1000 faster and the results are rounded. That might be not a very technical term, but it is appropriate. The actual resolution everywhere I tested was about 100Hz = 0,01s. It's been like that for years. Note date here http://www.guyrutenberg.com/2007/09/10/resolution-problems-in-clock/.

Then the doc follows with: "On POSIX-compatible systems, clock_gettime with clock id CLOCK_PROCESS_CPUTIME_ID offers better resolution."

So:

It's CPU time only. But 2 threads = 2*CPU time. See the example on cppreference.
It is not suited for fine grain measurements at all, as explained above. You were on the verge of its accuracy.
IMO measuring wall-clock is the only sensible thing, but its a rather personal opinion. Especially with multithreaded applications and multiprocessing in general. Otherwise results of system+user should be similar anyways.

EDIT: At 3. This of course holds for computational tasks. If your process uses sleep or give up execution back to system, it might be more feasible measuring CPU time. Also regarding the comment that clock resolution is erm... bad. It is, but to be fair one could argue you should not measure such short computations. IMO its too bad, but if you measure times over few seconds I guess its fine. I would personally use others available tools.

Question 2

Setting the optimization flag certainly makes a big difference.

C++ is a language that begs to be compiled optimized, particularly so if the code in question uses containers and iterators from the C++ standard library. A simple ++iterator shrinks from a good-sized chain of function calls when compiled unoptimized to one or two assembly statement when optimization is enabled.

That said, I knew what the compiler would do to your test code. Any decent optimizing compiler will make that for (int i=0; i<N; i++) continue; loop vanish. It's the as-if rule at work. That loop does nothing, so the compiler is free to treat it as if it wasn't even there.

When I look at the CPU behavior of a suspect CPU hog, I write a simple driver (in a separate file) that calls the suspect function a number of times, sometimes a very large number of times. I compile the functionality to be tested with optimization enabled, but I compile the driver with optimization disabled. I don't want a too-smart optimizing compiler to see that those 100,000 calls to function_to_be_tested() can be pulled out of the loop and then further optimize the loop away.

There are a number of solid reasons for calling the test function a number of times between the single call to start timer and stop timer. This is why python has the timeit module.

Is this a valid method for CPU time comparison between C++ and Python?

Python

C++

Update