Question

Is increased CPU time (as reported by time CLI command) indicative of inefficiency when hyperthreading is used (e.g. time spent in spinlocks or cache misses) or is it possible that the CPU time is inflated by the odd nature of HT? (e.g. real cores being busy and HT can't kick in)

I have quad-core i7, and I'm testing trivially-parallelizable part (image to palette remapping) of an OpenMP program — with no locks, no critical sections. All threads access a bit of read-only shared memory (look-up table), but write only to their own memory.

 cores real CPU
  1:   5.8  5.8
  2:   3.7  5.9
  3:   3.1  6.1
  4:   2.9  6.8
  5:   2.8  7.6
  6:   2.7  8.2
  7:   2.6  9.0
  8:   2.5  9.7

I'm concerned that amount of CPU time used increases rapidly as number of cores exceeds 1 or 2.

I imagine that in an ideal scenario CPU time wouldn't increase much (same amount of work just gets distributed over multiple cores).

Does this mean there's 40% of overhead spent on parallelizing the program?

Was it helpful?

Solution

It's quite possibly an artefact of how CPU time is measured. A trivial example, if you run a 100 MHz CPU and a 3 GHz CPU for one second each, each will report that it ran for one second. The second CPU might do 30 times more work, but it takes one second.

With hyperthreading, a reasonable (not quite accurate) model would be that one core can run either one task at lets say 2000 MHz, or two tasks at lets say 1200 MHz. Running two tasks it does only 60% of the work per thread, but 120% of the work for both threads together, a 20% improvement. But if the OS asks how many seconds of CPU time was used, the first will report "1 second" after each second on real time, while the second will report "2 seconds".

So the reported CPU time goes up. If it less than doubles, overall performance is improved.

OTHER TIPS

Quick question - are you running the genuine time program /usr/bin/time, or the built in bash command of the same name? I'm not sure that matters, they look very similar.

Looking at your table of numbers I sense that the processed data set (ie input plus all the out data) is reasonably large overall (bigger than L2 cache), and that the processing per data item is not that lengthy.

The numbers show a nearly linear improvement from 1 to 2 cores, but that is tailing off significantly by the time you're using 4 cores. The hyoerthreaded cores are adding virtually nothing. This means that something shared is being contended for. Your program has free running threads, so that thing can only be memory (L3 cache and main memory on the i7).

This sounds like a typical example of being I/O bound rather than compute bound, the I/O in this case being to/from L3 cache and main memory. L2 cache is 256k, so I'm guessing that the size of your input data plus one set of results and all intermediate arrays is bigger than 256k.

Am I near the mark?

Generally speaking when considering how many threads to use you have to take shared cache and memory speeds and data set sizes into account. That can be a right be a right bugger because you have to work it out at run time, which is a lot of programming effort (unless your hardware config is fixed).

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top