質問

I need to observe the CPU time took by a process in a multicored/hyper-threaded. Suppose a Xeon, Opteron, etc.

Let's assume I have 4 cores, hyper threaded, meaning 8 'virtual' cores. Let X the program I want to run an observed how much CPU time it took.

  • If I run process X in my cpu, I get CPU time A. Suppose A is more than 5 minutes.

  • If I run 8 copies of the same process X, I'll get CPU times B1, B2…, B8.

  • If I run 7 copies of the same process X, I'll get CPU times C1, C2…, C7.

  • If I run 4 copies of the same process X, I'll get CPU times D1, D2…, D4.

QUESTIONs:

  1. What's the relationship between numbers A, Bi, Ci, Di?

  2. Is A smaller than Bi? How much? What about Ci, Di?

  3. Are times Bi different between them? What about Ci, Di?

役に立ちましたか?

解決

What's the relationship between numbers A, Bi, Ci, Di?

Expect D1=D2=D3=D4=A*1, except if you have L2 cache issues (conflicts, faults, ...) where you will have a slightly greater number instead of 1.

Expect B1=B2=B3=B4=...=B8=A*1.3. The number 1.3 may vary between 1.1 and 2 depending on you application (certain processor subparts are hyperthreaded, others are not). It was computed from similar statistics, with I give here using the notations of the question: D=23 seconds, and A=18 seconds, according to a private forum. The unthreaded process did integer computations without input/output. Exact application was checking Adem coefficients in algebra of motivic Steenrod (don't know what it is; settings were (2n+e,n) with n=20).

In the case of sevent processes (Cs), if you assign each process to a core (with /usr/bin/htop on linux), then you will have one of the process (C5 for example) that has the same execution time as an A, and the others (in my example, C1, C2, C3, C4, C6, C7) would have same values than Ds. If you do not assign the processes to cores, and your process lasts enough for the OS do balance them between the cores, they will converge to the mean of the C.

Are times Bi different between them? What about Ci, Di?

Depend on your OS scheduler and on its configuration. And the percentage shown by /bin/top from linux is cheating, it will show nearly 100% for A, Bs, Cs and Ds.

To assess performances, don't forget /usr/bin/nettop (and variants nethogs, nmon, iftop, iptraf), iotop (and variants iostat, latencytop), and collectl (+colmux) and sar (+sag, +sadf).

他のヒント

As 2021, there could be high variations when running multiple experiments. For instance, over 50% of difference.

Two gold standards:

  • Run in single-core mode
  • Disabling hyperthreading.

For detecting the issue:

  • Run the same algorithm multiple times.

In theory this could be used when running experiments:

  • Run each experiment k times.

However, this is incomplete when comparing running time as a group of K could in conditions non-comparable with other K experiments.

To alleviate that:

  • Run each experiment k times.
  • Randomize the order of the experiments.

For publication purposes, that's not enough but it might be useful for fast turn-around, even with k = 2.

H/T: discussion in the slack space of the planning community, related to the conference ICAPS: https://www.icaps-conference.org

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top