質問

I spent this morning trying to find out how to determine which processor id is the hyper-threaded core, but without luck.

I wish to find out this information and use set_affinity() to bind a process to hyper-threaded thread or non-hyper-threaded thread to profile its performance.

役に立ちましたか?

解決

I discovered the simply trick to do what I need.

cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list

If the first number is equal to the CPU number (0 in this example) then it's a real core, if not it is a hyperthreading core.

Real core example:

# cat /sys/devices/system/cpu/cpu1/topology/thread_siblings_list
1,13

Hyperthreading core example

# cat /sys/devices/system/cpu/cpu13/topology/thread_siblings_list
1,13

The output of the second example is exactly the same as the first one. However we are checking cpu13, and the first number is 1, so CPU 13 this is an hyperthreading core.

他のヒント

HT is symmetric (in terms of basic resources, the system-mode may be asymmetric).

So, if the HT is turned on, large resources of Physical core will be shared between two threads. Some additional hardware is turned on to save state of both threads. Both threads have symmetric access to physical core.

There is a difference between HT-disabled core and HT-enabled core; but no difference between 1st half of HT-enabled core and 2nd half of HT-enabled core.

At single moment of time, one HT-thread may use more resources than other, but this resource balancing is dynamic. CPU will balance threads as it can and as it wants if both threads want to use the same resource. You can only do a rep nop or pause in one thread to let CPU give more resources to other thread.

I wish to find out this information and use set_affinity() to bind a process to hyper-threaded thread or non-hyper-threaded thread to profile its performance.

Okay, you actually can measure performance without knowing a fact. Just do a profile when the only thread in system is binded to CPU0; and repeat it when it is binded to CPU1. I think, the results will be almost the same (OS can generate noise if it binds some interrupts to CPU0; so try to lower number of interrupts when do testing and try to use CPU2 and CPU3 if you have such).

PS

Agner (he is the Guru in x86) recommends to use even-numbered cores in the case when you want not to use HT, but it is enabled in BIOS:

If hyperthreading is detected then lock the process to use the even-numbered logical processors only. This will make one of the two threads in each processor core idle so that there is no contention for resources.

PPS About New-reincarnation HT (not a P4 one, but Nehalem and Sandy) - based on Agner's research on microarchitecture

The new bottlenecks that require attention in the Sandy Bridge are the following: ... 5. Sharing of resources between threads. Many of the critical resources are shared between the two threads of a core when hyperthreading is on. It may be wise to turn off hyperthreading when multiple threads depend on the same execution resources.

...

A half-way solution was introduced in the NetBurst and again in the Nehalem and Sandy Bridge with the so-called hyperthreading technology. The hyperthreading processor has two logical processors sharing the same execution core. The advantage of this is limited if the two threads compete for the same resources, but hyperthreading can be quite advantageous if the performance is limited by something else, such as memory access.

...

Both Intel and AMD are making hybrid solutions where some or all of the execution units are shared between two processor cores (hyperthreading in Intel terminology).

PPPS: Intel Optimization book lists resource sharing in second-generation HT: (page 93, this list is for nehalem, but there is no changes of this list in Sandy section)

Deeper buffering and enhanced resource sharing/partition policies:

  • — Replicated resource for HT operation: register state, renamed return stack buffer, large-page ITLB //comment by me: there are 2 sets of this HW
  • — Partitioned resources for HT operation: load buffers, store buffers, re-order buffers, small-page ITLB are statically allocated between two logical processors. // comment by me: there is single set of this HW; it is statically splitted between two HT-virtual cores in two halfs
  • — Competitively-shared resource during HT operation: the reservation station, cache hierarchy, fill buffers, both DTLB0 and STLB. // comment: Single set, but divided not in half. CPU will dynamically redivide resources.
  • — Alternating during HT operation: front-end operation generally alternates between two logical processors to ensure fairness. // comment: there is single Frontend (instruction decoder), so threads will be decoded in order: 1, 2, 1, 2.
  • — HT unaware resources: execution units. // comment: there are actual hw devices which will do computations, memory accesses. There is only single set. If one of threads is capable of using a lot of execution units and if it has a low number of memory waits, it will consume all exec units and second thread performance will be low (but HT will switch sometimes to second thread. How often??? ). If both threads are not heavy-optimized and/or have memory waits, execution units will be splitted between two threads.

There are also pictures at page 112 (Figure 2-13), which shows that both logical cores are symmetric.

The performance potential due to HT Technology is due to:

  • • The fact that operating systems and user programs can schedule processes or threads to execute simultaneously on the logical processors in each physical processor
  • • The ability to use on-chip execution resources at a higher level than when only a single thread is consuming the execution resources; higher level of resource utilization can lead to higher system throughput

Although instructions originating from two programs or two threads execute simultaneously and not necessarily in program order in the execution core and memory hierarchy, the front end and back end contain several selection points to select between instructions from the two logical processors. All selection points alternate between the two logical processors unless one logical processor cannot make use of a pipeline stage. In this case, the other logical processor has full use of every cycle of the pipeline stage. Reasons why a logical processor may not use a pipeline stage include cache misses, branch mispredictions, and instruction dependencies.

I'm surprised nobody has mentioned lscpu yet. Here's an example on a single-socket system with four physical cores and hyper-threading enabled:

$ lscpu -p
# The following is the parsable format, which can be fed to other
# programs. Each different item in every column has an unique ID
# starting from zero.
# CPU,Core,Socket,Node,,L1d,L1i,L2,L3
0,0,0,0,,0,0,0,0
1,1,0,0,,1,1,1,0
2,2,0,0,,2,2,2,0
3,3,0,0,,3,3,3,0
4,0,0,0,,0,0,0,0
5,1,0,0,,1,1,1,0
6,2,0,0,,2,2,2,0
7,3,0,0,,3,3,3,0

The output explains how to interpret the table of IDs; logical CPU IDs with the same Core ID are siblings.

There is universal (Linux/Windows) and portable HW topology detector (cores, HT, cacahes, south bridges and disk/net connection locality) - hwloc by OpenMPI project. You may use it, because linux may use different HT core numbering rules, and we can't know will it be even/odd or y and y+8 nubering rule.

Home page of hwloc: http://www.open-mpi.org/projects/hwloc/

Download page: http://www.open-mpi.org/software/hwloc/v1.10/

Description:

The Portable Hardware Locality (hwloc) software package provides a portable abstraction (across OS, versions, architectures, ...) of the hierarchical topology of modern architectures, including NUMA memory nodes, sockets, shared caches, cores and simultaneous multithreading. It also gathers various system attributes such as cache and memory information as well as the locality of I/O devices such as network interfaces, InfiniBand HCAs or GPUs. It primarily aims at helping applications with gathering information about modern computing hardware so as to exploit it accordingly and efficiently.

It has lstopo command to get hw topology in graphic form like

 ubuntu$ sudo apt-get hwloc
 ubuntu$ lstopo

lstopo from hwloc (OpenMPI) - output example

or in text form:

 ubuntu$ sudo apt-get hwloc-nox
 ubuntu$ lstopo --of console

We can see physical cores as Core L#x each having two logical cores PU L#y and PU L#y+8.

Machine (16GB)
  Socket L#0 + L3 L#0 (4096KB)
    L2 L#0 (1024KB) + L1 L#0 (16KB) + Core L#0
      PU L#0 (P#0)
      PU L#1 (P#8)
    L2 L#1 (1024KB) + L1 L#1 (16KB) + Core L#1
      PU L#2 (P#4)
      PU L#3 (P#12)
  Socket L#1 + L3 L#1 (4096KB)
    L2 L#2 (1024KB) + L1 L#2 (16KB) + Core L#2
      PU L#4 (P#1)
      PU L#5 (P#9)
    L2 L#3 (1024KB) + L1 L#3 (16KB) + Core L#3
      PU L#6 (P#5)
      PU L#7 (P#13)
  Socket L#2 + L3 L#2 (4096KB)
    L2 L#4 (1024KB) + L1 L#4 (16KB) + Core L#4
      PU L#8 (P#2)
      PU L#9 (P#10)
    L2 L#5 (1024KB) + L1 L#5 (16KB) + Core L#5
      PU L#10 (P#6)
      PU L#11 (P#14)
  Socket L#3 + L3 L#3 (4096KB)
    L2 L#6 (1024KB) + L1 L#6 (16KB) + Core L#6
      PU L#12 (P#3)
      PU L#13 (P#11)
    L2 L#7 (1024KB) + L1 L#7 (16KB) + Core L#7
      PU L#14 (P#7)
      PU L#15 (P#15)

Simple way to get hyperthreading siblings of cpu cores in bash:

cat $(find /sys/devices/system/cpu -regex ".*cpu[0-9]+/topology/thread_siblings_list") | sort -n | uniq

There's also lscpu -e which will give relevant core and cpu info:

CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ    MINMHZ
0   0    0      0    0:0:0:0       yes    4100.0000 400.0000
1   0    0      1    1:1:1:0       yes    4100.0000 400.0000
2   0    0      2    2:2:2:0       yes    4100.0000 400.0000
3   0    0      3    3:3:3:0       yes    4100.0000 400.0000
4   0    0      0    0:0:0:0       yes    4100.0000 400.0000
5   0    0      1    1:1:1:0       yes    4100.0000 400.0000
6   0    0      2    2:2:2:0       yes    4100.0000 400.0000
7   0    0      3    3:3:3:0       yes    4100.0000 400.0000

I tried to verify the information by comparing the temperature of the core and load on the HT core.

enter image description here

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top