質問

I'm benchmarking software which executes 4x faster on Intel 2670QM then my serial version using all 8 of my 'logical' threads. I would like some community feedback on my perception of the benchmarking results.

When I am using 4 Threads on 4 cores I get a speed up of 4x, the entire algorithm is executed in parallell. This seems logical to me since 'Amdhals law' predicts it. Windows task manager tells me I'm using 50% of the CPU.

However if I execute the same software on all 8 threads, I get, once again a speed up of 4x and not a speed up of 8x.

If I have understood this correctly: my CPU has 4 cores with a Frequency of 2.2GHZ individually but the Frequency is divided into 1.1GHZ when applied to 8 'logical' threads and the same follows for the rest of the component such as the cache memory? If this is true then why does the task manager claim only 50% of my CPU is being used?

#define NumberOfFiles 8
...
char startLetter ='a';
#pragma omp parallel for shared(startLetter)
for(int f=0; f<NumberOfFiles; f++){
    ...
}

I am not including the time using disk I/O. I am only interested in the time a STL call takes(STL sort) not the disk I/O.

役に立ちましたか?

解決

A i7-2670QM processor has 4 cores. But it can run 8 threads in parallel. This means that it only has 4 processing units (Cores) but has support in hardware to run 8 threads in parallel. This means that a maximum of four jobs run in on the Cores, if one of the jobs stall due to for example memory access another thread can very fast start executing on the free Core with very little penalty. Read more on Hyper threading. In Reality there are few scenarios where hyper threading gives a large performance gain. More modern processors handle hyper threading better than older processors.

Your benchmark showed that it was CPU bound, i.e. There was little stalls in the pipeline that would have given Hyper Threading an advantage. 50% CPU is correct has the 4 cores are working and the 4 extra are not doing anything. Turn of hyper threading in the BIOS and you will see 100% CPU.

他のヒント

This is a quick summary of Hyperthreading

Thread switching is slow, having to stop execution, copy a bunch of values into memory, copy a bunch of values out of memory into the CPU, then start things going again with the new thread.

This is where your 4 virtual cores come in. You have 4 cores, that is it, but what hyperthreading allows the CPU to do is have 2 threads on a single core.

Only 1 thread can execute at a time, however when 1 thread needs to stop to do a memory access, disk access or anything else that is going to take some time, it can switch in the other thread and run it for a bit. On old processors, they basically had a bit of a sleep in this time.

So your quad core has 4 cores, which can do 1 thing at a time each, but can have a 2nd job on standby as soon as they need to wait on another part of the computer.

If your task has a lot of memory usage and a lot of CPU usage, you should see a slight decrease in total execution time, but if you are almost entirely CPU bound you will be better off sticking with just 4 threads

The important piece of information to understand here is the difference between physical and logical thread.
If you have 4 physical cores on your CPU, that means you have physical resources to execute 4 distinct thread of execution in parallel. So, if your threads do not have data contention, you can normally measure a x4 performance increase, compared to the speed of the single thread.
I'm also assuming that the OS (or you :)) sets the thread affinity correctly, so each thread is run on each physical core.
When you enable HT (Hyper-Threading) on your CPU the core frequency is not modified. :)
What happen is that part of the hw pipeline (inside the core and around (uncore, cache, etc)) is duplicated, but part of it is still shared between the logical threads. That's the reason why you do not measure a x8 performance increase. In my experience enabling all logical cores you can get a x1.5 - x1.7 performance improvement per physical core, depending on the code you are executing, cache usage (remember that the L1 cache is shared between two logical cores/1 physical core, for instance), thread affinity, and so on and so forth. Hope this helps.

Some actual numbers:

CPU-intensive task on my i7, (adding numbers from 1-1000000000 into an int var, 16 times), averaged over 8 tests:

Summary, threads/ticks:

1/26414
4/8923
8/6659
12/6592
16/6719
64/6811
128/6778

Note that in the 'using X threads' line in the reports below, X is one greater than the number of threads available to do the tasks - one thread submits the tasks and waits on a countdown-latch evnet for their completion - it processes none of the CPU-heavy tasks and used no CPU.

8 tests,
16 tasks,
counting to 1000000000,
using 2 threads:
Ticks: 26286
Ticks: 26380
Ticks: 26317
Ticks: 26474
Ticks: 26442
Ticks: 26426
Ticks: 26474
Ticks: 26520
Average: 26414 ms

8 tests,
16 tasks,
counting to 1000000000,
using 5 threads:
Ticks: 8799
Ticks: 9157
Ticks: 8829
Ticks: 9002
Ticks: 9173
Ticks: 8720
Ticks: 8830
Ticks: 8876
Average: 8923 ms

8 tests,
16 tasks,
counting to 1000000000,
using 9 threads:
Ticks: 6615
Ticks: 6583
Ticks: 6630
Ticks: 6599
Ticks: 6521
Ticks: 6895
Ticks: 6848
Ticks: 6583
Average: 6659 ms

8 tests,
16 tasks,
counting to 1000000000,
using 13 threads:
Ticks: 6661
Ticks: 6599
Ticks: 6552
Ticks: 6630
Ticks: 6583
Ticks: 6583
Ticks: 6568
Ticks: 6567
Average: 6592 ms

8 tests,
16 tasks,
counting to 1000000000,
using 17 threads:
Ticks: 6739
Ticks: 6864
Ticks: 6599
Ticks: 6693
Ticks: 6676
Ticks: 6864
Ticks: 6646
Ticks: 6677
Average: 6719 ms

8 tests,
16 tasks,
counting to 1000000000,
using 65 threads:
Ticks: 7223
Ticks: 6552
Ticks: 6879
Ticks: 6677
Ticks: 6833
Ticks: 6786
Ticks: 6739
Ticks: 6802
Average: 6811 ms

8 tests,
16 tasks,
counting to 1000000000,
using 129 threads:
Ticks: 6771
Ticks: 6677
Ticks: 6755
Ticks: 6692
Ticks: 6864
Ticks: 6817
Ticks: 6849
Ticks: 6801
Average: 6778 ms

HT is called SMT (Simultaneous MultiThreading) or HTT (HyperThreading Technology) in most BIOSes. The efficiency of HT depends on the so called compute-to-fetch ratio that is how many in-core (or register/cache) operations your code does before it fetches from or stores to the slow main memory or I/O memory. For highly cache efficient and CPU-bound codes the HT gives almost no noticeable performance increase. For more memory bound codes the HT can really benefit the execution due to the so-called "latency hiding". That's why most non-x86 server CPUs provide 4 (e.g. IBM POWER7) to 8 (e.g. UltraSPARC T4) hardware threads per core. These CPUs are usually used in database and transactional processing systems where many concurrent memory-bound requests are serviced at once.

By the way, the Amdhal's law states that the upper limit of the parallel speedup is one over the serial fraction of the code. Usually the serial fraction increases with the number of processing elements if there is (possibly hidden in the runtime) communication or other synchronisation between the threads, although sometimes cache effects can lead to superlinear speedup and sometimes cache trashing can reduce performance drastically.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top