Вопрос

Background

I have an EP (Embarassingly Parallell) C application running four threads on my laptop which contains an intel i5 M 480 running at 2.67GHz. This CPU has two hyperthreaded cores.

The four threads execute the same code on different subsets of data. The code and data have no problems fitting in a few cache lines (fit entirely in L1 with room to spare). The code contains no divisions, is essentially CPU-bound, uses all available registers and does a few memory accesses (outside L1) to write results on completion of the sequence.

The compiler is mingw64 4.8.1 i e fairly recent. The best basic optimization level appears to be -O1 which results in four threads that complete faster than two. -O2 and higher run slower (two threads complete faster than four but slower than -O1) as does -Os. Every thread on average does 3.37 million sequences every second which comes out to about 780 clock cycles for each. On average every sequence performs 25.5 sub-operations or one per 30.6 cycles.

So what two hyperthreads do in parallell in 30.6 cycles one thread will do sequentially in 35-40 or 17.5-20 cycles each.

Where I am

I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.

These switches work fairly well (when compiling module by module)

-O1 -m64 -mthreads -g -Wall -c -fschedule-insns

as do these when compiling one module which #includes all the others

-O1 -m64 -mthreads -fschedule-insns -march=native -g -Wall -c -fwhole-program

there is no discernible performance difference between the two.

Question

Has anyone experimented with this and achieved good results?

Это было полезно?

Решение

You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.

Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.

What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.

Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.

Другие советы

You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc.. This will work better on a machine that is just running your app.

It's possible that hyperthreading be counterproductive. It happens it is often counterproductive with computationally intensive loads.

I would give a try to:

  • disable it at bios level and run two threads
  • try to optimize and use vector SSE/AVX extensions, eventually even by hand

explanation: HT is useful because hardware threads get scheduled more efficiently that software threads. However there is an overhead in both. Scheduling 2 threads is more lightweight than scheduling 4, and if your code is already "dense", I'd try to go for "denser" execution, optimizing as more as possible the execution on 2 pipelines.

It's clear that if you optimize less, it scales better, but difficulty it will be faster. So if you are looking for more scalability - this answer is not for you... but if you are looking for more speed - give it a try.

As others has already stated, there is not a general solution when optimizing, otherwise this solution should be embedded in the compilers already.

You could download an OpenCL or CUDA toolkit and implement a version for your graphic card... you maybe able to speed it up 100 fold with little effort.

Лицензировано под: CC-BY-SA с атрибуция
Не связан с StackOverflow
scroll top