Best gcc optimization switches for hyperthreading

Question 1

You say "I think what I need is generated code which isn't so dense/efficient that the two hyperthreads constantly collide over the local CPU's resources.". That's rather misguided.

Your CPU has a certain amount of resources. Code will be able to use some of the resources, but usually not all. Hyperthreading means you have two threads capable of using the resources, so a higher percentage of these resources will be used.

What you want is to maximise the percentage of resources that are used. Efficient code will use these resources more efficiently in the first place, and adding hyper threading can only help. You won't get that much of a speedup through hyper threading, but that is because you got the speedup already in single threaded code because it was more efficient. If you want bragging rights that hyper threading gave you a big speedup, sure, start with inefficient code. If you want maximum speed, start with efficient code.

Now if your code was limited by latencies, it means it could perform quite a few useless instructions without penalty. With hyper threading, these useless instructions actually cost. So for hyper threading, you want to minimise the number of instructions, especially those that were hidden by latencies and had no visible cost in single threaded code.

Question 2

You could try locking each thread to a core using processor affinity. I've heard this can give you 15%-50% improved efficiency with some code. The saving being that when the processor context switch happens there is less changed in the caches etc.. This will work better on a machine that is just running your app.

Question 3

It's possible that hyperthreading be counterproductive. It happens it is often counterproductive with computationally intensive loads.

I would give a try to:

disable it at bios level and run two threads
try to optimize and use vector SSE/AVX extensions, eventually even by hand

explanation: HT is useful because hardware threads get scheduled more efficiently that software threads. However there is an overhead in both. Scheduling 2 threads is more lightweight than scheduling 4, and if your code is already "dense", I'd try to go for "denser" execution, optimizing as more as possible the execution on 2 pipelines.

It's clear that if you optimize less, it scales better, but difficulty it will be faster. So if you are looking for more scalability - this answer is not for you... but if you are looking for more speed - give it a try.

As others has already stated, there is not a general solution when optimizing, otherwise this solution should be embedded in the compilers already.

Question 4

You could download an OpenCL or CUDA toolkit and implement a version for your graphic card... you maybe able to speed it up 100 fold with little effort.