質問

I have a dual core machine with 4 logical processors thanks to hyper-threading. I am executing a SHA1 pre-image brute force test in C#. In each thread I basically have a for loop and compute a SHA1 hash and then compare the hash to what I am looking for. I made sure that all threads execute in complete separation. No memory is shared between them. (Except one variable: long count, which I increment in each thread using:

System.Threading.Interlocked.Increment(ref count);

I get about 1 mln sha1/s with 2 threads and 1.3 mln sha1/s with 4 threads. I fail to see why do I get a 30% bonus from HT in this case. Both cores should be busy doing their stuff, so increasing the number of threads beyond 2 should not give me any benefit. Can anyone explain why?

役に立ちましたか?

解決

Hyperthreading effectively gives you more cores, for integer operations - it allows two sets of integer operations to run in parallel on a single physical core. It doesn't help floating point operations as far as I'm aware, but presumably the SHA-1 code is primarily integer operations, hence the speed-up.

It's not as good as having 4 real physical cores, of course - but it does allow for a bit more parallelism.

他のヒント

Disable HT in BIOS and do the test again for 2 threads. HT gives a little speedup only when one virtual core uses CPU instruction set and second executes instructions which uses FPU registers.

SMT/Hyperthreading allows multiple threads (usually two), on the same physical core, to execute -- one is typically waiting for the other to encounter a stall, and then the thread which is executing will switch.

Stalls happen -- mostly with cache misses. Even if you are not traversing the same memory, there's no guarantee that said memory will already be in the cache (thus inducing a stall when it is accessed), or that it will not map to the same line of the cache that another thread is mapping memory to.

Thus, two threads will almost always benefit from SMT/hyperthreading, unless the data they traverse is already present in the cache. That's actually an unusual scenario -- an algorithm typically needs to prefetch its data, and additionally not use more than the cache can hold, or not overwrite memory other threads are trying to cache -- which requires knowledge of other threads on the core. That's not usually possible, because it's abstracted away by the OS.

Most algorithms are not tuned to that extent, particularly since its only usually console-exclusive games, or other hardware exclusive applications, which can guarantee a certain minimum spec for the cache, and more importantly, have intimate knowledge of other threads which are running concurrently on the same core. This is also one of the major reasons larger caches benefit modern CPU performance.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top