質問

I have a program which takes advantages of OpenMP for obtaining a great speed up on a dual CPU with a total of 32 cores server. The input parameters which I'm using doesn't allow for complete loading of the CPUs.

Today a couple of cores were 100% loaded by another program. When I launched my program it was terribly slow even if the load on the CPUs was as usual pretty high (~2500%). I removed the parallel instructions and I noticed some performance improvements.

Can this been due to the limited memory bandwidth? How could I further investigate the issue and eventually improve my code?

役に立ちましたか?

解決

It is not necessarily memory access that degrade performance. If you use static scheduling (often the default), loops are divided into chunks that are assigned to threads. If the threads are bound to a core which is already busy, it will dramatically slow down your runtime performance. If you are running in an environment where you are not guaranteed to be the only user of the resources, you may get better performance with dynamic scheduling.

If you did not specify a scheduling type, run your program with

OMP_SCHEDULE=dynamic  ./my_program

and see if it helps.

ライセンス: CC-BY-SA帰属
所属していません StackOverflow
scroll top