Shortly about my problem:
I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.
I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.
How do I make one-program version as fast as two-programs?
More details:
I have a big number of tasks and want to fully load all 32 cores of the system.
So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for
loop distribute tasks between 32 cores.
I use pthread_setaffinity_np
to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.
I use mlockall(MCL_FUTURE)
to insure that system would not make my memory jump between sockets.
So the code looks like this:
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
TaskManager manager;
for (int j = 0; j < fNTasksPerThr; j++){
manager.SetData( &(InpData->fInput[j]) );
manager.Run();
}
}
}
Only the computing time is important to me therefore I prepare input data in separate parallel_for
loop. And do not include preparation time in time measurements.
void operator()(const blocked_range<size_t> &range) const
{
for(unsigned int i = range.begin(); i != range.end(); ++i){
pthread_t I = pthread_self();
int s;
cpu_set_t cpuset;
pthread_t thread = I;
CPU_ZERO(&cpuset);
CPU_SET(threadNumberToCpuMap[i], &cpuset);
s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);
mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
InpData[i].fInput = new ProgramInputData[fNTasksPerThr];
for(int j=0; j<fNTasksPerThr; j++){
InpData[i].fInput[j] = InpDataPerThread.fInput[j];
}
}
}
Now I run all these on 32 cores and see speed of ~1600 tasks per second.
Then I create two version of program, and with taskset
and pthread
insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply &
command in shell:
program1 & program2 &
Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.
What do I miss?
I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?