Question

Shortly about my problem:

I have a computer with 2 sockets of AMD Opteron 6272 and 64GB RAM.

I run one multithread program on all 32 cores and get speed 15% less in comparison with the case when I run 2 programs, each on one 16 cores socket.

How do I make one-program version as fast as two-programs?


More details:

I have a big number of tasks and want to fully load all 32 cores of the system. So I pack the tasks in groups by 1000. Such a group needs about 120Mb input data, and take about 10 seconds to complete on one core. To make the test ideal I copy these groups 32 times and using ITBB's parallel_for loop distribute tasks between 32 cores.

I use pthread_setaffinity_np to insure that system would not make my threads jump between cores. And to insure that all cores are used consequtively.

I use mlockall(MCL_FUTURE) to insure that system would not make my memory jump between sockets.

So the code looks like this:

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated

      TaskManager manager;
      for (int j = 0; j < fNTasksPerThr; j++){
        manager.SetData( &(InpData->fInput[j]) );
        manager.Run();
      }
    }
  }

Only the computing time is important to me therefore I prepare input data in separate parallel_for loop. And do not include preparation time in time measurements.

  void operator()(const blocked_range<size_t> &range) const
  {
    for(unsigned int i = range.begin(); i != range.end(); ++i){

      pthread_t I = pthread_self();
      int s;
      cpu_set_t cpuset;
      pthread_t thread = I;
      CPU_ZERO(&cpuset);
      CPU_SET(threadNumberToCpuMap[i], &cpuset);
      s = pthread_setaffinity_np(thread, sizeof(cpu_set_t), &cpuset);

      mlockall(MCL_FUTURE); // lock virtual memory to stay at physical address where it was allocated
      InpData[i].fInput = new ProgramInputData[fNTasksPerThr];

      for(int j=0; j<fNTasksPerThr; j++){
        InpData[i].fInput[j] = InpDataPerThread.fInput[j];
      }
    }
  }

Now I run all these on 32 cores and see speed of ~1600 tasks per second.

Then I create two version of program, and with taskset and pthread insure that first run on 16 cores of first socket and second - on second socket. I run them one next to each other using simply & command in shell:

program1 & program2 &

Each of these programs achieves speed of ~900 tasks/s. In total this are >1800 tasks/s, which is 15% more than one-program version.

What do I miss?

I consider that may be the problem is in libraries, which I load to memory of muster thread only. Can this be a problem? Can I copy libraries data so it would be available independently on both sockets?

Was it helpful?

Solution

I would guess that it's STL/boost memory allocation that's spreading memory for your collections, etc across numa nodes due to the fact that they're not numa aware and you have threads in the program running on each node.

Custom allocators for all of the STL/boost things that you use might help (but is likely a huge job).

OTHER TIPS

You might be suffering a bad case of false sharing of cache: http://en.wikipedia.org/wiki/False_sharing

Your threads probably share access to the same data structure through the block_range reference. If speed is all you need, you might want to pass a copy to each thread. If your data is too huge to fit onto the call-stack you could dynamically allocate a copy of each range in different cache segments (i.e. just make sure they are far enough appart).

Or maybe I need to see the rest of the code to understand what you are doing better.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top