Multithreading decreases socket throughput on NUMA

https://stackoverflow.com/questions/18975418

29-06-2022
|

Domanda

I benchmarked a Java program on a 16 core NUMA machine with Red Had Linux. I measured the throughput of a Java DatagramSocket (for UDP) in terms of how many packets (of 64 Bytes size) it was able to receive and send per second. The program consisted of a single socket and n threads that were listening on the socket. When a packet arrived, they would copy the payload into a byte[] array, create a new DatagramPacket with that array and send it straight-away back to where it came from. Think of it as a ping on the UDP layer.

I found that the Java DatagramSocket socket achieves a significantly smaller throughput when using more than one thread, i.e. two or four. If I use only one thread to listen on the socket, I achieve a throughput of 122,000 packets per second, while more than one threads achieve only 65,000 packets per second. Now, I understand that a thread might be executed on any core of the NUMA machine and that memory accesses become expensive if the memory has to travel from one node to another. However, if I have two threads, only one should be executed on the “wrong” core, while the other should still achieve a very high throughput. Another possible explanation is a synchronization problem in the Datagramsocket but these are only guesses. Does anybody have a good insight in what the real explanation is?
I found that executing this program multiple times (in parallel) on multiple ports achieves a higher overall throughput. I started the program with one thread four times and each program used a socket on a separate port (5683, 5684, 5685 and 5686). The combined throughput of the four programs was 370,000 packets per second. In summary, using more than one thread on the same port decreases the throughput, while using more than one port with one thread each increases it. How is this explainable?

System specifications:

Hardware: 16 cores on 2 AMD Opteron(TM) Processor 6212 processors organized in 4 nodes with 32 GB RAM each. Frequency: 1.4 Ghz, 2048 KB cache.

node distances:
node   0   1   2   3
  0:  10  16  16  16
  1:  16  10  16  16
  2:  16  16  10  16
  3:  16  16  16  10

The OS is a Red Hat Enterprise Linux Workstation release 6.4 (Santiago) with kernel version 2.6.32-358.14.1.el6.x86_64. Java version "1.7.0_09", Java(TM) SE Runtime Environment (build 1.7.0_09-b05), Java HotSpot(TM) 64-Bit Server VM (build 23.5-b02, mixed mode) and I used the -XX:+UseNUMA flag. Server and client are connected over 10GB Ethernet.

Soluzione

In general, you are most efficient when using only one thread. Making stuff parallel will inevidently introduce cost. The gain in throughput will only come when the additional amount of work you can do in parallel overweights this cost.

Now, Amdahl's law illustrates the theoretical gain in throughput in relation to how much of your work consists of stuff that can be parallelized / cannot be parallelized. For example, if only 50% of your task is parallelizable, you can only get x2 increase in throughput regardless of how many threads you throw at the problem. Note that the chart you see inside the link ignores the cost of adding threads. In reality, native OS threads do add quite a bit of cost and esp. when a lot of them are trying to access a shared resource.

In your case, when you used only one socket, most of your work was not parallelizable. Hence using a single thread gave superior performance and adding threads made it worse because of the costs they added. In your second experiment, you increased the work that can be parallelized by using more than one socket. Hence you gained in throughput despite adding some cost by using threads.

Autorizzato sotto: CC-BY-SA insieme a attribuzione

Non affiliato a StackOverflow