Problems with lost packets across jgroups channels on EC2

Question 1

Any ideas why some of the packets are not arriving at their destination?

This has been an interesting problem to track down. It seems to affect certain EC2 instances much more than others. The problem is around large packets being sent between EC2 instances via UDP.

The cache synchronization code was sending a large ~300k message to the remote server that gets fragmented (using FRAG2) into 4 packets of 60k (the default size) and 1 packet of 43k which are sent to the remote box. Because of some networking limitation, the remote box only receives the last (5th) 43k message. The 60k messages just never arrive. This seems to happen only between certain pairs of hosts -- other pairs can communicate fine with large packet sizes. That it's not universal is what took so long for me to isolate the diagnose the issue.

I initially thought this was a UDP receiver window size issue and tried to adjust it (sysctl -w net.core.rmem_max=10240000) but this did not help. A look at the tcpdump showed the that 60k packets were just not arriving at the remote host. Only the 43k packets was.

The solution was to decrease the frag size down to 16k (32k may have been fine but we were being conservative). There is some internal AWS limit to the packet sizes as they travel around Amazon's virtual network that is filtering large UDP packets above maybe 50k. The default Jgroups fragment size (60k) is to big IMO and probably should be decreased to 32k or something.

We submitted a ticket on this with Amazon and they acknowledged the issue but the general response was that it was difficult for them to fix. We had tweaked the fragment sizes and were working so the ticket was closed. To quote from the ticket:

From: Amazon Web Services

This is an update for case XXXXXXXXX. We are currently limited to packet sizes of 32k and below on Amazon EC2 and can confirm the issues you are facing for larger packet sizes. We are investigating a solution to this limitation. Please let us know if you can keep your packet sizes below this level, or if this is severe problem blocking your ability to operate.

We are actively looking into increasing the packet size along with other platform improvements, and apologize for this inconvenience.

Couple of other comments about EC2. First, we've see TTL's of >8 necessary for hosts in the same availability zone. If you are using multicast, make sure your TTL is set to 128 or something. We thought this initially was the problem but ultimately it was not.

Hope this helps others.

Question 2

Without adding any element to the answer, I would like to add an alternative way of detecting the same issue.

I'm not a tcpdump expert, then I analysed the issue with debugging and logging.

In our case, a message was split in a number of smaller packets (given the frag_size parameter of FRAG2). Some of them (not necessarily the last one) were randomly not transmitted: typically, packets 1 to 19 were transmitted correctly, 21 was transmitted but 20 was missing.

This was followed by a large number of round-trips between the 2 instances:

The client would be missing packet #20, it acknowledges again #19 and asks for 20; the server would send #20 which is requested explicitely and #21 which has not been acknowledged

The client missing #20 would receive #21 (but not #20), re-acknowledge #19, re-ask #20 and so on for a time from 1 second to 50 seconds.

At the end, the client which is missing #20 generally completes (even if #20 has never been received) without saying anything.