Question

I have a Java application that is writing to the network. It is writing messages in the region of 764b, +/- 5b. A pcap shows that the stream is getting IP fragmented and we can't explain this.

Linux 2.6.18-238.1.1.el5

A strace shows:

(strace -vvvv -f -tt -o strace.out -e trace=network -p $PID)

1: 2045  12:48:23.984173 sendto(45, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0) = 764
2: 15206 12:48:23.984706 sendto(131, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0 <unfinished ...>
3: 2046  12:48:23.984811 sendto(46, "\0\0\0\0\0\0\2\374\0\0\0\0\0\3\n\0\0\0\0\3upd\365myData"..., 764, 0, NULL, 0 <unfinished ...>
4: 15206 12:48:23.984893 <... sendto resumed> ) = 764
5: 2046  12:48:23.984948 <... sendto resumed> ) = 764

I am seeing packets larger than the MTU when I capture the network, which is causing fragmentation.

4809   5.848987 10.0.0.2 -> 10.0.0.5 TCP 40656 > taiclock [ACK] Seq=325501 Ack=1 Win=46 Len=1448 TSV=344627654 TSER=270108068        # First Fragment
4810   5.848991 10.0.0.5 -> 10.0.0.2 TCP taiclock > 40656 [ACK] Seq=1 Ack=326949 Win=12287 Len=0 TSV=270108081 TSER=344627643       # TCP ack
4811   5.849037 10.0.0.2 -> 10.0.0.5 TCP 40656 > taiclock [PSH, ACK] Seq=326949 Ack=1 Win=46 Len=82 TSV=344627654 TSER=270108081    # Second Frag

Questions:

1) It appears the server trying to batch the two sendto() into one IP packet, which is larger than the MTU and is therefore getting fragmented. Why?

2) Looking at the strace output for PID 2046, is the figure after the equal sign <... sendto resumed> line a total for what was sent? I.e. 764b was sent in total for line 3 and line 5? Or is 764 bytes being sent per line?

3) Are there any options I can pass to strace to log all of the sendto() output? Can't seem to find anything..

Was it helpful?

Solution

To answer your questions, in order:

1) It is perfectly normal for multiple send calls to be coalesced when using TCP as it is a stream protocol so does not preserve user level send boundaries in any way. I don't see any evidence of IP fragmentation (which would be bad) in your trace, just of TCP segmentation (which is completely normal).

2) Yes, that is the size - more specifically it is reporting the value that the system call returned after it resumed.

3) You can use "-e write=all" or "-e write=" to get strace to report the whole of the written data.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top