Question

I have noticed a problem where the .Completed event of a SocketAsyncEventArgs seems to stop firing. The same SAEA can fire properly and be replaced in the pool several times, but eventually all instances will stop firing, and because the code to replace them in the pool is in the event handler, the pool empties.

The following circumstances are also apparently true:

1) It seems to only occur when a server side socket sends data out to one of the connected clients. When the same class is connecting as a client, it doesn't appear to malfunction.

2) It seems to occur under high load. The thread count seems to creep up until eventually the error happens.

3) A test rig under similar stress appears never to malfunction. (It's only 20 messages per second, and the test rig has been proven to 20K)

I'm not going to be able to paste the rather complicated code, but here is a description of my code:

1) The main inspiration is this: http://vadmyst.blogspot.ch/2008/05/sample-code-for-tcp-server-using.html. It shows how to hook up a completion port using an event, how to get different sized messages over the TCP connection, and so on.

2) I have a byte buffer in which all SAEAs have a piece, that doesn't overlap.

3) I have an object pool of SAEAs, based on a blockingcollection. This throws if the pool is empty for too long.

4) As a server, I keep a collection of sockets returned from the AcceptAsync function, indexed by the endpoint of the client. A single process can use one instance as a server as well as multiple instances as clients (forming a web). They share the data buffer and pool of SAEAs.

I realise it's hard to explain this; I've been debugging it for an entire day and night. Just hoping someone has heard of this or has useful questions or suggestions.

At the moment, I am suspecting some sort of thread exhaustion, leading to the SAEAs not being able to call the completion. Alternatively, some sort of buffer problem on the outgoing buffer.

Was it helpful?

Solution

So, another day of debugging and finally I have an explanation.

1) The SAEAs were not firing the completed event because they were unable to send more. This is revealed by Wireshark to be due to the TCP window emptying. (TCP ZeroWindow)

2) The TCP window was emptying because the networking layer was passing an event up the stack that took too long to complete, ie there's no producer/consumer between the network layer and the UI. Thus the network op would have to wait for the screen draw before sending the ACK.

3) The event that took too long was a screen draw in an event handler on the GUI. The test rig was a console window (one that summarized incoming messages), so that's why it didn't cause a problem at much higher load. It's normal not to redraw the screen on each message, but this was happening because the project isn't quite done yet. The redraw rate would have been fixed later.

4) The short term solution is simply to make sure there's no GUIs holding up the show. A more robust solution might be to create a producer/consumer at the network layer.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top