Question

I have a server coded in C++ running on ubuntu 10.04, currently in production, which exhibit a weird bug.

Context :

Each client connecting to the server has one socket and 2 threads

  • 1 thread for writing to the socket,
  • 1 thread for reading from the socket.

The socket is configured via ::setsockopt with SO_RCVTIMEO of 10 seconds.

Each ::send on the socket has flag MSG_NOSIGNAL set (each ::recvfrom also, but it seems it should have no impact)

Bug :

I have some evidence (but not 100% sure) that the following scenario may occur rarely :

  • ::recvfrom is called and block until either data is present or timeout is reached
  • ::send is called and the write on the socket triggers an error, returns EPIPE (Broken Pipe) error
  • Bug : ::recvfrom is still blocked, and will never return, somehow ignoring SO_RCVTIMEO option

Does the above scenario makes some sense to you ?

Metrics :

The bug happens approximatively every week. During a week, there is approximatively :

  • 20K sockets used
  • 30M ::recvfrom called
  • 60M ::send called

Should I rather use the timeout feature from ::select ? (supposing that the timeout implementation would be different from the SO_RCVTIMEO one)

Thanks a lot for any idea on this matter !

No correct solution

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top