mysql_ping hangs with Amazon RDS

Question

I can't find a citation in the documentation, but my experience suggests that the networking infrastructure of EC2 in general (which would include RDS and likely any other AWS service that runs on virtual machines that are provisioned per-customer, if not all of AWS, and certainly does not appear to be limited strictly to "EC2 instances") implements stateful packet inspection, and will "forget" that a TCP connection is valid after a few minutes of absolute idleness... causing the behavior you describe.

The machines on both ends of the connection may be convinced that the connection is still there, but the network will not allow the traffic to pass between them, because TCP sessions in an SPI environment are not discovered, they're created, and can only be created when the network sees the connection at the very beginning (SYN, SYN/ACK, ACK). I originally encountered this issue with MySQL servers in EC2 (not RDS) but would be very surprised if the underlying cause is not the same.

There are two possible approaches to work around this.

If your PHP machine is Linux, configure the kernel to keep the connections alive at layer 4. This change will be invisible to you in the sense that these keepalives won't change the value in the Time column in SHOW PROCESSLIST for connections in Sleep because it won't reset the amount of time the connection has been idle at layer 7 ... but it should avoid the timeouts from the AWS infrastructure if the libraries managing the MySQL connections are setting the socket options correctly to take advantage of it.

http://tldp.org/HOWTO/TCP-Keepalive-HOWTO/usingkeepalive.html explains how to set this up live, and how to make it persistent across reboots.

Failing that, the other option is to force MySQL to close the connection sooner than the network timeout so that the PHP machine will immediately recognize that it's trying to talk on a closed socket. It may sound counter-intuitive to shorten a timeout rather than lengthing it, but shortening the timeout should cause your ping test to fail very quickly if a session has been idle too long, which also (essentially) "solves" the problem, assuming sanity in the PHP client library. Once your application is busier, the connections will presumably seldom be idle long enough to reach the timeout.

MySQL Server has two different idle timeout settings: wait_timeout (for non-interactive sessions, i.e., connections from code, like PHP) and interactive_timeout (from query browsers and the command line client) but the server only knows the difference because the client library has to notify the server which type of connection it's establishing. Assuming your client library uses the correct setup, then wait_timeout is the one you're looking for. Setting this to a value below 900 should resolve the issue if changing the TCP keepalive settings in the Linux kernel doesn't. Note, though, that after making the change, only future connections will be impacted -- connections already established when the change is made will still be running with the current value, which defaults to 8 hours (28800 seconds). These are configurable in the RDS Parameter Group for your instance.

There are hints of similar behavior in the AWS docs here, along with the Windows registry settings that need to be adjusted to change TCP keepalives if you're running the PHP server on Windows, instead of Linux, as I assumed above... even though the article is specifically about Redshift and connections external to EC2 it still seems to validate the underlying issue as discussed above.