For a while now I've been combatting persistent, obnoxious, random disconnects on a few Linux hosts at the office. The main symptom was suddenly getting "Write failed: broken pipe" on the client side, with no indications of anything abnormal about the disconnect on the server side. It seemed to be happening whether my session was active or not; keeping things busy with "top" or otherwise didn't make any difference.
I made the usual inquiries to see if somehow the machine was exhausting all its resources, but it was using well below the amount of file descriptors, memory, and TCP memory that it had available to it. There was nothing abnormal in netstat, interface counters (no ierrs or oerrs), or netstat -s. The disconnects happened at any random time, not correlated to any spikes in CPU or memory consumption. I went so far as to watch cron, just to see if some misbehaving script could be doing a kill -9 on my shell or my sshd process. No dice.
So I cranked up the LogLevel in sshd_config to "DEBUG3", and found that for reasons unknown, sshd won't report particularly useful bits of information like "Read error from remote host" and "Connection timed out" unless you have debugging cranked up. That might be a bit important, OpenSSH guys... maybe "info" would be a better level for that message?
This seemed fairly strange because it was happening while plenty of interaction was in progress. Certainly my connections were not timing out in any amount of time that seemed normal.
Running tcpdump on both ends of the connection soon revealed something interesting, however... Every 5 seconds, there was a TCP packet exchanged between server and client, and several packets in 1-second intervals as the last activity, just before the disconnect.
Yes, it's good old TCP KeepAlive, which someone had set up quite aggressively on this particular host. Our Linux installation has three variables controlling KeepAlives: net.ipv4.tcp_keepalive_time, net.ipv4.tcp_keepalive_intvl, and net.ipv4.tcp_keepalive_probes.
If a connection has been idle for at least tcp_keepalive_time seconds, the network stack will send up to tcp_keepalive_probes every tcp_keepalive_intvl seconds. If none of those probes are returned before the next tcp_keepalive_intvl, the connection will be broken with the error -- you guessed it -- Read timed out.
On this machine, the keepalive_time was set to 5 seconds, probes was set to 2, and keepalive_intvl was set to 1. So every 5 seconds of inactivity, the stack would send out up to 2 probes spaced 1 second apart. And if it didn't hear back from either of those 2 probes within 1 second, then the connection would get killed. You can imagine it wouldn't take very much network congestion to lose 2 packets in 2 seconds, especially if WiFi got involved anywhere along the line. I suspect being on VPN and connecting to a VM only make it more error-prone.
In the case of this particular host, a quick disconnect is actually desirable, but I think we were a little too aggressive in how quick. Changing the number of probes from 2 to 5 has allowed the machine to be much more tolerant of short-term, transient network glitches while still failing fast in the event that something really has gone wrong with the network.