What a mess... But ok, here is a short list of things to test:
- dfs.client.failover.connection.retries.on.timeouts, default 0, between 2 and 5
- dfs.client.failover.connection.retries, default 0, between 2 and 5
- dfs.client.failover.max.attempts, default 15, more than 15, less than 50
If there are latency inside your Hadoop cluster too, consider the Rack Awarness feature and assign a unique rackId on each node, this will tell to Hadoop that all your nodes are distant.
More infos here: http://hadoop.apache.org/docs/r2.2.0/hadoop-project-dist/hadoop-hdfs/hdfs-default.xml