I believe that Zookeeper is the problem. A Zookeeper ensemble must be of 2N+1 instances and tolerate N down nodes. If you have only 2 than you are in the configuration of 2*1+1=3 nodes in the ensemble. Only one down node is tolerated and you need at least 2 nodes up. As you have only 2 zk, if any of them is down than your ZK ensemble is also down.
To achieve high availability, it is recommended to deploy an independent Zookeeper ensemble with at least 3 instances on 3 differents machines to eliminate SPoF.