Question

I have 3 zookeeper nodes. Those node was working fine but when I restart those nodes using ./zkServer.sh restart, the zookeeper did not got up again.

When I checked on the zookeeper status, it return:

./zkServer.sh status
JMX enabled by default
Using config: /opt/zookeeper/bin/../conf/zoo.cfg
Error contacting service. It is probably not running.

my zoo.cnf is:

dataDir=/var/lib/zookeeperdata/3
clientPort=2181
initLimit=50
tickTime=2000
syncLimit=10
maxClientCnxns=100000
server.1=IP1 value:2888:3888
server.2=IP2 value:2889:3889
server.3=127.0.0.1:2890:3890

This is unstable behavior because may be after two hours or tomorrow if I made restart for the 3 zookeeper nodes, they will see each others and working fine because this happened before with me.

zookeeper log:

2014-05-14 15:22:34,236 [myid:3] - INFO  [main:NIOServerCnxnFactory@94] - binding to port 0.0.0.0/0.0.0.0:2181
2014-05-14 15:22:34,282 [myid:3] - INFO  [main:QuorumPeer@913] - tickTime set to 2000
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@933] - minSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@944] - maxSessionTimeout set to -1
2014-05-14 15:22:34,283 [myid:3] - INFO  [main:QuorumPeer@959] - initLimit set to 50
2014-05-14 15:22:34,356 [myid:3] - INFO  [main:FileSnap@83] - Reading snapshot /var/lib/zookeeperdata/3/version-2/snapshot.f100000001
2014-05-14 15:22:43,387 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@197] - Accepted socket connection from /127.0.0.1:50923
2014-05-14 15:22:43,396 [myid:3] - INFO  [Thread-1:QuorumCnxManager$Listener@486] - My election bind port: 0.0.0.0/0.0.0.0:3890
2014-05-14 15:22:43,404 [myid:3] - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@354] - Exception causing close of session 0x0 due to java.io.IOExce
ption: ZooKeeperServer not running
2014-05-14 15:22:43,404 [myid:3] - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1001] - Closed socket connection for client /127.0.0.1:50923 (no se
ssion established for client)
2014-05-14 15:22:43,427 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@670] - LOOKING
2014-05-14 15:22:43,429 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@740] - New election. My id =  3, proposed zxid=0xf100000001
2014-05-14 15:22:48,438 [myid:3] - WARN  [WorkerSender[myid=3]:QuorumCnxManager@368] - Cannot open channel to 1 at election address /54.76.10.81:3888
java.net.SocketTimeoutException: connect timed out
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.toSend(QuorumCnxManager.java:327)
  at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.process(FastLeaderElection.java:393)
  at org.apache.zookeeper.server.quorum.FastLeaderElection$Messenger$WorkerSender.run(FastLeaderElection.java:365)
  at java.lang.Thread.run(Thread.java:662)
2014-05-14 15:22:53,440 [myid:3] - WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumCnxManager@368] - Cannot open channel to 1 at election address /54.76.10.81:3
888
java.net.SocketTimeoutException: connect timed out
  at java.net.PlainSocketImpl.socketConnect(Native Method)
  at java.net.PlainSocketImpl.doConnect(PlainSocketImpl.java:351)
  at java.net.PlainSocketImpl.connectToAddress(PlainSocketImpl.java:213)
  at java.net.PlainSocketImpl.connect(PlainSocketImpl.java:200)
  at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:366)
  at java.net.Socket.connect(Socket.java:529)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:354)
  at org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:388)

I searched a lot on this but I did not found anything useful for me so I hope someone can help me.

Thanks

Was it helpful?

Solution 2

I fixed it by changing the IP 127.0.0.1 to the internal IP for amazon node, after making this change for the three nodes and restart, this problem did not happened again. I hope this answer can help someone asking about the same problem.

OTHER TIPS

I've seen behavior like this as well. A ZK configuration that's been running fine will sometimes simply fail to restart. When this happens I've tried the following:

1) look at the logs for all of the servers...often one will list an error 2) stop all servers and restart 3) stop all servers and restart the servers one at a time 4) verify that each server's myid file exists, has correct permissions and has the right value.

I've used clusterssh to open windows to each of the servers so that the restarts can be at the very same time...and then I've tailed all of the server logs. Keep in mind that during restart the ZK cluster is doing a lot: both starting each server and electing a leader. I've had times when the cluster seemed to fail and then after a few more minutes it seems to figure it out.

There is a great tool called zktop that I've used for monitoring ZK.

make sure you have put correct data Dir in each of your node configuration. and also put a myid file in data Dir and put a number between 1-255 for each of you node in the myid file. I think it resole the issue.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top