How is the detection of terminated nodes in Erlang working? How is net_ticktime influencing the control of node liveness in Erlang?

https://stackoverflow.com//questions/24061270

26-12-2019
|

Question

I set net_ticktime value to 600 seconds.

net_kernel:set_net_ticktime(600)

In Erlang documentation for net_ticktime = TickTime:

Specifies the net_kernel tick time. TickTime is given in seconds. Once every TickTime/4 second, all connected nodes are ticked (if anything else has been written to a node) and if nothing has been received from another node within the last four (4) tick times that node is considered to be down. This ensures that nodes which are not responding, for reasons such as hardware errors, are considered to be down.

The time T, in which a node that is not responding is detected:

MinT < T < MaxT where:

MinT = TickTime - TickTime / 4
MaxT = TickTime + TickTime / 4

TickTime is by default 60 (seconds). Thus, 45 < T < 75 seconds.

Note: Normally, a terminating node is detected immediately.

My Problem: My TickTime is 600 (seconds). Thus, 450 (7.5 minutes)< T < 750 seconds (12.5 minutes). Although, when I set net_ticktime to all distributed nodes in Erlang to value 600 when some node fails (eg. when I close Erlang shell) then the other nodes get message immediately and not according to definition of ticktime.

However it is noted that normally a terminating node is detected immediately but I could not find explanation (neither in Erlang documentation, or Erlang ebook or other Erlang based sources) of this immediate response principle for node termination in distributed Erlang. Are nodes in distributed environment pinged periodically with smaller intervals than net_ticktime or does the terminating node send some kind of message to other nodes before it terminates? If it does send a message are there any scenarios when upon termination node cannot send this message and must be pinged to investigate its liveliness?

Also it is noted in Erlang documentation that Distributed Erlang is not very scalable for clusters larger than 100 nodes as every node keeps links to all nodes in the cluster. Is the algorithm for investigating liveliness of nodes (pinging, announcing termination) modified with increasing size of the cluster?

Solution

When two Erlang nodes connect, a TCP connection is made between them. The failure you are inducing would cause the underlying OS to close the connection, effectively notifying the other node very quickly.

The network tick is used to detect a connection to a distant node that appears to be up but is not actually passing traffic, such as may occur when a network event isolates a node.

If you want to simulate a failure that would require a tick to detect, use a firewall to block the traffic on the connection created when the nodes first ping.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow