How do you programmatically configure hazelcast for the multicast discovery mechanism?

https://stackoverflow.com/questions/20385973

29-08-2022
|

Question

Details:

The documentation only supplies an example for TCP/IP and is out-of-date: it uses Config.setPort(), which no longer exists.

My configuration looks like this, but discovery does not work (i.e. I get the output "Members: 1":

Config cfg = new Config();                  
NetworkConfig network = cfg.getNetworkConfig();
network.setPort(PORT_NUMBER);

JoinConfig join = network.getJoin();
join.getTcpIpConfig().setEnabled(false);
join.getAwsConfig().setEnabled(false);
join.getMulticastConfig().setEnabled(true);

join.getMulticastConfig().setMulticastGroup(MULTICAST_ADDRESS);
join.getMulticastConfig().setMulticastPort(PORT_NUMBER);
join.getMulticastConfig().setMulticastTimeoutSeconds(200);

HazelcastInstance instance = Hazelcast.newHazelcastInstance(cfg);
System.out.println("Members: "+hazelInst.getCluster().getMembers().size());

Update 1, taking asimarslan's answer into account

If I fumbled with the MulticastTimeout, I either get "Members: 1" or

Dec 05, 2013 8:50:42 PM com.hazelcast.nio.ReadHandler WARNING: [192.168.0.9]:4446 [dev] hz._hzInstance_1_dev.IO.thread-in-0 Closing socket to endpoint Address[192.168.0.7]:4446, Cause:java.io.EOFException: Remote socket closed! Dec 05, 2013 8:57:24 PM com.hazelcast.instance.Node SEVERE: [192.168.0.9]:4446 [dev] Could not join cluster, shutting down! com.hazelcast.core.HazelcastException: Failed to join in 300 seconds!

Update 2, taking pveentjer's answer about using tcp/ip into account

If I change the configuration to the following, I still only get 1 member:

Config cfg = new Config();                  
NetworkConfig network = cfg.getNetworkConfig();
network.setPort(PORT_NUMBER);

JoinConfig join = network.getJoin();

join.getMulticastConfig().setEnabled(false);
join.getTcpIpConfig().addMember("192.168.0.1").addMember("192.168.0.2").
addMember("192.168.0.3").addMember("192.168.0.4").
addMember("192.168.0.5").addMember("192.168.0.6").
addMember("192.168.0.7").addMember("192.168.0.8").
addMember("192.168.0.9").addMember("192.168.0.10").
addMember("192.168.0.11").setRequiredMember(null).setEnabled(true);

//this sets the allowed connections to the cluster? necessary for multicast, too?
network.getInterfaces().setEnabled(true).addInterface("192.168.0.*");

HazelcastInstance instance = Hazelcast.newHazelcastInstance(cfg);
System.out.println("debug: joined via "+join+" with "+hazelInst.getCluster()
.getMembers().size()+" members.");

More precisely, this run produces the output

debug: joined via JoinConfig{multicastConfig=MulticastConfig [enabled=false, multicastGroup=224.2.2.3, multicastPort=54327, multicastTimeToLive=32, multicastTimeoutSeconds=2, trustedInterfaces=[]], tcpIpConfig=TcpIpConfig [enabled=true, connectionTimeoutSeconds=5, members=[192.168.0.1, 192.168.0.2, 192.168.0.3, 192.168.0.4, 192.168.0.5, 192.168.0.6, 192.168.0.7, 192.168.0.8, 192.168.0.9, 192.168.0.10, 192.168.0.11], requiredMember=null], awsConfig=AwsConfig{enabled=false, region='us-east-1', securityGroupName='null', tagKey='null', tagValue='null', hostHeader='ec2.amazonaws.com', connectionTimeoutSeconds=5}} with 1 members.

My non-hazelcast-implementation is using UDP multicasts and works fine. So can a firewall really be the problem?

Update 3, taking pveentjer's answer about checking the network into account

Since I do not have permissions for iptables or to install iperf, I am using com.hazelcast.examples.TestApp to check whether my network is working, as described in Getting Started With Hazelcast in Chapter 2, Section "Showing Off Straight Away":

I call java -cp hazelcast-3.1.2.jar com.hazelcast.examples.TestApp on 192.168.0.1 and get the output

...Dec 10, 2013 11:31:21 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Prefer IPv4 stack is true.
Dec 10, 2013 11:31:21 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Picked Address[192.168.0.1]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
Dec 10, 2013 11:31:22 PM com.hazelcast.system
INFO: [192.168.0.1]:5701 [dev] Hazelcast Community Edition 3.1.2 (20131120) starting at Address[192.168.0.1]:5701
Dec 10, 2013 11:31:22 PM com.hazelcast.system
INFO: [192.168.0.1]:5701 [dev] Copyright (C) 2008-2013 Hazelcast.com
Dec 10, 2013 11:31:22 PM com.hazelcast.instance.Node
INFO: [192.168.0.1]:5701 [dev] Creating MulticastJoiner
Dec 10, 2013 11:31:22 PM com.hazelcast.core.LifecycleService
INFO: [192.168.0.1]:5701 [dev] Address[192.168.0.1]:5701 is STARTING
Dec 10, 2013 11:31:24 PM com.hazelcast.cluster.MulticastJoiner
INFO: [192.168.0.1]:5701 [dev] 

Members [1] {
    Member [192.168.0.1]:5701 this
}

Dec 10, 2013 11:31:24 PM com.hazelcast.core.LifecycleService
INFO: [192.168.0.1]:5701 [dev] Address[192.168.0.1]:5701 is STARTED

I then call java -cp hazelcast-3.1.2.jar com.hazelcast.examples.TestApp on 192.168.0.2 and get the output

...Dec 10, 2013 9:50:22 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Prefer IPv4 stack is true.
Dec 10, 2013 9:50:22 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Picked Address[192.168.0.2]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
Dec 10, 2013 9:50:23 PM com.hazelcast.system
INFO: [192.168.0.2]:5701 [dev] Hazelcast Community Edition 3.1.2 (20131120) starting at Address[192.168.0.2]:5701
Dec 10, 2013 9:50:23 PM com.hazelcast.system
INFO: [192.168.0.2]:5701 [dev] Copyright (C) 2008-2013 Hazelcast.com
Dec 10, 2013 9:50:23 PM com.hazelcast.instance.Node
INFO: [192.168.0.2]:5701 [dev] Creating MulticastJoiner
Dec 10, 2013 9:50:23 PM com.hazelcast.core.LifecycleService
INFO: [192.168.0.2]:5701 [dev] Address[192.168.0.2]:5701 is STARTING
Dec 10, 2013 9:50:23 PM com.hazelcast.nio.SocketConnector
INFO: [192.168.0.2]:5701 [dev] Connecting to /192.168.0.1:5701, timeout: 0, bind-any: true
Dec 10, 2013 9:50:23 PM com.hazelcast.nio.TcpIpConnectionManager
INFO: [192.168.0.2]:5701 [dev] 38476 accepted socket connection from /192.168.0.1:5701
Dec 10, 2013 9:50:28 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.0.2]:5701 [dev] 

Members [2] {
    Member [192.168.0.1]:5701
    Member [192.168.0.2]:5701 this
}

Dec 10, 2013 9:50:30 PM com.hazelcast.core.LifecycleService
INFO: [192.168.0.2]:5701 [dev] Address[192.168.0.2]:5701 is STARTED

So multicast discovery is generally working on my cluster, right? Is 5701 also the port for discovery? Is 38476 in the last output an ID or a port?

Joining still does not work for my own code with programmatical configuration :(

Update 4, taking pveentjer's answer about using the default configuration into account

The modified TestApp gives the output

joinConfig{multicastConfig=MulticastConfig [enabled=true, multicastGroup=224.2.2.3, 
multicastPort=54327, multicastTimeToLive=32, multicastTimeoutSeconds=2, 
trustedInterfaces=[]], tcpIpConfig=TcpIpConfig [enabled=false, 
connectionTimeoutSeconds=5, members=[], requiredMember=null], 
awsConfig=AwsConfig{enabled=false, region='us-east-1', securityGroupName='null', 
tagKey='null', tagValue='null', hostHeader='ec2.amazonaws.com', connectionTimeoutSeconds=5}}

and does detect other members after a couple of seconds (after each instance once lists only itself as a member if all are started at the same time), whereas

myProgram gives the output

joined via JoinConfig{multicastConfig=MulticastConfig [enabled=true, multicastGroup=224.2.2.3, multicastPort=54327, multica\
stTimeToLive=32, multicastTimeoutSeconds=2, trustedInterfaces=[]], tcpIpConfig=TcpIpConfig [enabled=false, connectionTimeoutSecond\
s=5, members=[], requiredMember=null], awsConfig=AwsConfig{enabled=false, region='us-east-1', securityGroupName='null', tagKey='nu\
ll', tagValue='null', hostHeader='ec2.amazonaws.com', connectionTimeoutSeconds=5}} with 1 members.

and does not detect members within its runtime of about 1 minute (I am counting the members about every 5 seconds).

BUT if at least one instance of TestApp runs concurrently on the cluster, all TestApp instances and all myProgram instances are detected and my program works fine. In case I start TestApp once and then myProgram twice in parallel, TestApp gives the following output:

java -cp ~/CaseStudy/jtorx-1.10.0-beta8/lib/hazelcast-3.1.2.jar:. TestApp
Dec 12, 2013 12:02:15 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Prefer IPv4 stack is true.
Dec 12, 2013 12:02:15 PM com.hazelcast.instance.DefaultAddressPicker
INFO: Picked Address[192.168.180.240]:5701, using socket ServerSocket[addr=/0:0:0:0:0:0:0:0,localport=5701], bind any local is true
Dec 12, 2013 12:02:15 PM com.hazelcast.system
INFO: [192.168.180.240]:5701 [dev] Hazelcast Community Edition 3.1.2 (20131120) starting at Address[192.168.180.240]:5701
Dec 12, 2013 12:02:15 PM com.hazelcast.system
INFO: [192.168.180.240]:5701 [dev] Copyright (C) 2008-2013 Hazelcast.com
Dec 12, 2013 12:02:15 PM com.hazelcast.instance.Node
INFO: [192.168.180.240]:5701 [dev] Creating MulticastJoiner
Dec 12, 2013 12:02:15 PM com.hazelcast.core.LifecycleService
INFO: [192.168.180.240]:5701 [dev] Address[192.168.180.240]:5701 is STARTING
Dec 12, 2013 12:02:21 PM com.hazelcast.cluster.MulticastJoiner
INFO: [192.168.180.240]:5701 [dev] 


Members [1] {
    Member [192.168.180.240]:5701 this
}

Dec 12, 2013 12:02:22 PM com.hazelcast.core.LifecycleService
INFO: [192.168.180.240]:5701 [dev] Address[192.168.180.240]:5701 is STARTED
Dec 12, 2013 12:02:22 PM com.hazelcast.management.ManagementCenterService
INFO: [192.168.180.240]:5701 [dev] Hazelcast will connect to Management Center on address: http://localhost:8080/mancenter-3.1.2/
Join: JoinConfig{multicastConfig=MulticastConfig [enabled=true, multicastGroup=224.2.2.3, multicastPort=54327, multicastTimeToLive=32, multicastTimeoutSeconds=2, trustedInterfaces=[]], tcpIpConfig=TcpIpConfig [enabled=false, connectionTimeoutSeconds=5, members=[], requiredMember=null], awsConfig=AwsConfig{enabled=false, region='us-east-1', securityGroupName='null', tagKey='null', tagValue='null', hostHeader='ec2.amazonaws.com', connectionTimeoutSeconds=5}}
Dec 12, 2013 12:02:22 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] Initializing cluster partition table first arrangement...
hazelcast[default] > Dec 12, 2013 12:03:27 PM com.hazelcast.nio.SocketAcceptor
INFO: [192.168.180.240]:5701 [dev] Accepting socket connection from /192.168.0.8:38764
Dec 12, 2013 12:03:27 PM com.hazelcast.nio.TcpIpConnectionManager
INFO: [192.168.180.240]:5701 [dev] 5701 accepted socket connection from /192.168.0.8:38764
Dec 12, 2013 12:03:27 PM com.hazelcast.nio.SocketAcceptor
INFO: [192.168.180.240]:5701 [dev] Accepting socket connection from /192.168.0.7:54436
Dec 12, 2013 12:03:27 PM com.hazelcast.nio.TcpIpConnectionManager
INFO: [192.168.180.240]:5701 [dev] 5701 accepted socket connection from /192.168.0.7:54436
Dec 12, 2013 12:03:32 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] Re-partitioning cluster data... Migration queue size: 181
Dec 12, 2013 12:03:32 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.180.240]:5701 [dev] 

Members [3] {
    Member [192.168.180.240]:5701 this
    Member [192.168.0.8]:5701
    Member [192.168.0.7]:5701
}

Dec 12, 2013 12:03:43 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] Re-partitioning cluster data... Migration queue size: 181
Dec 12, 2013 12:03:45 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] All migration tasks has been completed, queues are empty.
Dec 12, 2013 12:03:46 PM com.hazelcast.nio.TcpIpConnection
INFO: [192.168.180.240]:5701 [dev] Connection [Address[192.168.0.8]:5701] lost. Reason: Socket explicitly closed
Dec 12, 2013 12:03:46 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.180.240]:5701 [dev] Removing Member [192.168.0.8]:5701
Dec 12, 2013 12:03:46 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.180.240]:5701 [dev] 

Members [2] {
    Member [192.168.180.240]:5701 this
    Member [192.168.0.7]:5701
}

Dec 12, 2013 12:03:48 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] Partition balance is ok, no need to re-partition cluster data... 
Dec 12, 2013 12:03:48 PM com.hazelcast.nio.TcpIpConnection
INFO: [192.168.180.240]:5701 [dev] Connection [Address[192.168.0.7]:5701] lost. Reason: Socket explicitly closed
Dec 12, 2013 12:03:48 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.180.240]:5701 [dev] Removing Member [192.168.0.7]:5701
Dec 12, 2013 12:03:48 PM com.hazelcast.cluster.ClusterService
INFO: [192.168.180.240]:5701 [dev] 

Members [1] {
    Member [192.168.180.240]:5701 this
}

Dec 12, 2013 12:03:48 PM com.hazelcast.partition.PartitionService
INFO: [192.168.180.240]:5701 [dev] Partition balance is ok, no need to re-partition cluster data...

The only difference I see in TestApp's configuration is

config.getManagementCenterConfig().setEnabled(true);
        config.getManagementCenterConfig().setUrl("http://localhost:8080/mancenter-"+version);

for(int k=1;k<= LOAD_EXECUTORS_COUNT;k++){
    config.addExecutorConfig(new ExecutorConfig("e"+k).setPoolSize(k));
}

so I added it in a desperate attempt into myProgram, too. But it does not solve the problem - still each instance only detects itself as member during the whole run.

Update about how long myProgram runs

Could it be that the program is not running long enough (as pveentjer put it)?

My experiments seem to confirm this: If the time t between Hazelcast.newHazelcastInstance(cfg); and initializing cleanUp() (i.e. no longer communicating via hazelcast and no longer checking the number of members) is

less than 30 seconds, no communication and members: 1
more than 30 seconds: all members are found and communication happens (which weirdly seems to be happening for much longer than t - 30 seconds).

Is 30 seconds a realistic time span that a hazelcast cluster needs, or is there something strange going on? Here is a log from 4 myPrograms running concurrently (looking for hazelcast-members overlaps 30 seconds for instance 1 and instance 3):

instance 1: 2013-12-19T12:39:16.553+0100 LOG 0 (START) engine started 
looking for members between 2013-12-19T12:39:21.973+0100 and 2013-12-19T12:40:27.863+0100  
2013-12-19T12:40:28.205+0100 LOG 35 (Torx-Explorer) Model  SymToSim is about to\  exit

instance 2: 2013-12-19T12:39:16.592+0100 LOG 0 (START) engine started 
looking for members between 2013-12-19T12:39:22.192+0100 and 2013-12-19T12:39:28.429+0100 
2013-12-19T12:39:28.711+0100 LOG 52 (Torx-Explorer) Model  SymToSim is about to\  exit

instance 3: 2013-12-19T12:39:16.593+0100 LOG 0 (START) engine started 
looking for members between 2013-12-19T12:39:22.145+0100 and 2013-12-19T12:39:52.425+0100  
2013-12-19T12:39:52.639+0100 LOG 54 (Torx-Explorer) Model  SymToSim is about to\  exit

INSTANCE 4: 2013-12-19T12:39:16.885+0100 LOG 0 (START) engine started 
looking for members between 2013-12-19T12:39:21.478+0100 and 2013-12-19T12:39:35.980+0100  
2013-12-19T12:39:36.024+0100 LOG 34 (Torx-Explorer) Model  SymToSim is about to\  exit

How do I best start my actual distributed algorithm only after enough members are present in the hazelcast cluster? Can I set hazelcast.initial.min.cluster.size programmatically? https://groups.google.com/forum/#!topic/hazelcast/sa-lmpEDa6A sounds like this would block Hazelcast.newHazelcastInstance(cfg); until the initial.min.cluster.size is reached. Correct? How synchronously (within which time span) will the different instances unblock?

Solution

The problem appearently is that the cluster starts (and stops) and doesn't wait till enough members are in the cluster. You can set the hazelcast.initial.min.cluster.size property, to prevent this from happening.

You Can set 'hazelcast.initial.min.cluster.size' programmatically using:

Config config = new Config(); 
config.setProperty("hazelcast.initial.min.cluster.size","3");

OTHER TIPS

Your configuration is correct BUT you have set a very long multicast timeout of 200 sec where the default is 2 sec. setting a smaller value will solve it.

From Hazelcast Java API Doc: MulticastConfig.html#setMulticastTimeoutSeconds(int)

Specifies the time in seconds that a node should wait for a valid multicast response from another node running in the network before declaring itself as master node and creating its own cluster. This applies only to the startup of nodes where no master has been assigned yet. If you specify a high value, e.g. 60 seconds, it means until a master is selected, each node is going to wait 60 seconds before continuing, so be careful with providing a high value. If the value is set too low, it might be that nodes are giving up too early and will create their own cluster.

It seems you are using TCP/IP clustering, so that is good. Try the following (from the hazelcast book)

If you are making use of iptables, the following rule can be added to allow for outbound traffic from ports 33000-31000:

iptables -A OUTPUT -p TCP --dport 33000:31000 -m state --state NEW -j ACCEPT

and to control incoming traffic from any address to port 5701:

iptables -A INPUT -p tcp -d 0/0 -s 0/0 --dport 5701 -j ACCEPT

and to allow incoming multicast traffic:

iptables -A INPUT -m pkttype --pkt-type multicast -j ACCEPT

Connectivity test If you are having troubles because machines won't join a cluster, you might check the network connectity between the 2 machines. You can use a tool called iperf for that. On one machine you execute: iperf -s -p 5701 This means that you are listening at port 5701.

At the other machine you execute the following command:

iperf -c 192.168.1.107 -d -p 5701

Where you replace '192.168.1.107' by the ip address of your first machine. If you run the command and you get output like this:

------------------------------------------------------------
Server listening on TCP port 5701
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
------------------------------------------------------------
Client connecting to 192.168.1.107, TCP port 5701
TCP window size: 59.4 KByte (default)
------------------------------------------------------------
[  5] local 192.168.1.105 port 40524 connected with 192.168.1.107 port 5701
[  4] local 192.168.1.105 port 5701 connected with 192.168.1.107 port 33641
[ ID] Interval       Transfer     Bandwidth
[  4]  0.0-10.2 sec  55.8 MBytes  45.7 Mbits/sec
[  5]  0.0-10.3 sec  6.25 MBytes  5.07 Mbits/sec

You know the 2 machines can connect to each other. However if you are seeing something like this:

Server listening on TCP port 5701
TCP window size: 85.3 KByte (default)
------------------------------------------------------------
connect failed: No route to host

Then you know that you might have a network connection problem on your hands.

It looks like Hazelcast uses multicast address 224.2.2.3 on UDP port 54327 (by default) for discovery, and then port 5701 for TCP communication. Opening UDP port 54327 in the firewall fixed discovery for me. (I had also opened TCP port 5701 but that was not sufficient.)

Can you try with tcp/ip cluster first to make sure that everything else is fine? Once you have confirmed that there is no problem, try multicast. It could also be a firewall issue btw.

So it appears that Multicast is working on your network; which is good.

Could you try it with the following settings:

Config cfg = new Config();                  
NetworkConfig network = cfg.getNetworkConfig();

JoinConfig join = network.getJoin();
join.getTcpIpConfig().setEnabled(false);
join.getAwsConfig().setEnabled(false);
join.getMulticastConfig().setEnabled(true);

HazelcastInstance instance = Hazelcast.newHazelcastInstance(cfg);

As you can see, I removed all the customization.

Can you try to create your Hazelcast instance like this:

Config cfg = new Config();                  
HazelcastInstance hz = Hazelcast.newHazelcastInstance(cfg);

The managementcenter stuff and the creation of the executors are not relevant (I added that code in the testapp, so I'm 100% sure about that).

Then you should have exactly the same network configuration as the TestApp.

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow