Infinispan/JGroups cluster connection failure, when TCPPING.initialHost contains multiple hosts

StackOverflow https://stackoverflow.com/questions/14321639

  •  15-01-2022
  •  | 
  •  

Question

I'm trying to configure the Infinispan using TCP transport.

If in TCPPING.initialHosts I put the list of all potential cluster nodes, the cluster doesn't connects at all - there are about 15 potential nodes, most of them are dead, usually only 2 or 3 are alive.

However, if in TCPPING.initialHosts I put list of only 2-3 hosts, the cluster is created successfully.

What do I do wrong?

Update: As far as I see in log and in the stacktrace, the connection between the live nodes is created and the nodes exchange some messages. However, the cluster is still not formed.

Update: Here's the code that creates the JChannel:

    JChannel ch = new JChannel(false);
    ProtocolStack stack = new ProtocolStack();
    ch.setProtocolStack(stack);

    // TCPPING is responsible for discovery
    TCPPING tcpping = new TCPPING();
    List<IpAddress> initial_hosts = ... // get lists of hosts, list can be quite big
    tcpping.setInitialHosts(initial_hosts);
    tcpping.setErgonomics(false);
    tcpping.setPortRange(0);
    tcpping.setNumInitialMembers(3);

    TCP tcp = new TCP();
    tcp.setBindAddress(InetAddress.getByName(server.getHostName()));
    tcp.setBindPort(server.getPort());
    tcp.setThreadPoolMaxThreads(30);
    tcp.setOOBThreadPoolMaxThreads(30);

    NAKACK nakack = new NAKACK();
    nakack.setUseMcastXmit(false);
    nakack.setDiscardDeliveredMsgs(false);

    MERGE2 merge = new MERGE2();

    RSVP rsvp = new RSVP();
    rsvp.setValue("timeout", 60 * 1000);
    rsvp.setValue("resend_interval", 500);
    rsvp.setValue("ack_on_delivery", false);

    stack
        .addProtocol(tcp)
        .addProtocol(tcpping)
        .addProtocol(merge)
        .addProtocol(new FD_SOCK())
        .addProtocol(new FD())
        .addProtocol(new VERIFY_SUSPECT())
        .addProtocol(nakack)
        .addProtocol(new UNICAST2())
        .addProtocol(new STABLE())
        .addProtocol(new GMS())
        .addProtocol(new UFC())
        .addProtocol(new MFC())
        .addProtocol(new FRAG2())
        .addProtocol(rsvp);
    stack.init();

    return ch;
Was it helpful?

Solution

Perhaps the discovery phase takes too long, as JGroups tries to establish connections to 15 hosts, and only 2-3 of them are alive. I suggest setting TCP.scok_conn_timeout to a low value (200?) so that we return from a connection to a host that's down after 200ms max. Maybe GMS.join_timeout needs to be increased, and TCPPING.timeout as well. They should be higher than the longest discovery phase.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top