Question

I use a compiled version of ATS 4.1.2 on Debian Wheezy for distributed caching. Both the nodes that I am trying to cluster, resides on the same vlan, with same proxy.config.proxy_name value.However, most of the time, ATS is not able to discover this particular node, and manually adding the other node's IP in cluster.config (Which is an auto populated config, and not supposed to be human editable) throws

root@fe4:/opt/trafficserver/etc# grep -i illegal
/opt/trafficserver/var/log/trafficserver/*
/opt/trafficserver/var/log/trafficserver/diags.log:[Feb 21 18:00:37.714]
Server {0x2b99c1e29700} NOTE: Illegal cluster connection from 10.65.130.31
/opt/trafficserver/var/log/trafficserver/diags.log:[Feb 21 18:35:59.686]
Server {0x2b99c1e29700} NOTE: Illegal cluster connection from 10.65.130.31

However, the second host 10.65.130.31, is able to cluster with a different server and has the same proxy.config.proxy_name. So this is kind of uncertain which nodes would be able to be a part of the cluster.

Any help is greatly appreciated.

Was it helpful?

Solution

After hours of troubleshooting, I identified that this happened because of a flap happened at the bond interface. Sometime, the active slave on the bond interface on one server swapped to eth1, that was connected to a different physical switch, whereas on the other server, it remained as eth0. So this resulted in the 2 boxes remaining in 2 physical switches, eventhough they are in the same vlan and same IP range and subnet and broadcast. This was identified when the tcp dump was analyzed on the bond interface, where it was not at all showing any broadcasting/multicasting to the problem node. The output of bond interface was like

*server1:*
    root@cdn-fe4:# cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: eth0
====snip====
*server2:*
    root@fe7:/opt/trafficserver/etc# cat /proc/net/bonding/bond0
    Ethernet Channel Bonding Driver: v3.7.1 (April 27, 2011)

    Bonding Mode: fault-tolerance (active-backup)
    Primary Slave: None
    Currently Active Slave: eth1
====snip====

Tested again after breaking the bond, and manually configuring ATS to cluster via eth0, and this time, it worked.

Licensed under: CC-BY-SA with attribution
Not affiliated with StackOverflow
scroll top