ldirectord does not transfer connections when a real server dies

https://stackoverflow.com/questions/16281091

13-04-2022
|

Question

I am using ldirectord to load balance two IIS servers. The ldirectord.cg looks like this:

autoreload = yes
quiescent = yes
checkinterval = 1
negotiatetimeout = 2
emailalertfreq = 60
emailalert = Konstantin.Boyanov@mysite.com
failurecount = 1

virtual = 172.22.9.100:80
    checktimeout = 1
    checktype = negotiate
    protocol = tcp
    real = 172.22.1.133:80 masq 2048
    real = 172.22.1.134:80 masq 2048
    request = "alive.htm"
    receive = "I am not a zombie"
    scheduler = wrr

The load balancing is working fine, the real servers are visible etc. Nevertheless I am encountering a problem with a simple test:

I open some connections from a client browser (IE 8) to the sites that are hosted on the real servers
I cange the weight of the real server which server the above connections to 0 and leave only the other real server alive
I reload the pages to regenerate the connections

What I am seeing with ipvsadm -Ln is that the connections are still on the "dead" server. I have to wait up to one minute (I suppose some TCP timeout from the browser-side) for them to transfer to the "living" server. And If in this one minute I continue pressing the reload button the connections stay at the "dead" server and their TCP timeout counter gets restarted.

So my question is: Is there a way to tell the load balancer in NAT mode to terminate / redirect existing connections to a dead server immediately (or close to immediately)?

It seems to me a blunder that a reload on the client-side can make a connection become a "zombie", e.g. be bound to a dead real server although persistance is not used and the other server is ready and available.

The only thing that I found affecting this timeout is changing the keepAliveTimeout in the Windows machine running the IE8 which I use for the tests. When I cahnged it from the dafault value of 60 seconds to 30 seconds the connections could be transferred after 30 seconds. It seems to me very odd that a client setting can affect the operation of a network component as the load balancer.

And another thing - what is the colum named "Inactive Conenctions" in the output from ipvsadm used for? Which connections are considered inactive?

And also in the output of ipvsadm i see a couple of connections with the state TIME_WAIT. What are these for?

Any insight and suggestions are highly appreciated !

Cheers, Konstantin

P.S: Here is some more information about the configuration:

# uname -a
Linux 3.0.58-0.6.2-default #1 SMP Fri Jan 25 08:31:01 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lb-mysite.com wrr
  -> spwfe001.mysite.com:h Masq    10     0          0
  -> spwfe002.mysite.com:h Masq    10     0          0

# iptables -t nat -L
Chain PREROUTING (policy ACCEPT)
target     prot opt source               destination

Chain INPUT (policy ACCEPT)
target     prot opt source               destination

Chain OUTPUT (policy ACCEPT)
target     prot opt source               destination

Chain POSTROUTING (policy ACCEPT)
target     prot opt source               destination
SNAT       all  --  anywhere             anywhere            to:172.22.9.100
SNAT       all  --  anywhere             anywhere            to:172.22.1.130


# ip a
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 16436 qdisc noqueue state UNKNOWN
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
    inet 127.0.0.1/8 brd 127.255.255.255 scope host lo
    inet 127.0.0.2/8 brd 127.255.255.255 scope host secondary lo
2: eth0: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN         qlen 1000
    link/ether 00:50:56:a5:77:ae brd ff:ff:ff:ff:ff:ff
    inet 192.168.8.216/22 brd 192.168.11.255 scope global eth0
3: eth1: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN         qlen 1000
link/ether 00:50:56:a5:77:af brd ff:ff:ff:ff:ff:ff
inet 172.22.9.100/22 brd 172.22.11.255 scope global eth1:1
inet 172.22.8.213/22 brd 172.22.11.255 scope global secondary eth1
4: eth2: <BROADCAST,MULTICAST,UP,LOWER_UP> mtu 1500 qdisc pfifo_fast state UNKNOWN qlen 1000
    link/ether 00:50:56:a5:77:b0 brd ff:ff:ff:ff:ff:ff
    inet 172.22.1.130/24 brd 172.22.1.255 scope global eth2


# cat /proc/sys/net/ipv4/ip_forward
1
# cat /proc/sys/net/ipv4/vs/conntrack
1
# cat /proc/sys/net/ipv4/vs/expire_nodest_conn
1
# cat /proc/sys/net/ipv4/vs/expire_quiescent_template
1

Solution

First off - you cant test by changing the weight to 0... You have to delete the entry from the ipvs table completely to simulate a failed server.

You have told ldirectord to keep dead servers alive: quiescent = yes you need to change that to: quiescent = no (which will rip the entry out of the table)

It looks like you do have the following values set correctly: expire_nodest_conn - BOOLEAN expire_quiescent_template - BOOLEAN

Explanation here: https://www.kernel.org/doc/Documentation/networking/ipvs-sysctl.txt

Licensed under: CC-BY-SA with attribution

Not affiliated with StackOverflow