4. LVS: Ipvsadm and Schedulers

ipvsadm is the user code interface to LVS. The scheduler is the part of the ipvs kernel code which decides which realserver will get the next new connection.

There are patches for ipvsadm

4.1. Using ipvsadm

You use ipvsadm from the command line (or in rc files) to setup: -

  • services/servers that the director directs (e.g. http goes to all realservers, while ftp goes only to one of the realservers).
  • weighting given to each realserver - useful if some servers are faster than others.

    Horms 30 Nov 2004

    The weights are integers, but sometimes they are assigned to an atomic_t, so they can only be 24bits i.e. values so 0 to (2^24-1) should work.

  • scheduling algorithm

You use can also use ipvsadm to

  • add services: add a service with weight >0
  • shutdown (or quiesce) services: set the weight to 0.

    This allows current connections to continue, untill they disconnect or expire, but will not allow new connections. When there are no connections remaining, you can bring down the service/realserver.

  • delete services: this stops traffic for the service (the connection will hang), but the entry in the connection table is not deleted till it times out. This allows deletion, followed shortly thereafter by adding back the service, to not affect established (but quiescent) connections.
  • Once you have a working LVS, save the ipsvadm settings with ipvsadm-sav

    $ipvsadm-sav > ipvsadm.sav
    

    and then after reboot, restore the ipvsadm settings, with ipvsadm-restore

    $ipvsadm-restore < ipvsadm.sav
    

    Both of these commands can be part of an ipvsadm init script.

  • list version of ip_vs (here 0.9.4, with a hash table size of 4096)
    director:/etc/lvs# ipvsadm
    IP Virtual Server version 0.9.4 (size=4096)
    
  • list version of ipvsadm (here 1.20)
    director:/etc/lvs# ipvsadm --version
    director:/etc/lvs# ipvsadm v1.20 2001/09/18 (compiled with popt and IPVS v0.9.4)
    

4.2. Memory Requirements

On the director, the entries for each connection are stored in a hash table (number of buckets set when compiling ipvsadm). Each entry takes 128 bytes. Even large numbers of connections will use only a small amount of memory.

We would like to use LVS in a system where 700Mbit/s traffic is flowing through it. Concurrent connection number is about 420.000 . Our main purpose for using LVS is to direct 80. port requests into number of squid servers (~80 servers) I have read performance documents and I just wonder I can handle this much of traffic with a 2x3.2 Xeon and 4GB of RAM of hardware.

Ratz 22 Nov 2006

If you use LVS-DR and your squid caches have a moderate hit rate, the amount of RAM you'll need to load balance 420'000 connections is:

420000 x 128 x [RTTmin up to RTTmin+maxIdleTime] [bytes]

This means with 4GB and a standard 3/1GB split (your Xeon CPU is 32bit only with 64bit EMT) in the 2.6 kernel (I take it as 3000000000 Bytes), you will be able to serve half a million parallel connections, each connection lasting at most 3000000000/(500000*128) [secs] = 46.875 secs.

4.3. sysctl documentation

the sysctl for ipvs will be in Documentation/networking/ipvs-sysctl.txt for 2.6.18 (hopefully). It is derived from http://www.linuxvirtualserver.org/docs/sysctl.html v1.4.

Graeme Fowler graeme (at) graemef (dot) net 08 Mar 2007

A couple of times recently people have posted to the keepalived list or the LVS list about different issues which were resolved by toggling sysctls (most recently expire_quiescent_template - see Section 28.6). This got me thinking: these sysctls are pretty important, and not everyone knows what to do with them (or how to change them) since the recommended ways to modify them can vary between distributions. So, why not give ipvsadm the capability to modify appropriate sysctls found in /proc/sys/net/ipv4/vs/? The more I thought about it, the more I considered that the easiest way to do so would be to use a "generic" option along the lines of the e2fsprogs style "-O option,option,option=value" with "^option" as a negation for booleans. So you'd be able to say:

ipvsadm -O expire_quiescent_template,expire_nodest_conn
ipvsadm -O expire_nodest_conn
ipvsadm -O drop_packet=1,drop_entry=1,expire_nodest_conn

By making the option handler "generic" like this, as other sysctls arrive as the kernel develops they can simply be toggled or changed as necessary; in all cases, where no corresponding sysctl exists, an error is thrown to that effect. In my mind it makes ipvsadm more of a "one stop shop" for the various settings - not only will it manage the virtual and real servers, but more of the underlying infrastructure too.

Ratz 08 Mar 2007

I tend to agree, however people that want to setup LVS do need to know Linux on the level of also understanding sysctrl variables and their meaning. I've always hoped that with having them in the ipvsadm man page, the problem would be solved. I know only of one application that modifies sysctls, and this is the broken pluto of Free/OpenSwan :).

You still have to know what the options mean, correct? I favour a different approach more: Make LVS really user friendly, in that you provide the users with a tool that takes away the low level configuration, just like in linux-ha or commercial load balancers. It's not so difficult to write this, it's just that someone has to sit down and do it.

You still need to absolutely know the semantics of these settings. So what is the real gain between

   ipvsadm -O expire_quiescent_template=1

and

   echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template

Horms

Of late I've been thinking of the idea of enabling LVS to be configured via netlink rather than the current /proc + get/setsockopt fun. I think this was Ratz idea, but it seems like a good one to me, as it should allow a lot more flexibility in the user-space to kernel communication, which has always been a problem from the point of backwards compatibility. So I have kind of been thinking of ipvsadm2 or ipvsadm-nl.

Ratz

I've already started once with the conversion of IPVS to netlink and I've named the new ipvsadm ipvsctl :). I've attached my work (actually the part I could find right now ... I know on some of my dozens of crashed Laptop harddisks there should be more) in this email, so it doesn't get lost and you don't have to duplicate it. This would also allow us to easily implement the missing features and enable us to move more towards netfilter-friendliness.

4.4. Compile a version of ipvsadm that matches your ipvs

Compile and install ipvsadm on the director using the supplied Makefile. You can optionally compile ipvsadm with popt libraries, which allows ipvsadm to handle more complicated arguments on the command line. If your libpopt.a is too old, your ipvsadm will segv. (I'm using the dynamic libpopt and don't have this problem).

Since you compile ipvs and ipvsadm independantly and you cannot compile ipvsadm until you have patched the kernel headers, a common mistake is to compile the kernel and reboot, forgetting to compile/install ipvsadm.

Unfortunately there is only rudimentary version detection code into ipvs/ipvsadm. If you have a mismatched ipvs/ipvsadm pair, many times there won't be problems, as any particular version of ipvsadm will work with a wide range of patched kernels. Usually with 2.2.x kernels, if the ipvs/ipvsadm versions mismatch, you'll get weird but non-obvious errors about not being able to install your LVS. Other possibilities are that the output of ipvsadm -L will have IP's that are clearly not IPs (or not the IP's you put in) and ports that are all wrong. It will look something like this

[root@infra /root]# ipvsadm
IP Virtual Server version 1.0.4 (size=3D4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  C0A864D8:0050 rr
  -> 01000000:0000      Masq    0      0          0

rather than

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:ssh rr
  -> RS2.mack.net:ssh             Route   1      0          0

There was a change in the /proc file system for ipvs about 2.2.14 which caused problems for anyone with a mismatched ipvsadm/ipvs. The ipvsadm from different kernel series (2.2/2.4) do not recognise the ipvs kernel patches from the other series (they appear to not be patched for ipvs).

The later 2.2.x ipvsadms know the minimum version of ipvs that they'll run on, and will complain about a mismatch. They don't know the maximum version (which will be produced presumably some time in the future) that they will run on. This protects you against the unlikely event of installing a new 2.2.x version of director:/etc/lvs# ipvsadm on an older version of ipvs, but will not protect you against the more likely scenerio where you forget to compile ipvsadm after building your kernel. The ipvsadm maintainers are aware of the problem. Fixing it will break the current code and they're waiting for the next code revision which breaks backward compatibility.

If you didn't even apply the kernel patches for ipvs, then ipvsadm will complain about missing modules and exit (i.e. you can't even do `ipvsadm -h`).

4.4.1. Other compile problems

Ty Beede tybeede (at) metrolist (dot) net

Ty Beede tybeede (at) metrolist (dot) net on a slackware 4.0 machine I went to compile ipvsadm and it gave me an error indicating that the iphdr type was undefined and it didn't like that when it saw

Ty Beede tybeede (at) metrolist (dot) net to ip_fw.h I added

#include <linux/ip.h>

Ty Beede tybeede (at) metrolist (dot) net in ipvsadm.c, which is where the iphdr #structure is defined and everything went ok

Doug Bagley doug (at) deja (dot) com

The reason that it fails "out of the box" is because fwp_iph's type definition (struct iphdr) was

#ifdef'd out in <linux/ip_fw.h>

(and not included anywhere else) since the symbol __KERNEL_ was undefined.

Including <linux/ip.h> before <linux/ip_fw.h>

in the .c file did the trick.

4.5. put realservers in /etc/hosts

(from a note by Horms 26 Jul 2002)

ipvsadm by default outputs the names of the realservers rather than the IPs. The director then needs name resolution. If you don't have it, ipvsadm will take a long time (upto a minute) to return, as it waits for name resolution to timeout. The only IPs that the director needs to resolve are of the realservers. DNS is slow. To prevent the director from needing DNS, put the names of the realservers in /etc/hosts. This lookup is quicker than DNS and you won't need to open a route from the director to a nameserver.

Or you could use `ipvsadm -n` which outputs the IPs of the realservers instead.

4.6. RR and LC schedulers

On receiving a connect request from a client, the director assigns a realserver to the client based on a "schedule". The scheduler type is set with ipvsadm. The schedulers available are

  • round robin (rr), weighted round robin (wrr) - new connections are assigned to each realserver in turn
  • least connected (lc), weighted least connection (wlc) - new connections go to realserver with the least number of connections. This is not neccessarily the least busy realserver, but is a step in that direction.

    Note
    Doug Bagley doug (at) deja (dot) com points out that *lc schedulers will not work properly if a particular realserver is used in two different LVSs.

    Willy Tarreau (in http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing) says that *lc is suited for very long sessions, but not for webservers where the load varies on a short time scale.

  • persistent connection
  • LBLC: a persistent memory algorythm
  • DH: destination hash
  • SH: source hash

    Again from Willy: this is used when you want a client to always appear on the same realserver (e.g. a shopping cart, or database). The SH scheduler has not been much used in LVS, possibly because no-one knew the syntax for a long time and couldn't get it to work. Most shopping cart type servers are using persistence, which has many undesirable side effects.

The original schedulers are rr, and lc (and their weighted versions). Any of these will do for a test setup. In particular, round robin will cycle connections to each realserver in turn, allowing you to check that all realservers are functioning in the LVS. The rr,wrr,lc,wlc schedulers should all work similarly when the director is directing identical realservers with identical services. The lc scheduler will better handle situations where machines are brought down and up again (see thundering herd problem). If the realservers are offering different services and some have clients connected for a long time while others are connected for a short time, or some are compute bound, while others are network bound, then none of the schedulers will do a good job of distributing the load between the realservers. LVS doesn't have any load monitoring of the realservers. Figuring out a way of doing this that will work for a range of different types of services isn't simple (see load and failure monitoring).

Note

Ratz Nov 2006

After almost 10 years of my involvement with load balancers, I have to admit that no customer _ever_ truly asked or cared about the scheduling algorithm :). This is academia for the rest of the world.

4.7. Netmask for VIP

You setup the RIPs, DIP and other networking with whatever netmask you choose. For the VIP

  • For LVS-DR, LVS-Tun: netmask for VIP on director, realservers must be /32.
  • For LVS-NAT: the netmask can be /32 or the netmask of the RIPs, DIP.

You will need to setup the routing for the VIP to match the netmask. For more details, see the chapters for each forwarding method.

Horms 12 Aug 2004

The real story is that the netmask works a little differently on lo to other interfaces. On lo the interface will answer to _all_ addresses covered by the netmask. This is how 127.0.0.1/8 on lo ends up answering 127.0.0.0-127.255.255.255. So if you add 172.16.4.222/16 to eth0 then it will answer 172.16.4.222 and only 172.16.4.222. But if you add the same thing to lo then it will answer 172.16.0.0-172.16.255.255. So you need to use 172.16.4.222/32 instead.

To clarify -

ifconfig eth0:0 192.168.10.10 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Add 192.168.10.10 to eth0

ifconfig lo:0 192.168.10.10 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Add 192.168.10.0 - 192.168.10.255 to lo

ifconfig lo:0 192.168.10.0 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Same as above, add 192.168.10.0 - 192.168.10.255 to lo

ifconfig lo:0 192.168.10.10 netmask 255.255.255.255 broadcast 192.168.10.10 up
   -> Add 192.168.10.10 to lo

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/21

On all platforms apart from windows you want 255.255.255.255 for the loopback. On windows you can get away with 255.255.255.0 IF you use a priority 254 80% of the time. 255.255.255.255 can be used if you mod the registry... But we've found that 255.0.0.0 will work better 99% of the time because windows by default uses the smallest subnet first for routing and a class A will never be used instead of a class C.

4.8. LBLC, DH schedulers

The LBLC code (by Wensong) and the DH scheduler (by Wensong, inspired by code submitted by Thomas Proell proellt (at) gmx (dot) de) are designed for web caching realservers (e.g. squids). For normal LVS services (eg ftp, http), the content offered by each realserver is the same and it doesn't matter which realserver the client is connected to. For a web cache, after the first fetch has been made, the web caches have different content. As more pages are fetched, the contents of the web caches will diverge. Since the web caches will be setup as peers, they can communicate by ICP (internet caching protocol) to find the cache(s) with the required page. This is faster than fetching the page from the original webserver. However, it would be better after the first fetch of a page from http://www.foo.com/*, for all subsequent clients wanting a page from http://www.foo.com/ to be connected to that realserver.

The original method for handling this was to make connections to the realservers persistent, so that all fetches from a client went to the same realserver.

The -dh (destination hash) algorythm makes a hash from the target IP and all requests to that IP will be sent to the same realserver. This means that content from a URL will not be retrieved multiple times from the remote server. The realservers (eg squids in this case) will each be retreiving content from different URLs.

Wensong Zhang wensong (at) gnuchina (dot) org 16 Feb 2001

Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers.

Example: cache cluster shared by several load balancers.

		Internet
		|
                |------cache array
                |
		|-----------------------------
		   |                |
		   DH               DH
		   |                |
		 Access            Access
                 Network1          Network2

The DH scheduler can keep the two load balancer redirect requests destined for the same IP address to the same cache server. If the server is dead or overloaded, the load balancer can use cache_bypass feature to send requests to the original server directly. (Make sure that the cache servers are added in the two load balancers in the same order)

Diego Woitasen 12 Aug 2003

The scheduling algorithms that use dest IP for selecting the realserver to use (like DH, LBLC, LBLCR) is only aplicable to transparent proxy, this being the only aplication where the dest ip could be variable.

Wensong Zhang wensong (at) linux-vs (dot) org 12 Aug 2003

Yes, you are almost right. LBLC and LBLCR are written for transparent proxy clusters only. DH can be used for transparent proxy cluster and can be used in other clusters needing static mapping.

Note
Here's follows a set of exchanges between a Chinese person and Wensong, that were in English, that I didn't follow at all. Apparently it was clear to Wensong.

If lblc uses dh, then is lblc = dh + lc?

Wensong Zhang 09 Mar 2004

Maybe lblc = dh + wlc.

/*
 * The lblc algorithm is as follows (pseudo code):
 *
 *       if cachenode[dest_ip] is null then
 *               n, cachenode[dest_ip] <- {weighted least-conn node};
 *       else
 *               n <- cachenode[dest_ip];
 *               if (n is dead) OR
 *                  (n.conns>n.weight AND
 *                   there is a node m with m.conns<m.weight/2) then
 *                 n, cachenode[dest_ip] <- {weighted least-conn node};
 *
 *       return n;
 *
 */

The difference between lblc and lblcr is that cachenode[dest_ip] in lblc is a server, and cachenode[dest_ip] in lblcr is a server set.

In lblc the server has overloaded and lvs use wlc and allocate a server in half load of the server, Allocate the weighted least-connection server to IP address. Is this means after allocation for ip address, it will not return to past server ?

No, it will not in most cases. There is only one possible situation that the current map expires after it is not used for six minutes, and the past server is the one with least connections when next access to the ip address comes.

4.8.1. scheduling squids

The usual problem with squids not using a cache friendly scheduler is that fetches are slow. In this case the website is sending hits to several different RIPs. Some websites detect this and won't even serve you the pages.

Palmer J.D.F. J (dot) D (dot) F (dot) Palmer (at) Swansea (dot) ac (dot) uk 18 Mar 2002/

I tried https and online banking sites (e.g. www.hsbc.co.uk). It seems that this site and undoubtedly many other secure sites don't like to see connections split across several IP addresses as happens with my cluster. Different parts of the pages are requested by different realservers, and hence different IP addresses.

It gives an error saying... "...For your security, we have disconnected you from internet banking due to a period of inactivity..."

I have had caching issues with HSBC before, they seem to be a bit more stringent than other sites. If I send the requests through one of the squids on it's own it works fine, so I can only assume it's because it is seeing fragmented requests, maybe there is a keepalive component that is requested. How do I combat this? Is this what persistence does or is there a way of making the realservers appear to all have the same IP address?

Joe

change -rr (or whatever you're running) to -dh.

Lars

Use a different scheduler, like lblc or lblcr.

Harry Yen hyen1 (at) yahoo (dot) com 16 April 2002

What is the purpose of using LVS with Squid to a https site? HTTPs based material typically is not cachable. I don't understand why you need Squid at all.

Once a request reaches a Squid and incurs a cache miss, the forwarded request will have Squid IP as the source address. So you need to find a way to make sure all connections from the same client IP to go to the same Squid farm. Then when they incur cache misses, they will wind up via LVS persistency to the same real sever.

The reason https is sent to the squids is because it's much easier to send all browser traffic to the squids and then let them handle it. The only way I seemed to be able to get this to work (IE access the bank site) is to set a persistence (360 seconds), and using lblc scheduling. The current output of ipvsadm is this... I am a tad concerned at the apparent lack of load balancing.

TCP  wwwcache-vip.swan.ac.uk:squi lblc persistent 360
  -> squidfarm1.swan.ac.uk:squid  Route   1      202        1045
  -> squidfarm2.swan.ac.uk:squid  Route   1      14         8

HSBC seems to be a bit more stringent than other sites. If I send the requests through one of the squids on it's own it works fine, so I can only assume it's because it is seeing fragmented requests, maybe there is a keepalive component that is requested. How do I combat this? Is this what persistence does or is there a way of making the realservers appear to all have the same IP address? I have sorted it by using persistence, couldn't get any of the dedicated squid schedulers to work properly. I'm currently running wlc, and 360s persistance. Seems to be holding up really well. Still watching it with eagle eyes though.

The -dh scheduler was written expressly to handle squids. Jezz tried it and didn't get it to work satisfactorily but found that persistence worked. We don't understand this yet.

Jakub Suchy jakub (at) rtfm (dot) cz 2005/02/23

round-robin algorithm is not usable for squid. Some servers (banks etc.) check clients ip address and terminates it's connection if it changes. When you use the source-hashing algorithm, IPVS checks the client against its local table and forwards connection always to same squid real server, so the client always accesses the web through same squid. source-hashing can become unbalanced when you have few clients and one of them use squid more frequently than others. With more clients, it's statistically balanced.

4.8.2. DH persistence

Just as you can use the SH scheduler to achieve persistence (affinity), you can also use the DH. We couldn't work out why this LVS wasn't scheduling the way it was expected, but the DH scheduler fixed it.

Steve Haneman stevehaneman (at) yahoo (dot) com 22 Oct 2008

I'm using ipvsadm to load balance 2 security web servers, so I'm using 3 boxes. The web servers have 100 IPs each in the 192.168.253.x and 192.168.252.x ranges.

I'm running a load test through the lb to the web servers from 20 unique IPs. I'm finding that 4% of the time the sessions are not sticky. A user/password web POST goes to one server and a followup POST ends up at the other server. There is less than 1 second between the POSTs. Each transaction (login POST, data POST, logout) is completed with the IP it started with.

Jeff Tchang jeff (dot) tchang (at) gmail (dot) com

Not sure if this might help but have you tried using different scheduling algorithms? In particular maybe destination hashing?

Steve

That fixed it. I changed from rr to dh and I'm seeing all goodness.

4.9. LVS with mark tracking: fwmark patches for multiple firewalls/gateways

If the LVS is protected by multiple firewall boxes and each firewall is doing connection tracking, then packets arriving and leaving the LVS from the same connection will need to pass through the same firewall box or else they won't be seen to be part of the same connection and will be dropped. An initial attempt to handle the firewall problem was sent in by Henrik Nordstrom, who is involved with developing web caches (squids).

This code isn't a scheduler, but it's in here awaiting further developements of code from Julian because it addresses similar problems to the SH scheduler in the next section.

Julian 13 Jan 2002

Unfortunately Henrik's patch breaks the LVS fwmark code. Multiple gateway setups can be solved with routing and a solution is planned for LVS. Until then it would be best to contact Henrick, hno (at) marasystems (dot) com for his patch.

Here's Henrick's patch and here's some history.

Henrik Nordstrom hno (at) marasystems (dot) com 13 Jan 2002

My use of the MARK is for routing purposes of return traffic only, not at all related to the scheduling onto the farm. This to solve complex routing problems arising in borders between networks where it is impractical to know full routing of all clients. One example of what I do is like this:

I have a box connected to three networks (firewall, including LVS-NAT load balancing capabilities for published services)

  • a - Internet
  • b - DMZ, where the farm members are
  • c - Large intranet

For simplicity both Internet and intranet users connect to the same LVS IP addresses. Both networks 'a' and 'c' is complex, and maintaining a complete and correct routing table covering one of the networks (i.e. the 'c' network in the above) is on the border to impossible and error prone as the use of addresses change over time.

To simplify routing decisions I simply simply want return traffic to be routed back the same way as from where the request was received. This covers 99.99% of all routing needed in such situation regardless of the complexity of the networks on the two (or more) sides without the need of any explicit routing entries. To do this I MARK the session when received using netfilter, giving it a routing mark indicating which path the session was received from. My small patch modifies LVS to memorize this mark in the LVS session, and then restore it on return traffic received FROM the realservers. This allows me to route the return traffic from the farm members to the correct client connection using iproute fwmark based routing rules.

As farm distribution algorithms I use different ones depending on the type of application. The MARK I only use for routing of return traffic. I also have a similar patch for Netfilter connection tracking (and NAT), for the same purpose of routing return traffic. If interested search for CONNMARK in the netfilter-devel archives. The two combined allows me to make multihomed boxes who do not need to know the networks on any of the sides in detail, besides it's own IP addresses and suitable gateways to reach further into the networks.

Another use of the connection MARK memory feature is a device connected to multiple customer networks with overlapping IP addresses, for example two customers both using 192.168.1.X addresses. In such case making a standard routing table becomes impossible as the customers are not uniquely identified by their IP addresses. The MARK memory however deals with such routing at ease since it do not care about the detailed addressing as long as it possible to identify the two customer paths somehow. i.e. interface originally received on, source MAC of the router who sent us the request, or anything uniquely identifying the request as coming from a specific path.

The two problems above (not wanting to known the IP routing, or not being able to IP route) are not mutually exclusive. If you have one then the other is quite likely to occur.

Here's Henrik's announcement and the replies.

Henrik Nordstrom 14 Feb 2001

Here is a small patch to make LVS keep the MARK, and have return traffic inherit the mark.

We use this for routing purposes on a multihomed LVS server, to have return traffic routed back the same way as from where it was received. What we do is that we set the mark in the iptables mangle chain depending on source interface, and in the routing table use this mark to have return traffic routed back in the same (opposite) direction.

The patch also moves the priority of LVS INPUT hook back to infront of iptables filter hook, this to be able to filter the traffic not picked up by LVS but matchin it's service definitions. We are not (yet) interested of filtering traffic to the virtual servers, but very interested in filtering what traffic reaches the Linux LVS-box itself.

Julian - who uses NFC_ALTERED ?

Netfilter. The packet is accepted by the hook but altered (mark changed).

Julian - Give us an example (with dummy addresses) for setup that require such fwmark assignments.

For a start you need a LVS setup with more than one real interface receiving client traffic for this to be of any use. Some clients (due to routing outside the LVS server) comes in on one interface, other clients on another interface. In this setup you might not want to have a equally complex routing table on the actual LVS server itself.

Regarding iptables/ipvs I currently "only" have three main issues.

  • As the "INPUT" traffic bypasses most normal routes, the iptables conntrack will get quite confused by return traffic..
  • Sessions will be tracked twice. Both by iptables conntrack and by IPVS.
  • There is no obvious choice if IPVS LOCAL_IN sould be placed before or after iptables filter hook. Having it after enables the use of many fancy iptables options, but instead requires one to have rules in iptables for allowing ipvs traffic, and any mismatches (either in rulesets or IPVS operation) will cause the packets to actually hit the IP interface of the LVS server which in most cases is not what was intended.

4.10. SH scheduler

Using the SH (source hash) scheduler, the realserver is selected using a hash of the CIP. Thus all connect requests from a particular client will go to the same realserver. Scheduling based on the client IP, should solve some of the problems that currently require persistence (i.e. having a client always go to the same realserver).

Other than the few comments here, no-one has used the -sh scheduler. The SH scheduler was originally intended for directors with multiple firewalls, with the balancing based on hashes of the MAC address of the firewall and this is how it was written up. Since no-one was balancing on the MAC address of the firewall, the SH scheduler lay dormant for many years, till someone on the mailing list figured out that it could do other things too.

It turns out that address hashing is a standard method of keeping the client on the same server in a load balanced server setup. Willy Tarreau (in http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing) discusses address hashing (in the section "selecting the best server") to prevent loss of session data with SSL connections in loadbalanced servers, by keeping the client on the same server.

Here's Wensong's announcement:

Wensong Zhang wensong (at) gnuchina (dot) org 16 Feb 2001

Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers. Example: Firewall Load Balancing

                      |-- FW1 --|
  Internet ----- SH --|         |-- DH -- Protected Network
                      |-- FW2 --|

Make sure that the firewall boxes are added in the load balancers in the same order. Then, request packets of a session are sent to a firewall, e.g. FW1, the DH can forward the response packets from protected network to the FW1 too. However, I don't have enough hardware to test this setup myself. Please let me know if any of you make it work for you. :)

For initial discussions on the -dh and -sh scheduler see on the mailing list under "some info for DH and SH schedulers" and "LVS with mark tracking".

4.10.1. Testing the SH scheduler

The SH scheduler schedules by client IP. Thus if you test from one client only, all connections will go to the first realserver in the ipvsadm table.

Rached Ben Mustapha rached (at) alinka (dot) com 15 May 2002

It seems there is a problem with the SH scheduler and local node feature. I configured my LVS director (node-102) in direct routing mode on a 2.4.18 linux kernel with ipvs 1.0.2. The realservers are set up accordingly.

root@node-102# ipvsadm --version
director:/etc/lvs# ipvsadm v1.20 2001/11/04 (compiled with popt and IPVS v1.0.2)
			
root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0

With this configuration, it's ok. Connections from different clients are load-balanced on both servers. Now I add the director:

root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 127.0.0.1                    Route   1      0          0
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0

All new connections whatever the client's IP goes to the director. And with this config:

root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0
  -> 127.0.0.1

Now all new connections whatever the client's IP goes to node-103. So it seems that with localnode feature, the scheduler always choose the first entry in the redirection rules.

Wensong

There is no problem in SH scheduler and localnode feature.

I reproduced your setup. I issued the requests from two difficult clients, all the requests were sent to the localnode. Then, issues the requests from the third client, the requests were sent to the other server. Please see the result.

[root@dolphin /root]# <command>ipvsadm</command> -ln
IP Virtual Server version 1.0.2 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.26.20.118:80 sh
  -> 172.26.20.72:80              Route   1      2          0
  -> 127.0.0.1:80                 Local   1      0          858

I don't know how many clients are used in your test. You know that the SH scheduler is statically mapping algorithm (based on the source IP address of clients). It is quite possible that two or more client IP addresses are mapped to the same server.

4.10.2. "weight" is really the number of connections for the SH scheduler

Con Tassios ct (at) swin (dot) edu (dot) au 7 Jun 2006

source hashing with a weight of 1 (the default value for the other schedulers) will result in the service being overloaded when the number of connections is greater than 2, as the output of ipvsadm shows. You should increase the weight. The weight when used with SH and DH has a different meaning than most, if not all, the other standard LVS scheduling methods. Although this doesn't appear to be mentioned in the man page for ipvsadm.

From ip_vs_sh.c

The sh algorithm is to select server by the hash key of source IP
address. The pseudo code is as follows:

      n <- servernode[src_ip];
      if (n is dead) OR
         (n is overloaded, such as n.conns>2*n.weight) then
                return NULL;

      return n;
Note
Joe: if weight is 1, then return NULL if number of connections > 2. If number of connections is twice the weight, don't allow anymore connections.

Martijn Grendelman martijn (at) grendelman (dot) net 09 Jun 2006

That would explain the things I saw. In the meantime, I went back to a configuration using the SH scheduler now with a weight for both real servers of 200, instead of 1, and things seem to run fine.

martijn@tweety:~> rr ipvsadm -L
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  212.204.230.98:www sh
  -> tweety.sipo.nl:www           Local   200    25         44
  -> daffy.sipo.nl:www            Route   200    12         27
TCP  212.204.230.98:https sh persistent 360
  -> tweety.sipo.nl:https         Local   100    0          0
  -> daffy.sipo.nl:https          Route   100    0          0
Joe: Since the SH scheduler sends a client's packets to the same realserver, I had thought that it should completely replace persistence. However you're using persistence with SH, so apparently SH doesn't handle keeping the client on the realserver as I expect. So why are you using persistence?

Ehr.. no reason, I guess. It's still there from when I used RR scheduling and I guess I forgot to remove it. I don't think it is actually useful.

4.10.3. SH failout

Shutting down an SH realserver by changing the weight to 0 (as is done for the other schedulers), still allows in connection requests to be sent to that realserver (you'll get a failed connection if the realserver is down). This seems to be a result of the different meaning of weight for the SH scheduler. No new sources will be allowed to initiate connections, but all connections from known sources will still be forwarded, and all known sources will be allowed to initiate connections. To stop connection requests being forwarded to a realserver, you have to remove the realserver from the ipvsadm table. You may have to break current connections to do this :-(

Thomas Pedoussaut thomas (at) pedoussaut (dot) com 16 Oct 2008

Weight to 0 means no more new connections, but existing ones (in the SH way) will still hit the real server You need to have it properly removed.

4.11. What is an ActiveConn/InActConn (Active/Inactive) connnection?

The output of ipsvadm lists connections, either as

  • ActiveConn - in ESTABLISHED state
  • InActConn - any other state

With LVS-NAT, the director sees all the packets between the client and the realserver, so always knows the state of tcp connections and the listing from ipvsadm is accurate. However for LVS-DR, LVS-Tun, the director does not see the packets from the realserver to the client. Termination of the tcp connection occurs by one of the ends sending a FIN (see W. Richard Stevens, TCP/IP Illustrated Vol 1, ch 18, 1994, pub Addison Wesley) followed by reply ACK from the other end. Then the other end sends its FIN, followed by an ACK from the first machine. If the realserver initiates termination of the connection, the director will only be able to infer that this has happened from seeing the ACK from the client. In either case the director has to infer that the connection has closed from partial information and uses its own table of timeouts to declare that the connection has terminated. Thus the count in the InActConn column for LVS-DR, LVS-Tun is inferred rather than real.

Entries in the ActiveConn column come from

  • service with an established connection. Examples of services which hold connections in the ESTABLISHED state for long enough to see with ipvsadm are telnet and ftp (port 21).

Entries in the InActConn column come from

  • Normal operation

    • Services like http (in non-persistent i.e. HTTP /1.0 mode) or ftp-data(port 20) which close the connections as soon as the hit/data (html page, or gif etc) has been retrieved (<1sec). You're unlikely to see anything in the ActiveConn column with these LVS'ed services. You'll see an entry in the InActConn column untill the connection times out. If you're getting 1000connections/sec and it takes 60secs for the connection to time out (the normal timeout), then you'll have 60,000 InActConns. This number of InActConn is quite normal. If you are running an e-commerce site with 300secs of persistence, you'll have 300,000 InActConn entries. Each entry takes 128bytes (300,000 entries is about 40M of memory, make sure you have enough RAM for your application). The number of ActiveConn might be very small.
  • Pathological Conditions (i.e. your LVS is not setup properly)

    • identd delayed connections: The 3 way handshake to establish a connection takes only 3 exchanges of packets (i.e. it's quick on any normal network) and you won't be quick enough with ipvsadm to see the connection in the states before it becomes ESTABLISHED. However if the service on the realserver is under authd/identd, you'll see an InActConn entry during the delay period.

    • Incorrect routing (usually the wrong default gw for the realservers):

      In this case the 3 way handshake will never complete, the connection will hang, and there'll be an entry in the InActConn column.

Usually the number of InActConn will be larger or very much larger than the number of ActiveConn.

Here's a LVS-DR LVS, setup for ftp, telnet and http, after telnetting from the client (the client command line is at the telnet prompt).

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      0          0
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      1          0
-> RS1.mack.net:telnet              Route   1      0          0

showing the ESTABLISHED telnet connection (here to realserver RS2).

Here's the output of netstat -an | grep (appropriate IP) for the client and the realserver, showing that the connection is in the ESTABLISHED state.

client:# netstat -an | grep VIP
tcp        0      0 client:1229      VIP:23           ESTABLISHED
			</para><para>
realserver:# netstat -an | grep CIP
tcp        0      0 VIP:23           client:1229      ESTABLISHED
<programlisting><![CDATA[
	<para>
Here's immediately after the client logs out from the telnet session.
	</para>
<programlisting><![CDATA[
director# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      0          0
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      0          0
-> RS1.mack.net:telnet              Route   1      0          0

client:# netstat -an | grep VIP
#ie nothing, the client has closed the connection

#the realserver has closed the session in response
#to the client's request to close out the session.
#The telnet server has entered the TIME_WAIT state.
realserver:/home/ftp/pub# netstat -an | grep 254
tcp        0      0 VIP:23        CIP:1236      TIME_WAIT

#a minute later, the entry for the connection at the realserver is gone.

Here's the output after ftp'ing from the client and logging in, but before running any commands (like `dir` or `get filename`).

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      1          1
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      0          0
-> RS1.mack.net:telnet              Route   1      0          0

client:# netstat -an | grep VIP
tcp        0      0 CIP:1230      VIP:21        TIME_WAIT
tcp        0      0 CIP:1233      VIP:21        ESTABLISHED

realserver:# netstat -an | grep 254
tcp        0      0 VIP:21        CIP:1233      ESTABLISHED

The client opens 2 connections to the ftpd and leaves one open (the ftp prompt). The other connection, used to transfer the user/passwd information, is closed down after the login. The entry in the ipvsadm table corresponding to the TIME_WAIT state at the realserver is listed as InActConn. If nothing else is done at the client's ftp prompt, the connection will expire in 900 secs. Here's the realserver during this 900 secs.

realserver:# netstat -an | grep CIP
tcp        0      0 VIP:21        CIP:1233      ESTABLISHED
realserver:# netstat -an | grep CIP
tcp        0     57 VIP:21        CIP:1233      FIN_WAIT1
realserver:# netstat -an | grep CIP
#ie nothing, the connection has dropped

#if you then go to the client, you'll find it has timed out.
ftp> dir
421 Timeout (900 seconds): closing control connection.

http 1.0 connections are closed immediately after retrieving the URL (i.e. you won't see any ActiveConn in the ipvsadm table immediately after the URL has been fetched). Here's the outputs after retreiving a webpage from the LVS.

director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          1
-> RS1.mack.net:www                 Route   1      0          0

client:~# netstat -an | grep VIP

RS2:/home/ftp/pub# netstat -an | grep CIP
tcp        0      0 VIP:80        CIP:1238      TIME_WAIT

4.11.1. Programmatically access ActiveConn, InActConn

I want to get the active and inactive connections of one virtual service in my program.

Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 12 Feb 2003

You have two options here:

  • Read /proc/net/ip_vs file and parse it for the required numbers
  • Use libipvs (distributed with ipvsadm) to read the tables directly. Take a look in ipvsadm.c for how this is done.

4.11.2. ActiveConn/InActConn different for 2.4.x/2.6.x kernels

Ratz 15 Oct 2006

IPVS between 2.4 and 2.6 have has change significantly with regards to the ratio of active/inactive connections. We've seen that in our rrdtool/MRTG graphs as well. In the 2.6.x kernels, at least for the (w)LC scheduler the RS calculation is done differently. On top of that, the TCP stack has changed tunables and you hardware also behaves differently. The LVS state transition timeouts are different between 2.4.x and 2.6.x kernels, IIRC and so, for example if you're using LVS-DR, the active connection to passive connection transition takes more time, thus yielding a potentially higher amount of sessions in state ActiveConn.

4.11.3. from the mailing list

Ty Beede wrote:

I am curious about the implementation of the inactconns and activeconns variables in the lvs source code.

Julian

        Info about LVS <= 0.9.7
TCP
        active:         all connections in ESTABLISHED state
        inactive:       all connections not in ESTABLISHED state
UDP
        active:         0 (none) (LVS <= 0.9.7)
        inactive:       all (LVS <= 0.9.7)

        active + inactive = all

Look in this table for the used timeouts for each protocol/state:

/usr/src/linux/net/ipv4/ip_masq.c, masq_timeout_table

For LVS-Tun and LVS-DR the TCP states are changed checking only the TCP flags from the incoming packets. For these methods UDP entries can expire (5 minutes?) if only the realservers sends packets and there are no packets from the client.

For info about the TCP states: /usr/src/linux/net/ipv4/tcp.c, rfc793.txt

Jean-francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

Done some testing (netmon) on this and here's my observations :

1. A connection becomes active when LVS sees the ACK flag in the TCP header incoming in the cluster : i.e when the socket gets established on the real server.

2. A connection becomes inactive when LVS sees the ACK-FIN flag in the TCP header incoming in the cluster. This does NOT corespond to the socket closing on the realserver.

Example with my Apache Web server.

Client  	<---> Server

A client request an object on the web server on port 80 :

SYN REQUEST     ---->
SYN ACK 	<----
ACK             ----> *** ActiveConn=1 and 1 ESTABLISHED socket on realserver.
HTTP get        ----> *** The client request the object
HTTP response   <---- *** The server sends the object
APACHE closes the socket : *** ActiveConn=1 and 0 ESTABLISHED socket on realserver
The CLIENT receives the object. (took 15 seconds in my test)
ACK-FIN         ----> *** ActiveConn=0 and 0 ESTABLISHED socket on realserver

Conclusion : ActiveConn is the active number of CLIENT connections..... not on the server in the case of short transmissions (like objects on a web page). Its hard to calculate a server's capacity based on this because slower clients makes ActiveConn greater than what the server is really processing. You wont be able to reproduce that effect on a LAN, because the client receives the segment too fast.

In the LVS mailing list, many people explained that the correct way to balance the connections is to use monitoring software. The weights must be evaluated using values from the realserver. In LVS-DR and LVS-Tun, the Director can be easily fooled with invalid packets for some period and this can be enough to inbalance the cluster when using "*lc" schedulers.

I reproduce the effect connecting at 9600 bps and getting a 100k gif from Apache, while monitoring established sockets on port 80 on the realserver and ipvsadm on the cluster.

Julian

You are probably using LVS-DR or LVS-Tun in your test. Right? Using these methods, the LVS is changing the TCP state based on the incoming packets, i.e. from the clients. This is the reason that the Director can't see the FIN packet from the realserver. This is the reason that LVS can be easily SYN flooded, even flooded with ACK following the SYN packet. The LVS can't change the TCP state according to the state in the realserver. This is possible only for VS/NAT mode. So, in some situations you can have invalid entries in ESTABLISHED state which do not correspond to the connections in the realserver, which effectively ignores these SYN packets using cookies. The VS/NAT looks the better solution against the SYN flood attacks. Of course, the ESTABLISHED timeout can be changed to 5 minutes for example. Currently, the max timeout interval (excluding the ESTABLISHED state) is 2 minutes. If you think that you can serve the clients using a smaller timeout for the ESTABLISHED state, when under "ACK after SYN" attack, you can change it with ipchains. You don't need to change it under 2 minutes in LVS 0.9.7. In the last LVS version SYN+FIN switches the state to TIME_WAIT, which can't be controlled using ipchains. In other cases, you can change the timeout for the ESTABLISHED and FIN-WAIT states. But you can change it only down to 1 minute. If this doesn't help, buy 2GB RAM or more for the Director.

One thing that can be done, but this is may be paranoia:

change the INPUT_ONLY table:

from:
           FIN
        SR ---> TW
to:
           FIN
        SR ---> FW

OK, this is incorrect interpretation of the TCP states but this is a hack which allows the min state timeout to be 1 minute. Now using ipchains we can set the timeout to all TCP states to 1 minute.

If this is changed you can now set ESTABLISHED and FIN-WAIT timeouts down to 1 minute. In current LVS version the min effective timeout for ESTABLISHED and FINWAIT state is 2 minutes.

Jean-Francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

I'm using DR on the cluster with 2 realservers. I'm trying to control the number of connections to acheive this :

The cluster in normal mode balances requests on the 2 realservers. If the realservers reaches a point where they can't serve clients fast enough, a new entry with a weight of 10000 is entered in LVS to send the overflow locally on a web server with a static web page saying "we're too busy". It's a cgi that intercept 'deep links' in our site and return a predefined page. A 600 seconds persistency ensure that already connected clients stays on the server they began to browse. The client only have to hit refresh until the number of AciveConns (I hoped) on the realservers gets lower and the overflow entry gets deleted.

Got the idea... Load balancing with overflow control.

Julian

Good idea. But LVS can't help you. When the clients are redirected to the Director they stay there for 600 seconds.

But when we activate the local redirection of requests due to overflow, ActiveConn continues to grow in LVS, while Inactconn decreases as expected. So the load on the realserver gets OK... but LVS doesnt sees it and doesnt let new clients in. (it takes 12 minutes before ActiveConns decreases enough to reopen the site)

I need a way, a value to check at that says the server is overloaded, begin redirecing locally and the opposite.

I know that seems a little complicated....

Julian

What about trying to:

  • use persistent timeout 1 second for the virtual service.

    If you have one entry for this client you have all entries from this client to the same realserver. I didn't tested it but may be a client will load the whole web page. If the server is overloaded the next web page will be "we're too busy".

  • switch the weight for the Director between 0 and 10000. Don't delete the Director as realserver.

    Weight 0 means "No new connections to the server". You have to play with the weight for the Director, for example:

  • if your realservers are loaded near 99% set the weight to 10000

  • if your realservers are loaded before 95% set the weight to 0

Jean-Francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

Will a weight of 0 redirect traffic to the other realservers (persistency remains ?)

Julian

If the persistent timeout is small, I think.

I can't get rid of the 600 seconds persistency because we run a transactionnal engine. i.e. if a client begins on a realserver, he must complete the transaction on that server or get an error (transactionnal contexts are stored locally).

Such timeout can't help to redirect the clients back to the realservers.

You can check the free ram or the cpu idle time for the realservers. By this way you can correctly set the weights for the realservers and to switch the weight for the Director.

These recommendations can be completely wrong. I've never tested them. If they can't help try to set httpd.conf:MaxClients to some reasonable value. Why not to put the Director as real server permanently. With 3 realservers is better.

Jean

Those are already optimized, bottleneck is when 1500 clients tries our site in less than 5 minutes.....

One of ours has suggested that the realservers check their own state (via TCP in use given by sockstat) and command the director to redirect traffic when needed.

Can you explain more in details why the number of ActiveConn on realserver continue to grow while redirecting traffic locally with a weight of 10000 (and Inactonn on that realserver decreasing normally).

Julian

Only the new clients are redirected to the Director at this moment. Where the active connections continue to grow, in the real servers or in the Director (weight=10000)?

4.11.4. How is ActiveConn/InActConn calculated?

Joe, 14 May 2001

according to the ipvsadm man page, for "lc" scheduling, the new connections are assigned according to the number of "active connections". Is this the same as "ActiveConn" in the output of ipvsadm? If the number of "active connections" used to determine the scheduling is "ActiveConn", then for services which don't maintain connections (e.g. http or UDP services), the scheduler won't have much information, just "0" for all realservers?

Julian, 14 May and 23 May

It is a counter and it is incremented when new connection is created. The formula is:

active connections = ActConn * K + InActConn

where K can be 32 to 50 (I don't remember the last used value), so it is not only the active conns (which would break UDP).

Is "active connections" incremented if the client re-uses a port?

No, the reused connections are not counted.

4.11.5. ActiveConn is a guess for LVS-DR

For LVS-DR, the director doesn't see the return packets and uses tables of timeouts to guess a likely state of the service at the realserver. For the same reason you can't do stateful filtering on the director for LVS-DR controlled packets.

barrywong barrywong (at) sina (dot) com 30 Aug 2008

I'm using DR wlc persistent 120

# ipvsadm -Ln
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn

TCP  xx.xx.xx.xx:80 wlc persistent 120
  -> xx.xx.xx.x1:80             Route   1      6459       7057
  -> xx.xx.xx.x2:80            Route   1      6446       4766

# netstat -n
realserver ESTABLISHED
xx.xx.xx.x1:80 4210 ESTABLISHED
xx.xx.xx.x2:80 4483 ESTABLISHED

The realserver ESTABLISHED status connect numberis less than the lvs ActiveConn connect number. Why is this?

Thomas Pedoussaut thomas (at) pedoussaut (dot) com

I guess your issue is that the persistance is low compared to your usage. I've had similar numbers with a mysql setup. Basically, there was hundreds of very-long-lasting connections, but that weren't doing much of traffic, with sometimes pausing for hours. They would disappear from the LVS status but still be visible on the client and the server as CONNECTED.

It's not really a big issue. Usually server affinity make the resuming packets being directed to the same server so the connection can still be used. If it wasn't the case, there is enough code on the client side to re-establish a new connection if that one was to fail. You'll still have to face a problem with the server side connections that will be lingering in a limbo state. I would consider setting some sort of timeout on that side. I'm not 100% sure, but you're real server are running squid on port 80 correct. If so, please have a look there http://www.squid-cache.org/Versions/v3/3.0/cfgman/read_timeout.html and probably shorten it (or extend your LVS persistance to that value with ipvsadm --set )

4.12. FAQ: ipvsadm shows entries in InActConn, but none in ActiveConn, connection hangs. What's wrong?

The usual mistake is to have the default gw for the realservers set incorrectly.

  • LVS-NAT: the default gw must be the director. There cannot be any other path from the realservers to the client, except through the director.
  • LVS-DR, LVS-Tun: the default gw cannot be the director - use some local router.

Setting up an LVS by hand is tedious. You can use the configure script which will trap most errors in setup.

4.13. FAQ: initial connection is delayed, but once connected everything is fine. What's wrong?

Usually you have problems with authd/identd. Simplest thing is to stop your service from calling the identd server on the client (i.e.disconnect your service from identd).

4.14. unbalanced realservers: does rr and lc weighting equally distribute the load? - clients reusing ports

(also see polygraph in the performance section.)

I ran the polygraph simple.pg test on a LVS-NAT LVS with 4 realservers using rr scheduling. Since the responses from the realservers should average out I would have expected the number of connection and load average on the realservers to be equally distributed over the realservers.

Here's the output of ipvsadm shortly after the number of connections had reached steady state (about 5 mins).

IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph rr
  -> RS4.mack.net:polygraph         Masq    1      0          883
  -> RS3.mack.net:polygraph         Masq    1      0          924
  -> RS2.mack.net:polygraph         Masq    1      0          1186
  -> RS1.mack.net:polygraph         Masq    1      0          982

The servers were identical hardware. I expect (but am not sure) that the utils/software on the machines is identical (I set up RS3,RS4 about 6 months after RS1,RS2). RS2 was running 2.2.19, while the other 3 machine were running 2.4.3 kernels. The number of connections (all in TIME_WAIT) at the realservers was different for each (otherwise apparently identical) realserver and was in the range 450-500 for the 2.4.3 machines and 1000 for the 2.2.19 machine (measured with netstat -an | grep $polygraph_port |wc ) and varied about 10% over a long period.

This run had been done immediately after another run and InActConn had not been allowed to drop to 0. Here I repeated this run, after first waiting for InActConn to drop to 0

IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph rr
  -> RS4.mack.net:polygraph         Masq    1      0          994
  -> RS3.mack.net:polygraph         Masq    1      0          994
  -> RS2.mack.net:polygraph         Masq    1      0          994
  -> RS1.mack.net:polygraph         Masq    1      1          992
TCP  lvs2.mack.net:netpipe rr

RS2 (the 2.2.19 machine) had 900 connections in TIME_WAIT while the other (2.4.3) machines were 400-600. RS2 was also delivering about 50% more hits to the client.

Repeating the run using "lc" scheduling, the InActConn remains constant.

IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph lc
  -> RS4.mack.net:polygraph         Masq    1      0          994
  -> RS3.mack.net:polygraph         Masq    1      0          994
  -> RS2.mack.net:polygraph         Masq    1      0          994
  -> RS1.mack.net:polygraph         Masq    1      0          993

The number of connections (all in TIME_WAIT) at the realservers did not change.

I've been running the polygraph simple.pg test over the weekend using rr scheduling on what (AFAIK) are 4 identical realservers in a LVS-NAT LVS. There are no ActiveConn and a large number of InActConn. Presumably the client makes a new connection for each request.

Julian (I think)

The implicit persistence of TCP connection reuse can cause such side effects even for RR. When the setup includes small number of hosts and the used rate is big enough to reuse the client's port, the LVS detects existing connections and new connections are not created. This is the reason you can see some of the rs not to be used at all, even for such method as RR.

the client is using ports from 1025-4999 (has about 2000 open at one time) and it's not going above the 4999 barrier. ipvsadm shows a constant InActConn of 990-995 for all realservers, but the number of connections on each of the realservers (netstat -an) ranges from 400-900.

So if the client is reusing ports (I thought you always incremented the port by 1 till you got to 64k and then it rolled over again), LVS won't create a new entry in the hash table if the old one hasn't expired?

Yes, it seems you have (5000-1024) connections that never expire in LVS.

Presumably because the director doesn't know the number of connections at the realservers (it only has the number of entries in its tables), and because even apparently identical realservers aren't identical (the hardware here is the same, but I set them up at different times, presumably not all the files and time outs are the same), the throughput of different realservers may not be the same.

The apparent unbalance in the number of InActConn then is a combination of some clients reusing ports and the director's method of estimating the number of connections, which makes assumptions about TIME_WAIT on the realserver. A better estimate of the number of connections at the realservers would have been to look at the number of ESTABLISHED and TIME_WAIT connections on the realservers, but I didn't think of this at the time when I did the above tests.

The unbalance then isn't anything that we're regarding as a big enough problem to find a fix for.

4.15. Changing weights with ipvsadm

When setting up a service, you set the weight with a command like (default weight is 1).

director:/etc/lvs# ipvsadm -a -t $VIP:$SERVICE -r $REALSERVER_NAME:$SERVICE $FORWARDING -w 1

If you set the weight for the service to "0", then no new connections will be made to that service (see also man ipvsadm, about the -w option).

Lars Marowsky-Bree lmb (at) suse (dot) de 11 May 2001

Setting weight = 0 means that no further connections will be assigned to the machine, but current ones remain established. This allows to smoothly take a realserver out of service, i.e. for maintenance.

Removing the server hard cuts all active connections. This is the correct response to a monitoring failure, so that clients receive immediate notice that the server they are connected to died so they can reconnect.

Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 11 May 2001

Is there a way to clear some entries in the ipvs tables ? If a server reboots or crashes, the connection entries remains in the ipvsadm table. Is there a way to remove manually some entries? I have tried to remove the realserver from the service (with ipvsadm -d .... ), but the entries are still there.

Joe

After a service (or realserver) failure, some agent external to LVS will run ipvsadm to delete the entry for the service. Once this is done no new connections can be made to that service, but the entries are kept in the table till they timeout. (If the service is still up, you can delete the entries and then re-add the service and the client will not have been disconnected). You can't "remove" those entries, you can only change the timeout values.

Any clients connected through those entries to the failed service(s) will find their connection hung or deranged in some way. We can't do anything about that. The client will have to disconnect and make a new connection. For http where the client makes a new connection almost every page fetch, this is not a problem. Someone connected to a database may find their screen has frozen.

If you are going to set the weight of a connection, you need to first know the state of the LVS. If the service is not already in the ipvsadm table, you add (-a) it. If the service is already in the ipvsadm table, you edit (-a) it. There is no command to just set the weight no matter what the state. A patch exists to do this (from Horms) but Wensong doesn't want to include it. Scripts which dynamically add, delete or change weights on services will have to know the state of the LVS before making any changes, or else trap errors from running the wrong command.

4.16. Setting initial weights

If your hardware is all the same, you set then all to the same weight (1:1:1.., or 1000:1000:1000, it's the ratio that's important, not the value). What if you have a bunch of different hardware and you don't know what weight to set for each?

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 13 Nov 2008

Run them all on the same weight for a while and see how they get loaded, then decide if you need to play with the weights or just add more servers. In a layer 4 balanced cluster I normally recommend no greater than 40% utilisation as a rule of thumb to cope with peaks in demand.

Graeme

Ensure you start with (for example) 100/100/100, or 1000/1000/1000. It's easier to juggle the weights with those values than 1/1/1 !

4.17. Dynamically changing realserver weights

Some law of averaging large numbers predicts that realservers with large numbers of clients should have nearly the same load. In practice, realservers can have widely different loads or numbers of connections. Presumably this is for services where a small number of clients can saturate a realserver. LVS's serving static html pages should have even loads on the realservers.

Leonard Soetedjo Leonard (at) axs (dot) com (dot) sg

From reading the howto, mon doesn't handle dynamically changing the realserver weights. Is it advisable to create a monitoring program that changes the weightage of the realserver? The program will check the worker's load, memory etc and reduce or increase the weight in the director based on those information.

Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uki 09 Dec 2002

Personaly I think it adds complication you shouldn't require... As your servers are running the same app they should respond in roughly the same way to traffic.. If you have a very fast server reduce its weight. If you have some very slow pages.. i.e. Global Search... Then why not set up another VIP to make sure that all requests to search.mydomain.com are evenly distributed... or restricted to a couple of servers (so they don't imapact everyone else..)

But obviously it all depends on how your app works, with mine its database performance that is the problem... Time to look at loadbalancing the DB !, Does anyone have any experience of doing this with MS SQL server and or PostGreSQL ? I'm thinking about running the session / sales stuff of the MS SQL box, and all the readonly content from several readonly PostGreSQL DBs... Due to licencing costs... :-(.

OTOH, someone recently spoke on the list about a monitoring tool which could use plugins to monitor the realservers. (Joe - see feedbackd.)

Lars Marowsky-Bree lmb (at) suse (dot) de 17 Mar 2003

keep in mind that loadavg is a poor indication of real resource utilization, but it might be enough. loadavg needs to be at least normalized via the number of CPUs.

Andres Tello criptos (at) aullox (dot) com 17 Mar 2003

I use: ram*speed/1000 to calculate the weight

Joe

dsh (http://www.netfort.gr.jp/~dancer/software/dsh.html", machine gone Sep 2004) is good for running commands (via rsh, ssh) on multiple machines (machines listed in a file). I'm using it here to monitor temperatures on multiple machines.

also see procstatd

Bruno Bonfils

The loads can become unbalanced even if the realservers are indentical. Customers can read different pages. Some of them may have heavy php/sql usage, which implies a higher load average than a simple static html file.

Rylan W. Hazelton rylan (at) curiouslabs (dot) com 17 Mar 2003

I still find large differences in loadaverage of the realservers. WLC has no idea what else (non httpd) that might be happening on a server. Maybe I am compiling something for some reason, or I have a large cron. It would be nice if LVS could redirect load accordingly.

4.18. feedbackd

Joe, Mar 2003

Jeremy's feedbackd code and HOWTO (feedbackd) is now released.

Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 09 Dec 2002:

As I've said earlier (check out the thead starting at http://www.in-addr.de/pipermail/lvs-users/2002-November/007264.html ), the software sends server load information to the director to be inserted in to the ipvs weighting tables.

I'm busy writing up the benchmarking results at the moment, and I'll post a link to the paper (and software) soon. In summary: when the simulation cluster is (intentionally) unbalanced, the feedback software sucessfully evens the load between all servers.

Jeremy at one stage had plans to merge his code with Alexandre's code, but (Aug 2004) he's not doing anything about it at the moment (he has a real job now).

Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 04 Feb 2005

Everything's available at feedbackd (http://ozlabs.org/~jk/projects/feedbackd/).

Michal Kurowski mkur (at) gazeta (dot) pl 19 Jan 2007

I want to distribute the load based on criteria such as:

  • disk usage
  • OS load average
  • CPU usage
  • custom hooks into my own software
  • You Name It (TM)

feedbackd has got CPU-monitoring only by default. It also has a perl plugin that's supposed to let you code something revelant to you quickly. That's a perfect idea except the original perl plugin is not fully functional, because it breaks some rules regarding linking to C-based modules.

I wrote feedbackd-agent.patch that's solves the problem (it's against the latest release - 0.4, and I've sent Jeremy a copy).

4.19. lvs-kiss

Per Andreas Buer perbu (at) @linpro (dot) .no 14 Dec 2002

I ran into the same problem this summer. I was setting up a loadbalanced SMTP cluster - and I wanted to distribute the incomming connections based on number of e-mails in the mail-servers queues. We ended up making our own program to do this. Later, I made the thing a bit more generic and released it. You might want to check it out

http://www.linpro.no/projects/lvs-kiss/

lvs-kiss distributes incomming connections based on some numerical value - as long as you are able to quantify it - it can be used. It can also time certain test in order to acquire the load of the realservers.

4.20. connection threshold

Note
according to my dictionary, the spelling is threshold and not the more logical threshhold as found in many webpages (see google: "threshhold dictionary").

If the realservers are limited in the number of connections they can support, you can use the connection threshold in ipvsadm (in ip_vs 1.1.0). See the Changelog and man ipvsadm. This functionality is derived from Ratz's original patches.

Matt Burleigh

Is there a stock Red Hat kernel with a new enough version of ip_vs to include connection thresholds?

Ratz 19 Dec 2002: I've done the original patch for 2.2.x kernels but I've never ported it to 2.4.x kernels. I don't know if RH has done so. In the newest LVS release for 2.5.x kernels the same concept is there, so with a bit of coding (maybe even luck) you could use that.

ratz ratz (at) tac (dot) ch 2001-01-29

This patch on top of ipvs-1.0.3-2.2.18 adds support for threshold settings per realserver for all schedulers that have the -w option.

Note
As of Jun 2003, patches are available for 2.4 kernels. All patches are on Ratz's LVS page. Patches are in active development (i.e. you'll be helping with the debugging), look at the mailing list for updates.

Horms 30 Aug 2004

LVS in 2.6 has its own connection limiting code. There isn't a whole lot too it. Just get ipvadm for 2.6 and take a look in the man page. It has details on how the connection thresholds can be set. Its pretty straight forward as I recall.

Anon

Is there any way to limit connections per IP through IPVS, to mimic the netfilter connection limit module ipt_connlimit? I see ipvsadm's threshold option, but it does totals per server.

Ratz 15 Feb 2007

how exactly is the threshold option (per RS) different to the ipt_connlimit regarding the service -> RS pool mapping? If you need source IP limiting you're better off using QoS anyway.

4.20.1. Description/Purpose

I was always thinking of how a kernel based implementation of connection limitation per realserver would work and how it could be implemented so while waiting in the hospital for the x-ray I had enough time to write up some little dirty hack to show a proof of concept. It works like follows. I added three new entries to the ip_vs_dest() struct, u_thresh and l_thresh in ip_vs.* and I modified the ipvsadm to add the two new options x and y. A typical setup would be:

director:/etc/lvs# ipvsadm -A -t 192.168.100.100:80 -s wlc
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 192.168.100.3:80 -w 3 -x 1145 -y 923
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 192.168.100.3:80 -w 2 -x 982 -y 677
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 127.0.0.1:80 -w 1 -x 100 -y 50

So, this means, as soon as (dest->inactconns + dest->activeconns) exceed the x value the weight of this server is set to zero. As soon as the connections drop below the lower threshold (y) the weight is set back to the initial value. What is it good for? Yeah well, I don't know exactly, imagine yourself, but first of all this is proposal and I wanted to ask for a discussion about a possible inclusion of such a feature or even a derived one into the main code (of course after fixing the race conditions and bugs and cleaning up the code) and second, I found out with tons of talks with customers that such a feature is needed, because also commercial lb have this and managers always like to have a nice comparision of all features to decide which product they take. Doing all this in user- space is unfortunately just not atomic enough.

Anyway, if anybody else thinks that such a feature might be vital for inclusion we can talk about it. If you look at the code, it wouldn't break anything and just add two lousy CPU cycles for checking if u_thresh is <0. This feature can easily be disabled by just setting u_thresh to zero or not even initialize it.

Well, I'm open for discussion and flames. I have it running in production :) but with a special SLA. I implemented the last server of resort which works like this: If all RS of a service are down (healthcheck took it out or treshhold check set weight to zero), my userspace tool automagically invokes the last server of resort, a tiny httpd with a static page saying that the service is currently unavailable. This is also useful if you want to do maintainance of the realservers.

I already implemented a dozen of such setups and they work all pretty well.

How we will defend against DDoS (distributed DoS)?

I'm using a packetfilter and in special zones a firewall after the packetfilter ;) No seriously, I personally don't think the LVS should take too much part on securing the realservers It's just another part of the firewall setup.

The problem is that LVS has another view for the realserver load. The director sees one number of connections the realserver sees another one. And under attack we observe big gap between the active/inactive counters and the used threshold values. In this case we just exclude all realservers. This is the reason I prefer the more informed approach of using agents.

Using the number of active or inactive connections to assign a new weight is _very_ dangerous.

I know, but OTOH, if you set a threshold and my code takes the server out, because of a well formated DDoS attack, I think it is even better than if you would allow the DDoS and maybe kill the realservers http-listener.

No, we have two choices:

- use SYN cookies and much memory for open requests, accept more valid requests

- don't use SYN cookies, drop the requests exceeding the backlog length, drop many valid requests but the realservers are not overloaded

In both cases the listeners don't see requests until the handshake is completed (Linux).

BTW, what if you enable the defense strategies of the loadbalancer? I've done some tests and I was able to flood the realservers by sending forged SYNs and timeshifted SYN-ACKs with the expected seq-nr. It was impossible to work on the realservers unless of course I enabled the TCP_SYNCOOKIES.

Yes, nobody claims the defense strategies guard the real servers. This is not their goal. They keep the director with more free memory and nothing more :) Only drop_packet can control the request rate but only for the new requests.

I then enabled my patch and after the connections exceeded the threshold, the kernel took the server out temporarily by setting the weight to 0. In that way the server was usable and I could work on the server.

Yes but the clients can't work, you exclude all servers in this case because the LVS spreads the requests to all servers and the rain becomes deluge :)

In theory, the number of connections is related to the load but this is true when the world is ideal. The inactive counter can be set with very high values when we are under attack. Even the WLC method loads proportionatly the realservers but they are never excluded from operation.

True, but as I already said. I think LVS shouldn't replace a fw. I normally have a router configured properly, then a packetfilter, then a firewall or even another but stateful packetfilter. See, the patch itself is not even mandatory. I normal setup, my code is not even touched (except the ``if'':).

I have some thoughts about limiting the traffic per connection but this idea must be analyzed.

Hmm, I just want to limit the amount of concurrent connections per realserver and in the future maybe per service. This saved me quite some lines of code in my userspace healthchecking daemon.

Yes, you vote for moving some features from user to the kernel space. We must find the right balance: what can be done in LVS and what must be implemented in the user space tools.

The other alternatives are to use the Netfilter's "limit" target or QoS to limit the traffic to the realservers.

But then you have to add quite some code. The limit target has no idea about LVS tables. How should this work, f.e. if you would like to rate limit the amount of connections to a realserver?

May be we can limit the SYN rate. Of course, that not covers all cases, so my thought was to limit the packet rate for all states or per connection, not sure, this is an open topic. It is easy to open a connection through the director (especially in LVS-DR) and then to flood with packets this connection. This is one of the cases where LVS can really guard the realservers from packet floods. If we combine this with the other kind of attacks, the distributed ones, we have better control. Of course, some QoS implementations can cover such problems, not sure. And this can be a simple implementation, of course, nobody wants to invent the wheel :)

Let's analyze the problem. If we move new connections from "overloaded" realserver and redirect them to the other realservers we will overload them too.

No, unless you use a old machine. This is maybe a requirement of an e-commerce application. They have some servers and if the servers are overloaded (taken out by my user-space healthchecking daemon because the response time it to high or the application daemon is not listening anymore on the port) they will be taken out. Now I have found out that by setting thresholds I could reduce the down- time of flooded server significantly. In case all servers were taken out or their weights were set to 0 the userspace application sets up a temporarily (either local route or another server) new realserver that has nothing else to do then pushing a static webpage saying that the service is currently unavailable due to high server load or DDoS attack or whatever. Put this page behind a TUX 2.0 and try to overflow it. If you can, apply the zero-copy patches of DaveM. No way you will find such a fast (88MBit/s requests!!) Link to saturate the server.

Yes, I know that this is a working solution. But see, you exclude all realservers :) You are giving up. My idea is we to find a state when we can drop some of the requests and to keep the realservers busy but responsive. This can be a difficult task but not when we have the help from our agents. We expect that many valid requests can be dropped but if we keep the realserver in good health we can handle some valid requests because nobody knows when the flood will stop. The link is busy but it contains valid requests. And the service does not see the invalid ones.

IMO, the problem is that there are more connection requests than the cluster can handle. The solutions to try to move the traffic between the realservers can only cause more problems. If at the same time we set the weights to 0 this leads to more delay in the processing. May be more useful is to start to reduce the weights first but this again returns us to the theory for the smart cluster software.

So, we can't exit from this situation without dropping requests. There is more traffic that can't be served from the cluster.

The requests are not meaningful, we care how much load they introduce and we report this load to the director. It can look, for example, as one value (weight) for the real host that can be set for all real services running on this host. We don't need to generate 10 weights for the 10 real services running in our real host. And we change the weight on each 2 seconds for example. We need two syscalls (lseek and read) to get most of the values from /proc fs. But may be from 2-3 files. This is in Linux, of course. Not sure how this behaves under attack. We will see it :)

Obviously yes, but if you also include the practical problem of SLA with customers and guaranteed downtime per month I still have to say that for my deploition (is this the correct noun?) I go better with my patch in case of a DDoS and enabled LVS defense strategies then without.

If there is no cluster software to keep the realservers equally loaded, some of them can go offline too early.

The scheduler should keep them equally loaded IMO even in case of let's say 70% forged packets. Again, if you don't like to set a threshold, leave it. The patch is open enough. If you like to set it, set it, maybe set it very high. It's up to you.

The only problem we have with this scheme is the ipvsadm binary. It must be changed (the user structure in the kernel :)) The last change is dated from 0.9.10 and this is a big period :) But you know what means a change in the user structures :)

The cluster software can take the role to monitor the load instead of relying on the connection counters. I agree, changing the weights and deciding how much traffic to drop can be explained with a complex formula. But I can see it only as a complete solution: to balance the load and to drop the exceeding requests, serve as many requests as possible. Even the drop_packet strategy can help here, we can explicitly enable it specifying the proper drop rate. We don't need to use it only to defend the LVS box but to drop the exceeding traffic. But someone have to control the drop rate :) If there is no exceeding traffic what problems we can expect? Only from the bad load balancing :)

The easiest way to control the LVS is from user space and to leave in LVS only the basic needed support. This allows us to have more ways to control LVS.

4.21. Flushing connection table

Shinichi Kido shin (at) kidohome (dot) ddo (dot) jp

I want to reset all the connection table (the output list by ipvsadm -lc command) immediately without waiting for the expire time for all the connection.

Stefan Schlosser castla (at) grmmbl (dot) org 04 Jun 2004

you may want to try these patches:

http://grmmbl.org/code/ipvs-flushconn.diff
http://grmmbl.org/code/ipvsadm-flushconn.diff

and use ipvsadm -F

Horms horms (at) verge (dot) net (dot) au 04 Jun 2004

Another alternative, if you have lvs compiled as a module is to reload it. This will clear everything.

ipvsadm -C
# remove the ipvs scheduler and other modules
# rmmod ip_vs_wlc
# ...
rmmod ip_vs 
modprobe ip_vs

Then again, Shinichi-san, why do you want to clear the connection table? It might be useful for testing. But I am not sure what it would be useful for in production.

Joe

In general it's not good for a server to accept a connection and then unilaterally break it. You should let the connections expire. If you don't want any new connections, just set weight=0.

4.22. Thundering herd problem, Slow start code for realserver(s) coming on line

Despite what you may have read in the mailing list and possibly in earlier versions of this HOWTO, there is no slow start for realservers coming on line (I thought it was in the code from the very early days). If you bring a new realserver on-line with *lc scheduling, the new machine, having less connections (i.e. none) will get all the new connections. This will likely stress the new realserver.

As Jan Klopper points out (11 Mar 2006), you don't get the thundering herd problem with round robin scheduling. In this case the number of connections will even out when the old connections expire (for http this may only be a few minutes). It would be simple enough to bring up a new realserver with all rules being rr, then after 5mins change over to lc (if you want lc).

Horms says (off-line Dec 2006) that it's simple enough to use the in-kernel timers to handle this problem; he just hasn't done it. Some early patches to handle the problem received zero response, so he dropped it.

Christopher Seawood cls (at) aureate (dot) com

LVS seems to work great until a server goes down (this is where mon comes in). Here's a couple of things to keep in mind. If you're using the Weighted Round-Robin scheduler, then LVS will still attempt to hit the server once it goes down. If you're using the Least Connections scheduler, then all new connections will be directed to the down server because it has 0 connections. You'd think using mon would fix these problem but not in all cases.

Adding mon to the LC setup didn't help matters much. I took one of three servers out of the loop and waited for mon to drop the entry. That worked great. When I started the server back up, mon added the entry. During that time, the 2 running servers had gathered about 1000 connections apiece. When the third server came back up, it immediately received all of the new connections. It kept receiving all of the connections until it had an equal number of connections with the other servers (which by this time...a minute or so later...had fallen to ~700). By this time, the 3rd server had been restarted after due to triggering a high load sensor also monitoring the machine (a necessary evil or so I'm told). At this point, I dropped back to using WRR as I could envision the cycle repeating itself indefinitely.

Horms has fixed this problem with a patch to ipvsadm.

Horms horms (at) verge (dot) net (dot) au 23 Feb 2004

Here is a patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/thundering_herd.diff) that implements slow start for the WLC scheduler. This is designed to address the problem where a realserver is added to the pool and soon inundated with connections. This is sometimes refered to as the thundering herd problem and was recently the topic of a thread on this list "Easing a Server into Rotation". http://marc.theaimsgroup.com/?l=linux-virtual-server&m=107591805721441&w=2

The patch has two basic parts.

  • ip_vs_ctl.c:

    When the weight of a realserver is modified (or a realserver is added), set the IP_VS_DEST_F_WEIGHT_INC or IP_VS_DEST_F_WEIGHT_DEC flag as appropriate and put the size of the change in dest.slow_start_data.

    This information is intended to act as hints for scheduler modules to implement slow start. The scheduler modules may completely ignore this information without any side effects.

  • ip_vs_wlc.c:

    If IP_VS_DEST_F_WEIGHT_DEC is set then the flag is zeroed - slow start does not come into effect for weight defects.

    If IP_VS_DEST_F_WEIGHT_INC is set then a handicap is calculated. The flag is then zeroed.

    The handicap is stored in dest.slow_start_data, along with a scaling factor to allow gradual decay which is stored in dest.slow_start_data2. The handicap effectively makes the realserver appear to have more connections than it does, thus decreasing the number of connections that the wlc scheduler will allocate to it. This handicap is decayed over time.

Limited debugging information is available by setting

/proc/sys/net/ipv4/vs/debug_level to 1 (or greater).

This will show the size of the handicap when it is calculated and show a message when the handicap is fully decayed.

4.23. Handling kernel version dependant files e.g. System.map and ipvsadm

If you boot with several different versions of the kernel (particularly switching between 2.2.x and 2.4.x), and you have executables or directories with contents that need to match the kernel version (e.g. System.map, ipvsadm, /usr/src/linux, /usr/src/ipvs), then you need some mechanism for making sure that the appropriate executable or directory is brought into scope.

Note:klogd is supposed to read files like /boot/System.map-<kernel_version> allowing you to have several kernels in / (or /boot). However this doesn't solve the problem for general executables like ipvsadm.

If you have the wrong version of System.map you'll get errors when running some commands (e.g. `ps` or `top`)

Warning: /usr/src/linux/System.map has an incorrect kernel version.

If you the ip_vs and ipvsadm don't match, then ipvsadm will give invalid numbers for IPs and ports or it will tell you that you don't have ip_vs installed.

As with most problems in computing, this can be solved with an extra layer of indirection. I name my kernel versions in /usr/src like

director:/etc/lvs# ls -alF /usr/src | grep 2.19
lrwxrwxrwx   1 root     root           25 Sep 18  2001 linux-1.0.7-2.2.19 -> linux-1.0.7-2.2.19-module/
drwxr-xr-x  15 root     root         4096 Jun 21  2001 linux-1.0.7-2.2.19-kernel/
drwxr-xr-x  15 root     root         4096 Aug  8  2001 linux-1.0.7-2.2.19-module/
lrwxrwxrwx   1 root     root           18 Oct 21  2001 linux -> linux-1.0.7-2.2.19

Here I have two versions of ip_vs-1.0.7 for the 2.2.19 kernel, one built as a kernel module and the other built into the kernel (you will probably only have one version of ip_vs for any kernel). I select the one I want to use by making a link from linux-1.0.7-2.2.19 (I do this by hand). If you do this for each kernel version, then the /usr/src directory will have several links

director:/etc/lvs# ls -alFrt /usr/src | grep lrw
lrwxrwxrwx   1 root     root           25 Sep 18  2001 linux-1.0.7-2.2.19 -> linux-1.0.7-2.2.19-module/
lrwxrwxrwx   1 root     root           38 Sep 18  2001 linux-0.9.2-2.4.7 -> linux-0.9.2-2.4.7-module-hidden-shared/
lrwxrwxrwx   1 root     root           39 Sep 18  2001 linux-0.9.3-2.4.9 -> linux-0.9.3-2.4.9-module-forward-shared/
lrwxrwxrwx   1 root     root           17 Sep 19  2001 linux-2.4.9 -> linux-0.9.3-2.4.9/
lrwxrwxrwx   1 root     root           40 Oct 11  2001 linux-0.9.4-2.4.12 -> linux-0.9.4-2.4.12-module-forward-shared/
lrwxrwxrwx   1 root     root           18 Oct 21  2001 linux -> linux-0.9.4-2.4.12/

The last entry, the link from linux is made by rc.system_map (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/rc.system_map). At boot time rc.system_map checks the kernel version (here booted with 2.4.12) and links linux to linux-0.9.4-2.4.12. If you lable ipvsadm, /usr/src/ip_vs and System.map in a similar way, then rc.system_map will link the correct versions for you.

ipvsadm versions match ipvs versions and not kernel versions, but kernel versions are close enough that this scheme works.

4.24. Limiting number of clients connecting to LVS

This comes up occasionally, and Ratz has developed scheduler code that will handle overloads for realservers (see writing a scheduler and discussion of schedulers with memthresh in Section 34.13).

The idea is that after a certain number of connections, the client is sent to an overload machine with a page saying "hold on, we'll be back in a sec". This is useful if you have an SLA saying that no client connect request will be refused, but you have to handle 100,000 people trying to buy Rolling Stones concert tickets from your site, who all connect within 30secs of the opening time. Ratz' code may even be in the LVS code sometime. Until it is, ask Ratz about it.

NewsFlash: Ratz' code is out

Roberto Nibali ratz (at) tac (dot) ch 23 Oct 2003

http://marc.theaimsgroup.com/?l=linux-virtual-server&m=105914647925320&w=2

Horms

The LVS 1.1.x code for the 2.6 kernel allows you to set connection limits using ipvsadm. This is documented in the ipvsadm man page that comes with the 1.1.x code. The limits are currently not available in the 1.0.x code for the 2.4 kernel. However I suspect that a backport would not be that difficult.

Diego Woitasen, Oct 22, 2003

but what set the IP_VS_DEST_F_OVERLOAD in struct ip_vs_dst?

Horms 23 Oct 2003

This relates to LVS's internal handling of connection thresholds for realservers which is available in the 1.1.x tree for the 2.6.x kernel (also see connection threshold).

IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS code when the high and low thresholds are passed for a realserver. If a scheduler honours this flag then it should not allocate new connections to realservers with this flag set. As far as I can see all the supplied schedulers honour this flag. But if a scheduler did not then it would just be ignored. That is real servers would have new connections allocated regardless of if IP_VS_DEST_F_OVERLOAD is set or not. It would be as if no connection thresholds had been set.

Note that if persistancy is in use then subsequent connections to the same realserver for a given client within the persistancy timeout are not scheduled as such. Thus additional connections of this nature can be allocated to a realserver even if it has been marked IP_VS_DEST_F_OVERLOAD. This, IMHO, is a desirable behaviour.

ok, I saw that IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS, but I can't find where the thresholds are set. As can I see, this thresholds are always set to Zero in userpace. Is this right?

No. In the kernel the threshoulds are set by code in ip_vs_ctl.c (I guess, as that is where all other configuration from user-space is handled). If you get the version of ipvsadm that comes with LVS source that supports IP_VS_DEST_F_OVERLOAD then it has command line options to set the thresholds.

Steve Hill

A number of the schedulers seem to use an is_overloaded() function that limits the number of connections to twice the server's weight.

Ratz 03 Aug 2004

For the sake of discussion I'll be referring to the 2.4.x kernel. We would be talking about this piece of jewelry:

static inline int is_overloaded(struct ip_vs_dest *dest)
{
         if (atomic_read(&dest->activeconns) > 
atomic_read(&dest->weight)*2) {
                 return 1;
         }
         return 0;
}

I'm a bit unsure about the semantics of this is_overloaded regarding it's mathematical background. Wensong, what was the reason to arbitraly us twice the amount of activeconns for the overoad criteria?

  • dest->activeconns has such a short life span, it hardly represents nor reflects the current RS load in any way I could imagine.
  • 2.4.x and 2.6.x differ in what they consider a destination to be overloaded. IP_VS_DEST_F_OVERLOAD is set when ip_vs_dest_totalconns exceeds the upper threshold limit and totalconns means currently active + inactive connections which is also kind of unfortunate. And yes, there is some more code I haven't mentioned yet.
I'm using the dh scheduler to balance across 3 machines - once the connections exceed twice the weight, it refuses new connections from IP addresses that aren't currently persistent.

Which kernel? And 2.4.x and 2.6.x contain similar (although not sync'd ... sigh) code regarding this feature:

2.4.24 sorry
ratz@webphish:/usr/src/linux-2.6.8-rc2> grep -r is_overloaded *
net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_sh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblcr.c:            if (!dest || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_dh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblc.c:                 || is_overloaded(dest, svc)) { ratz@webphish:/usr/src/linux-2.6.8-rc2>


ratz@webphish:/usr/src/linux-2.4.27-rc4> grep -r is_overloaded *
net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_sh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblcr.c:            if (!dest || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblc.c:                 || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_dh.c:           || is_overloaded(dest)) {

Assymetric coding :-)

This in itself isn't really a problem, but I can't find this behaviour actually documented anywhere - all the documentation refers to the weights as being "relative to the other hosts" which means there should be no difference between me setting the weights on all hosts to 5 or setting them all to 500. Multiplying the weight by 2 seems very arbitrary, although in itself there is no real problem (as far as I can tell) with limiting the connections like that so long as it's documented.

This is correct. I'm a bit unsure as to what your exact problem is, but a kernel version would already help, although I believe you're using a 2.4.x kernel. Normally the is_overloaded() function was designed to be used by the threshold limitation feature only which is only present as a shacky backport from 2.6.x. I don't quite understand the is_overloaded() function in the ip_vs_dh scheduler, OTOH, I really haven't been using it so far.

I have 3 squid servers and had set them all to a weight of 5 (since they are all identical machines and the docs said that the weights are relative to eachother). What I found was that once there were >10 concurrent connections any new hosts that tried to make a connection (i.e. any host that isn't "persisting") would have it's connection rejected outright. After some reading through the code I discovered the is_overloaded condition, which was failing in the case of > 10 connections and so I have increased all the weights to 5000 (to all intents and purposes unlimited) which has solved the problem.

Oddly there is another LVS server with a similar configuration which isn't showing this behaviour, but I cannot find any significant difference in the configuration to account for it.

The primary use for LVS in this case is failover in the event of one of the servers failing, although load balancing is a good side effect. I'm using ldirectord to monitor the realservers and adjusting the LVS settings in response to an outage. At the moment, for some reason it doesn't seem to be doing any load balancing at the moment (something I am looking into) - it is just using a single server, although if that server is taken down it does fail over correctly to one of the other servers.

Sorry, I've just realised I've been exceptionally stupid about this bit - I should be using the SH scheduler instead of DH.

One problem is that activeconns doesn't define connections and the implementations for 2.4 and 2.6 differ significantly. Also is_overloaded should be reserved for another purpose, the threshhold limitation feature.

Joe: now back into prehistory -

Milind Patil mpatil (at) iqs (dot) co (dot) in 24 Sep 2001

I want to limit number of users accessing the LVS services at any given time. How can I do it.

Julian

  • for non-NAT cluster (maybe stupid but interesting)

    May be an array from policers, for example, 1024 policers or an user-defined value, power of 2. Each client hits one of the policers based on their IP/Port. This is mostly a job for QoS ingress, even the distributed attack but may be something can be done for LVS? May be we better to develop a QoS Ingress module? The key could be derived from CIP and CPORT, may be something similar to SFQ but without queueing. It can be implemented may be as a patch to the normal policer but with one argument: the real number of policers. Then this extended policer can look into the TCP/UDP packets to redirect each packet to one of the real policers.

  • for NAT only

    Run SFQ qdisc on your external interface(s). It seems this is not a solution for DR method. Of course, one can run SFQ on its uplink router.

  • Linux 2.4 only

    iptables has support to limit the traffic but I'm not sure whether it is useful for your requirements. I assume you want to set limit to each one of these 1024 aggregated flows.

Wenzhuo Zhang

Is anybody actually using the ingress policer for anti-DoS? I tried it several days ago using the script in the iproute2 package: iproute2/examples/SYN-DoS.rate.limit. I've tested it against different 2.2 kernels (2.2.19-7.0.8 - redhat kernel), 2.2.19, 2.2.20preX, with all QoS related functions either compiled into the kernel or as modules) and different versions of iproute2. In all cases, tc fails to install the ingress qdisc policer:

root@panda:~# tc qdisc add dev eth0 handle ffff: ingress
RTNETLINK answers: No such file or directory
root@panda:~# /tmp/tc qdisc add dev eth0 handle ffff: ingress
RTNETLINK answers: No such file or directory

Julian

For 2.2, you need the ds-8 package, at Package for Differentiated Services on Linux. Compile tc by setting TC_CONFIG_DIFFSERV=y in Config. The right command is:

	tc qdisc add dev eth0 ingress

Ratz

The 2.2.x version is not supported anymore. The advanced routing documentation says to only use 2.4. For 2.4, ingress is in the kernel but it is still unusable for more than one device (look in linux-netdev for reference).

James Bourne james (at) sublime (dot) com (dot) au 25 Jul 2003

I was after some samples or practical suggestion in regard to Rate Limiting and Dynamically Denying Services to abusers on a per VIP basis. I have had a look at the sections in the HOWTO on "Limiting number of clients connecting to LVS" and

http://www.linuxvirtualserver.org/docs/defense.html.

Ratz

This is a defense mechanism which is always unfair. You don't want that from what I can read.

Specifically, we are running web based competition entries (e.g. type in your three bar codes) out of our LVS cluster and want to limit those who might construct "bots" to auto-enter. The application is structured so that you have to click through multiple pages and enter a value that is represented in a dynamically generated PNG.

I would like to:

  1. rate limit on each VIP (we can potentially do this at the firewall)
  2. ban a source ip if it goes beyond a certain number "requests-per-time-interval"
  3. dynamically take a vip offline if it goes beyond a certain number of "requests-per-time-interval"
  4. toss all "illegal requests" - eg. codered, nimda etc.

Perhaps a combination of iptables, QoS, SNORT etc. would do the job??

Roberto Nibali 25 Jul 2003

1. rate limit on each VIP (we can potentially do this at the firewall)

Hmm, you might need to use QoS or probably better would be to write a scheduler which uses the rate estimator in IPVS.

2. ban a source ip if it goes beyond a certain number "requests-per-time-interval"

A scheduler could do that for you, although I do not think this is a good idea.

3. dynamically take a vip offline if it goes beyond a certain number of "requests-per-time-interval"

Quiescing the service should be enough, you don't want to put on a penalty on other people, you simple want to keep your maximum request -per-time rate.

4. toss all "illegal requests" - eg. codered, nimda etc.

This has nothing to do with LVS ;).

QoS is certainly suitable for 1). For 2) and 3) I think you would need to write a scheduler.

Max Sturtz

I know that iptables can block connections if they exceed a specified number of connections per second (from anywhere). The question is, is anybody doing this on a per-client basis, so that if any particular IP is sending us more than a specified number of connections per second, they get blocked but all other clients can keep going?

ratz 01 Dec 2003

Using iptables is a very bad practice approach to handle such problems. You have no information if the IP which is making those request attempts at a high rate is malicious or friendly. If it's malicious (IP spoofing) you will block an existing friendly IP.

several times per week we experience a traffic storm. LVS handles it just fine, but the web-servers get loaded up really bad, and pretty soon our site is all but un-usable. We're looking for tools we could use to analyze this (we use Webalizer for our web-logs-- but it can't tell us who's talking to us in any given time-frame...)

Could you describe your overloaded system with some metrics or could you determine the upper connection threshold where your RS are still working fine? You could dump the LVS masquerading table from time to time and grep for connection templates.

I see 4 approaches (in no particular order) to this problem:

  • LVS tcp defense mechanism, best described in http://www.linux-vs.org/docs/defense.html
  • L7 load balancer which inspects HTTP content, best described in http://www.linux-vs.org/software/ktcpvs/ktcpvs.html or in the package Readme.
  • Use of the per RS threshold limitation patch I wrote (see limiting client).
  • Use the feedbackd (http://www.redfishsoftware.com.au/projects/feedbackd/) architecture to signal the director of network anomalies based on certain metrics gained on the RSs.

4.25. Who is connecting to my LVS?

On the realservers you can look with `netstatn -an`. With LVS, the director also has information.

malalon (at) poczta (dot) onet (dot) pl 18 Oct 2001

How do I know who is connecting to my LVS?

Julian

  • Linux 2.2: netstat -Mn (or /proc/net/ip_masquerade)

  • Linux 2.4: ipvsadm -Lcn (or /proc/net/ip_vs_conn)

4.26. experimental scheduling code

This section is a bit out of date now. See the ipvsadm and schedulers new schedulers by Thomas Prouell for web caches and by Henrik Norstrom for firewalls. Ratz ratz (at) tac (dot) ch has produced a scheduler which will keep activity on a particular realserver below a fixed level.

For this next code write to Ty or grab the code off the list server

Ty Beede tybeede (at) metrolist (dot) net 23 Feb 2000

This is a hack to the ip_vs_wlc.c schedualing algorithm. It is curently implemnted in a quick, ad hoc fashion. It's purpose is to support limiting the total number of connections to a realserver. Currently it is implmented using the weigh value as the upper limit on the number of activeconns(connections in an established TCP state). This is a very simple implementation and only took a few minutes after reading through the source. I would like, however, to develop it further.

Due to it's simple nature it will not function in several types of enviroments, those based on connectionless protocals (UDP, this uses the inactconns variable to keep track of things, simply change the activeconns varible-in the weigh check- to inactconns for UDP) and it may impose complecations when persistance is implemented. The current algorimthm simply checks that weight > activeconns before including a server in the standard wlc scheduling. This works for my enviroment, but could be changed to perhaps (weight * 50) > (activeconns * 50) + inactconns to include the inactconns but make the activeconns more important in the decison.

Currently the greatest weight value a user may specify is approimalty 65000, independant of this modification. As long as the user keeps most importanly the weight values correct for the total number of connections and in porportion to one another the things should function as expected.

In the event that the cluster is full, all real severs have maxed out, then it might be neccessary for overflow control, or the client's end will hang. I haven't tested this idea but it could simply be implemented by specifing the over flow server last, after the real severs using the ipvsadm tool. This will work because as each realserver is added using ipvsadm it is put on a list, with the last one added being last on the list. The scheduling algorithm traverses this list linearly from start to finish and if it finds that all severs are maxed out, then the last one will be the overflow and that will be the only one to send traffic to.

Anyway this is just a little hack, read the code and it should make sense. It has been included as an attachment. If you would like to test this simply replace the old ip_vs_wlc.c scheduling file in /usr/src/linux/net/ipv4 with this one. Compile it in and set the weight on the real severs to the max number of connections in an established TCP state or modifiy the source to your liking.

From: Ty Beede tybeede (at) metrolist (dot) net 28 Feb 2000

I wrote a little patch and posted it a few days ago... I indicated that overflow might be accomplished by adding the overflow server to the lvs last. This statement is completely off the wall wrong. I'm not really sure why I thought that would work but it won't, first of all the linked list adds each new instance of a real sever to the start of the realservers list, not the end like I though. Also it would be impossible do distingish the overflow server from the realservers in the case that not all the realservers were busy. I don't know where I got that idea from but I'm going to blame it on my "bushy eyed youth". In responce to needing overflow support I'm thinking about implementing "prority groups" into the lvs code. This would logically group the real severs into different groups, though with a higher priority group would fillup before those with a lower grouping. If anybody could comment on this it would be nice to hear what the rest of you think about overflow code.

4.26.1. Theoretical issues in developing better scheduling algorithms

Julian

It seems to me it would be useful in some cases to use the total number of connections to a realserver in the load balancing calculation, in the case where the realserver participates in servicing a number of different VIPs.

Wensong

Yeah, it is true. Sometimes, we need tradeoff between simplicity/performance and functionality. Let me think more about this, and probably maximum connection scheduling together together too. For a rather big server cluster, there may be a dedicated load balancer for web traffic and another load balancer for mail traffic, then the two load balancers may need exchange status periodically, it is rather complicated.

Yes, if a realserver is used from two or more directors the "lc" method is useless.

Actually, I just thought that dynamic weight adaption according to periodical load feedback of each server might solve all the above problems.

Joe - this is part of a greater problem with LVS, we don't have good monitoring tools and we don't have a lot of information on the varying loads that realservers have, in order to develope strategies for informed load regulation. See load and failure monitoring.

Julian

From my experience with realservers for web, the only useful parameters for the realserver load are:

  • cpu idle time

    If you use realservers with equal CPUs (MHz) the cpu idle time in percents can be used. In other cases the MHz must be included in a expression for the weight.

  • free ram

    According to the web load the right expression must be used including the cpu idle time and the free ram.

  • free swap

    Very bad if the web is swapping.

The easiest parameter to get, the Load Average is always <5. So, it can't be used for weights in this case. May be for SMTP ? The sendmail guys use only the load average in sendmail when evaluating the load :)

So, the monitoring software must send these parameters to all directors. But even now each of the directors use these weights to create connections proportionally. So, it is useful these parameters for the load to be updated in short intervals and they must be averaged for this period. It is very bad to use current value for a parameter to evaluate the weight in the director. For example, it is very useful to use something like "Average value for the cpu idle time for the last 10 seconds" and to broadcast this value to the director on each 10 seconds. If the cpu idle time is 0, the free ram must be used. It depends on which resource zeroed first: the cpu idle time or the free ram. The weight must be changed slightly :)

The "*lc" algorithms help for simple setups, eg. with one director and for some of the services, eg http, https. It is difficult even for ftp and smtp to use these schedulers. When the requests are very different, the only valid information is the load in the realserver.

Other useful parameter is the network traffic (ftp). But again, all these parameters must be used from the director to build the weight using a complex expression.

I think the complex weight for the realserver based on connection number (lc) is not useful due to the different load from each of the services. May be for the "wlc" scheduling method ? I know that the users want LVS to do everything but the load balancing is very complex job. If you handle web traffic you can be happy with any of the current scheduling methods. I didn't tried to balance ftp traffic but I don't expect much help from *lc methods. The realserver can be loaded, for example, if you build new Linux kernel while the server is in the cluster :) Very easy way to switch to swap mode if your load is near 100%.

4.27. Ratz's primer on writing your own scheduler

Roberto Nibali ratz (at) tac (dot) ch 10 Jul 2003

the whole setup roughly works as follows:

struct ip_vs_scheduler {
	struct list_head        n_list;   /* d-linked list head */
	char			*name;    /* scheduler name */
	atomic_t                refcnt;   /* reference counter */
 	struct module		*module;  /* THIS_MODULE/NULL */

	/* scheduler initializing service */
	int (*init_service)(struct ip_vs_service *svc);
	/* scheduling service finish */
	int (*done_service)(struct ip_vs_service *svc);
	/* scheduler updating service */
	int (*update_service)(struct ip_vs_service *svc);

	/* selecting a server from the given service */
	struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc,
				       struct iphdr *iph);
};

Each scheduler {rr,lc,...} will have to register itself by initialisation of the ip_vs_scheduler struct object. As you can see it contains above other data types 4 function pointers:

int (*init_service)(struct ip_vs_service *svc)
int (*done_service)(struct ip_vs_service *svc)
int (*update_service)(struct ip_vs_service *svc)
struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc,struct iphdr *iph)

Each scheduler will need to provide a callback function for those prototypes with his own specific implementation.

Let's have a look at ip_vs_wrr.c:

We start with the __init function which is kernel specific. It defines ip_vs_wrr_init() which in turn calls the required register_ip_vs_scheduler(&ip_vs_wrr_scheduler). You can see the ip_vs_wrr_scheduler structure definition just above the __init function. There you will note following:

static struct ip_vs_scheduler ip_vs_wrr_scheduler = {
         {0},                    /* n_list */
         "wrr",                  /* name */
         ATOMIC_INIT(0),         /* refcnt */
         THIS_MODULE,            /* this module */
         ip_vs_wrr_init_svc,     /* service initializer */
         ip_vs_wrr_done_svc,     /* service done */
         ip_vs_wrr_update_svc,   /* service updater */
         ip_vs_wrr_schedule,     /* select a server from the destination list */
};

This now is exactly the scheduler specific object instantiation of the struct ip_vs_scheduler prototype defined in ip_vs.h. Reading this you can see that the last for "names" map the function names to be called accordingly.

So in case of the wrr scheduler, what does the init_service (mapped to the ip_vs_wrr_init_svc function) do?

It generates a mark object (used for list chain traversal and mark point) which gets filled up with initial values, such as the maximum weight and the gcd weight. This is a very intelligent thing to do, because if you do not do this, you will need to compute those values every time the scheduler needs to schedule a new incoming request.

The latter also requires a second callback. Why? Imagine someone decides to update the weights of one or more server from user space. This would mean that the initially computed weights are not valid anymore.

What can be done against it? We could compute those values every time the scheduler needs to schedule a destination but that's exactly what we don't want. So in play comes the update_service protoype (mapped to the ip_vs_wrr_update_svc function).

As you can easily see the ip_vs_wrr_update_svc function will do part of what we did for the init_service: it will compute the new max weight and the new gcd weight, so the world is saved again. The update_service callback will be called upon a user space ioctl call (you can read about this in the previous chapter of this marvellous developer guide :)).

The ip_vs_wrr_schedule function provides us with the core functionality of finding an appropriate destination (realserver) when a new incoming connection is hitting our cluster. Here you could write your own algorithm. You only need to either return NULL (if no realserver can be found) or a destination which is of type: struct ip_vs_dest.

The last function callback is the ip_vs_wrr_done_svc function which kfree()'s the initially kmalloc()'d mark variable.

This short tour-de-scheduler show give you enough information to write your own scheduler, at least in theory :).

unknown

I'd like to write a user defined scheduler to guide the load dispatching

Ratz 12 Aug 2004

Check out feedbackd feedbackd and see if you miss something there. I know that this is not what you wanted to hear but to provide a generic API for user space deamons to interact directly with a generic scheduler is definitely out of scope. One problem is that the process of balancing incoming network load is not an atomic operation. It can take minutes, hours, days, weeks until you get an equalised load on your servers. Having a user space doing premature scheduler updates in a short time interval only asks for trouble regarding peak load bouncing.

4.28. changing ip_vs behaviour with sysctl flags in /proc

You can change the behaviour of ip_vs by pushing bits in the /proc filesystem. This gives finer control of ip_vs than is available with ipvsadm. For ordinary use, you don't need to worry about the sysctl, since sensible defaults have been installed.

Here's a list of the current sysctls at http://www.linuxvirtualserver.org/docs/sysctl.html . Note that older kernels will not have all of these sysctls (test for the existance of the appropriate file in /proc first). These sysctls are mainly used for bringing down persistent services.

Some, but not all, of the sysctls are documented in ipvsadm(8)

(Thanks to Kit Gerrits Dec 2008). There's info on ip_vs() sysctls at ipvs-sysctl.txt (http://www.mjmwired.net/kernel/Documentation/networking/ipvs-sysctl.txt).

Horms horms (at ) verge (dot) net (dot) au 11 Dec 2003

I am still strongly of the opinion that the sysctl variables should be documented in the ipvsadm man page as they are strongly tied to its behaviour. At the moment we are in a situation where some are documented in ipvsadm(8), while all documented in sysctl.html Yet there is no reference to sysctl.html in ipvsadm(8). My preference is to merge all the information in sysctl.html into ipvsadm(8) or perhaps a separate man page. If this is not acceptable then I would advocate removing all of the sysctl infromation from ipvsadm(8) and replacing it with a reference to sysctl.html. Though to be honest, why half the information on configuring LVS should be in ipvsadm(8) and the other half in sysctl.html is beyond me.

4.29. Counters in ipvsadm

Rutger van Oosten r (dot) v (dot) oosten (at) benq-eu (dot) com 09 Oct 2003

When I run

ipvsadm -l --stats

it shows connections, packets and bytes in and out for the virtual services and for the realservers. One would expect that the traffic for the service is the sum of the traffic to the servers - but it is not, the numbers don't add up at all, whereas in

ipvsadm -l --rate

they do (approximately, not exactly for the bytes per second ones). For example (LVS-NAT, two realservers, one http virtual service):

# ipvsadm --version
ipvsadm v1.21 2002/11/12 (compiled with popt and IPVS v1.0.10)

# ipvsadm -l --stats
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
  -> RemoteAddress:Port
TCP  vip:http                      4239091 31977546 29470876    3692M 26647M
  -> www02:http                    3911835 29405279 26900679    3407M 24292M
  -> www01:http                    3395953 25407180 23257431    2931M 20957M

# ipvsadm -l --rate
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port                 CPS    InPPS   OutPPS    InBPS OutBPS
  -> RemoteAddress:Port
TCP  vip:http                           45      348      314    41739 285599
  -> www02:http                         35      252      216    30416 197101
  -> www01:http                         10       96       98    11323 88497

Is this a bug, or am I just missing something?

Wensong 12 Oct 2003

It's quite possible that the conns/bytes/packets statistics of virtual service is not the sum of the conns/bytes/packets counters of its realservers, because some realservers may be removed permanetly. The connection rate of virtual service is the sum of connection rate of its realservers, because it is an instant metric at a time.

In the output of your ipvsadm --l --stats, the counters of virtual service is less than the sum of the counters of its realservers. I guess that your virtual service must have been removed after it run for a while, and then must be created later. In current implementation, if realservers are to be deleted, they will not be removed permanently, but be put in the trash, because established connections still refer to them; a server can be looked up in the trash when it is added back to a service. When a virtual service is created, it always has counters set to zero, but the realservers can be picked up from the trash, they have the past counters. We probably need zero the counters of realservers if the service is a new one. Anyway, you can do cat /proc/net/ip_vs_stats. The counters of all IPVS services is larger than or equal to the sum of realservers.

You are right - after the weekly reboot last night the numbers do add up. The realservers have been removed and added in the mean time, but the virtual services have stayed in place and the numbers are still correct. So that must be it. Mystery solved, thank you :-)

4.30. Exact Counters

Guy Waugh gwaugh (at) scu (dot) edu (dot) au 2005/20/05

The ipvsadm_exact.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvsadm_exact.patch) contains a diff for the addition of an '-x' or '--exact' command-line switch to ipvsadm (version 1.19.2.2). The idea behind the new option is to allow users to specify that large numbers printed with the '--stats' or '--rate' options are in machine readable bytes, rather than in 'human-readable' form (e.g. kilobytes, megabytes). I needed this to get stats from LVS readable by nagios (http:/www.nagios.org/).

4.31. Scheduling TCP/UDP/SCTP/TCP splicing/

Note

LVS does not schedule SCTP (although people ask about it occasionally).

SCTP is a connection oriented protocol (like TCP, but not like UDP). Delivery is reliable (like TCP), but packets are not neccessarily delivered in order. The Linux-2.6 kernel supports SCTP (see The Linux Kernel sctp project http://lksctp.sourceforge.net/). For an article on SCTP see Better networking with SCTP (http://www-128.ibm.com/developernetworks/linux/library/l-sctp/?ca=dgr-lnwx01SCTP). One of the features of SCTP is multistreaming: control and data streams are separate streams within an association. With tcp to do the same thing, you need separate ports (e.g. ftp uses 20 for data, 21 for command) or you put both into one connection (e.g. http). If you use one port then a command will be blocked till queued data is serviced. If multiple (redundant) routes are available, failover is transparent to the application. (Thus the requirement that packets not neccessarily be delivered in order.) SIP is can use SCTP (I only know about SIP using UDP).

Note

Ratz 20 Feb 2006

There is a remotely similar approach in the TCP splicing code for LVS. (http://www.linuxvirtualserver.org/software/tcpsp/). It's only a small subset of SCTP.

With TCP, scheduling needs to know the number of current connections to each realserver, before assigning a realserver for the next connection. The length of time for a connection can be short (retrieving a page by http) or long (an extended telnet session).

For UDP there is no "connection". LVS uses a timeout (for 2.4.x kernels is about 3mins) and any UDP packets from a client within the timeout period will be rescheduled to the same realserver. On a short time scale (ca. timeout), there will be no load balancing of UDP services (e.g. as was found for ntp). All requests will go to the same realserver. On a long time scale (>>timeout) loadbalancing will occur.

Here's the official LVS definition of a UDP "connection"

Wensong Zhang wensong (at) iinchina (dot) net 2000-04-26

For UDP datagrams, we create entries for state with the timeout of IP_MASQ_S_UDP (5*60*HZ) by default. Consequently all UDP datagrams from the same source to the same destination will be sent to the same realserver. Therefore, we call data communication between a client's socket and a server's socket a "connection", for both no TCP and UDP.

Julian Anastasov 2000-07-28

For UDP we can in principle remove the implicit persistence for the UDP connections and thus select different real server for each packet. My thought was to implement a new feature: schedule each UDP packet to new realserver. i.e.something like timeout=~0 for UDP as service flag.

LVS has been tested with the following UDP services,

So far only DNS has worked well (but then DNS already fine in a cluster setup without LVS). ntpd is already self loadbalancing and doesn't need to be run under LVS. xdmcp dies if left idle for long periods (several hours). UDP services are not commonly used with LVS and we don't yet know whether the problems are with LVS or with the service running under LVS.

Han Bing hb (at) quickhot (dot) com 29 Dec 2002

I am developing several game servers using UDP which I would like to use with LVS. LVS supports UDP "connection" persistence. Does persistence work for UDP too?

For example, I have 3 games, every games has 3 servers( 9 servers in 3 groups totally). All game1 servers listen on udp port 10000, game2 servers listen on 10001 udp port, and game3 servers listen on 10002 udp port. when the client send a udp datagram to game1( to VIP:10000 ), can the lvs director auto-select one server from the 3 game1 servers and forward it to the server, AND keep the persistence of this "UDP connection" when she receives the following datagram from the same CIP?

Joe: not sure what may happen here. people have had problems with LVS/udp (e.g. with ntp). These problems should go away with persistent udp, but no-one has tried it and it didn't even occur to me. The behaviour of LVS with persistent udp may be undefined for all I know. I would make sure that the setup worked with ntp before trying a game setup.

4.32. patch: machine readable error codes from ipvsadm

Computers can talk to each other and read from and write to other programs. You shouldn't have to get a person to sit at the console to parse the output of a program. Here's a patch to make the output of ipvsadm machine readable

Padraig Brady padraig (at) antefacto (dot) com 07 Nov 2001

This 1 line patch is useful for me and I don't think it will break anything. It's against ipvsadm-0.8.2 and returns a specific error code.

--Boundary_(ID_nuebet+LsBGYFsmRPljqqA)
Content-type: text/plain;>ipvsadm-0.8.2-returncode.diff"
Content-disposition: inline; filename="ipvsadm-0.8.2-returncode.diff"
Content-transfer-encoding: 7bit

--- //ipvs-0.8.2/ipvs/ipvsadm/ipvsadm.c	Fri Jun 22 16:03:08 2001
+++ ipvsadm.c	Wed Nov  7 16:29:11 2001
@@ -938,6 +938,7 @@
         result = setsockopt(sockfd, IPPROTO_IP, op,
                             (char *)&urule, sizeof(urule));
         if (result) {
+                result = errno; /* return to caller */
                 perror("setsockopt failed");

                 /*

--Boundary_(ID_nuebet+LsBGYFsmRPljqqA)--

4.33. patch: stateless ipsvadm - add/edit patch

Commands like ifconfig are idempotent, i.e. they tell the machine to assume a certain state without reference to the previous state. You can repeat the same command without errors (e.g. put IP=xxx onto eth0). Not all commands are idempotent - some require you to know the state of the machine first. ipvsadm is not idempotent: if a VIP:port entry already exists, then you will get an error on attempting to enter it. Whether you make a command idempotent or not, will depend on the nature of the command.

The problem with ipvsadm is that it isn't scriptable and hence can't be used for automated control of an LVS:

  • If no entry exists:

    you must add the entry with the -a option

  • if the entry exists;

    you must edit the entry with the -e option.

You will get an error if you use the wrong command. Two solutions are:

  • parse the output of ipvsadm to see if the entry you are about to make already is present
  • try both commands and see which one runs and then have the script figure out if both the error and the non-error was valid.

This is a pain and is quite unneccesary. What is needed is a version of ipvs that accepts valid entries without giving an error. Here's the ipvs-0.9.0_add_edit.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.0_add_edit.patch) patch by Horms against ipvs-0.9.0. It modifies several ipvs files, including ipvsadm.

4.34. patch: fwmark name-number translation table

ipvsadm allows entry of fwmark only as numbers. In some cases, it would be more convenient to enter/display the fwmark as a name; e.g. an e-commerce site, serving multiple customers (i.e. VIPs) and which is linking http and https by a fwmark. The output of ipvsadm then would list the fwmark as "bills_business", "fred_inc" rather than "14","15"...

Horms has written a ipvs-0.9.5.fwmarks-file.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.5.fwmarks-file.patch) which allows the use of a string fwmark as well as the default method of an integer fwmark, using a table in /etc that looks like the /etc/hosts table.

Horms horms (at) vergenet (dot) net Nov 14 2001

while we were at OLS in June, Joe suggested that we have a file to associate names with firewall marks. I have attached a patch that enables ipvsadm to read names for firewall marks from /etc/fwmarks. This file is intended to be analogous to /etc/hosts (or files in /etc/iproute2/).

The patch to the man page explains the format more fully, but briefly the format is "fwmark name..." newline delimited

e.g.

1 a_name
2 another_name yet_another_name

Which leads to

director:/etc/lvs# ipvsadm -A -f a_name

4.35. ip_vs_conn.pl

Note
you can also run `ipvsadm -lcn` to do the same thing)
#!/usr/bin/perl
#-----------------------------------------------
#Date: Wed, 07 Aug 2002 20:14:25 -0300
#From: Jeff Warnica <noc (at) mediadial (dot) com>

#Here is an /proc/net/ip_vs_conn hex mode to integer script.
#If its given an ipaddress as an argument on the commandline,
#it will show only lines with that ipaddress in it.
#-------------------------------------------------

if (@ARGV) {
        $mask = $ARGV[0];
}
open(DATA, "/proc/net/ip_vs_conn");

$format = "%8s %-17s %-5s %-17s %-5s %-17s %-5s %-17s %-20s\n";
printf $format, "Protocol", "From IP", "FPort", "To IP", "TPort", "Dest
IP", "DPort", "State", "Expires";
while(<DATA>){
        chop;
        ($proto, $fromip, $fromport, $toip, $toport, $destip, $destport,
$state, $expiry) = split();
        next unless $fromip;
        next if ($proto =~ /Pro/);

        $fromip = hex2ip($fromip);
        $toip   = hex2ip($toip);
        $destip = hex2ip($destip);

        $fromport = hex($fromport);
        $toport   = hex($toport);
        $destport = hex($destport);

        if ( ($fromip =~ /$mask/) || ($toip =~ /$mask/) || ($destip =~
/$mask/) || (!($mask))) {
                printf $format, $proto, $fromip, $fromport, $toip,
$toport, $destip, $destport, $state, $expiry;
        }
}


sub hex2ip($input) {
        my $input = shift;

        $first  = substr($input,0,2);
        $second = substr($input,2,2);
        $third  = substr($input,4,2);
        $fourth = substr($input,6,2);

        $first  = hex($first);
        $second = hex($second);
        $third  = hex($third);
        $fourth = hex($fourth);

        return "$first.$second.$third.$fourth";
}

#---------------------------------------------------------------

4.36. Luca's php monitoring script

Luca Maranzano liuk001 (at) gmail (dot) com 12 Oct 2005

I've written a simple php script luca.php to monitor the status of an LVS server. To use it, configure sudo in order to make the Apache user to run /sbin/ipvsadm as root without password prompt. The CSS is derived from phpinfo() page.

Note
I had this as a link to the file on my machine, so you could just download it if you wanted it. However the file has a sudo in it, which caused havoc with my security tests of files. I've deleted the file and instead have it as text here in the HOWTO. Hopefully no-one will try to execute the HOWTO
<?
// Simple script to monitor LVS
// --liuk -at- linux.it
//
// extract vars...
$p1=$_GET['resolve_dns'];
$p2=$_GET['refresh_int'];
if($p2=="" || $p2<9) { $p2="10"; };
?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"DTD/xhtml1-transitional.dtd">
<html><head>
<meta http-equiv="refresh" content="<? echo $p2; ?>">
<style type="text/css"><!--
body {background-color: #ffffff; color: #000000;}
body, td, th, h1, h2 {font-family: sans-serif;}
pre {margin: 0px; font-family: courier;}
a:link {color: #000099; text-decoration: none; background-color: #ffffff;}
a:hover {text-decoration: underline;}
table {border-collapse: collapse;}
.center {text-align: center;}
.center table { margin-left: auto; margin-right: auto; text-align: left;}
.center th { text-align: center !important; }
td, th { border: 1px solid #000000; font-size: 75%; vertical-align: baseline;}
h1 {font-size: 150%;}
h2 {font-size: 125%;}
.p {text-align: left;}
.e {background-color: #ccccff; font-weight: bold; color: #000000;}
.h {background-color: #9999cc; font-weight: bold; color: #000000;}
.v {background-color: #cccccc; color: #000000;}
.vv {background-color: #cccccc; color: #000000; font-family: courier; }
i {color: #666666; background-color: #cccccc;}
img {float: right; border: 0px;}
hr {width: 600px; background-color: #cccccc; border: 0px; height: 1px;
color: #000000;}
//--></style>
<title>Local Director Monitor</title></head>
<body><div class="center">
<table border="0" cellpadding="3" width="600">
<tr class="h"><td>
<h1 class="p">Local Director Monitor v. 1.0</h1>
</td></tr>
</table><br />
<table border="0" cellpadding="3" width="800">

<tr><td class="e">Monitor Options: </td>
<td class="v">
<form method="GET" action="">
<? if(isset($resolve_dns)) { $curr_dns="$p1"; } else { $curr_dns="1"; }; ?>
<input type="checkbox" name="resolve_dns" value="<?echo $curr_dns;?>"
<? if($p1) { echo "checked=\"checked\""; } else { $dns_flag=" -n "; }; ?>
> Resolve DNS Names - Refresh every
<input type="text" name="refresh_int" size="2" value="<? echo $p2; ?>">
seconds -
<input type="submit" value="Update">
</form>
</td></tr>

<tr><td class="e">Active node: </td>
<td class="vv"><? passthru("hostname"); ?></td></tr>

<tr><td class="e">Status: </td>
<td class="vv"><pre><? $cmd="sudo /sbin/ipvsadm -L ".$dns_flag;
passthru($cmd); ?></pre></td></tr>

<tr><td class="e">Statistics: </td>
<td class="vv"><pre><? $cmd="sudo /sbin/ipvsadm -L --stats
".$dns_flag; passthru($cmd); ?></pre></td></tr>

<tr><td class="e">Active<br>connections: </td>
<td class="vv"><pre><? $cmd="sudo /sbin/ipvsadm -L -c ".$dns_flag;
passthru($cmd); ?></pre></td></tr>

<tr><td class="e">Rate<br>statistics: </td>
<td class="vv"><pre><? $cmd="sudo /sbin/ipvsadm -L --rate ".$dns_flag;
passthru($cmd); ?></pre></td></tr>

<tr><td class="e">Sync<br>daemon: </td>
<td class="vv"><pre><? passthru("sudo /sbin/ipvsadm -L --daemon");
?></pre></td></tr>

</table><br />

</div></body></html>

Jeremy Kerr jk (at) ozlabs (dot) org 12 Oct 2005

<? $cmd="sudo /sbin/ipvsadm -L ". $dns_flag; passthru($cmd); ?>

Whoa.

If you use this script with register_globals set (and assuming you've set it up so that the sudo works), you've got a remote *root* vunerability right there. e.g. http://example.com/script.php?resolve_dns=1&dns_flag=;sudo+rm+-rf+/, which will do `rm -rf /` as root.

you may want to ensure your variables are clean beforehand, and avoid the sudo completely (maybe use a helper process?)

malcolm (at) loadbalancer (dot) org Oct 12 2005

That's why PHP no longer has register globals defaulted! And also why you lock down your admin ip address by source ip. My code has this vulnerability, but I'm not sure a helper app would be any more secure (sudo is a helper app.)

liuk001 (at) gmail (dot) comOct 12 2005

Jeremy, this is a good point. I wrote it as a quick and dirty hack without security in mind. It is used on the internal net from trusted users who indeed have root access to the servers ;-) However, sudo is configured to run only /sbin/ipvsadm from www-data user, so I think that /bin/rm could not be executed.

Graeme Fowler graeme (at) graemef (dot) net 12 Oct 2005

...as all the relevant values are produced in /proc/net/ip_vs[_app,_conn,_stats], then why not just write something to process those values instead? They're globally readable and don't need any helper apps to view them at all.

Yes, you'd be re-inventing a small part of ipvsadm's functionality. The security improvements alone are worth it; the fact that the overhead of running sudo and then ipvsadm is removed by just doing an open() on a /proc file might be worth it in situations where you may have many users running your web app.

Sure, you need to decode the hex values to make them "nice". Unless you have the sort of users who read hex encoding all the time :)

4.37. ipvsadm set option

anon

what /proc values does ipvsadm --set modify? Something in /proc/sys/net/ipv4/vs?

Ratz

the current proc-fs entries are a read-only representation of what could be set regarding state timeouts. ipvsadm --set will set for IPVS related connection entries

  • The time we spend in TCP ESTABLISHED before transition
  • The time we spend in TCP FIN_WAIT before transition
  • The time we spend for UDP in general

Andreas 05 Feb 2006

Where can I get the currently set values of ipvsadm --set foo bar baz?

Ratz

You can't :) IP_VS_SO_GET_TIMEOUTS is not implemented in ipvsadm or I'm blind. Also the proc-fs related entries for this are not exported. I've written a patch to re-instate the proper settings in proc-fs, however this is only in 2.4.x kernels. Julian has recently proposed a very granular timeout framework, however none of us has had the time nor impulse to implement it. For our (work) customers I needed the ability to instrument all the IPVS related timeout values in DoS and non-DoS mode. The ipvsadm --set option should be obsoleted, since it only covers the timeout settings partially and there is no --get method.

I did not find a way to read them out, I grep through the /proc/sys/foo and /proc/net foo and was not able to see the numbers I set before. This was on kernel 2.4.30 at least.

Correct. The standard 2.4.x kernel only exports the DoS timers for some (to me at least) unknown reason. I suggest that we re-instate (I'll send a patch or a link to the patch tomorrow) these timer settings until we can come up with Julian's framework. It's imperative that we can set those transition timers, since their default values are dangerous regarding >L4 DoS. One example is HTTP/1.1 slow start if the web servers are mis-configured (wrong MaxKeepAlive and its timeout settings).

This brings me further to the question if the changes of lvs in recent 2.6 development are being backported to 2.4?

Currently I would consider it the other way 'round. 2.6.x has mostly stalled feature wise and 2.4.x is missing the crucial per RS threshold limitation feature. I've backported it and also improved it quite a bit (service pool) and so we're currently completely out of sync :). I'll try to put some more effort into working on the 2.6.x kernel, however since it's still too unstable of us, our main focus remains on the 2.4.x kernel.

[And before you ask: No, we don't have the time (money wise) to invest into bug-hunting and reporting problems regarding 2.6.x kernels on high-end server machines. And in private environment 2.6.x works really well on my laptops and other machines, so there's really not much to report ;).]

On top of that LVS does not use the classic TCP timers from the stack since it only forwards TCP connections. The IPVS timers are needed so we can maintain the LVS state table regarding expirations in various modes, mostly LVS_DR.

Browsing through the code recently I realised that the state transition code in ip_vs_conn:vs_tcp_state() is very simple, probably too simple. If we use Julian's forward_shared patch (which I consider a great invention, BTW) one would assume that IPVS timeouts are more closely timed to the actually TCP flow. However, this is not the case because, from what I've read and understood the IPVS state transitions are done without memory, so it's wild guessing :). I might have a closer look at this because it just seems sub-optimal. Also the notion of active versus inactive connections stemming from this simple view of TCP flow is questionable, especially the dependence and weighting of some schedulers.

So, if I set a lvs tcp timeout about 2h 12 min, lvs would never drop a tcp connection unless a client is really "unreachable":

The timeout is more bound to the connection entry in the IPVS lookup table, so we know where to forward incoming packets regarding a specific TCP flow. A TCP connection is never drop or not dropped by LVS, only specific packets pertaining to a TCP connection.

After 2h Linux sends tcp keepalive probes serveral times, so there are some byte send through the connection.

Nope, IPVS does not work as a proxy regarding TCP connections. It's a TCP flow redirector.

lvs will (re)set the internal timer for this connection to the keepalive time I set with --set.

Kind of ... only the bits of the state transition table which are affected by the three settings. It might not be enough to keep persistency for your TCP connection.

Or does it recognize that the bytes send are only probes without a vaild answer and thus drop the connection?

There is no sending keepalive probes from the director.

will we eventually get timeput parameters _per service_ instead of global ones?

Julian proposed following framework: http://www.ssi.bg/~ja/tmp/tt.txt

So if you want to test, the only thing you have to do is fire up your editor of choice :). Ok, honestly, I don't know when this will be done because it's quite some work and most us developers here are a pretty busy with other daily activities. So unless there is a glaring issue regarding timers implemented as-is, chances are slim that this gets implemented. Of course I could fly down to Julian's place over the week-end and we could implement it together; want to sponsor it? ;).

There's a lot of TCP timers in the Linux kernel and they all have different semantical meanings. There is the TCP timout timer for sockets related to locally initiated connections, then there is a TCP timeout for the connection tracking table, which on my desktop system for example has following settings:

/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_close:10
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_close_wait:60
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_established:432000
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_fin_wait:120
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_last_ack:30
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_syn_recv:60
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_syn_sent:120
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_time_wait:120

And of course we have the IPVS TCP settings, which look as follows (if they weren't disabled in the core :)):

/proc/sys/net/ipv4/vs/tcp_timeout_established:900
/proc/sys/net/ipv4/vs/tcp_timeout_syn_sent:120
/proc/sys/net/ipv4/vs/tcp_timeout_syn_recv:60
/proc/sys/net/ipv4/vs/tcp_timeout_:900
[...]

unless you enabled tcp_defense, which changes those timers again. And then of course we have other in-kernel timers, which influence those timers mentioned above.

However, the beforementioned timers regarding packet filtering, NAPT and load balancing and are meant as a means to map expected real TCP flow timeouts. Since there is no socket (as in an endpoint) involved when doing either netfilter or IPVS, you have to guess what the TCP flow in-between (where you machine is "standing") is doing, so you can continue to forward, rewrite, mangle, whatever, the flow, _without_ disturbing it. The timers are used for table mapping timeouts of TCP states. If we didn't have them, mappings would stay in the kernel forever and eventually we'd run out of memory. If we have them wrong, it might occur that a connection is aborted prematurely by our host, for example yielding those infamous ssh hangs when connecting through a packet filter.

The tcp keepalive timer setting you've mentioned, on the other hand, is per socket. And as such only has an influence on locally created or terminated sockets. A quick socket(2) and socket(7) skimming reveil:

   [socket(2) excerpt]
        The communications protocols which implement a SOCK_STREAM
        ensure that data is not lost or duplicated.  If a piece of
        data  for  which the peer protocol has buffer space cannot
        be successfully transmitted within a reasonable length  of
        time,  then the connection is considered to be dead.  When
        SO_KEEPALIVE is enabled on the socket the protocol  checks
        in  a  protocol-specific  manner if the other end is still
        alive.

   [socket(7) excerpt]
        These socket options can be set by using setsockopt(2) and
        read with getsockopt(2)  with  the  socket  level  set  to
        SOL_SOCKET for all sockets:

        SO_KEEPALIVE
        Enable sending of keep-alive messages on connection-oriented 
        sockets. Expects a integer boolean flag.

4.38. ipvsadm error messages

ipvsadm's error messages are low level and give the user little indication of what they've done wrong. These error messages were written in the early days of LVS when getting ipvsadm to work was a feat in itself. Unfortunately the messages have not been updated and enough use of ipvsadm is now scripted and so we don't run into the messages anymore. As people post error messages and what they mean, I'll put them here

Brian Sheets bsheets (at) singlefin (dot) net 7 Jan 2007

ipvsadm -d -t 10.200.8.1:25 -r 10.200.8.100
Service not defined

What am I doing wrong? The syntax looks correct to me.

Graeme Fowler graeme (at) graemef (dot) net 07 Jan 2007

do you have a service defined on VIP 10.200.8.1 port 25? Make sure you're not getting your real and virtual servers mixed up.

Brian

Yup, I had the real and virtuals reversed..

Joe

You should be able to delete a realserver for a service that isn't declared, with only a notice rather than an error, at least in my thinking. However that battle was lost back in the early days.

4.39. ipvsadm fast update bug with smp

Kees Hoekzema kees (at) tweakers (dot) net 11 Jul 2007

I'm applying weight changes to a 64bit 2-way SMP director quite rapidly (not quite twice a second, but close) and getting a frozen director, which needs a cold reset.

I have found a bit more information from my debugging, and it seems that Horms already knows about it: (http://marc.info/?l=linux-netdev&m=118040107213444&w=2) I recompiled the ip_vs() modules with a bit more debug and everytime my system crashed I had the same debug output:

Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 913
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 916
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 918
Jul 12 15:43:15 atropos kernel: Leave: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 919
Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910

The code after line 910 reads:

while (atomic_read(&svc->usecnt) > 1) {};

Every other busy lock in the code reads:

IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);

Which basicly is the same except a cpu_relax(); At the moment I am testing my server with cpu_relax() code in the ip_vs_edit_dest function, and so far it has not crashed yet and is directing the traffic quite a bit longer than previously was possible. The only differences between this server and the old server (which didn't have any problems) are:

  • - SMP (4 cores) vs Single core
  • - 64 bits vs 32 bits
  • - 2.6.21.5 vs 2.6.20.4 (but I do not see any changes in ip_vs_ctl.c)

In my first mail I accused the 64/32 bits difference, but right now I'm more thinking of a SMP issue, but unfortunatly I lack the kernel hacking skills to say why, or why that cpu_relax() helps so much in the while loop. Hopefully Horms understands it better than I do ;)

--- linux-2.6.22.1/net/ipv4/ipvs/ip_vs_ctl.c    2007-07-12
19:41:27.000000000 +0200
+++ old/net/ipv4/ipvs/ip_vs_ctl.c       2007-07-10 20:56:30.000000000 +0200
@@ -909,8 +909,8 @@
        write_lock_bh(&__ip_vs_svc_lock);

        /* Wait until all other svc users go away */
-       IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);
-
+       while (atomic_read(&svc->usecnt) > 1) {};
+
        /* call the update_service, because server weight may be changed */
        svc->scheduler->update_service(svc);

Joe

I know this is separate from the problem, but according to feedback and control theory you should be making adjustments on a timescale that damps the transients. What a transient is here is not obvious - the timescale of a tcpip connection, the time is takes to change the load by 10%?, 50%? I don't know, but every couple of seconds would seem to be a lot shorter than either of these two time scales. Do you find your setup has problems when you do adjustments on a longer timescale?

4.40. Problems when no scheduler

Note
What happens if you have the service entered with ipvsadm -A, but have no realservers (no ipvsadm -a) to accept the forwarded packets? We haven't quite figured out what to do about this yet. Till we can get a better idea about what's going on, we're not going to do anything.

Siim Poder windo (at) p6drad-teel (dot) net 07 May 2008

We've had LVS machines dying a couple of times when the service is using the wrr scheduler and keepalived pulls all real servers from behind the service IP.

The symptoms are that there are a lot of (thousands, apparently for every packet?) messages in syslog: ip_vs_wrr_schedule(): no available servers

After which the machine hangs. I don't recall if i've had to boot it manually or if it boots by itself.

Also, I'm not sure if it is that message that is killing the machine, but the problem hasn't occured with other schedulers (that don't print such a message). We use wrr the most though.

I think we should either remove the message or ratelimit it (unless the bug is somewhere else). I tested the patch and it seems to be ok, but as I'm unable to reproduce the hanging/crashing in test environment, I can't verify wether it actually helps.

Something close to this was added to mainline by someone already. But the problem seems to persist (just without the messages). It seems to appear with any scheduler (at least wrr, wlc and rr). However I have been unable to reproduce this neither by connection nor packet rate in test environment. It's probably not just the missing real servers, but something relatively infrequent that gets triggered only after there are no real servers. The LVS goes down in about 5-15 minutes of missing real servers IIRC.

I tried generating many connections (and played with ttl/fragmentation a little), but couldn't trigger the bug. Maybe it has to do with clients sending some ICMP messages (which would probably be rare enough)?

This still gets triggered in our live env for high connection rate services (if the servers fail for any reason and keepalived kicks them out). We have put sorry servers into keepalived configuration to avoid the whole LVS going down for now (sorry_server 127.0.0.1:666), so there is a workaround for us.

--- linux-2.6.24/net/ipv4/ipvs/ip_vs_wrr.c      2008-01-24 22:58:37.000000000 +0000
+++ linux-2.6.24-ipvs_patches/net/ipv4/ipvs/ip_vs_wrr.c 2008-05-06 16:17:17.790662800 +0000
@@ -169,7 +169,7 @@
                                 */
                                if (mark->cw == 0) {
                                        mark->cl = &svc->destinations;
-                                       IP_VS_INFO("ip_vs_wrr_schedule(): "
+                                       IP_VS_DBG_RL("ip_vs_wrr_schedule(): "
                                                   "no available servers\n");
                                        dest = NULL;
                                        goto out;

Horm 29 Dec 2008

We're doing nothing till we figure out what's really going on The important problem seems to be that LVS dies sometimes. But unfortunately that can't be fixed right now, because nobody knows how to do so, despite Siim's efforts to find the cause of the problem.

With regards to making wrr like the other schedulers, I'd actually be much more inclined to do the reverse - make all the other schedulers display a rate-limited warning when they don't have any real servers available. Perhaps something like the patch against 2.6.28 below.

In any case, I think that the warning message and the LVS dying issues are separate, except that it seems likely that the warning message will help to lead us to the cause of the "LVS dies" bug.

Note
Joe: The patches to make all schedulers display a warning will be in kernel 2.6.29.