LVS-HOWTO

LVS-HOWTO Joseph Mack jmack (at) wm7d (dot) net v2009.03 Mar 2009, released under GPL. 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 Joseph Mack Install, testing and running of a Linux Virtual Server with 2.2.x, 2.4.x, 2.6.x kernels search the LVS documentation search the LVS documenation with htdig. search the various mailing list archives Hank Leninger's searchable mailing list archive has moved. It's now at http://marc.info/?l=linux-virtual-server&w=2.

LVS: Introduction This LVS-HOWTO is posted to the LVS-HOWTO homepage, http://www.austintek.com/LVS/LVS-HOWTO/ about once a month (although I do miss occasional months). Some of the material is from my own testing and I've tried to make it into a coherent story. Much of the material is from the lvs-users mailing list and is listed chronologically (sometimes forward and sometimes backwards in time) and will thus look like a blog.

Thanks Contributions to this HOWTO came from the mailing list and are attibuted to the poster (with e-mail address). Postings may have been edited to fit into the flow of the HOWTO. The LVS logo (Tux with 3 lighter shaded penguins behind him representing a director and 3 realservers) is by Mike Douglas spike (at) bayside (dot) net LVS homepage is running on a machine donated by Horms. (Until Jul 2002, we used a machine donated by Voxel). LVS mailing list is hosted by Lars in Germany lmb (at) suse (dot) de

About the HOWTO

Purpose To enable you to understand how a Linux Virtual Server (LVS) works. The LVS-mini-HOWTO (http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html) tells you how to setup and install an LVS without understanding how the LVS works. The material here covers directors and realservers with 2.2, 2.4 and 2.6 kernels. The original material was written for 2.2.x kernels and ipchains. Not all material has been updated for 2.4.x kernels and iptables. The layout of this HOWTO is almost flat - you go to the section you want information on. You aren't supposed to read it from start to finish. Within any section, newer information may be combined with older information that says different things. I just don't have time to edit everything - I'll be glad if you straighten me out. The only information one level up is how LVS works (from this HOWTO or from documentation on the website, e.g. Wensong's early documents) setting up an LVS in the LVS-mini-HOWTO.html. The code for 2.0.x kernels still works fine and was used on production systems when 2.0.x kernels were current, but is not being developed further. For 2.2 kernels, the Linux kernel networking code was rewritten, producing for us . This changes the installation of LVS from a simple process that can be done by almost anyone, to a thought provoking, head scratching exercise, which requires detailed understanding of the workings of LVS. For 2.0 and 2.2, LVS is stand-alone code, based on ip_masquerading and doesn't integrate well with other uses of ip_masquerading. For 2.4 kernels, LVS was rewritten as much as possible to be a netfilter module, to allow it to fit into and be visible to other netfilter modules. Unfortunately the fit isn't perfect, but cooperation with netfilter does work in most cases. If ip_vs() was a real netfilter module, it would be really slow. (The original LVS-NAT code had problems when using your director as a firewall; see the , but much of this has been fixed - Feb 2006.) Being a netfilter module, the latency and throughput are slightly worse for 2.4 LVS than for the 2.2 code. However with modern CPUs being running at 800MHz, the bottleneck now is network throughput rather than LVS throughput (you only need a small number of realservers to saturate 100Mbps ethernet). In general ipvsadm commands and services have not changed between kernels.

HOWTO source is xml The HOWTO was originally written in sgml. It is now xml. The char '&' found in C source code has to be written as & in sgml. If you swiped patches from the sgml rather than the html rendering, you would get code that needed to be edited to fix the &. Now that the HOWTO is in xml, this munging is not needed. Although I've tried to remove all munged ampersands, I expect some will persist for a while. Ampersands in URLs still have to be munged.

e-mail addresses in the HOWTO are spam protected Well we hope so anyhow. An article on spambots describes robots which ignore the robots.txt file and scan for e-mail addresses in readable files on websites. The author suggests removing any 'mailto:' strings and spam protecting e-mail addresses, by changing them from machine-readable to human-readable format. If you have a better scheme than implemented here, (and I can do it with vi) let me know. (May 2002): BTW, 160 people have contributed to the HOWTO (as judged by unique e-mail addresses).

Links die frequently There are links to 180 urls in this HOWTO (May 2002), which came from postings to the LVS mailing list. If people move/rename/delete/change their webpages/links once a year, then I'm going to have to trackdown 15 websites each month. If a site is gone and it isn't in google, I'm not going to be able to find it.

Nomenclature/Abbreviations If you use these terms when you mail us, we'll know what you're talking about.

Preferred names IPVS,ipvs,ip_vs the code that patches the linux kernel on the director. LVS, linux virtual server This is the director + realservers. Together these machines are the virtual server, which appears as one machine to the client(s). director: the node that runs the ipvs code. Clients connect to the director. The director forwards packets to the realservers. The director is nothing but an IP router with special rules that make the LVS work. realservers: the hosts that have the services. The realservers handle the requests from the clients. client the host or user level process that connects to the VIP on the director forwarding method (currently , , ). The director is a router with somewhat different rules for forwarding packets than a normal router. The forwarding method determines how the director sends packets from the client to the realservers. scheduling () - the algorithm the director uses to select a realserver to service a new connection request from a client.

synonyms Please use the first term in these lines. The other words are valid but less precise (or are redundant). director: load balancer, dispatcher, redirector. realserver: servers, realservers, real-servers. LVS: the whole cluster, the (linux) virtual server (LVS)

virtual services, scheduling groups Here's the ipvsadm output of an LVS serving telnet and squid. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:squid rr -> rs1.mack.net:squid Route 1 0 0 -> rs2.mack.net:squid Route 1 0 0 -> rs3.mack.net:squid Route 1 0 0 TCP lvs.mack.net:telnet rr -> rs1.mack.net:telnet Route 1 0 0 -> rs2.mack.net:telnet Route 1 0 0 ]]> In the above LVS, there are two virtual services, telnet and squid. There are also two virtual servers; a virtual server for telnet (which has 2 realservers) and a virtual server for squid (which has 3 realservers). This is what the client sees; two services (and two servers). Connections to each virtual server are scheduled (here by "rr", round robin), to the realservers which belong to the scheduling group. Here the scheduling group for telnet is rs1,rs2. The scheduling group for quid is rs1,rs2,rs3. Connections to the telnet virtual server are scheduled independantly of connections to the squid virtual server. The above nomenclature can be extended for .

scheduling instance, scheduled unit, virtual connection We don't have a good name for this. Suggestions welcome. (We also don't talk much about this concept on the mailing list, so we've done without a name). The director needs to know how to schedule packets from the client to the realservers. The smallest unit for LVS is a tcpip connection, i.e. all packets that are part of a single tcpip session from a client will be sent to the same realserver. For a tcp virtual service, each tcp connection is scheduled separately, with the first tcp connection going to one realserver, and the next tcp connection going to the next realserver assigned a connection from the scheduler. The virtual connection is the same as the tcp connection. For a all tcp connections that are separated by less than the timeout period are regarded as belonging to the same virtual connection and are scheduled to the same realserver. For udp, there is no such thing as a connection or session and all packets from the client within a timeout period are scheduled to the same realserver. (People aren't using LVS for udp services a whole lot). The virtual connection then is all udp packets from a client within a certain time period.

backend (multi-tier) servers The realservers sometimes are frontends to other backend servers. The client does not connect to these backend servers and they are not in the ipvsadm table. e.g. a realserver may run a web application. The web application in turn connects to a database on another backend server. a webcaching realserver (e.g. a squid). The squid connects to backend webserver(s). These backend servers are setup separately from the LVS.

the term "the server" is ambiguous People sometimes call the director or the realservers, "the server". Since the whole LVS appears as a server to the client and since the realservers are also serving services, the term "server" is ambiguous. Do not use the term "the server" or "the lvs server" when talking about LVS. Most often you are referring to the "director" or the "realservers". Sometimes (e.g. when talking about throughput) you are talking about the (whole) virtual server. I use "realserver" as I despair of finding a reference to a "realserver" in a webpage using the search keys "real" and "server". Horms and I (for reasons that neither of us can remember) have been pushing the term "real-server" for about a year, on the mailing list, and no-one has adopted it. We're going back to "realserver".

names of IPs/networks in an LVS The router has traditionally not been considered part of the LVS, because often you do not have control over the router. However if you're a paying customer, then the ISP will be glad to set up the router according to your specifications. If you have access to the router, it can solve and can install filter rules. Here are the names we use for the various IPs. If you use them when asking questions on the mailing list, we'll be able to answer your questions more easily. The VIP and DIP are setup as secondary IPs, (i.e. there is another primary IP on that NIC), so they can be moved to another duplicate director following director failover. For initial setup with a single director, setting up the VIP and DIP as secondary IPs will make the transition to a failover setup easier. For a two director LVS (where directors failover), the IPs on the DRIP network are The DIP will be on the same NIC as PIP on bootup and will move to the same NIC as SIP on director failover. We don't seem to need a name for the primary IP on the outside of the director - no-one ever talks about it. We don't often need to explicitely name the networks in an LVS, but here's some suggestions DRIP network: the network containing the DIP and RIPs. (OK you come up with a better name.) network facing the internet or the outside network: the network on the director which receives packets from the outside world. This shouldn't be called the VIP network, as the VIP is also in the DRIP network (but not replying to arp calls) on the realservers in LVS-DR and LVS-Tun.

Minimal knowledge required The mailing list and HOWTOs cover information specific to LVS. The rest you have to handle yourself. All of us knew nothing about computers when we first started, we learnt it, and you can too (we're not saying it's easy). If you can't setup a simple LVS from the LVS-mini-HOWTO, without breaking into a major sweat (or being able to tell us what's wrong with the instructions), then you need to do some more homework. (Also see Help! My LVS doesn't work.)

Ratz ratz (at) tac (dot) ch To be able to setup/maintain an LVS, you need to be able to know how to patch and compile a kernel the basics of shell-scripting have intermediate knowledge of TCP/IP have read the man-page, the online-documentation and LVS-HOWTO (this document) (and the LVS-mini-HOWTO) know basic system administration (e.g. iptables; syslog; find, compile, install code from source files; use cpan to find perl modules).

Free Technical Help All of the people on the LVS mailing list are replying for free in their spare time. The best we can do is to give solutions to technical problems on setting up and running LVS. I give about 15secs to a posting to decide if I've got something useful to say. The posting has to indicate that the person has analysed the problem to a stage where an answer exists. If _they_ can't describe the problem, there's no point in replying - they won't understand the answer. Please don't e-mail me privately with general questions (feel free to cc: me if you want). The mailing list will archive your question and the answer(s) which can be retrieved later. Other people may have more interesting, relevant or useful comments than I will. If you are writing to me in the hopes of avoiding the humiliation of publically showing your ignorance on the mailing list, it's not going to happen. We've had too many good ideas from "ignorant" people to let this happen. If your question has been answered many times before and it's in the HOWTO and the archives, you'll be told to read the HOWTO, that's all. To get technical help: Read the docs on the website, the HOWTOs, and search the mailing list archives. The HOWTO (at the top) has a link to a search engine of all known LVS documentation. It will probably return several webpages. You'll have to find the entry from there. The LVS-mini-HOWTO shows you how to setup a simple 3 node (client, director, realserver) LVS without you needing to understand a whole lot about how an LVS works. after you've done a search of the docs, then post to the mailing list. updates/problems/bugs - post to the mailing list Jakub Suchy jakub (at) rtfm (dot) cz 13 Jan 2005 Please read: smart questions (http://www.catb.org/~esr/faqs/smart-questions.html) before asking questions. Please only post relevant lines of a debug dump. If you post the whole dump, because you don't understand it, then it will fill up the archive machine and everyone's mail box. If we need the whole debug, we'll ask for it and you can send it to us off-list.

Problem people 1 It's hard to believe, but we get postings like

recompiling the kernel is hard (or I don't read HOWTOs), can't you guys cut me some slack and just tell me what to do?

I expect the people who post these statements don't read this HOWTO, so I may be wasting my time, but - No. The people on the mailing list answer questions for free, and have other important things to do, like keeping up with /. and checking our e-mail. When we're at home, we drink beer and watch Gilligan's Island re-runs.

Problem people 2

can anybody tell me how to setup a windows realserver? thank you very much! I'm in a hurry.

robert (dot) gehr (at) web2cad (dot) de I can't think of anyone who has set up lvs in a hurry :-)

Problem People 3: People using RedHat LVS RedHat have LVS in their standard distribution kernel. This gives people the idea that they can setup LVS from their standard RedHat distribution just by clicking on a few buttons or running some scripts. From reading the postings to the mailing list, it's more difficult than doing it our way. You still have to understand LVS and then afterwards, you have to figure out what RedHat did to it. One of the major wastes of time and source of aggravation for me personally on the LVS mailing list, is postings from people using RedHat LVS who assume that it's the same as LVS, and who post as if they're using our setup methods. Just saying that you're using a RedHat distribution doesn't tell us anything, since you can setup LVS our way in RedHat. Things you need to know before you post - There are reasons for wanting to setup LVS in a standard RedHat distribution (e.g. RedHat is "approved" in your location whereas "Linux" isn't). There is information in this HOWTO () and in the various links from here which show you how to setup RedHat LVS. We have a method of setting up LVS which works for all distributions (including RedHat). We are not interested in learning, understanding, debugging, supporting or fixing a setup method that only works for one distibution. RedHat don't talk to us about what they do and while they may monitor the LVS mailing list, rarely (only about once a year, that I can tell) do they reply to people having problems with RedHat LVS. It appears that RedHat does not think their version of LVS worthy of much support and I agree with them. If you setup LVS the RedHat way, you still need an understanding of how an LVS works and is setup (just like everyone else), before posting to the mailing list. If you are setting up with RedHat and want help with it, make sure that you describe what you've done, that you're using the RedHat files and how you've set it up, otherwise we'll assume that you're setting up using our methods.

Why you may not get an answer no-one knows. The took a long time to figure out. Since no-one else had seen the problem, we didn't know at first if it was a problem with LVS. It wasn't till 6 months later, when someone else had the same symptom, and found that it only occured when the ftp helper module was loaded, that we could do something. I once needed to do something with iproute2 that I spent about 3 weeks trying to figure out. No-one on the list knew the answer. I had to post off-line to someone who could figure it out for me. We may not have a useful answer. If you post saying "I want to build an LVS with (list of hardware); do you think it will work?", all we can say is "probably". Often when questions like this come up, there are people who are happy to share their experiences, so there's no harm in posting such a question. In general the people who've been working with LVS for years will expect you to have read the docs and know what LVS does before you post. In the time I alot for a reply, I don't have time to figure out whether in your case LVS is best for you - you should pay a consultant to do this if you can't do it yourself. Your question may not be well posed. We are reading the postings in our spare time. You will get at most 30secs of attention before we figure out whether we can help you, an answer will take a bit of thinking, or we can't help you. If you have a long posting in which you haven't figured out which parts are causing the problem and which parts are working, then we aren't going to try to figure it out either. Post the minimum setup that will produce the problem. It's obvious that you haven't read the HOWTO.

Edit your posts! (top, bottom and in-line posting) Please edit the posting you're replying to, leaving only the parts relevant to your reply. We don't need to see material from previous posts irrelevant to the current posting, and the disk archive doesn't either. Reply in-line, i.e. following each statement by the poster. Here's a posting on the subject from one of the kernel mailing lists. Greg KH greg (at) kroah (dot) com 16 Nov 2005

After you've Got Technical Help In most cases when a problem is solved, there's enough info on the mailing list to see how it worked and we can write it up here for the next people. Occasionally, we get a posting "I've worked it out. Thanks for the help." When this happens we have no idea what the solution was and will have to reinvent it for the next person. If you've got help from the unpaid people on the mailing list, who've given their spare time to help you, when they could instead have been watching Gilligan's Island reruns, please write it up for the HOWTO. When I write to people asking for their solution I don't want to hear that you're busy and have a job. We're busy, have jobs, kids, homework to do and tax forms to fill in and we stopped what we were doing to help you. Here's a template. what you wanted to do why/how it didn't work what you needed to do to get it to work how the solution works

Paid technical help We occasionally get requests for people to do an install. The listing is a service to people looking for paid technical help (installs or anything else) and does not imply that I (Joe) or anyone connected to the LVS project endorse the services of the listees. If you want to know more about them, check their postings to the LVS mailing list. Entries will be listed at no cost, in approximate order of the date I receive/post them. Entries will be listed for at least a year (HOWTOs come out at erratic intervals and new entries will be added/old entries deleted whenever the next HOWTO comes out). If you want to be listed again next year, send me an e-mail in a year. I'm too busy to keep much of an eye on what goes in here and your entry may stay longer than a year. If you really want people to know who you are, don't rely on this entry - make sure google knows about you. To be listed, send me off-list

your URL (e.g. <http://www.foo.org>The Foo LVS service centre</a>) and/or e-mail, then a blurb of upto 80chars e.g. "We do it all", optionally including your location.

this will be minimum maintenance - I'm just going to mouse swipe your e-mail (i.e. don't plan on changing your URL in the year). People available for paid technical help. Oct 2007: http://www.dotnoc.com - solutions for hosting sales@dotnoc.com. Linux load balancing and networking specialists Oct 2007: Loadbalancer.org Ltd (http://www.loadbalancer.org/) - Specialise in high availability load balancers based on LVS. Happy for customers to have full access to the OS and source code and offer 24*7 support. However we don't do consultancy on home brew implementations. UK and USA offices. Oct 2007: http://www.netdigix.com Linux solutions for business. contact@netdigix.com. We specialize in Linux networking and setup of LVS for hosting and mission critical infrastructures. Canada:British Columbia:Lower Mainland:Vancouver

Mailing list: subscribing, unsubscribing, searching Thanks to Hank Leininger for the mailing list archive which is searchable not only by subject and author, but also by strings in the body. Hank's resource has been of great help in assembling this HOWTO. The mailing list is available for further questions. A single mailing list handles developers, new users and old users and has about 0-20 postings a day. You don't have to join the mailing list to read the archives. If you want to post questions, then you have to join. If you aren't subscribed and you post (or you post from an unsubscribed address), you'll get a reply saying that your posting is "awaiting moderator approval". It isn't; because of the volume of spam, we no longer review these messages - they're deleted.

Mailing list: posting to Please send e-mail with straight ascii (not html) and turn line-wrap on (some mails come with each paragraph on a single long line). If you're stuck with posting from a Windows machine or Lotus notes, or using Lookout, where each paragraph is sent as one line:

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 09 Jul 2004 Global Settings -> Internet Message Format -> Default (or the one used) -> Advanced -> word wrap ]]> like shown in fixing outlook (http://www.lemis.com/email/fixing-outlook.html) especially in word wrap (http://www.lemis.com/email/exchsrvr-wordwrap.gif), but this is a very old version of exchange.

Please don't turn on your vacation message, intended only for your work mates, for messages from a list. e.g. The LVS mailing list doesn't want to know. Dan Moljar Aug 2004 For Lotus Notes: The client is not configured correctly. In the 'Out of Office' enable dialog under the 'Exceptions' tab, there is a check box for 'Do not reply to Internet Addresses'. Check it. The server shouldn't do it to begin with, but you can make the client stop. There's always new ideas and questions being posted on the mailing list. We don't expect this to stop. There are many complexities to LVS and we don't expect new people to understand any more about LVS that we did when we started. No-one is expected to know/understand everything in the docs but your questions will be better received, if you've done your homework, if you have setup the test configurations here, have at least perused this HOWTO (yes we know it's big), and have looked at the mail archives. We can't help you if you just tell us that you've read the documents and your LVS doesn't work. To you, all problems look the same ("it doesn't work"). To help you, we need more information. We at least need the forwarding method, the service(s) being forwarded, the number of networks and the output of ipvsadm in the problem state. Before you come up on the mailing list - Read the LVS-HOWTO (this document) and the LVS-mini-HOWTO Set up a simple LVS (3 nodes: client, director, realserver) with LVS-DR or LVS-NAT forwarding, with the service telnet using the instructions in the LVS-mini-HOWTO. You should be able to do this starting from a freshly downloaded kernel from ftp.kernel.org and the LVS patches (ipvs and the hidden patch if you have 2.4.x realservers). Don't setup first with http, with filter rules, with firewalls, with complicated file systems (e.g. coda, nfs) or network accelators - debug all these nifty things after you have LVS working with telnet and with no filter rules. Do use standard compilers (gcc-2.95.3), tools and utilities (ifconfig or iproute2). Do not use non-standard tools particular to a distribution designed to capture market share (e.g. ifup). If you are using one of the packages that can be used with LVS (e.g. heartbeat from the Linux HA project http://www.henge.com/~alanr/ha, or piranha from Redhat), again we may know what the problem is, but they need the feedback that you can't get it to work, not us. Many of us are on each others' mailing lists and we try to help when we can, but the best people to handle the problem are the developers for each package. Consult the LVS mailing list archives. Use our jargon as best you can. The machine names will be client, director, realserver1, realserver2... IPs are CIP, VIP, RIP, DIP. If you do this, we won't have to translate "susanne" and "annie" to their functional names as we scan your posting. we need to know your kernel (e.g. 2.2.14) and the ip_vs patch that was applied to it (eg 0.9.11), whether you are using LVS-DR, LVS-NAT or LVS-Tun. Tell us what you did what you expected what you got and why that's a problem If you don't understand your problem well, here's a suggested submission format from Roberto Nibali ratz (at) tac (dot) ch System information, such as kernel, tools and their versions. Example: ipvsadm -L -n | head -1 IP Virtual Server version 1.0.2 (size=4096) hog:~ # ipvsadm -h | head -1 ipvsadm v1.13 2000/12/17 (compiled with popt and IPVS v1.0.2) ]]> Short description and maybe sketch of what you intended to setup. Example for LVS-DR: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.1.10:80 wlc -> 192.168.1.13:80 Route 0 0 0 -> 192.168.1.12:80 Route 0 0 0 -> 192.168.1.11:80 Route 0 0 0 ]]> The output from ifconfig from all machines (abbreviated, just need the IP, netmask etc), and the output from netstat -rn. What doesn't work. Show some output from tcpdump, ipchains/ip_tables, ipvsadm and kernlog. Later we may ask you for a more detailed configuration like routing table, OS-version or interface setup on some machines used in your setup. Tell us what you expected. Example: /proc/sys/net/ipv4/vs/debug_level && tail -f /var/log/kernlog tcpdump -n -e -i eth0 tcp port 80 route -n netstat -an ifconfig -a ]]> tcpdump listings are difficult to read. If you post one, please change the IPs to VIP, CIP, RIP1..n, DIP etc. Since you'll likely be on a switched network, tcpdump will only see packets to that NIC. Tell us which machine (director, realserver...) and the NIC (if there are two NICs on the machine) that it was run on.

Bug Fixes It's wonderful to get an unsolicited bug fix. Please let us know what it does and why it's better than the current file. A new version of a file without any information about what it does, or what it fixes isn't much use to us.

Other load balancing solutions, GPL, opensource and commercial

Open Source and GPL solutions Malcolm lists (at) loadbalancer (dot) org 23 Nov 2006 Willy Tarreau's written a nice article http://1wt.eu/articles/2006_lb/ - Making applications scalable with Load Balancing on load balancing that covers layer 4 and layer 7 options. I still don't think layer 7 can ever give high availability.. but its a good read: Ratz 23 Nov 2006 A very nice and to the point introduction. Willy, among being a nice person and a good friend, is an excellent engineer with a lot of expertise in high available, high performance and secure web services, networking and packet filtering and much more. It would be nice to have Willy contributing to/on this list as well :). Malcolm lists (at) loadbalancer (dot) org 01 Feb 2007 HAProxy http://haproxy.1wt.eu is a tcp proxy (fast) but flexible enough to do cookie insertion and SNAT etc. from lvs (at) spiderhosting (dot) com a list of load balancers Brent Cook busterb (at) mail (dot) utexas (dot) edu 28 Mar 2002

There's the http://www.bsdshell.net/ HighUpTime (HUT) projec (link dead Apr 2003). It's FreeBSD.

The HUT author, Sebastian Petit spe (at) selectbourse (dot) net has joined the LVS mailing list. For L7 Switching see the DRWS project. Dec 2006: Alexandre Cassen, the author of has written an L7 Switch at http://www.linux-l7sw.org". BSD load balancing: Roberto Nibali ratz (at) tac (dot) ch 05 Nov 2003 As already mentioned by others, LVS will not work on FreeBSD as director due to the kernel part. Using FreeBSD on the RS is of course ok. The BSD folks have not shown bigger interest in adopting the LVS idea or parts of the code yet. If you're interested in load balancing and HA Solutions under FreeBSD, you could check out following links: Gavin Henry ghenry (at) suretecsystems (dot) com 06/13/2005

ClusterIP by Harald Welte. What is the list's view on it?

Gavin Henry ghenry (at) suretecsystems (dot) com 13 Jun 2005 The man page for more recent versions of iptables says:

CLUSTERIP: This module allows you to configure a simple cluster of nodes that share a certain IP and MAC address without an explicit load balancer in front of them

Horms Been there, done that. Works, but is it neccessary? LVS with upto 16 directors active (http://www.ultramonkey.org/papers/active_active/) A set of postings on /. 2 Mar 2009 at Best Solution for HA and Network Loadbalancing (http://tech.slashdot.org/article.pl?sid=09/03/02/0231241) lists the following HAproxy Distributor Crossroads Pen Balance/BalanceNG Pound

List of Commercial Solutions Cahya Wirawan cwirawan (at) email (dot) archlab (dot) tuwien (dot) ac (dot) at 19 Feb 2004

I'm implementing proxy, smtp and webserver with LVS as local node, and I have tested it and it's running fine, but because someone from management section thinks that such an implementation is easy (just run setup.exe and everything is installed and ready to use), he pushed me to move the setup into production, and create another one as soon as possible. I want to tell him that such an implementation is not a trivial thing and needs time to setup and to test before we go into production. I want to show him a list of companies who have such complete solutions, so he can see the cost. Then he can understand that high availability and load balancing is not easy to setup, and will cost alot of money if we buy a complete solution.

Vendors just rub their hands with glee on finding management like this - see my review of the book "The IBM Way" (http://www.austintek.com/book_reviews/the_ibm_way.html) for how IBM handles the situation. Peter Mueller Prices at this level are negotiable. Who knows what you could pay? http://www.cisco.com/ - the old man on the LB-gig. http://www.f5.com/f5products/bigip/LB520/ - the second old man in the LB gig. http://www.suse.com/us/business/products/server/ - Suse has always been a big player in the Linux-HA world. http://www.redhat.com/software/rhel/purchase/ - they have clustering based on LVS, not sure about price. At this point you have to buy enterprise edition (or http://www.whiteboxlinux.com) to use the clustering software. http://www.ibm.com/ - always an option... http://www.dell.com/ - moving up in the datacenter world. I see lots of Dells now.. http://www.ebay.com/ - see how much the gear is worth on the open market. http://www.linuxvirtualserver.org/ - $0. There's plenty of people on list who can help you and your boss feel more comfortable with your setup. I'm sure if you posted something some people would be willing to help make you sleep better at night. BTW, you know about the http://www.ultramonkey.org/ and http://www.keepalived.org/ projects, right?

Radware Joe Oct 2005: From a presentation by Radware) (http://www.radware.com/) given to North Carolina Systems Administrators (NCSA) (http://www.ncsysadmin.org/) on 10 Oct 2005. Unfortunately I was the guy getting the pizzas for the meeting, so I missed most of the talk (which I wanted to hear). Radware is used by Ebay and Accuweather. Radware has a NAT loadbalancing director that appears to function similarly to an LVS-NAT director. The servers can have private IPs. Radware's loadbalancing director is only a small part of their offering. Radware have boxes that filter based on packet content (looking for viruses) that sit in the flow of packets (possibly before the director, possibly after - didn't find this out). They have boxes which just handle SYN floods. They use SYN cookies and do a statistical analysis of the packets, letting some through to see which machines reply to the SYN-ACKs. Radware has a gui to control the loadbalancer, which can do things like shutting down some of the backend servers at sometime in the future (e.g. at 10pm later that night) for new connections, so that by 8am next morning these machine have few or no connections and can be taken offline for servicing. Much of their hardware is ASIC based. Health checking seems to be done from the director, and checks are made through to 3rd-Tier components of the backend servers (e.g. database machines behind the webservers that the client doesn't directly connect to). Each local NAT'ed load balancing setup is itself a member of a distributed DNS-based load balancer. So www.foo.net might have a loadbalanced set of servers in different sites eg London, New York, San Francisco and Tokyo. Each local setup has an authoritative nameserver for www.foo.net The way is works is client in Scotland asks for the IP of www.foo.net the client's nameserver doesn't know the IP and asks a rootserver for the machine authoritative for foo.net. The rootserver has a list of 4 authoritative nameservers for foo.net and selects the next nameserver by round robin. If the next one in its list is in New York, it tells the client's nameserver to go query the nameserver in New York. The New York nameserver for foo.net measures the packet latency to the client's nameserver and then returns the VIP for www.foo.net associated with the New York installation of www.foo.net. The latency is propagated to the other foo.net nameservers (in Tokyo, London and San Francisco). Sometime later after the client's nameserver has flushed the IP entry for www.foo.net from its cache, another (or the same) client using the same nameserver asks for the IP of www.foo.net again and this time the rootserver will possibly send the request to another of the sites (say London). The London machine already knows the latency from New York to the client (without knowing where the client is), and sees that its latency to the client is lower than the latency from New York to the client, and returns the IP of its copy of www.foo.net. The London nameserver also updates the latency tables at the other sites (New York, San Francisco and Tokyo). If the next nameserver request from the client site is sent to Tokyo, then the Tokyo machine updates the latency tables in all the other nameservers, and knowing that the latency is lowest to the London nameserver, returns the IP of www.foo.net in London. In this way the four nameserver accumulate the latencies to all nameservers in the world. This works provided that the latencies don't change a lot with time of day (or throughput). Presumably you could store successive latencies and pick the shortest as reflecting the true network distance. The amount of memory required to do this must be small - there can't be more than a million nameservers, can there? 1 million 8 bit latencies is not much to store in memory. Although I didn't get to ask how it works, if a client winds up at a more distant site (network wise), then http redirects will send the client to a closer site. Radware SSL accelarators: When I commented to the speaker that the main reason to use SSL accelarators is financial, i.e. to only have one copy of the certificate, rather than one on each realserver, they said "it's also for certificate management". Presumably some sites have large numbers of certificates. (They didn't disagree with my statement.) The SSL accelarators in the Radware design don't sit between the director and the realservers (or in front of the director i.e. between the client and the director), but sit at the same level as the other realservers. The https request is balanced by the director to an accelarator, which decrypts the packets and sends the decrypted packet back to the director for loadbalancing as http traffic. Since the director is a NAT balancer, the return http traffic from the http servers, goes back through the director, and then recursively back to the SSL accelarator then back to the director at https traffic and then back to the client. Being able to have the SSL accelarator as a realserver in LVS would require the realservers to be a client of the director, something that we can do for LVS-NAT, but not for LVS-DR. This is not a capability that we've paid much attention to for LVS. If you need a realserver to be in the path in both inward and outward directions (like an SSL accelarator) then you will have to use LVS-NAT. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 12 Oct 2005 Note that we removed our Radware appliance to use LVS instead. Load Balancing using DNS is _evil_, especially with mobile internet and all those misconfigured operator gateways. Most mobile gateway are written in Java, and I'm probably the only one who read the java.security file. Just have a look on this ugly stuff you can find in it and the unbelievable silly explanation given: For security reasons! Guys! Well. So we removed radware. Note that we had other problem with radware. The DNS cache of the clients is one, the response time of the DNS was another. Several technical issues when you reach some trafic limits was the last. Henrik Holst

still, geographic load balancing would be very nice to have and I cannot figure out another way to do it than involve DNS round-robin.

Francois Round-Robin DNS could work if You have enough clients Clients are using DNS as expected Clients are dealing with TTL Client DNS caches or provider DNS are honouring DNS TTL All your sites are always up and working (you can't use a DNS solution for failover) My clients are mobile phones, basically points 1 to 4 are not OK :). And I have to deal with multiple sources for the same client (the transaction begin in the gallery gateway and continues in the standard surf gateway, and I have to use fwmarks to keep the session)... We used RadWare to try to load-balance between our two peers. It clearly was not working. Unfortunately, I don't have all the details. Horms If you want to distribute traffic between hosts that have fast, reliable links, like a LAN, then LVS is a good option. No, an excellent option. If you want to distribute traffic between geographically separated hosts, then you don't want something like LVS that channles packets through a single location then to another. Something DNS based is probably the way to go - though round robin is not nearly smart enough for my liking. In practice, if you do have geographically distributed sites, then each site should probably be an LVS cluster. So essentially you end up using two techniques to solve different parts of the same problem. I wrote quite a lot of this on supersparrow.org once upon a time, its still there if people want to read/play/enhance/. (links through ).

User's view of Radware, F5 bak bak (at) picklefactory (dot) org 09 Jan 2007 I've used Radware, F5, and HP SAs as an admin. My 2-minute executive overview take: Radware is great for a switch-like, low-key experience. They're relatively cheap for hardware load balancers. You get extra functionality like SSL and link balancing with extra bits of hardware. Sometimes they can be pretty hard to troubleshoot. If you want global balancing/failover, that's part of all their "AS" type switches. F5 is the other 'big name' option. These boxes are more like Brocade switches: it's running embedded Linux in there, and if you want to run tcpdump, you can. You get extra functionality by buying a 'bigger' box and then paying F5 for more licensing. If you want to do global balancing/failover, you have to get one of their DNS devices. If you have money to wave around, I've found both Radware and F5 are more than happy to give you a demo unit for 2-4 weeks.

Books on LVS Karl Kopper has tackled this. Writing a book on a moving target like LVS is a difficult proposition - certainly more than I was prepared to take on. The book is available at your usual suppliers. I'm loath to mention the names of internet booksellers who require your e-mail address as part of your purchase, so that they can spam you later. I've been buying my books by phone at a marginally higher price since realising their business practices. However recently (Jul 2004) I've discovered disposable e-mail addresses e.g. the free service from Jetable.org (http://www.jetable.org/). They have a google-like (i.e. simple) interface. You give them your e-mail address, the required lifetime of the address (1-8days), and click. Up comes an e-mail address (test by sending a message to it) that you can give to your internet vendor, and mail will be forwarded to you for the period selected. After that time, no more mail will get to you. I've been using jetable since Jul 2004 (now Sep 2004) and have not got any spam from Jetable or from internet vendors.

LVS in the news "Wired" Magazine in Jun 2004 has a small article about LVS, illustrating the multinational cooperative nature of GPL software development. The page is here (http://www.wired.com/wired/archive/12.06/images/atlas_software.pdf), or a local copy of the article on this server.

Software/Information/HOWTOs useful/related to LVS Ultra Monkey is LVS and HA combined. tong tong (at) csusb (dot) net 25 Jun 2003 Here's a step-by-step guide for setting up an LVS system with heartbeat (http://www.cula.net/cluster). This guide was published a year ago and we've only just heard about it. The author has never popped up on the mailing list to say hello. from lvs (at) spiderhosting (dot) com Super Sparrow Global Load Balancing using BGP routing information. Ratz is documenting the 2.6 headers and calls with doxygen (http://www.drugphish.ch/~ratz/IPVS/index.html) whenever he has reason to fiddle with a piece of code (i.e. the documentation isn't exhaustive, at least yet). From ratz, there's a write up on load imbalance with persistence and sticky bits at our friends at M$. From ratz, Zero copy patches to the kernel to speed up network throughput, Dave Miller's patches, Rik van Riel's vm-patches and more of Rick van Riel's patches at http://www.linux-mm.org/ (link dead Dec 2003). The Zero copy patches may not work with LVS and may not work with netfilter either (from john (at) antefacto (dot) com). From Michael Brown michael_e_brown (at) dell (dot) com, the TUX kernel level webserver. Dustin Puryear dustin (at) puryear-it (dot) com gave a talk on LVS at LISA 2003. The tutorial, is avaialble at: LVS: Load Balancing and High Availability for Free (http://www.puryear-it.com/publications.htm#6).

LVS: What is an LVS? Can I use an LVS? A Linux Virtual Server (LVS) is a cluster of servers which appears to be one server to an outside client. This apparent single server is called here a "virtual server". The individual servers (realservers) are under the control of a director (or load balancer), which runs a Linux kernel patched to include the ipvs code. The ipvs code running on the director is the essential feature of LVS. Other user level code is used to manage the LVS (set rules for services handled, handle failover). The director is basically a layer 4 router with a modified set of routing rules (i.e. connections do not originate or terminate on the director, it doesn't send ACKs etc, it's just a router). When a new connection is requested from a client to a service provided by the LVS (e.g. httpd), the director will choose a realserver for the client. From then, all packets from the client will go through the director to that particular realserver. The association between the client and the realserver will last for only the life of the tcp connection (or udp exchange). For the next tcp connection, the director will choose a new realserver (which may or may not be the same as the first realserver). Thus a web browser connecting to an LVS serving a webpage consisting of several hits (images, html page), may get each hit from a separate realserver. Since the director will send the client to an arbitary realserver, the services must be either read only (e.g. web services) or if read/write (e.g. an on-line shopping cart) some mechanism external to LVS must be provided for propagating the writes to the other realservers on a timescale appropriate for the service (i.e. purchase of an item must decrement the stock on all other nodes before the next client attempts to purchase the same item). At best LVS is read mostly. If you just want one of several nodes to be up at any one time, and the other node(s) to become active on failure of the primary node, then you don't need LVS: you need a high availability setup e.g. Linux-HA (heartbeat), vrrp or carp. If you want independant servers at different locations, then you want a geographically distributed server like Supersparrow. Here are some Management of the LVS is through the user space utility , which is used to add/removed realservers/services and to handle failout. LVS itself does not detect failure conditions; these are detected by external agents, which then update the state of the LVS through ipvsadm

What is a VIP? The director presents an IP called the Virtual IP (VIP) to clients. (When using , VIPs are agregated into groups of IPs, but the same principles apply as for a single IP). When a client connects to the VIP, the director forwards the client's packets to one particular realserver for the duration of the client's connection to the LVS. This connection is chosen and managed by the director. The realservers serve services (eg ftp, http, dns, telnet, nntp, smtp) such as are found in /etc/services or inetd.conf. The LVS presents one IP on the director (the virtual IP, VIP) to clients. Peter Martin p (dot) martin (at) ies (dot) uk (dot) com and John Cronin jsc3 (at) havoc (dot) gtf (dot) org 05 Jul 2001

The VIP is the address which you want to load balance i.e. the address of your website. The VIP is usually a secondary address so that the VIP can be swapped between two directors on failover (the VIP used to be an alias (e.g. eth0:1). The VIP is the IP address of the "service", not the IP address of any of the particular systems used in providing the service (ie the director and the realservers). The VIP be moved from one director to another backup director if a fault is directed (typically this is done by using mon and heartbeat, or something similar). The director can have multiple VIPs. Each VIP can have one or more services associated with it e.g. you could have HTTP/HTTPS balanced using one VIP, and FTP service (or whatever) balanced using another VIP, and calls to these VIPs can be answered by the same or different realservers. Groups of VIPs and/or ports can be setup with . The realservers have to be configured to work with the VIPs on the director (this includes handling the ). There can be issues, if you are using cookies or https, or anything else that expects the realserver fulfilling the requests to have some connection state information. This is also addressed on the LVS persistence page

Where do you use an LVS? For higher throughput. The cost of increasing throughput by adding realservers in an LVS increases linearly, whereas the cost of increased throughput by buying a larger single machine increases faster than linearly for redundancy. Individual machines can be switched out of the LVS, upgraded and brought back on line without interuption of service to the clients. Machines can move to a new site and brought on line one at a time while machines are removed from the old site, without interruption of service to the clients. for adaptability. If the throughput is expected to change gradually (as a business builds up), or quickly (for an event), the number of servers can be increased (and then decreased) transparently to the clients.

Client/Server relationship is preserved in an LVS Client sees only one IP address and believes it is connecting to a single machine. IPs of all servers is mapped to one IP (the VIP). While the client is connected to only one machine at a time, however subsequent connections will be assigned to a new and likely different machine. servers at different IP addresses believe they are contacted directly by the client.

LVS director is an L4 switch In the computer beastiary, the director is a layer 4 (L4) switch. The director makes decisions at the IP layer and just sees a stream of packets going between the client and the realservers. In particular an L4 switch makes decisions based on the IP information in the headers of the packets. Here's a description of an L4 switch from Super Sparrow Global Load Balancer documentation

Layer 4 Switching: Determining the path of packets based on information available at layer 4 of the OSI 7 layer protocol stack. In the context of the Internet, this implies that the IP address and port are available as is the underlying protocol, TCP/IP or UCP/IP. This is used to effect load balancing by keeping an affinity for a client to a particular server for the duration of a connection.

This is all fine except Nevo Hed nevo (at) aviancommunications (dot) com 13 Jun 2001

The IP layer is L3.

Alright, I lied. TCPIP is a 4 layer protocol and these layers do not map well onto the 7 layers of the OSI model. (As far as I can tell the 7 layer OSI model is only used to torture students in classes.) It seems that everyone has agreed to pretend that tcpip uses the OSI model and that tcpip devices like the LVS director should therefore be named according to the OSI model. Because of this, the name "L4 switch" really isn't correct, but we all use it anyhow. The director does not inspect the content of the packets and cannot make decisions based on the content of the packets (e.g. if the packet contains a cookie, the director doesn't know about it and doesn't care). The director doesn't know anything about the application generating the packets or what the application is doing. Because the director does not inspect the content of the packets (layer 7, L7) it is not capable of session management or providing service based on packet content. L7 capability would be a useful feature for LVS and perhaps this will be developed in the future (preliminary ktcpvs code is out - May 2001 - ). The director is basically a router, with routing tables set up for the LVS function. These tables allow the director to forward packets to realservers for services that are being LVS'ed. If http (port 80) is a service that is being LVS'ed then the director will forward those packets. The director does not have a socket listener on VIP:80 (i.e. netstat won't see a listener). John Cronin jsc3 (at) havoc (dot) gtf (dot) org (19 Oct 2000) calls these types of servers (i.e. lots of little boxes appearing to be one machine) "RAILS" (Redundant Arrays of Inexpensive Linux|Little|Lightweight|L* Servers). Lorn Kay lorn_kay (at) hotmail (dot) com calls them RAICs (C=computer), pronounced "rake".

LVS forwards packets to realservers The director uses 3 different methods of forwarding. LVS-NAT based on network address translation (NAT) LVS-DR (direct routing) where the MAC addresses on the packet are changed and the packet forwarded to the realserver LVS-Tun (tunnelling) where the packet is IPIP encapsulated and forwarded to the realserver. Some modification of the realserver's ifconfig and routing tables will be needed for LVS-DR and LVS-Tun forwarding. For LVS-NAT the realservers only need a functioning tcpip stack (i.e. the realserver can be a networked printer). LVS works with all services tested so far (single and 2 port services) except that LVS-DR and LVS-Tun cannot work with services that initiate connects from the realservers (so far; identd and rsh). The realservers can be indentical, presenting the same service (eg http, ftp) working off file systems which are kept in sync for content. This type of LVS increases the number of clients able to be served. Or the realservers can be different, presenting a range of services from machines with different services or operating systems, enabling the virtual server to present a total set of services not available on any one server. The realservers can be local/remote, running Linux (any kernel) or other OS's. Some methods for setting up an LVS have fast packet handling (eg LVS-DR which is good for http and ftp) while others are easier to setup (eg transparent proxy) but have slower packet throughput. In the latter case, if the service is CPU or I/O bound, the slower packet throughput may not be a problem. For any one service (eg httpd at port 80) all the realservers must present identical content since the client could be connected to any one of them and over many connections/reconnections, will cycle through the realservers. Thus if the LVS is providing access to a farm of web, database, file or mail servers, all realservers must have identical files/content. You cannot split up a database amongst the realservers and access pieces of it with LVS. The simplest LVS to setup involved clients doing read-only fetches (e.g. a webfarm). If the client is allowed to write to the LVS (e.g. database, mail farm), then some method is required so that data written on one realserver is transferred to other realservers before the client disconnects and reconnects again. This need not be all that fast (you can tell them that their mail won't be updated for 10mins), but the simplest (and most expensive) is for the mail farm to have a common file system for all servers. For a database, the realservers can be running database clients which connect to a single backend database, or else the realservers can be running independant database daemons which replicate their data.

LVS runs on Linux and FreeBSD directors LVS was developed on Linux and historically uses a Linux director. The Intel and Dec Alpha versions of LVS are known to work. The LVS code doesn't have any Intel specific instructions and is expected to work on any machine that runs Linux. In Apr 2005, LVS was ported to FreeBSD by Li Wang Li Wang dragonfly (at) linux-vs (dot) org 2005/04/16 The URL is: FreeBSD port of LVS (http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm). Here's a performance test on FreeBSD(version 0.4.0) (http://dragon.linux-vs.org/~dragonfly/software/doc/ipvs_freebsd/performance.html).

Code for LVS is different for each kernel series There are differences in the coding for LVS for the 2.0.x, 2.2.x, 2.4.x and 2.6.x kernels. Development of LVS on 2.0.36 kernels has stopped (May 99). Code for 2.6.x kernels is relatively new. The 2.0.x and 2.2.x code is based on the masquerading code. Even if you don't explicitely use ipchains (eg with LVS-DR or LVS-Tun), you will see masquerading entries with `ipchains -M -L` (or `netstat -M`). Code for 2.4.x kernels was rewritten to be compatible with the netfilter code (i.e. its entries will show up in netfilter tables). It is now production level code. Because of incompatibilities with LVS-NAT for 2.4.x LVS was in development mode (till about Jan 2001) for LVS-NAT.

kernels from 2.4.x series are SMP for kernel code 2.4.x kernels are SMP for kernel code as well as user space code, while 2.2.x kernels are only SMP for user space code. LVS is all kernel code. A dual CPU director running a 2.4.x kernel should be able to push packets at twice the rate of the same machine running a 2.2 kernel (if other resources on the director don't become limiting). (Also see the section on .)

OS for realservers You can have almost any OS on the realservers (all are expected to work, but we haven't tried them all yet). The realservers only need a tcpip stack - a networked printer can be a realserver.

LVS works on ethernet LVS works on ethernet. There are some limitations on using ATM. Firewire: (from the Beowulf mailing list - Donald Becket 5 Dec 2002): The firewire transport layer (IEEE1394) does run IP over FireWire. However firewire is designed for fixed size repeated frames (video or continuous disk block reads), but has overhead for other communication. Throughput is 400Mbps but worst case latency is high (msec range). Oracle has released GPL libraries for clustering Linux boxes over FireWire (http://www.ultraviolet.org/mail-archives/beowulf.2002/2977.html, link dead Dec 2003).

LVS works on IPv6 Seiji Tsuchiike tsuchiike (at) yggr-drasill (dot) com 02 Jun 2002

We just implemented IPv6 to lvs. We think that Basic Mechanism is same. (http://www.yggr-drasill.com/LVS6/documents.html. link dead Dec 2003, but Sep 2004 Horms says its alive; Joe Dec 2006 it's alive).

LVS is continually being developed LVS is continually being developed and usually only the more recent kernel and kernel patches are supported. Usually development is incremental, but with the 2.2.14 kernels the entries in the /proc file system changed and all subsequent 2.2.x versions were incompatible with previous versions.

LVS is 64 bit Kenny Chamber

Has anybody here successfully setup lvs-director on sparc64 machine? I need to know which distro is OK for this.

Ratz 16 Dec 2004 Yes. Just recently. Debian is fine, I reckon Gentoo would do as well. INFO: It could be that your ipvsadm binary that comes to instrument the kernel tables for LVS is broken with regard to 64bit'ness. You then need to download the latest sources and recompile adding '-m64' to the CFLAGS. That's all, other than that it seems to work nicely. Btw: I took Debian testing, probably not too wise but on the other hand I needed more up to date tools. I wouldn't know of too many other Distros that have up to date Sparc64 support. Suse used to have, but they dropped support a while ago unfortunately. Justin Ossevoort justin (at) snt (dot) utwente (dot) nl 16 Dec 2004 Well our plain debian-sarge here did it just as painlessly as our x86 based machines. So as long as your distro has ipvs (and of course a sparc tree ;)) support you're in the green. liuah

I want to know whether LVS can work with 64-bit boxes. If I use LVS-DR, how can I apply the hidden patch to 64-bit linux, using kernel is 2.4.18?

ratz 29 Nov 2003 Yes. The only problem I see is if either the counters or the hashtable handling has some bug with 32/64-bit signedness and wrong shift operators. Just let us know if you experience flakyness on your director ;). The hidden patch for your kernel is: http://www.ssi.bg/~ja/hidden-2.4.19pre5-1.diff I hope you are aware of the fact that 2.4.18 is really buggy in many ways. I know that some 64-bit archs have been lagging behind in the 2.4.x tree but if I was you I would upgrade to a newer kernel. Peter Mueller pmueller (at) sidestep (dot) com 29 Nov 2003 The one for straight 2.4.18 is http://www.ssi.bg/~ja/hidden-2.4.5-1.diff. Since he said 2.4.18 I would suspect he's running Debian. If you want a Debian kernel with LVS+hidden use the Ultramonkey kernel (http://www.ultramonkey.org/"). liuah liuah (at) langchaobj (dot) com (dot) cn 02 Dec 2003 The hidden patch compiles and runs on our 64-bit servers successfully.

Other documentation For more documentation, look at the LVS web site (eg a talk I gave on how LVS works on 2.0.36 kernel directors) Julian has written Netparse for which we don't have a lot of documentation yet. For those who want more understanding of netfilter/iptables etc, here are some starting places. These topics are also covered in many other places. Harald Welte (of the netfilter team) description of what happens to a packet under 2.4 Harald Welte (of the netfilter team) conntrack HOWTO. Conntrack is used in filter rules as a way of accepting "related" packets, e.g. the data packets associated with an established ftp connection. Regular filter rules written for these data packets would accept ftp data packets (port 20) even if there were not in response to a PORT call from an already established ftp connection on port 21. In this case the filter rules would accept packets that are part of a DoS attack. Conntrack is CPU intensive and lowers throughput (see effect of conntrack on throughput). To disable conntrack you have to rmmod all the conntrack modules. the docs/FAQs/HOWTOs on the netfilter site Linux Network Administrators Guide

LVS is not simple to install, get going or keep running This is not a utility where you run ../configure && make && make check && make install, put a few values in a *.conf file and you're done. LVS rearranges the way IP works so that a router and server (here called director and realserver), reply to a client's IP packets as if they were one machine. You will spend many days, weeks, months figuring out how it works. LVS is a lifestyle, not a utility. That said, you should be able to get a simple LVS-NAT setup working in a few hours without really understanding a whole lot about what's going one (see the LVS-mini-HOWTO).

LVS Control (Failure, Thundering Herd, Sorry Servers) LVS is kernel code (ip_vs) and a user space controller (ipvsadm). When adding functionality to LVS (handling failover, bringing new machines on-line), we have to figure the best place to put it: in the kernel code or in the user space code? Such decisions are relevant if you can choose from two equally functional lots of code - we usually get what the coder wanted to implement. Current thinking is to make the kernel code just handle the switching and to have all control in user space. Should the be controlled by LVS or by an external user space program e.g. feedbackd. Currently there is both a kernel patch and a script to change from rr to lc shortly after adding a new realserver. An alternative (not implemented) would be a scheduler that is rr when there's a large difference in the number of connections to the different realservers and lc when the number of connections is similar. LVS supplies high throughput using multiple identically configured machines. You would like to be able to swap out machines for planned maintenance and to automatically handle node failure (high availability). The LVS itself does not provide high availability. The current thinking is that the software layer that provides high availability should be logically separate to the layer that it monitors. The writing of software that attempts to determine whether a machine is working, is somewhat of a black art. There are several packages used to help provide high availability for LVS and these are discussed in the High Availability LVS section. While it is relatively easy to monitor the functionality of the realservers, fail-out of directors is more difficult. An even greater problem is handling failure of nodes which are holding state information. There is a sorry server option in Gustavo

I'm trying to create a sorry server for clients that can't connect to my real servers (limited by u-threshold); ServerA - 100 conn, ServerB - 110 conn. When this limit is reached I want my clients to go to a lighttpd served page saying "come back later" I'm trying with weights and thresholds... but it's not working the way I thought.

Ratz 22 Nov 2006 I suspect the clients scheduled for the sorry server never return back to the cluster, right (only if you use persistency of course)?

That's right. I'm working on a project for an airline companie. Some times they post some promotional tickets for a small period of time (only for passengers that buys on the website can have it) and the servers go high.

I've written a patch for the 2.4 kernel series extending IPVS to support the concept of an atomically switching sorry server environment. Unfortunately I didn't have the time to port the work to 2.6 kernels yet (the threshold stuff is already in but a bit broken and the sorry server stuff needs some adjustments in the 2.6 kernel). If you run 2.4 on your LB, you could try out the patches posted to this list almost exactly one year ago: The fix to the kernel patch above: And the 3/4 cut-off fix: I personally believe that the sorry-server feature is a big missing piece of framework in IPVS, one that is implemented in all commercial HW load balancers. Horms

That is true, but its also a piece that is trivially inplemented in user-space, where higher-level monitoring is usually taking place anyway. Is there a strong argument for having it in the kernel?

ratz 14 Feb 2007 Yes, it won't work reliably in user space because of missing atomicity. From the point the user space daemon decides that it's time to switch over to the sorry-server pool to the actual switch in the kernel by modifying the according service flag, there's a couple of us to ms time frame in which the kernel TCP stack will happily proceed with its normal tasks, including service more requests to the previously elected service for sorry-server forwarding. This can lead to broken (half-shown) page views an the customer's side inside their browser. In the field I had to implement load balancing, this was simply not accepted, especially because it irritated our customer's clients and also because everybody knew that HW load balancers do it right (tm). YMMV and I still didn't sit down and forward port my code to 2.6 but I first need some interest by enough people before I start :). I wrote the 2.4 server pool implementation for a ticket reseller company that probably had the same problems as your airline company. Normal selling activities not needing high end web servers and then from time to time (in your case promotional tickets, in my case Christina Aguilera, U2 or Robbie Williams or World Soccer Championship tickets) peak selling where tickets need to be sold in the first 15 minutes having tens of thousands of requests per second, plus the illicit traffic generated by scripters trying to sanction the event. These peaks, however, do not warrant the acquisition of high-end servers and on-demand servers cannot be organized/prepared so quickly.

I need to manually limit each server capacity and the remaining connections need to go to this sorry server.

That's exactly the purpose of my patch, plus you get to see how many connections (persistent as in session and active/passive connections) are forwarded to either the normal webservers (so long as they are within the u_thresh and l_thresh) or the overflow (sorry server) pool. As soon as one of the RS in the serving pool drops below l_thresh future connection requests are immediately sent to the service pool again.

We have tried F5 Big-IP for a while and it worked perfectly, but it is very expensive for us :(

Yep, about USD 20k-30k to have them in a HA-pair. So for the 2.4 kernel, I have a patch that has been tested extensively and is running in production for one year now, having survived some hype events. I don't know if I find time to sit down for a 2.6 version. Anyway, as has been suggested, you can also try the sorry server of keepalived, however I'm quite sure that this is not atomically (since keepalived is user space) and works more like: u_thresh then quiesce RS if RS.isQuiesced and RS.conns < l_tresh then { if sorry server active then remove sorry server set RS.weight to old RS.weight } if sum_weight of all RS == 0 then invoke sorry server with weight > 0 } ]]> If this is the case, it will not work for our use cases with high peak requests, since sessions are not switched to either one service pool atomically and thus this will result in people being sent to the overflow pool even though they would have had a legitimate session and others again get broken pages back, because in midst of the page view the LB's user space process gets a scheduler call to update its FSM and so further requests sent for HTTP 1.0 for example will be broken. The browser hangs on your customers side and your management gets the angry phone calls of the business users, to whom you had promised B2B access. This is roughly how I came around to implementing the server overflow (spillover server, sorry server) functionality for IPVS.

clients on realservers Sometimes you want a client process that has nothing to do with LVS, to connect to machines outside the LVS. The client could be fetching DNS, running telnet/ssh or sending mail for logging. None of these client calls are part of the service being balanced by LVS. This is covered in . Sometimes the LVS'ed service (e.g. http) will fire up a client process (e.g. filling in a webpage will result in the realserver writing to a database). People then want to loadbalance these client calls coming from the RIPs through a VIP on the same director. This is covered in .

LVS: Install, Configure, Setup

Installing from Source Code Doing this from source code is now described in the LVS-mini-HOWTO. Two methods of setup are described Setup from the command line. This is fine to understand what's going on, and if you only want to have a single type of setup. For LVSs which you're reconfiguring a lot, it's tedious and mistake prone. If it doesn't work, you will spend some time figuring out why. From a configure script which sets up an LVS with a single director. This script is fine for initial setups: it's mistake proof (will give you enough information about failures to figure out what might be wrong) and I used it for all my testing of LVS. Since it's not easily expandable to handle director failover and other configuration tools handle this now, the configure script is not being developed anymore. For production, where you need failover directors, you should use other setup tools or save your hand-built setup as a script (e.g. with ipvsadm-sav).

Ultra Monkey Ultra Monkey is a packaged set of binaries for LVS, including Linux-HA for director failover and ldirectord for realserver failover. It's written by Horms, one of the LVS developers. Ultra Monkey was used on many of the server setups sold by VA Linux and presumably made lots of money for them. Ultra Monkey has been around since 2000 and is mature and stable. Questions about Ultra Monkey are answered on the LVS mailing list. Ultra Monkey is mentioned in many places in the LVS-HOWTO. Ben Hollingsworth ben (dot) hollingsworth (at) bryanlgh (dot) org 29 Jun 2007 There's step by step instructions on How to install Ultra Monkey LVS in a 2-Node HA/LB Setup on CentOS/RHEL4 (http://www.jedi.com/obiwan/technology/ultramonkey-rhel4.html). Dan Thagard daniel (at) gehringgroup (dot) com 3 Jul 2007 I recently setup LVS using the Ultramonkey RPMs. The following is a (based on my understanding) complete howto for setting up CentOS 5 with LVS: Generic CentOS 5 x64 Install on 2 PCs using Ultramonkey and Streamlined/HA topology with Apache The following assumptions were made: Real Server names are ws01.testlab.local and ws02.testlab.local (replace these with the result from uname -n from each RS) Real Server IPs are 10.0.0.10/24 and 10.0.0.20/24, Gateway: 10.0.0.1 Virtual IP: 10.0.0.100 Username: tester Power PC and insert CD during BIOS. Boot to CD. Hit 'Enter' for Graphical Installer. You will be prompted to test the installation media. You may choose to test the media or skip the test (usually you can skip this step). Click 'Next' to begin installation. Select 'English' as installation language and click 'Next'. Select 'U.S. English' as the keyboard configuration and click 'Next'. Select 'Remove all partitions on selected drivers and create default layout' and click 'Next'. Configure the network settings for each adapter. a. Click 'Edit'. i. Uncheck Configure using DHCP ii. Input the IP Address and Netmask. iii. Click 'OK'. b. Input the Gateway and DNS and click 'Next'. Select 'America/ New York' and click 'Next'. Enter the root password twice and click 'Next'. Select the system packages . > a. Check 'Desktop-Gnome', 'Server', 'Server-GUI', 'Clustering', 'Storage Clustering' b. Select 'Customize Now' c. Click 'Next'. Configure the system packages. a. Expand and click 'Details' on Desktop Environments->GNOME Desktop Environment. i. Uncheck 'desktop-printing', 'dvd+rw tools', 'esc', 'gimp-print-utils', 'gnome-audio', 'gnome-backgrounds', 'gnome-mag', 'gnome-pilot', 'gnome-themes', 'gok', and 'nautilus-cd' b. Expand Servers. i. Uncheck 'DNS', 'Legacy Network Server', 'Mail Server', 'Network Servers', 'News', and 'Printing Support' c. Expand Base System. i. Uncheck 'Dialup Networking Support' d. Expand and click 'Details' on Base System->Base. i. Uncheck 'bluez-utils' and 'ccid' e. Click 'Next' Click 'Next' to begin copying over the files. Remove DVD and click 'Reboot' to reboot the machine after installation. Set firewall to 'Disabled' and click 'Forward'. Click 'Yes' on pop-up. Set SELinux to 'Disabled' and click 'Forward'. Select the 'Network Time Protocol' tab, check 'Enable Network Time Protocol', and click 'Forward'. Enter tester in the username field, 'Test User' in the Full name field, type in the password twice, and click 'Forward'. Click 'Forward' to skip the audio test. Click 'Finish' to complete the installation routine. Login to the local system using the root username and password. Edit the '/etc/group' file a. Locate the user 'tester' and append 'wheel' (i to insert, [ESC] to stop editing). b. Save the file and exit by typing ':wq'. Leave the server, goto your PC and SSH into the server (e.g. PuTTY) Login as user 'tester' Su to root Install the dries yum repository by creating dries.repo in the /etc/yum.repo.d/ directory with the following contents Install the dries GPG key Update your local packages and install some additional ones Correct release version /etc/redhat-release ]]> Download the Ultramonkey RPMs from http://www.ultramonkey.org (also grab perl-MAIL-POP3Client, available from http://rpm.pbone.net/index.php3/stat/4/idpl/4508518/com/perl-Mail-POP3Client-2.17-1.el5.centos.noarch.rpm.html as of the time of this writing) Install the arptables-noarp-addr and perl-Mail-POP3Client RPMs (change the cd path to wherever you downloaded Ultramonkey to) Install Ultramonkey Download and edit the Ultramonkey config files that relate your desired topology from http://www.ultramonkey.org to the /etc/ha.d/ directory and edit them to meet your desired configuration. Examples as follows: Set the permission on authkeys Start the httpd server Create alive.html in the /var/www/html folder with the following text (set this to whatever file you have set in the monitoring script) Edit the /etc/hosts file to include the FQDN of all of the machines in your LVS (not strictly necessary, but it helps avoid problems) Edit the /etc/sysconfig/network-scripts/ifcfg-lo file with your virtual IP Edit the /etc/sysconfig/network-scripts/ifcfg-eth0 file to match this (edit the IP address for each director/real server, change from eth0 to whatever active interface you are using): Restart the network Enable packet forwarding and arp ignore in the /etc/sysctl.conf file Reparse the sysctl.conf file Make sure all services set to start at system boot. Start the heartbeat service

Keepalived Keepalived is written by Alexandre Cassen Alexandre (dot) Cassen (at) free (dot) fr, and is based on vrrpd for director failover. Health checking for realservers is included. It has a lengthy but logical conf file and sets up an LVS for you. Alexandre released code for this in late 2001. There is a keepalived mailing list and Alexandre also monitors the LVS mailing list (May 2004, most of the postings have moved to the keepalived mailing list). The LVS-HOWTO has some information about Keepalived.

ipvsman(d) Volker Jaenisch volker (dot) jaenisch (at) inqbus (dot) de 2007-07-04 http://sourceforge.net/projects/ipvsman/ ipvsman is a curses based GUI to the IPVS loadbalancer written in python. ipvsmand is a monitoring instance of ipvs to achive the desired state of the loadbalancing as ldirectord or keepalived do. ipvsman/d now comes with tcp regular expression chat to check any tcp-service you can imagine Sorry-Servers can be checked for their avability. Fedora 7 packages are contributed by Gerry Reno.

Alternate hardware: Soekris (and embedded hardware) Clint Byrum cbyrum (at) spamaps (dot) org 27 Sep 2004

I'd like to setup a two node Heartbeat/LVS load balancer using Soekris Net4801 machines. These have a 266Mhz Geode CPU, 3 Ethernet, and 128MB of RAM. The OS (probably LEAF) would live on a CF disk. If these are overkill, I'd also consider a Net4501, which has a 133Mhz CPU, 64MB RAM, and 3 ethernet. I'd need to balance about 300 HTTP requests per second, totaling about 150kB/sec, between two servers. I'm doing this now with the servers themselves (big dual P4 3.02 Ghz servers with lots and lots of RAM). This is proving problematic as failover and ARP hiding are just a major pain. I'd rather have a dedicated LVS setup. 1) anybody else doing this? 2) IIRC, using the DR method, CPU usage is not a real problem because reply traffic doesn't go through the LVS boxes, but there is some RAM overhead per connection. How much traffic do you guys think these should be able to handle?

Ratz 28 Sep 2004 The Net4801 machines are horribly slow but for your purpose enough. The limiting factor on those boxes are almost always the cache sizes. I've waded through too many processor sheets of those Geode derivates to give your specific details on your processor but I would be surprised if it had more than 16kb i/d-cache each.

16k unified cache. :-/

Make sure that your I/O rate is as low as possible or the first thing to blow is your CF disk. I've worked with hundreds of those little boxes in all shapes, sizes and configurations. The biggest common mode failures were CF disk due to temperature problems and I/O pressure (MTTF was 23 days); other problems only showed up in really bad NICs locking up half of the time.

I haven't ever had an actual CF card blow on me. LEAF is made to live on readonly media.. so its not like it will be written to a lot.

Sorry, blow is exaggerated, I mean they simply fail because they only have limited write capacity on the cells. RO doesn't mean that there's no I/O going to your disk as you correctly noted. The problem is that if you plan on using them 24/7 I suggest you monitor your block I/O on your RO partitions using the values from /proc/partitions or the wonderful iostat tool. Then extrapolate about 4 hours worth of samples, check your CF vendor specification on how many writes it can endure and see how long you can expect the thing to run. I have to add that thermal issues were adding to our high failure rates. We wanted to ship those little nifty boxes to every branch of a big customer to do a big VPN network. Unfortunately the customer is in the automobile industry and this means that those boxes were put in the stranges places imaginable in garages sometimes causing major heat congestion. Also as it is usual in this sector of industry people are used to reliable hardware and so they don't care if at the end of a working day they simply shut down the power of the whole garage. Needless to say that this adds up to the reduced lifetime of a CF. I then did a reliability analysis using the MGL (multiple greek letter, derived from the beta-factor model) model to calculate the average risk in terms of failure*consequence and we had to refrain from using those little nifty things. The costs of repair (detection of failure -> replacement of product) at a customer would exceed the income our service provided through a mesh of those boxes.

If these are overkill, I'd also consider a Net4501, which has a 133Mhz CPU, 64MB RAM, and 3 ethernet.

I'd go with the former ones, just to be sure ;).

Forgive me for being frank, but it sounds like you wouldn't go with either of them.

I don't know your business case so it's very difficult to give you a definite answer. I only give you an (somewhat intimidating) experience report, someone might just as well give you a much better report.

I'd need to balance about 300 HTTP requests per second, totaling about 150kB/sec, between two servers.

So one can assume a typical request to your website is 512 Bytes, which is rather quite high. But not really an issue for LVS-DR.

I didn't clarify that. The 150kB/sec is outgoing. This isn't for all of the website, just the static images/html/css.

I'm doing this now with the servers themselves (big dual P4 3.02 Ghz servers with lots and lots of RAM). This is proving problematic as failover and ARP hiding are just a major pain. I'd rather have a dedicated LVS setup.

I'd have to agree to this.

1) anybody else doing this?

Maybe. Stupid questions: How often did you have to failover and how often did it work out of the box?

Maybe once every 2 or 3 months I'd need to do some maintenance and switch to the backup. Every time there was some problem with noarp not coming up or some weird routing issue with the IPs. Complexity bad. :)

So frankly speaking: your HA solution didn't work as expected ;).

2) IIRC, using the DR method, CPU usage is not a real problem because reply traffic doesn't go through the LVS boxes, but there is some RAM overhead per connection. How much traffic do you guys think these should be able to handle?

This is very difficult to say since these boxes impose limits also through their inefficiant PCI busses, their rather broken NICs and the dramatically reduced cache. Also it would be interesting to know if you're planning on using persistency on your setup.

Persistency is not a requirement. Note that most of the time a client opens a connection once, and keeps it up as long as they're browsing with keepalives.

Yes, provided most clients use HTTP/1.1. But since on an application level you don't need persistency. But to give you a number to start with, I would say those boxes should be able (given your constraints) to sustain 5Mbit/s of traffic with about 2000pps (~350 Bytes/packet) and only consume 30 Mbyte of your precious RAM when running without persistency. This is if every packet of your 2000pps is a new client requesting a new connection to the LVS and will be inserted by the template at an average of 1 Minute. As mentioned previously, you HW configuration is very hard to compare to actual benchmarks, thus take those numbers with a grain of salt, please.

Thats not encouraging. I need something fairly cheap.. otherwise I might as well go down the commercial load balancer route.

Well, I have given you number which are (at a second look) rather low estimates ;). Technically, your system should be able to deliver 25000pps (yes, 25k) at a 50Mbit/s rate. You would then, if every packet was a new client, consume about all the memory of your system :). So somewhere in between those two numbers I would place the performance of your machine. Bubba Parker sysadmin (at) citynetwireless (dot) net 27 Sep 2004 In my tests, the Soekris net4501, 4511, and 4521 all were able to route almost 20Mbps at wire-speed. I would suspect the 4801 to be in excess of 50Mbps, but remember, your Soekris board has 3 nics, but what they don't tell you is that they all share the same interrupt, so performance degredation is exponential with many packets per second. Ratz 28 Sep 2004 For all Geode based boards I've received more technical documentation than I was ever prepared to dive in. Most of the time you get a very accurate depiction of your hardware including south and north bridges and there you can see that the interrupt lines are hardwired and require a interrupt sharing. However this is not a problem since there's not a lot of devices on the bus anyway that would occupy it and if you're really unhappy about the bus speed, use setpci to reduce latency for the NIC's IRQs. Newer kernels have excellent handling for shared IRQs btw. Did you measure exponential degradation? I know you get a pretty steep performance reduction once you push the pps too high but I newer saw exponential behaviour. Peter Mueller 2004-09-27

What about not using these Soekris's and just using those two beefy servers? e.g., http://www.ultramonkey.org/2.0.1/topologies/ha-overview.html or http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-overview.html

Clint Byrum 27 Sep 2004 Thats what I'm doing now. The setup works, but its complexity causes issues. Bringing up IPs over here, moving them from eth0 to lo over there, running noarpctl on that box. Its all very hard to keep track of. Its much simpler to just have two boxes running LVS, and not worry about whats on the servers. Simple things are generally easier to fix if they break. It took me quite a while to find a simple typo in a script on my current setup, because it was very non-obvious at what layer things were failing.

LVS on a CD: Malcolm Turnbull's ISO files Malcolm Turnbull Malcolm (dot) Turnbull (at) crocus (dot) co (dot) uk 03 Jun 2003, has released a Bootable ISO image of his Loadbalancer.org appliance software. The link was at http://www.loadbalancer.org/modules.php?name=Downloads&d_op=viewdownload&cid=2 but is now dead (Dec 2003). Checking the website (Apr 2004) I find that the code is available as a 30 day demo (http://www.loadbalancer.org/download.html, link dead Feb 2005). Here's the original blurb from Malcolm

The basic idea is creating an easy to use layer 4 switch appliance to compete with Coyote Point Equalizer/ CISCO local director... All my source code is GPL, but the ISO distribution contains files that are non-GPL to protect the work and allow vendors to licence the software. The ISO requires a license before you can legally use it in production. Burn it to CD and then use it to boot a spare server with pentium/celeron + ATAPI CD + 64MB RAM + 1 or 2 NICs+20GB HD Default setup is DR so just plug it straight into the same hub as your web servers and have a play.. Download the manuals for configuration info...

LVS: Ipvsadm and Schedulers ipvsadm is the user code interface to LVS. The scheduler is the part of the ipvs kernel code which decides which realserver will get the next new connection. There are patches for ipvsadm machine readable error codes for ipvsadm stateless entry of ipvsadm commands Mar 2004. There appears to have been introduced a bug in the wrr code. Presumably this will be fixed sometime in the main code, and presumably older versions of ipvs still work (but I don't know how far back you need to go, presumably to the 2.4 kernels). Here are some postings on the matter and links to a patch. Jan Kasprzak kas (at) fi (dot) muni (dot) cz 2005/03/25

port unreachable after RS removal: I use IPVS with direct routing and wrr scheduler. The problem is that for some configurations I get "icmp port unreachable" when one of the real servers fails and is removed from the ip_vs tables. The smallest case where I can replicate the problem is the following: `-' Resolving virtual.service... 1.2.3.4 Connecting to virtual.service[1.2.3.4]:80... failed: Connection refused. ]]> I have verified by tcpdump that no traffic is sent to realserver2 after it is removed from the virtual.service pool. The ICMP "tcp port unreachable" is sent by the ipvs director. This appears to be a problem in the wrr scheduler. With wlc or rr it works as expected. The director is Fedora Core 3 with vanilla 2.6.11.3 kernel, but I have been experiencing this for a longer time.

Sent by: lvs-users-bounces@LinuxVirtualServer.org 2005/03/26 This is exactly the problem I described in my previous mails, and for which a patch is available from Wensong and/or Horms. Search the mailinglist archive for 'overload flag not resetting' which was my initial (wrong) diagnosis. See

Using ipvsadm You use ipvsadm from the command line (or in rc files) to setup: - services/servers that the director directs (e.g. http goes to all realservers, while ftp goes only to one of the realservers). weighting given to each realserver - useful if some servers are faster than others. Horms 30 Nov 2004 The weights are integers, but sometimes they are assigned to an atomic_t, so they can only be 24bits i.e. values so 0 to (2^24-1) should work. scheduling algorithm You use can also use ipvsadm to add services: add a service with weight >0 shutdown (or quiesce) services: set the weight to 0. This allows current connections to continue, untill they disconnect or expire, but will not allow new connections. When there are no connections remaining, you can bring down the service/realserver. delete services: this stops traffic for the service (the connection will hang), but the entry in the connection table is not deleted till it times out. This allows deletion, followed shortly thereafter by adding back the service, to not affect established (but quiescent) connections. Once you have a working LVS, save the ipsvadm settings with ipvsadm-sav ipvsadm.sav ]]> and then after reboot, restore the ipvsadm settings, with ipvsadm-restore Both of these commands can be part of an ipvsadm init script. list version of ip_vs (here 0.9.4, with a hash table size of 4096) list version of ipvsadm (here 1.20)

Memory Requirements On the director, the entries for each connection are stored in a hash table (number of buckets set when compiling ipvsadm). Each entry takes 128 bytes. Even large numbers of connections will use only a small amount of memory.

We would like to use LVS in a system where 700Mbit/s traffic is flowing through it. Concurrent connection number is about 420.000 . Our main purpose for using LVS is to direct 80. port requests into number of squid servers (~80 servers) I have read performance documents and I just wonder I can handle this much of traffic with a 2x3.2 Xeon and 4GB of RAM of hardware.

Ratz 22 Nov 2006 If you use LVS-DR and your squid caches have a moderate hit rate, the amount of RAM you'll need to load balance 420'000 connections is: This means with 4GB and a standard 3/1GB split (your Xeon CPU is 32bit only with 64bit EMT) in the 2.6 kernel (I take it as 3000000000 Bytes), you will be able to serve half a million parallel connections, each connection lasting at most 3000000000/(500000*128) [secs] = 46.875 secs.

sysctl documentation the sysctl for ipvs will be in Documentation/networking/ipvs-sysctl.txt for 2.6.18 (hopefully). It is derived from http://www.linuxvirtualserver.org/docs/sysctl.html v1.4. Graeme Fowler graeme (at) graemef (dot) net 08 Mar 2007 A couple of times recently people have posted to the keepalived list or the LVS list about different issues which were resolved by toggling sysctls (most recently expire_quiescent_template - see ). This got me thinking: these sysctls are pretty important, and not everyone knows what to do with them (or how to change them) since the recommended ways to modify them can vary between distributions. So, why not give ipvsadm the capability to modify appropriate sysctls found in /proc/sys/net/ipv4/vs/? The more I thought about it, the more I considered that the easiest way to do so would be to use a "generic" option along the lines of the e2fsprogs style "-O option,option,option=value" with "^option" as a negation for booleans. So you'd be able to say: By making the option handler "generic" like this, as other sysctls arrive as the kernel develops they can simply be toggled or changed as necessary; in all cases, where no corresponding sysctl exists, an error is thrown to that effect. In my mind it makes ipvsadm more of a "one stop shop" for the various settings - not only will it manage the virtual and real servers, but more of the underlying infrastructure too. Ratz 08 Mar 2007 I tend to agree, however people that want to setup LVS do need to know Linux on the level of also understanding sysctrl variables and their meaning. I've always hoped that with having them in the ipvsadm man page, the problem would be solved. I know only of one application that modifies sysctls, and this is the broken pluto of Free/OpenSwan :). You still have to know what the options mean, correct? I favour a different approach more: Make LVS really user friendly, in that you provide the users with a tool that takes away the low level configuration, just like in linux-ha or commercial load balancers. It's not so difficult to write this, it's just that someone has to sit down and do it. You still need to absolutely know the semantics of these settings. So what is the real gain between /proc/sys/net/ipv4/vs/expire_quiescent_template ]]> Horms Of late I've been thinking of the idea of enabling LVS to be configured via netlink rather than the current /proc + get/setsockopt fun. I think this was Ratz idea, but it seems like a good one to me, as it should allow a lot more flexibility in the user-space to kernel communication, which has always been a problem from the point of backwards compatibility. So I have kind of been thinking of ipvsadm2 or ipvsadm-nl. Ratz I've already started once with the conversion of IPVS to netlink and I've named the new ipvsadm ipvsctl :). I've attached my work (actually the part I could find right now ... I know on some of my dozens of crashed Laptop harddisks there should be more) in this email, so it doesn't get lost and you don't have to duplicate it. This would also allow us to easily implement the missing features and enable us to move more towards netfilter-friendliness.

Compile a version of ipvsadm that matches your ipvs Compile and install ipvsadm on the director using the supplied Makefile. You can optionally compile ipvsadm with popt libraries, which allows ipvsadm to handle more complicated arguments on the command line. If your libpopt.a is too old, your ipvsadm will segv. (I'm using the dynamic libpopt and don't have this problem). Since you compile ipvs and ipvsadm independantly and you cannot compile ipvsadm until you have patched the kernel headers, a common mistake is to compile the kernel and reboot, forgetting to compile/install ipvsadm. Unfortunately there is only rudimentary version detection code into ipvs/ipvsadm. If you have a mismatched ipvs/ipvsadm pair, many times there won't be problems, as any particular version of ipvsadm will work with a wide range of patched kernels. Usually with 2.2.x kernels, if the ipvs/ipvsadm versions mismatch, you'll get weird but non-obvious errors about not being able to install your LVS. Other possibilities are that the output of ipvsadm -L will have IP's that are clearly not IPs (or not the IP's you put in) and ports that are all wrong. It will look something like this RemoteAddress:Port Forward Weight ActiveConn InActConn TCP C0A864D8:0050 rr -> 01000000:0000 Masq 0 0 0 ]]> rather than RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:ssh rr -> RS2.mack.net:ssh Route 1 0 0 ]]> There was a change in the /proc file system for ipvs about 2.2.14 which caused problems for anyone with a mismatched ipvsadm/ipvs. The ipvsadm from different kernel series (2.2/2.4) do not recognise the ipvs kernel patches from the other series (they appear to not be patched for ipvs). The later 2.2.x ipvsadms know the minimum version of ipvs that they'll run on, and will complain about a mismatch. They don't know the maximum version (which will be produced presumably some time in the future) that they will run on. This protects you against the unlikely event of installing a new 2.2.x version of director:/etc/lvs# ipvsadm on an older version of ipvs, but will not protect you against the more likely scenerio where you forget to compile ipvsadm after building your kernel. The ipvsadm maintainers are aware of the problem. Fixing it will break the current code and they're waiting for the next code revision which breaks backward compatibility. If you didn't even apply the kernel patches for ipvs, then ipvsadm will complain about missing modules and exit (i.e. you can't even do `ipvsadm -h`).

Other compile problems Ty Beede tybeede (at) metrolist (dot) net

Ty Beede tybeede (at) metrolist (dot) net on a slackware 4.0 machine I went to compile ipvsadm and it gave me an error indicating that the iphdr type was undefined and it didn't like that when it saw Ty Beede tybeede (at) metrolist (dot) net to ip_fw.h I added ]]> Ty Beede tybeede (at) metrolist (dot) net in ipvsadm.c, which is where the iphdr #structure is defined and everything went ok

Doug Bagley doug (at) deja (dot) com The reason that it fails "out of the box" is because fwp_iph's type definition (struct iphdr) was ]]> (and not included anywhere else) since the symbol __KERNEL_ was undefined. before ]]> in the .c file did the trick.

put realservers in /etc/hosts (from a note by Horms 26 Jul 2002) ipvsadm by default outputs the names of the realservers rather than the IPs. The director then needs name resolution. If you don't have it, ipvsadm will take a long time (upto a minute) to return, as it waits for name resolution to timeout. The only IPs that the director needs to resolve are of the realservers. DNS is slow. To prevent the director from needing DNS, put the names of the realservers in /etc/hosts. This lookup is quicker than DNS and you won't need to open a route from the director to a nameserver. Or you could use `ipvsadm -n` which outputs the IPs of the realservers instead.

RR and LC schedulers On receiving a connect request from a client, the director assigns a realserver to the client based on a "schedule". The scheduler type is set with ipvsadm. The schedulers available are round robin (rr), weighted round robin (wrr) - new connections are assigned to each realserver in turn least connected (lc), weighted least connection (wlc) - new connections go to realserver with the least number of connections. This is not neccessarily the least busy realserver, but is a step in that direction. Doug Bagley doug (at) deja (dot) com points out that *lc schedulers will not work properly if a particular realserver is used in two different LVSs. Willy Tarreau (in http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing) says that *lc is suited for very long sessions, but not for webservers where the load varies on a short time scale. LBLC: a persistent memory algorythm DH: destination hash SH: source hash Again from Willy: this is used when you want a client to always appear on the same realserver (e.g. a shopping cart, or database). The SH scheduler has not been much used in LVS, possibly because no-one knew the syntax for a long time and couldn't get it to work. Most shopping cart type servers are using persistence, which has many undesirable side effects. The original schedulers are rr, and lc (and their weighted versions). Any of these will do for a test setup. In particular, round robin will cycle connections to each realserver in turn, allowing you to check that all realservers are functioning in the LVS. The rr,wrr,lc,wlc schedulers should all work similarly when the director is directing identical realservers with identical services. The lc scheduler will better handle situations where machines are brought down and up again (see thundering herd problem). If the realservers are offering different services and some have clients connected for a long time while others are connected for a short time, or some are compute bound, while others are network bound, then none of the schedulers will do a good job of distributing the load between the realservers. LVS doesn't have any load monitoring of the realservers. Figuring out a way of doing this that will work for a range of different types of services isn't simple (see load and failure monitoring). Ratz Nov 2006 After almost 10 years of my involvement with load balancers, I have to admit that no customer _ever_ truly asked or cared about the scheduling algorithm :). This is academia for the rest of the world.

Netmask for VIP You setup the RIPs, DIP and other networking with whatever netmask you choose. For the VIP For LVS-DR, LVS-Tun: netmask for VIP on director, realservers must be /32. For LVS-NAT: the netmask can be /32 or the netmask of the RIPs, DIP. You will need to setup the routing for the VIP to match the netmask. For more details, see the chapters for each forwarding method. Horms 12 Aug 2004 The real story is that the netmask works a little differently on lo to other interfaces. On lo the interface will answer to _all_ addresses covered by the netmask. This is how 127.0.0.1/8 on lo ends up answering 127.0.0.0-127.255.255.255. So if you add 172.16.4.222/16 to eth0 then it will answer 172.16.4.222 and only 172.16.4.222. But if you add the same thing to lo then it will answer 172.16.0.0-172.16.255.255. So you need to use 172.16.4.222/32 instead. To clarify - Add 192.168.10.10 to eth0 ifconfig lo:0 192.168.10.10 netmask 255.255.255.0 broadcast 192.168.10.255 up -> Add 192.168.10.0 - 192.168.10.255 to lo ifconfig lo:0 192.168.10.0 netmask 255.255.255.0 broadcast 192.168.10.255 up -> Same as above, add 192.168.10.0 - 192.168.10.255 to lo ifconfig lo:0 192.168.10.10 netmask 255.255.255.255 broadcast 192.168.10.10 up -> Add 192.168.10.10 to lo ]]> Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/21 On all platforms apart from windows you want 255.255.255.255 for the loopback. On windows you can get away with 255.255.255.0 IF you use a priority 254 80% of the time. 255.255.255.255 can be used if you mod the registry... But we've found that 255.0.0.0 will work better 99% of the time because windows by default uses the smallest subnet first for routing and a class A will never be used instead of a class C.

LBLC, DH schedulers The LBLC code (by Wensong) and the DH scheduler (by Wensong, inspired by code submitted by Thomas Proell proellt (at) gmx (dot) de) are designed for web caching realservers (e.g. squids). For normal LVS services (eg ftp, http), the content offered by each realserver is the same and it doesn't matter which realserver the client is connected to. For a web cache, after the first fetch has been made, the web caches have different content. As more pages are fetched, the contents of the web caches will diverge. Since the web caches will be setup as peers, they can communicate by ICP (internet caching protocol) to find the cache(s) with the required page. This is faster than fetching the page from the original webserver. However, it would be better after the first fetch of a page from http://www.foo.com/*, for all subsequent clients wanting a page from http://www.foo.com/ to be connected to that realserver. The original method for handling this was to make connections to the realservers persistent, so that all fetches from a client went to the same realserver. The -dh (destination hash) algorythm makes a hash from the target IP and all requests to that IP will be sent to the same realserver. This means that content from a URL will not be retrieved multiple times from the remote server. The realservers (eg squids in this case) will each be retreiving content from different URLs. Wensong Zhang wensong (at) gnuchina (dot) org 16 Feb 2001 Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers. Example: cache cluster shared by several load balancers. The DH scheduler can keep the two load balancer redirect requests destined for the same IP address to the same cache server. If the server is dead or overloaded, the load balancer can use cache_bypass feature to send requests to the original server directly. (Make sure that the cache servers are added in the two load balancers in the same order) Diego Woitasen 12 Aug 2003

The scheduling algorithms that use dest IP for selecting the realserver to use (like DH, LBLC, LBLCR) is only aplicable to transparent proxy, this being the only aplication where the dest ip could be variable.

Wensong Zhang wensong (at) linux-vs (dot) org 12 Aug 2003 Yes, you are almost right. LBLC and LBLCR are written for transparent proxy clusters only. DH can be used for transparent proxy cluster and can be used in other clusters needing static mapping. Here's follows a set of exchanges between a Chinese person and Wensong, that were in English, that I didn't follow at all. Apparently it was clear to Wensong.

If lblc uses dh, then is lblc = dh + lc?

Wensong Zhang 09 Mar 2004 Maybe lblc = dh + wlc. n.weight AND * there is a node m with m.conns The difference between lblc and lblcr is that cachenode[dest_ip] in lblc is a server, and cachenode[dest_ip] in lblcr is a server set.

In lblc the server has overloaded and lvs use wlc and allocate a server in half load of the server, Allocate the weighted least-connection server to IP address. Is this means after allocation for ip address, it will not return to past server ?

No, it will not in most cases. There is only one possible situation that the current map expires after it is not used for six minutes, and the past server is the one with least connections when next access to the ip address comes.

scheduling <link linkend="squids">squids</link> The usual problem with squids not using a cache friendly scheduler is that fetches are slow. In this case the website is sending hits to several different RIPs. Some websites detect this and won't even serve you the pages. Palmer J.D.F. J (dot) D (dot) F (dot) Palmer (at) Swansea (dot) ac (dot) uk 18 Mar 2002/

I tried https and online banking sites (e.g. www.hsbc.co.uk). It seems that this site and undoubtedly many other secure sites don't like to see connections split across several IP addresses as happens with my cluster. Different parts of the pages are requested by different realservers, and hence different IP addresses. It gives an error saying... "...For your security, we have disconnected you from internet banking due to a period of inactivity..." I have had caching issues with HSBC before, they seem to be a bit more stringent than other sites. If I send the requests through one of the squids on it's own it works fine, so I can only assume it's because it is seeing fragmented requests, maybe there is a keepalive component that is requested. How do I combat this? Is this what persistence does or is there a way of making the realservers appear to all have the same IP address?

Joe change -rr (or whatever you're running) to -dh. Lars Use a different scheduler, like lblc or lblcr. Harry Yen hyen1 (at) yahoo (dot) com 16 April 2002 What is the purpose of using LVS with Squid to a https site? HTTPs based material typically is not cachable. I don't understand why you need Squid at all. Once a request reaches a Squid and incurs a cache miss, the forwarded request will have Squid IP as the source address. So you need to find a way to make sure all connections from the same client IP to go to the same Squid farm. Then when they incur cache misses, they will wind up via LVS persistency to the same real sever.

The reason https is sent to the squids is because it's much easier to send all browser traffic to the squids and then let them handle it. The only way I seemed to be able to get this to work (IE access the bank site) is to set a persistence (360 seconds), and using lblc scheduling. The current output of ipvsadm is this... I am a tad concerned at the apparent lack of load balancing. squidfarm1.swan.ac.uk:squid Route 1 202 1045 -> squidfarm2.swan.ac.uk:squid Route 1 14 8 ]]> HSBC seems to be a bit more stringent than other sites. If I send the requests through one of the squids on it's own it works fine, so I can only assume it's because it is seeing fragmented requests, maybe there is a keepalive component that is requested. How do I combat this? Is this what persistence does or is there a way of making the realservers appear to all have the same IP address? I have sorted it by using persistence, couldn't get any of the dedicated squid schedulers to work properly. I'm currently running wlc, and 360s persistance. Seems to be holding up really well. Still watching it with eagle eyes though.

The -dh scheduler was written expressly to handle squids. Jezz tried it and didn't get it to work satisfactorily but found that persistence worked. We don't understand this yet. Jakub Suchy jakub (at) rtfm (dot) cz 2005/02/23 round-robin algorithm is not usable for squid. Some servers (banks etc.) check clients ip address and terminates it's connection if it changes. When you use the source-hashing algorithm, IPVS checks the client against its local table and forwards connection always to same squid real server, so the client always accesses the web through same squid. source-hashing can become unbalanced when you have few clients and one of them use squid more frequently than others. With more clients, it's statistically balanced.

DH persistence Just as you can use the to achieve persistence (affinity), you can also use the DH. We couldn't work out why this LVS wasn't scheduling the way it was expected, but the DH scheduler fixed it. Steve Haneman stevehaneman (at) yahoo (dot) com 22 Oct 2008

I'm using ipvsadm to load balance 2 security web servers, so I'm using 3 boxes. The web servers have 100 IPs each in the 192.168.253.x and 192.168.252.x ranges. I'm running a load test through the lb to the web servers from 20 unique IPs. I'm finding that 4% of the time the sessions are not sticky. A user/password web POST goes to one server and a followup POST ends up at the other server. There is less than 1 second between the POSTs. Each transaction (login POST, data POST, logout) is completed with the IP it started with.

Jeff Tchang jeff (dot) tchang (at) gmail (dot) com Not sure if this might help but have you tried using different scheduling algorithms? In particular maybe destination hashing? Steve

That fixed it. I changed from rr to dh and I'm seeing all goodness.

LVS with mark tracking: fwmark patches for multiple firewalls/gateways If the LVS is protected by multiple firewall boxes and each firewall is doing connection tracking, then packets arriving and leaving the LVS from the same connection will need to pass through the same firewall box or else they won't be seen to be part of the same connection and will be dropped. An initial attempt to handle the firewall problem was sent in by Henrik Nordstrom, who is involved with developing web caches (squids). This code isn't a scheduler, but it's in here awaiting further developements of code from Julian because it addresses similar problems to the SH scheduler in the next section. Julian 13 Jan 2002

Unfortunately Henrik's patch breaks the LVS fwmark code. Multiple gateway setups can be solved with routing and a solution is planned for LVS. Until then it would be best to contact Henrick, hno (at) marasystems (dot) com for his patch.

Here's Henrick's patch and here's some history. Henrik Nordstrom hno (at) marasystems (dot) com 13 Jan 2002

My use of the MARK is for routing purposes of return traffic only, not at all related to the scheduling onto the farm. This to solve complex routing problems arising in borders between networks where it is impractical to know full routing of all clients. One example of what I do is like this: I have a box connected to three networks (firewall, including LVS-NAT load balancing capabilities for published services) a - Internet b - DMZ, where the farm members are c - Large intranet For simplicity both Internet and intranet users connect to the same LVS IP addresses. Both networks 'a' and 'c' is complex, and maintaining a complete and correct routing table covering one of the networks (i.e. the 'c' network in the above) is on the border to impossible and error prone as the use of addresses change over time. To simplify routing decisions I simply simply want return traffic to be routed back the same way as from where the request was received. This covers 99.99% of all routing needed in such situation regardless of the complexity of the networks on the two (or more) sides without the need of any explicit routing entries. To do this I MARK the session when received using netfilter, giving it a routing mark indicating which path the session was received from. My small patch modifies LVS to memorize this mark in the LVS session, and then restore it on return traffic received FROM the realservers. This allows me to route the return traffic from the farm members to the correct client connection using iproute fwmark based routing rules. As farm distribution algorithms I use different ones depending on the type of application. The MARK I only use for routing of return traffic. I also have a similar patch for Netfilter connection tracking (and NAT), for the same purpose of routing return traffic. If interested search for CONNMARK in the netfilter-devel archives. The two combined allows me to make multihomed boxes who do not need to know the networks on any of the sides in detail, besides it's own IP addresses and suitable gateways to reach further into the networks. Another use of the connection MARK memory feature is a device connected to multiple customer networks with overlapping IP addresses, for example two customers both using 192.168.1.X addresses. In such case making a standard routing table becomes impossible as the customers are not uniquely identified by their IP addresses. The MARK memory however deals with such routing at ease since it do not care about the detailed addressing as long as it possible to identify the two customer paths somehow. i.e. interface originally received on, source MAC of the router who sent us the request, or anything uniquely identifying the request as coming from a specific path. The two problems above (not wanting to known the IP routing, or not being able to IP route) are not mutually exclusive. If you have one then the other is quite likely to occur.

Here's Henrik's announcement and the replies. Henrik Nordstrom 14 Feb 2001

Here is a small patch to make LVS keep the MARK, and have return traffic inherit the mark. We use this for routing purposes on a multihomed LVS server, to have return traffic routed back the same way as from where it was received. What we do is that we set the mark in the iptables mangle chain depending on source interface, and in the routing table use this mark to have return traffic routed back in the same (opposite) direction. The patch also moves the priority of LVS INPUT hook back to infront of iptables filter hook, this to be able to filter the traffic not picked up by LVS but matchin it's service definitions. We are not (yet) interested of filtering traffic to the virtual servers, but very interested in filtering what traffic reaches the Linux LVS-box itself.

Julian - who uses NFC_ALTERED ?

Netfilter. The packet is accepted by the hook but altered (mark changed).

Julian - Give us an example (with dummy addresses) for setup that require such fwmark assignments.

For a start you need a LVS setup with more than one real interface receiving client traffic for this to be of any use. Some clients (due to routing outside the LVS server) comes in on one interface, other clients on another interface. In this setup you might not want to have a equally complex routing table on the actual LVS server itself. Regarding iptables/ipvs I currently "only" have three main issues. As the "INPUT" traffic bypasses most normal routes, the iptables conntrack will get quite confused by return traffic.. Sessions will be tracked twice. Both by iptables conntrack and by IPVS. There is no obvious choice if IPVS LOCAL_IN sould be placed before or after iptables filter hook. Having it after enables the use of many fancy iptables options, but instead requires one to have rules in iptables for allowing ipvs traffic, and any mismatches (either in rulesets or IPVS operation) will cause the packets to actually hit the IP interface of the LVS server which in most cases is not what was intended.

SH scheduler Using the SH (source hash) scheduler, the realserver is selected using a hash of the CIP. Thus all connect requests from a particular client will go to the same realserver. Scheduling based on the client IP, should solve some of the problems that currently require persistence (i.e. having a client always go to the same realserver). Other than the few comments here, no-one has used the -sh scheduler. The SH scheduler was originally intended for directors with multiple firewalls, with the balancing based on hashes of the MAC address of the firewall and this is how it was written up. Since no-one was balancing on the MAC address of the firewall, the SH scheduler lay dormant for many years, till someone on the mailing list figured out that it could do other things too. It turns out that address hashing is a standard method of keeping the client on the same server in a load balanced server setup. Willy Tarreau (in http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing) discusses address hashing (in the section "selecting the best server") to prevent loss of session data with SSL connections in loadbalanced servers, by keeping the client on the same server. Here's Wensong's announcement: Wensong Zhang wensong (at) gnuchina (dot) org 16 Feb 2001 Please see "man ipvsadm" for short description of DH and SH schedulers. I think some examples to use those two schedulers. Example: Firewall Load Balancing Make sure that the firewall boxes are added in the load balancers in the same order. Then, request packets of a session are sent to a firewall, e.g. FW1, the DH can forward the response packets from protected network to the FW1 too. However, I don't have enough hardware to test this setup myself. Please let me know if any of you make it work for you. :) For initial discussions on the -dh and -sh scheduler see on the mailing list under "some info for DH and SH schedulers" and "LVS with mark tracking".

Testing the SH scheduler The SH scheduler schedules by client IP. Thus if you test from one client only, all connections will go to the first realserver in the ipvsadm table. Rached Ben Mustapha rached (at) alinka (dot) com 15 May 2002

It seems there is a problem with the SH scheduler and local node feature. I configured my LVS director (node-102) in direct routing mode on a 2.4.18 linux kernel with ipvs 1.0.2. The realservers are set up accordingly. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.0.50:23 sh -> 192.168.32.103:23 Route 1 0 0 -> 192.168.32.101:23 Route 1 0 0 ]]> With this configuration, it's ok. Connections from different clients are load-balanced on both servers. Now I add the director: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.0.50:23 sh -> 127.0.0.1 Route 1 0 0 -> 192.168.32.103:23 Route 1 0 0 -> 192.168.32.101:23 Route 1 0 0 ]]> All new connections whatever the client's IP goes to the director. And with this config: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.0.50:23 sh -> 192.168.32.103:23 Route 1 0 0 -> 192.168.32.101:23 Route 1 0 0 -> 127.0.0.1 ]]> Now all new connections whatever the client's IP goes to node-103. So it seems that with localnode feature, the scheduler always choose the first entry in the redirection rules.

Wensong There is no problem in SH scheduler and localnode feature. I reproduced your setup. I issued the requests from two difficult clients, all the requests were sent to the localnode. Then, issues the requests from the third client, the requests were sent to the other server. Please see the result. ipvsadm -ln IP Virtual Server version 1.0.2 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 172.26.20.118:80 sh -> 172.26.20.72:80 Route 1 2 0 -> 127.0.0.1:80 Local 1 0 858 ]]> I don't know how many clients are used in your test. You know that the SH scheduler is statically mapping algorithm (based on the source IP address of clients). It is quite possible that two or more client IP addresses are mapped to the same server.

"weight" is really the number of connections for the SH scheduler Con Tassios ct (at) swin (dot) edu (dot) au 7 Jun 2006 source hashing with a weight of 1 (the default value for the other schedulers) will result in the service being overloaded when the number of connections is greater than 2, as the output of ipvsadm shows. You should increase the weight. The weight when used with SH and DH has a different meaning than most, if not all, the other standard LVS scheduling methods. Although this doesn't appear to be mentioned in the man page for ipvsadm. 2*n.weight) then return NULL; return n; ]]> Joe: if weight is 1, then return NULL if number of connections > 2. If number of connections is twice the weight, don't allow anymore connections. Martijn Grendelman martijn (at) grendelman (dot) net 09 Jun 2006 That would explain the things I saw. In the meantime, I went back to a configuration using the SH scheduler now with a weight for both real servers of 200, instead of 1, and things seem to run fine. rr ipvsadm -L IP Virtual Server version 1.0.10 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 212.204.230.98:www sh -> tweety.sipo.nl:www Local 200 25 44 -> daffy.sipo.nl:www Route 200 12 27 TCP 212.204.230.98:https sh persistent 360 -> tweety.sipo.nl:https Local 100 0 0 -> daffy.sipo.nl:https Route 100 0 0 ]]>

Joe: Since the SH scheduler sends a client's packets to the same realserver, I had thought that it should completely replace persistence. However you're using persistence with SH, so apparently SH doesn't handle keeping the client on the realserver as I expect. So why are you using persistence?

Ehr.. no reason, I guess. It's still there from when I used RR scheduling and I guess I forgot to remove it. I don't think it is actually useful.

SH failout Shutting down an SH realserver by changing the weight to 0 (as is done for the other schedulers), still allows in connection requests to be sent to that realserver (you'll get a failed connection if the realserver is down). This seems to be a result of the different meaning of weight for the SH scheduler. No new sources will be allowed to initiate connections, but all connections from known sources will still be forwarded, and all known sources will be allowed to initiate connections. To stop connection requests being forwarded to a realserver, you have to remove the realserver from the ipvsadm table. You may have to break current connections to do this :-( Thomas Pedoussaut thomas (at) pedoussaut (dot) com 16 Oct 2008 Weight to 0 means no more new connections, but existing ones (in the SH way) will still hit the real server You need to have it properly removed.

What is an ActiveConn/InActConn (Active/Inactive) connnection? The output of ipsvadm lists connections, either as ActiveConn - in ESTABLISHED state InActConn - any other state With LVS-NAT, the director sees all the packets between the client and the realserver, so always knows the state of tcp connections and the listing from ipvsadm is accurate. However for LVS-DR, LVS-Tun, the director does not see the packets from the realserver to the client. Termination of the tcp connection occurs by one of the ends sending a FIN (see W. Richard Stevens, TCP/IP Illustrated Vol 1, ch 18, 1994, pub Addison Wesley) followed by reply ACK from the other end. Then the other end sends its FIN, followed by an ACK from the first machine. If the realserver initiates termination of the connection, the director will only be able to infer that this has happened from seeing the ACK from the client. In either case the director has to infer that the connection has closed from partial information and uses its own table of timeouts to declare that the connection has terminated. Thus the count in the InActConn column for LVS-DR, LVS-Tun is inferred rather than real. Entries in the ActiveConn column come from service with an established connection. Examples of services which hold connections in the ESTABLISHED state for long enough to see with ipvsadm are telnet and ftp (port 21). Entries in the InActConn column come from Normal operation Services like http (in non-persistent i.e. HTTP /1.0 mode) or ftp-data(port 20) which close the connections as soon as the hit/data (html page, or gif etc) has been retrieved (<1sec). You're unlikely to see anything in the ActiveConn column with these LVS'ed services. You'll see an entry in the InActConn column untill the connection times out. If you're getting 1000connections/sec and it takes 60secs for the connection to time out (the normal timeout), then you'll have 60,000 InActConns. This number of InActConn is quite normal. If you are running an e-commerce site with 300secs of persistence, you'll have 300,000 InActConn entries. Each entry takes 128bytes (300,000 entries is about 40M of memory, make sure you have enough RAM for your application). The number of ActiveConn might be very small. Pathological Conditions (i.e. your LVS is not setup properly) identd delayed connections: The 3 way handshake to establish a connection takes only 3 exchanges of packets (i.e. it's quick on any normal network) and you won't be quick enough with ipvsadm to see the connection in the states before it becomes ESTABLISHED. However if the service on the realserver is under , you'll see an InActConn entry during the delay period. Incorrect routing (usually the wrong default gw for the realservers): In this case the 3 way handshake will never complete, the connection will hang, and there'll be an entry in the InActConn column. Usually the number of InActConn will be larger or very much larger than the number of ActiveConn. Here's a LVS-DR LVS, setup for ftp, telnet and http, after telnetting from the client (the client command line is at the telnet prompt). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> RS2.mack.net:www Route 1 0 0 -> RS1.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> RS1.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 1 0 -> RS1.mack.net:telnet Route 1 0 0 ]]> showing the ESTABLISHED telnet connection (here to realserver RS2). Here's the output of netstat -an | grep (appropriate IP) for the client and the realserver, showing that the connection is in the ESTABLISHED state. realserver:# netstat -an | grep CIP tcp 0 0 VIP:23 client:1229 ESTABLISHED Here's immediately after the client logs out from the telnet session. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> RS2.mack.net:www Route 1 0 0 -> RS1.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> RS1.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 client:# netstat -an | grep VIP #ie nothing, the client has closed the connection #the realserver has closed the session in response #to the client's request to close out the session. #The telnet server has entered the TIME_WAIT state. realserver:/home/ftp/pub# netstat -an | grep 254 tcp 0 0 VIP:23 CIP:1236 TIME_WAIT #a minute later, the entry for the connection at the realserver is gone. ]]> Here's the output after ftp'ing from the client and logging in, but before running any commands (like `dir` or `get filename`). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> RS2.mack.net:www Route 1 0 0 -> RS1.mack.net:www Route 1 0 0 TCP lvs2.mack.net:0 rr persistent 360 -> RS1.mack.net:0 Route 1 1 1 TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 client:# netstat -an | grep VIP tcp 0 0 CIP:1230 VIP:21 TIME_WAIT tcp 0 0 CIP:1233 VIP:21 ESTABLISHED realserver:# netstat -an | grep 254 tcp 0 0 VIP:21 CIP:1233 ESTABLISHED ]]> The client opens 2 connections to the ftpd and leaves one open (the ftp prompt). The other connection, used to transfer the user/passwd information, is closed down after the login. The entry in the ipvsadm table corresponding to the TIME_WAIT state at the realserver is listed as InActConn. If nothing else is done at the client's ftp prompt, the connection will expire in 900 secs. Here's the realserver during this 900 secs. dir 421 Timeout (900 seconds): closing control connection. ]]> http 1.0 connections are closed immediately after retrieving the URL (i.e. you won't see any ActiveConn in the ipvsadm table immediately after the URL has been fetched). Here's the outputs after retreiving a webpage from the LVS. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> RS2.mack.net:www Route 1 0 1 -> RS1.mack.net:www Route 1 0 0 client:~# netstat -an | grep VIP RS2:/home/ftp/pub# netstat -an | grep CIP tcp 0 0 VIP:80 CIP:1238 TIME_WAIT ]]>

Programmatically access ActiveConn, InActConn

I want to get the active and inactive connections of one virtual service in my program.

Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 12 Feb 2003 You have two options here: Read /proc/net/ip_vs file and parse it for the required numbers Use libipvs (distributed with ipvsadm) to read the tables directly. Take a look in ipvsadm.c for how this is done.

ActiveConn/InActConn different for 2.4.x/2.6.x kernels Ratz 15 Oct 2006 IPVS between 2.4 and 2.6 have has change significantly with regards to the ratio of active/inactive connections. We've seen that in our rrdtool/MRTG graphs as well. In the 2.6.x kernels, at least for the (w)LC scheduler the RS calculation is done differently. On top of that, the TCP stack has changed tunables and you hardware also behaves differently. The LVS state transition timeouts are different between 2.4.x and 2.6.x kernels, IIRC and so, for example if you're using LVS-DR, the active connection to passive connection transition takes more time, thus yielding a potentially higher amount of sessions in state ActiveConn.

from the mailing list Ty Beede wrote:

I am curious about the implementation of the inactconns and activeconns variables in the lvs source code.

Julian Look in this table for the used timeouts for each protocol/state: /usr/src/linux/net/ipv4/ip_masq.c, masq_timeout_table For LVS-Tun and LVS-DR the TCP states are changed checking only the TCP flags from the incoming packets. For these methods UDP entries can expire (5 minutes?) if only the realservers sends packets and there are no packets from the client. For info about the TCP states: /usr/src/linux/net/ipv4/tcp.c, rfc793.txt Jean-francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

Done some testing (netmon) on this and here's my observations : 1. A connection becomes active when LVS sees the ACK flag in the TCP header incoming in the cluster : i.e when the socket gets established on the real server. 2. A connection becomes inactive when LVS sees the ACK-FIN flag in the TCP header incoming in the cluster. This does NOT corespond to the socket closing on the realserver. Example with my Apache Web server. Server A client request an object on the web server on port 80 : SYN REQUEST ----> SYN ACK <---- ACK ----> *** ActiveConn=1 and 1 ESTABLISHED socket on realserver. HTTP get ----> *** The client request the object HTTP response <---- *** The server sends the object APACHE closes the socket : *** ActiveConn=1 and 0 ESTABLISHED socket on realserver The CLIENT receives the object. (took 15 seconds in my test) ACK-FIN ----> *** ActiveConn=0 and 0 ESTABLISHED socket on realserver ]]> Conclusion : ActiveConn is the active number of CLIENT connections..... not on the server in the case of short transmissions (like objects on a web page). Its hard to calculate a server's capacity based on this because slower clients makes ActiveConn greater than what the server is really processing. You wont be able to reproduce that effect on a LAN, because the client receives the segment too fast. In the LVS mailing list, many people explained that the correct way to balance the connections is to use monitoring software. The weights must be evaluated using values from the realserver. In LVS-DR and LVS-Tun, the Director can be easily fooled with invalid packets for some period and this can be enough to inbalance the cluster when using "*lc" schedulers. I reproduce the effect connecting at 9600 bps and getting a 100k gif from Apache, while monitoring established sockets on port 80 on the realserver and ipvsadm on the cluster.

Julian You are probably using LVS-DR or LVS-Tun in your test. Right? Using these methods, the LVS is changing the TCP state based on the incoming packets, i.e. from the clients. This is the reason that the Director can't see the FIN packet from the realserver. This is the reason that LVS can be easily SYN flooded, even flooded with ACK following the SYN packet. The LVS can't change the TCP state according to the state in the realserver. This is possible only for VS/NAT mode. So, in some situations you can have invalid entries in ESTABLISHED state which do not correspond to the connections in the realserver, which effectively ignores these SYN packets using cookies. The VS/NAT looks the better solution against the SYN flood attacks. Of course, the ESTABLISHED timeout can be changed to 5 minutes for example. Currently, the max timeout interval (excluding the ESTABLISHED state) is 2 minutes. If you think that you can serve the clients using a smaller timeout for the ESTABLISHED state, when under "ACK after SYN" attack, you can change it with ipchains. You don't need to change it under 2 minutes in LVS 0.9.7. In the last LVS version SYN+FIN switches the state to TIME_WAIT, which can't be controlled using ipchains. In other cases, you can change the timeout for the ESTABLISHED and FIN-WAIT states. But you can change it only down to 1 minute. If this doesn't help, buy 2GB RAM or more for the Director. One thing that can be done, but this is may be paranoia: change the INPUT_ONLY table: TW to: FIN SR ---> FW ]]> OK, this is incorrect interpretation of the TCP states but this is a hack which allows the min state timeout to be 1 minute. Now using ipchains we can set the timeout to all TCP states to 1 minute. If this is changed you can now set ESTABLISHED and FIN-WAIT timeouts down to 1 minute. In current LVS version the min effective timeout for ESTABLISHED and FINWAIT state is 2 minutes. Jean-Francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

I'm using DR on the cluster with 2 realservers. I'm trying to control the number of connections to acheive this : The cluster in normal mode balances requests on the 2 realservers. If the realservers reaches a point where they can't serve clients fast enough, a new entry with a weight of 10000 is entered in LVS to send the overflow locally on a web server with a static web page saying "we're too busy". It's a cgi that intercept 'deep links' in our site and return a predefined page. A 600 seconds persistency ensure that already connected clients stays on the server they began to browse. The client only have to hit refresh until the number of AciveConns (I hoped) on the realservers gets lower and the overflow entry gets deleted. Got the idea... Load balancing with overflow control.

Julian Good idea. But LVS can't help you. When the clients are redirected to the Director they stay there for 600 seconds.

But when we activate the local redirection of requests due to overflow, ActiveConn continues to grow in LVS, while Inactconn decreases as expected. So the load on the realserver gets OK... but LVS doesnt sees it and doesnt let new clients in. (it takes 12 minutes before ActiveConns decreases enough to reopen the site) I need a way, a value to check at that says the server is overloaded, begin redirecing locally and the opposite. I know that seems a little complicated....

Julian What about trying to: use persistent timeout 1 second for the virtual service. If you have one entry for this client you have all entries from this client to the same realserver. I didn't tested it but may be a client will load the whole web page. If the server is overloaded the next web page will be "we're too busy". switch the weight for the Director between 0 and 10000. Don't delete the Director as realserver. Weight 0 means "No new connections to the server". You have to play with the weight for the Director, for example: if your realservers are loaded near 99% set the weight to 10000 if your realservers are loaded before 95% set the weight to 0 Jean-Francois Nadeau jf (dot) nadeau (at) videotron (dot) ca

Will a weight of 0 redirect traffic to the other realservers (persistency remains ?)

Julian If the persistent timeout is small, I think.

I can't get rid of the 600 seconds persistency because we run a transactionnal engine. i.e. if a client begins on a realserver, he must complete the transaction on that server or get an error (transactionnal contexts are stored locally).

Such timeout can't help to redirect the clients back to the realservers. You can check the free ram or the cpu idle time for the realservers. By this way you can correctly set the weights for the realservers and to switch the weight for the Director. These recommendations can be completely wrong. I've never tested them. If they can't help try to set httpd.conf:MaxClients to some reasonable value. Why not to put the Director as real server permanently. With 3 realservers is better. Jean

Those are already optimized, bottleneck is when 1500 clients tries our site in less than 5 minutes..... One of ours has suggested that the realservers check their own state (via TCP in use given by sockstat) and command the director to redirect traffic when needed. Can you explain more in details why the number of ActiveConn on realserver continue to grow while redirecting traffic locally with a weight of 10000 (and Inactonn on that realserver decreasing normally).

Julian Only the new clients are redirected to the Director at this moment. Where the active connections continue to grow, in the real servers or in the Director (weight=10000)?

How is ActiveConn/InActConn calculated?

Joe, 14 May 2001 according to the ipvsadm man page, for "lc" scheduling, the new connections are assigned according to the number of "active connections". Is this the same as "ActiveConn" in the output of ipvsadm? If the number of "active connections" used to determine the scheduling is "ActiveConn", then for services which don't maintain connections (e.g. http or UDP services), the scheduler won't have much information, just "0" for all realservers?

Julian, 14 May and 23 May It is a counter and it is incremented when new connection is created. The formula is: where K can be 32 to 50 (I don't remember the last used value), so it is not only the active conns (which would break UDP).

Is "active connections" incremented if the client re-uses a port?

No, the reused connections are not counted.

ActiveConn is a guess for LVS-DR For LVS-DR, the director doesn't see the return packets and uses tables of timeouts to guess a likely state of the service at the realserver. For the same reason you can't do stateful filtering on the director for LVS-DR controlled packets. barrywong barrywong (at) sina (dot) com 30 Aug 2008

I'm using DR wlc persistent 120 RemoteAddress:Port Forward Weight ActiveConn InActConn TCP xx.xx.xx.xx:80 wlc persistent 120 -> xx.xx.xx.x1:80 Route 1 6459 7057 -> xx.xx.xx.x2:80 Route 1 6446 4766 # netstat -n realserver ESTABLISHED xx.xx.xx.x1:80 4210 ESTABLISHED xx.xx.xx.x2:80 4483 ESTABLISHED ]]> The realserver ESTABLISHED status connect numberis less than the lvs ActiveConn connect number. Why is this?

Thomas Pedoussaut thomas (at) pedoussaut (dot) com I guess your issue is that the persistance is low compared to your usage. I've had similar numbers with a mysql setup. Basically, there was hundreds of very-long-lasting connections, but that weren't doing much of traffic, with sometimes pausing for hours. They would disappear from the LVS status but still be visible on the client and the server as CONNECTED. It's not really a big issue. Usually server affinity make the resuming packets being directed to the same server so the connection can still be used. If it wasn't the case, there is enough code on the client side to re-establish a new connection if that one was to fail. You'll still have to face a problem with the server side connections that will be lingering in a limbo state. I would consider setting some sort of timeout on that side. I'm not 100% sure, but you're real server are running squid on port 80 correct. If so, please have a look there http://www.squid-cache.org/Versions/v3/3.0/cfgman/read_timeout.html and probably shorten it (or extend your LVS persistance to that value with ipvsadm --set )

FAQ: ipvsadm shows entries in InActConn, but none in ActiveConn, connection hangs. What's wrong? The usual mistake is to have the default gw for the realservers set incorrectly. LVS-NAT: the default gw must be the director. There cannot be any other path from the realservers to the client, except through the director. LVS-DR, LVS-Tun: the default gw cannot be the director - use some local router. Setting up an LVS by hand is tedious. You can use the configure script which will trap most errors in setup.

FAQ: initial connection is delayed, but once connected everything is fine. What's wrong? Usually you have problems with . Simplest thing is to stop your service from calling the identd server on the client (i.e.disconnect your service from identd).

unbalanced realservers: does rr and lc weighting equally distribute the load? - clients reusing ports (also see in the performance section.) I ran the polygraph simple.pg test on a LVS-NAT LVS with 4 realservers using rr scheduling. Since the responses from the realservers should average out I would have expected the number of connection and load average on the realservers to be equally distributed over the realservers. Here's the output of ipvsadm shortly after the number of connections had reached steady state (about 5 mins). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph rr -> RS4.mack.net:polygraph Masq 1 0 883 -> RS3.mack.net:polygraph Masq 1 0 924 -> RS2.mack.net:polygraph Masq 1 0 1186 -> RS1.mack.net:polygraph Masq 1 0 982 ]]> The servers were identical hardware. I expect (but am not sure) that the utils/software on the machines is identical (I set up RS3,RS4 about 6 months after RS1,RS2). RS2 was running 2.2.19, while the other 3 machine were running 2.4.3 kernels. The number of connections (all in TIME_WAIT) at the realservers was different for each (otherwise apparently identical) realserver and was in the range 450-500 for the 2.4.3 machines and 1000 for the 2.2.19 machine (measured with netstat -an | grep $polygraph_port |wc ) and varied about 10% over a long period. This run had been done immediately after another run and InActConn had not been allowed to drop to 0. Here I repeated this run, after first waiting for InActConn to drop to 0 RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph rr -> RS4.mack.net:polygraph Masq 1 0 994 -> RS3.mack.net:polygraph Masq 1 0 994 -> RS2.mack.net:polygraph Masq 1 0 994 -> RS1.mack.net:polygraph Masq 1 1 992 TCP lvs2.mack.net:netpipe rr ]]> RS2 (the 2.2.19 machine) had 900 connections in TIME_WAIT while the other (2.4.3) machines were 400-600. RS2 was also delivering about 50% more hits to the client. Repeating the run using "lc" scheduling, the InActConn remains constant. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:polygraph lc -> RS4.mack.net:polygraph Masq 1 0 994 -> RS3.mack.net:polygraph Masq 1 0 994 -> RS2.mack.net:polygraph Masq 1 0 994 -> RS1.mack.net:polygraph Masq 1 0 993 ]]> The number of connections (all in TIME_WAIT) at the realservers did not change. I've been running the polygraph simple.pg test over the weekend using rr scheduling on what (AFAIK) are 4 identical realservers in a LVS-NAT LVS. There are no ActiveConn and a large number of InActConn. Presumably the client makes a new connection for each request. Julian (I think)

The implicit persistence of TCP connection reuse can cause such side effects even for RR. When the setup includes small number of hosts and the used rate is big enough to reuse the client's port, the LVS detects existing connections and new connections are not created. This is the reason you can see some of the rs not to be used at all, even for such method as RR.

the client is using ports from 1025-4999 (has about 2000 open at one time) and it's not going above the 4999 barrier. ipvsadm shows a constant InActConn of 990-995 for all realservers, but the number of connections on each of the realservers (netstat -an) ranges from 400-900. So if the client is reusing ports (I thought you always incremented the port by 1 till you got to 64k and then it rolled over again), LVS won't create a new entry in the hash table if the old one hasn't expired?

Yes, it seems you have (5000-1024) connections that never expire in LVS.

Presumably because the director doesn't know the number of connections at the realservers (it only has the number of entries in its tables), and because even apparently identical realservers aren't identical (the hardware here is the same, but I set them up at different times, presumably not all the files and time outs are the same), the throughput of different realservers may not be the same. The apparent unbalance in the number of InActConn then is a combination of some clients reusing ports and the director's method of estimating the number of connections, which makes assumptions about TIME_WAIT on the realserver. A better estimate of the number of connections at the realservers would have been to look at the number of ESTABLISHED and TIME_WAIT connections on the realservers, but I didn't think of this at the time when I did the above tests. The unbalance then isn't anything that we're regarding as a big enough problem to find a fix for.

Changing weights with ipvsadm When setting up a service, you set the weight with a command like (default weight is 1). If you set the weight for the service to "0", then no new connections will be made to that service (see also man ipvsadm, about the -w option).

Lars Marowsky-Bree lmb (at) suse (dot) de 11 May 2001 Setting weight = 0 means that no further connections will be assigned to the machine, but current ones remain established. This allows to smoothly take a realserver out of service, i.e. for maintenance. Removing the server hard cuts all active connections. This is the correct response to a monitoring failure, so that clients receive immediate notice that the server they are connected to died so they can reconnect.

Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 11 May 2001 Is there a way to clear some entries in the ipvs tables ? If a server reboots or crashes, the connection entries remains in the ipvsadm table. Is there a way to remove manually some entries? I have tried to remove the realserver from the service (with ipvsadm -d .... ), but the entries are still there.
Joe After a service (or realserver) failure, some agent external to LVS will run ipvsadm to delete the entry for the service. Once this is done no new connections can be made to that service, but the entries are kept in the table till they timeout. (If the service is still up, you can delete the entries and then re-add the service and the client will not have been disconnected). You can't "remove" those entries, you can only change the timeout values. Any clients connected through those entries to the failed service(s) will find their connection hung or deranged in some way. We can't do anything about that. The client will have to disconnect and make a new connection. For http where the client makes a new connection almost every page fetch, this is not a problem. Someone connected to a database may find their screen has frozen.

If you are going to set the weight of a connection, you need to first know the state of the LVS. If the service is not already in the ipvsadm table, you add (-a) it. If the service is already in the ipvsadm table, you edit (-a) it. There is no command to just set the weight no matter what the state. A patch exists to do this (from Horms) but Wensong doesn't want to include it. Scripts which dynamically add, delete or change weights on services will have to know the state of the LVS before making any changes, or else trap errors from running the wrong command.

Setting initial weights If your hardware is all the same, you set then all to the same weight (1:1:1.., or 1000:1000:1000, it's the ratio that's important, not the value). What if you have a bunch of different hardware and you don't know what weight to set for each? Malcolm Turnbull malcolm (at) loadbalancer (dot) org 13 Nov 2008 Run them all on the same weight for a while and see how they get loaded, then decide if you need to play with the weights or just add more servers. In a layer 4 balanced cluster I normally recommend no greater than 40% utilisation as a rule of thumb to cope with peaks in demand. Graeme Ensure you start with (for example) 100/100/100, or 1000/1000/1000. It's easier to juggle the weights with those values than 1/1/1 !

Dynamically changing realserver weights Some law of averaging large numbers predicts that realservers with large numbers of clients should have nearly the same load. In practice, realservers can have widely different loads or numbers of connections. Presumably this is for services where a small number of clients can saturate a realserver. LVS's serving static html pages should have even loads on the realservers. Leonard Soetedjo Leonard (at) axs (dot) com (dot) sg

From reading the howto, mon doesn't handle dynamically changing the realserver weights. Is it advisable to create a monitoring program that changes the weightage of the realserver? The program will check the worker's load, memory etc and reduce or increase the weight in the director based on those information.

Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uki 09 Dec 2002 Personaly I think it adds complication you shouldn't require... As your servers are running the same app they should respond in roughly the same way to traffic.. If you have a very fast server reduce its weight. If you have some very slow pages.. i.e. Global Search... Then why not set up another VIP to make sure that all requests to search.mydomain.com are evenly distributed... or restricted to a couple of servers (so they don't imapact everyone else..) But obviously it all depends on how your app works, with mine its database performance that is the problem... Time to look at loadbalancing the DB !, Does anyone have any experience of doing this with MS SQL server and or PostGreSQL ? I'm thinking about running the session / sales stuff of the MS SQL box, and all the readonly content from several readonly PostGreSQL DBs... Due to licencing costs... :-(. OTOH, someone recently spoke on the list about a monitoring tool which could use plugins to monitor the realservers. (Joe - see feedbackd.) Lars Marowsky-Bree lmb (at) suse (dot) de 17 Mar 2003 keep in mind that loadavg is a poor indication of real resource utilization, but it might be enough. loadavg needs to be at least normalized via the number of CPUs. Andres Tello criptos (at) aullox (dot) com 17 Mar 2003 I use: ram*speed/1000 to calculate the weight Joe dsh (http://www.netfort.gr.jp/~dancer/software/dsh.html", machine gone Sep 2004) is good for running commands (via rsh, ssh) on multiple machines (machines listed in a file). I'm using it here to monitor temperatures on multiple machines. also see Bruno Bonfils The loads can become unbalanced even if the realservers are indentical. Customers can read different pages. Some of them may have heavy php/sql usage, which implies a higher load average than a simple static html file. Rylan W. Hazelton rylan (at) curiouslabs (dot) com 17 Mar 2003 I still find large differences in loadaverage of the realservers. WLC has no idea what else (non httpd) that might be happening on a server. Maybe I am compiling something for some reason, or I have a large cron. It would be nice if LVS could redirect load accordingly.

feedbackd Joe, Mar 2003 Jeremy's feedbackd code and HOWTO (feedbackd) is now released. Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 09 Dec 2002: As I've said earlier (check out the thead starting at http://www.in-addr.de/pipermail/lvs-users/2002-November/007264.html ), the software sends server load information to the director to be inserted in to the ipvs weighting tables. I'm busy writing up the benchmarking results at the moment, and I'll post a link to the paper (and software) soon. In summary: when the simulation cluster is (intentionally) unbalanced, the feedback software sucessfully evens the load between all servers. Jeremy at one stage had plans to merge his code with Alexandre's code, but (Aug 2004) he's not doing anything about it at the moment (he has a real job now). Jeremy Kerr jeremy (at) redfishsoftware (dot) com (dot) au 04 Feb 2005 Everything's available at feedbackd (http://ozlabs.org/~jk/projects/feedbackd/). Michal Kurowski mkur (at) gazeta (dot) pl 19 Jan 2007 I want to distribute the load based on criteria such as: disk usage OS load average CPU usage custom hooks into my own software You Name It (TM) feedbackd has got CPU-monitoring only by default. It also has a perl plugin that's supposed to let you code something revelant to you quickly. That's a perfect idea except the original perl plugin is not fully functional, because it breaks some rules regarding linking to C-based modules. I wrote feedbackd-agent.patch that's solves the problem (it's against the latest release - 0.4, and I've sent Jeremy a copy).

lvs-kiss Per Andreas Buer perbu (at) @linpro (dot) .no 14 Dec 2002 I ran into the same problem this summer. I was setting up a loadbalanced SMTP cluster - and I wanted to distribute the incomming connections based on number of e-mails in the mail-servers queues. We ended up making our own program to do this. Later, I made the thing a bit more generic and released it. You might want to check it out http://www.linpro.no/projects/lvs-kiss/ lvs-kiss distributes incomming connections based on some numerical value - as long as you are able to quantify it - it can be used. It can also time certain test in order to acquire the load of the realservers.

connection threshold according to my dictionary, the spelling is threshold and not the more logical threshhold as found in many webpages (see google: "threshhold dictionary"). If the realservers are limited in the number of connections they can support, you can use the connection threshold in ipvsadm (in ip_vs 1.1.0). See the Changelog and man ipvsadm. This functionality is derived from Ratz's original patches. Matt Burleigh

Is there a stock Red Hat kernel with a new enough version of ip_vs to include connection thresholds?

Ratz 19 Dec 2002: I've done the original patch for 2.2.x kernels but I've never ported it to 2.4.x kernels. I don't know if RH has done so. In the newest LVS release for 2.5.x kernels the same concept is there, so with a bit of coding (maybe even luck) you could use that. ratz ratz (at) tac (dot) ch 2001-01-29 This patch on top of ipvs-1.0.3-2.2.18 adds support for threshold settings per realserver for all schedulers that have the -w option. As of Jun 2003, patches are available for 2.4 kernels. All patches are on Ratz's LVS page. Patches are in active development (i.e. you'll be helping with the debugging), look at the mailing list for updates. Horms 30 Aug 2004 LVS in 2.6 has its own connection limiting code. There isn't a whole lot too it. Just get ipvadm for 2.6 and take a look in the man page. It has details on how the connection thresholds can be set. Its pretty straight forward as I recall. Anon Is there any way to limit connections per IP through IPVS, to mimic the netfilter connection limit module ipt_connlimit? I see ipvsadm's threshold option, but it does totals per server. Ratz 15 Feb 2007 how exactly is the threshold option (per RS) different to the ipt_connlimit regarding the service -> RS pool mapping? If you need source IP limiting you're better off using QoS anyway.

Description/Purpose I was always thinking of how a kernel based implementation of connection limitation per realserver would work and how it could be implemented so while waiting in the hospital for the x-ray I had enough time to write up some little dirty hack to show a proof of concept. It works like follows. I added three new entries to the ip_vs_dest() struct, u_thresh and l_thresh in ip_vs.* and I modified the ipvsadm to add the two new options x and y. A typical setup would be: So, this means, as soon as (dest->inactconns + dest->activeconns) exceed the x value the weight of this server is set to zero. As soon as the connections drop below the lower threshold (y) the weight is set back to the initial value. What is it good for? Yeah well, I don't know exactly, imagine yourself, but first of all this is proposal and I wanted to ask for a discussion about a possible inclusion of such a feature or even a derived one into the main code (of course after fixing the race conditions and bugs and cleaning up the code) and second, I found out with tons of talks with customers that such a feature is needed, because also commercial lb have this and managers always like to have a nice comparision of all features to decide which product they take. Doing all this in user- space is unfortunately just not atomic enough. Anyway, if anybody else thinks that such a feature might be vital for inclusion we can talk about it. If you look at the code, it wouldn't break anything and just add two lousy CPU cycles for checking if u_thresh is <0. This feature can easily be disabled by just setting u_thresh to zero or not even initialize it. Well, I'm open for discussion and flames. I have it running in production :) but with a special SLA. I implemented the last server of resort which works like this: If all RS of a service are down (healthcheck took it out or treshhold check set weight to zero), my userspace tool automagically invokes the last server of resort, a tiny httpd with a static page saying that the service is currently unavailable. This is also useful if you want to do maintainance of the realservers. I already implemented a dozen of such setups and they work all pretty well. How we will defend against DDoS (distributed DoS)?

I'm using a packetfilter and in special zones a firewall after the packetfilter ;) No seriously, I personally don't think the LVS should take too much part on securing the realservers It's just another part of the firewall setup.

The problem is that LVS has another view for the realserver load. The director sees one number of connections the realserver sees another one. And under attack we observe big gap between the active/inactive counters and the used threshold values. In this case we just exclude all realservers. This is the reason I prefer the more informed approach of using agents. Using the number of active or inactive connections to assign a new weight is _very_ dangerous.

I know, but OTOH, if you set a threshold and my code takes the server out, because of a well formated DDoS attack, I think it is even better than if you would allow the DDoS and maybe kill the realservers http-listener.
No, we have two choices: - use SYN cookies and much memory for open requests, accept more valid requests - don't use SYN cookies, drop the requests exceeding the backlog length, drop many valid requests but the realservers are not overloaded In both cases the listeners don't see requests until the handshake is completed (Linux).
BTW, what if you enable the defense strategies of the loadbalancer? I've done some tests and I was able to flood the realservers by sending forged SYNs and timeshifted SYN-ACKs with the expected seq-nr. It was impossible to work on the realservers unless of course I enabled the TCP_SYNCOOKIES.

Yes, nobody claims the defense strategies guard the real servers. This is not their goal. They keep the director with more free memory and nothing more :) Only drop_packet can control the request rate but only for the new requests.

I then enabled my patch and after the connections exceeded the threshold, the kernel took the server out temporarily by setting the weight to 0. In that way the server was usable and I could work on the server.

Yes but the clients can't work, you exclude all servers in this case because the LVS spreads the requests to all servers and the rain becomes deluge :) In theory, the number of connections is related to the load but this is true when the world is ideal. The inactive counter can be set with very high values when we are under attack. Even the WLC method loads proportionatly the realservers but they are never excluded from operation.

True, but as I already said. I think LVS shouldn't replace a fw. I normally have a router configured properly, then a packetfilter, then a firewall or even another but stateful packetfilter. See, the patch itself is not even mandatory. I normal setup, my code is not even touched (except the ``if'':).
I have some thoughts about limiting the traffic per connection but this idea must be analyzed.
Hmm, I just want to limit the amount of concurrent connections per realserver and in the future maybe per service. This saved me quite some lines of code in my userspace healthchecking daemon.

Yes, you vote for moving some features from user to the kernel space. We must find the right balance: what can be done in LVS and what must be implemented in the user space tools. The other alternatives are to use the Netfilter's "limit" target or QoS to limit the traffic to the realservers.

But then you have to add quite some code. The limit target has no idea about LVS tables. How should this work, f.e. if you would like to rate limit the amount of connections to a realserver?

May be we can limit the SYN rate. Of course, that not covers all cases, so my thought was to limit the packet rate for all states or per connection, not sure, this is an open topic. It is easy to open a connection through the director (especially in LVS-DR) and then to flood with packets this connection. This is one of the cases where LVS can really guard the realservers from packet floods. If we combine this with the other kind of attacks, the distributed ones, we have better control. Of course, some QoS implementations can cover such problems, not sure. And this can be a simple implementation, of course, nobody wants to invent the wheel :) Let's analyze the problem. If we move new connections from "overloaded" realserver and redirect them to the other realservers we will overload them too.

No, unless you use a old machine. This is maybe a requirement of an e-commerce application. They have some servers and if the servers are overloaded (taken out by my user-space healthchecking daemon because the response time it to high or the application daemon is not listening anymore on the port) they will be taken out. Now I have found out that by setting thresholds I could reduce the down- time of flooded server significantly. In case all servers were taken out or their weights were set to 0 the userspace application sets up a temporarily (either local route or another server) new realserver that has nothing else to do then pushing a static webpage saying that the service is currently unavailable due to high server load or DDoS attack or whatever. Put this page behind a TUX 2.0 and try to overflow it. If you can, apply the zero-copy patches of DaveM. No way you will find such a fast (88MBit/s requests!!) Link to saturate the server.

Yes, I know that this is a working solution. But see, you exclude all realservers :) You are giving up. My idea is we to find a state when we can drop some of the requests and to keep the realservers busy but responsive. This can be a difficult task but not when we have the help from our agents. We expect that many valid requests can be dropped but if we keep the realserver in good health we can handle some valid requests because nobody knows when the flood will stop. The link is busy but it contains valid requests. And the service does not see the invalid ones. IMO, the problem is that there are more connection requests than the cluster can handle. The solutions to try to move the traffic between the realservers can only cause more problems. If at the same time we set the weights to 0 this leads to more delay in the processing. May be more useful is to start to reduce the weights first but this again returns us to the theory for the smart cluster software. So, we can't exit from this situation without dropping requests. There is more traffic that can't be served from the cluster. The requests are not meaningful, we care how much load they introduce and we report this load to the director. It can look, for example, as one value (weight) for the real host that can be set for all real services running on this host. We don't need to generate 10 weights for the 10 real services running in our real host. And we change the weight on each 2 seconds for example. We need two syscalls (lseek and read) to get most of the values from /proc fs. But may be from 2-3 files. This is in Linux, of course. Not sure how this behaves under attack. We will see it :)

Obviously yes, but if you also include the practical problem of SLA with customers and guaranteed downtime per month I still have to say that for my deploition (is this the correct noun?) I go better with my patch in case of a DDoS and enabled LVS defense strategies then without.

If there is no cluster software to keep the realservers equally loaded, some of them can go offline too early.

The scheduler should keep them equally loaded IMO even in case of let's say 70% forged packets. Again, if you don't like to set a threshold, leave it. The patch is open enough. If you like to set it, set it, maybe set it very high. It's up to you.

The only problem we have with this scheme is the ipvsadm binary. It must be changed (the user structure in the kernel :)) The last change is dated from 0.9.10 and this is a big period :) But you know what means a change in the user structures :) The cluster software can take the role to monitor the load instead of relying on the connection counters. I agree, changing the weights and deciding how much traffic to drop can be explained with a complex formula. But I can see it only as a complete solution: to balance the load and to drop the exceeding requests, serve as many requests as possible. Even the drop_packet strategy can help here, we can explicitly enable it specifying the proper drop rate. We don't need to use it only to defend the LVS box but to drop the exceeding traffic. But someone have to control the drop rate :) If there is no exceeding traffic what problems we can expect? Only from the bad load balancing :) The easiest way to control the LVS is from user space and to leave in LVS only the basic needed support. This allows us to have more ways to control LVS.

Flushing connection table Shinichi Kido shin (at) kidohome (dot) ddo (dot) jp

I want to reset all the connection table (the output list by ipvsadm -lc command) immediately without waiting for the expire time for all the connection.

Stefan Schlosser castla (at) grmmbl (dot) org 04 Jun 2004 you may want to try these patches: and use ipvsadm -F Horms horms (at) verge (dot) net (dot) au 04 Jun 2004 Another alternative, if you have lvs compiled as a module is to reload it. This will clear everything. Then again, Shinichi-san, why do you want to clear the connection table? It might be useful for testing. But I am not sure what it would be useful for in production. Joe In general it's not good for a server to accept a connection and then unilaterally break it. You should let the connections expire. If you don't want any new connections, just set weight=0.

Thundering herd problem, Slow start code for realserver(s) coming on line Despite what you may have read in the mailing list and possibly in earlier versions of this HOWTO, there is no slow start for realservers coming on line (I thought it was in the code from the very early days). If you bring a new realserver on-line with *lc scheduling, the new machine, having less connections (i.e. none) will get all the new connections. This will likely stress the new realserver. As Jan Klopper points out (11 Mar 2006), you don't get the thundering herd problem with round robin scheduling. In this case the number of connections will even out when the old connections expire (for http this may only be a few minutes). It would be simple enough to bring up a new realserver with all rules being rr, then after 5mins change over to lc (if you want lc). Horms says (off-line Dec 2006) that it's simple enough to use the in-kernel timers to handle this problem; he just hasn't done it. Some early patches to handle the problem received zero response, so he dropped it. Christopher Seawood cls (at) aureate (dot) com LVS seems to work great until a server goes down (this is where mon comes in). Here's a couple of things to keep in mind. If you're using the Weighted Round-Robin scheduler, then LVS will still attempt to hit the server once it goes down. If you're using the Least Connections scheduler, then all new connections will be directed to the down server because it has 0 connections. You'd think using mon would fix these problem but not in all cases. Adding mon to the LC setup didn't help matters much. I took one of three servers out of the loop and waited for mon to drop the entry. That worked great. When I started the server back up, mon added the entry. During that time, the 2 running servers had gathered about 1000 connections apiece. When the third server came back up, it immediately received all of the new connections. It kept receiving all of the connections until it had an equal number of connections with the other servers (which by this time...a minute or so later...had fallen to ~700). By this time, the 3rd server had been restarted after due to triggering a high load sensor also monitoring the machine (a necessary evil or so I'm told). At this point, I dropped back to using WRR as I could envision the cycle repeating itself indefinitely. Horms has fixed this problem with a patch to ipvsadm. Horms horms (at) verge (dot) net (dot) au 23 Feb 2004 Here is a patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/thundering_herd.diff) that implements slow start for the WLC scheduler. This is designed to address the problem where a realserver is added to the pool and soon inundated with connections. This is sometimes refered to as the thundering herd problem and was recently the topic of a thread on this list "Easing a Server into Rotation". http://marc.theaimsgroup.com/?l=linux-virtual-server&m=107591805721441&w=2 The patch has two basic parts. ip_vs_ctl.c: When the weight of a realserver is modified (or a realserver is added), set the IP_VS_DEST_F_WEIGHT_INC or IP_VS_DEST_F_WEIGHT_DEC flag as appropriate and put the size of the change in dest.slow_start_data. This information is intended to act as hints for scheduler modules to implement slow start. The scheduler modules may completely ignore this information without any side effects. ip_vs_wlc.c: If IP_VS_DEST_F_WEIGHT_DEC is set then the flag is zeroed - slow start does not come into effect for weight defects. If IP_VS_DEST_F_WEIGHT_INC is set then a handicap is calculated. The flag is then zeroed. The handicap is stored in dest.slow_start_data, along with a scaling factor to allow gradual decay which is stored in dest.slow_start_data2. The handicap effectively makes the realserver appear to have more connections than it does, thus decreasing the number of connections that the wlc scheduler will allocate to it. This handicap is decayed over time. Limited debugging information is available by setting This will show the size of the handicap when it is calculated and show a message when the handicap is fully decayed.

Handling kernel version dependant files <emphasis>e.g.</emphasis> System.map and ipvsadm If you boot with several different versions of the kernel (particularly switching between 2.2.x and 2.4.x), and you have executables or directories with contents that need to match the kernel version (e.g. System.map, ipvsadm, /usr/src/linux, /usr/src/ipvs), then you need some mechanism for making sure that the appropriate executable or directory is brought into scope. Note:klogd is supposed to read files like /boot/System.map-<kernel_version> allowing you to have several kernels in / (or /boot). However this doesn't solve the problem for general executables like ipvsadm. If you have the wrong version of System.map you'll get errors when running some commands (e.g. `ps` or `top`) If you the ip_vs and ipvsadm don't match, then ipvsadm will give invalid numbers for IPs and ports or it will tell you that you don't have ip_vs installed. As with most problems in computing, this can be solved with an extra layer of indirection. I name my kernel versions in /usr/src like linux-1.0.7-2.2.19-module/ drwxr-xr-x 15 root root 4096 Jun 21 2001 linux-1.0.7-2.2.19-kernel/ drwxr-xr-x 15 root root 4096 Aug 8 2001 linux-1.0.7-2.2.19-module/ lrwxrwxrwx 1 root root 18 Oct 21 2001 linux -> linux-1.0.7-2.2.19 ]]> Here I have two versions of ip_vs-1.0.7 for the 2.2.19 kernel, one built as a kernel module and the other built into the kernel (you will probably only have one version of ip_vs for any kernel). I select the one I want to use by making a link from linux-1.0.7-2.2.19 (I do this by hand). If you do this for each kernel version, then the /usr/src directory will have several links linux-1.0.7-2.2.19-module/ lrwxrwxrwx 1 root root 38 Sep 18 2001 linux-0.9.2-2.4.7 -> linux-0.9.2-2.4.7-module-hidden-shared/ lrwxrwxrwx 1 root root 39 Sep 18 2001 linux-0.9.3-2.4.9 -> linux-0.9.3-2.4.9-module-forward-shared/ lrwxrwxrwx 1 root root 17 Sep 19 2001 linux-2.4.9 -> linux-0.9.3-2.4.9/ lrwxrwxrwx 1 root root 40 Oct 11 2001 linux-0.9.4-2.4.12 -> linux-0.9.4-2.4.12-module-forward-shared/ lrwxrwxrwx 1 root root 18 Oct 21 2001 linux -> linux-0.9.4-2.4.12/ ]]> The last entry, the link from linux is made by rc.system_map (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/rc.system_map). At boot time rc.system_map checks the kernel version (here booted with 2.4.12) and links linux to linux-0.9.4-2.4.12. If you lable ipvsadm, /usr/src/ip_vs and System.map in a similar way, then rc.system_map will link the correct versions for you. ipvsadm versions match ipvs versions and not kernel versions, but kernel versions are close enough that this scheme works.

Limiting number of clients connecting to LVS This comes up occasionally, and Ratz has developed scheduler code that will handle overloads for realservers (see and discussion of schedulers with memthresh in ). The idea is that after a certain number of connections, the client is sent to an overload machine with a page saying "hold on, we'll be back in a sec". This is useful if you have an SLA saying that no client connect request will be refused, but you have to handle 100,000 people trying to buy Rolling Stones concert tickets from your site, who all connect within 30secs of the opening time. Ratz' code may even be in the LVS code sometime. Until it is, ask Ratz about it. NewsFlash: Ratz' code is out Roberto Nibali ratz (at) tac (dot) ch 23 Oct 2003 http://marc.theaimsgroup.com/?l=linux-virtual-server&m=105914647925320&w=2 Horms The LVS 1.1.x code for the 2.6 kernel allows you to set connection limits using ipvsadm. This is documented in the ipvsadm man page that comes with the 1.1.x code. The limits are currently not available in the 1.0.x code for the 2.4 kernel. However I suspect that a backport would not be that difficult. Diego Woitasen, Oct 22, 2003

but what set the IP_VS_DEST_F_OVERLOAD in struct ip_vs_dst?

Horms 23 Oct 2003 This relates to LVS's internal handling of connection thresholds for realservers which is available in the 1.1.x tree for the 2.6.x kernel (also see ). IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS code when the high and low thresholds are passed for a realserver. If a scheduler honours this flag then it should not allocate new connections to realservers with this flag set. As far as I can see all the supplied schedulers honour this flag. But if a scheduler did not then it would just be ignored. That is real servers would have new connections allocated regardless of if IP_VS_DEST_F_OVERLOAD is set or not. It would be as if no connection thresholds had been set. Note that if persistancy is in use then subsequent connections to the same realserver for a given client within the persistancy timeout are not scheduled as such. Thus additional connections of this nature can be allocated to a realserver even if it has been marked IP_VS_DEST_F_OVERLOAD. This, IMHO, is a desirable behaviour.

ok, I saw that IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS, but I can't find where the thresholds are set. As can I see, this thresholds are always set to Zero in userpace. Is this right?

No. In the kernel the threshoulds are set by code in ip_vs_ctl.c (I guess, as that is where all other configuration from user-space is handled). If you get the version of ipvsadm that comes with LVS source that supports IP_VS_DEST_F_OVERLOAD then it has command line options to set the thresholds. Steve Hill

A number of the schedulers seem to use an is_overloaded() function that limits the number of connections to twice the server's weight.

Ratz 03 Aug 2004 For the sake of discussion I'll be referring to the 2.4.x kernel. We would be talking about this piece of jewelry: activeconns) > atomic_read(&dest->weight)*2) { return 1; } return 0; } ]]> I'm a bit unsure about the semantics of this is_overloaded regarding it's mathematical background. Wensong, what was the reason to arbitraly us twice the amount of activeconns for the overoad criteria? dest->activeconns has such a short life span, it hardly represents nor reflects the current RS load in any way I could imagine. 2.4.x and 2.6.x differ in what they consider a destination to be overloaded. IP_VS_DEST_F_OVERLOAD is set when ip_vs_dest_totalconns exceeds the upper threshold limit and totalconns means currently active + inactive connections which is also kind of unfortunate. And yes, there is some more code I haven't mentioned yet.

I'm using the dh scheduler to balance across 3 machines - once the connections exceed twice the weight, it refuses new connections from IP addresses that aren't currently persistent.

Which kernel? And 2.4.x and 2.6.x contain similar (although not sync'd ... sigh) code regarding this feature:

2.4.24 sorry

grep -r is_overloaded * net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest) net/ipv4/ipvs/ip_vs_sh.c: || is_overloaded(dest)) { net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc) net/ipv4/ipvs/ip_vs_lblcr.c: if (!dest || is_overloaded(dest, svc)) { net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest) net/ipv4/ipvs/ip_vs_dh.c: || is_overloaded(dest)) { net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc) net/ipv4/ipvs/ip_vs_lblc.c: || is_overloaded(dest, svc)) { ratz@webphish:/usr/src/linux-2.6.8-rc2> ratz@webphish:/usr/src/linux-2.4.27-rc4> grep -r is_overloaded * net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest) net/ipv4/ipvs/ip_vs_sh.c: || is_overloaded(dest)) { net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc) net/ipv4/ipvs/ip_vs_lblcr.c: if (!dest || is_overloaded(dest, svc)) { net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc) net/ipv4/ipvs/ip_vs_lblc.c: || is_overloaded(dest, svc)) { net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest) net/ipv4/ipvs/ip_vs_dh.c: || is_overloaded(dest)) { ]]> Assymetric coding :-)

This in itself isn't really a problem, but I can't find this behaviour actually documented anywhere - all the documentation refers to the weights as being "relative to the other hosts" which means there should be no difference between me setting the weights on all hosts to 5 or setting them all to 500. Multiplying the weight by 2 seems very arbitrary, although in itself there is no real problem (as far as I can tell) with limiting the connections like that so long as it's documented.

This is correct. I'm a bit unsure as to what your exact problem is, but a kernel version would already help, although I believe you're using a 2.4.x kernel. Normally the is_overloaded() function was designed to be used by the threshold limitation feature only which is only present as a shacky backport from 2.6.x. I don't quite understand the is_overloaded() function in the ip_vs_dh scheduler, OTOH, I really haven't been using it so far.

I have 3 squid servers and had set them all to a weight of 5 (since they are all identical machines and the docs said that the weights are relative to eachother). What I found was that once there were >10 concurrent connections any new hosts that tried to make a connection (i.e. any host that isn't "persisting") would have it's connection rejected outright. After some reading through the code I discovered the is_overloaded condition, which was failing in the case of > 10 connections and so I have increased all the weights to 5000 (to all intents and purposes unlimited) which has solved the problem. Oddly there is another LVS server with a similar configuration which isn't showing this behaviour, but I cannot find any significant difference in the configuration to account for it. The primary use for LVS in this case is failover in the event of one of the servers failing, although load balancing is a good side effect. I'm using ldirectord to monitor the realservers and adjusting the LVS settings in response to an outage. At the moment, for some reason it doesn't seem to be doing any load balancing at the moment (something I am looking into) - it is just using a single server, although if that server is taken down it does fail over correctly to one of the other servers. Sorry, I've just realised I've been exceptionally stupid about this bit - I should be using the SH scheduler instead of DH.

One problem is that activeconns doesn't define connections and the implementations for 2.4 and 2.6 differ significantly. Also is_overloaded should be reserved for another purpose, the threshhold limitation feature. Joe: now back into prehistory - Milind Patil mpatil (at) iqs (dot) co (dot) in 24 Sep 2001

I want to limit number of users accessing the LVS services at any given time. How can I do it.

Julian for non-NAT cluster (maybe stupid but interesting) May be an array from policers, for example, 1024 policers or an user-defined value, power of 2. Each client hits one of the policers based on their IP/Port. This is mostly a job for QoS ingress, even the distributed attack but may be something can be done for LVS? May be we better to develop a QoS Ingress module? The key could be derived from CIP and CPORT, may be something similar to SFQ but without queueing. It can be implemented may be as a patch to the normal policer but with one argument: the real number of policers. Then this extended policer can look into the TCP/UDP packets to redirect each packet to one of the real policers. for NAT only Run SFQ qdisc on your external interface(s). It seems this is not a solution for DR method. Of course, one can run SFQ on its uplink router. Linux 2.4 only iptables has support to limit the traffic but I'm not sure whether it is useful for your requirements. I assume you want to set limit to each one of these 1024 aggregated flows. Wenzhuo Zhang Is anybody actually using the ingress policer for anti-DoS? I tried it several days ago using the script in the iproute2 package: iproute2/examples/SYN-DoS.rate.limit. I've tested it against different 2.2 kernels (2.2.19-7.0.8 - redhat kernel), 2.2.19, 2.2.20preX, with all QoS related functions either compiled into the kernel or as modules) and different versions of iproute2. In all cases, tc fails to install the ingress qdisc policer: Julian For 2.2, you need the ds-8 package, at Package for Differentiated Services on Linux. Compile tc by setting TC_CONFIG_DIFFSERV=y in Config. The right command is: Ratz The 2.2.x version is not supported anymore. The advanced routing documentation says to only use 2.4. For 2.4, ingress is in the kernel but it is still unusable for more than one device (look in linux-netdev for reference). James Bourne james (at) sublime (dot) com (dot) au 25 Jul 2003

I was after some samples or practical suggestion in regard to Rate Limiting and Dynamically Denying Services to abusers on a per VIP basis. I have had a look at the sections in the HOWTO on "Limiting number of clients connecting to LVS" and http://www.linuxvirtualserver.org/docs/defense.html.

Ratz This is a defense mechanism which is always unfair. You don't want that from what I can read.

Specifically, we are running web based competition entries (e.g. type in your three bar codes) out of our LVS cluster and want to limit those who might construct "bots" to auto-enter. The application is structured so that you have to click through multiple pages and enter a value that is represented in a dynamically generated PNG. I would like to: rate limit on each VIP (we can potentially do this at the firewall) ban a source ip if it goes beyond a certain number "requests-per-time-interval" dynamically take a vip offline if it goes beyond a certain number of "requests-per-time-interval" toss all "illegal requests" - eg. codered, nimda etc. Perhaps a combination of iptables, QoS, SNORT etc. would do the job??

Roberto Nibali 25 Jul 2003

1. rate limit on each VIP (we can potentially do this at the firewall)

Hmm, you might need to use QoS or probably better would be to write a scheduler which uses the rate estimator in IPVS.

2. ban a source ip if it goes beyond a certain number "requests-per-time-interval"

A scheduler could do that for you, although I do not think this is a good idea.

3. dynamically take a vip offline if it goes beyond a certain number of "requests-per-time-interval"

Quiescing the service should be enough, you don't want to put on a penalty on other people, you simple want to keep your maximum request -per-time rate.

4. toss all "illegal requests" - eg. codered, nimda etc.

This has nothing to do with LVS ;). QoS is certainly suitable for 1). For 2) and 3) I think you would need to write a scheduler. Max Sturtz

I know that iptables can block connections if they exceed a specified number of connections per second (from anywhere). The question is, is anybody doing this on a per-client basis, so that if any particular IP is sending us more than a specified number of connections per second, they get blocked but all other clients can keep going?

ratz 01 Dec 2003 Using iptables is a very bad practice approach to handle such problems. You have no information if the IP which is making those request attempts at a high rate is malicious or friendly. If it's malicious (IP spoofing) you will block an existing friendly IP.

several times per week we experience a traffic storm. LVS handles it just fine, but the web-servers get loaded up really bad, and pretty soon our site is all but un-usable. We're looking for tools we could use to analyze this (we use Webalizer for our web-logs-- but it can't tell us who's talking to us in any given time-frame...)

Could you describe your overloaded system with some metrics or could you determine the upper connection threshold where your RS are still working fine? You could dump the LVS masquerading table from time to time and grep for connection templates. I see 4 approaches (in no particular order) to this problem: LVS tcp defense mechanism, best described in http://www.linux-vs.org/docs/defense.html L7 load balancer which inspects HTTP content, best described in http://www.linux-vs.org/software/ktcpvs/ktcpvs.html or in the package Readme. Use of the per RS threshold limitation patch I wrote (see ). Use the feedbackd (http://www.redfishsoftware.com.au/projects/feedbackd/) architecture to signal the director of network anomalies based on certain metrics gained on the RSs.

Who is connecting to my LVS? On the realservers you can look with `netstatn -an`. With LVS, the director also has information.

malalon (at) poczta (dot) onet (dot) pl 18 Oct 2001 How do I know who is connecting to my LVS?

Julian Linux 2.2: netstat -Mn (or /proc/net/ip_masquerade) Linux 2.4: ipvsadm -Lcn (or /proc/net/ip_vs_conn)

experimental scheduling code This section is a bit out of date now. See the new schedulers by Thomas Prouell for web caches and by Henrik Norstrom for firewalls. Ratz ratz (at) tac (dot) ch has produced a scheduler which will keep activity on a particular realserver below a fixed level. For this next code write to Ty or grab the code off the list server

Ty Beede tybeede (at) metrolist (dot) net 23 Feb 2000 This is a hack to the ip_vs_wlc.c schedualing algorithm. It is curently implemnted in a quick, ad hoc fashion. It's purpose is to support limiting the total number of connections to a realserver. Currently it is implmented using the weigh value as the upper limit on the number of activeconns(connections in an established TCP state). This is a very simple implementation and only took a few minutes after reading through the source. I would like, however, to develop it further. Due to it's simple nature it will not function in several types of enviroments, those based on connectionless protocals (UDP, this uses the inactconns variable to keep track of things, simply change the activeconns varible-in the weigh check- to inactconns for UDP) and it may impose complecations when persistance is implemented. The current algorimthm simply checks that weight > activeconns before including a server in the standard wlc scheduling. This works for my enviroment, but could be changed to perhaps (weight * 50) > (activeconns * 50) + inactconns to include the inactconns but make the activeconns more important in the decison. Currently the greatest weight value a user may specify is approimalty 65000, independant of this modification. As long as the user keeps most importanly the weight values correct for the total number of connections and in porportion to one another the things should function as expected. In the event that the cluster is full, all real severs have maxed out, then it might be neccessary for overflow control, or the client's end will hang. I haven't tested this idea but it could simply be implemented by specifing the over flow server last, after the real severs using the ipvsadm tool. This will work because as each realserver is added using ipvsadm it is put on a list, with the last one added being last on the list. The scheduling algorithm traverses this list linearly from start to finish and if it finds that all severs are maxed out, then the last one will be the overflow and that will be the only one to send traffic to. Anyway this is just a little hack, read the code and it should make sense. It has been included as an attachment. If you would like to test this simply replace the old ip_vs_wlc.c scheduling file in /usr/src/linux/net/ipv4 with this one. Compile it in and set the weight on the real severs to the max number of connections in an established TCP state or modifiy the source to your liking. From: Ty Beede tybeede (at) metrolist (dot) net 28 Feb 2000 I wrote a little patch and posted it a few days ago... I indicated that overflow might be accomplished by adding the overflow server to the lvs last. This statement is completely off the wall wrong. I'm not really sure why I thought that would work but it won't, first of all the linked list adds each new instance of a real sever to the start of the realservers list, not the end like I though. Also it would be impossible do distingish the overflow server from the realservers in the case that not all the realservers were busy. I don't know where I got that idea from but I'm going to blame it on my "bushy eyed youth". In responce to needing overflow support I'm thinking about implementing "prority groups" into the lvs code. This would logically group the real severs into different groups, though with a higher priority group would fillup before those with a lower grouping. If anybody could comment on this it would be nice to hear what the rest of you think about overflow code.

Theoretical issues in developing better scheduling algorithms Julian It seems to me it would be useful in some cases to use the total number of connections to a realserver in the load balancing calculation, in the case where the realserver participates in servicing a number of different VIPs.

Wensong Yeah, it is true. Sometimes, we need tradeoff between simplicity/performance and functionality. Let me think more about this, and probably maximum connection scheduling together together too. For a rather big server cluster, there may be a dedicated load balancer for web traffic and another load balancer for mail traffic, then the two load balancers may need exchange status periodically, it is rather complicated.

Yes, if a realserver is used from two or more directors the "lc" method is useless.

Actually, I just thought that dynamic weight adaption according to periodical load feedback of each server might solve all the above problems.

Joe - this is part of a greater problem with LVS, we don't have good monitoring tools and we don't have a lot of information on the varying loads that realservers have, in order to develope strategies for informed load regulation. See load and failure monitoring.

Julian From my experience with realservers for web, the only useful parameters for the realserver load are: cpu idle time If you use realservers with equal CPUs (MHz) the cpu idle time in percents can be used. In other cases the MHz must be included in a expression for the weight. free ram According to the web load the right expression must be used including the cpu idle time and the free ram. free swap Very bad if the web is swapping. The easiest parameter to get, the Load Average is always <5. So, it can't be used for weights in this case. May be for SMTP ? The sendmail guys use only the load average in sendmail when evaluating the load :) So, the monitoring software must send these parameters to all directors. But even now each of the directors use these weights to create connections proportionally. So, it is useful these parameters for the load to be updated in short intervals and they must be averaged for this period. It is very bad to use current value for a parameter to evaluate the weight in the director. For example, it is very useful to use something like "Average value for the cpu idle time for the last 10 seconds" and to broadcast this value to the director on each 10 seconds. If the cpu idle time is 0, the free ram must be used. It depends on which resource zeroed first: the cpu idle time or the free ram. The weight must be changed slightly :) The "*lc" algorithms help for simple setups, eg. with one director and for some of the services, eg http, https. It is difficult even for ftp and smtp to use these schedulers. When the requests are very different, the only valid information is the load in the realserver. Other useful parameter is the network traffic (ftp). But again, all these parameters must be used from the director to build the weight using a complex expression. I think the complex weight for the realserver based on connection number (lc) is not useful due to the different load from each of the services. May be for the "wlc" scheduling method ? I know that the users want LVS to do everything but the load balancing is very complex job. If you handle web traffic you can be happy with any of the current scheduling methods. I didn't tried to balance ftp traffic but I don't expect much help from *lc methods. The realserver can be loaded, for example, if you build new Linux kernel while the server is in the cluster :) Very easy way to switch to swap mode if your load is near 100%.

Ratz's primer on writing your own scheduler Roberto Nibali ratz (at) tac (dot) ch 10 Jul 2003 the whole setup roughly works as follows: Each scheduler {rr,lc,...} will have to register itself by initialisation of the ip_vs_scheduler struct object. As you can see it contains above other data types 4 function pointers: Each scheduler will need to provide a callback function for those prototypes with his own specific implementation. Let's have a look at ip_vs_wrr.c: We start with the __init function which is kernel specific. It defines ip_vs_wrr_init() which in turn calls the required register_ip_vs_scheduler(&ip_vs_wrr_scheduler). You can see the ip_vs_wrr_scheduler structure definition just above the __init function. There you will note following: This now is exactly the scheduler specific object instantiation of the struct ip_vs_scheduler prototype defined in ip_vs.h. Reading this you can see that the last for "names" map the function names to be called accordingly. So in case of the wrr scheduler, what does the init_service (mapped to the ip_vs_wrr_init_svc function) do? It generates a mark object (used for list chain traversal and mark point) which gets filled up with initial values, such as the maximum weight and the gcd weight. This is a very intelligent thing to do, because if you do not do this, you will need to compute those values every time the scheduler needs to schedule a new incoming request. The latter also requires a second callback. Why? Imagine someone decides to update the weights of one or more server from user space. This would mean that the initially computed weights are not valid anymore. What can be done against it? We could compute those values every time the scheduler needs to schedule a destination but that's exactly what we don't want. So in play comes the update_service protoype (mapped to the ip_vs_wrr_update_svc function). As you can easily see the ip_vs_wrr_update_svc function will do part of what we did for the init_service: it will compute the new max weight and the new gcd weight, so the world is saved again. The update_service callback will be called upon a user space ioctl call (you can read about this in the previous chapter of this marvellous developer guide :)). The ip_vs_wrr_schedule function provides us with the core functionality of finding an appropriate destination (realserver) when a new incoming connection is hitting our cluster. Here you could write your own algorithm. You only need to either return NULL (if no realserver can be found) or a destination which is of type: struct ip_vs_dest. The last function callback is the ip_vs_wrr_done_svc function which kfree()'s the initially kmalloc()'d mark variable. This short tour-de-scheduler show give you enough information to write your own scheduler, at least in theory :). unknown

I'd like to write a user defined scheduler to guide the load dispatching

Ratz 12 Aug 2004 Check out feedbackd feedbackd and see if you miss something there. I know that this is not what you wanted to hear but to provide a generic API for user space deamons to interact directly with a generic scheduler is definitely out of scope. One problem is that the process of balancing incoming network load is not an atomic operation. It can take minutes, hours, days, weeks until you get an equalised load on your servers. Having a user space doing premature scheduler updates in a short time interval only asks for trouble regarding peak load bouncing.

changing ip_vs behaviour with sysctl flags in /proc You can change the behaviour of ip_vs by pushing bits in the /proc filesystem. This gives finer control of ip_vs than is available with ipvsadm. For ordinary use, you don't need to worry about the sysctl, since sensible defaults have been installed. Here's a list of the current sysctls at http://www.linuxvirtualserver.org/docs/sysctl.html . Note that older kernels will not have all of these sysctls (test for the existance of the appropriate file in /proc first). These sysctls are mainly used for . Some, but not all, of the sysctls are documented in ipvsadm(8) (Thanks to Kit Gerrits Dec 2008). There's info on ip_vs() sysctls at ipvs-sysctl.txt (http://www.mjmwired.net/kernel/Documentation/networking/ipvs-sysctl.txt). Horms horms (at ) verge (dot) net (dot) au 11 Dec 2003 I am still strongly of the opinion that the sysctl variables should be documented in the ipvsadm man page as they are strongly tied to its behaviour. At the moment we are in a situation where some are documented in ipvsadm(8), while all documented in sysctl.html Yet there is no reference to sysctl.html in ipvsadm(8). My preference is to merge all the information in sysctl.html into ipvsadm(8) or perhaps a separate man page. If this is not acceptable then I would advocate removing all of the sysctl infromation from ipvsadm(8) and replacing it with a reference to sysctl.html. Though to be honest, why half the information on configuring LVS should be in ipvsadm(8) and the other half in sysctl.html is beyond me.

Counters in ipvsadm Rutger van Oosten r (dot) v (dot) oosten (at) benq-eu (dot) com 09 Oct 2003

When I run it shows connections, packets and bytes in and out for the virtual services and for the realservers. One would expect that the traffic for the service is the sum of the traffic to the servers - but it is not, the numbers don't add up at all, whereas in they do (approximately, not exactly for the bytes per second ones). For example (LVS-NAT, two realservers, one http virtual service): RemoteAddress:Port TCP vip:http 4239091 31977546 29470876 3692M 26647M -> www02:http 3911835 29405279 26900679 3407M 24292M -> www01:http 3395953 25407180 23257431 2931M 20957M # ipvsadm -l --rate IP Virtual Server version 1.0.10 (size=4096) Prot LocalAddress:Port CPS InPPS OutPPS InBPS OutBPS -> RemoteAddress:Port TCP vip:http 45 348 314 41739 285599 -> www02:http 35 252 216 30416 197101 -> www01:http 10 96 98 11323 88497 ]]> Is this a bug, or am I just missing something?

Wensong 12 Oct 2003 It's quite possible that the conns/bytes/packets statistics of virtual service is not the sum of the conns/bytes/packets counters of its realservers, because some realservers may be removed permanetly. The connection rate of virtual service is the sum of connection rate of its realservers, because it is an instant metric at a time. In the output of your ipvsadm --l --stats, the counters of virtual service is less than the sum of the counters of its realservers. I guess that your virtual service must have been removed after it run for a while, and then must be created later. In current implementation, if realservers are to be deleted, they will not be removed permanently, but be put in the trash, because established connections still refer to them; a server can be looked up in the trash when it is added back to a service. When a virtual service is created, it always has counters set to zero, but the realservers can be picked up from the trash, they have the past counters. We probably need zero the counters of realservers if the service is a new one. Anyway, you can do cat /proc/net/ip_vs_stats. The counters of all IPVS services is larger than or equal to the sum of realservers.

You are right - after the weekly reboot last night the numbers do add up. The realservers have been removed and added in the mean time, but the virtual services have stayed in place and the numbers are still correct. So that must be it. Mystery solved, thank you :-)

Exact Counters Guy Waugh gwaugh (at) scu (dot) edu (dot) au 2005/20/05 The ipvsadm_exact.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvsadm_exact.patch) contains a diff for the addition of an '-x' or '--exact' command-line switch to ipvsadm (version 1.19.2.2). The idea behind the new option is to allow users to specify that large numbers printed with the '--stats' or '--rate' options are in machine readable bytes, rather than in 'human-readable' form (e.g. kilobytes, megabytes). I needed this to get stats from LVS readable by nagios (http:/www.nagios.org/).

Scheduling TCP/UDP/SCTP/TCP splicing/ LVS does not schedule SCTP (although people ask about it occasionally). SCTP is a connection oriented protocol (like TCP, but not like UDP). Delivery is reliable (like TCP), but packets are not neccessarily delivered in order. The Linux-2.6 kernel supports SCTP (see The Linux Kernel sctp project http://lksctp.sourceforge.net/). For an article on SCTP see Better networking with SCTP (http://www-128.ibm.com/developernetworks/linux/library/l-sctp/?ca=dgr-lnwx01SCTP). One of the features of SCTP is multistreaming: control and data streams are separate streams within an association. With tcp to do the same thing, you need separate ports (e.g. ftp uses 20 for data, 21 for command) or you put both into one connection (e.g. http). If you use one port then a command will be blocked till queued data is serviced. If multiple (redundant) routes are available, failover is transparent to the application. (Thus the requirement that packets not neccessarily be delivered in order.) is can use SCTP (I only know about SIP using UDP). Ratz 20 Feb 2006 There is a remotely similar approach in the TCP splicing code for LVS. (http://www.linuxvirtualserver.org/software/tcpsp/). It's only a small subset of SCTP. With TCP, scheduling needs to know the number of current connections to each realserver, before assigning a realserver for the next connection. The length of time for a connection can be short (retrieving a page by http) or long (an extended telnet session). For UDP there is no "connection". LVS uses a timeout (for 2.4.x kernels is about 3mins) and any UDP packets from a client within the timeout period will be rescheduled to the same realserver. On a short time scale (ca. timeout), there will be no load balancing of UDP services (e.g. as was found for ntp). All requests will go to the same realserver. On a long time scale (>>timeout) loadbalancing will occur. Here's the official LVS definition of a UDP "connection" Wensong Zhang wensong (at) iinchina (dot) net 2000-04-26

For UDP datagrams, we create entries for state with the timeout of IP_MASQ_S_UDP (5*60*HZ) by default. Consequently all UDP datagrams from the same source to the same destination will be sent to the same realserver. Therefore, we call data communication between a client's socket and a server's socket a "connection", for both no TCP and UDP.

Julian Anastasov 2000-07-28

For UDP we can in principle remove the implicit persistence for the UDP connections and thus select different real server for each packet. My thought was to implement a new feature: schedule each UDP packet to new realserver. i.e.something like timeout=~0 for UDP as service flag.

LVS has been tested with the following UDP services, DNS ntp xdmcp radius So far only DNS has worked well (but then DNS already fine in a cluster setup without LVS). ntpd is already self loadbalancing and doesn't need to be run under LVS. xdmcp dies if left idle for long periods (several hours). UDP services are not commonly used with LVS and we don't yet know whether the problems are with LVS or with the service running under LVS. Han Bing hb (at) quickhot (dot) com 29 Dec 2002

I am developing several game servers using UDP which I would like to use with LVS. LVS supports UDP "connection" persistence. Does persistence work for UDP too? For example, I have 3 games, every games has 3 servers( 9 servers in 3 groups totally). All game1 servers listen on udp port 10000, game2 servers listen on 10001 udp port, and game3 servers listen on 10002 udp port. when the client send a udp datagram to game1( to VIP:10000 ), can the lvs director auto-select one server from the 3 game1 servers and forward it to the server, AND keep the persistence of this "UDP connection" when she receives the following datagram from the same CIP?

Joe: not sure what may happen here. people have had problems with LVS/udp (e.g. with ). These problems should go away with persistent udp, but no-one has tried it and it didn't even occur to me. The behaviour of LVS with persistent udp may be undefined for all I know. I would make sure that the setup worked with ntp before trying a game setup.

patch: machine readable error codes from ipvsadm Computers can talk to each other and read from and write to other programs. You shouldn't have to get a person to sit at the console to parse the output of a program. Here's a patch to make the output of ipvsadm machine readable Padraig Brady padraig (at) antefacto (dot) com 07 Nov 2001 This 1 line patch is useful for me and I don't think it will break anything. It's against ipvsadm-0.8.2 and returns a specific error code. ipvsadm-0.8.2-returncode.diff" Content-disposition: inline; filename="ipvsadm-0.8.2-returncode.diff" Content-transfer-encoding: 7bit --- //ipvs-0.8.2/ipvs/ipvsadm/ipvsadm.c Fri Jun 22 16:03:08 2001 +++ ipvsadm.c Wed Nov 7 16:29:11 2001 @@ -938,6 +938,7 @@ result = setsockopt(sockfd, IPPROTO_IP, op, (char *)&urule, sizeof(urule)); if (result) { + result = errno; /* return to caller */ perror("setsockopt failed"); /* --Boundary_(ID_nuebet+LsBGYFsmRPljqqA)-- ]]>

patch: stateless ipsvadm - add/edit patch Commands like ifconfig are idempotent, i.e. they tell the machine to assume a certain state without reference to the previous state. You can repeat the same command without errors (e.g. put IP=xxx onto eth0). Not all commands are idempotent - some require you to know the state of the machine first. ipvsadm is not idempotent: if a VIP:port entry already exists, then you will get an error on attempting to enter it. Whether you make a command idempotent or not, will depend on the nature of the command. The problem with ipvsadm is that it isn't scriptable and hence can't be used for automated control of an LVS: If no entry exists: you must add the entry with the -a option if the entry exists; you must edit the entry with the -e option. You will get an error if you use the wrong command. Two solutions are: parse the output of ipvsadm to see if the entry you are about to make already is present try both commands and see which one runs and then have the script figure out if both the error and the non-error was valid. This is a pain and is quite unneccesary. What is needed is a version of ipvs that accepts valid entries without giving an error. Here's the ipvs-0.9.0_add_edit.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.0_add_edit.patch) patch by Horms against ipvs-0.9.0. It modifies several ipvs files, including ipvsadm.

patch: fwmark name-number translation table ipvsadm allows entry of fwmark only as numbers. In some cases, it would be more convenient to enter/display the fwmark as a name; e.g. an e-commerce site, serving multiple customers (i.e. VIPs) and which is linking http and https by a fwmark. The output of ipvsadm then would list the fwmark as "bills_business", "fred_inc" rather than "14","15"... Horms has written a ipvs-0.9.5.fwmarks-file.patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.5.fwmarks-file.patch) which allows the use of a string fwmark as well as the default method of an integer fwmark, using a table in /etc that looks like the /etc/hosts table. Horms horms (at) vergenet (dot) net Nov 14 2001

while we were at OLS in June, Joe suggested that we have a file to associate names with firewall marks. I have attached a patch that enables ipvsadm to read names for firewall marks from /etc/fwmarks. This file is intended to be analogous to /etc/hosts (or files in /etc/iproute2/). The patch to the man page explains the format more fully, but briefly the format is "fwmark name..." newline delimited

ip_vs_conn.pl you can also run `ipvsadm -lcn` to do the same thing) #Here is an /proc/net/ip_vs_conn hex mode to integer script. #If its given an ipaddress as an argument on the commandline, #it will show only lines with that ipaddress in it. #------------------------------------------------- if (@ARGV) { $mask = $ARGV[0]; } open(DATA, "/proc/net/ip_vs_conn"); $format = "%8s %-17s %-5s %-17s %-5s %-17s %-5s %-17s %-20s\n"; printf $format, "Protocol", "From IP", "FPort", "To IP", "TPort", "Dest IP", "DPort", "State", "Expires"; while(){ chop; ($proto, $fromip, $fromport, $toip, $toport, $destip, $destport, $state, $expiry) = split(); next unless $fromip; next if ($proto =~ /Pro/); $fromip = hex2ip($fromip); $toip = hex2ip($toip); $destip = hex2ip($destip); $fromport = hex($fromport); $toport = hex($toport); $destport = hex($destport); if ( ($fromip =~ /$mask/) || ($toip =~ /$mask/) || ($destip =~ /$mask/) || (!($mask))) { printf $format, $proto, $fromip, $fromport, $toip, $toport, $destip, $destport, $state, $expiry; } } sub hex2ip($input) { my $input = shift; $first = substr($input,0,2); $second = substr($input,2,2); $third = substr($input,4,2); $fourth = substr($input,6,2); $first = hex($first); $second = hex($second); $third = hex($third); $fourth = hex($fourth); return "$first.$second.$third.$fourth"; } #--------------------------------------------------------------- ]]>

Luca's php monitoring script Luca Maranzano liuk001 (at) gmail (dot) com 12 Oct 2005 I've written a simple php script luca.php to monitor the status of an LVS server. To use it, configure sudo in order to make the Apache user to run /sbin/ipvsadm as root without password prompt. The CSS is derived from phpinfo() page. Jeremy Kerr jk (at) ozlabs (dot) org 12 Oct 2005 ]]> Whoa. If you use this script with register_globals set (and assuming you've set it up so that the sudo works), you've got a remote *root* vunerability right there. e.g. http://example.com/script.php?resolve_dns=1&dns_flag=;sudo+rm+-rf+/, which will do `rm -rf /` as root. you may want to ensure your variables are clean beforehand, and avoid the sudo completely (maybe use a helper process?) malcolm (at) loadbalancer (dot) org Oct 12 2005 That's why PHP no longer has register globals defaulted! And also why you lock down your admin ip address by source ip. My code has this vulnerability, but I'm not sure a helper app would be any more secure (sudo is a helper app.) liuk001 (at) gmail (dot) comOct 12 2005 Jeremy, this is a good point. I wrote it as a quick and dirty hack without security in mind. It is used on the internal net from trusted users who indeed have root access to the servers ;-) However, sudo is configured to run only /sbin/ipvsadm from www-data user, so I think that /bin/rm could not be executed. Graeme Fowler graeme (at) graemef (dot) net 12 Oct 2005 ...as all the relevant values are produced in /proc/net/ip_vs[_app,_conn,_stats], then why not just write something to process those values instead? They're globally readable and don't need any helper apps to view them at all. Yes, you'd be re-inventing a small part of ipvsadm's functionality. The security improvements alone are worth it; the fact that the overhead of running sudo and then ipvsadm is removed by just doing an open() on a /proc file might be worth it in situations where you may have many users running your web app. Sure, you need to decode the hex values to make them "nice". Unless you have the sort of users who read hex encoding all the time :)

ipvsadm set option anon

what /proc values does ipvsadm --set modify? Something in /proc/sys/net/ipv4/vs?

Ratz the current proc-fs entries are a read-only representation of what could be set regarding state timeouts. ipvsadm --set will set for IPVS related connection entries The time we spend in TCP ESTABLISHED before transition The time we spend in TCP FIN_WAIT before transition The time we spend for UDP in general Andreas 05 Feb 2006

Where can I get the currently set values of ipvsadm --set foo bar baz?

Ratz You can't :) IP_VS_SO_GET_TIMEOUTS is not implemented in ipvsadm or I'm blind. Also the proc-fs related entries for this are not exported. I've written a patch to re-instate the proper settings in proc-fs, however this is only in 2.4.x kernels. Julian has recently proposed a very granular timeout framework, however none of us has had the time nor impulse to implement it. For our (work) customers I needed the ability to instrument all the IPVS related timeout values in DoS and non-DoS mode. The ipvsadm --set option should be obsoleted, since it only covers the timeout settings partially and there is no --get method.

I did not find a way to read them out, I grep through the /proc/sys/foo and /proc/net foo and was not able to see the numbers I set before. This was on kernel 2.4.30 at least.

Correct. The standard 2.4.x kernel only exports the DoS timers for some (to me at least) unknown reason. I suggest that we re-instate (I'll send a patch or a link to the patch tomorrow) these timer settings until we can come up with Julian's framework. It's imperative that we can set those transition timers, since their default values are dangerous regarding >L4 DoS. One example is HTTP/1.1 slow start if the web servers are mis-configured (wrong MaxKeepAlive and its timeout settings).

This brings me further to the question if the changes of lvs in recent 2.6 development are being backported to 2.4?

Currently I would consider it the other way 'round. 2.6.x has mostly stalled feature wise and 2.4.x is missing the crucial per RS threshold limitation feature. I've backported it and also improved it quite a bit (service pool) and so we're currently completely out of sync :). I'll try to put some more effort into working on the 2.6.x kernel, however since it's still too unstable of us, our main focus remains on the 2.4.x kernel. [And before you ask: No, we don't have the time (money wise) to invest into bug-hunting and reporting problems regarding 2.6.x kernels on high-end server machines. And in private environment 2.6.x works really well on my laptops and other machines, so there's really not much to report ;).] On top of that LVS does not use the classic TCP timers from the stack since it only forwards TCP connections. The IPVS timers are needed so we can maintain the LVS state table regarding expirations in various modes, mostly LVS_DR. Browsing through the code recently I realised that the state transition code in ip_vs_conn:vs_tcp_state() is very simple, probably too simple. If we use Julian's forward_shared patch (which I consider a great invention, BTW) one would assume that IPVS timeouts are more closely timed to the actually TCP flow. However, this is not the case because, from what I've read and understood the IPVS state transitions are done without memory, so it's wild guessing :). I might have a closer look at this because it just seems sub-optimal. Also the notion of active versus inactive connections stemming from this simple view of TCP flow is questionable, especially the dependence and weighting of some schedulers.

So, if I set a lvs tcp timeout about 2h 12 min, lvs would never drop a tcp connection unless a client is really "unreachable":

The timeout is more bound to the connection entry in the IPVS lookup table, so we know where to forward incoming packets regarding a specific TCP flow. A TCP connection is never drop or not dropped by LVS, only specific packets pertaining to a TCP connection.

After 2h Linux sends tcp keepalive probes serveral times, so there are some byte send through the connection.

Nope, IPVS does not work as a proxy regarding TCP connections. It's a TCP flow redirector.

lvs will (re)set the internal timer for this connection to the keepalive time I set with --set.

Kind of ... only the bits of the state transition table which are affected by the three settings. It might not be enough to keep persistency for your TCP connection.

Or does it recognize that the bytes send are only probes without a vaild answer and thus drop the connection?

There is no sending keepalive probes from the director.

will we eventually get timeput parameters _per service_ instead of global ones?

Julian proposed following framework: http://www.ssi.bg/~ja/tmp/tt.txt So if you want to test, the only thing you have to do is fire up your editor of choice :). Ok, honestly, I don't know when this will be done because it's quite some work and most us developers here are a pretty busy with other daily activities. So unless there is a glaring issue regarding timers implemented as-is, chances are slim that this gets implemented. Of course I could fly down to Julian's place over the week-end and we could implement it together; want to sponsor it? ;). There's a lot of TCP timers in the Linux kernel and they all have different semantical meanings. There is the TCP timout timer for sockets related to locally initiated connections, then there is a TCP timeout for the connection tracking table, which on my desktop system for example has following settings: And of course we have the IPVS TCP settings, which look as follows (if they weren't disabled in the core :)): unless you enabled tcp_defense, which changes those timers again. And then of course we have other in-kernel timers, which influence those timers mentioned above. However, the beforementioned timers regarding packet filtering, NAPT and load balancing and are meant as a means to map expected real TCP flow timeouts. Since there is no socket (as in an endpoint) involved when doing either netfilter or IPVS, you have to guess what the TCP flow in-between (where you machine is "standing") is doing, so you can continue to forward, rewrite, mangle, whatever, the flow, _without_ disturbing it. The timers are used for table mapping timeouts of TCP states. If we didn't have them, mappings would stay in the kernel forever and eventually we'd run out of memory. If we have them wrong, it might occur that a connection is aborted prematurely by our host, for example yielding those infamous ssh hangs when connecting through a packet filter. The tcp keepalive timer setting you've mentioned, on the other hand, is per socket. And as such only has an influence on locally created or terminated sockets. A quick socket(2) and socket(7) skimming reveil:

ipvsadm error messages ipvsadm's error messages are low level and give the user little indication of what they've done wrong. These error messages were written in the early days of LVS when getting ipvsadm to work was a feat in itself. Unfortunately the messages have not been updated and enough use of ipvsadm is now scripted and so we don't run into the messages anymore. As people post error messages and what they mean, I'll put them here Brian Sheets bsheets (at) singlefin (dot) net 7 Jan 2007

What am I doing wrong? The syntax looks correct to me.

Graeme Fowler graeme (at) graemef (dot) net 07 Jan 2007 do you have a service defined on VIP 10.200.8.1 port 25? Make sure you're not getting your real and virtual servers mixed up. Brian

Yup, I had the real and virtuals reversed..

Joe You should be able to delete a realserver for a service that isn't declared, with only a notice rather than an error, at least in my thinking. However that battle was lost back in the early days.

ipvsadm fast update bug with smp Kees Hoekzema kees (at) tweakers (dot) net 11 Jul 2007 I'm applying weight changes to a 64bit 2-way SMP director quite rapidly (not quite twice a second, but close) and getting a frozen director, which needs a cold reset. I have found a bit more information from my debugging, and it seems that Horms already knows about it: (http://marc.info/?l=linux-netdev&m=118040107213444&w=2) I recompiled the ip_vs() modules with a bit more debug and everytime my system crashed I had the same debug output: The code after line 910 reads: usecnt) > 1) {}; ]]> Every other busy lock in the code reads: usecnt) > 1); ]]> Which basicly is the same except a cpu_relax(); At the moment I am testing my server with cpu_relax() code in the ip_vs_edit_dest function, and so far it has not crashed yet and is directing the traffic quite a bit longer than previously was possible. The only differences between this server and the old server (which didn't have any problems) are: - SMP (4 cores) vs Single core - 64 bits vs 32 bits - 2.6.21.5 vs 2.6.20.4 (but I do not see any changes in ip_vs_ctl.c) In my first mail I accused the 64/32 bits difference, but right now I'm more thinking of a SMP issue, but unfortunatly I lack the kernel hacking skills to say why, or why that cpu_relax() helps so much in the while loop. Hopefully Horms understands it better than I do ;) usecnt) > 1); - + while (atomic_read(&svc->usecnt) > 1) {}; + /* call the update_service, because server weight may be changed */ svc->scheduler->update_service(svc); ]]> Joe I know this is separate from the problem, but according to feedback and control theory you should be making adjustments on a timescale that damps the transients. What a transient is here is not obvious - the timescale of a tcpip connection, the time is takes to change the load by 10%?, 50%? I don't know, but every couple of seconds would seem to be a lot shorter than either of these two time scales. Do you find your setup has problems when you do adjustments on a longer timescale?

Problems when no scheduler What happens if you have the service entered with ipvsadm -A, but have no realservers (no ipvsadm -a) to accept the forwarded packets? We haven't quite figured out what to do about this yet. Till we can get a better idea about what's going on, we're not going to do anything. Siim Poder windo (at) p6drad-teel (dot) net 07 May 2008

We've had LVS machines dying a couple of times when the service is using the wrr scheduler and keepalived pulls all real servers from behind the service IP. The symptoms are that there are a lot of (thousands, apparently for every packet?) messages in syslog: ip_vs_wrr_schedule(): no available servers After which the machine hangs. I don't recall if i've had to boot it manually or if it boots by itself. Also, I'm not sure if it is that message that is killing the machine, but the problem hasn't occured with other schedulers (that don't print such a message). We use wrr the most though. I think we should either remove the message or ratelimit it (unless the bug is somewhere else). I tested the patch and it seems to be ok, but as I'm unable to reproduce the hanging/crashing in test environment, I can't verify wether it actually helps. Something close to this was added to mainline by someone already. But the problem seems to persist (just without the messages). It seems to appear with any scheduler (at least wrr, wlc and rr). However I have been unable to reproduce this neither by connection nor packet rate in test environment. It's probably not just the missing real servers, but something relatively infrequent that gets triggered only after there are no real servers. The LVS goes down in about 5-15 minutes of missing real servers IIRC. I tried generating many connections (and played with ttl/fragmentation a little), but couldn't trigger the bug. Maybe it has to do with clients sending some ICMP messages (which would probably be rare enough)? This still gets triggered in our live env for high connection rate services (if the servers fail for any reason and keepalived kicks them out). We have put sorry servers into keepalived configuration to avoid the whole LVS going down for now (sorry_server 127.0.0.1:666), so there is a workaround for us. cw == 0) { mark->cl = &svc->destinations; - IP_VS_INFO("ip_vs_wrr_schedule(): " + IP_VS_DBG_RL("ip_vs_wrr_schedule(): " "no available servers\n"); dest = NULL; goto out; ]]>

Horm 29 Dec 2008 We're doing nothing till we figure out what's really going on The important problem seems to be that LVS dies sometimes. But unfortunately that can't be fixed right now, because nobody knows how to do so, despite Siim's efforts to find the cause of the problem. With regards to making wrr like the other schedulers, I'd actually be much more inclined to do the reverse - make all the other schedulers display a rate-limited warning when they don't have any real servers available. Perhaps something like the patch against 2.6.28 below. In any case, I think that the warning message and the LVS dying issues are separate, except that it seems likely that the warning message will help to lead us to the cause of the "LVS dies" bug. Joe: The patches to make all schedulers display a warning will be in kernel 2.6.29.

LVS: LVS-NAT

Introduction see also Julian's layer 4 LVS-NAT setup (http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt). LVS-NAT is based on cisco's LocalDirector. This method was used for the first LVS. If you want to set up a test LVS, this requires no modification of the realservers and is still probably the simplest setup. In a commercial environment, the owners of servers are loath to change the configuration of a tested machine. When they want load balancing, they will clone their server and tell you to put your load balancer infront of their rack of servers. You will not be allowed near any of their servers, thank you very much. In this case you use LVS-NAT.

Ratz Wed, 15 Nov 2006 Most commercial load balancers are not set up anymore using the triangulation mode (Joe: triangulation == LVS-DR) (at least in the projects I've been involved). The load balancer is becoming more and more a router, using well-understood key technologies like VRRP and content processing.

With LVS-NAT, the incoming packets are rewritten by the director changing the dst_addr from the VIP to the address of one of the realservers and then forwarded to the realserver. The replies from the realserver are sent to the director where they are rewritten and returned to the client with the source address changed from the RIP to the VIP. Unlike the other two methods of forwarding used in an LVS (LVS-DR and LVS-Tun) the realserver only needs a functioning tcpip stack (eg a networked printer), i.e. the realserver can have any operating system and no modifications are made to the configuration of the realservers (except setting their route tables).

LVS-NAT bugs Sep 2006: Various problems have surfaced in the 2.6.x LVS-NAT code all relating to routing (netfilter) on the side of the director facing the internet. People using LVS-NAT on a director which isn't a firewall and which only has a single default gw, aren't having any problems. It seems the 2.4.x code was working correctly: Farid Sarwari had it working for IPSec at least. The source routing problem has been identified by three people, who've all submitted functionally equivalent patches. While we're delighted to have contributions from so many people, we regret that we weren't fast enough to recognise the problem and save the last two people all their work. One of the problems (we think) is that not many people are using LVS-NAT and when a weird problem is reported on the mailing list we say "well 1000's of people have been using LVS-NAT for years without this problem, this guy must not know what he's talking about". We're now taking the approach that maybe not too many people are using LVS-NAT. Here are the problems which have surfaced so far with LVS-NAT. They either have been solved or will be in a future release of LVS. Firewall incompatibility: You couldn't run a netfilter firewall on the outside of the director. This was solved by Ben North with the . These patches were taken over by Vinnie, and are now being maintained by Julian as part of the . Since the NFCT patches are benign when not being used, we hope that they will be incorporated into the ip_vs code for the kernel (when Horms gets time). Source routing: Outbound packets originating at the VIP are not injected into the routing table but are sent straight out the default gw. As a result the packets were not affected by iproute2 commands. This problem was found by Ken Brownfield who submitted a patch for his relatively old kernel, then Farid Sarwari who couldn't get routing to work for his IPSec LVS submitted another, then David Black realised that Julian's NFCT patches handled the problem from the start. (see ). Horm's is working on getting Julian's NFCT code into ip_vs. LVS-NAT ftp helper modules for active/passive ftp: We seem to get a disproportionate number of problems with ftp on LVS. This seems to be a combination of the small number of users, real bugs and inadequate documentation. (see ).

Example 1-NIC, 2 Network LVS-NAT (VIP and RIPs on different network) If the VIP and the RIPs are on the same network you need the Here the client is on the same network as the VIP (in a production LVS, the client will be coming in from an external network via a router). (The director can have 1 or 2 NICs - two NICs will allow higher throughput of packets, since the traffic on the realserver network will be separated from the traffic on the client network). here's the lvs.conf for this setup The VIP is the only IP known to the client. The RIPs here are on a different network to the VIP (although with only 1 NIC on the director, the VIP and the RIPs are on the same wire). In normal NAT, masquerading is the rewriting of packets originating behind the NAT box. With LVS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP) is rewritten by the director (becoming CIP->RIP). The action of the LVS director is called demasquerading. The demasqueraded packet is forwarded to the realserver. The reply packet (RIP->CIP) is generated by the realserver.

All packets sent from the LVS-NAT realserver to the client must go through the LVS-NAT director For LVS-NAT to work all packets from the realservers to the client must go through the director. Forgetting to set this up is the single most common cause of failure when setting up a LVS-NAT LVS. The original (and the simplest from the point of view of setup) way is to make the DIP (on the director) the default gw for the packets from the realserver. The documentation here all assumes you'll be using this method. (Any IP on the director will do, but in the case where you have two directors in active/backup failover, you have an IP that is moved to the active director and this is called the DIP). Any method of making the return packets go through the director will do. With the arrival of the tools, you can route packets according to any parameter in the packet header (e.g. src_addr, src_port, dest_addr..) Here's an example of ip rules on the realserver to route packets from the RIP to an IP on the director. This avoids having to route these packets via a default gw. Neil Prockter prockter (at) lse (dot) ac (dot) uk 30 Mar 2004

you can avoid using the director as the default gw by > /etc/iproute2/rt_tables realserver# ip route add default
table lvs realserver# ip rule add from table lvs ]]> For the IPs in Virtual Server via NAT (http://www.linuxvirtualserver.org/VS-NAT.html). > /etc/iproute2/rt_tables ip route add default 172.16.0.1 table lvs ip rule add from 172.16.0.2 table lvs ]]> I do this with lvs and with cisco css units

Here Neil is routing packets from RIP to 0/0 via DIP. You can be more restrictive and route packets from RIP:port (where port is the LVS'ed service) to 0/0 via DIP. Packets from RIP:other_ports can be routed via other rules. For a 2 NIC director (with different physical networks for the realservers and the clients), it is enough for the default gw of the realservers to be the director. For a 1 NIC, two network setup (where the two networks are using the same link layer), in addition, the realservers must only have routes to the director. For a 1 NIC, 1 network setup, ICMP redirects must be turned off on the director (see ) (the configure script does this for you). In a normal server farm, the default gw of the realserver would be the router to the internet and the packet RIP->CIP would be sent directly to the client. In a LVS-NAT LVS, the default gw of the realservers must be the director. The director masquerades the packet from the realserver (rewrites it to VIP->CIP) and the client receives a rewritten packet with the expected source IP of the VIP. the packet must be routed via the director, there must be no other path to the client. A packet arriving at the client directly from the realserver, rather than going through the director, will not be seen as a reply to the client's request and the connection will hang. If the director is not the default gw for the realservers, then if you use tcpdump on the director to watch an attempt to telnet from the client to the VIP (run tcpdump with `tcpdump port telnet`), you will see the request packet (CIP->VIP), the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP). You will not see the rewritten reply packet (VIP->CIP). (Remember if you have a switch on the realserver's network, rather than a hub, then each node only sees the packets to/from it. tcpdump won't see packets to between other nodes on the same network.) Part of the setup of LVS-NAT then is to make sure that the reply packet goes via the director, where it will be rewritten to have the addresses (VIP->CIP). In some cases (e.g. 1 net NS-NAT) icmp redirects have to be turned off on the director so that the realserver doesn't get a redirect to forward packets directly to the client. In a production system, a router would prevent a machine on the outside exchanging packets with machines on the RIP network. As well, the realservers will be on a private network (eg 192.168.x.x/24) and replies will not be routable. In a test setup (no router), these safeguards don't exist. All machines (client, director, realservers) are on the same piece of wire and if routing information is added to the hosts, the client can connect to the realservers independantly of the LVS. This will stop LVS-NAT from working (your connection will hang), or it may appear to work (you'll be connecting directly to the realserver). In a test setup, traceroute from the realserver to the client should go through the director (2 hops in the above diagram). The configure script will test that the director's gw is 2 hops from the realserver and that the route to the director's gw is via the director, preventing this error. (Thanks to James Treleaven jametrel (at) enoreo (dot) on (dot) ca 28 Feb 2002, for clarifying the write up on the ping tests here.) In a test setup with the client connected directly to the director (in the setup above with 1 or 2 NICs, or the one NIC, one network LVS-NAT setup), you can ping between the client and realservers. However in production, with the client out on internet land, and the realservers with unroutable IPs, you should not be able to ping between the realservers and the client. The realservers should not know about any other network than their own (here 10.1.1.0). The connection from the realservers to the client is through ipchains (for 2.2.x kernels) and LVS-NAT tables setup by the director. In my first attempt at LVS-NAT setup, I had all machines on a 192.168.1.0 network and added a 10.1.1.0 private network for the realservers/director, without removing the 192.168.1.0 network on the realservers. All replies from the servers were routed onto the 192.168.1.0 network rather than back through LVS-NAT and the client didn't get any packets back. Here's the general setup I use for testing. The client (192.168.2.254) connects to the VIP on the director. (The VIP on the realserver is present only for LVS-DR and LVS-Tun.) For LVS-DR, the default gw for the realservers is 192.168.1.254. For LVS-NAT, the default gw for the realservers is 192.168.1.9. This setup works for both LVS-NAT and LVS-DR. Here's the routing table for one of the realservers as in the LVS-NAT setup. Here's a traceroute from the realserver to the client showing 2 hops. Note the traceroute from the client box to the realserver only has one hop. director icmp redirects are on, but the director doesn't issue a redirect (see ) because the packet RIP->CIP from the realserver emerges from a different NIC on the director than it arrived on (and with different source IP). The client machine doesn't send a redirect since it is not forwarding packets, it's the endpoint of the connection.

Run the configure script Use lvs_nat.conf as a template (sample here will setup LVS-NAT in the diagram above assuming the realservers are already on the network and using the DIP as the default gw). The output is a commented rc.lvs_nat file. Run the rc.lvs_nat file on the director and then the realservers (the script knows whether it is running on a director or realserver). The configure script will setup up masquerading, forwarding on the director and the default gw for the realservers.

Setting up demasquerading on the director; 2.4.x and 2.2.x The packets coming in from the client are being demasqueraded by the director. In 2.2.x you need to masquerade the replies. Here's the masquerading code in rc.lvs_nat, that runs on the director (produced by the configure script). /proc/sys/net/ipv4/ip_forward echo "installing ipchain rules" /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0 #repeated for each realserver and service .. .. echo "ipchain rules " /sbin/ipchains -L ]]> In this example, http is being masqueraded by the director, allowing the realserver to reply to the telnet requests from the director being demasqueraded by the director as part of the 2.2.x LVS code. In 2.4.x masquerading of LVS'ed services is done explicitely by the LVS code and no extra masquerading (by iptables) commands need be run.

rewriting, re-mapping, translating ports with LVS-NAT One of the features of LVS-NAT is that you can rewrite/re-map the ports. Thus the client can connect to VIP:http, while the realserver can be listening on some other port (!http). You set this up with ipvsadm Here the client connects to VIP:http, the director rewrites the packet header so that dst_addr=RIP:9999 and forwards the packet to the realserver, where the httpd is listening on RIP:9999. For each realserver (i.e. each RIP) you can rewrite the ports differently: each realserver could have the httpd listening on it's own particular port (e.g. RIP1:9999, RIP2:80, RIP3:xxxx). Although port re-mapping is not possible with LVS-DR or LVS-Tun, it's possible to use iptables to do (and LVS-Tun) on the realserver, producing the same result.

masquerade timeouts For the earlier versions of LVS-NAT (with 2.0.36 kernels) the timeouts were set by linux/include/net/ip_masq.h, the default values of masquerading timeouts are:

Julian's step-by-step check of a L4 LVS-NAT setup Julian has his latest fool-proof setup doc at Julian's software page. Here's the version, at the time I wrote this entry. good A.2 No => bad Some settings for the director: Linux 2.2/2.4: ipchains -A forward -s RIP -j MASQ Linux 2.4: iptables -t nat -A POSTROUTING -s RIP -j MASQUERADE Q.2 Traceroute to client goes through LVS box and reaches the client? traceroute -n -s RIP CLIENT_IP A.1 Yes => good A.2 No => bad same ipchains command as in Q.1 For client and server on same physical media use these in the director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf//send_redirects Q.3 Is the traffic forwarded from the LVS box, in both directions? For all interfaces on director: tcpdump -ln host CLIENT_IP The right sequence, i.e. the IP addresses and ports on each step (the reversed for the in->out direction are not shown): CLIENT | CIP:CPORT -> VIP:VPORT | || | \/ out | CIP:CPORT -> VIP:VPORT || LVS box \/ | CIP:CPORT -> RIP:RPORT in | || | \/ | CIP:CPORT -> RIP:RPORT + REAL SERVER A.1 Yes, in both directions => good (for Layer 4, probably not for L7) A.2 The packets from the realserver are dropped => bad: - rp_filter protection on the incoming interface, probably hit from local client (for more info on rp_filter, see the section on - firewall rules drop the replies A.3 The packets from the realservers leave the director unchanged - missing -j MASQ ipchains rule in the LVS box For client and server on same physical media: The packets simply does not reach the director. The real server is ICMP redirected to the client. In director: echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects echo 0 > /proc/sys/net/ipv4/conf//send_redirects A.4 All packets from the client are dropped - the requests are received on wrong interface with rp_filter protection - firewall rules drop the requests A.5 The client connections are refused or are served from service in the LVS box - client and LVS are on same host => not valid - the packets are not marked from the firewall and don't hit firewall mark based virtual service Q.4 Is the traffic replied from the realserver? For the outgoing interface on realserver: tcpdump -ln host CLIENT_IP A.1 Yes, SYN+ACK => good A.2 TCP RST => bad, No listening real service A.3 ICMP message => bad, Blocked from Firewall/No listening service A.4 The same request packet leaves the realserver => missing accept rules or RIP is not defined A.5 No reply => realserver problem: - the rp_filter protection drops the packets - the firewall drops the request packets - the firewall drops the replies A.6 Replies goes through another device or don't go to the LVS box =? bad - the route to the client is direct and so don't pass the LVS box, for example: - client on the LAN - client and realserver on same host - wrong route to the LVS box is used => use another Check the route: rs# ip route get CLIENT_IP from RIP The result: start the following tests rs# tcpdump -ln host CIP rs# traceroute -n -s RIP CIP lvs# tcpdump -ln host CIP client# tcpdump -ln host CIP For more deep problems use tcpdump -len, i.e. sometimes the link layer addresses help a bit. For FTP: VS-NAT in Linux 2.2 requires: - modprobe ip_masq_ftp (before 2.2.19) - modprobe ip_masq_ftp in_ports=21 (2.2.19+) VS-NAT in Linux 2.4 requires: - ip_vs_ftp VS-DR/TUN require persistent flag FTP reports with debug mode enabled are useful: # ftp ftp> debug ftp> open my.virtual.ftp.service ftp> ... ftp> dir ftp> passive ftp> dir There are reports that sometimes the status strings reported from the FTP realservers are not matched with the string constants encoded in the kernel FTP support. For example, Linux 2.2.19 matches "227 Entering Passive Mode (xxx,xxx,xxx,xxx,ppp,ppp)" Julian Anastasov ]]>

How LVS-NAT works director:/etc/lvs# ipvsadm does the following (if the service was http instead of telnet, the webserver on the realserver could be listening on port 8000 instead of 80) Turn on ip_forwarding (so that the packets can be forwarded to the realservers) /proc/sys/net/ipv4/ip_forward ]]> Example: client requests a connection to 192.168.1.110:23 director chooses realserver 10.1.1.2:23, updates connection tables, then The client gets back a packet with the source_address = VIP. For the verbally oriented... The request packet is sent to the VIP. The director looks up its tables and sends the connection to realserver1. The packet is rewritten with a new destination (in this case with the same port, but the port could be changed too) and sent to RIP1. The realserver replies, sending back a packet to the client. The default gw for the realserver is the director. The director accepts the packet and rewrites the packet to have source=VIP and sends the rewritten packet to the client. Why isn't the source of the incoming packet rewritten to be the DIP or VIP? Wensong

...changing the source of the packet to the VIP sounds good too, it doesn't require that default route rule, but requires additional code to handle it.

In LVS-NAT, how do packets get back to the client, or how does the director choose the VIP as the source_address for the outgoing packets? This was written for 2.0.x and 2.2.x kernel LVSs which was based on the masquerading code. With 2.4.x, LVS is based on netfilter and there were initially some problems getting LVS-NAT to work with 2.4.x. What happens here for 2.4.x, I don't know. Joe In normal NAT, where a bunch of machines are sitting behind a NAT box, all outward going packets are given the IP on the outside of the NAT box. What if there are several IPs facing the outside world? For NAT it doesn't really matter as long as the same IP is used for all packets. The default value is usually the first interface address (eg eth0). With LVS-NAT you want the outgoing packets to have the source of the VIP (probably on eth0:1) rather than the IP on the main device on the director (eth0). With a single realserver LVS-NAT LVS serving telnet, the incoming packet does this, VIP:telnet #client sends a packet CIP:high_port -> RIP:telnet #director demasquerades packet, forwards to realserver RIP:telnet -> CIP:high_port #realserver replies ]]> The reply arrives on the director (being sent there because the director is the default gw for the realserver). To get the packet from the director to the client, you have to reverse the masquerading done by the LVS. To do this (in 2.2 kernels), on the director you add an ipchains rule If the director has multiple IPs facing the outside world (eg eth0=192.168.2.1 the regular IP for the director and eth0:1=192.168.2.110 the VIP), the masquerading code has to choose the correct IP for the outgoing packet. Only the packet with src_addr=VIP will be accepted by the client. A packet with any other scr_addr will be dropped. The normal default for masquerading (eth0) should not be used in this case. The required m_addr (masquerade address) is the VIP. Does LVS fiddle with the ipchains tables to do this? Julian Anastasov ja (at) ssi (dot) bg 01 May 2001

No, ipchains only delivers packets to the masquerading code. It doesn't matter how the packets are selected in the ipchains rule. The m_addr (masqueraded_address) is assigned when the first packet is seen (the connect request from the client to the VIP). LVS sees the first packet in the LOCAL_IN chain when it comes from the client. LVS assigns the VIP as maddr. The MASQ code sees the first packet in the FORWARD chain when there is a -j MASQ target in the ipchains rule. The routing selects the m_addr. If the connection already exists the packets are masqueraded. The LVS can see packets in the FORWARD chain but they are for already created connections, so no m_addr is assigned and the packets are masqueraded with the address saved in the connections structure (the VIP) when it was created. There are 3 common cases: The connection is created as response to packet. The connection is created as response to packet to another connection. The connection is already created Case (1) can happen in the plain masquerading case where the in->out packets hit the masquerading rule. In this case when nobody recommends the s_addr for the packets going to the external side of the MASQ, the masq code uses the routing to select the m_addr for this new connection. This address is not always the DIP, it can be the preferred source address for the used route, for example, address from another device. Case (1) happens also for LVS but in this case we know: the client address/port (from the received datagram) the virtual server address/port (from the received datagram) the realserver address/port (from the LVS scheduler) But this is on out->in packet and we are talking about in->out packets Case (2) happens for related connections where the new connection can be created when all addresses and ports are known or when the protocol requires some wildcard address/port matching, for example, ftp. In this case we expect the first packet for the connection after some period of time. It seems you are interested how case (3) works. The answer is that the NAT code remembers all these addresses and ports in a connection structure with these components external address/port (LVS: client) masquerading address/port (LVS: virtual server) internal address/port (LVS: realserver) protocol etc LVS and the masquerading code simply hook in the packet path and they perform the header/data mangling. In this process they use the information from the connection table(s). The rule is simple: when a packet is already for established connection we must remember all addresses and ports and always to use same values when mangling the packet header. If we select each time different addresses or ports we simply break the connection. After the packet is mangled the routing is called to select the next hop. Of course, you can expect problems if there are fatal route changes. So, the short answer is: the LVS knows what m_addr to use when a packet from the realserver is received because the connection is already created and we know what addresses to use. Only in the masquerading case (where LVS os not involved) connections can be created and a masquerading address to be selected without using rule for this. In all other cases there is a rule that recommends what addresses to be used at creation time. After creation the same values are used.

So make the VIP the primary IP on the outside of the director

Wayne wayne (at) compute-aid (dot) com 26 Apr 2000 Any web server behind the LVS box use LVS-NAT can initiate communication to the Internet. However, it is not using the farm IP address, rather it is using the masquerading IP address -- the actual IP address of the interface. Is there easy way to let the server in NAT mode to go out as the farm IP address?

Lars No. This is a limitation in the 2.2 masquerading code. It will always use the first address on the interface.

We tried and it works! We put VIP on eth0, and RIP on eth0:1 in NAT mode and it works fine. Just need to figure out how to do it during reboot, since this is done by playing with ifconfigure command. Once we swap them around, the going out IP address is the VIP address. But if LVS box reboot, you just have to redo it again.

Joe: ! :-) I didn't realise you were in VS-NAT mode, therefore not having the VIP on the realservers. I thought you must be in VS-DR.

One Network LVS-NAT According to Malcolm Turnbull, this is called "One Arm NAT" in the commercial world (i.e. one nic and one network) The disadvantage of the 2 network LVS-NAT is that the realservers are not able to connect to machines in the network of the VIP. You couldn't make a LVS-NAT setup out of machines already on your LAN, which were also required for other purposes to stay on the LAN network. Here's a one network LVS-NAT LVS. The problem: A return packet from the realserver (with address RIP->CIP) will be sent to the realserver's default gw (the director). What you want is for the director to accept the packet and to demasquerade it, sending it on to the client as a packet with address (VIP->CIP). With ICMP redirects on, the director will realise that there is a better route for this packet, i.e. directly from the realserver to the client and will send an ICMP redirect to the realserver, informing it of the better route. As a result, the realserver will send subsequent packets directly to the client and the reply packet will not be demasqueraded by the director. The client will get a reply from the RIP rather than the VIP and the connection will hang. The cure: Thanks to michael_e_brown (at) dell (dot) com and Julian ja (at) ssi (dot) bg for help sorting this out. To get a LVS-NAT LVS to work on one network - On the director, turn off icmp redirects on the NIC that is the default gw for the realservers. (Note: eth0 may be eth1 etc, on your machine). /proc/sys/net/ipv4/conf/all/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects ]]> Make the director the default and only route for outgoing packets. You will probably have set the routing on the realserver up like this Note the route to 192.168.1.0/24. This route allows the realserver to send packets to the client by just putting them out on eth0, where the client will pick them up directly (without being demasqueraded) and the LVS will not work. This route also allows the realservers to talk to each other directly i.e. without routing packets through the director. (As the admin, you might want to telnet from one realserver to another, or you might have ntp running, sending ntp packets between realservers.) Remove the route to 192.168.1.0/24. This will leave you with Now packets RIP->CIP have to go via the director and will be demasqueraded. The LVS-NAT LVS now works. If LVS is forwarding telnet, you can telnet from the client to the VIP and connect to the realserver. As a side effect, packets between the realservers are also routed via the director, rather than going directly (note: all packets now go via the director). (You can live with that.) You can ping from the client to the realserver. You can also connect _directly_ to services on the realserver _NOT_ being forwarded by LVS (in this case e.g. ftp). You can no longer connect directly to the realserver for services being forwarded by the LVS. (In the example here, telnet ports are not being rewritten by the LVS, i.e. telnet->telnet). Here's tcpdump on the director. Since the network is switched the director can't see packets between the client and realserver. The client initiates telnet. `netstat -a` on the client shows a SYN_SENT from port 4121. client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 (DF) 16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0] ]]> (repeats every second until I kill telnet on client) The director doesn't see the connect request from client->realserver. The first packet seen is the ack from the realserver, which will be forwarded via the director. The director will rewrite the ack to be from the director. The client will not accept an ack to port 4121 from director:telnet. Julian 2001-01-12 The redirects are handled in net/ipv4/route.c:ip_route_input_slow(), i.e. from the routing and before reaching LVS (in LOCAL_IN): Here RTCF_NAT && RTCF_MASQ are flags used from the dumb nat code but the masquerading defined with ipchains -j MASQ does not set such or some of these flags. The result: the redirect is sent according to the conf/{all,<device>}/send_redirects from ip_rt_send_redirect() and ip_forward() from net/ipv4/ip_forward.c. So, the meaning is: if we are going to forward packet and the in_dev is same as out_dev we redirect the sender to the directly connected destination which is on the same shared media. The ipchains code in the FORWARD chain is reached too late to avoid sending these redirects. They are already sent when the -j MASQ is detected. If all/send_redirects is 1 every <device>/send_redirects is ignored. So, if we leave it 1 redirects are sent. To stop them we need all=0 && <device>=0. default/send_redirects is the value that will be inherited from each new interface that is created. The logical operation between conf/all/<var> and conf/<device>/<var> is different for each var. The used operation is specified in /usr/src/linux/include/linux/inetdevice.h For send_redirects it is '||'. For others, for example for conf/{all,<device>}/hidden), it is '&&' So, for the two logical operations we have: result ------------------------------ 0 0 0 0 1 0 1 0 0 1 1 1 For ||: all result ------------------------------ 0 0 0 0 1 1 1 0 1 1 1 1 ]]> When a new interface is created we have two choices: 1. to set conf/default/<var> to the value that we want each new created interface to inherit 2. to create the interface in this way: and then to set the value before assigning the address: > conf/eth0/ifconfig eth0 192.168.0.1 up ]]> but this is risky especially for the tunnel devices, for example, if you want to play with var rp_filter. For the other devices this is a safe method if there is no problem with the default value before assigning the IP address. The first method can be the safest one but you have to be very careful.

One Network LVS from Joe Stump Joe Stump joe (at) joestump (dot) net 2002-09-04

Problem The problem is you have one network that has your realservers, directors, and clients all together on the same class C. For this example we will say they all sit on 192.168.1.*. Here is a simple layout. Everything looks like it should work just fine right? Wrong. The problem is that in reality all of these machines are able to talk to one another because they all reside on the same physical network. So here is the problem: clients outside of the internal network get expected output from the load balancer, but clients on the internal network hang when connecting to the load balancers.

Cause So what is causing this problem? The routing tables on the directors and the realservers are causing your client to become confused and hang the connection. If you look at your routing tables on your realserver you will notice that the default gatway for your internal network is 0.0.0.0. Your director will have a similar route. These routes tell your directors and realservers that requests coming from machines on that network should be routed directly back to that machine. So when a request comes to the director the director routes it to the realserver, but the realserver sends the response directly back to the client instead of routing it back through the director as it should. The same thing happens when you try to connect via the director's outside IP from an internal client IP, only this time the director mistakenly sends directly to the internal client IP. The internal client IP is expecting the return packets from the director's external IP, not the director's internal IP.

Solution The solution is simple. Delete the default routes on your directors and real servers to the internal network. The above line should do the trick. One thing to note is that you will not be able to connect to these machines once you have deleted these routes. Y0u might just want to use the director as a terminal server since you can connect from there to the realservers. Also, if you have your realservers connect to DB's and NFS servers on the internal network you will have to add direct routes to those hosts. You do this by typing this: I added these routes to a startup script so it kills my internal routes and adds the needed direct routes to my NFS and DB server during startup.

One Network LVS-NAT with windows realservers Malcolm Turnbull malcolm (at) loadbalancer (dot) org 1 Aug 2008 Route configuration for Windows Server with one arm NAT mode: When a client on the same subnet as the real server tries to access the virtual server on the load balancer the request will fail. The real server will try to use the local network to get back to the client rather than going through the load balancer and getting the correct network translation for the connection. To rectify this issue we need to add a route to the the load balancer that takes priority over Windows default routing rules. This is a simple case of adding a permanent route: (NB. Replace 192.168.1.0 with your local subnet address.) The default route to the local network has a metric of 10, so this new route overrides all local traffic and forces it to go through the load balancer as required. Any local traffic (same subnet) is handled by this route and any external traffic is handled by the default route (which also points at the load balancer).

Malcolm's modification of the One Network LVS-NAT Malcolm Turnbull malcolm (at) loadbalancer (dot) org 1 Aug 2008 Forgot to add that this method (i.e. the windows realserver method above) works for Linux as well avoiding the need to add a route for each host Route configuration for Linux with one arm NAT mode: When a client on the same subnet as the real server tries to access the virtual server on the load balancer the request will fail. The real server will try to use the local network to get back to the client rather than going through the load balancer and getting the correct network translation for the connection. To rectify this issue we need to modify the local network route to a higher metric: NB. Replace 192.168.1.0 with your local subnet address. Then we need to make sure that local network access uses the load balancer as its default route: NB. Replace 192.168.1.21 with your load balancer gateway Any local traffic (same subnet) is handled by this manual route and any external traffic is handled by the default route (which also points at the load balancer). FIXME Joe: here's what I think Malcolm is saying. Malcolm's clients are on 0/0 and not on the same logical network as the realservers. (In the setup above needing the icmp redirects turned off, all machines, client, director, realservers are on the same network). metric 0 is high priority. anything to 0/0 goes to default gw metric 2000 is low priority. anything to 192.168.1.0/24 goes to eth0. you don't need to turn off icmp redirects However the linux kernel ignores the metric only dynamic routing protocols (RIP, GATED) use metric, and then only to decide between duplicate routes. routes with (metric>16) are ignored by dynamic routing protocols. Presumably Linux ignoring the metric, treats a route with metric=2000 the same as a route with metric=0. So although Malcolm's method works, we don't understand why at the moment.

One net LVS-NAT from the mailing list Here's an untested solution from Julian for a one network LVS-NAT (I assume this is old, maybe 1999, because I don't have a date on it). put the client in the external logical network. By this way the client, the director and the realserver(s) are on same physical network but the client can't be on the masqueraded logical network. So, change the client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add through DIP (I don't see such IP for the Director). Why in your setup DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you can later add path for 192.168.1.0/24 through 166.84.192.33. There is no need to use masquerading with 2 NICs. Just remove the client from the internal logical network used by the LVS cluster. A different working solution from Ray Bellis rpb (at) community (dot) net (dot) uk the same *logical* subnet. I still have a dual-ethernet box acting as a director, and the VIP is installed as an alias interface on the external side of the director, even though the IP address it has is in fact assigned from the same subnet as the Ray Bellis rpb (at) community (dot) net (dot) uk has used a 2 NIC director to have the RIPs on the same logical network as the VIP (ie RIP and VIP numbers are from the same subnet), although they are in different physical networks.

re-mapping ports, rewriting is slow for 2.0, 2.2 kernels For LVS-NAT, the packet headers are re-written (from the VIP to the RIP and back again). At no extra overhead, anything else in the header can be rewritten at the same time. LVS-NAT can rewrite the ports Thus a request to port VIP:80 received on the director can be sent to RIP:8000 on the realserver. In the 2.0.x and 2.2.x series of IPVS, rewriting the packet headers is slow on machines from that era (60usec/packet on a pentium classic) and limits the throughput of LVS-NAT (for 536byte packets, this is 72Mbit/sec or about 100BaseT). While LVS-NAT throughput does not scale well with the packet rate (after you run out of CPU), the advantage of LVS-NAT is that realservers can have any OS, no modifications are needed to the realserver to run it in an LVS, and the realserver can have services not found on Linux boxes. For , headers are not rewritten. The LVS-NAT code for 2.4 is rewritten as a Netfilter modules and is not detectably slower than LVS-DR or LVS-Tun. (The IPVS code for the early 2.4.x kernels in 2001 was buggy during the changeover, but that is all fixed now.)

Two instances of demon running on realserver from Horms, Jul 2005 With LVS-DR or LVS-Tun, the packet arrives on the realserver with dst_addr=VIP:port. Thus even if you set up two RIPs on the realserver you cannot have two instances of the service demon, because they would both have to be listening for VIP:port. With LVS-NAT, you could have two RIPs (RIP1 and RIP2) on one realserver (both IPs could be on one NIC), ipvsadm forwarding to both RIPs, with an instance of the demon listening to RIP1 and another instance of the demon listening to RIP2. one RIP on the realserver, but have ipvsadm forward requests to two different ports. Thus one instance of the demon would listen to RIP:port1 and another would listen to RIP:port2.

Performance of LVS-NAT Horms All things are relative. LVS-NAT is actually pretty fast. I have seen it do well over 600Mbit/s. But in theory LVS-DR is always going to be faster because it does less work. If you only have 100Mbit/s on your LAN then either will be fine. If you have gigabit then LVS-NAT will still probably be fine. Beyond that... I am not sure if anyone has tested that to see what will happen. In terms of number of connections, there is a limit with LVS-NAT that relates to the number of ports. But in practice you probably won't reach that limit anyway.

Performance of LVS-NAT, 2.0 and 2.2 kernels With the slower machines around in the early days of LVS, the throughput of LVS-NAT was limited by the time taken by the director to rewrite a packet. The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT). Since the director is the limiting step, increasing the number of realservers does not increase the throughput. The performance page shows a slightly higher latency with LVS-NAT compared to LVS-DR or LVS-Tun, but the same maximum throughput. The load average on the director is high (>5) at maximum throughput, and the keyboard and mouse are quite sluggish. The same director box operating at the same throughput under LVS-DR or LVS-Tun has no perceptable load as measured by top or by mouse/keyboard responsiveness.

Performance of LVS-NAT, 2.4 kernels Wayne

NAT taks some CPU and memory copying. With a slower CPU, it will be slower.

the origial posting Julian Anastasov ja (at) ssi (dot) bg 19 Jul 2001 This is a myth from the 2.2 age. In 2.2 there are 2 input route calls for the out->in traffic and this reduces the performance. By default, in 2.2 (and 2.4 too) the data is not copied when the IP header is changed. Updating the checksum in the IP header does not cost too much time compared to the total packet handling time. To check the difference between the NAT and DR forwarding method in out->in direction you can use testlvs from http://www.ssi.bg/~ja/ and to flood a 2.4 director in 2 setups: DR and NAT. My tests show that I can't see a visible difference. We are talking about 110,000 SYN packets/sec with 10 pseudo clients and same cpu idle during the tests (there is not enough client power in my setup for full test), 2 CPUx 866MHz, 2 100mbit internal i82557/i82558 NICs, switched hub: 3 testlvs client hosts -> NIC1-LVS-NIC2 -> packets/sec. I use small number of clients because I don't want to spend time in routing cache or LVS table lookups. Of course, the NAT involves in->out traffic and this can reduce twice the performance if the CPU or the PCI power is not enough to handle the traffic in both directions. This is the real reason the NAT method to look so slow in 2.4. IMO, the overhead from the TUN encapsulation or from the NAT process is negliable. Here come the surprises: The basic setup: 1 CPU PIII 866MHz, 2 NICs (1 IN and 1 OUT), LVS-NAT, SYN flood using testlvs with 10 pseudo clients, no ipchains rules. Kernels: 2.2.19 and 2.4.7pre7. Linux 2.2 (with ipchains support, with modified demasq path to use one input routing call, something like LVS uses in 2.4 but without dst cache usage): Linux 2.4 (with ipchains support): with 3-4 ipchains rules: Linux 2.4 (without ipchains support): Linux 2.4, 2 CPU (with ipchains support): Linux 2.4, 2 CPU (without ipchains support): What I see is that: modified 2.2 and 2.4 UP look equal on 80,000P/s limits: 2.2=88,000P/s, 2.4=96,000P/s, i.e. 8% difference 1 and 2 CPU in 2.4 look equal 110,000->96,000 (100mbit or PCI bottleneck?), may be we can't send more that 96,000P/s through 100mbit NIC? the ipchains rules can dramatically reduce the performance - from 88,000 to 55,000 P/s 2.4.7pre7 SMP shows too many context switches DR and NAT show equal results for 2.4 UP 110,000->96,000P/s, 2-3% idle, so I can't claim that there is a NAT-specific overhead. I performed other tests, testlvs with UDP flood. The packet rate is lower, the cpu idle time in the LVS box was increased dramatically but the client hosts show 0% cpu idle, may be more testlvs client hosts are needed. Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002 Many people think that the packet mangling is evil in the NAT processing. The picture is different: the NAT processing in 2.2 uses 2 input routing calls instead of 1 and this totally kills the forwarding of packets from/to many destinations. Such problems are mostly caused from the bad hash function used in the routing code and because the routing cache has hard limit for entries. Of course, the NAT setups handle more traffic than the other forwarding methods (both the forward and reply directions), a good reason to avoid LVS-NAT with a low power director. In 2.4 the difference between the DR and NAT processing in out->in direction can not be noticed (at least in my tests) because only one route call is used, for all methods. Matthew S. Crocker Jul 26, 2001 DR is faster, less resource intensive but has issues with configuration because of the age old 'arp problem' Horms horms (at) vergenet (dot) net LVS-NAT is still fast enough for many aplications and is IMHO considerably easier to set up. While I think LVS-DR is great I don't think people should be under the impresion that LVS-NAT will intrisicly be a limiting factor to them. Don Hinshaw dwh (at) openrecording (dot) com 04 Aug 2001 Cisco, Alteon and F5 solutions are all NAT based. The real limiting factor as I understand it is the capacity of the netcard, which these three deal with by using gigabit interfaces. Julian Anastasov ja (at) ssi (dot) bg 05 Mar 2002 in discussion with Michael McConnell Note that I used a modified demasq path which uses one input route for NAT but it is wrong. It only proves that 2.2 can reach the same speed as 2.4 if there was use_dst analog in 2.2. Without such feature the difference is 8%. OTOH, there is a right way to implement one input route call as in 2.4 but it includes rewriting of the 2.2 input processing. Michael McConnell

From what I see here, it looks as though the 2.2 kernel handles a higher numberof SYN's better than the 2.4 kernel. Am I to asume, that the for the 110,000SYNs/sec in the 2.4 kernel, only 63,000 SYNs/sec were answers? The rest failed?

In this test 2.4 has firewall rules, while 2.2 has only ipchains enabled.

Is the 2.2 kernel better at answer a higher number of requests?

No. Note also that the testlvs test was only in one direction, no replies, only client->director->realserver

has anyone compared iptables/ipchains, via 2.2/2.4?

here are my results There is some magic in these tests, I don't know at one place why netfilter shows such bad results. Maybe someone can point to me to the problem.

Various debugging techniques for routes This originally described how I debugged setting up a one-net LVS-NAT LVS using the output of route. Since it is more about networking tools than LVS-NAT it has been moved to the section on .

Connecting directly from the client to a service:port on an LVS-NAT realserver If you connect directly to the realserver in a LVS-NAT LVS, the reply packet will be routed through the director, which will attempt to masquerade it. This packet will not be part of an established connection and will be dropped by the director, which will issue an ICMP error. Paul Wouters paul (at) xtdnet (dot) nl 30 Nov 2001

It would like to reach all LVS'ed services on the realservers directly, i.e. without going through the LVS-NAT director, say from a local client not on the internet. Connecting from client to a RIP should just completely bypass all the lvs code, but it seems that the lvs code is confused, and thinks a RIP->client answer should be part of its NAT structure. tcpdump running on internal interface of the director shows a packet from the client received on the RIP; the RIP replies (never reaches the client, the director drops it). The director then sends out a port unreachable:

Julian The code that replies with an ICMP error can be removed but then you still have the problem of reusing connections. The local_client can select a port for direct connection with the RIP but if that port was used some seconds before for a CIP->VIP connection, it is possible that LVS to catch these replies as part of the previous connection. LVS does not inspect the TCP headers and does not accurately keep the TCP state. So, it is possible that LVS will not to detect that the local_client and the realserver have established a new connection with the same addresses and ports that are still known as NAT connection. Even stateful conntracking can't notice it because the local_clientIP->RIP packets are not subject to NAT processing. When LVS sees the replies from RIP to local_clientIP it will SNAT them and this will be fatal because the new connection is between the local_clientIP and RIP directly, not from CIP->VIP->RIP. The other thing is that CIP even does not know that it connects from same port to same server. It thinks there are 2 connections from same CPORT: to VIP and to RIP, so they can live even at the same time.

But a proper TCP/IP stack on a client will not re-use the same port that quickly, unless it is REALLY loaded with connections right? And a client won't (can't?) use the same source port to different destinations (VIP and RIP) right? So, the problem becomes almost theoretical?

This setup is dangerous. As for the ICMP replies, they are only for anti-DoS purposes but may be are going to die soon. There is still no enough reason to remove that code (it was not first priority).

Or make it switachable as #ifdef or /proc sysctl?

Wensong Just comment out the whole block, for example, protocol, iph->saddr, h.portp[0])) { /* * Notify the realserver: there is no existing * entry if it is not RST packet or not TCP packet. */ if (!h.th->rst || iph->protocol != IPPROTO_TCP) { icmp_send(skb, ICMP_DEST_UNREACH, ICMP_PORT_UNREACH, 0); kfree_skb(skb); return NF_STOLEN; } } #endif ]]>

This works fine. Thanks

The topic came up again. Here's another similar reply.

I've set up a small LVS_NAT-based http load balancer but can't seem to connect to the realservers behind them via IP on port 80. Trying to connect directly to the realservers on port 80, though, translates everything correctly, but generates an ICMP port unreach.

Ben North ben (at) antefacto (dot) com 06 Dec 2001 The problem is that LVS takes an interest in all packets with a source IP:port of a Real Service's IP:port, as they're passing through the FORWARD block. This is of course necessary --- normally such packets would exist because of a connection between some client and the Virtual Service, mapped by LVS to some Real Service. The packets then have their source address altered so that they're addressed VIP:VPort -> CIP:CPort. However, if some route exists for a client to make connections directly to the Real Service, then the packets from the Real Service to the client will not be matched with any existing LVS connection (because there isn't one). At this point, the LVS NAT code will steal the packet and send the "Port unreachable" message you've observed back to the Real Server. A fix to the problem is to #ifdef out this code --- it's in ip_vs_out() in the file ip_vs_core.c.

A NAT router has no connections A NAT router rewrites source IP (and possibly the port) of packets coming from machines on the inside network. With an LVS-NAT director, the connection originates on the internet and terminates on the realserver (de-masquerading). The replies (from the realserver to the the lvs client) are masqueraded. In both cases (NAT router, LVS-NAT director), to the machine on the internet, the connection appears to be coming from the box doing the NAT'ing. However the NAT box has no connection (e.g. with netstat -an) to the box on the internet. It is just routing packets (and rewriting them). Horms 17 May 2004 There is no connection as such. Or more specifically, the connection is routed, not terminated by the kernel. However, there is a proc entry, that you can inspect, to see the natted connections.

Thoughts on extending NAT

Tao Zhao taozhao (at) cs (dot) nyu (dot) edu 01 May 2002 LVS-NAT assumes that all servers are behind the director, so the director only need to change the destination IP when a request comes in and forward that to the scheduled realserver. When the reply packets go through the director it will change the source IP. This limits the deployment of LVS using NAT: the director must be the outgoing gateway for all servers. I am wondering if I can change the code so that both source and destinamtion IPs are changed in both ways. For example, CIP: client IP; DIP: director IP; SIP: server IP (public IPs); Director->Server: address pair (CIP, DIP) is changed to (DIP, SIP) Server->Director->Client: address pair (SIP, DIP) is changed to (DIP, CIP). ]]>

Lars Not very efficient; but this can actually already be done by using the port-forwarding feature AFAIK, or by a userspace application level gateway. I doubt its efficiency, since the director would _still_ need to be in between all servers and the client both ways. Direct routing and/or tunneling make more sense. As well clients do not know where the connection originally came from; making the logs on them nearly useless, also filtering by client IP and establishing a session back to the client (ie, ftp or some multimedia protocols) is also very difficult. Wayne wayne (at) compute-aid (dot) com 01 May 2002 Client IP address is very important for analyzing the traffic for marketing people. Get rid of the CIP will make web server has no way to log where the traffic coming from, thus totally blind the marketing people, that is very undesirable for many use. Do you have to allocate a table for tracking these changes, too? That will further slow down the director.

Of course, the director need to allocate a new port number and change the source port number to it when it forwards the packet to the server. Thus this local port number should be enough for the director to distinguish different connections. This way, there will be no limitation where the servers are (the tunneling solution needs the change of server: setup tunneling)

Joe I talked to Wensong about this in the early days of LVS, but I remember thinking that keeping track of the CIP would have been a lot of work. I think I mentioned it in the HOWTO for a while. However I'd be happy to use the code if someone else wrote it :-) Some commercial load balancers seem to have some NAT-like scheme where the packets can return directly to the CIP without going through the director. Does anyone know how it works? (Actually I don't know whether it's NAT-like or not, I think there's some scheme out there that isn't VS-DR which returns packets directly from the realservers to the clients - this is called "direct server return" in the commercial world). Wayne wayne (at) compute-aid (dot) com I think those are switch-like load balancers. They don't take any IP addresses, But I think it could be done even with NAT, as long as the server has two NIC, one talk to the load balancer, the other talk to the switch/hub before the load balancer. The load balancer has to change the packet not have its own IP in it, so there is no need to NAT back to the public packet. Server set its default gateway using the other NIC to send the packets out.

Postings from the mailing list frederic (dot) defferrard (at) ansf (dot) alcatel (dot) fr

would be possible to use LVS-NAT to load-balance virtual-IPs to ssh-forwarded real-IPs? Ssh can also be used to create a local access that is forwarded to a remote access throught the ssh protocol. For example you can use ssh to securely map a local acces to a remote POP server: local:ssh ~~~~~ ssh port forwarding ~~~~~ remote:ssh ==> remote:pop ]]> And when you connect to local:localip you are transparently/securely connected to remote:pop The main idea is to allow RS in differents LANs with RS that are non-Linux (precluding LVS-Tun). Example: VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80 \ - VS:83 ---- ssh ---- RS:80 ]]>

Wensong you can use VPN (or CIPE) to map some external realservers into your private cluster network. If you use LVS-NAT, make sure the routing on the realserver must be configuration properly so that the response packets will go through the load balancer to the clients.

I think that it isn't necessery to have the default router to the load balancer when using ssh because when the RS address is the same that the VS address (differents ports)

With the NAT method, your example won't work because the LVS/NAT treats packets as local ones and forward to the upper layers without any change. However, your example give me an idea that we can dynamically redirect the port 80 to port 81, 82 and 83 respectively for different connections, then your example can work. However, the performance won't be good, because lots of works are done in the application level, and the overhead of copying from kernel to user-space is high. Another thought is that we might be able to setup LVS/DR with real server in different LANs by using of CIPE/VPN stuff. For example, we use CIPE to establish tunnels from the load balancer to realservers like Then, you can add LVS-DR configuration commands as: I haven't tested it. Please let me know the result if anyone tests this configuration. Lucas 23 Apr 2004

Is it possible use the cluster as a NAT Router? What I'm saying is: I got a private LAN and I want to share my internet connection, doing NAT and Firewall and QoS. The realservers are actually routers and dont serve any service. Is there a way to use the VIP as the private LAN gateway or to pass the traffic through the director to the "real servers (real routers)" even when is not destined to a specific port in the server?

Horms 21 May 2004 I think that should work, as long as you are only wanting to route IPv4 TCP, UDP and related ICMP. You probably want to use a fwmark virtual service so that you can forward all ports to the realservers (routers). That said I haven't tried it, so I can't be sure.

LVS-NAT source routing patch (Brownfield, Sawari and Black) Mar 2006: This will be in the next release of LVS. Ken Brownfield found that ipvs changes the routing of packets from the director to 0/0 (i.e. LVS-NAT or LVS-DR with the forward-shared patch). The packets from ipvs should use the routing table, but they don't. Ken had a director with two external NICS. He wanted the packets to return via the NIC they arrived. When he tried LVS-NAT, with his own installed routing table (which works when tested with traceroute), the reply packets from ip_vs are sent to the default gw, apparently ignoring his routing table. It should be none of ip_vs's business where the packets are routed. Here's Ken's ip_vs_source_route.patch.gz patch. Here's Ken's take on the matter I need to support VIPs on the director that live on two separate external subnets: The default gateway is on ISP1_SUBNET/eth0, and I have source routes set up as follows for eth1: If I perform an mtr/traceroute on the director bind()ed to the ISP2_IP interface, outgoing traceroutes traverse the proper ISP2_GW, and the same for the ISP1_IP interface. I'm pretty sure the source- route behavior is correct, since I can revert from the proper behavior by dropping table 136. For a single web service, I'm defining identical VIPs but for each of the ISPs: Incoming packets come in via the proper gateway, but LVS always emits response packets through the default gateway, seemingly ignoring the source-route rules. I've seen Henrick's general fwmark state tracking described. Reading this, it seems like this patch isn't exactly approved or even obviously available. And the article is from 2002. :) I'm also not sure why this seems like such a difficult problem. If LVS honored routes, there would be no complicated hacks required. Unless LVS overrides routes, in which case it might be nice to have a switch to turn off that optimization. I understand that routes are a subset of the problem fixed by the patch, and I can see the value of the patch. But for the basic route case it seems odd for LVS to just dump all outgoing packets to the default gw. I mean, it could cache the routing table instead of just a single gw? From what I can tell, the SH scheduler decides which realserver will receive an incoming request based on the external source IP in the request. I can see four problems with this. The first is that I can't see how this will change the return route of the packet. I can see mapping incoming source routes to specific real servers with distinct gateways, but I can't see how it could effect an LVS-NAT setup. The second is that a single client IP could go through either incoming VIP. Assuming SH was somehow changing outbound routing, it would distribute the outbound gateway randomly vs. correctly. I suppose this helps distribute traffic but I'm not really interested in perpetuating asymmetric routes. The third is that I'd really like to use LVS as a load-balancer, not as a simple load splitter. wlc is pretty key. The fourth is that using sh doesn't change outbound routes, I just tried it. :-) The docs state "Multiple gateway setups can be solved with routing and a solution is planned for LVS." Which seems to imply that source routing is a fix but sort of not... :( Scanning the NFCT patch and looking at the icmp handling, I'm pretty sure the problem is that ip_vs_out() is sending out the packet with a route calculated from the real server's IP. Since ip_vs_out() is reputedly only called for masq return traffic, I think this is just plain incorrect behavior. I pulled out the route_me_harder() mod and created the attached patch. My only concern would be performance, but it seems netfilter's NAT uses this. First, I need to correct the stated provenance of this patch. It is a small tweaked subset of an antefacto patch posted to integrate netfilter's connection tracking into LVS, not the NFCT patches as I said. Lots of Googling, not enough brain cells. This patch applies to v1.0.10, but appears to be portable to 2.6. During a maintenance window this morning, I had the opportunity to test the patch. The first time I ever loaded the patched module, and shockingly it worked perfectly -- outbound traffic from masq VIPs now follows source-routes and choses the correct outbound gateway. No side effects so far, no obvious increased load. I also poked around the 2.6 LVS source a bit to see if this issue had been resolved in later versions, and noticed uses of ip_route_output_key, but the source address was always set to 0 instead of something more specific. I'd say it might be worth a review of the LVS code to make sure source addresses are set usefully, and routes are recalculated where necessary. In any case, if anyone has a similar problem with VIPs spanning multiple external IP spaces and gateways, this has been working like a charm for me in significant production load. So far. *knock*on*wood* I'll update if it crashes and/or burns. Joe

any idea what would happen if there were multiple VIPs or the packets coming into the director from the outside world were arriving at the LVS code via a fwmark?

To my understanding, Henrick's fwmark patch allows LVS to route traffic based on fwmarks set by an admin in iptables/iproute2. I can imagine certain complex situations where this functionality could be useful and even crucial, but setup and maintenance of fwmarks requires specifically coded fwmark behavior in each of netfilter, iproute2, and ip_vs. Source routes are essentially a standard feature these days, and are critical for proper routing on gateways and routers (which is essentially what a director is in Masq mode). Having LVS properly observe the routing table is a "missing feature", I believe. The patch I created requires no changes for an admin to make (no fwmarks to set up in ip_vs, netfilter, *and* iproute2), basically just properly and transparently observing routes set by iproute2 (which the rest of the director's traffic already obeys). So short answer: Henrick's patch allows VIP routing based on fwmarks specifically created/handled by an admin for that purpose, whereas mine is a minor correction to existing code to properly recalculate the routes of outbound VS/NAT VIP traffic after mangling/masquerading of the source IP. A little end-result crossover, but really quite different. My (borrowed :) patch is essentially a one-liner, so the code complexity is very small and the behavior easily confirmable at a glance. The fwmark code is more invasive, seemingly. Technically, I could have used fwmarks, but until someone needs that specific functionality, I suspect proper source-routing covers 90% of the alternate use cases. And it's the cleaner, more specific solution to my problem. But that's just me. :) Your summary of SH matches my understanding -- it's hash-based persistence calculated from the client's source IP (vs destination in DH). It probably generates a good random, persistent distribution, which I can see being useful in a cluster environment where persistence is rewarded by caching/sessions/etc. WLC with persistence is probably a better bet for a load-balancer config, since it actually balances load. Without something like wackamole on the real servers, rr/sh/dh are happy to send traffic to dead servers, AFAICT. Ken Brownfield krb (at) irridia (dot) com 22 Mar 2006 I'm attaching ip_vs_source_route.patch.gz, which is the patch itself. It patches ip_vs_core.c, adding a function call at the end of ip_vs_out() that recalculates the route for an outgoing packet after mangling/masquerading has occurred. ip_vs_out(), according to the comments in the source (and my brief perusal of the code) is "used only for VS/NAT." There should be no effect on DR/TUN functionality as far as I can tell. This type of route recalc might be correct behavior in some TUN or DR circumstances, but I have no experience in a DR/TUN setup. So yes, I believe this patch is orthogonal to DR/TUN functionality and should be silent with regard to DR/TUN. The only concern a user should have after applying this patch is that they make sure they are aware of existing source routes before using the patch. Users may be unknowingly relying on the fact that LVS always routes traffic based on the real server's source IP instead of the VIP IP, and applying the patch could change the behavior of their system. I suspect that will be a very rare concern. As long as the source routes on the system are correct, where the source IP == the VIP IP, packets from LVS will be routed as the system itself routes packets. Routes confirmed with a traceroute (bound to a specific IP on the director) will no longer be ignored for traffic outbound from a NAT VIP. Joe: next Farid Sarwari stepped in Farid Sarwari fsarwari (at) exchangesolutions (dot) com 25 Jul 2006 I'm having some issues with IPVS and IPSec. When a HTTP client requests a page, I can see the traffic come all the way to the webserver (ws1,ws2). However, the return traffic gets to the load balancer but does not make it through the ipsec tunnel. When doing a tcpdump I can see that the packets get SNATed by ipvs. I know there is a problem with ipsec2.6 and SNAT, and I've upgraded my kernel and iptables so now SNAT with iptables works. But it looks like ipvs is doing its own SNAT which doesn't pass through the ipsec tunnel. real=y.y.y.1:80 masq real=y.y.y.2:80 masq checktype=negotiate fallback=127.0.0.1:80 masq service=http request="/" receive=" " scheduler=wlc protocol=tcp ------------------ ipvsadm -ln output: P Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP x.x.x.x:80 wlc -> y.y.y.1:80 Masq 1 0 0 -> y.y.y.1:80 Masq 1 0 0 ------------------ Software Version #s: ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0) Linux Kernel 2.6.16 iptables v1.3.5 ldirectord version 1.131 ]]> The Brownfield patch is for an older version of ipvs. When I was applying the patch, Hunk#3 failed. I was able to apply the third hunk manually. When I compile it give errors for the code from the first hunk of the patch. Finally got it to work! I can access load balanced pages through ipsec. Ken Brownfield's patch seemed to have been for an older version of kernel/ipvs. If you look in the patch, there is function called ip_vs_route_me_harder with is an exact copy of ip_route_me_harder from netfilter.c. I'm not sure what version of ipvs/kernel Brownfield's patch is for. I couldn't get ipvs to compile with his patch, so I just used his idea and copied the new code from the netfilter source code. I've modified his patch by copying new the ip_route_me_harder function from net/ipv4/netfiter.c (2.6.16). Below is the patch for kernel 2.6.16 (kernel sources from FC4) +#include EXPORT_SYMBOL(register_ip_vs_scheduler); EXPORT_SYMBOL(unregister_ip_vs_scheduler); @@ -516,6 +517,76 @@ return NF_DROP; } +/* This code stolen from net/ipv4/netfilter.c */ + +int ip_vs_route_me_harder(struct sk_buff **pskb) +{ + struct iphdr *iph = (*pskb)->nh.iph; + struct rtable *rt; + struct flowi fl = {}; + struct dst_entry *odst; + unsigned int hh_len; + + /* some non-standard hacks like ipt_REJECT.c:send_reset() can cause + * packets with foreign saddr to appear on the NF_IP_LOCAL_OUT hook. + */ + if (inet_addr_type(iph->saddr) == RTN_LOCAL) { + fl.nl_u.ip4_u.daddr = iph->daddr; + fl.nl_u.ip4_u.saddr = iph->saddr; + fl.nl_u.ip4_u.tos = RT_TOS(iph->tos); + fl.oif = (*pskb)->sk ? (*pskb)->sk->sk_bound_dev_if : 0; +#ifdef CONFIG_IP_ROUTE_FWMARK + fl.nl_u.ip4_u.fwmark = (*pskb)->nfmark; +#endif + if (ip_route_output_key(&rt, &fl) != 0) + return -1; + + /* Drop old route. */ + dst_release((*pskb)->dst); + (*pskb)->dst = &rt->u.dst; + } else { + /* non-local src, find valid iif to satisfy + * rp-filter when calling ip_route_input. */ + fl.nl_u.ip4_u.daddr = iph->saddr; + if (ip_route_output_key(&rt, &fl) != 0) + return -1; + + odst = (*pskb)->dst; + if (ip_route_input(*pskb, iph->daddr, iph->saddr, + RT_TOS(iph->tos), rt->u.dst.dev) != 0) { + dst_release(&rt->u.dst); + return -1; + } + dst_release(&rt->u.dst); + dst_release(odst); + } + + if ((*pskb)->dst->error) + return -1; + +#ifdef CONFIG_XFRM + if (!(IPCB(*pskb)->flags & IPSKB_XFRM_TRANSFORMED) && + xfrm_decode_session(*pskb, &fl, AF_INET) == 0) + if (xfrm_lookup(&(*pskb)->dst, &fl, (*pskb)->sk, 0)) + return -1; +#endif + + /* Change in oif may mean change in hh_len. */ + hh_len = (*pskb)->dst->dev->hard_header_len; + if (skb_headroom(*pskb) < hh_len) { + struct sk_buff *nskb; + + nskb = skb_realloc_headroom(*pskb, hh_len); + if (!nskb) + return -1; + if ((*pskb)->sk) + skb_set_owner_w(nskb, (*pskb)->sk); + kfree_skb(*pskb); + *pskb = nskb; + } + + return 0; +} /* * It is hooked before NF_IP_PRI_NAT_SRC at the NF_IP_POST_ROUTING @@ -734,6 +805,7 @@ struct ip_vs_protocol *pp; struct ip_vs_conn *cp; int ihl; + int retval; EnterFunction(11); @@ -821,8 +893,20 @@ skb->ipvs_property = 1; - LeaveFunction(11); - return NF_ACCEPT; + /* For policy routing, packets originating from this + * machine itself may be routed differently to packets + * passing through. We want this packet to be routed as + * if it came from this machine itself. So re-compute + * the routing information. + */ + if (ip_vs_route_me_harder(pskb) == 0) + retval = NF_ACCEPT; + else + /* No route available; what can we do? */ + retval = NF_DROP; + + LeaveFunction(11); + return retval; drop: ip_vs_conn_put(cp); ------snip-------- ]]> Joe

Can you do IPSec with LVS-DR? (the director would only decrypt and the realservers encrypt)

I haven't tried it, but I don't see why it shouldn't work. It's probably easier to get work than LVS-NAT with IPSec :) You can think of Ipsec as just another interface except that with kernel 2.6 there is more ipsec0 interface. So as long as routing is setup correctly LVS-DR should work with IPSec.

so you have an ipsec0 interface and you can put an IP on it and route to/from it just like with eth0? Can you use iproute2 tools on ipsec0?

With Kernel 2.6 there is no more ipsec0 interface, but you can use iproute2 to alter the routing table. You wouldn't want to modify the routes to the tunnel because ipsec takes care of that, but you can modify routes for traffic that is coming through the tunnel destined for LVS-DR. Ken Brownfield krb (at) irridia (dot) com 28 Jul 2006 At first glance, that's exactly what had to be ported, and I'm glad someone with enough 2.6 fu did it. Now, if someone could have it conditional on a proc/sysctl, it would seem like more of a no-brainer for inclusion. ;) Joe: next David Black stepped in David Black dave (at) jamsoft (dot) com 28 Jul 2006 I applied the following patch to a stock 2.6.17.7 kernel, and enabled the source routing hook via /proc/sys/net/ipv4/vs/snat_reroute: http://www.ssi.bg/~ja/nfct/ipvs-nfct-2.6.16-1.diff LVS-NAT connections now appear to obey policy routing - yay! Referring to an older version of the NFCT patch, Ken Brownfield says in the LVS HOWTO: "I pulled out the route_me_harder() mod and created the attached patch." So the Brownfield patch is a derivative of the NFCT patch in the first place. And here's a comment from the NFCT patch I used: For a patched kernel, that functionality is enabled by /proc/sys/net/ipv4/vs/snat_reroute ]]> Farid Sarwari fsarwari (at) exchangesolutions (dot) com 31 Jul 2006 The problem I was having with ipvs was that I couldn't access it through ipsec kernel 2.6. I remember accessing ipvs through ipsec 2.4 a few years ago and I don't remember running into this problem. Correct me if I'm wrong, prior to kernel 2.6.16 SNAT (netfilter) didn't work properly with ipsec. When troubleshooting my problem it looked like the natting was happening after the routing decision had been made. This is why I was under the assumption that only code from kernel 2.6.16+ would fix my problem. If the NFCT patch works with ipsec, I would much rather us that. Joe

If Julian's patch had been part of the kernel ipvs code, would anyone have had source routing/iproute2 problems with LVS-NAT?

Ken 9 Aug 2006 I don't believe so -- the source-routing behavior appears to be a (happy) side-effect of working NFCT functionality. I think the NFCT and source-routing patches' intentions are to supply a feature and a bug-fix, respectively, but NFCT is an "accidental" superset.

LVS-NAT FTP Recipe Stephen Milton smmilton (at) gmail (dot) com 12/17/05 This may be old hat to many of you on this list, but I had a lot of problems deciphering all the issues around FTP in load balanced NAT. So I wrote up the howto on how I got my configuration to work. I was specifically trying to setup for high availability, load balanced, FTP and HTTP with failover and firewalling on the load balancer nodes. Here is a permanent link to the article: load_balanced_ftp_server (http://sacrifunk.milton.com/b2evolution/blogs/index.php/2005/12/17/load_balanced_ftp_server)

LVS-NAT vhosts with apache Michael Green mishagreen (at) gmail (dot) com

Is it possible to make Apache's IP based vhosts work under LVS-NAT?

Graeme Fowler graeme (at) graemef (dot) net 14 Dec 2005 If, by that, you mean Apache vhosts whereby a single vhost lives on a single IP then the answer is definitely "yes", although it may seem counter-intuitive at first. If you're using IP based virtual hosting, you have a single IP address for *each and every* virtual host. In the 'classic' sense this means your server has one, two, a hundred, a thousand IP addresses configured (as aliases) on its' interface which faces the internet and a different vhost listens to each interface. In the clearest case of LVS-NAT, you'd have your public interface on the director handle the one, two, a hundred, a thousand _public_ IP addresses and present those to the internet (or your clients, be those as they are). Assuming you have N realservers, you then require N*(one, two, a hundred, a thousand) private IP addresses and you configure up (one, two, a hundred, a thousand) aliases per virtual server. You then setup LVS-NAT to take each specific public IP and NAT it inbound to N private IPs on the realservers. Still with me? Good. This is a network management nightmare. Imagine you had 256 Virtual IPs, each with 32 servers in a pool. You immediately need to manage an entire /19 worth of space behind your director. That's a lot of address space (8192 addresses to be precise) for you to be keeping up with, and it's a *lot* of entries in your ipvsadm table. There is, however, a trick you can use to massively simplify your addressing: Put all your IP based vhosts on the same IP but a *different port* on each realserver. Suddenly you go from 8192 realserver address (aliases) to, well, 32 address (aliases) with 256 ports in use on each one. Much easier to manage. For even more trickery you could probably make use of some of keepalived's config tricks to "pool" your realservers and make your configuration even more simple, but if you only have a small environment you may want to get used to using ipvsadm by hand first until you're happy with it.

LVS-NAT timeout problem Joe: here's a posting that hasn't been solved. It occured with LVS-NAT, but we don't know if it occurs with the other forwarding methods. Dmitri Skachkov dmitri (at) nominet (dot) org (dot) uk 21 Feb 2007 I should probably say in the beginning that the issue I'm going to describe is not directly related to the problem discussed on this list a while ago (http syn/ack not translated when ftp loadbalancing also enabled). We have several LVS/NAT installations which are managed by Keepalived. All of them are pretty much identical and exhibit the same issue. The setup is looking like this (a backup load balancer and a backup router are omitted) and is LVS/NAT standard: This setup is working fine most of the time except when a client sends a TCP SYN packet and then forgets about this connection. In this case a RealServer starts to send SYN/ACK packets until this connection on the server times out and it sends RST/ACK. The issue is that two last packets don't get translated because ipvs on the LoadBalancer already timed out this connection. Below is a tcpdump on LoadBalancer/eth0: 213.248.224.116.43: S 1402601529:1402601529(0) win 512 10:58:20.655335 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 10:58:24.031708 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 10:58:30.792336 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 10:58:44.303557 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 10:59:11.316010 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 11:00:05.330972 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 11:01:05.346329 IP 192.168.1.32.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 11:02:05.362233 IP 192.168.1.32.43 > 213.248.204.8.2113: R 1:1(0) ack 1 win 49312 ]]> In this example I simulated the situation with sending SYN packet from my PC to the server and dropping all further packets. While the SYN/ACK packets were still being translated But once I see only this: Yes, I played with 'ipvsadm --set tcp tcpfin udp' and it doesn't have any effect on this issue. packets from RealServer belonging to this connection (from RealServer point of view) stop getting translated. This is not a real problem but rather a nuisance for me. I just don't want packets with private IP's leaving LoadBalancer. I can't block this packets with iptables since I believe ipvs does SNATing somewhere in POSTROUTING chain and there is no way to put any other rules beyond this chain. I also can't modify SYN_RECV timeout since there is no tcp_timeout_syn_recv entry in /proc/sys/net/ipv4/vs/ (this is a stock CentOS 4.3 kernel). My question is: Is it possible to block not translated packets from leaving the LoadBalancer without touching RealServers and the Router? If it can help, here is additional info: later... Graeme Fowler graeme (at) graemef (dot) net 25 Jun 2007 One of my "standard" (I use the term advisedly) configuration settings for LVS-NAT is to ensure that I have an SNAT rule for packets exiting the director towards clients. I make sure that if I have RS1 with VIP 1.2.3.4 and two realservers 192.0.0.1 and 192.0.0.2 that I have rules of the form: This means any packets escaping the LVS - for example where the LVS connection entry has timed out, but the realserver application session hasn't, will be SNATted to the right IP. It also means that any sessions originating from the realserver - CGI calls to other websites, PHP database connections to offboard servers, SSI includes, RSS inclusion, whatever - appear to come from the right source. It can help to track down abuse in the case of mass virtual hosting, and it prevents information leakage of the form you're seeing. Longer term, it looks like you need to make sure that the IP stack timeouts on the realservers match the LVS connection table timeouts on the director. Have a look at the "--set" option to ipvsadm, and check the corresponding sysctls in /proc/sys/net/ipv4/ - you may have to do a bit of deduction regarding backoff algorithms and retries to get the total time for (for example) a TCP three-way handshake timeout, like you're seeing. dmitri (at) nominet (dot) org (dot) uk 25 Jun 2007 If I remember correctly, the POSTROUTING solution didn't work for me as it seemed LVS stuff in kernel just ignored any postrouting tables for any ip packets under LVS control. Neither ipvsadm --set had any effect on this issue. Sorry for a short explanation but this is what I remember of top of my head and since in all other respects LVS is just working fine for us and so I have not looked often into it lately.

LVS: The ARP Problem

The problem If you follow the instructions and setup the examples in the LVS-mini-HOWTO, then you don't need to know about the arp problem. If you're going to setup grander LVS's, then you'll need to understand the arp problem. Although this section comes early in the HOWTO, it has lots of pitfalls. You shouldn't be reading this unless you've at least setup a working LVS-NAT and LVS-DR LVS using the canned instructions in the LVS-mini-HOWTO. The LVS allows several machines to function as one machine. For LVS-DR and LVS-Tun, some trickery was needed to split the various handshakes involved in establishing and maintaining a tcpip connection, so that some parts of the handshake come from one machine and other parts from another machine. The worst problem, which ironically only happens with realservers running Linux (2.2 and later kernels), is the "arp problem". It's just as well we have the source code :-(. With LVS-DR and LVS-Tun, all the machines (director, realservers) in the LVS have an extra IP, the VIP. Here's a LVS-DR in a test setup where all machines and IPs are on the same physical network (i.e. are using the same link layer and can hear each other's broadcasts). When the client requests a connection to the VIP, it must connect to the VIP on the director and not to the VIP on the realservers. The director acts as a layer-4 IP router, accepting packets destined for the VIP and then sending them on to a realserver (where the real work is done and a reply is generated). For the LVS to function, when the client (or if present, the router) puts out the arp request "who has VIP, tell client", the client/router must receive the MAC address of the director (and not the MAC address of one of the realservers). After receiving the arp reply, the client will send the connect request to the director. (The director will update its ipvsadm tables to keep track of the connections that it's in charge of and then forward the connect request packet to the chosen realserver). If instead, the client gets the MAC address of one of the realservers, then the packets will be sent directly to that realserver, bypassing the LVS action of the director. If nothing is done to direct arp requests, by the router for the VIP, to the director (), then in some setups, one particular realserver's MAC address will be in the client/router's arp table for the VIP and the client will only see one realserver. If the client's packets are consistently sent to the same realserver, then the client will have a normal session connected to that realserver. You can't count on this happening: in the middle of a tcpip sesssion, the client/router might get the MAC address of another realserver as a result of an arp request, and the client will start getting packets for connections it knows nothing about (and the realserver will send tcp resets). (In my setup, the machine with the fastest CPU is in the client's arp table, suggesting that it's the first machine to reply that gets in. Horms and Steven WIlliams have written that they think it's the last machine to reply whose entry in in the client's arp table.) In other setups where the realservers are identical, the client will connect to different realservers each time the arp cache times out (see comment by Steven WIlliams elsewhere). If the director always gets its MAC address in the router arp table, then the LVS will work without any changes to the realservers (as happened in my case with a director with the fastest CPU in the LVS), although this is not a reliable solution for production. Getting the MAC address of the VIP on the director (instead of the MAC address of the VIP on the realservers) to the client when the client/router does an arp request is the key to solving the "arp problem". The traditional ways of handling the arp problem (as explained here) all require fiddling with the settings of the VIP on the realservers. The assumption in the early days of LVS was that you wouldn't have access to the router (this being under the control of the IT department or your ISP and you would have to go through a lot of bureaucracy to changed the settings on the router). However if you're paying good money to an ISP to house your LVS, or your inhouse LVS is doing something useful for your establishment, then you should have no trouble in having the router setup the way you want. If you have access to the router (or can put one in front of your LVS - a low power linux box is just fine) and you can set it to route packets for the VIP only to the director(s) and not to the realservers, or you can use the arp filtering tools of iptables, and you understand what's been said above, then you've handled the arp problem and need read no futher. For those who don't have access to the router, or who want to setup an LVS on one network, then read on... The arp problem is handled in Linux 2.0.x kernels, as dummy0, tunl0, lo:0, are available on the realserver which don't reply to arp requests. For other OS's, the NOARP flag for ifconfig stops the VIP on the realservers from replying to arp requests. However with 2.2.x (and later) kernels, the devices which didn't reply to arp requests in 2.0.x, now reply to arp requests. There is a "-arp" (NOARP) option for ifconfig which (according to the man pages) turns off replies to arp requests for that device, and an "arp" option which turns them back on again. Linux does not always honour this flag. You couldn't turn on replies to arp requests for the dummy0 devices in 2.0.36 kernels and you can't turn it off for tunl0 in 2.2.x kernels. eth0 behaves properly in 2.0.36 but in 2.2.x kernels it arps even when you tell it not to arp. This behaviour of not honouring the NOARP flag in the Linux 2.2.x kernels is not regarded as a "problem" by those writing the Linux TCPIP code and is not going to be "fixed". Julian 22 May 2001

The flag is used to allow arp requests for the specified device. Although "lo" doesn't reply to arp requests, the requests for the VIP go through eth*, and so the NOARP flag is of no help to us. We can't drop the flag for eth.

Another wrinkle is that in 2.0.36 kernels, aliased devices (eg eth0:1) could be setup independantly of the options on the primary (eth0) device. Thus eth0:1 behaved as if it were on a separate NIC and its arp'ing behaviour could be set independantly of the primary interface. The settings of an aliased device belonged to the IP. With the 2.2.x kernels, the aliased devices are now just alternate names for each other: you change an option (eg -arp) or up/down of one alias (or primary) the other aliases follow. With 2.2.x kernels, the settings of the aliased device belong to the primary device (there is only one device with several IPs). When LVS was running on 2.0.36 machines, the VIP was usually configured as an alias (eg lo:0, tunl0) on the main ethernet device (eth0), allowing the nodes in an LVS to have only one NIC. With 2.2.x kernels, care is needed when only one NIC is used on the realserver (the usual case). On a realserver with eth0 carrying the RIP, and the realserver having only one NIC, eth0 must reply to arp requests (to receive packets), then eth0:1 carrying the VIP will reply to arp requests too, even if you ifconfig it with -noarp. Thus if a realserver is running a 2.2.x kernel and has the VIP on an ip_alias, then the VIP on the realserver will reply to arp requests received from the router. The use of ip_aliases is still allowed, but requires a "label" to be recognised by the new tools (iproute2 and ip_tables). The "label"ed IPs are now secondary IPs. For 2.2.x kernels and beyond the commands ifconfig and route should only be used with single NIC leaf nodes. You can still use them to set up a simple LVS, but for anything more complicated you will need to start using the iproute2 commands.

Put the VIP on the realservers lo device In the early days (2.0.x, 2.2.x) I seemed to be able to put the VIP on any device I liked. I don't know whether this is still possible with the newer kernels, but people have not been able to get their LVSes to work unless with arp_filter and arp_ignore unless they put the VIP on the realserver's lo device.

The Cure(s) Several cures have been produced in an attempt to solve the arp problem. They involve either stopping the realservers from replying to arp requests for the VIP. hiding the VIP on the realservers so that they don't see the arp requests. priming the client/router in front of the director with the correct MAC address for the VIP. allowing the realserver to accept a packet with dst=VIP even though the realserver does not have a device with this IP (i.e. the host has nothing to reply to an arp request). This is implemented by or fwmark. For transparent proxy on the realserver, the director forwards the packets to the realserver's MAC address, so you don't need to route the packets yourself. For fwmark, you need to understand . There may be performance problems with transparent proxy at high packet rates. Noone has tested against transparent proxy at high throughput. stopping arp requests for the VIP getting to the realservers. Note: For the 2.2 and 2.4 kernels, most of these cures involve applying a kernel patch to the realserver. The realserver patch is unrelated to the ipvs patch applied to the director. horms 4 Aug 2005 For the record, the ARP problem is not about honoring the -arp flag or not. The problem lies in whether or not the OS regards the IP address as belonging to the interface, or as belonging to the host. Both are valid. Linux adopts the latter, which turns out to work really well in most situations. LVS is a rare case where it doesn't. This has been painful in the past, but since arp_ignore and arp_announce were added, its quite easy now. The following list of cures is a little confusing. If you're not using routing to stop packets for the VIP arriving at the realservers, then you'll have to stop the realservers replying to arp requests for the VIP. In this case you'll do one of the following on the realservers (Mar 2005, with help from Horms) the original method: use Julian's hidden patch on the realserver. You set the VIP on lo and then "hide" it. This method has been around since the arp problem first arose and has been well tested. For more on the hidden patch see julian's page. This code is still being maintained, so if your setup scripts are for the hidden patch, you can continue to use it. Otherwise for new installations, you should use the arp_announce. the next method: Maurizio's noarp module. This has the advantage that it does not require any patching of the realserver's kernel, is simple to setup and is the preferred method for many people. It has another advantage that you control the arp behaviour for the IP and not for the device. the new way arp_announce: see arp_announce (http://www.ssi.bg/~ja/#arp_announce) which sets arp_ignore and arp_announce on the arping interfaces. This typically means eth0, but if you have eth1 as well, you need to set it there too. (If you have multiple NICs; eth0..ethn, you only need fix the NIC that hears the arp requests.) Setting these parameters on lo has no effect as far as I understand from testing, reading the code and reading correspondance from Jullian, i.e. you aren't interested in these settings. Make sure you don't bring up the ethernet device (say at bootup) before arp_ignore/arp_announce have been setup, or you will get a round of arp broadcasts from the NIC. Horms If the VIP is on eth0, and you don't want it advertised over ARP on eth1, then set: This is different to the hidden approach where you put the VIP on lo and then hide lo. The arp_ignore approach has theoretical and aesthetic advantages.

The Cure: 2.0 kernels - nothing needed There is no arp problem with 2.0 kernels on the realservers. On the realservers, configure the VIP on the lo device with the -noarp option as you would with any other Unix.

The Cure: 2.2.x kernels - many options The preferred method is "hidden"

The hidden patches The "hidden" patches for kernel >=2.2.14 are now in the standard linux distribution (i.e. you can use the "hidden" feature with a standard kernel and don't have to patch the kernel on the realserver). The arp patches allow you to hide a device from arp requests, allowing the realserver to function in an LVS. The hidden patch hides the device (here the lo) (and any IPs that are on it). The -noarp flag in 2.0 kernels affects only the ip_alias (and not other IPs on the same device). These are different methods, but both stop the router/client from getting arp replies from the realserver for the VIP. To hide devices from arp calls, on the realservers do /proc/sys/net/ipv4/conf/all/hidden #to make lo:0 not arp, put lo here echo 1 > /proc/sys/net/ipv4/conf//hidden ]]> then test that you've hidden the VIP (). There is a possible race condition in hiding the VIP - Kyle Sparger, 15 Feb 2001

I've found an interesting, but not totally unexpected race condition under DR in 2.2.x that I've managed to create when installing VIP's on a machine in DR mode. Basically, the cause is this: /proc/sys/net/ipv4/conf/dummy0/hidden ]]> You'll notice that there's going to be a small gap between the two which allows an ARP request to come in, and for the server to reply. And yes, it is big enough to be bitten by -- I've been bitten twice by it so far :)

Julian On boot: /proc/sys/net/ipv4/conf/all/hidden # For each hidden interface (dummy, lo, tunl): modprobe dummy0 ifconfig dummy0 0.0.0.0 up echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden # Now set any other IP address ]]> Kyle's suggestion

/proc/sys/net/ipv4/conf/default/hidden ifconfig dummy0 10.0.1.15 echo 0 > /proc/sys/net/ipv4/conf/default/hidden ]]> The echo 0 command is incase I want to configure other interfaces later that I _do_ want responding to ARP requests. Technically, it's not necessary, I just find it useful in my particular setup.

The Cure: Older 2.2 kernels (<2.2.12) These are old and it would better to upgrade (you won't get much help on the mailing list for these). However if you have them, you apply the arp patches to the kernel code of the 2.2.x realservers. These patches are separate from the ipvs patch applied to the kernel on the director. For kernels <2.2.12, Julian's patch is on the lvs website. http://www.linuxvirtualserver.org/arp_invisible-2213-2.diff The patch by Stephen WIllIams is at http://www.linuxvirtualserver.org/sdw_fullarpfix.patch This patch is against a 2.2.5 kernel but can be applied to later kernels (tested to 2.2.13). The file appears to have DOS carriage control. Depending what you get on your disk, you may have to convert the file to unix carriage control (with `tr -d '\015'`) (the unix line extension of '\' doesn't work in combination with DOS carriage control). The whitespace may not match your file so do If you are using you will need the forward_shared-hidden patch as well (needed only on the director, but can be applied to both director and realservers).

Put an extra NIC (eth1) on the realserver to carry the VIP Possible cards would be a discarded ISA card (WD80x3), or a cheap 100Mbit PCI card (eg Netgear FA310TX, $16 in USA in Nov 99) There is no traffic going through this NIC and it doesn't matter that it's an old slow card. The extra card is only required so that the realserver can have the VIP on the machine. With 2.2.x kernels you can't stop this device (eth1) from replying to arp requests, but if you don't connect the cable to it or don't put a route to it in the realserver's routing table, then the client won't be able to send it an arp request. To set this up with the configure script, enter eth1 as the device for the VIP on the realserver. Apparently, the 2nd NIC doesn't handle arp problem for 2.4 kernels. I tested the 2 NIC method of handling the arp problem with kernel 2.2.13. I haven't tried it with 2.4 kernels, but apparently it doesn't work. Julian and Ratz think it shouldn't work with 2.2.x kernels, but I haven't revisited the matter to see why we have come to different conclusions.

The Cure: 2.4.x kernels - arp_ignore/arp_announce The current (kernels starting 2.4.26) accepted method is arp_ignore/arp_announce. There are several ways of handling the arp problem for 2.4.x kernels. They all work, but some of them have been around longer and so have been used more and people on the mailing list are more familiar with them.

2.4 Hidden Patch Julian's hidden patch has been around the longest and is well tested. Although included in the standard 2.2.x kernel, it is not being included in the 2.4.x kernels. You'll have to patch the kernel on the realservers. The preferred method for new installations is arp_ignore/arp_announce. For early 2.4.x kernels (eg x=0), the patch is available at http://www.linuxvirtualserver.org/hidden-2.3.41-1.diff. (This patches a part of the kernel that isn't being actively fiddled with, so hopefully the patch will work against later 2.4.x kernels too.) The 2.4.x "hidden" patch is included in ipvs-x.x.x/contrib/patches/hidden-x.x.x.diff Assuming you are patching 2.4.2 with the ipvs-0.2.5 files Then build the kernel (can use same options as for the 2.4 director kernel build). You activate the hidden feature as for 2.2 (see hidden). As to why the hidden patch is in the 2.2 kernels but not the 2.4 kernels see the the mailing list archives or for the thread

2.4 arp_announce The 2.6.x arp_announce, arp_ignore code has been back ported to 2.4.26 (and later) kernels.

arp filtering Julian has written an extension to the iproute2 tools, which filters arp packets. You can use this to handle the arp problem. See Julian's software page for more details. This method does not require patching of the 2.4 kernel on the realserver. Julian's arp filtering is not . Joe

Is arptables the extension to iptables that you wrote a while ago? arptables seems pretty simple. What are the problems with arptables that you've written arp_ignore and keep maintaining the hidden patch?

Julian 11 Jul 2004 Almost true, I'm not the arptables author. You're referring to the arprules/iparp functionality which is based on ip, not on iptables. Similar names. At that time there was no user space tool for the arptables changes in kernel (done by David Miller), now there is such tool (I didn't tried it), so the list of options to hide addresses in clusters is extended. arp_ignore was born day(s) after arp_announce. Both flags are easy to set default policy for playing with ARP requests and replies which was needed for years for stuff like interoperability with other ARP stacks (mostly for controlling the source address selection in ARP requests with arp_announce) or for hiding of addresses for cluster setups.

Maurizio Sartori's noarp module Maurizio Sartori masar (at) masarlabs (dot) com 28 Nov 2002 On my site is a simple kernel module for Linux 2.4.x to solve the ARP Problem. You don't have to patch the kernel but only to compile, install and configure the 'noarp' module, to use your loopback interface filtering its arp reply. I've tested it on Debian 'Sarge' and RedHat 8. Maurizio later produced a patch for 2.6. Sebastien Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 04 Jun 2003 Nobody seems to recall what a smart Italian guy named Maurizio Sartori did. Instead of the hidden patch, which requires a full kernel build, he's written a *module* called noarp, way more handy, as it requires only a one module build, doesn't require a kernel build, takes about 1 minute to install and get working. it allows hidding IPs, not interfaces. I'm using it in production and it works perfectly. Joe

Can you hide the VIP on eth0:x and not hide the RIP on eth0? (I should know this, but I don't)

Jan Abraham jan_abraham (at) gmx (dot) net 31 Oct 2003 Yes, you can :) I used Maurizio Sartori's noarp module, suggested in your HOWTO in chapter 4.5.3. It can be controlled by IP, not by interface. Ratz 17 Dec 2004 Julian's arp_ignore is the way to go, portable and ready for upgrades ;). Nothing against Maurizio by all means, but after years of fighting with the netdev's Julian finally convinced the high priest of Linux networking to solve the arp Problem once and for eternity. If you read the accompagning documentation on arp_* sysctrl you can pretty much figure out that nothing is impossible anymore ;). Joe - I would have been happy if they'd left the arp behaviour as it was originally and as it is in all the other unices (except HPUX). Todd Lyons tlyons (at) ivenue (dot) com 03 Feb 2005

Hi guys (and masar), I am trying to use your noarp module, but am hitting the limit of 16 entries. I need it to work for (at the moment) an additional 10 entries. I see in noarp.h: Is it going to create problems to pick this number up to 32 or 64? I've already done it and it seems to handle the problem. I don't want to create any memory leaks or overruns. Your code looks like it allocates memory based on that NOARP_MAX_IP, but my c is not good enough to know for sure if that will be a problem. Here's what happens on my system (RH 7.3 with 2.4.20-28.7smp kernel). You can see that it's failing on the 10 additional IP's after the initial 16. Please let me know if I can safely raise that number. I have a question about the man page I must be misunderstanding something very basic. I thought you didn't want real servers to arp at all for VIPs, no matter what interface the arp comes in on and no matter what interface is defined with the matching address. The only acceptable arp answer is for the RIP (implying local traffic or traffic that is not desired to be load balanced). But the above man page contradicts my ideas. So I'm a bit confused as to how exactly noarp is working.

Maurizio Sartori masar (at) MasarLabs (dot) com 04 Feb 2005 there should be no problem to incremente NOARP_MAX_IP, all memory is allocated statically. The RIP in the 'add' command of noarpctl is used to suppress the selection of the VIP as the sender IP address in arp requests. If not suppressed the back-end host request updates all arp cache entries on the local net for the VIP with the mac of the back-end host. A way to generate a request of this type is, from a real server: nc -s $VIP somehost 80 ]]>

extra NIC doesn't solve arp problem for 2.4 kernel realservers Jean Paul Piccato j (dot) piccato (at) studenti (dot) to (dot) it

I'm setting up a DR_LVS with a director and two servers... I've to handle the ARP problem so I've put two NIC on the two realservers...

Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002 This works maybe only with Linux 2.0. (Joe: see 2.2 kernels with extra NIC). For 2.2+ you need a specific kind of ARP control. In Linux 2.2+ the operation of adding IP address involves the following 2 steps: Define a local IP address as a host property - remote hosts can talk to it through any device Define network link route on the specified device - you can talk with other hosts from this local network only through this device (1) allows the Linux 2.2+ box to send ARP replies through any device that received the reply. Additionally, the user can provide some filtering by setting some device specific values: ]]> These are explained in /usr/src/linux/Documentation/networking/ip-sysctl.txt The LVS setups depend mostly on the FLAGs rp_filter, hidden, arp_filter, send_redirects. (for more info on kernel flags see the section on ). On problems, check them after learning what they mean and how they can kill your setup. By setting rp_filter or arp_filter on some device you can ignore the ARP requests (and the traffic if rp_filter is set) coming from addresses if we don't have a route to these addresses through the mentioned above device. The send_redirects values must be checked for setups playing with NAT on one physical medium. Information on using the hidden patch is in hidden.txt

It seems that eth0 reply to the server instead of eth1

Any device can reply if the ARP probe is not filtered. See hidden.txt from the above URL Michael McConnell michaelm (at) eyeball (dot) com 10 Jun 2002

I currently have a system which has a Tyan 2515 Motherboard. This motherboard features a Dual Intel 82559 NIC. The problem I am face is that which using both ports of this dual interface network card (plugged into the same switch) I find that the second interface is answering arp requests (on rare occasions) that the first interface should be answering. I have used tcpdump and clearly seen eth1 answering arps requests that eth0 should be answering... how odd.... It's rare, but when it happens of course that address is offline. (Note: this only seems to happen on alias IP address, it has never happened on the primary interface) I am using the open source drivers provided with the 2.2.19 kernel, I'm wondering if the drivers provided by Intel would help this problem?

Roberto Nibali ratz (at) tac (dot) ch 11 Jun 2002 The drivers indeed can't make the difference but not because they are the same, but because the driver doesn't have anything to do with the arp/routing issue. Julian Classic problem of attaching multiple Linux interfaces to shared medium. You can set arp_filter on all your ARP devices or why not to restrict even the IP protocol by setting rp_filter. Such answering (of arp requests) can not be never a problem. If the Linux box answers via many interfaces then it is willing to accept traffic through these ifaces. Of course, the achieved failover when attaching two interfaces to same hub is not perfect because the remote LAN boxes will use the alive Linux interface but Linux routing still uses the first interface (even if it is failed on Layer 2) for the used subnet. If your goal is to restrict the talks for each subnet through one interface then you have to use arp_filter=1 but still to use rp_filter=0 to allow cross-subnet talks. One day rp_filter will be aware of the medium_id values for each interface and will allow the Linux box to interconnect multiple hubs securely (and still to use many interfaces to these hubs). By default Linux replies to ARP probes for any local IP address configured on any device no matter on what device the probe is received. Such probes look like "who-has TARGET tell SENDER". If the probe is answered later we can receive IP traffic from SENDER to TARGET destined to the TARGET's MAC address. When we have different subnets (network routes) configured on multiple interfaces attached to same hub sometimes we prefer (may be the reader can find good reason for this) the traffic to/from one subnet always to use one interface. In such cases replying through many interfaces is not desired. We have 2 options: arp_filter:

when then the flag will cause any probe received on interface DEV to be dropped if the route from TARGET to SENDER points to different interface. With the usual local networks in table main in the form "from 0/0 to local_net lookup main" we see that the TARGET is ignored. As result, we drop probes received from SENDER that comes from wrong interface. As result, if the route from TARGET to SENDER1 is via DEV1 and from TARGET to SENDER2 is via DEV2, then we will reply only through one device for each of the senders. Of course, the arp_filter relies on the routing and as result the bahavior depends on the used ip rules and routes. The above is a simple example for normal local networks. The arp_filter simply checks the route for the reversed addresses. It should point to the input device.

rp_filter:

The rp_filter flag (DEV/rp_filter or all/rp_filter) set to 1 has similar semantic. It has nearly the same function as arp_filter and can control the ARP for the same purposes: symmetric talks (in and out using same device) but it covers the IP traffic too. It is assumed that where ARP is received (replied more exactly) there the IP traffic will be accepted too. It has mostly security function and can defend against IP spoofing. It controls the reverse path protection: we accept traffic from SENDER to TARGET received on DEV only when the reverse path (from TARGET to SENDER) points to the input interface DEV. It is used usually for "external" interfaces.

How you can use it: /proc/sys/net/ipv4/conf/eth0/arp_filter echo 1 > /proc/sys/net/ipv4/conf/eth1/arp_filter ]]>

Put the realservers on a different network to the VIP Setup routing tables so that the client cannot route to the realserver network (Lars' method). This method requires the director to not forward packets for the VIP (easy to implement if 2 NICS on the director). The reply packets from the realservers return to the client via a different router to the one attached to the director. Thus the director's router cannot send arp requests to the realservers.

On the client(router), route packets with dst_addr=VIP to the director You can hardwire the MAC address of the director as the MAC address of the VIP. You can do this with Here is my /etc/ethers file (on the client) This requires no extra NICs or patching of realservers. However in a production environment, redundant directors with heartbeat/failover may be required and some method (eg running send-arp) will be needed to change the static arp entry as the failover occurs. If multiple NICs are involved, it is possible that the above instruction will result in a route through the wrong NIC. In this case bring up the NIC of interest first and then run the above command. Alternately if the router has several NICs, use one for the director and another for the realservers. Route the VIP to the director.

Use transparent proxy allow the incoming packet to be accepted locally - Horms method. see the sections on , and its setup for LVS-DR and LVS-Tun. The configure script will set this up for you.

The Cure: 2.6.x kernels - arp_ignore/arp_announce

2.6 arp_announce Julian Anastasov ja (at) ssi (dot) bg 25 Feb 2004

2.4.26 and 2.6.4 come with 2 new device flags for tuning the ARP stack: arp_announce and arp_ignore. All IPVS-like setups can use arp_announce=2 and arp_ignore=1/2/3 to solve the "ARP problem" on realservers with DR/TUN setups. These flags are going to replace the "hidden" functionality which does not work well when directors are changing role between master/slave for a particular VIP. The risk is that other hosts can probe for VIP using unicast packets for which the hidden flag always replies. I'll continue to support the hidden flag for 2.4 and 2.6 to help existing setups but switching to the new device flags (or other solutions) is recommended.

Documentation is in the 2.6 kernel docs (linux/Documentation/networking/ip-sysctl.txt) (here from the 2.6.17 kernel). On the realservers the VIP will still be on lo (as for the hidden method). If the reply packets to the client are routed through eth0, then the arp announcements/requests are made through eth0 and you will apply the arp_ignore/arp_announce sysctls to eth0, not to lo (you cannot use arp_ignore/arp_announce on lo). As with all devices that reply to arp requests, you should stop the arp behaviour before bringing up the VIP, or else flush the arp tables on the router before using the LVS. Mag2589 Walla Feb 21, 2007

On my realservers I have them set up to listen to the virtual address on eth0:0 I need them to respond to arp on eth0 but I need them to ignore it on eth0:0 To do this would I enter the following line in my /etc/sysctl.conf file?

Horms In a word: No arp_ignore works only on physical interfaces. The old eth0:0 notation is a hang-over from the days of ip aliases, where in a round-about way you could establish virtual interfaces (sort of). These days an interface can have 0 or more addresses. The arp_ignore semantics apply to such addresses. If you really need fine-grained arp control, take a look at , which is kind of like iptables for arp.

noarp v2.6 Masar masar (at) MasarLabs (dot) com 04 Mar 2004 noarp 2.0.0 (http://www.masarlabs.com) is now available. This is the port of noarp to the Linux 2.6.x kernel. For the 2.4.x kernel use noarp 1.x.x. I'm making separate packages of noarp for the two kernels, because the method for producing a module is different. If there is sufficient interest, I may produce a single package for both kernel versions.

arp ignore with Ubuntu Julien Cornuwel cornuwel (at) gmail (dot) com 29 Sep 2008

I'm trying to set up load balancing with IPVS on two Apache webservers. The loadbalancer and both apache servers are virtual machines running Ubuntu 8.04 server on VMware ESX 3.5. If the platform has been idle for some time (like when I came back to work this morning), I can point a browser to http://VIP and get pages from server1 or server2 alternatively (I'm using rr during setup). But after about 5 seconds, I get nothing and the browser times out. Here is my configuration on the load balancer : On webservers, I added the following to /etc/sysctl.conf (as suggested on http://www.linuxvirtualserver.org/VS-DRouting.html) : I rebooted them and then : Unless I did something stupid (if so, please tell me), it should work. /proc/sys/net/ipv4/conf/lo/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce ]]> I'm quite sure my problem is not on the loadbalancer but on webservers. I have one interface and tried the following with no more luck : On the first request, that works, I have an ESTABLISHED line. But after a few seconds, I get dozens of SYN_RECV. OK, so I definitely have an ARP problem, with the above configuration. Any idea why the above commands doesn't work on Ubuntu? I did a TCP dump on all 3 servers and here is what I see : On webservers, when it works, I see outgoing IP packets with the LB's address as origin. When it doesn't, I just see nothing. About once per second, the LB sends ARP requests trying to find both webservers (on their real addresses), I never saw an ARP reply. On the load balancer, I see incomming requests from clients, and some "ICMP host lb" not reachable sent to the client when it doesn't work. Webservers should reply to ARP requests on their primary addresses, but they don't :(

Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 07 Oct 2008 Well I think it's either that you applied "hidden" to eth0 on the webservers, or the LB has the VIP as a primary address. See if the ARPs were going out with VIP as source and if that's the case, try giving the LB a different primary address and make VIP an alias. Julien

Great ! That was it. Now that I have the VIP as an alias on LB, it works. Note to the documentation team : on Ubuntu 8.04, there is a trap with real servers. If you set arp_ignore/arp_announce configuration in /etc/sysctl.conf AND set the VIP on lo:0 in /etc/interfaces. It seems that the interface is brought up *before* the sysctl commands are passed. You have to set the VIP manually at the end of the boot process.

arptables Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 11 Jul 2004

arptables is a method supported by Red Hat. The package, arptables_jf, is part of Advanced Server, but the src.rpm can be downloaded, rebuilt and used on Workstation since it has the same kernel support. The configuration is pretty straightforward: This service will start before the network is brought up. Note that you have to specify an explicit runlevel, since it stupidly won't start in single user by default.

Bandit Lazuli banditlazuli (at) yahoo (dot) com 13 Apr 2006 Our cluster of web frontends periodically exhibited a kind of Fatal Attraction behavior, where one host would suddenly be the recipient of all hits. Attempting to add new hosts to the existing cluster triggered this behavior in a consistent way. With something clear to fix, we installed the latest version of keepalived on the latest RHEL4 kernel. And lo, nothing changed. Add a new host, it became a Fatal Attractor within 6 minutes of operation (note that this is NOT the ; things were relatively well balanced for a minute or 6). Stranger yet, ipvsadm on the director revealed that the Attractor was getting NO hits. So it wasn't that the LVS was sending all hits to one machine. You guessed it. The new machine was arping for the shared ip, and connections were coming directly to it. We had arptables set up as follows: And in desperation, started arptables at runlevel 1. This didn't help, because it wasn't responding to an inbound arp request, but was instead generating it's OWN arp request, and broadcasting the response it made to itself. This could be seen with: file ]]> And then pawing through the file for the shared ip (name). So there lies the smoking gun. Arptables was NOT working as advertised. So we added: This still did not do the trick; apparently arptables implicitly operates on the interface owing the ip (lo:1, in our case), if no interface is specified. That left eth0 leaking arps. Specifying the interface did the trick: And here is the whole filter: arps are now properly squelched, and fatal attractor behavior has vanished. I'm posting this because I longed for google to return such a message in response to many searches.

The arp problem is on the realserver's VIP not the RIP Cali Federico

I've configured an http service on director as below: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 194.153.172.249:80 wlc -> 194.153.172.222:80 Route 1 0 0 ]]> and I've installed the noarp module on the realserver as below: The problem I can see is that invoking the http://VIP/index.html from my PC (outside the LVS network) I can see the page provided by realserver 2 or 3 times after that I receive a "Page Cannot Displayed". The page remains unreachable for several minutes after I have the same behaviour again. Looking at the director's arp table, the HWaddress related to the realserver is "(incomplete)". After that I set (using arp -s) the correct realserver MacAddress the LVS works properly.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/19/05 You are no arp'ing your RIP which is not a good idea. It's just the VIP that needs NOARP on the realserver.

Testing an interface for replies to arp requests To test that the VIP on the realservers (here lo:0) is hidden from arp requests: You test from a client machine on the same network segment as the NICs on the realservers. For your sanity, you could try this with one realserver at a time. You do _NOT_ have the director (with its pingable VIP) connected to the network (unplug it). Optional: on each realserver, to accumulate a list of the MAC addresses for each NIC. find the MAC address for the realserver's VIP from the test client. Hide the lo interface on the realserver (). Before the arp tables expire (15secs - 2mins depending on the OS), ping the VIP again from the test client. The realserver will still reply to the ping, since the MAC address for the VIP will still be in the arp table of the test client. let the arp cache expire (wait 15sec - 2mins) or clear the arp cache of the test client. ping the VIP. You should get no reply. Do for all realservers, making sure you get no ping replies for the VIP. On the director (don't connect it to the network yet) find the MAC address of the VIP Connect the director to the network. Just to be sure, clear the arp entry for the VIP on the test client (arp -d VIP) and ping the VIP again. You should get ping replies. The test client's arp cache should have the MAC address of the director's NIC for the VIP.

Normal machines, Solaris, Novell Server The arp problem only occurs on Linux with kernels 2.2 and later. All other OS's honor the arp flag (Joe HPUX does something wierd with noarp, but I've forgotten what). Mark de Vries

I need to do some DR to a Solaris 8 box... Anyone know how to set it to ignore arp requests? So far I have only done DR to linux and windows boxen...

Lasse Karstensen lkarsten (at) hyse (dot) org21 Apr 2006 At least for Solaris 9, you can just create the file /etc/hostname.lo0:14 (14=some number) Inside of it you write where 1.2.3.4 is your vip-address. I'm pretty sure this also works in Solaris 8. The Solaris-people here also mention that you can just use addif in /etc/hostname.lo0, if you rather fancy having everything in one file. Mark de Vries markdv (dot) lvsuser (at) asphyx (dot) net21 Apr 2006

Ah yes, the -arp option... Yeah works! Not that it matters because half way throught the exersize I suddenly realized that the service has always used NAT for a reason; the real servers have the service on different ports then on the VIP... so DR is a no-go... sigh... And the whole reason we wanted to use dr in this first, or actually second, case was because half way through the first attempt at configuring it as NAT I suddenly realized that wouldn't work because in this case the realserver has an interface in the same network as the VIP is in (so there is a direct route to the client and packets won't get de-natted)... So short of someone pointing me to a source-based-routing-HOWTO-on-Solaris it's just not gonna work out...

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 17 Dec 2008 We recently had a customer using that funny thing called Novell Server.... I couldn't find anything in the LVS manual about Novell Server in DR mode but eventually figured out the following which works great: noarp ]]>

problems with switches There are other places in the network with arp caches, like "smart" switches. These will bight you if you don't know about them. frederic (dot) buche (at) equant (dot) com 29 Oct 2003

OK Julian, you are right. The problem came from my network-switch, which keeps in memory the MAC address of all machines. So it just relays the arp request to the concerned server, by using a unicast arp request. Just for a test, I have deleted the MAC entry on my switch. Then reproduce the same test than before ... and the hidden patch works well!

Carlos J. Ramos cjramos (at) genasys (dot) com 15 Dec 2003 We are using an HP Procurve Switch 2124 in a cluster using Heartbeat and Ldirectord as HA and Balancing mechanisms. Previously we have similar working setups with a hub in the same location. Eerything works fine, till we make a takeover on directors. As the switch documentation saids, the switch automatically learn MAC address and associate it to its ports, so that although heartbeat changes IP address, the switch try to use the same switch port. The situation remains for at least 1 hour... for this time the forwarding in the cluster does not works... and realservers are unable to be reached from outside... We are assuming this is an arp caching problem, although we haven't eliminated other possible causes yet. Is there any way to force the switch to refresh MAC Address Table?, is there any Linux tool that sent any kind of packet over the net forcing the ARP Table to be updated?

The ARP problem, the first inklings History: The ARP behaviour changed between 2.0.x and 2.2.x kernels. Here's the original posting by Wensong and a reply from Alexy Kuznet (2.2 tcpip author) Wensong Zhang wensong (at) iinchina (dot) net 24 Mar 1999 Today I upgraded the kernel to 2.2.3 with tunneling support on one of a realserver, and found a problem that the Linux 2.2.3 tunnel device answers ARP requests. Even if I used the NOARP options as follows: It still answers the ARP requests. This will greatly affect the virtual server via tunneling work properly. In fact, the tunnel device shouldn't answer the ARP requests from the ethernet. I think it is a bug of linux/net/ipv4/ipip.c, which is now a clone of ip_gre.c not the original tunneling code. If you are interested, you can test yourself on kernel 2.2.3, choose a free IP address of your ethernet and configure it on the tunl0 device, then telnet to that IP address from other host, I guess you can. Finally, have a look at the ipip.c, maybe you can debug it. :-) -- But, what is the IFF_NOARP flag of the tunnel device for? kuznet (at) ms2 (dot) inr (dot) ac (dot) ru

IFF_NOARP means that ARP is not used by THIS device. On normal IPIP tunnels it does not make much of sense, but may be used for example to turn on/off endpoint reachability detection. I do not see any reasons to disable answering ARP in such curcumstances. Isolation of VPNs on adjacent segments is impossible at routing/arp level, it is just not well-defined behaviour. If the isolation is made with firewall policy rules, then it is clear that arp policy must be handled at this level too.

In kernel 2.0.x, the tunnel device doesn't answer ARP requests.

Yes.

Yeah, we can have link-local addresses that doesn't answer ARP requests in kernel 2.2.x. For example, we can configure all the hosts in a network with the following command: There will no collision. The lookback alias interfaces don't answer ARP requests.

Are you sure? I am not. Please, test. BTW you risk adding non-loopback addresses on loopback device. They have the HIGHEST preference to be used as router identifier. so that VPN addresses cannot be added to loopback at all.

No, it doesn't fail. I tested it with kernel 2.0.36, it worked.

It does not work under 2.2. To be honest, I am about to stop to understand you. You talk about 2.2, but all your tests are made for 2.0. 8)

A posting to the mailinglist by Peter Kese explaining the "arp problem" (saved for posterity by Ted Pavlic, minor editing by Joe) peter (dot) kese (at) ijs (dot) si Before we start, let's assume we have following network configuration for an LVS running LVS-DR. The virtualserver is the combination of the director and the realserver running LVS. Or goal is: Virtual server should respond to arp requests for both the VIP and the director IP. The realserver should respond to arp requests for the realserver IP but NOT the VIP. Gateway sends packets for the VIP to the director IP load balancer no matter what. Problem 1: Interface aliases Realserver and director need to have an interface with the VIP in order to respond to packets for virtual server. A real interface is not needed, an IP alias will do just fine and this interface alias could be either eth0:0 or lo:0. On the 2.0 kernels, the ARP responding ability of an interface alias (eg eth0:0) could either be enabled or disabled independantly of the main (eth0) interface. If you wanted eth0:0 not to respond to ARP requests, you could simply say: Thus in the 2.0 kernels it is possible, on a realserver, to have the realserver IP (on eth0) respond to arp requests and for the VIP (on eth0:0) to not respond. In the 2.2 kernels this doesn't work any more. Whether the an interface alias responds to ARP requests or not, depends only on the way the real interface is configured. So if eth0 responds to ARP requests (which it normally will), eth0:0 carrying the VIP will also respond to ARP requests no matter what. This means an ethernet alias (eth0:0) is not permitted on real servers, because realservers should not respond ARP requests. On the other hand, loopback aliases never respond ARP requests, which means that the loopback alias (lo:0) must not be used on the director for the VIP. Problem 2: Loopback aliases I haven't done much checking on loopback interface problem, but it seems that if an alias is used on a loopback interface (as is required for LVS-DR) on a realserver running kernel 2.2.x, the whole ARP gets screwed. It appears that loopback interfaces get special ARP treatment in the kernel, so I suggest avoiding the loopback aliases as whole. The question now is: What kind of an interface can I use on real servers? As I already noted, eth0:0 alias can not be used, because such aliases respond to ARP requests. lo:0 aliases can not be used, because they make ARP problems too. In case of tunneling VS configuration, the answer is trivial: tunl0. But to be honest, tunl0 interface can also be used for direct routing. (Joe: the dummy device is OK too, at least for the 2.0.x kernels) With direct routing, the only thing we need an interface for is to let kernel know we posses an additional IP address. This means, we can set up any kind of an interface, as long as it doesn't respond ARP requests. Instead of tunl0, you could also set up a ppp0, slip0, eth1 or whatever. I suggest setting up a tunl0: Problem 3: Real server ARP requests. Suppose we have set up a virtual server as described at the beginning. All computers are running, but no requests have been made. Then the client sends a request to the VIP. When the packet arrives to gateway, the gateway makes an ARP query for the VIP and the director responds. Gateway remembers the director's MAC address and sends the packet to the director. Director receives the packet, looks up its ipvsadm/LVS tables and chooses the realserver and forwards the packet to the real server by direct routing or tunneling method. Real server receives the packet and generates a response packet with destination=client, source=VIP. (until now everything works correctly) When realserver wants to send the response packet to the gateway, it finds out, that it does not know the gateway's MAC address. It sends an ARP request to the local network and asks for the gateway MAC address. This should look like: ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP) But in reality, realserver asks something like: ARP, who has 192.168.1.1 (gw), tell 192.168.1.110 (VIP), because it takes the source address from the packet it wants to send. Here the problems come in. Gateway receives the packet and responds to it, which is correct. But at the same time, gatweay does a little optimization. It finds out, that the realserver's MAC address is not listed in its ARP tables and adds the entry into the table, just in case it might need that address in the near future. The ARP request contained the VIP address and the realserver's MAC address, so from now on, the gateway will send all packets destined for the VIP to the realserver instead (due to MAC address). This means all packets that follow will avoid the virtual server as whole and get responded by the realserver. If the realserver's ARP request would be: ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP) all this would not have happened. Therefore I have patched the 2.2 VS kernel in such a way, that it composes ARP requests based on the address of the interface selected by the routing tables instead of the address taken from the packet itself. In order for virtual server to work correctly, the realservers should have patched kernels as well, or at least copy the patched /usr/src/linux/net/ipv4/arp.c file to the realservers before compiling the kernels. Conclusion Those were my experience with ARP problems, and the 2.2 kernel virtual server. I think it would be wise to add this letter to the web site and notify the network developers about our findings at some point in time. Here are some golden rules I stick to, when I do virtual server configuration:

arp bouncing symptoms of realservers arp'ing - arp bouncing Stephen WIlliams sdw (at) lig (dot) net (Stephen wrote one of the patches that stop devices in 2.2.x kernels from replying to arp requests) If you don't use the patch you'll find that the 'active' box will bounce from machine to machine as each one sends an ARP reply that is heard last. Additionally you will get TCP Reset's as connections that were on one box suddenly start going to others. Very nasty and unusable.

Lar's Method (This is called Lars' method) Lars I have thought about how the ARP problem can occur at all with direct routing, because I never noticed it. Then it occured to me that your VIP comes from the same subnet as the RIP of the LVS and also all the realservers share this media. To avoid the "ARP problem" in this case without adding a kernel patch or anything else, you can just add a direct route for the VIP using the RIP of the LVS as a gateway address on the router in front of the LVS. ("ip route VIP 255.255.255.255 real_ip" on a Cisco, or "route add -host VIP gw RIP" on Linux) Since I just used 2 ethernet cards and had the LVS act as gateway/firewall anyway, I never noticed the ARP problem. (We have 2 LVS in a standby configuration to eliminate the SPOF)

Static Routing to Director The arp problem is handled if the router in front of the director has a static route for the VIP to the director (i.e. packets for the VIP from the outside world are sent to the director and cannot get to the realservers). Wensong For the clients who reach the virtual server through the router, there is no problem if a static route for VIP is added. However, for the clients who are in the network of virtual server, the "ARP problem" will arise. There is fight in ARP response, and the clients don't know send the packets to the load balancer or the realserver. In my point of view, the VIP address is shared by the director and realservers in LVS-Tun or LVS-DR, only the director does ARP response for VIP to accept request packets, and the realservers has the VIP but don't, so that they can process packets destined for VIP.

iproute2 arp on|off flag Joe, 21 May 2001

Was looking at the ip (i.e.iproute2) notes and it says Is this like the old -noarp flag for ifconfig?

Julian Anastasov ja (at) ssi (dot) bg 21 May 2001 This is the device ARP flag, same as ifconfig [-]arp. The flag is used to allow ARP packets for the specified device. It is correct that "lo" does not talk ARP, but you connect to the VIPs on "lo" through eth*, so the flag is of no help for LVS. We can't drop the flag for eth device. Andreas J. Koenig, 02 Jun 2001

kernel 2.4.5 has arp_filter

Julian Anastasov ja (at) ssi (dot) bg arp_filter does not solve the ARP problem for LVS This is a new proposal to control the ARP probes and replies based on route flag "noarp". It will be discussed on the netdev mailing list and may be something like this is going to be included in 2.4, may be in 2.2 too, not sure. All you know that the hidden feature is not considered to 2.4. The net developers have the final word. I'll try to maintain the hidden flag in all next kernels while this flag is more usable than the new feature and because the hidden flag has other semantic. And because may be there are some user space tools that rely on this.

Is the arp behaviour of 2.2.x kernel a bug? Julian Anastasov is replying to correct an error in a previous version of the HOWTO where I state that the dummy0 device in 2.2.x kernels does not arp. Julian wrote one of the realserver patches which fix the "arp problem". Julian In fact, the documentation is incorrect. There is no difference, all devices are reported in the ARP replies: lo, tunl and dummy. So, only the ARP patch can solve the problem. This can be tested using this configuration with any device (before the patch applied): On host A try: ping 192.168.0.3 Host B replies for 192.168.0.3 through 192.168.0.2 device So, the ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel. Stephen WIlliams (who wrote another of the patches to fix the arp problem).

Of course the ARP code in the kernel needs to be fixed so my filter code isn't needed. Still, I'm confused by this statement. The IFF_NOARP flag determines whether a device arp replies or not. What's wrong with honoring that? If you mean that arp replies should never be sent on another interface, that is what I currently believe to be correct.

Julian My understanding is that 2.2.x ARP code is not buggy and there is no need to be "fixed". I must say that your patch is working for the LVS folks but not for all linux users. IFF_NOARP means "Don't talk ARP on this device", from the 'man ifconfig': [-]arp Enable or disable the use of the ARP protocol on this interface. So, where is the bug ? The ARP code never talks through lo, dummy and tunl devices when they are set NOARP. It uses eth (ARP) device. If You hide all NOARP interfaces from the ARP protocol this is a bug. One example: Is it possible after your patch Host B to access www.domain.com ? How ? Host A doesn't send replies for A.B.C.1 through eth0 after your patch. OK, may be this is not fatal. Tell it to all kernel users. You hide all their NOARP interfaces. May be there are other examples where this is a problem too. Or may be there is something wrong in this configuration? I want to say that this patch hurts all users if present in the kernel. On Nov 6 I posted one patch proposal to the linux-kernel list which adds the ability to hide interfaces from the ARP queries and replies. But the difference is that only specified interfaces are not replied, not all NOARP interfaces. Its arp_invisible sysctl can be used by LVS folks to hide lo, tunl or dummy interfaces but this feature doesn't hurt all kernel users. I think, this patch is more acceptable and can be included in the 2.2 kernel, may be after some tunning. And I'm still expecting comments from the net folks and from all LVS users.

The device doesn't reply to arp requests, the kernel does. ARP requests/replies are thought of as coming from a device and people make statements like "the dummy device in 2.0.x kernels does not reply to arp requests while the same device in 2.2.x kernels does reply". It is the kernel that handles arp requests according to a set of rules and not the device. The code for the dummy device is the same in 2.0.x and 2.2.x kernels and is not responsible for the change in arp behaviour. (The RPC for ARP is at ftp://ftp.isi.edu/in-notes/std/std37.txt. - also see rfc826 and rfc1122. The model system used there is 2 machines on a single ethernet. It doesn't shed any light on the implementation of ARP on multi-interface systems like LVS.)

Properties of devices for the VIP In a previous version of the HOWTO I stated that the dummy0 device did not arp in 2.2.x kernels and therefore could be used as the device for the VIP on an unpatched 2.2.13 realserver. Julian Anastasov replied that they did arp (see below for his posting and the ensuing discussions). I hadn't actually tested whether the dummy0 device arp'ed but had concluded that it wasn't arp'ing because I had a working LVS using the dummy0 interface for the VIP on unpatched 2.2.x realservers and because as everyone knows ;-) an LVS needs to have a non-arp'ing device on the VIP of the realservers. I had a LVS-DR LVS which worked with dummy0, lo:0 and tunl0 as the VIP device and which on further testing, I found also worked with eth0:1 or eth1 as the VIP device on 2.2.13 realservers. Whatever the arp'ing status of dummy0, lo:0 or tunl0, clearly eth1 replies to arp requests, so despite the conventional wisdom, it is possible to build an LVS with arp'ing VIP's on the realservers. On investigating why this LVS worked, I found that the MAC address for the VIP in the client's arp cache (# arp -a) was always the director. I assume this was because the director is 3-4x the speed of the other machines in the LVS and it replies to arp requests first for the VIP (another posting from Stephen WIlliams says that the address which replies last is stored in the arp cache - we'll figure out what's really going on here eventually). On another LVS where the realservers were all identical hardware with 2.2.13 unpatched kernels, one particular realserver always was the machine in the client's arp cache for the VIP (to check, delete entry for VIP with arp -d, then ping again, then look in arp cache). I found that I could get a working LVS using almost anything to hold the VIP on the realservers, including eth0:1 and eth1 (another NIC in the realserver). These devices carrying the VIP were pingable from the client and I could get the corresponding MAC addresses in the arp table of the client if the director was not setup with a VIP. When I setup a working LVS this way, I found each time that the MAC address for the VIP in the client's arp cache was the director's MAC address. For some reason, that I don't know, whenever the client does an arp request for the VIP, it gets the director's MAC address. Possible reasons for the MAC address of the director always being associated with the VIP in my LVS - 1. I configure the director first and then the realservers. I don't make requests for a service till the realservers are setup. (Still I can't imagine the client asking for the MAC address of the VIP until it makes a connect request.) 2. The director is 3 times faster (CPU speed) than the next machine in the LVS and it always replies to arp request first. 3. I was lucky. Since you can make a working LVS-DR LVS with the realserver VIP on an arp'ing eth0:1 device I decided that the relevent piece of information about arp'ing was (ta da!) * an LVS will work if the client always gets the MAC address of the director when it asks for the MAC address of the VIP * This provides an easy solution - you tell the client (or the router) the MAC address of the VIP with arp -s or arp -f. here's my /etc/ethers After installing the MAC address of the DIP (director) as the MAC address of the VIP (lvs) in the arp table (arp -f /etc/ethers) I get notice the "PERM" in the VIP entry on the client. removing the permanent entry on eth0 director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0 ]]> If I edited /etc/ethers changing the MAC address of lvs to anything else, the LVS did not work anymore. So the arp information is coming from /etc/ethers rather than some uncontrolled variable I'm not aware of. I had thought that in an LVS with the VIP on realservers on an arping device that the VIP would hop from one machine to another (see the postings in the MISC section). Since naturally occuring LVS's with arping VIP's on realservers existed and worked well (mine), I set up an LVS by making a permanent entry for the VIP of the director in the arp cache of the client (router). This can be done by or There are 2 results of this the realservers can have the VIP on an an arp'ing device (eg eth0:1, eth1) - you don't need lo or dummy0, tunl0 for realservers with 2.0.36 and 2.2.x kernels. If two (or more) directors are setup in failover mode, the mechanism by for changing the VIP from one to another is broken by making a permanent entry for VIP on the director in the arp cache of the router. This is not a problem for a test setup to demonstrate an LVS but may be a problem in a high availability environment (a solution may be found n the meantime too). The normal method for changing directors (e.g. with heartbeat) includes a gratuitous arp. To force a gratuitous arp Julian

You can use Yuri Volobuev's send_arp.c from the 'fake' package or Alexey Kuznetsov's arping from its iputils package: fake - http://vergenet.net/linux/fake/ iputils - ftp://ftp.inr.ac.ru/ip-routing/iputils-ss991024.tar.gz iputils is also used for IPAT, IP address takeover

If you're not sure if the network knows that the VIP has moved, try this. Graeme Fowler graeme (at) graemef (dot) net 13 Mar 2006 At failover, make the new live director run something along the lines of: Where $GW_IP is the IP address of your upstream router. It's not exactly gratuitous ARP but it does, in my experience, help to rapdily converge the systems which currently don't talk to each other. Also make absolutely sure that the VIP is being torn down on the failed director. If it isn't, and it still ARPs for it, you'll end up in all sorts of problems. To monitor this you could feasibly run arpwatch on both the directors' upstream interfaces. You should see the VIP flip-flop on failover. If you see it repeatedly flip-flop at regular intervals, you're not tearing down properly. Joe Dec 2003 There is also http://www.vergenet.net/~acassen/software/garp-0.1.1.tar.gz which has been available for over a year, without me even knowing about it. Here's some tests I did Experiment 1: Result - arp'ing is independant of [-]arp Summary: the -arp/+arp option for ifconfig had no effect on any devices back to 2.0.36 kernels with net-tools 1.42. If it normally arps then -arp had no effect, if it normally doesn't arp, than "arp" doesn't turn it on (data below). Method: IP=192.168.1.1/24 with VIP=192.168.1.110/32. The VIP was on dummy0. The test was to see if the VIP was pingable from another (external) machine on the 192.168.1.0/24 network or pingable from the machine itself (ie internally from the console). (I assume I had a route add -host for the VIP although I didn't record this). The test was done with ifconfig using arp or -arp (the output of ifconfig -a didn't change) Experiment2: Can the VIP be on a separate NIC? Summary: yes, as long as the NIC doesn't have a cable plugged into it. Method: same as above except VIP on eth1 (another NIC). One of the reasons an no_arp interface is used on the realserver is that it is not visible to the rest of the network. Does the LVS work if the eth1 VIP on the realserver is not visible to the rest of the network? Conclusion: for 2.0.36 dummy0 doesn't arp, and eth1 does arp. the arp/-arp option to ifconfig has no effect on arp behaviour. LVS works with both dummy0 and eth1, I assume since VIP need only be resolved as local on the realserver and does not need to be visible to the network. Experiment 3: What devices and netmasks are neccessary for a working LVS? Using the /etc/ethers approach for setting the MAC address of the VIP I then set up an LVS with pair of realservers serving telnet. All IPs are 192.168.1.x, all machines have a route to 192.168.1.0 via eth0. There is no default route. with the following devices holding the VIP, tunl0, eth0:1, lo:0, dummy0, eth1. In each case there was no route entry for the VIP device and there was no cable connected to eth1 when it was used for the VIP. The table below shows whether the LVS worked. The VIP is installed with the result belong to 1 of 3 groups netmask of VIP=255.255.255.255 (normal LVS setup) netmask of VIP=255.255.255.0 (not normally used for LVS) It would seem that any device and any netmask can be used for the VIP on a 2.0.36 realserver for both LVS-Tun and LVS-DR. For 2.2.13 realserver, LVS-Tun, VIP on a tunl0 device only, any netmask (ie you need tunl0 on LVS-Tun with 2.2.x kernels) For LVS-DR then on Solaris/DEC/HP/NT... LVS can probably use a regular eth0 device rather than an lo:0 device (more work for Ratz to do :-). Does anyone know why the lo:0 device has to be /32 for LVS-DR on kernel 2.2.13 while the other devices can be /24? Jean-Francois Nadeau jna (at) microflex (dot) ca 6 Dec 99 In kernel 2.2.1x with a virtual interface on lo:0 and netmask of 255.255.255.0 that the interface no longer arps. Horms 29 Oct 2003 (4yrs later, presumably referring to the 2.4 kernels) brings up lo:110 (a virtual interface on the loopback device) for 192.168.1.110 with the broadcast and netmask as specified. If you are using LVS-DR then the packets that arrive on the realservers have the destination IP address set to the VIP. So the realservers need some way of accepting this traffic as local. One way is to add an interface on the loopback device and hide it so it won't answer ARP requests. The netmask has to be 255.255.255.255 because the loopback interface will answer packets for _all_ hosts on any configured interface. So 192.168.1.110 with netmask of 255.255.255.0 will cause the machine to accept packets for _all_ addresses in the range 192.168.1.0 - 192.168.1.255, which is probably not what you want. Does anyone know why only the tunl0 device works for LVS-Tun on 2.2.x kernels? Experiment 4: Effect of route entry for VIP and connection to VIP. The VIP normally has an entry in the routing table eg I found in Experiment 2 that a route entry was not neccessary for the LVS to work when the realserver had the VIP on eth0:1. Since I had always used a route entry for the VIP I wanted to find out when it was needed. The same LVS was used as for Experiment 3. The variables were Conclusion 1: LVS works when for both cases of route/no_route for the VIP for eth0:1 and eth1 (ie you don't need a route entry for the VIP on the realservers). Conclusion 2: having a network cable/no network cable does not affect whether the LVS works. Conclusion 3: for 2.0.36 kernels you can choose to have the VIP pingable from the outside world but not pingable by the local host by having it on eth1 with a cable connection (this seems weird and I can't think of any use for it just yet) or the reverse - pingable from the localhost but not by the external world by not have a cable connection. using a host's routable IP as the target - the IP on eth0 say - you can make a host unpingable from the console if you down the lo. The host is still pingable from elsewhere on the net.

Topologies for LVS-DR and LVS-Tun LVS's

Traditional The conventional LVS-DR/VS-Tun topology which allows maximum scalability has each realserver with its own default gateway (to a router). (In a routerless test setup, the client would be the default gateway for the realservers. In a setup which is not network bound, i.e. is disk- or compute-bound, only one router may be needed. The changes in topology/routing are made by changing the IP of the default gw for the realservers) Some method of handling the arp problem is needed here. The packets sent to the realservers from the director, generate replies which go directly to the client. Failure messages (eg if a realservers is not available) do not get returned to the director, who cannot tell if a realserver has failed (see discussion of monitoring agents).

Director sees replies (from Julian Anastasov) This discussion led to Julian's . If the default gw for each realserver is changed to the DIP (see the Martian modification section) then The director has to handle the reply packets as well as in the incoming packets, doubling the network load. The director sees all the reply packets. Connection failure can be detected (in principle). Here's the original posting by Horms horms (at) vergenet (dot) net Hi, I have been setting up a test network to benchmark IPVS, the topology is as follows. The question that I have is that the network I would really like to be testing is; .. other than using NAT, which has performance problems, is this possible? I tried this topology with direct routing and packets from the clients were multiplexed to the servers fine, but return packets from the servers to the client were not routed by the IPVS box. Lars Yes. The LVS box silently drops the return packets, since they have a src ip which is also bound as a local interface on the LVS. This is meant to be a simple anti-spoofing protection. from Joe: The return packet from the realserver has src=VIP, dest=CIP. If this packet is routed via the director, which also has the VIP, the director will be receiving a packet from another machine with the the src being an one of its own IPs and the director will drop the packet). You can enable logging these packets via /proc/sys/net/ipv4/conf/all/log_martians ]]> The only way around this with current Linux kernels is to disable the check in the kernel source or to use a separate box as the outward gateway. (Which is how DR is meant to be used for full performance) This is not a problem as such as it probably makes a lot of sense on not to use an IPVS box as your gateway router, Actually it makes a lot of sense to do just that IMHO. Less points of failure, less hard- and software to duplicate in a failover configuration. Ray Bellis rpb (at) community (dot) net (dot) uk

It needs to be made more explicit in the documentation that LVS-DR will only work if you have a different return path.

Lars Marowsky-Bree lmb (at) teuto (dot) net ... or if you have a suitably patched kernel.

We spent several man days trying to get this to work before figuring out why the packets were being dropped, at which point we had no alternative but to use LVS-NAT instead.

I agree. We still assume too much knowledge on the network admin side.

FYI, we have our LVS system working now, with LVS redundancy achieved by running OSPF routing (gated) on the LVS-NAT servers and having the VIP within the same IP subnet as the RIPs so that IGP routing policies automatically determine which LVS router the packets arrive on.

Yes, thats one option. Even better than heartbeat and IPAT, if all your systems support running a routing protocol. (IPAT = IP address takeover, part of heartbeat) In essence, heartbeat and IPAT is nothing but reinventing a subset of the functionality of a hardened routing protocol like OSPF/RIPv2/EIGRP.

On other schemes for director/realservers to exchange roles Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg has pointed out on the mailing list that the prototype LVS can be redrawn as and that any realserver is in a position to replace a failed director. No-one has bothered to write the code for this. It seems it's easier do have extra boxes in the director role (ready for failover) and others in realserver role. It's easier to wheel in another box for a spare director than to configure realservers to do two jobs reliably. Julian The director and the backup are in a shared network for incoming traffic, the backup sniff packets and change its connection state the same as the director (because the director is just on half client-to-server connection in LVS/TUN and LVS/DR), then drop packets. It needs some investigation and probably lots of additional code too. ;-) Wensong Zhang wensong (at) iinchina (dot) net I don't even think so - the main trick is getting the kernel to sniff the packets, which is probably quite easy with a little messing around. Not sending the packets out again (which would confuse the realservers) is easy with a ipchains output rule which silently drops them. This doesn't work with a switch though, you need a shared network like a hub. However, I have been talking with Rusty about this. The problem is more general - HA shared-state firewalls are asked for all the time, so we want to do a generic thing for everything which builds upon Netfilter's state machine. This would not only cover LVS, but also masquerading and packet filtering in general. We intend to discuss this in greater detail at the Ottawa Linux Symposium latest. Julian You can see,the connections depend on the initalize status and realsevers realtime status. So another method is that when Director is down, backup-sever setup the ipvs with the connections,but it seems too late. How do you think about this? Wensong TCP/IP should be able to cope with a few seconds delay and lost packets. You want to heartbeat once per second and take over after 3-4s though - this usually means takeover is complete in <10s, which TCP/IP should swallow.

Why do all devices broadcast the arp replies John Reuning (10 Apr 2003)

Why are arp replies sent for all interfaces, regardless of which interface receives the arp request?

Julian Because Linux routing agrees that all these senders have access to this IP, so we give them access to valid link layer address. This behavior is usually observed on routers configured without source address validation enabled. As this is the default behavior specified in RFC1812 (rp_filter=0), Linux simply allows access to this IP on any interface.

arp is part of the transition from network layer to link layer, right? So why should an alias on lo, an interface that doesn't really generate network frames, trigger an arp reply. Do other unix tcp/ip

Note that these packets are not passed via the lo interface, also, we do not send ARP replies via lo, why we should care about the lo's NOARP flag?

I can't seem to make a Solaris 7 system generate arp replies for an lo alias.

The different systems have different policy for IP addresses configured on loopback device. Note that in Linux, this behavior has nothing to do with the lo interface, you can configure IP on eth1 and then again to see our ARP reply for it on eth0.

A discussion about the arp problem (Joe and Julian) Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg There is no difference between devices in 2.2.x, all devices are reported in the ARP replies: lo, tunl and dummy. This can be tested using this configuration with any device: On host A try: ping 192.168.0.3 Host B replies for 192.168.0.3 through 192.168.0.2 device The ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel. ARP problem, some rules: ARP responses all local IP addresses are replied: lo, eth, tunl*, dummy* but with some exceptions (see the next rules) 127.0.0.0/8(LOOPBACK) and 224.0.0.0/4(MULTICAST) are not replied there is one exception for the "lo" interface: it is possible the kernel to ignore the ARP request if the source IP is from the same net as the net used to configure "lo" alias. The specified network is treated as local. For example: realserver# ifconfig lo:0 192.168.1.1 netmask 255.255.255.0 broadcast 192.168.1.255 up "real" treats all packets with source addr from 192.168.1.0/24 which come from the other devices (eth0) as invalid, i.e. source address validation works in this case and the ARP request are not replied. The kernel thinks: "The incoming packet arrived with saddr=local_IP1 and daddr=local_IP2(VIP), so it is invalid". By this way the host from the LAN can't talk to the realserver if its lo alias is configured with netmask != 255.255.255.255 registers only 192.168.1.1 as local ip but: all 256 IPs are local. All IFF_LOOPBACK devices treat all IPs as local according to the used netmask.

Joe I assume IFF_LOOPBACK devices are lo, lo:0..n?

Yes, currently only lo is marked as loopback. It is used to mark whole subnets as local.

lo:0 is not marked as loopback?

lo:0 is just attached IP address to the same device "lo". You can try "ifconfig lo:0 192.168.0.1 netmask 255.255.255.255" and display the interfaces using "ifconfig". There is LOOPBACK flag for lo:0 which is inherited from the device "lo". In Linux 2.2 all aliases inherit the device flags. Only the IFF_UP flag is used to add/delete the aliases.

Joe Assume LVS-DR with VIP, RIPs all on the same /24 network on eth0 devices, realservers all have lo:0 with VIP/24 and have the standard 2.2.x kernel (no patches to hide interfaces). Router says "who has VIP", the arp request arrives at the realservers via eth0. Device lo:0 finds arp request which arrived on eth0 from router is on the same subnet as lo:0 and does not reply to the arp request.

Before checking if to answer the ARP the routing tables are checked, i.e. the source validation of the packet is performed. If 192.168.0.2 asks "who-has 192.168.1.1 tell 192.168.1.2" the realservers assumes that this is invalid packet, i.e. from one local IP to another local IP (from me to me => drop).

Joe I notice that with the 2.2.x kernel, that lo:0 has to have netmask=255.255.255.255 to work, whereas with the 2.0.x kernels (where lo:0 doesn't reply to arp requests), that lo:0 can have the VIP on a 255.255.255.0 netmask and still work.

The rule is to use netmask 255.255.255.255 and to hide lo. The ARP works in different way in 2.2. It looks the "local" table to validate the source of the ARP request and after that it lookups the same table to check if daddr of the ARP request is local ip. ARP requests: - all local addresses can be used by the kernel to announce them as the source for the ARP request.

is it OK to say the kernel can (does?) use all local addresses as the source of ARP requests

It can and does. The realserver thinks that it can use any local ip address as saddr in the ARP request and the answer will be returned back if this ip is uniq in the LAN.

Joe do you mean "the realserver will receive a reply if the s_addr is unique in the LAN"?

The realserver will receive answer if it uses RIP as saddr in the ARP request because the VIP(HIP) is hidden or when using transparent proxy because it is not local (the VIP). Real server must know how to ask (using uniq IP) or the trafic for the asked IP (ROUTER) will be blocked. But the hidden addresses are not used because they are not uniq (2.2.14) and the answer will be returned to the Director.

Joe do you mean "the non-hidden VIP on the director"?

Yes, when the realserver ask "who-has ROUTER tell VIP" the ARP reply is received in the Director and the transmission in the realservers is stopped. The ROUTER sends everything destined to VIP to the Director. This is true for all clients on the LAN too if they are not in this cluster (if they don't handle packets for VIP).

Joe I would have thought that the main device on each NIC, eg eth0, eth1 would have been used as the source address.

No, it is extracted from the outgoing datagram and if saddr is local ip it is used. But if this is not local ip, i.e. when using transparent proxy or the address is marked as hidden the main device ip is used.

Joe how is arping part of transparent proxy?

It is not. When VIP is not local IP address in the realserver this IP is not used from the ARP code. It is not in the "local" table. But TCP, UDP and ICMP use it via transparent proxy support. They are extracted from the outgoing packet.

Joe what is "They"? the source addresses? When you say "extracted", do you mean "removed from packet" or "looked at/detected"

The saddr from the data packet is used to build the ARP request. We tell the kernel that these addresses are not uniq by setting <interface>/hidden=1 (starting with kernel 2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request.

Joe the kernel can use any local address as s_addr but the code for hiding IPs from arp requests prevents the kernel from using hidden addresses as s_addr in an arp request?

Yes, the code to hide the addresses is already part of the source address autoselection (saddr in the ARP request in our case). We never autoselect hidden addresses, i.e. if the source address is not specified from the higher level. The code to hide interface:

Joe When you say "We expect it is uniq in the LAN" do you mean - we expect you've set up your network properly and that you don't have the same RIP on 2 realservers? :-)

The LVS administrator must ensure that the RIPs are uniq, only the VIP is shared. We tell the kernel that the VIP addresses are not uniq by setting interfacehidden=1 (2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request. We expect it is uniq in the LAN. So, the recommendation for using the "lo" interface in the real servers is: - use netmask 255.255.255.255 when configuring lo alias. By this way source validation doesn't drop the incoming packets to this IP. LVS users usually define the net route through the eth interface, so we can talk to other hosts from this network, for example to send the packets to the client through the default gateway. It is not needed to configure the alias with mask != 255.255.255.255 So, the interfaces which can be used in the realservers to listen for VIP are: All these devices must be marked as hidden to solve the ARP problem when using Linux 2.2. In the Director: there is no problem to configure the VIP even on lo alias or dummy interface. If the interface is not marked as hidden this VIP is visible for all hosts on the LAN.

ATM/ethernet and router problems LVS has only been tested on ethernet. One person had an ATM setup which didn't work with LVS-DR as the ATM router expects packets from the VIP to have the same MAC address (in LVS-DR packets coming from the VIP could have the MAC address of any of the realservers). Apparently this is not easily fixable in the ATM world. It should be possible to use Julian's to make LVS-DR work on ATM, but the person with the ATM setup disappeared off the mailing list without us convincing him of the joy in having the first ATM LVS. Other people have found similar problems with ethernet -

Kyle Sparger ksparger (at) dialtoneinternet (dot) net I don't know if someone has gone over this, but here's a consideration I've come across when setting up LVS in DR mode: When the realservers reply, cisco routers (ours do, at least) will pick up on the fact that it's replying from a different MAC address, and will start arping soon thereafter. This is sub-optimal, as it causes a constant flood of arp requests on the network. Our solution has been to hardcode the MAC address into the router, but this can cause other issues, for example during failover. That can be worked around, as you can set the MAC address on most cards, but that in itself may cause other issues. Has anyone else experienced this? Has anyone else come up with a better solution than hardcoding it into the router?

It should be possible to have the reply packets from the VIP come from a virtual MAC address (such as created by vrrpd), in which case all replies coming to the same port in a router from the VIP will have the same MAC address. No-one seems to be interested in writing the code to do this.

Same IP on multiple NICs

Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 2002-04-16 I'm setting up with LVS-DR. To allow a node to be both a realserver and a backup director, I have eth0:2 being the VIP, because at this point, "backup-and-node" is the director. But when it's not, I still need VIP to be setup on lo:1 to use "backup-and-node" as a realserver. I end up with the following config : The problem is that when VIP is setup on both lo:1 and eth0:2, "backup-and-node" will not answer *any* ARP request for VIP, whereas it should via eth0 (as far as I understand the purpose of the hidden feature).

Julian The problem is that this setup is ambigous. The kernel doesn't know what device you are using for primary and for secondary IPs. Device lo is a valid device for primary IPs. It is not allowed to define one IP both as primary and secondary one. Yes, lo is first in the device list and we search for hidden IP in _any_ device. We don't have a preferred device to start from. Yes, this is limitation that nobody wants to fix. Someone will have to persuade me with a clear fix for this. Joe

I'm surprised you're allowed to have the same IP on two different devices. Is there a reason why you'd want to do this or is it just not forbidden and therefore allowed (I beleive this is called the American philosophy).

Horms It is actually something you may want to do. Imagine you have a dialup server, 192.168.0.1, which sits on the 192.168.0.0/24 network. Now each dialup user is going to get their own ip address, but 192.168.0.0/24 is your server network, so these ip addresses are on a different network, lets say 10.0.7.0/24. Now when the dailup users come in, there is no need for the dialup-server to have an address on the 10.0.7.0/24 network, it is just a point to point link, so you can have for instance. [dialup-server] 10.0.7.7 192.168.0.1 ppp0 ppp0 ]]> But the dialup-server already has 192.168.0.1 on eth0. Thus you have the same IP address on multiple interfaces. In fact it would have the same IP address on eth0 and each of the ppp interfaces.

LVS: LVS-DR LVS-DR is based on IBM's NetDispatcher. The NetDispatcher sits in front of a set of webservers, which appear as one webserver to the clients. The NetDispatcher served http for the Atlanta and the Sydney Olympic games and for the chess match between Kasparov and Deep Blue. When the packet CIP->VIP arrives at the director it is put into the OUTPUT chain as a layer 2 packet with dest = MAC address of the realserver. This bypasses the routing problem of a packet with dest = VIP, where the VIP is local to the director. When the packet arrives at the realserver, which finds the packet addressed to an IP local to the realserver (the VIP).

LVS-DR example Here's an example set of IPs for a LVS-DR setup. In this example, the RIPs are on the same network as the VIP (a one network LVS-DR). In this example, for (my) convenience, the servers are on the same network as the router connecting to the client and you have to handle the arp problem (I used the arp -f /etc/ethers approach). VIP | | v | | ________ | | | router | advertises route to VIP |________| | __________ | | | | VIP=192.168.1.110 (eth0:1, arps) | director |--- DIP=192.168.1.1 (eth0) |__________| | | ^ MAC_DIP->MAC_RIP1(CIP->VIP) | | | VIP->CIP v | | ------------------------------------- | | | | | | RIP1=192.168.1.2 RIP2=192.168.1.3 RIP3=192.168.1.4 (eth0) VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all lo:0, non-arping) _____________ _____________ _____________ | | | | | | | CIP->VIP | | | | | | VIP->CIP | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________| ]]> Here's the lvs_dr.conf file Here's how you'd set up a two-network LVS-DR. Note that the router receives packets on port R from both the RIP and VIP, which are in different networks. Once you've solved the arp problem, the router will send packets to the VIP only on port D. VIP | | v | | ________ | | R | router |------------- |________| | | D | | | VIP=192.168.1.110 (eth0:1, arps)| __________ | | | | | director | | |__________| | DIP=10.0.1.1 (eth1) | | | ^ MAC_DIP->MAC_RIP1(CIP->VIP) | | | | VIP->CIP v | | | | ------------------------------------- | | | | | | RIP1=10.0.1.2 RIP2=10.0.1.3 RIP3=10.0.1.4 (eth0) VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all lo:0, non-arping) ______________ _____________ _____________ | | | | | | |lo: CIP->VIP | | | | | |eth0:VIP->CIP | | | | | | realserver | | realserver | | realserver | |______________| |_____________| |_____________| ]]> LVS-DR setup and testing is the same as LVS-Tun except that all machines within the LVS-DR (ie the director and realservers) must be on the same segment (be able to arp each other). This means that there must be no forwarding devices between them i.e. they are using the same piece of transport layer hardware ("wire"), eg RJ-45, coax, fibre (there can be hub(s) or switch(es) in this mix). Communication within the LVS is by link-layer, using MAC addresses rather than IP's. All machines in the LVS have the VIP: only the VIP on the director replies to arp requests, the VIP on the realservers must be on a non-arping device (eg lo:0, dummy). The restrictions for LVS-DR are The client must be able to connect to the VIP on the director Realservers and the director must be on the same segment (piece of wire) (they must be able to arp each other) as packets are sent by link-layer from the director to the realservers. The route from the realservers to the client _cannot_ go through the director, i.e. the director cannot be the default gw for the realservers. (Note: the client does not connect directly to the the realservers for the LVS to function. The realservers could be behind a firewall, but the realservers must be able to send packets to the client). The return packets, from the realservers to the client, go directly from the realservers to the client and _do_not_ go back through the director. For high throughput, each realserver can have its own router/connection to the client/internet and return packets need not go through the router feeding the director. For more info see e-mail postings about LVS-DR topologies in the section More on the arp problem and topologies of LVS-DR and LVS-Tun LVS's. To allow the director to be the default gw for the realservers (e.g. when the director is the firewall), see . Note for LVS-DR (and LVS-Tun), the services on the realservers are listening to the VIP. You can have the service listening to the RIP as well, but the LVS needs the service to be listening to the VIP. This is not an issue with services like telnet which listen to all local IPs (ie 0.0.0.0), but httpd is set up to listen to only the IPs that you tell it. Normally for LVS-DR, the client is on a different network to the director/server(s), and each realserver has its own route to the outside world. In the simple test case below, where all machines are on the 192.168.1.0 network, no routers are required, and the return packets, instead of going out (the router(s)) at the bottom of the diagram, would return to the client via the network device on 192.168.1.0 (presumably eth0).

How LVS-DR works Here's part of the rc.lvs_dr script which configures the realserver with RIP=192.168.1.8 There's no forwarding in the conventional sense for LVS-DR (ip_vs does the forwarding on the director of the LVS packets). You can have ip_forward set to ON if you need it for something else, but LVS_DR doesn't need in ON. If you don't have a good reason to have it ON, then for security turn it OFF. For more explanation see /proc/sys/net/ipv4/ip_forward ]]> With LVS-DR, the target port numbers of incoming packets cannot be remapped (unlike LVS-NAT). A request to port 23 (telnet) on the VIP will be forwarded to port 23 on a realserver, thus the RIP entry for the realserver in ipvsadm has no accompanying port. You can however re-map ports with iptables (see ). Here's the packet headers as the request is processed by the LVS. For the verbally oriented... A packet arrives from the client for the VIP (CIP:3456->VIP:23). The director looks up its tables and decides to send the connection to realserver_1. The director arps for the MAC address of RIP1 and sends a link-layer packet to that MAC containing an IP datagram with CIP:3456->VIP:23. This is the same src:dst as the incoming packet and the tcpip layer see this as a forwarded packet. To allow this packet to be sent to the realserver, it is not neccessary for forwarding must be on in the director (it is turned off by default in 2.2.x, 2.4.x kernels - turning it on is handled by the configure script). The packet arrives at realserver_1. The realserver recovers the IP datagram, looks up its routing table, finds that the VIP (on an otherwise unused, non-arping and nonfunctional device) is local. I'm not sure what exactly happens next, but I believe the Linux tcpip stack then delivers the packet to the socket listeners, rather than to the device with the VIP, but I'm out of my depth now. The realserver now has a packet CIP:3456->VIP:23, processes it locally, constructs a reply, VIP:23->CIP:3456. The realserver looks up its routing table and sends the reply out its default gw to the internet (or client). The reply does not go through the director. The role of LVS-DR is to allow the director to deliver a packet with dst=VIP (the only arp'ing VIP being on the director), not to itself, but to some machine that (as far as the director knows) doesn't have the VIP address at all. The only difference between LVS-DR and LVS-Tun is that instead of putting the IP datagram inside a link-layer packet with dst=MAC of the RIP, for LVS-Tun the IPdatagram from the client CIP->VIP is put inside another IPdatagram DIP->RIP. The use of the non-arping lo:0 and tunl0 to hold the VIP for LVS-DR and LVS-Tun (respectively) is to allow the realserver's routing table to have an entry for a local device with IP=VIP _AND_ that so that other machines can't see this IP (ie it doesn't reply to arp requests). There is nothing particularly loopback about the lo:0 device that is required to make LVS-DR work anymore than there is anything tunnelling about a tunl0 device. For 2.0.x kernels, a tunnel packet is de-capsulated because it is marked type=IPIP, and will be decapsulated if delivered to an lo device just as well as if delivered to a tunl device. The 2.2.x kernels are more particular and need a tunl device (see "Properties of devices for VIP").

Handling the arp problem for LVS-DR

VIP on lo:0 The VIP on the realservers must not reply to arp requests from the client (or from the router between the client and the director).

Realservers with Linux 2.2.x kernels The loopback device does not arp by default for all OS's except Linux 2.2.x,2.4.x kernels (even when you use -noarp with ifconfig). You may need to do something if you are running a realserver with a 2.2.x or 2.4.x kernel (see the ).

Lars' method This requires hiding the VIP on the realservers, by putting them on a separate network. Lars set this up first on LVS-Tun. Here it is for LVS-DR. The director has 2 NICs and the realservers are on a different network (10.1.1.0/24) to the VIP (192.168.1.0/24). All IPs reply to arps. The router/client cannot route to the realserver network and the RIPs do not need to be internet routable. Since the director has 2 NICs, in the lvs_dr.conf file, set the DIP to eth1. to client ]]>

Transparent Proxy (TP or Horms' method) - not having the VIP on the realserver at all. The subject of has it's own section.

LVS-DR scales well Performance tests (75MHz pentium classics, on 100Mbps network) with LVS-DR on the performance page (http://www.linuxvirtualserver.org/Joseph.Mack/performance/single_realserver_performance.html) showed the rate limiting step for LVS-DR director forwarding packets to the realservers. LVS doesn't add any detectable latency or change the throughput of forwarding. There is little load on the director operating at high throughput in LVS-DR mode. Apparently little computation is involved in forwarding. In the early days of LVS, we expected the director in LVS-DR to be lightly loaded because it was receiving only small packets from the client (e.g. get index.html or get largefile.tar.gz) while the realservers were delivering the large files to the client via their router. We expected that the fan-out (number of realservers handled by a director) would be in the ratio of the filesize sent to the client compared to the requestsize from the client. It is true that the measured bandwidth coming in to the director is smaller than the output from the realservers. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 06/06/2005

I have 38 realservers behind my director, incoming traffic (to director) goes up to 20Mb/s, outgoing (from realservers LVS-DR setup) up to 60Mb/s. I have about 1200 sites hosted. 36 virtual_server entries in keepalived.conf, 30 VIPs. There's no noticable load on the poor PIII/700 director that's handling the traffic.

However we have since realised that network hardware is specified in packets/sec and not Mbps (see ) and that every outgoing packet from a realserver is matched by an incoming packet to the director (possibly just an <ack>). The director then is passing the same number of network packets as all the realservers together. Once the incoming network traffic to the director reaches 8000pps (for 100Mbps FE), the director is saturated. LVS-DR does get good fan-out (one director supporting many realservers) but the reason is not the one we originally thought. The good fan-out is because the director only has to handle network traffic, while a realserver may have to go to disk or to compute before it can produce its packets. The fan-out then is the ratio of time that the realservers need to produce the packet payload, compared to the time it takes to transmit them.

LVS-DR director as default gw for realservers, transparent proxy and Julian's martian and forward_shared patches In the case where the director is the firewall for the realserver network, the director has to be the default gw for the realservers. The reply packet from the realserver to the client (VIP->CIP) then goes through the director (which has a device with IP=VIP). The director then is being asked to route a packet from outside, that has a src address that is on the director. Normally this is not allowed and such illegal packets are called martians. Here's from rfc1812 from Ken Chase math (at) velocet (dot) ca 14 May 2003, posting to the beowulf mailing list. name); 1621 if (dev->hard_header_len) { 1622 int i; 1623 unsigned char *p = skb->mac.raw; 1624 printk(KERN_WARNING "ll header: "); 1625 for (i = 0; i < dev->hard_header_len; i++, p++) { 1626 printk("%02x", *p); 1627 if (i < (dev->hard_header_len - 1)) 1628 printk(":"); 1629 } 1630 printk("\n"); 1631 } 1632 } 1633 #endif 1634 goto e_inval; 1635 } 1636 5.3.7 Martian Address Filtering An IP source address is invalid if it is a special IP address, as defined in 4.2.2.11 or 5.3.7, or is not a unicast address. An IP destination address is invalid if it is among those defined as illegal destinations in 4.2.3.1, or is a Class E address (except 255.255.255.255). A router SHOULD NOT forward any packet that has an invalid IP source address or a source address on network 0. A router SHOULD NOT forward, except over a loopback interface, any packet that has a source address on network 127. A router MAY have a switch that allows the network manager to disable these checks. If such a switch is provided, it MUST default to performing the checks. A router SHOULD NOT forward any packet that has an invalid IP destination address or a destination address on network 0. A router SHOULD NOT forward, except over a loopback interface, any packet that has a destination address on network 127. A router MAY have a switch that allows the network manager to disable these checks. If such a switch is provided, it MUST default to performing the checks. If a router discards a packet because of these rules, it SHOULD log at least the IP source address, the IP destination address, and, if the problem was with the source address, the physical interface on which the packet was received and the Link Layer address of the host or router from which the packet was received. Martian Filtering A packet that contains an invalid source or destination address is considered to be martian and discarded. ]]> Horms horms (at) vergenet (dot) net>

The problem is that with Direct routing the reply from the real server has the vip as the source address. As this is an address of one of the interfaces on the director it will drop it if you try and forward it through the director. It appears from experimentation with /proc/sys/net/ipv4/conf/*/rp_filter that at least on 2.2.14, there is no way to turn this behaviour off. (for more info on rp_filter see the .)

This type of packet is called a "source martian" and is dropped by the director. martians can be logged with /proc/sys/net/ipv4/conf/all/log_martians ]]> There are 3 solutions to this; 2 by Julian and 1 by Horms.

Director has 1 NIC, accepts packets via transparent proxy. If the director accepts packets for the VIP via transparent proxy, then the director doesn't have the VIP and the return packets are processed normally. (Note: transparent proxy only works on the director for 2.2.x kernels - update early 2003, patches are now available for 2.4.x kernels). Here's Julian's posting Router: transparent proxy for VIP (or all served VIPs). The ISP must feed your Director with packets for your subnet 199.199.199.0/24 LVS-DR mode (Yes, LVS-DR, this is not a mistake). eth1: 199.199.199.2. default gw is ISP. Real server(s): nothing special. VIP on hidden device or via transparent proxy. eth0: 199.199.199.3. default gateway is 199.199.199.2 (the Director) This is a minimum required config. You can add internal subnets yourself using the same physical network (one NIC) or by adding additional NICs, etc. They are not needed for this test. Packets from the realservers with saddr=VIP will be forwarded from the director because VIP is not configured in the Director. We expect that this setup is faster than VS/NAT.

Julian's martian modification (forward_shared) In normal LVS-DR, the packets returning from the realservers (which have a src_addr=VIP) are routed to anywhere but the director. Normally packets with scr_addr=VIP are rejected as source martians on the director, because the director has the VIP as a local IP. The martian modification patch allows the director to be the default gw for packets from the realservers. Since packets with src_addr are now allowed from the realserver, spoofed packets from the outside, with src_addr=VIP, must be disallowed (you can use filter rules in combination with the name of the NIC connecting to the router - assuming it's a different NIC to the one connecting to the realservers). The original name I gave this patch is "martian modification". Julian's name for it is "forward_shared". Both names are used in the HOWTO. To download the patches and to read Julian's notes on using the director as a gateway for realserver in LVS-DR/Tun, see LVS director as gateway in Direct Routing and Tunnel Setups. (The dates on the files are the creation dates, not the last modified. Thus the file for the 2.4.26 kernel, current in May 2004, has a date in 2001.) (see earlier for an explanation of "source martians" .) Also see for an alternate solution to the martian problem. The martian modification is currently (since Aug 2001) implemented with the hidden-forward_shared-xxx.diff patch. This patch has the hidden (for realservers) and forward_shared (for directors) patch and can be applied to both realservers and directors. (Remember for the director you need the ipvs patch too). The forward_shared patch will not be going into the kernel code (you'll always have to apply the patch) as some kernel people don't like the idea of allowing source martin packets. This is a kernel patch, director has 2 NICs (doesn't work with one NIC), VIP is on outside NIC. After applying the patch, for a test, use the default values for */rp_filter(=0). This allows realservers to send packets with saddr=VIP and daddr=client through the Director. If the patch is applied and external_eth/rp_filter is 0 (which is the default) the realservers can receive packets with saddr=any_director_ip and dst=any_RIP_or_VIP which is not very good. On the external net, set rp_filter=1 for better security. Here's the test setup 192.168.1.1 is the normal router. For the test it was put on the director instead (as an alias). The director has 2 NICs, with forwarding=on (client and realservers can ping each other). Director runs linux-0.9.8-2.2.15pre9 unpatched or with Julian's patch. LVS is setup using the configure script, redirecting telnet, with rr scheduling to 3 realservers. The realservers were running 2.0.36 (1) or 2.2.14 (2). The arp problem was handled for the 2.2.14 realservers by permanently installing in the client's arp table, the MAC address of the NIC on the outside of the director, using the command arp -f /etc/ethers. The director was booted 4 times, into unpatched, patched, unpatched and patched. After each reboot the lvs scripts were run on the director and the realservers, then the functioning of the LVS tested by telnet'ing multiple times from the client to the VIP. For the unpatched kernel, the client connection hung and inactive connections acccumulated for each realserver. For the patched kernel, the client telnet'ed to the VIP connecting with each realserver in turn. The configure script will set up the modified LVS-DR (and will warn you that you need the patch for this to work). Setup details are in performance page

Martian modification performance Performance has similar latency to LVS-NAT but the load is low on the director at high throughput of LVS-DR (see the performance page).

questions mstockda (at) logicworks (dot) net

Which interfaces need forward_shared? the interface on the realserver lan _and_ the external side?

Julian Anastasov ja (at) ssi (dot) bg 15 Mar 2002 No, you just enabled the feature which works only for the already selected interfaces. Check it with You should enable forward_shared only for interfaces attached to internal mediums (hubs) and of course, only where is needed. cnf.forward_shared && ipv4_devconf.forward_shared) ]]>

LVS-DR director as default gw by bridging: the difference between "transparent proxy" and "proxy arp" proxy arp and bridging were discussed in the early days of LVS as a way of allowing the director to be the default gw for LVS-DR. The subject came up again in a thread on another topic. Also see . Nicolas Chiappero Nicolas (dot) Chiappero (at) estat (dot) com 28 Jan 2003

Is "proxy arp" and "transparent proxy" the same thing?

Joe Both allow routing of packets in ways not allowed by the normal routing tables. TP allows a machine to accept (rather than forward) packets for which it is not the destination. This originally was written so that a local squid would accept packets destined for a remote httpd server. proxy arp, allows a host to reply to arp requests, telling the requestor that it has an IP locally, when in fact the IP is on another machine. This is useful to alter routing (eg for transparent bridging). Julian - proxy ARP is used when the traffic should be routed at Layer 3 with the help from ARP. The packets reach the routing after the box answers ARP probes asking for foreign addresses. - transparent proxy has mostly Layer 5-7 semantic, it is used to intercept traffic destined to foreign addresses and to deliver it to sockets.

- If so, I found a document (http://www.sjdjweis.com/linux/proxyarp/) explaining how to do proxy arp on a 2.4 kernel. Will this method be compatible with LVS as long as director would also be the default GW for realservers?

No. The spoofing checks performed from routing will drop the traffic. Linux Bridging Here, the traffic from realservers to the ROUTER passes only Layer 2, i.e. the routing is not reached and you avoid the spoofing checks. forward shared If you don't want Bridging or the link to the ROUTER is not ARP aware, then you can use solutions that avoid the spoofing checks for this traffic. One of them is the forward_shared flag (Solution 2). With the forward_shared patch applied and with eth1 as the private interface, in the forward_shared directory of the /proc filesystem you set

Why the forward_shared patch is not in the kernel Julian 16 Nov 2006 Pros: saves one extra patching Cons: useful only for setups which share IPs very dangerous!!! That was the first concern by Alexey Kuznetsov. I see people blindly use echo 1 > all/VAR_NAME without considering what is the relation between all/VAR_NAME and DEV_NAME/VAR_NAME. I saw this many times. forward_shared should be applied only on trusted interfaces and setting 1 to all/ opens the door for spoofing/loop attacks. it is another hack in routing. Not sure if all changes are entirely correct. So, my opinion is 30% (below 50%) for inclusion. May be it is a good idea to have one diff with all IPVS patches not included in mainline. Then the IPVS users will have to patch only once. Now we even don't have this option linked to visible place in web.

Accepting packets on LVS-DR director by fwmarks Horms allows the director to accept packets by fwmark. There is no VIP required on the director.

security concerns: default gw(s) and routing with LVS-DR/LVS-Tun The material here came from a talk by Herbie Pearthree of IBM (posting 2000-10-10) and from a posting by TC Lewis (which I've lost). In normal IP communication between two hosts, the routing is symmetrical: each end of the link has an ethernet device with an IP and a route to the other machine. Packets are transmitted in pairs (an outgoing packet and a reply, often just an ACK). In LVS-DR or LVS-Tun the roles of the two machines are split between 3 machines. Here is a two network test setup, with the client in the position normally occupied by the router. In production, the client will have a public IP and connect via a router. (This is my test setup. A big trap in this setup is that services which make calls from the RIP, e.g. and rshd will work, but will fail in production, where the RIP will not be routable). RIP=192.168.1.2 (eth0) VIP=192.168.2.110 (lo:0, no_arp) _____________ | | | realserver | |_____________| ]]>

Director's default gw The client sends a packet to the VIP on the director. In a normal exchange of packets between a pair of machines, the director would send a reply packet back to the client. With an LVS, the director's response instead is a packet to the MAC address of the RIP. Except for ICMP packets (which are only sent in error conditions), the VIP on the director never sends packets back to the client, it only sends packets to the realservers. A default gw for the director is not needed for the functioning of the LVS. Having a default gw would only allow the VIP director to reply to packets from the internet, such as port scans, creating a security hazard. The director doesn't need and shouldn't have a default gw. There are pathological conditions when the VIP needs to reply to the client. If the realserver goes down, the director will issue ICMP "host unreachable" packets, till a new realserver is switched in by mon or ldirectord. (If you have a long lived tcp connection, eg with telnet or https, the new realserver will be getting packets for a connection which it doesn't know about, and it will issue a tcp reset. This reset will go out the default gw for the realserver and the client's session will hang or drop.) If you're using the director for other functions (DNS, firewall...), then packets will need to return to the internet. If you wanted security, you could use the iproute2 tools to allow only the DNS replies to use the default route. An example of doing this is the routing used for realservers. Julian Anastasov ja (at) ssi (dot) bg> 30 Aug 2001

It may be these ICMPs are not fatal if they are not sent. This is true when LVS is used in transparent proxy setups and particulary in 2.4 where there is no real transparent proxy support. There icmp_send() does not send any packets when there is no running squid. But may be the original email sender wanted to use the LocalNode feature together with a DR setup, IIRC. I see that the configure script does not have configs for such setups with mixed forwarding methods. So, as you said, the users with more knowledge can select another way to build their setup. And they will know when they need a default gateway :) If the mtu is not matched between the router and the director, the director will need to send ICMP "fragmentation needed" packets back to the router. This is a bad setup.

You could enable default routing for icmp, but not tcp or udp from the VIP back to the router, by using iproute2.

Realserver's default gw, route to realserver from router The realserver doesn't reply to the director, instead it sends its reply to the client. The realserver requires a default gw (here 192.168.1.154), but the client/router never replies to the realserver, the client/router sends its replies to the director. So the client/router doesn't need a route to the realserver network. To have one would be a security hazard. The realserver now can't ping its default gw (since there's no route for the reply packet), but the LVS still works. The flow of packets around the LVS-DR LVS is shown by the ascii arrows. When an attacker tries to access the nodes on the LVS, it can only connect to the LVS services on the director. It can't connect to the realserver network, as there is no routing to the realservers (even if they get access to the router). Presumably the realservers are not accessable from the outside as they'll be on private networks anyhow. Note that for Julian's , the director will need a default gw.

routing to realserver from director you don't need routing from the server_gw to the realservers either, see route to realserver. If you are only using the link between the director and realserver for LVS-DR packets (i.e. you aren't telnet or ssh'ing from the realserver to the director for your admin, and you aren't copying logs from one machine to another), then you don't need an IP on the interface on the director which connects to the realserver(s). tc lewis tcl (at) bunzy (dot) net 12 Jul 2000 (paraphrased)

I would like to send packets from the LVS-DR director to the realservers by a separate interface (eth2), but not assign an IP to this interface. Normally I put a 192.168.100.x ip on eth2, but without it, route add -net 192.168.100.0 netmask 255.255.255.0 dev eth2 just gives me an error about eth2 not existing. I just want to save an extra IP. What i'm asking is: does the director's eth2 need an ip on 192.168.100.0/24, or can i just somehow add that route to that interface to tell the machine to send packets that way? With lvs, the realservers are never going to care about the director's interface ip, since there's no direct tcp/ip connections or anything there, but it looks like it still needs an ip anyway. If all that that interface is doing is forwarding outgoing packets from the director via the dr method, then i don't see why it needs an ip address.

Ted Pavlic tpavlic (at) netwalk (dot) com You basically want to do device routing. There's nothing special about this -- many routers do it... NT even does it. So does Linux. Your original route command should work as long as you've brought up eth2. Now tricking Linux into bringing up eth2 without an address might be the hard part. Try this: tc lewis tcl (at) bunzy (dot) net

then the route did work. I tried that before with a netmask but it didn't work.

Ted Pavlic tpavlic (at) netwalk (dot) com Remember that IP=0 actually is IP=0.0.0.0, which is another name for the default route. The reason why IP=0 is 0.0.0.0 ... Remember that each IP address is simply a 4-byte unsigned integer, right? Well... the easiest way to envision this is to imagine that an IP is just like a base-256 number. For example: Which is equal to 3628449804. So... telnet 216.69.192.12 25 is the same as: telnet 3628449804 25 0.0.0.0 is just a special system address which is the same as the default route. Making a route from 0.0.0.0 to some gateway will set your default route equal to that gateway. That's all "route add default gw ..." does. Don't believe me? Do a route -n. So when I told TC to put 0 on his IP-less NIC, I was just choosing a system IP that I knew would not ever need to be transmitted on. Linux wanted an IP to create the interface... so I gave it one -- the IP of the default gateway. Packets would never need to leave the system going to 0.0.0.0, and Linux has to listen to this address ANYWAY, so you might as well explicitly put it on an interface. What would have also worked (and might have been a better idea) would be to put 127.0.0.1 on that interface. That is another system address that Linux will listen to anyway if loopback has been turned on... and it should never transmit anything away from itself with that as the destination address, so it's safe to put it on more than one interface. The only reason I chose 0 over 127.0.0.1 is because 0 is easy... It's small... It's quick. Whenever I want to telnet to my localhost's port blah I just do a: because I'm lazy.. (Linux sees 0, interprets 0.0.0.0, sees an address it listens to, and basically treats 0 like a loopback) Also you'll notice that if you give an interface 0.0.0.0 as an IP address and do an ifconfig to get stats on that interface, it will still retain no IP address. Another perquesite of using 0.0.0.0 in TC's particular situation. It may actually cause less confusion in the end.

LVS-DR, LVS-Tun need rp_filter=0 This applies on the director for both LVS-DR and LVS-Tun Brandon Yap byap (at) xss (dot) com (dot) au 21 Feb 2004 I found the problem. rp_filter needed to be turned off on tunl0. /proc/sys/net/ipv4/conf/tunl0/rp_filter ]]> Joe - 0 is the default value for rp_filter, as specified in RFC1812 (for more on RFC1812 see and ). From postings on the LVS mailing list, it seems that some of the market enhanced kernels (e.g. Debian) have changed the default. (They wouldn't make any money if their kernels behaved in the expected way ;-\ .) You need to file a bug report with the supplier of your kernel. Ratz 13 Nov 2006 $i done ]]> Ratz 21 Jan 2006 You would be referring to following snippet in the RFC, right? 5.3.8 Source Address Validation

A router SHOULD IMPLEMENT the ability to filter traffic based on a comparison of the source address of a packet and the forwarding table for a logical interface on which the packet was received. If this filtering is enabled, the router MUST silently discard a packet if the interface on which the packet was received is not the interface on which a packet would be forwarded to reach the address contained in the source address. In simpler terms, if a router wouldn't route a packet containing this address through a particular interface, it shouldn't believe the address if it appears as a source address in a packet read from this interface. If this feature is implemented, it MUST be disabled by default.

So if I read this correctly, /proc/../conf/{all,default}/rp_filter must be off on a freshly booted kernel without any explicit user changes in any of the rc boot scripts. I've checked on a Debian installation of one of our customers: I have to assume these are the default settings, which then in /etc/init.d/networking get set over doopt() (completely brain-dead redundant information). Reading spoofprotect_rp_filter() in /etc/init.d/networking I have to assume that the person maintaining this piece of software has not understood the network related settings (besides showing horrible programming practice) in proc-fs under Linux: This should be s/*/default/ to match at least the wrong comment echo 1 > $f done return 0 else return 1 fi } ]]> On top, good programming practice would be to explicitly set the other values you take for granted to 0, since an operator could have accidentally set some proc-fs values to test something and did not make it reboot-safe. Debian is and will remain a system for people with a lot of spare time. Folks: rp_filter has almost nothing to do with proper network security! If source validation has to be done, make sure you route properly. It's funny, Debian people would only need to have a look at SuSE or Red Hat to see how one can do the networking setup a tad bit better. Jacob Coby jcoby (at) listingbook (dot) com 20 Feb 2004 Could you do me a favor, and turn rp_filter ON, and ping the VIP with both normal sized ping packets, and very large (>MTU). And then, turn rp_filter OFF and try it again? I'm thinking this is the reason I was having trouble getting lvs-tun to work with packets of size >MTU (see ). rp_filter is about the only /proc entry I didn't lookup and try fiddling with. from the adv-routing HOWTO (http://www.ibiblio.org/pub/Linux/docs/HOWTO/Adv-Routing-HOWTO)

".. if a packet arrived on the Linux router on eth1 claiming to come from the Office+ISP subnet, it would be dropped. Similarly, if a packet came from the Office subnet, claiming to be from somewhere outside your firewall, it would be dropped also."

I think LVS-TUN packets claim to be from the outside world, but come from the subnet, don't they? Joe: in test situations where both the director and realservers are on the same bench, tunnelled packets from the director to the realservers are from the same netmask. However in real life, the director and realservers can be on different continents and will be in different networks. The decapsulated packet is from the client. Guy Coates gmpc (at) sanger (dot) ac (dot) uk 03 Nov 2004

I'm running into problems using LVS-DR when using a private network to route traffic from the director to the realservers. eth0 on both machines are on the same segment, and eth1 on both machines are connected via a crossover cable. All client traffic comes in and out via the public network. If I route director->realserver traffic over eth0, everything works as it should. If I route director->realserver traffic via the private network, things don't. The director routes the incoming traffic correctly, but the realserver drops the packets on the floor. tcpdump on the realserver confirms that the director is correctly passing the packets to the realserver: 172.17.22.216.80: S 2236244704:2236244704(0) win 5840 ]]> However, the realserver does not pick up the packet. I'm using kernel 2.4.27+hidden arp patches on both realserver and director.

Unknown You're not running with one of the anti-spoofing controls switched on re you? For the life of me I can't remember which sysctl this is (rp_filter?) but that would exhibit this type of behaviour if set.

Ahh yes, it looks as if debian handily sets /proc/sys/net/ipv4/conf/default/rp_filter to 1 by default. Setting that to zero on the realserver make everything spring into life.

Simon Detheridge simon (at) widgit (dot) com 30 Oct 2006 I had two LVS-Tun directors. One worked, one didnt. Yeah. I did a "cp -r /proc/sys ~/" on each machine, then a recursive diff on the results. Looks like somehow during the upgrading and ensuing tinkering, rp_filter got set to "1" on the backup director. Setting it to "0" seems to have made the bad behaviour go away. I thought I'd checked this, but must have only done it on the realservers.

Director as client in LVS-DR The LVS-mini-HOWTO states that the lvs client cannot be on the director or any of the realservers, i.e. that you need an outside client. This restriction can be relaxed under some conditions. Joshua Goodall joshua (at) myinternet (dot) com (dot) au 11 May 2004 I want to setup the situation where the director is one of the clients. It appears that LVS does not intercept the outbound packet when it originates on the director itself. This is with both fwmark and a configured VIP:port. I've also tried adding -j REDIRECT in the OUTPUT chain, to no avail. If I bring up the VIP on the director, I see the packet when tcpdumping localhost, but LVS doesn't grab it. Oddly, the packet is still on localhost even when the VIP is on eth0. It seems that ip_vs_in ignores the packet if the device is loopback_dev. Questions then: Why test for loopback_dev at all? Is this important, or is it just supposed to be an optimisation? Can we fool ip_vs to fill skb->dev with something other than &loopback_dev if the director is the client? I tried this patch (2.4.26) pkt_type != PACKET_HOST || skb->dev == &loopback_dev) { + if (skb->pkt_type != PACKET_HOST) { IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n", skb->pkt_type, iph->protocol, ]]> then added to the existing and now my fwmark-based LVS-DR director does the job for clients and for itself. To make LVS-NAT work, we'd also need to be able to choose the masqueraded source address, which would be a much longer diff. I didn't try LVS-Tun, but that would probably be workable like LVS-DR. Julian So, now you can send packets in form DIP->VIP to real servers (LVS-DR method)? I'm wondering how your patched director accepts reply packets for the LVS'ed service from the realserver in the form VIP->DIP. Linux has source address validation and you can not disable it for packets with saddr=local_ip. I see that you can remove the limitation when sending packets but how do you accept the normal LVS replies from the realservers? Maybe you do not have the VIP configured as IP address? Joshua There is no VIP. For regular (external) clients, I'm using fwmark + iproute2 to grab packets intended for the DIP; to capture locally sourced packets, I just put a -j REDIRECT into the OUTPUT chain of the nat table. Julian There can be another problem in 2.4 (2.6 seems to handle this properly): ip_vs_skb_cow does not expect skbs to have valid skb->ski. Maybe the skb should be copied (skb_copy) instead reallocating only its data. You can check for problems by using tcpdump -i lo. Make sure there are no crashes or any kind of memory leaks because I personally can not test such setup. For 2.6 you can remove also the skb->sk check.

from the mailing list tc lewis has NAT'ed out .

rewriting, re-mapping, translating ports with LVS-DR see

LVS: LVS-Tun

LVS-Tun Intro LVS-Tun is an LVS original. It is based on LVS-DR. The LVS code encapsulates the original packet (CIP->VIP) inside an ipip packet of DIP->RIP, which is then put into the OUTPUT chain, where it is routed to the realserver. (There is no tunl0 device on the director; ip_vs() does its own encapsulation and doesn't use the standard kernel ipip code. This possibly is the reason why PMTU on the director does not work for LVS-Tun - see .) The realserver receives the packet on a tunl0 device (see need tunl0 device) and decapsulates the ipip packet, revealing the original CIP->VIP packet. Initially only Linux could decapsulate IPIP packets, but recently FreeBSD and w2k can now do it too (hmm 2005, Microsoft has dropped support for IPIP). If you want to try a test LVS-Tun setup on the bench, take a standard LVS-DR setup , change lo on the realservers to tunl0 (and handle the ARP problem on tunl0) and change the ipvsadm switch from -g to -i . If your clients are going to be sending large packets, you need to set the MTU (see for the ipip packet DIP->RIP). This can be done on the realserver with iptables (see ) or iproute2 (see ). As with LVS-DR, the director doesn't know about the VIP on the realserver (it only knows about the RIP). Health checking of a service listening on the VIP on the realserver then must use a connection between the DIP and the RIP (if the demon is listening on both the RIP and DIP, the service listening on the RIP can be a proxy for the service listening on the VIP). LVS-Tun allows the realservers to be geographically remote from the director (this is the main point of LVS-Tun). If your realservers cannot do ipip decapsulation, you can still have geographically remote realservers using other techniques (see ). (see also Julian's LVS-Tun write up and postings to the mailing list).

LVS-Tun example setup Here's an example set of IPs for a LVS-Tun setup. For (my) convenience the servers are on the same network as the client. The only restrictions for LVS-Tun with remote hosts are that the client must be able to route to the director and that the realservers must be able to route to the client (the return packets to the client come directly from the realservers and do not go back through the director). Normally for LVS-Tun, the client is on a different network to the director/server(s), and each server has its own route to the outside world. In the simple test case below where all machines are on the 192.168.1.0 network there would be no default route for the servers, and routing for packets from the servers to the client would use the device on the 192.168.1.0 network (presumably eth0). In reallife, the realservers would have their own router/connection to the internet and packets returning to the client would go through this router. In any case reply packets do not go back through the director. VIP | | ^ v | | VIP->CIP | VIP=192.168.1.110 | (eth0:1, arps) | __________ | | | | | director |------- |__________| | DIP=192.168.1.1 | (eth0) | | DIP->RIP(CIP->VIP) | | v ------------------------------------- | | | | | | RIP1=192.168.1.2 RIP2=192.168.1.3 RIP3=192.168.1.4 (eth0) VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all tunl0,non-arping) _____________ _____________ _____________ | | | | | | | realserver | | realserver | | realserver | |_____________| |_____________| |_____________| ]]> Here's a likely production setup (I haven't done this one myself). It assumes the realservers are on a different network to the DIP. Here x.x.x.? and y.y.y.? are public IPs. The 176 and 10 addresses are for communication between the different locations and will be assigned by the ISP. VIP | |--------------------------------- v | | __________ | | | | | D-router | | |__________| | | | CIP->VIP | | | v | | | | VIP=y.y.y.110(eth0, arps) | __________ | | | | | director | | |__________| | DIP=176.0.0.1 (eth1) | | ^ | DIP->RIP1(CIP->VIP) | | VIP->CIP | | v | | __________ __________ | | | | | R-router | R,C-Router do not | C-Router | |__________| advertise VIP |__________| | | | ^ | DIP->RIP1(CIP->VIP) | | VIP->CIP | | v | | | | ---------------------------------------------------- | | | | RIP1=10.0.0.1(eth0) RIP2=10.0.0.2(eth0) VIP=y.y.y.110(tunl0) VIP=y.y.y.110(tunl0) | | _________________ ___________________ | | | | | realserver | | realserver | | tunl0: CIP->VIP | | | | eth0: VIP->CIP | | | |_________________| |___________________| ]]>

You need a tunl0 device tunl0 is a networking device like eth0, lo, and dummy0. In LVS-Tun, the tunl0 device holds the VIP, just as the lo device holds the device for LVS-DR. You need to build the tunl0 device into the Linux kernel (in networking options - IP:tunneling) - it is turned off by default. The tunnelling (ipip) can be built as a module, in which case you'll have to insmod ipip before you can use it, or you can build ipip directly into the kernel. With a kernel enabled for ipip, you should be able to see the unconfigured tunl0 device with ifconfig or with ip addr show (Feb 2004 - my ifconfig used to see the unconfigured tunl0, but it doesn't anymore.) Then you configure the tunl0 device (even if ifconfig can't see it). when the tunl0 device becomes visible to ifconfig or the VIP is a /32 addr, so the brd addr is the VIP, not x.x.x.255.

the ARP problem with LVS-Tun If the realservers and director are on a different network (e.g. the realservers are geographically remote), then the router infront of the realservers will not be advertising routes to the VIP and you won't need to handle the ARP problem on the realservers. In effect you are using without having to do anything special. If the realservers are using the same router as the director you need to handle the ARP problem for the realservers (set tunl0 to not reply to arp queries). This networking is the same as for LVS-DR and you'd only do this to test LVS-Tun. (there's no other reason to use LVS-Tun with the LVS-DR network). However all my LVS-Tun test cases used the same networking as for LVS-DR, i.e. the DIP and RIPs were on the same network and only one router (actually none, the client with 1 or 2 NICs, faced directly onto the director and realservers). In this case I had to handle the ARP problem for the realservers.

Reply packets appear to be spoofed Unlike LVS-DR, with LVS-Tun the realservers can be in a different location (and on a network remote from the director), where the director and realservers will be on different networks and the realservers will be on a network that does NOT contain the VIP. If this is the case, the realservers will be generating reply packets with VIP:port->CIP (where port is the LVS'ed service). Not being on the VIP network, the routers for the realservers will have to be programmed to accept outgoing packets with src_addr=VIP:port. Routers normally drop these packets as an anti-spoofing measure. If you aren't in control of the routers, you'll just have to inform the people who are, that packets from VIP:port are valid for your business. If they don't want to help you with your business, then you should find another provider who will. Mark Wadham mark (dot) wadham (at) areti (dot) net 30 Mar 2007 I believe we have located the source of the problem. Our load balancer is located in Manchester and the mail servers are located in London, and it appears that our upstream providers filter our traffic to prevent ip spoofing.

How LVS-Tun works Here's part of the rc.lvs_tun script which configures the realserver with RIP=192.168.1.8 There's no forwarding in the conventional sense for LVS-Tun. (You can have ip_forward set to ON if you need it for something else, but LVS-Tun doesn't need in ON. If you don't have a good reason to have it ON, then for security turn it OFF). For more explanation see /proc/sys/net/ipv4/ip_forward ]]> As with LVS-DR, for LVS-Tun, the target port numbers of incoming packets cannot be remapped. A request to port 23 on the VIP will be forwarded to port 23 on a realserver, thus no port number is used for setting up the IP of the realserver. However you can still external to LVS, using iptables Here's the packet headers as the request is processed by LVS-Tun. For the verbally oriented... A packet arrives at the director for the VIP. The director looks up its tables and decides to send the connection to realserver_1. The director encapsulates the request packet in an IPIP datagram with header DIP->RIP_1. The packet arrives at realserver_1, the realserver recovers the original IP datagram, looks up its routing table, finds that the VIP (on the non-arping tunl0) is local and processes the packet locally. A reply packet is generated with VIP:23->CIP:3456. The realserver looks up its routing table and finds that a packet to CIP goes out its default gw (not to the DIP). The tunl0 device does not arp with 2.0.36 kernels, but does with 2.2.x (and later) kernels. Go look up the section on the to see if you need to patch the kernel on the realserver. (Joe: since kernel 2.6.4 and 2.4.26, arp_ignore/arg_annouce are the preferred way of handling the arp problem.)

The RIP (not the tunl device) receives the ipip packet Joe

How does a packet get to a tunl device, which doesn't have a MAC address, from a remote machine?

Julian tunl, lo and dummy are used just to configure the VIP. We don't send any packets through these devices. The requests are delivered to the realservers using their RIP. The director asks only about their RIP from ipvsadm. Only the router/gateway asks about VIP, but only the director must reply. When the packet is received in the realserver it is delivered locally (not forwarded or dropped) due to configured VIP. This is the only role of these "dummy" interfaces: the kernel to treat the received packet as it is destined to our host (the realserver). Nothing more. No IPIP encapsulations (for tunl), no MAC address definitions, nothing more. When we answer the request we use eth0. The tunl/lo/dummy is not selected as device for the outgoing packets. We have routes for eth0 (default gateway) which we use for the outgoing traffic. This is for DROUTE and TUNNEL mode.

If two linux boxes (not in an LVS) are joined by an IPIP tunnel and there is no MAC address associated with the tunl0 devices at each end of the link, then how do the packets get from one machine to the other?

Julian The packets are encapsulated via IPIP and sent to the tunnel ends real IP where they are decapsulated again and appear on the tunl interface. You don't need a MAC address for point-to-point links, or logical interfaces like tunnels.

Configure LVS-Tun Edit the template lvs_tun.conf and run the configure script Load the the parameters into the director and then the realservers with the command (the script knows whether it is running on a realserver or the director). (later put rc.lvs_tun in /etc/rc.d or /etc/init.d and put mon_xxx.cf in /etc/mon) check the output from ipvsadm, ifconfig -a and netstat -rn, to check that the services/IP's are correct.

set rp_filter correctly this is now in

FreeBSD and Solaris realservers with LVS-Tun maluyao ma(dot)luyao(at)gmail(at)com 4 Apr 2007 see LVS-Tun on FreeBSD and Solaris realservers (http://kb.linuxvirtualserver.org/wiki/LVS/TUN_mode_with_FreeBSD_and_Solaris_realserver). Here's how to setup ipip encapulation in FreeBSD. carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 20 Jun 2002

First, gifs must be supported in your kernel (enable "pseudo-device gif" in your kernel config). src_addr is the address of your NIC's interface while dest_addr is the remote side or the other end of the tunnel IP address. For example, if pc1 is one end of your tunnel and pc2 is the other end, then: You can also man gifconfig . I haven't tried using gif interfaces for IP-in-IP tunneling. I've only used them for IPv6 in IPv4 tunneling, but you can test it.

carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 30 Jun 2002

I'd just like to report that I got LVS-Tun working for a Linux(as director)-OpenBSD(as realserver). I am currently testing LVS so we could use it to loadbalance web service requests (http) over different sites (different IPs/different blocks) therefore LVS-Tun is required. I know FreeBSD implements tunneling but I've only used it for IPv6-in-IPv4 tunneling and I didn't quite understand how tunneling in Linux worked. For example, in linux to create a tunnel, you did this: on the director: no tunnel is created because ipvs does the encapsulation on the realserver: Basically, I understand that the tunl0 is identified with the remote tunnel end (VIP) but I don't understand the "route add" part since LVS-Tun only implements a one-way tunnel. That is, from the director to the realserver, tunneling from realserver-to-director is not required and seems useless. The realserver routes following it's default router path direct to the client. So that's where I got stuck. "How do you say this in *BSD using the gif0 interface, the one I'm familiar with?" In the end, this is the topology we'd like to implement: (one-way-tunnel)realserver, *BSD | -------------- | realserver(local-NAT), *BSD ]]> with the tunneled packet routed normally through its routers/gateways (edge routers or other) down to the realserver. My test setup looks like this: So what I did on the OpenBSD realserver is this, 10.10.8.1 is the default gateway for the private network. Notice that the tunnel endpoint is the DIP (not VIP like in Linux). This is because as I understand, the packet that arrives at the realserver (encapsulated by ipvs) has this format: where, D - director address, R - realserver address, C - client address, and V - VIP address. Decapsulation is done by the gif0 tunnel, after that it sees that the packet is destined to itself (VIP defined at its lo0 interface) and processes it normally with source IP= client IP. When I do "telnet VIP" from the client, I successfully enter 10.10.8.199 after the login.

Windows realservers with LVS-Tun support for ipip was removed from M$ after w2k. Paolo has a solution for using a spanned layer2 network. Johan Ronkainen jr (at) mpoli (dot) fi 10 Feb 2003 It's possible with w2k Server. You'll find necessary settings under Routing and Remote Access snap-in. First create new IP Tunnel under "Routing Interfaces", then select "New Interface" under IP Routing/General and put necessary settings there. You'll also need Loopback interface so w2k will handle packets itself and won't try to route them. Open Control Panel, click Add New Hardware, navigate thru dialogs and finally select Microsoft Loopback Adapter. If you want /32 network for loopback adapter you need to change it with regedit since GUI allows only /31. Network code itself is fine with /32 subnets. It's been a while since I did this. We load-balanced w2k Terminal Server clients to three servers. Two were on same building as clients and third one was in different city on separate subnet. Clients connected to LVS that forwarded 2/3 of connections to local servers using LVS/DR and 1/3 to remote location using LVS/Tun via IP-tunnel. Replies were routed directly to clients. This never went to full production and servers have been re-installed since so I can't check exact configs. It's not that hard. Just like LVS/Tun with Linux on LVS end. w2k part required bit trial and error but it's doable. Adam Hammouda AdamMH (at) aol (dot) com 02 September 2003

I'm wondering if anyone can help me with some lvs/ipvs configuration issues regarding Windows' Real Server's and LVS-Tunneling. I have been able to setup LVS-Tun when all realservers are Linux-based, however when Windows is thrown into the equation things start to get messy. I have Created a new General IP Routing Tunnel (Interface) and set it's local and remote addresses' to the VIP and Director IP, respectively. configured the Microsoft Loopback Adapter to use the VIP, and set it's subnet mask to 255.0.0.0 as was recommended.

Chris Chris (at) baonline (dot) co (dot) uk 03 Sep 2003 We run lvs using tunneling (ipip) with 3 windows 2000 realservers. The steps are something like: Install a loopback adapter on each realserver. I think this is documented elsewhere to solve the ARP problem. But basically, go to add new hardware, network, microsoft, loopback adapter. Assign this adapter the IP address of the cluster, (VIP). - Sounds like you have already done this Go to routing and remote access on each realserver. Enable if needed. Dont let it automatically configure = asking for trouble. Under routing interfaces, add a new IPTunnel. Now, under IP Routing - General, add a new interface, selecting the IPTunnel that you have just created. For the 'Local address', specify the RIP of that realserver. For the Remote Address, specify the DIP. Ok through all that and then reboot the realserver. - it is M$ after all. - Sounds like you have specified the wrong IP address Paolo Penzo paolo.penzo (at) bancatoscana (dot) it 26 Sep 2003

I 'm using LVS on geographical basis (DR and TUN) with both Linux and Windows 2k as realservers. Unfortunately we started to migrate Win 2k severs to Win 2003 and we discovered that IP-IP encapsulation is not supported anymore by MS servers (see http://support.microsoft.com/?id=280484) so LVS TUN configurations don't work anymore if you use win 2k3 as realserver. I'm thinking how to overcame this problem by manually configuring IPSec tunnels or something similar... Help is wellcomed.

Joe: there was no answer

Realservers without ipip encapsulation ipip encapsulation is used when the realservers are at a remote site. Methods of tunneling other than ipip exist (e.g. a VPN) if you need geographically remote realservers. Richard Seabrook

Since Windows 2003 doesn't support IP-in-IP like 2000 did, what other alternatives are people using when real servers are remote from the directors?

Paolo Penzo paolo (dot) penzo (at) bancatoscana (dot) it 06 Dec 2006 We made a layer2 network spanned across geographical sites and moved to DR balancing: everthing is much more easy to manage!

LVS-Tun has smaller MTUu: PMTU is disabled - handling fragmentation Ratz 28 Feb 2007 (after seeing a patch in the lkml regarding documentation for sysctls tcp_mtu_probing and tcp_base_mss.) I found this patch rather interesting, regarding the fact that obviously PMTU seems to be disabled by default on newly 2.6.x (x>19?) kernels. We need to keep an eye on this. A ipip header is added when sending packets through a tunnel. Since the mtu is fixed (1500), the extra header reduces the allowed packet payload size. This will require fragmenting of packets>1480 sent from the director to the realserver in LVS-Tun. LVS (and Linux) doesn't have any special code to handle ipip fragmentation, so we should have expected LVS-Tun to fail when the client sent packets large enough to require fragmentation in the DIP->RIP hop. Either few people were using LVS-Tun in production, or clients were only sending small packets (e.g. HTTP GET) and we didn't realise for a long time that we had a problem lurking. Further below is Casey Zacek's solution for both w2k and linux. Here is Julian's description of the problem. Julian Feb 12 2007 The client will (Joe: should?) see a "fragmentation required" icmp packet from the director, if the packet is bigger than our PMTU to RS. this problem is still present and it is hard to fix (it's a bug): Without handling ICMP errors for our IPIP packets, we will not lower the PMTU just by generating IPIP traffic. But other (non ipip) protocols (packets to RS) can learn lower PMTU and update the cache. Then we can see this lower value in the routing cache and generate reply ICMPs when large packets come from the client. At least that's what I remember from before; I'm not sure if things have changed in 2.6 now. IPIP packets are between the DIP and RIP. These packets can hit the MTU limit in all hops between director and RS. The reply packets from RS to CLIENT are another path. If a big packet from the RS to CLIENT hits a MTU limit, then our director will receive ICMP/FRAG_NEEDED from xxxHOP to VIP, which we tunnel in IPIP to RIP. Here is a simple picture showing the MTU for each hop: DHOP ---> DIRECTOR ---> RHOP ---> RS ---> CHOP ---> same CLIENT CIP->VIP DIP->RIP VIP->CIP CLIENT - knows about DHOP and uses MTU=1500 DHOP - hop/router to director, knows MTU=1400 to director director - sees MTU=1300 to RS, knows (or doesn't know) about RS (MTU=1200) RHOP - hop/router to realserver, knows about RS and uses MTU=1200 RS - connects to CHOP with MTU 1100 CHOP - hop/router to CLIENT, uses MTU 1000 ]]> The steps: Client sends 1500-byte TCP packet to DHOP (IP DF=1) DHOP returns ICMP/FRAG_NEEDED Client reduces the cached MTU and generates a new shorter 1400-byte packet ipvs() in Director receives the packet and if PMTU for RS is not configured to 1200, the packet still hits the 1300-byte limit. ICMP is replied (back to client). As the kernel/IPVS does not update PMTU cache based on ICMP for our IPIP packets, the PMTU for DIP->RIP has to be configured manually in the director (if it's going to be set at all there). Director encapsulates the CIP->VIP packet inside an ipip packet DIP->RIP. This packet has a header IPIP (there is no SYN, ACK, DATA, FIN) and is a one-off packet, and not part of a two-way DIP-RIP tcp connection. The decapsulated packet arriving at the RS (CIP->VIP) can't result in any packets going back to the director. Once decapsulated the RS forgets that DIP sent the packet, and the RS now sees the original packet from client. If the IPIP packet (DIP->RIP) is too big, the RS won't cause an icmp FRAG_NEEDED to be sent back to the DIP - the RS is not a router, only the RHOP will send an ICMP packet. CLIENT sends 1280-byte packet which IPVS tunnels into 20+1280 IPIP packet. RHOP generates ICMP/FRAG_NEEDED (src_addr=RHOP dst_addr=DIP). We don't accept this icmp packet and the icmp information is lost (the problem mentioned in URL). I don't remember if this is a problem with ip_vs() or the Linux kernel. Maybe the kernel learns the PMTU from the first ICMP packet, but IPVS does not see the ICMP packet. I hope for the 2nd packet from CIP IPVS will detect the lower MTU limit and will return ICMP immediately to CIP. Maybe IPVS looking for ICMP packets in LOCAL_IN can see this first error, but it is hard to forward similar message to the client. If instead the director knows about the 1200-byte limit, then any IPIP packet from director will reach RS without any ICMP replies. One way of doing this would be by setting the mtu for the route DIP->RIP. (This command sets a lower MTU for all packets, not just ipip packets.) If the RS generates a 1500-byte TCP reply packet (VIP->CIP), then CHOP will generate ICMP reply to the VIP, that should come in director, if routed properly (this packet will likely traverse the internet using a path separate to the client-director-RS-client path). On arrival at the director, the director will use icmp.c:icmp_unreach() to learn the PMTU. ip_rt_frag_needed() will save the value in the routing cache. Since the director doesn't send packets to CHOP, the problem then is how this information is used. The kernel's ICMP protocol receiver parses the information, updates PMTU in cache, but fails to deliver it to the upper layers as happens when delivering errors to sockets. This time IPVS was the sender. That is why the LOCAL_IN hook exists, where IPVS can listen for these errors, but as I said, it is difficult to generate ICMP error to send to the CIP. Another problem problem is what MTU to use between RS and clients, but IPVS should properly forward (tunnel) any ICMP errors from hops between RS and (before) client to the RS (Joe: the director?). The client will never trigger an ICMP reply, which is generated only by routers. CHOP replies to the packet VIP->CIP, so this ICMP packet comes to the VIP (director) and IPVS will select the appropriate connection in the ipvsadm table, and forward the ICMP information in an IPIP packet to the RIP (as happens for the regular TCP packets from the CIP). ip_vs() used to have forwarding of ICMP from the non-error class icmp packets (e.g. ICMP ECHO), but someone dropped it from 2.6 as an unused feature. I hope that is how IPIP setups work.

MTU: early signs of problems awysock (at) absoftware (dot) com 28 Nov 2003

I've set up two UM Load blancers running LVS 1.0.10 and have them up and running. I'm using LVS-Tun since I rent my servers and my IP addresses are all over the place. My site deals with lots of photos, so my users are doing large POSTS along with large POSTS of Text data. It seems when the ethernet packet goes over the 1460 byte mark only some of the users fail others (my own machines) work just fine. I have tried it on my windows machine and my MAC I have no problem, but when somebody elsewhere on the net does the same function they fail with a 404 or timeout error on their end. Its only some of the people, others are not having the problems. If they go directly to the server it works. So I'm guessing it something between the LVS and the Real Servers. I have changed the MTU value for eth0 on the director to 1400. All that does for me is make more machines (all that I have tested) suffer from the same problem. Should the MTU value be changed at different places? i.e. both ends of the tunnel? I knew that our choice to use Windows 2000, would haunt me! Does anyone know how to change the MTU for an IP tunnel in Windows 2000?

Ratz, 1 Dec 2003 Enable PMTUDiscovery in w2k (http://insight.zdnet.co.uk/communications/networks/0,39020427,2123537-2,00.htm) and DrTCP (http://www.dslreports.com/drtcp) (Joe: presumably you want DRTCP019.exe, support for MTU set in w2k).

The MTU was originally set to 1500 on all machines. Most machines worked but some would not when posting large amounts of data. When I set the MTU for all interfaces on the director to 1400 and leave the MTU for the tunnel untouched at 1500, all machines would fail. When I set the MTU for all interfaces on the director to 1400 and set the MTU for the tunnel at 1400, all machines would fail. With the MTU for the tunnel set to 1400. I can set the MTU for the director to anywhere between 1420 - 1500 before it fails with all machines. The largest packet I can transmit on the ISP's network without it fragmenting is 1472 although they claim their MTU is 1500. (ping www.linux.org -l -f 1472 works but anything bigger does not) This makes no sense to me. The only way I can think this is correct is if: Maximum packet size (without a tunnel) between director and realservers is 1500. If the header for IPIP tunnel is about 20 bytes, then the maximum packet size for packets within the tunnel is 1480. Therefore, the MTU for the director must be at least 20 more than the MTU for the tunnel. So why does using 1400 everywhere make it all fail, but 1500 everywhere only fail on some machines? What can I set the MTU values to in order to guarantee it working with all clients? Most of our clients have no technical knowledge and this is becomming a nightmare!

Horms 30 Nov 2003 typically the MTU used is 1500 bytes. But when tunnels come into play then this becomes slightly smaller because of the overhead for the tunnel. This should not be an issue but in practice it often makes sense to manually set the MTU to the smaller value on applicable interfaces. ratz 01 Dec 2003 ... or the mtu of the tunnel's routing entity for that matter. This is faster and less intrusive than adjusting down the whole physical interface's mtu. I use it for boxes where I have dozens of VPN tunnels over a physical interface, but also non-tunneled traffic. Joe Ratz is saying to change the MTU not for the interface (which will affect all routes through that interface), but only for the route. Presumably the route is DIP->RIP (the packet on arrival at the RIP is decapsulated to the packet with dest_addr=VIP). (Feb 2007 - Ratz posted that he got the idea from off-line discussions with Julian. But Ratz gets the credit for telling us about it.) Roberto Nibali ratz (at) drugphish (dot) ch 01 Jun 2004 You can set the mtu for a route to/from the VIP. You must of course pay attention to route selection which can be investigated with ip rule/ip route or the shell tools I've written to display routing tables. So you might need to put the VIP route into a special routing table which gets parsed before the other routes. Also don't forget to flush the routing cache. Joe: in principle this is easy to do, but no-one has done it yet. The ipip packet from the director to the realserver is DIP->RIP. Ideally you would only want to change the mtu for the ipip packets to the RIP (or to the RIP network), so that other packets to the RIP (e.g. logging, administration) have standard MTUs. As well we aren't sure yet whether PMTU works, even if we do change the mtu for the DIP->RIP (someone could look in the code). Here's how Ratz changes the MTU for the default route. Ratz 05 Feb 2007 Here I add a default route to a new table and change the default mtu. Basically you can use the "change" keyword in conjunction with the "mtu" selector on the specific route. Jacob Coby jcoby (at) listingbook (dot) com 01 Dec 2003 Decreasing the MTU with this bug only causes more problems; it causes the packets to fragment MORE often. When I had the issue, I could decrease the MTU to 200 bytes, and the connection would fail at a payload of ~160 (20b for the IP header, 20b for the IPIP header), even with non-tcp data, like ping. Julian 28 Nov 2003 try LVS with 2.4.23 as it contains a fix for packets longer than mtu. (and later) Julian Anastasov 24 Feb 2004, 29 May 2004 There is only one remaining problem related to LVS-TUN: there is no handling of ICMP errors being received on a local IP after being returned from somewhere in the path (DIP->RIP) coming back to the DIP and containing the reply to tunneled packet (e.g. a frag_needed message and carrying the first few bytes of the packet). We do not relay these messages, generated between the director and the realserver, back to the client. The correct target for the ICMP message depends: the director is sending 20 bytes more (the ipip overhead), and if this is causing the ICMP message, then the client need not receive the ICMP message in all cases. The client should only receive an ICMP message if the director detects a lower PMTU. While TCP and UDP handle ICMP errors, IPIP does not handle them well. The LVS-DR and LVS-NAT forwarding preserve the sender's IP in which case ICMP traffic from realservers (or hosts before realservers) is always returned to the client. But if LVS-Tun is used, the ICMP packets are not returned to the client. If the only traffic from the director to the LVS-Tun realservers is IPVS traffic, then the routing cache does not receive the PMTU info from ipip_err() and we don't learn the correct path MTU to the realserver. Then, on forwarding packets, the IPVS code cannot detect that the path has lower PMTU. But this is theory, not really tested. Maybe we can update the PMTU in the routing cache by listening to these ICMP errors in LOCAL_IN? Needs experiments and time for fixing, patches are welcome. There is no such thing as an MTU for ipip with IPVS. IPVS extends the packet with 20 bytes by prepending IPIP header and ignores the mtu. IPVS has its own encapsulation and uses the route to the RIP (you do not need to configure a tunl0 device on the director).

I would love to upgrade the Kernel (currently 2.4.20) but that is not an option as a quick fix at the moment. - Live environment and the like.

This time the fix is not in the IPVS code: (see the kernel bug list http://linux.bkbits.net:8080/linux-2.4/hist/net/ipv4/ip_output.c?nav=index.html|src/.|src/net|src/net/ipv4). The problem is that skb->nfcache is not copied on [re]fragmentation. Here's a posting and patch by Julian to the linux-netdev mailing list posting and patch by Julian to the linux-netdev mailing list (http://marc.theaimsgroup.com/?l=linux-netdev&m=106589293316918&w=2). But we need to see your tcpdump output first because the PMTUD (path MTU discovery) is usually enabled. Joe

IPIP is a one-way channel (packets don't come back?) and PMTUD doesn't work?

Julian The director still can receive ICMP errors with the source somewhere between the LVS-Tun realserver and the director. Chris Paul

The problem is I can not reproduce the error. We only have a small number of non technical customers who are having trouble, but I can only go so far when it comes to asking them to debug our services.

Julian 01 Dec 2003 Then your problem is related to the client-director PMTU. I understand that it can be difficult to trace an unknown client, but do you have some kind of ICMP filtering between clients and the director? The problem came up again in May 2004 (when the current kernel is 2.4.26). Casey Zacek cz (at) neospire (dot) net 26 May 2004

The problem, as described by one of my customers, is this (the customer is running phpBB on 3 Linux/Apache servers with an LVS-Tun setup):
For very few users, when they post long posts (anything over a few lines) and hit submit, the browser appears to hang and finally it times out. Similar effects if they try and update their profiles. I even experienced this on my home computer. I use a proxy server sometimes and it showed the request being transmitted from my computer but ultimately no response was received from the site. Now, in most instances of this, we have found that the affected users are on broadband using a router of some type. I myself use a cable modem connected through a Linksys Router. When I experienced the issue, I was able to post from work, but not from home. I fiddled with my setup, thinking it was cookies or caching of some type and ultimately performed a firmware upgrade on my router. Suddenly the problem went away.
At the time, I was running kernel 2.4.25 (IPVS 1.0.10), but since upgraded to 2.4.26 (IPVS 1.0.11), then 2.6.6 (IPVS 1.2.0). I have asked the customer to retest it, but he'll have to talk to some of his users, from the sound of things, since he upgraded his router firmware. I'd love to chalk it up to "client router problems," but that probably won't be good enough for this customer. The customer's setup worked using a Riverstone smartswitch router running what equates to LVS-NAT, but it does not work with this LVS-Tun setup. With all three versions, I get a lot of these messages:

Julian This message means that the IPVS director is generating ICMP errors to request that the client reduce the packet size. Maybe these ICMP messages are filtered somewhere and do not reach the client. I have a step-by-step howto for TUN setups: http://www.ssi.bg/~ja/TUN-HOWTO.txt Joe: This URL doesn't directly address the mtu problem. It checks the capsulation and routing. Joe

Why is the default MTU for ipip packets 1480, rather than 1500+overhead_for_ipip=1520? Is 1500 a hardware buffer size limit in the NICs? (i.e. hardware buffer=1500?)

Julian I don't know which the origin of the 1500 limit. Maybe it is a balance between link sharing and protocol header overhead. There is a convention in IPv4 to reply with an ICMP error if a packet with DF flag set reaches a smaller pipe (i.e. packet length > PMTU). If the DF flag is not set, the packet is fragmented into MTU-sized fragments. For an explanation of PMTU and the DF flag, see PMTU - Path MTU Discovery (http://www.netheaven.com/pmtu.html). There can be many problems related to MTU: no ICMP errors generated (from director or from other hosts between director and realservers) ICMP errors do not reach their destination (the client), e.g. filters dropping blindly any ICMP traffic ICMP errors generated from realservers (or from hosts between the director and realservers) not forwarded from director to client PMTU not updated in routing cache due to IP TOS changes after routing Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004 The problem is caused by the linux kernel not taking into account the size of the ipip tunnel headers when sending traffic over an ipip tunnel. Basically, the MTU (the largest size packet than can be sent over a network) is normally 1500 bytes. With the IP header information this drops to 1492, so the largest size of packet that can be sent over an IP link, before the packet get split into multiple packets is 1492 bytes. When you use ipip tunneling, there is an additional header that takes the maximum transmition size through the link to somthing like 1480. Linux kernel 2.4.??? does not take into account this additional header and sets the mtu for the ipip tunnel to 1492. So if you send a packet that is between 1480 and 1492, it gets truncated rather than split into multiple packets. The ipip tunnel destination then waits to receive the rest of the packet, which it never arrives. The result is the server never responds. When I was having this problem, it was a nightmare because you can not guarantee it will fail. It only fails when the packet size is very specific and the size of the header is also large. To fix this you can either. Upgrade to kernel 2.6.??? Change the MTU values on the director. I solved it by changing the MTU values, but it was nearly a year ago now I and can't remember exactly which ones I changed, i.e., the RIP on the director, the tunnel from the director, or the tunnel from the realserver. Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004 You have to change the mtu value on the end of the IP tunnel that initiates the tunnel i.e. the realserver (in this instance, a w2k box). This value should be close to the mtu value of the physical interface it is going through, but small enough to ensure there is enough space left for the ipip header. We use 1400 and have never had any reports of it failing. To do this you goto registry and add a dword entry called MTU with the decimal value 1400 (safe) into reboot with w2k, XP, you can "Restart Networking" Joe - see for Casey Zacek's modification of this method. If the mtu is not set, you get lots of IPVS: ip_vs_tunnel_xmit(): frag needed messages logged to the console and connections hang. Joe

I would have thought you'd set the mtu at the director end. Presumably if any end of the segment has a reduced mtu, then both ends of the segment should be notified about it.

Julian This message means "fragmentation needed but DF flag set". ip_vs_tunnel_xmit() tries to prepend IPIP header, but notices that the resulting packet with DF flag set will exceed the PMTU(director->RS) limit, so it generates an ICMP error instead of xmit-ing the packet to the RS. Joe

What messages would you get if the icmp problem was about the link between the tunnel realserver and the director?

Messages from the same type but we haven't handled this case yet. In the meantime, setting the proper PMTU in the director for the route to the realserver is a good idea. Jacob Coby jcoby (at) listingbook (dot) com 27 May 2004 I could test for the problem reliably by using ping with packet_size>934 (934 and lower worked fine). Once I bumped it up over 934, I'd see Must Fragment (MF) ICMP messages being sent, and the ping request would have no response. As I lowered the MTU, the size of the ping that would cause the problem lowered in direct proportion. A 1500 MTU would cause a 935 byte ping to fail, a 1400 MTU would cause a 835 byte ping to fail, and so on. Any HTTP GET or POST over that 934 byte payload would cause the site to not respond. Chris Paul

Where are you setting your mtu of 1400? You have to make sure that it is the mtu for data inside the tunnel. When I changed the mtu values, the only way I could reliably get it to change the size inside the tunnel rather than the whole tunnel packets was from the realserver not the director.

Jacob Coby I have no idea which MTU I was setting. I could get the problem to go away for one or two times, and then it would come back. It's been over a year since I messed with LVS-TUN, and I'm now running LVS-DR. Peter Mueller pmueller (at) sidestep (dot) com 27 May 2004 I've heard people in poptop use this hack. Maybe you can modify for your use in this situation. If it works I like this solution better than a change to the MTU on the interface. Joe: MSS is maximum segment size, i.e. the payload in the packet, rather than the packet size (which is set by mtu). Julian 3 Jun 2004 Ratz's work around should work, or you can hope that other traffic between director and RS will update the PMTU in the routing cache. Joe

if you put a tunl0 device on the director, would it receive the PMTU packets back from the realserver?

Yes, it can look into the ICMP errors that include IPIP header but the current version of ipip.c does not update the RIP's PMTU. Another option is IPVS to do it in LOCAL_IN. The VIP does not play here. The forwarded traffic in the director is routed to daddr=RIP (as for the other forwarding methods). Only the clients need a route to VIP. Joe

I was thinking to reduce the MTU for the CIP-VIP segment, then there would be no problem in the DIP-RIP segment. Is this a way of handling it?

This is another solution. Just keep PMTU(CIP->VIP) <= PMTU(DIP->RIP) + 20. I'm not sure you can do it for every client. Maybe it can be in the default route :) To use Ratz's work-around, you set the PMTU for packets going to the RIP (via eth0 on the director, there being no tunl0 devices on the director). If it is set to 1500 then you do not need such route as IPVS reports PMTU reduced with 20 (here 1480) when generating ICMP error to client. So, if PMTU to RIP is X or RS sends ICMP error to director notifying for PMTU=X then IPVS will report PMTU=(X-20) to client. OTOH, may be it is not so difficult to check in LOCAL_IN for any FRAG_NEEDED errors and if they reduce the PMTU for RIP we can update the routing cache. Need to investigate whether we can easily find that such error is for one of our TUN RIPs. What you can do on the director when using LVS-TUN: if PMTU to RIP is lower then outdev's MTU then you have to specify it in special route to RIP. If the PMTU is 1500 you do not need such route, IPVS automatically reports MTU 1480 to all clients that send packets>1480. There is no chance for PMTUD to work when the ICMP errors are dropped between director and client. It will happen for any used PMTU to RIP. run tcpdump and check for any received or generated ICMP errors The PMTU is not updated in the routing cache if director receives ICMP_FRAG_NEEDED. This is easy to detect and to solve. The good news is that you can detect it from any client, send large file, tcpdump for ICMP errors coming from realservers to director. If this is the case (PMTU to RIP is lower than outdev's MTU) than you can try to specify pmtu in special route to RIP. Once the director knows the right PMTU to RIP then it will report it to every client that violates it. There is no need IPVS to relay the ICMP error coming from RIP to the client, we just know how to generate it on each request from client. The only benefit can be if ipip.c is patched to update the PMTU in the routing cache and to avoid creating special route to RIP. All other problems can be related to filtering of the ICMP errors generated from director and sent to client. Such places for filtering can be the netfilter in director or any router used to reach the client. It is enough to check that the ICMP errors generated in director reach the director's uplink GW. Then you hope that the client does not filter ICMP. Joe

what is the MTU doing in the output of ip addr show dev tunl0 when you have a tunl device on a machine? I can set it (can't I?). Is the mtu meaningless, ignored, what?

It is ignored for IPVS traffic, IPVS has its own encapsulation and uses the route to RIP (you do not need to configure tunl0 in director). The tunl0 device is usually needed to receive IPIP packets, so in normal cases you do not need such interface in director even when using TUN realservers. The PMTU setting must be for the route to RIP. Such setting (and special route to daddr=RIP) can be needed only if PMTU to RIP is less than the outdev MTU.

So with regular ipip tunneling (not ipvs) you only need the tunl0 device on the receiving end? The only reason you need a tunl0 device on the transmitting end is to handle the packets that reply?

For regular ipip purposes tunl0 can be used both for send and for receive. IPVS simply knows how to create ipip packets without using the ipip code.

tunl mtu solved: Setting the MTU by MSS with iptables on the realserver Casey's solution is run on the realserver. Presumably a similar solution could be found for the director. Ratz's method of setting the mtu for the route rather than the interface runs on the director. Casey Zacek cz (at) neospire (dot) net 2005/03/11 I've emailed about this before, and nothing we ever came up which really worked. The real problem I've always had is that I've never had a means for duplicating it (possibly because I didn't fully understand the problem -- I can probably duplicate it at will now), and my customers have eventually just either accepted it and moved on or changed to an LVS-NAT environment. I finally came across someone whose home network was setup in such a way as to experience the "problem", so I decided to figure it out once and for all and hopefully end all the confusion. Attached is a piece of PHP (lvs-tun-test.php) that'll duplicate the problem. The "submit" query will timeout if you are experiencing the problem. Matthew Boehm matthew (at) matthewboehm (dot) com 6 Jan 2007 (and Casey). With IE6/7: When you submit the POST, the page just reloads (Matthew) or hangs/timesout with no data posted (Casey). With Firefox/Netscape: You get a "Bad Request" page. big POST test

alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf

Cut here ----------------------------------------------------------- ]]> In order to force yourself to experience the problem, you need to forcefully ignore icmp fragmentation-needed packets. I am able to do that on my home network with a simple iptables rule on my firewall: Now, I browse the lvs-tun-test.php through LVS-Tun, and click submit, and it just hangs and times out. tcpdump shows the expected results. Then I change the MTU on the loopback interface on the realserver (It's a w2k box) using regedit, then disable and re-enable the loopback adapter via the network properties, then click submit again. Poof, it works. tcpdump is my friend. I started out running tcpdump on the director: VIP.80: S [tcp sum ok] 3288780265:3288780265(0) win 65535 23:13:52.810423 IP (tos 0x0, ttl 116, id 26415, offset 0, flags [DF], length: 40) CIP.60964 > VIP.80: . [tcp sum ok] 3288780266:3288780266(0) ack 2303765635 win 65535 23:13:52.813943 IP (tos 0x0, ttl 116, id 26416, offset 0, flags [DF], length: 602) CIP.60964 > VIP.80: P [tcp sum ok] 0:562(562) ack 1 win 65535 23:13:52.820802 IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 562:2014(1452) ack 1 win 65535 23:13:52.820887 IP (tos 0xc0, ttl 64, id 25185, offset 0, flags [none], length: 576) VIP > CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 562:2014(1452) ack 1 win 65535 23:13:52.827175 IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 2014:3466(1452) ack 90 win 65446 23:13:52.827251 IP (tos 0xc0, ttl 64, id 25186, offset 0, flags [none], length: 576) VIP > CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 2014:3466(1452) ack 90 win 65446 23:13:52.833420 IP (tos 0x0, ttl 116, id 26420, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 3466:4918(1452) ack 90 win 65446 ]]> The tcp [DF] CIP->VIP (packet length 1492 -- too big), then IPVS's ICMP response continues until the request eventually times out. This message is generated every time one of the ICMP responses are sent: The problem comes when the ICMP host-unreachable (change MTU) packets are ignored/dropped and not acted-upon by the client. This is a more common situation than I thought would be the case. A few hours of debugging later, I realized that the SYN+ACK packet, the response from the real server to continue the connection handshake, is missing. Duh. I moved my tcpdumping to a tap in the network that I knew would get all of the traffic. The SYN+ACK packet establishes the MSS (max segment size -- the data segment size for the packets for this connection) to 1452, just as the client machine requests (the first packet in the earlier trace). Duh! I had read all the stuff on the URL above, and the posting by Chris Paul comes closest to describing the solution: In reality, it's not "the end of the IP tunnel that initiates the tunnel" because the tunnel interface on the w2k box doesn't initiate anything -- it only receives forwarded traffic from the director. What he really means is "the interface on the real server that is handshaking the TCP connection with the client." The goal is to get the client to send smaller packets so that they'll make it on to the realserver. So, we have to change that MSS that gets sent back from realserver to client. That is, set the MTU on the loopback interface on the w2k box. The solution is to do exactly what Chris Paul Chris said, except change from: to: After all, if you set an MTU in the IP tunnel interface this way, it won't be there after you reboot, I've found. Oh, and 1480 is the magic number. 1400 is safe, but 1480 works. Any higher than that, and it doesn't work as desired. So I went to investigate how to do the same thing on my Linux real servers, only to find that the tunl0 interface, which is the connection endpoint for Linux realservers, already has an MTU of 1480. I don't know when that got fixed, but I guess I won't worry about it. (later) I was wrong; here's the fix for Linux realservers: Tested, tcpdumped, works. Now I have no more 'IPVS: ip_vs_tunnel_xmit(): frag needed' messages. (At least for now. We'll see if I'm wrong tomorrow.) Chris Paul, 11 Mar 2005

Isn't this fixed in Kernel 2.6 anyway

Casey Zacek cz (at) neospire (dot) net I really don't think it's possible to fix this on the director (and my directors are running 2.6.11 anyway -- and it's not fixed there). The closest way I could think of was to ignore the DF flag in the incoming TCP packets and just fragment them anyway. Casey Zacek cz (at) neospire (dot) net 2005/04/12 It's not fixed in 2.6; I still need the iptables rule to set the mss Joe: we don't know why this works Julian Feb 2007 Huh, I don't know why, may be because there is such limit somewhere in the path from RS to client. Path from RS to client is not different between real servers in DR or TUN mode, they both send normal reply from VIP to CIP, no IPIP is involved there. May be problem with a CHOP that can not route ICMP to VIP properly. jarol1@seznam.cz J (dot) Libak (at) sh (dot) cvut (dot) cz 07 Dec 2006 Today I ran into an MTU problem with LVS-Tun. Small packets were forwarded to real servers without problems, but the bigger ones weren't and TCP retransmissions occurred. I noticed the problem dissapeared when I switched to LVS/DR so this gave me hint to where the problem might be. MTU 1480 had to be set on the outgoing interface of realservers with tunl0 having standard 1500. Directors have 1500 on all interfaces. This way TCP syn ack contained correct MTU and the client didn't send big packets that were discarded on director anymore. IP header is 20 bytes long so 1480 is the maximum value that works. Casey Zacek cz (at) neospire (dot) net 31 Aug 2007 For some reason that I cannot remember, I have switched off of this iptables method in favor of using some advanced routing to take care of the MSS setting.

(Joe: Ratz says that the MTU should be set for the route and not for the device, since not all routes/packet types to/from a device need an altered MTU.)

I wish I would have shared with the group when I started it, because I can't remember why I'm doing it this way now. Still on the real servers, I use routing like so: This assumes the VIP is in a class C network So, for example, say VIP is 10.2.2.38 VIP_NETWORK is 10.2.2.0 VIP_NETWORK_GATEWAY is 10.2.2.1 (probably) The number 42 is just a number I chose when I started this. Sameer Garg sameer (dot) garg (at) gmail (dot) com 6 Sep 2007 By trial and error I was able to find a work around this: I am still not sure why I need to make the change on the director because technically during the three way handshake, the real server should tell the client about MSS being 1400.I have tried it without making the changes on the director but it doesn't work.

Setting the MTU by route This works on the realserver, but not on the director. We don't know why it doesn't work on the director and we're not really sure why it works on the realserver either. With Casey having a suitable test setup, we asked him to test setting the MTU by route using Julian's suggestion of Casey Zacek cz (at) neospire (dot) net 14 Feb 2007 Nope. Doesn't work. Here's tcpdump running on the realserver showing the first packet back to the client, which negotiates the MSS for the connection. ENDUSER.1276: S [tcp sum ok] 2051800163:2051800163(0) \ ack 1809535240 win 5840 ]]> That "mss 1460" needs to be "mss 1440". That's the secret magic key to the universe. I got some of these when I blocked icmp-type fragmentation-needed to my workstation, with logging: And my page request just waited and waited (Firefox 2.0). When I flushed the icmp-type fragmentation-needed DROP rules, and I submit the page again, it goes through instantly. I also tried with This also did not work. Julian

To tell if this is a PMTU problem (rather than we haven't figured out the correct ip route command), one should check all steps with tcpdump in all boxes, icmp, tcp.

Now, I can make it work if I do this on the real server: RS# ip route add table 42 to default via DEFAULTGW advmss 1440 RS# ip rule add from VIP to default table 42 priority 42 So, at least it doesn't require iptables. Also, this doesn't cover any client machine that is not reached via the default route. Instead you'd need something more like this: In most cases, though, these two routes and one rule will cover it. I think I prefer using iproute to using iptables, as iptables tends to be more volatile in my environments.

rewriting, re-mapping, translating ports with LVS-Tun see

LVS: LocalNode We rarely hear of anyone using this to make a director function as a normal realserver. However more specialised roles have been found for localnode. 2008: plenty of people are using it now, particularly the . using the director as a "sorry server" (e.g. when all realservers are overloaded and you want to display a "please come back later message"). With localnode, the director machine can be a realserver too. This is convenient when only a small number of machines are available as servers. To use localnode, with ipvsadm you add a realserver with IP 127.0.0.1 (or any local IP on your director). You then setup the service to listen to the VIP on the director, so that when the service replies to the client, the src_addr of the reply packets are from the VIP. The client is not connecting to a service on 127.0.0.1 (or a local IP on the director), despite ipvsadm installing a service with RIP=127.0.0.1. Some services, e.g. telnet listen on all IP's on the machine and you won't have to do anything special for them, they will already be listening on the VIP. Other services, e.g. http, sshd, have to be specifically configured to listen to each IP. Configuring the service to listen to an IP which is not the VIP, is the most common mistake of people reporting problems with setting up LocalNode. LocalNode operates independantly of NAT,TUN or DR modules (i.e. you have have LocalNode running on a director that is forwarding packets to realservers by any of the forwarding methods).

Horms 04 Mar 2003 from memory, this is what is going to happen: The connection will come in for VIP. LVS will pick this up and send it to the realserver (which happens to be a local address on the director e.g.192.168.0.1). As this address is a local IP address, the packet will be sent directly to the local port without any modification. That is, the destination IP address will still be the VIP, not 192.168.0.1. So I am guessing that an application that is only bound to 192.168.0.1 will not get this connection.

Two LocalNode Servers We've only had the ability to have one service in LocalNode, till Horms made this proposal. Let us know if it works. Horms 5 Jun 2007 If you want to use LVS to have two local services on the director, wouldn't an easy way be to bind the processes to 127.0.0.1 and 127.0.0.2 respectively and set them up as the real-servers in LVS?

Two Box LVS It's possible to have a fully failover LVS with just two boxes. The machine which is acting as director, also is acting as a realserver using localnode. The second box is a normal realserver. The two boxes run failover code to allow them to swap roles as directors. The two box machine is the minimal setup for an LVS with both director and realserver functions protected by failover. An example two box LVS setup can be found at http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-eg.html. UltraMonkey uses LVS so this setup should be applicable to anyone else using LVS.

Salvatore D. Tepedino sal (at) tepedino (dot) org 21 Jan 2004 I've set one up before and it works well. Here's a page http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-overview.html that explains how it's done. You do not have to use the ultramonkey packages if you don't want to. I didn't and it worked fine.

In practice, having the director also function as a realserver, complicates failover. The realserver, which had a connection on VIP:port will have to release it before it can function as the director, which only forwards connections on VIP:port (but doesn't accept them). If after failover, the new active director is still listening on the LVS'ed port, it won't be able to forward connections.

Karl Kopper karl (at) gardengrown (dot) org 22 Jan 2004 At failover time, the open sockets on the backup Director may survive when the backup Director acquires the (now arp-able) VIP (of course the localnode connections to the primary director are dropped anyway), but that's not going to happen at failback time automatically. You may be able to rig something up with ipvsadm using the --start-daemon master/backup, but it is not supported "out-of-the-box" with Heartbeat+ldirectord. (I think this might be easier on the 2.6 kernel btw). Perhaps what you want to achieve is only possible with dedicated Directors not using LocalNode mode.

The "Two Box LVS" is only suitable for low loads and is more difficult to manage than a standard (non localnode) LVS.

Horms horms (at) verge (dot) net (dot) au 23 Jan 2004 The only thing that you really need to consider is capacity. If you have 2 nodes and one goes down, then will that be sufficient untill you can bring the failed node back up again? If so go for it. Obviously the more nodes you have the more capacity you have - though this also depends on the capacity of each node. My thinking is that for smallish sites having the linux director as a machine which is also a realserver is fine. The overhead in being a linux director is typically much smaller than that of a realserver. But once you start pushing a lot of traffic you really want a dedicated pair of linux directors. Also once you have a bunch of nodes it is probably easier to manage things if you you know that these servers are realservers and those ones are linux directors, and spec out the hardware as appropriate for each task - e.g. linux-directors don't need much in the way of storgage, just CPU and memory. Horms horms (at) verge (dot) net (dot) au 26 Aug 2003 The discussion revolves around using LVS where Linux Directors are also realservers. To complicate matters more there are usually two such Linux Directors that may be active or standby at any point in time, but will be Real Servers as long as they are available. The key problem that I think you have is that unless you are using a fwmark virtual service then the VIP on the _active_ Linux Director must be on an interface that will answer ARP requests. To complicate things, this setup really requires the use of LVS-DR and thus, unless you use an iptables redirect of some sort, the VIP needs to be on an interface that will not answer ARP on all the realservers. In this setup that means the stand-by Linux Director. Thus when using this type of setup with the constraints outlined above, when a Linux Director goes from strand-by to active then the VIP must go from being on an interface that does not answer ARP to an interface that does answer ARP. The opposite is true if a Linux Director goes from being active to stand-by. In the example on ultramonkey.org the fail-over is controlled by heartbeat (as opposed to Keepalived which I believe you are using). As part of the fail-over process heartbeat can move an interface from lo:0 to ethX:Y and reverse this change as need be. This fits the requirement above. Unfortunately I don't think that Keepalived does this, though I would imagine that it would be trivial to implement. Another option would be to change the hidden status of lo as fail-over occurs. This should be easily scriptable. There are some more options too: Use a fwmark service and be rid of your VIP on an interface all together. Unfortunately this probably won't solve your problem though, as you really need one VIP in there somewhere. Or instead of using hidden interfaces just use an iptables redirect rule. I have heard good reports of people getting this to work on redhat kernels. I still haven't had time to chase up whether this works on stock kernels or not (sorry, I know it has been a while). (For other postings on this thread see the mailing list archive http://marc.theaimsgroup.com/?l=linux-virtual-server&m=103612116901768&w=2.)

Two Box LVS: both directors have active ipvsadm entries The normal way to run the Two Box LVS is with no ipvsadm entries on the backup director. However keepalived does have ipvsadm entries, and a non-arp'ing VIP. If the backup director has ipvsadm entries then even though it's not receiving packets directly from the internet, a connection request can be forwarded from the active director. The backup director will attempt to loadbalance this request, which could be sent back to the active director. You are now in a loop. Here's the story of the discovery of the problem and the fix by Graeme. Martijn Grendelman martijn (at) pocos (dot) nl 19 Dec 2007

I have a quite straightforward LVS-DR setup: two machines, both running a webserver on port 80, one of them directing traffic to either the local node or the other machine. I am using the 'sh' scheduler, as I have been for ages. Since a while, directing traffic to the other machine (not the director) doesn't work anymore, BUT ONLY on a specific VIP:PORT combination. During my tests, the LVS setup is as follows: rr ipvsadm -L -n IP Virtual Server version 1.2.1 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 213.207.104.20:80 sh -> 213.207.104.11:80 Route 500 0 0 TCP 213.207.104.20:81 sh -> 213.207.104.11:81 Route 500 0 0 TCP 213.207.104.50:80 sh -> 213.207.104.11:80 Route 500 0 0 ]]> Note that all references to the local node have been temporarily removed. Now, the service defined second (port 81) works. The third one, port 80, but a different VIP, works too. The first one, the one that I need, does not. When I connect to 213.207.104.20:80, I see some kind of SYN storm on both the director and the real server on 213.207.104.11: Mostly identical lines (and nothing else) keep appearing at a high rate, even after I kill the connection on the client. Only after I remove the service from LVS, this stops.

Graeme You likely have a pair of "battling" directors. Consider this Client sends SYN to director1. Director1 sends it on to director2, being the other realserver - so far this is your scenario. (Joe: now only if director2 is active): Director2's LVS catches the packet and sends it back to director1 for service, but director1 already sent that connection to director2, so sends the packet back. What happens now is that second paragraph happens ad nauseum, until your ethernet between the machines is full up of the same SYN packet, performance degrades, and the directors fall over under the load (eventually). Martin

Indeed, the other realserver, being the backup director, had its LVS rules loaded. After clearing the LVS table on this machine, everything worked like ever before. In the past, the machines only had LVS active if they were actually the active director, but at some point in time, I figured I could just leave it active, because the stand-by director didn't get any requests anyway. But of course, in this DR fashion, that is not true.

Graeme I found it eventually, on the keepalived-devel@lists.sourceforge.net list - I'll post it verbatim below. This *should* allow you, with some modifications, to sort out your problem and keep an active/active master/backup (by this I mean with IPVS loaded and configured on both directors).

Client sends packet to VIP. Director1 (Master) has VIP on external interface, and has LVS to catch packets and load balance them. Director1 uses LVS-DR to route packet either to itself (which is fine), or to Director2 (Backup). There's the problem... In the case of keepalived, Director2 *also* has a VIP, and has ipvsadm rules configured to forward packets regardless of the VRRP mode (MASTER, BACKUP or FAULT). This makes for faster failover but leads directly to this problem/solution. In the backup director keepalived moves the VIP from the VRRP interface to the lo which is configured to not reply to arp requests. In the basic case, 50% of the packets being forwarded by Director1 to Director2 *will now get sent back to Director1* by the LVS-DR configuration. Because Director1 LVS has already sent traffic for that connection to Director2, so it forwards the traffic to Director2. Time passes, friend. Your servers collapse under the weight of the amplifying traffic on their intermediate or backend (or frontend, if you're on a one-net setup) network. The solution? A real nice easy one - use iptables to set a MARK on the incoming traffic - something like: Then configure keepalived to match traffic on the MARK value instead of the VIP/PORT combination, like so: ...and so on for the other MARK values you define in your iptables setup.

This works perfectly where you have more than one interface and are routing inter-director traffic via a "backend". In the case of a single NIC on each box, you need a modified rule to NOT apply the mark value to packets sourced from the "other" director: On node1 create an iptables rule of the form: -t mangle -I PREROUTING -d $VIP -p tcp -m tcp --dport $VPORT -m mac \ ! --mac-source $MAC_NODE2 -j MARK --set-mark 0x6 where $MAC_NODE2 is node2's MAC address as seen by node1. Do a similar trick on node2: -t mangle -I PREROUTING -d $VIP -p tcp -m tcp --dport $VPORT -m mac \ ! --mac-source $MAC_NODE1 -j MARK --set-mark 0x7 where $MAC_NODE1 is node1's MAC address as seen by node2. Change your keepalived.conf so that it uses fwmarks. node1: virtual_server fwmark 6 { node2: virtual_server fwmark 7 { The problem came up again, before the solution went into the HOWTO. Graeme directed Thomas to the original posting in the archive Thomas Pedoussaut thomas (at) pedoussaut (dot) com 15 Apr 2008

I have a very light infrastructure, with 2 servers acting as directors AND real servers. I came across the packet storm problem where when the MASTER forwards a connection to the real server on the BACKUP (via DR), the BACKUP treats it as a VIP connection to be loadbalanced rather than a real server connection to process. And decides to load balance it back to the MASTER I'm sure there is a way to do it, maybe with iptables. I'm looking for a schema explaining how a packet coming on an interface traverses the various layers (ipvs, netfilter, routing) so I could figure out how to do it. My chance is that I have 2 physical interfaces, one public and one private, so if a packet arrives on the private interface for the VIP, it's a DR from the MASTER, and if it comes on the public, it's pre-loadbalance traffic. Another option would be to be sure that the tables are in sync between the 2 machines so the BACKUP know that the connection has to be directed locally. I have tried to setup that feature, but it doesn't seems to sync really.

Joe: Here's my explanation of Graeme's problem, in case you haven't got it yet. The problem only occurs if ipvsadm rules are loaded on the backup director (having the rules loaded makes failover simpler). Here's the two NIC version of the problem MAC of RIP2 normal packet MAC of RIP1<-CIP spurious packet ]]> client sends a connect request to VIP:port. the active director picks the realserver to forward the request and having two choices, delivers it either to the localnode or to the MAC address of eth1 on the other realserver (with RIP2). In a normal LVS, the packet would be accepted by the demon listing on the VIP. if the packet is delivered to eth1/RIP2, it will not be delivered to the demon listening on the VIP, but will first be processed by ip_vs(). There's a 50% chance the packet will be forwarded to the localnode, which will generate a normal reply to the client. Although the client gets the expected response, we don't want the packet to go through ip_vs(). We want it delivered directly to the demon. There's a 50% chance that the packet will be returned by ip_vs() to the active director at eth1/RIP1. (wrinse, lather, repeat). We don't want packets to the VIP coming in on eth1 to be processed by ip_vs() on the backup director (here acting as a realserver). We want the packets to be delived to the demon. We do want packets to the VIP coming in on eth0 to be processed by ip_vs(). Solution: fwmark packets coming in on eth0, to VIP:80. Load balance on the fwmark. Packets for the VIP coming from 0/0 to eth0 will be load balanced. Packets for the VIP coming in on eth1 will not be load balanced and will be delivered to the demon. Here's the 1 NIC version of the problem MAC of eth0 on backup normal packet VIP | MAC of eth0 on active<-CIP spurious packet |---------------- | | eth0 VIP eth0 VIP _______ _______ | | | | active | | | | backup |_______| |_______| ]]> client sends a connect request to VIP:port. the active director picks the realserver to forward the request and having two choices, delivers it either to the localnode or to the MAC address of eth0 on the backup director (functioning as a realserver). In a normal LVS, the packet would be accepted by the demon listing on the VIP. if the packet is delivered to eth0 on the backup director, it will not be delivered to the demon listening on the VIP, but will first be processed by ip_vs(). There's a 50% chance the packet will be forwarded to the localnode, which will generate a normal reply to the client. Although to the client, the LVS is functioning correctly, we want the packet delivered directly to the demon and not to go through ip_vs(). There's a 50% chance that the packet will be returned to the MAC of the active director at eth0/VIP. We don't want packets to the VIP on the backup director (realserver) coming from the MAC of eth0 on the active director to be processed by ip_vs(). We do want packets to the VIP coming in on eth0, from anywhere else but the MAC of the other director, to be processed by ip_vs(). Solution: fwmark packets coming in on eth0, to VIP:80 but not those coming from MAC of the other director. Load balance on the fwmark. Packets to VIP:port from 0/0 will be loadbalanced. Packets to VIP:port from the other MAC address, will not be loadbalanced and will be delivered directly to the demon. Joe: It occurs to me that ipvsadm doesn't have the -i eth0, or -o eth0 options, that the other netfilter commands have. Does a packet arriving on LOCAL_IN, know which NIC it came in on? Horms 29 Dec 2008 It would be possible, and I believe that the information is available in LOCAL_IN. But there are a lot of different filters taht can be applied through netfitler. And rather than adding them all to ipvsadm, I think it makes a lot more sense to just make use of fwmark.

Testing LocalNode If you want to explore installing localnode by hand, try this. First make sure scheduling is turned on at the director (this command adds round robin scheduling and direct routing) With an httpd listening on the VIP (192.168.1.110:80) of the director (192.168.1.1) AND with _no_ entries in the ipvsadm table, the director appears as a normal non-LVS node and you can connect to this service at 192.168.1.110:80 from an outside client. If you then add an external realserver to the ipvsadm table in the normal manner with then connecting to 192.168.1.110:80 will display the webpage at the realserver 192.168.1.2:80 and not the director. This is easier to see if the pages are different (eg put the real IP of each machine at the top of the webpage). Now comes the LocalNode part - You can now add the director back into the ipvsadm table with (or replace 127.0.0.1 by another IP on the director) Note, the port is the same for LocalNode. LocalNode is independant of the LVS mode (LVS-NAT/Tun/DR) that you are using for the other IP:ports. Shift-reloading the webpage at 192.168.1.110:80 will alternately display the wepages at the server 192.168.1.2 and the director at 192.168.1.1 (if the scheduling is unweighted round robin). If you remove the (external) server with you will connect to the LVS only at the directors port. The director:/etc/lvs# ipvsadm table will then look like Remote Addr Weight ActiveConns TotalConns ... TCP 192.168.1.110:80 ==> 127.0.0.1 2 3 3 ]]> From the client, you cannot tell whether you are connecting directly to the 192.168.1.110:80 socket or through the LVS code.

Localnode on the backup director With dual directors in active/backup mode, some people are interested in running services in localnode, so that the backup director can function as a normal realserver rather than just sit idle. This should be do-able. There will be extra complexity in setting up the scripts to do this, so make sure that robustness is not compromised. The cost of another server is small compared to the penalties for downtime if you have tight SLAs. Jan Klopper janklopper (at) gmail (dot) com 2005/03/02 I have 2 directors running hearthbeat and 3 realservers to process the requests. I use LVS-DR and want the balancers to also be realservers. Both directors are setup with localnode to serve requests when they are the active director, but when they are the inactive director, it is idle. If I add the VIP with noarp to the director, hearthbeat would not be able to setup the VIP when it becomes the active director. Is there any way to tell hearthbeat to toggle the noarp switch on the load balancers instead of adding/removing the VIP? Ideal sollution would be like this: secondary loadbalancer carries the VIP with noarp (trough noarp2.0/noarpctl) and can thus be used to process querys like any realserver. If the primary loadbalancer fails, the secondary loadbalancer would disable the noarp program, and thus start arping for the VIP, becoming the load balancer, using the local node feature to continue processing requests. If the primary load balancer comes back up, it either takes the role as secondary server (and adds the VIP with noarp to become a realserver), or becomes the primary load balancer agian, which would trigger the secondary load balancer to add the noarp patch again, (which would make it behave like a realserver again) I figured we could just do the following: replace the line that says, ifconfig eth0 add VIP netmask ... with: noarpctl del VIP RIP. And the other way around: replace the line: ifconfig eth0:0 del VIP netmask ... with noarpctl add VIP RIP the only point I don't know for sure is: will the new server begin replying to arp requests as soon as noarp has been deleted? Joe yes. However the arp caches for the other nodes will still have the old MAC address for the VIP and these take about 90secs to expire. Until the arp cache expires and the node makes another arp request, the node will have the wrong MAC address. Heartbeat handles this situation by sending 5 gratuitous arps (arp broadcasts) using send_arp just to make sure everyone on the net knows the new MAC address for the VIP. Graeme Fowler graeme (at) graemef (dot) net (addressing the issue that complexity is not a problem in practice) I've got a 3-node DNS system using LVS-DR, where all 3 nodes are directors and realservers simultaneously. I'm using keepalived to manage it all and do the failover, with a single script running when keepalived transitions from MASTER - BACKUP or FAULT and back again. It uses iptables to add an fwmark on the incoming requests, then uses the fwmark check for the LVS. Basic configuration is as follows: lvs_id DNS02 } static_routes { # backend managment LAN 1.2.0.0/16 via 1.2.0.126 dev eth0 } !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!! ! VRRP synchronisation !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!! vrrp_sync_group SYNC1 { group { DNS_OUT GW_IN } } vrrp_instance DNS_1 { state MASTER interface eth0 track_interface { eth1 } lvs_sync_daemon_interface eth0 virtual_router_id 111 priority 100 advert_int 5 smtp_alert virtual_ipaddress { 5.6.7.1 dev eth1 5.6.7.2 dev eth1 } virtual_ipaddress_excluded { 5.6.7.8 dev eth1 5.6.7.9 dev eth1 } virtual_routes { } notify_master "/usr/local/bin/transitions MASTER" notify_backup "/usr/local/bin/transitions BACKUP" notify_fault "/usr/local/bin/transitions FAULT" } vrrp_instance GW_IN { state MASTER garp_master_delay 10 interface eth0 track_interface { eth0 } lvs_sync_interface eth0 virtual_router_id 11 priority 100 advert_int 5 smtp_alert virtual_ipaddress { 1.2.0.125 dev eth0 } virtual_routes { } } !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!! ! DNS TCP !!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!! !!!!! virtual_server fwmark 5 { smtp_alert delay_loop 30 lb_algo wlc lb_kind DR persistence_timeout 0 protocol TCP real_server 1.2.0.2 53 { weight 10 inhibit_on_failure TCP_CHECK { connect_timeout 10 connect_port 53 } MISC_CHECK { misc_path "/usr/bin/dig @1.2.0.2 -p 53 known_zone soa" misc_timeout 10 } } ]]> ...Where /usr/local/bin/transitions is: ...and /etc/resolver_ips contains: ...and in /etc/sysctl.conf we have (amongst other things): So we have a single MASTER and two BACKUP directors in normal operation, where the MASTER has "resolver" IP addresses on its' "external" NIC, and the BACKUP directors have them on the loopback adapter. Upon failover, the transitions script moves them from loopback to NIC or vice-versa. The DNS server processes themselves are serving in excess of 880000 zones using the DLZ patch to BIND so startup times for the cluster as a whole are really very short (it can be cold-started in a matter of minutes). In practice the system can cope with many thousands of queries per minute without breaking a sweat, and fails over from server to server without a problem. You might think that this is an unmanageable methodology and is impossible to understand, but I think it works rather well :)

rewriting, re-mapping, translating ports with Localnode see For an alternate take on mapping ports for LocalNode see .

LVS: You can't map (or rewrite) ports with LVS-DR, LVS-Tun or localnode (but you can with iptables) The LVS-NAT director rewrites the dst_addr in the header of the packet coming from the client from the VIP to the RIP. The reply packet from the realserver has the src_addr in the header rewritten restoring the RIP src_addr to the VIP. Once you've incurred the cost of disassembling the packet, it is trivial to rewrite the dst_port at the same time. So LVS-NAT can rewrite (or map) the ports. Thus the client could send a packet to VIP:80 but when it arrives at the realserver, the packet will be going to RIP:1080. On the other hand LVS-DR, LVS-Tun and Localnode just forward the packet to the target, with no disassembly of the packet header. Thus you cannot remap (rewrite) the destination port in LVS. However you can use iptables to rewrite the ports. The examples below can be used for any of LVS-DR, LVS-Tun and Localnode.

You can't rewrite ports with localnode (but you can with iptables) Paul Monaghan wrote: Okay, perhaps what I am trying to do can't work for what ever reason but here it is. I've setup ipvs rules as follows: 10.0.0.1:8001 Local 1 1 0 -> 10.0.0.1:8000 Local 1 0 0 ]]> Wensong

For the LocalNode feature, the load balancer just lets the packets pass to the upper layer (up to service daemon) to process the request. So, the port number of the local service must be equal to that of the virtual service, otherwise it won't work. Maybe I should add some code to check whether the port number of local service is equal to that of virtual service, if not, reject it to avoid such an error.

i.e. you can't use localnode with LVS-NAT and have it rewrite ports. You can't have a request coming to port VIP:80 and have it serviced by a demon listening on 127.0.0.1:81. If you want to do this, you could try instead Requests to 192.168.1.12:80 will go to 192.168.1.12:81 (more info is available). Pablo Ares paresd (at) airtel (dot) net 19 Jul 2004

I have a configuration with only two machines that act both as directors and realservers (Localnode). With a Localnode configuration you can't do port redirection/rewrite independently of the forwarding method (DR, TUN or NAT). I need port redirection because I want to offer a Virtual HTTP Service on port 80, and map this service to two realservers running Tomcat on port 8080 with an unprivileged account. I tried this iptables DNAT rule in the PREROUTING CHAIN. This rule functions well for the traffic that is mapped to the local realserver, but the traffic that goes to the other realserver returns with source port 8080 to client (which causes a Reset of TCP connection by client). I probed this configuration with LVS/NAT and LVS/DR with forward_shared (source martians) patch. Is it possible to do port redirection in a Localnode environment?

ratz If I understand you correctly, the other RS is a physically different machine, right? You need someone to do a port mapping for you on your back-path. First idea: eth0[director/node1]eth1 -----> eth0[node2] Two DNAT rules: iptables -t nat -A PREROUTING -i eth0 -p tcp -d $VIP --dport 80 \ -j DNAT --to $VIP:8080 iptables -t nat -A POSTROUTING -i eth0 -p tcp -d $CIP -s $RIP \ --sport 8080 -j SNAT --to-source $IP_of_eth0:80 ]]> The "problem" is that netfilter maintains a template table which is used to lookup the n-tuple corresponding to your initial connection attempt which was port-redirected. Of course the source port of the outgoing packet is then not known which gives you little to know option of back mapping the port. What you could do is have a tcp forwarding tool on a local socket on node2 which redirects traffic to the local socket on port 8080. There are other possibilities, however I'm not sure if I understand your current setup correctly.

The problem is solved. I applied NFCT patch http://www.ssi.bg/~ja/nfct/ and I add a second iptables rule in the POSTROUTING chain.

Mikel Ruiz Echeverria Jun 07, 2005

I would like to balance the service over two real instances running all on the same machine. I have tried ldirectord.cf And I also have tried: but when I start ldirectord, I get: It seems like IPs of realservers could not be the same as the Director Server's one. Must Director and realservers run on different machines to work properly with LVS?

Horms Unfortunately, what you are trying to do is not possible, and here is why: When you set up a real server that is on the same machine as LVS, it uses a special forwarding mechanism called Local. It uses this regardless of weather you asked for Masq, Route or Tun. You can't ask for it, it just knows if the address is local and sets it. You can however observe it using ipvsadm -L. The reason for this is running packages that are going to be delivered to a local process through Masq, Route or Tun has overhead and in most cases makes very little sense. However, the downside is that the Local forwarding mechanism (like Route and Tun, but) unlike Masq does not allow port-mapping. That is, your port 8090 packets will stay as port 8090 packets. So in a nutshell IPVS translates your configuration to. Which obiously isn't going to work because you have a duplicate entry, and that is what the error message you are getting is trying to say. Well, thats what it should be trying to say, looks like there might be a bit of a bug in ldirectord somewhere, but that doesn't change the fact that IPVS can't do what you want to do. I believe an easy solution to this problem would be to deliver the packets to different addresses rather than different ports. Something like the following might just work. A longer term solution would be to fix up the way the Local delivery mechanism works. But this would likely be quite tricky, and certainly increase its current complexity - its basically a NULL opp at the moment.

rewriting, re-mapping, translating ports with iptables in LVS-DR With LVS-NAT you can rewrite ports (see ). However, LVS-DR and LVS-Tun just forward the packets to the realserver without rewriting the ports. You could write code to rewrite the ports before the packet left the director, if you wanted to. However, the replies from the realserver go directly to the client and do not return through the director. For the client to receive a reply from the correct source port, then the realserver would have to rewrite the ports for the reply. There doesn't seem to be enough demand for rewritten ports with LVS-DR or LVS-Tun, that anyone has bothered to write the code. You can still enter a port for the realserver with ipvsadm -r in the same way that you can for LVS-NAT. However with LVS-DR and LVS-Tun, the port is silently ignored. This leads people to mistakenly think that they can rewrite the ports with LVS-DR. It would be better if ipvsadm disallowed a port with the -r option, or at least gave a warning and exited with a non-zero error code. Horms will have a fix out for the next releases of ipvsadm.

Horms horms (at) verge (dot) net (dot) au 21 May 2004 ipvsadm has code to change the port if it doesn't match. Below the service on the realserver is entered as 10.0.0.3:200 but is added to the ipvsadm table as 10.0.0.3:100. However no warning is generated. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.0.1:100 wlc -> 10.0.0.3:100 Route 1 0 0 -> 10.0.0.2:100 Route 1 0 0 -> 10.0.0.1:100 Route 1 0 0 ]]> I don't think it is a good idea to disallow specifying the port. It may break some peoples scripts, and seems uneccessary. Here's a possible patch. Now when you enter a mismatched port (e.g. the same 10.0.0.3:200 service on the realserver) you get a warning. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.0.1:100 wlc -> 10.0.0.3:100 Route 1 0 0 -> 10.0.0.2:100 Route 1 0 0 -> 10.0.0.1:100 Route 1 0 0 ]]>

You can still rewrite the ports at the realserver with iptables allowing the realserver to listen on another port.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 04 Mar 2004 if you need to rewrite ports for LVS-DR or LVS-Tun, just use Here the client connects to VIP:80 on the director, the realserver is listening on VIP:9999. It works for me for tomcat standalone servers.

can't port map with LVS Ryan P Linn Oct 01, 2003

I'm currently using a setup where I have individual webservers which are using port based virtual hosts in apache. For instance, I have port 5678 and 5679 which map to ports 80 and 443 on a virtual host. I'm currently using a commercial solution to schedule these hosts and keep them persistant together, however I'm hoping to switch these over to my LVS-DR box. It appears that the fwmark group is what I would want to do to keep people going to both ports persistant, but from the documentation it didn't appear that you could do port mapping while doing fwmarks. I was wondering if anyone had done this and if they could share how they made it work if they had. This would be for a shopping cart type application where switching between port "80" and "443" were necessary for security, but because the application uses php sessions it has to go back to the same server each time. It appears very easy to do if they were actually listening on port 80 and 443 but since they're not I'm very confused about the correct way to configure this.

Horms horms (at) verge (dot) net (dot) au 02 Oct 2003 The short answer is that you can't, using LVS. But I wonder if it might be possible to change the destination port using nefilter before or after the packets hit LVS. Alternatively it would be possible to modify LVS to do this, the main issue in my mind would be working out a sane way to configure it.

LVS: Non-LVS clients on Realservers This HOWTO is a little disorganised here. Read the section on too.

always NAT out clients through VIP This section (Jan 2007) is a collection of material that previously has been scattered thoughout the HOWTO, including in the old sections on 3-Tier LVS's and authd. In its simplest form, an LVS is a highly available server. Realservers are servers only: they reply directly to the client and don't need to connect to other machines to do so. This model serves well for telnet (used for testing) and the widely deployed http. With http as the most often deployed service, this model lasted a surprisingly long time. It wasn't long before we found that realservers were required to do more than just serve: client processes on the realserver made calls, often back to the LVS client (the CIP). The first client we found was which connects to the CIP. We didn't know what to do with this client and since it wasn't needed and could be turned off, we did just that, solving the immediate problem. We assumed we had a one-off problem that we wouldn't see again and we didn't see any bigger picture. The write-up on is long, not because anyone needs to understand it in any depth, but because it was an big problem with LVS in the early days and we put some effort into figuring it out. Next, for LVS's running a web based database, a client process on the realserver connects to the database machine (a 3-Tier setup). The database was running on a machine under our control and the connection was local and was easily handled. We thought we'd handled another special case. Now and again an administrator would want access to the outside world from a realserver, or a script would need to pull from the internet (sometimes requiring access by a DNS client running on the realserver). These cases were handled by NAT'ing out the connection through some convenient machine (often the director). Again these were treated as another special case and implemented slightly differently for LVS-NAT/DR/Tun. These connections came from the primary IP on the outside of the director and not the VIP. By the time we figured this out, no-one was running identd anymore and the identd case was not revisited. It took a while for the next step; Francois JEANMOUGIN () realised that you could NAT out the connection through the VIP. This solution wasn't often needed, since you could usually pull data or resolve hostnames, no matter what IP you used to make the call. We forgot about this trick and Graeme Fowler had to reinvent it. (We still hadn't "got it".) The ftp-data connection, in the standard two port service, requires similar handling, but for LVS-NAT ftp has its own helper, while for LVS-DR/Tun ftp is handled by persistence. Again we regarded this as another special case. However some server processes on the realserver also make calls to the internet, e.g. an MTA which receives e-mail on the VIP, and which forwards the e-mail, must forward it from the VIP. When there are multiple VIPs, each with its instance of the server process, client calls, from each instance of the server process, must be NAT'ed out through the appropriate VIP. David M was the first to describe a working multiple VIP/multiple client setup, , which showed the generalisation that we'd been missing: clients running on the realservers, which are calling on behalf of a server process listening on the VIP (or RIP for LVS-NAT), have to call from the VIP. Thus an MTA on the realservers listening on the VIP, when it connects to another MTA, has to connect from the VIP. In an LVS'ed DNS, when named makes a connect to other machines, these calls must come from the VIP. In contrast, the client call for name resolution for the MTA client, doesn't have to come from the VIP, since the name resolution is not being LVS'ed. The connect request from a database client running on the realserver, which accesses a database on LAN, doesn't have to come from the VIP, since the database call is not being LVS'ed. If you're unsure as to whether the call needs to come from the VIP, think of the standalone server; which IP does the client call need to come from? After seeing David's solution, I scanned for unsolved problems on the mailing list, to find postings about server setups that worked on a standalone server, but which didn't work in an LVS. These setups were behind a director using NAT rules, where the client process emerged with src_addr!=VIP, but which required src_addr=VIP. (No we didn't fix the problem, presumably the poster(s) went to a commercial solution.) The lesson from this is to nat your realserver client processes from the VIP, unless you're sure that it's not needed. The rest of this section is just amplification of this statement. If you understand David M's posting on then you're done here.

Masquerading clients on realservers to the outside world (SNAT) also see Sometimes you a client process on the realserver will need to contact the outside world, e.g. the LVS'ed server process may need to run a client process to connect to another computer e.g. to access a database, or to initiate an smtp connection to the next MTA in the chain. the LVS'ed process may make a callback to a process running on the LVS client (e.g. the ftp-data port with ftp) A process independant of the LVS'ed service may need to periodically connect to an outside computer e.g. ftp to upload logs, or DNS (the realserver knows the CIP already, so this won't be for the LVS'ed service). Clients on realservers can call from the RIP or VIP. By default, clients will call from the RIP, since it is the primary IP on the realserver. Often the client of the LVS or an outside machine will expect the call to come from the VIP, which is handled by NAT'ing the call. If the LVS has multiple VIPs, then the call must come from the correct VIP. RIP Clients like telnet call from the RIP as so do the clients of some callbacks e.g. rshd. Some services e.g. MTAs which receive e-mail on the VIP will initiate sending e-mail from the RIP, this being the primary IP on the NIC. Usually the RIP is a private IP and will not be routable. If the resources needed by the client are local e.g. to a local nameserver with its own connection to the internet, or to a database server, then a non-routable RIP is fine. If you need to route packets from a routable IP, you could make the RIPs routable. but from the security point of view, you don't want to make your realservers publically accessible, so making the RIP routable is not generally a good idea. VIP Clients which are associated with a service listening on the VIP and which make callbacks from the VIP to the LVS client. The instances that we know about of this. passive ftp The general solution for callbacks from the VIP is to write a helper module for the director. If you don't have one, then you're stuck - in this case look at the section on for attempts at solutions. A possible solution is to use persistence with port=0 as can be done for ftp (port=0 forwards all ports, increasing security problems and should not be used if at all possible). To handle calls from the RIP, you can NAT the connections out through any available box: for LVS-NAT, the director is available; for LVS-DR/LVS-Tun both the director and the default gw box are possibilities (although you may not have access to the default gw box). In the case of LVS-NAT, the director is the already the default gw for packets from the RIP (since you need to route the replies from the LVS'ed service through the director). In the case of LVS-DR/LVS-Tun, the default gw for packets from the VIP is through a router that is not the director: the default gw for packets from the RIP is not part of the LVS setup, but will probably also be the same router box. In this case, the packets from the RIP will need to be routed instead to the director (you can use the iproute2 tools for this). If you don't do anything special, the NAT'ed requests will come from the primary IP on the outside of the director (the VIP is usually a secondary IP, so that it can be moved on failover). Below we show how to make the call from the director's VIP. In the case of LVS-DR/LVS-Tun, the VIP on the outside of the director doesn't send any packets, and doesn't need a route (see routing for LVS-DR). If you NAT out through the VIP on an LVS-DR or LVS-Tun director, then you will need to put in a default gw for packets from the VIP (you normally don't have a default route for packets from the VIP for LVS-DR or LVS-Tun).

Masquerading clients on LVS-NAT realservers Here's the command to run on a 2.2.x director to allow realserver1 to telnet to the outside world. With LVS-NAT and a single director, the VIP will be the primary IP on the outside of the director and the packets will have src_addr=VIP. Otherwise the packets will come from an IP which is not the VIP. You may have to turn off icmp redirects, if you have a one network LVS-NAT. /proc/sys/net/ipv4/conf/all/send_redirects director: #echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects ]]> After running this command you can telnet from the realservers. You can do this even if telnet is an LVS'ed service, since the telnet client and demon running on the realserver operate independantly of each other. Here are the IP:port, seen by `netstat -an` on each machine client on the internet telnet'ing to an LVS forwarding by LVS-NAT VIP:23 - CIP:1041->RIP:23 ]]> the realserver connecting by masquerading through the director to the telnetd on the LVS client. The masqueraded connection to the LVS client comes from the primary IP of the director (here the DIP) and not from the VIP, which in this setup is an alias (secondary IP) of the DIP. The masqueraded ports can be seen on the director with 23 ]]> For both connections, the director doesn't have connections to any of its ports. It the case of LVS, the director is just forwarding packets like a router. In the masquerading case, the director is rewritten the headers before forwarding the packets like a router. Connections from clients start at high_port=1024. The masqueraded ports start at port=61000 (not 1024) (at least for kernel 2.2.x). The port number increments for each new connection in both cases. In the case where a machine is both connecting to the outside world (using ports starting at 1024) and masquerading connections from other machines (using port starting at 61000), there is no port collision detection. This can be a problem if the machine is masquerading a large number of connections and the port range has been increased. The masqueraded ports start at (64k-4k)=61440 for 2.2.x kernels. 2.4.x kernels can use all ports for masquerading. Peter Klapprodt peter (dot) klapprodt (at) ewido (dot) net 21 Jul 2005

Any ideas on how to get internet access working on the real servers (i.e. clients unrelated to the LVS services) using LVS-NAT? I've read something about virtual_routes in keepalived but couldn't find any detailed instructions yet.

graeme (at) graemef (dot) net ..in exactly the same way you would for an ordinary masqueraded network: realservers use active director as default gateway on director > /proc/sys/net/ipv4/ip_forward ]]> on director, set up masquerading: / -d / -j ACCEPT iptables -t nat -A POSTROUTING -s / -j MASQUERADE ]]> and that's it! Any packet which returns to the director which is not hooked by LVS as part of an active connection will fall through to the nat POSTROUTING chain and get masqueraded. PMilanese (at) nypl (dot) org 22 Jul 2005 Do not use the static interface assignment for the gateway. Use the virtual (dynamic) interface (the DIP). If the directors fail over you need the gateway to move with the active director.

Masquerading clients on LVS-DR realservers The realserver in LVS-DR has two IPs, the RIP and the VIP. The LVS'ed services are running on the VIP. Packets from LVS'ed services, returning from the realserver, have src_addr=VIP. The RIP is not directly involved in the LVS. Services may be running on the RIP too, e.g. telnetd which listens to 0.0.0.0, but services running on the RIP are of no interest to a LVS-DR. The director only needs the RIP to determine the target MAC address to forward packets from the clients destined for the VIP. Thus you are free to do whatever you like with the RIP without affecting the LVS. Usually the RIP is on a private IP (eg 192.168.x.x) so as to not require an extra IP, and to shield the realserver from the internet. It would be unusual to run non-LVS'ed services on the realservers, as the RIP would have to be a public IP and the realservers would have to be firewalled. However there it is reasonable to run clients on the realservers. A client session ( e.g. telnet) initiated from the RIP would have to be NAT'ed out to the outside world. The NAT box could be the router or the director. Here's how to setup with the director doing the NAT'ing (the router setup would be the same).

Send client packets (src_addr=RIP) to the director and LVS packets (src_addr=VIP) to the router This is not possible with the standard destination-based route command. You need the policy routing tools from iproute2. Here's Julian's recipe (25 Sep 2000) for setting up NAT for clients on realservers in a LVS-DR LVS. For the realserver(s), send all packets from the RIP network (RIPN) to the DIP (an IP on the director in the RIPN). The director has to to listen on DIP (if it doesn't already), and not send ICMP redirects from the DIP ethernet device and has to masquerade (all) packets from the RIPN. /proc/sys/net/ipv4/conf/all/send_redirects director: #echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects # for 2.2 kernels, all services director: #ipchains -A forward -s RIPN/24 -j MASQ # for 2.2 kernels, telnet only director: #ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0 ]]>

add a default route for packets from the primary IP on the outside of the director For LVS-DR, no default gw is needed for packets from the primary IP on the outside of the director or from the VIP (which will be an alias/secondary IP). For security reasons then none is installed. To allow masquerading of clients on the realservers, a default route will be needed for packets from the primary IP on the outside of the director (but not for packets from the VIP). If you want to test this out first, just put in a default route for the director using the route command. If you like it you can add the more restrictive routes with iproute2 later.

Masquerading clients on LVS-Tun realservers The director is on a different network (possibly in a different location), you don't have a two way ipip connection back to the director (although you can add one), and you don't have a route from the RIP to the DIP (although you can add this too). If you handle these problems, then you can use the director to NAT out connections from the realservers. However it would probably be simpler to NAT out through the local router.

Masquerading clients through the VIP on the director The recipes above for masquerading clients, have the packets coming out from the primary IP on the outside of the director. This will not usually be the VIP, which is a secondary IP (so that it can be moved easily on failover). Here we show how to masquerade out from the VIP.

Masquerading through a single VIP Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 19 Aug 2004

When masquerading clients on realservers out through the director, how do I make the src_addr=VIP?

"C. R. Oldham" cro (at) ncacasi (dot) org 25 Aug 2004 You can do this with policy-based routing in the 2.6 series of kernels. On my Debian realservers I have this in /etc/networks/interfaces I have a table "lvs" in iproute2/rt_tables It took me a long time and lots of googling to figure this out but it works great. Francois JEANMOUGIN Just use snat! It is pretty simple. The VIP does not have to be up on the system, the rule stays there unemployed. In case of a director switch, even if vrrp add the VIP as a secondary (or alias) interface, the outgoing packets will have the VIP as the source address. Using iptables with the SNAT method let you use vrrp for director failover without any other configuration and scripts. Tested and approved (my VIP is a secondary interface now again on the directors). I think you can use several SNAT rules if you want to mix several natted virtual_servers, using a -s (IIRC) option (that part I didn't test). P.S.: Yes, I feel, the "--to" option confusing too. Joe - It took a long time for someone to realise how to make the packets come from the VIP, rather than the primary IP on the outside of the director. The same problem came up again, but I'd forgotten that it had been solved, so it was invented again. Kristoffer Egefelt

If I send a mail from a realserver to my gmail account, the outgoing packets have the primary IP of the director as src_addr. I would like the packets to come instead from the VIP.

Graeme Fowler graeme (at) graemef (dot) net 22 May 2006 You want a machine (the realserver) behind a masquerading server (the director) to appear to have a fixed IP address when making outbound connections to the internet. Simply have a SNAT rule on your director's external interface such that packets going out from the realserver get mapped to the VIP; assuming here that the external interface is eth0: I've used this many times to do a many-to-one mapping for realservers so that when they initiate external connections, they appear to come from the same IP. Since this is outbound data from a high port on the VIP, and not from a port controlled by ipvsadm, the ip_vs code on the director will ignore these packets and they will be reverse SNAT'ed and pass to the realserver. This works is for outbound communication from the realservers; it's extremely unlikely that they'll use a well-known (and often priveleged) service port as the source for a new TCP session to somewhere external. In context, an example mail server cluster will generally have one or more of ports 25, 465 and 587 bound to the VIP on the external side of the director. No well-written MTA will initiate a connection to an external host using those ports as source. The same goes for webservers, DB servers and a whole host of others. That means the LVS doesn't have to be considered, as the netfilter conntrack code will work perfectly well. There is, however, an exception - DNS servers can be configured to use UDP/53 as a source port for queries; in my experience explicitly turning this off means a tiny proportion of queries will fail. Leaving it turned on behind a director means that, well, anything could happen... so making use of a forwarder here is a good solution. Besides, in DNS operation having a query come from a reversible IP which maps to a forward name lookup is less important than it is for web or email connections. Brad Dameron brad (at) seatab (dot) com 19 May 2006 you can use iptables to push packets from certain realservers out certain IP's. Here is my /etc/init.d/ipvs_firewall startup script. This script also allows your real servers to connect to the outsite world through the LVS server. This is a SuSe start script so will need to be a little modified to work with RedHat, etc. Chris Newland chrisn (at) allipo (dot) com 11 Jul 2006 I use LVS-NAT and SNAT by using the following iptables rule: ]]> My realservers only have non-routable IP addresses (10.0.0.*) The realservers can all connect to servers on the internet and when they do, the IP source address is that of the director.

Masquerading through multiple VIPs David M northridgeaustin (at) gmail (dot) com 14 Dec 2006 We have an LVS-NAT which works fine for other services (e.g. http). We also LVS sendmail. The MTA listens for connections on the RIP (and works fine), but when it initiates a connection (which is does from the RIP), this occurs independantly of the LVS. Outgoing connections from RIPs get routed out the default gateway for LVS-NAT, where they're NAT'ed by iptables rules on the director. We have three sendmail realservers, each with 30 private (172.16.0.0/24) RIPs, each RIP with an instance of sendmail (30 instances/realserver; 90 private RIPs total). On the Director, there are 30 public VIPs which are balanced by the three realservers. On each realserver then, MTA connections can be initiated from 30 RIPs, and all are sent to the same default gateway (the DIP). The director needs to know through which VIP the connection needs to NAT'ed out. The director then needs 90 rules (one for each RIP). We have three realservers (RS1, RS2, RS3), and we are associating RIPs with VIPs. Here's the subset for VIP_01 $VIP_01 $RIP_RS2_VIP_01 --> $VIP_01 $RIP_RS3_VIP_01 --> $VIP_01 #iptables rules $IPT -t nat -A POSTROUTING -s $RIP_RS1_VIP_01 -o $EXT_INTER -j SNAT --to-source $VIP_01 $IPT -t nat -A POSTROUTING -s $RIP_RS2_VIP_01 -o $EXT_INTER -j SNAT --to-source $VIP_01 $IPT -t nat -A POSTROUTING -s $RIP_RS3_VIP_01 -o $EXT_INTER -j SNAT --to-source $VIP_01 ]]> Rob ipvsuser (at) itsbeen (dot) sent (dot) com 15 Dec 2006 Well, the way I set up things up is different (possibly better) - My goal is to make it easy to config/manage/troubleshoot, secure, fast and low load on the director(s): I use OpenBSD and pf to separate public and private IP spaces Use LVS-DR for all the lvs work (not sure if you can do this or if you need to use nat for some other reason) By separating the NATing from the load balancing it seems to simplify the configuration of both and I feel it is easier to write pf rules than iptables (YMMV). In pf for each of the 30 email servers you need 2 rules: px.py.pz.1 Incoming: rdr pass on $ext_if inet proto tcp from any to px.py.pz.1 port 25 -> 172.16.1.1 port 25 ]]> The above will send incoming connections to the correct VIP and keep the outgoing connections/replies coming from the correct public IP. For the LVS config: No special routing set up on the director or real servers, all machines have the OpenBSD firewall as their gateway. Low load on the director since it is DR. Then to cheat on the arp issue, I hardcode the MAC Address of the director into the arp table on the OpenBSD firewall for each of the VIPs (and run arpwatch and set the Linux machines arp sysconfig params) One of the cool things you can do with a set up like this is use the excellent table handling in pf. I have about 85,000 ips that I know are spammers and I don't want them using any resources on my MTA boxes so I redirect all of them to OpenBSD's spamd which tarpits them at extremely low cost: persist file "/etc/spammers.txt" {} rdr pass on $ext_if inet proto tcp from {} to any port 25 -> 127.0.0.1 port 8027 ]]> This means that the MTA boxes can service real mail more quickly since slots are not being used by spammers. I do similar things for bogons http://www.cymru.com/Bogons/ and ssh brute force attackers. I haven't found a reasonable way to work with any sizable tables in iptables.

3-Tier LVS However some services need resources on other machines, e.g DNS, databases. A squid realserver gets its content from machines on the internet and to do this, the squid demon will run a client process which makes a connection from RIP to 0/0:80. These client packets need to be routed and to do so the RIP must first be on a public IP (or at least routable locally). Sorting out the routing requirements for setting up a 3-Tier LVS was prompted by Jezz Palmer (Mar 2002) who found that his squid didn't work when setup by the configure script, but did when he put in a default route for the squid realserver. Jezz ran the tcpdumps, ran and debugged the scripts for me.

Routes needed for 3-Tier LVS Figuring out the iptables and iproute2 commands was helped by Horms, Ratz, Julian and Peter Mueller. Here is the standard LVS-DR test setup with 2-NIC director and only 1 realserver. The router for the realservers has the LVS client. The routes neccessary for a normal LVS are in lower case (e.g. from 0/0 to VIP). Note (see discussion of routes for LVS-DR) that there is no route for packets from the VIP on the director (to anywhere) and no routes for packets from the SERVER_GW to RIP,VIP on the realserver. In UPPER CASE are the routes which need to be added to turn the LVS into a 3-Tier LVS (e.g. FROM 0/0:PORT to RIP) where "PORT" is the port for the client running on the RIP. Note that the gw for 0/0:PORT (here SERVER_GW) can be another router - it does not have to be the SERVER_GW. Note also that the dst_addr does not have to be 0/0 - a more restrictive dst_addr could be used if the IPs of the 3rd tier machines are known ahead of time (e.g. DNS servers, database servers). In the original LVS-DR setup (1999, or configure scripts upto version 0.8.x) the routes for the realserver were In LVSs setup by the configure script 0.9.x, packets from the VIP are sent to the default gw. Packets from the RIP to 0/0 are sent via the DIP (where they are filtered i.e. DROPed or REJECTed) In LVSs setup by the configure script v 0.10.x and later, selected packets from the RIP are sent to the 3_TIER_GW (which may be the same as the SERVER_GW).

Setting up routes using iptables and iproute2 The problem then becomes one of routing packets from RIP to 0/0:80 (if the realserver is a squid) while making sure that no other packets from RIP to any other ports on 0/0 are DROP'ed or REJECT'ed. For 2.2 kernels running ipchains there is no way of doing this, and all packets to 0/0 have to be routed. For 2.4 kernels, iptables allows marking (fwmark) by dport (or sport). After marking, packets can be routed by iproute2. The configure script (v 0.10.x or later) will set this up for you. (May 2002, it's being tested as we speak, coming Real Soon Now). Here's a standalone version of the code in the configure script that marks the packets. Francoisflafolie (at) aic (dot) fr Apr 26 2007 It seems I need the following rules to make my setup work. The iprules have to have as ip source address the VIP and not the RIP. Here's the original problem I posted. I have installed and configured keepalived (v1.1.13). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 10.0.23.100:http wlc persistent 600 -> 192.168.15.11:http Masq 100 0 0 TCP 10.0.22.171:ftp wlc persistent 600 -> 192.168.15.10:ftp Masq 100 0 0 ]]> I'm trying to manage different services on different VLANs on my loadbalancer. The problem is I can configure only one default route on my loadbalancer. For example, if my default route is 10.0.23.1, request and reply for http (vlan 10.0.23.0) both going in the good vlan. But for ftp, request will be on the good vlan (10.0.22.0) but reply on vlan 10.0.23.0 (my firewall authorizes that for tests) and not 10.0.22.0. I have tried to define some iprules on my loadbalancer to say if the source ip address is 192.168.15.10, so forward packets to 10.0.22.0 network but it seems doesn't work. LVS apparently don't let the routing decisions to the operating system after its own operations... Here are my iprules : I also tried that but no more effect :

from the mailing list TC Lewis has running on the realserver. He is using NAT through the director rather than routing the packets directly as is described here. An LVS-DR director normally does not have a default route and this would have to be added to NAT packets through the director. You may be able to NAT through the router instead.

LVS: LVS clients on Realservers This HOWTO is a little disorganised here. Read the section on too.

Do you really need LVS clients on the realserver in a 3-Tier setup? Thomas Champagne 10 Apr 2007

There are two services on each servers : Apache and Mysql. Each service have its IP and have a VIP address : The problem : Accessing services from a remote client (outside the cluster) to the VIP is ok. But when the client is the cluster, it always connects on the local machine.

people coming to this mailing list are always trying to balance the 3rd-tier (in your case, mysql). If this was easy to do, that would be one thing, but with the current design of LVS, it's next to impossible. The first connection (here to apache) is balanced, so that the connection to the 3rd-tier (here to mysql) will be (at least reasonably) balanced. So you have the balanced apache on your realserver connect to the local mysql. To have a valid realserver, both apache and mysql have to be up. Maybe people think then that, running two services, there's twice the chance of the realserver going down and for the same hardware their 99% uptime realserver is now a 98% uptime realserver. So they have to be prepared for apache_1 to connect to mysql_2. That would be true if the only failures on the machine were the demons dying and that they died independantly. I don't run a production internet site, so I don't have any numbers on failures in those situations, but it's not often that demons for no reason at all just die or stop answering. Most failures seem to be disks and fans dying, memory chips going bad resulting in corrupt files being written, loss of network connectivity to the outside world (the backhoe problem) and surprisingly routers dying. Rarely does the demon die. In which case requiring two demons to have a functioning realserver may not change the downtime a whole lot. There's many other demons running on the realserver which are part of unix, and which are required for a running machine, so you actually need maybe 10-20 demons for a functioning realserver, in which case an extra one (mysql) isn't going to make a whole lot of difference. But let's say a functional realserver will have twice the downtime because it requires two functioning realservsers. Well that's high availability life when you have a service that requires multiple demons. You have to fail out the realserver when either service goes down. That's all. Last exchange I had on this subject, the person didn't have any technical reason why they needed to balance the 3rd-tier. They just wanted it. So I haven't been convinced that you must have a balanced 3rd tier.

Realserver as LVS client in LVS-NAT The LVS-mini-HOWTO states that the lvs client cannot be on the realservers, i.e. that you need an outside client. This restriction can be relaxed under some conditions.

Jacob Reif's solution This came from a posting by Jacob Reif Jacob (dot) Rief (at) Tiscover (dot) com 25 Apr 2003. It is common to run multiple websites (Jacob has 100s) on the same IP, using name based http to differentiate the websites. Sometimes webdesigners use some kind of include-function to include content from one website into another, by means of server-side-includes. (see http://www.php.net/manual/en/function.require.php) using http-subrequests. The include requires a client process running on the webserver, to make a request to a different website on the same IP. If the website is running on an LVS, then the realservers need to be able to make a request to the VIP. For LVS-DR and LVS-Tun this is no problem: the realserver has the VIP (and the services presented on that IP), so requests by http clients running on the realserver to the VIP, will be answered locally. For LVS-NAT, the services are all running on the RIP (remember, there is no IP with the VIP on realservers for LVS-NAT). Here's what happens when the client on the realserver requests a page at VIP:80 realserver_1 makes a request to VIP:80, which goes to the director. The director demasquerades (rewrites) dst_addr from VIP to RIP_2. realserver_2 then services the request and fires off a reply packet with src_addr=RIP_2, dst_addr=RIP_1. This goes to realserver_1 directly (rather than being masqueraded through the director), but realserver_1 refuses the packet because it expected a reply from VIP and not from RIP_2. Here are the current attempts at solutions to the problem, or you can go straight to Jacob's solution Using the /etc/hosts solution of Ted Pavlic for , doesn't work as there are 100s of domain-names registered (rather than just one) onto the same IP-address. Julian's solution removes the local routing (as done for one network LVS-NAT) and forces every packet to pass through the director. The director therefore masquerades (rewrites) src_addr=RIP_2 to VIP and realserver_1 accepts the request. This puts extra netload onto the director. | | director | +-------------+ |^ |^ ans|| req||ans v|req v| +-------------+ +-------------+ | | | | | Realserver | | Realserver | | = client | | = server | +-------------+ +-------------+ ]]> Jacob's solution: The solution proposed here does not put that extra load onto the director. However each realserver always contacts itself (which isn't a problem). Put the following entry into each realserver. Now the realservers can access the httpd on RIP as if it were on VIP.

Carlos Lozano's solution Carlos Lozano clozano (at) andago (dot) com 02 Jul 2004 We have a machine that must be both a client and director. The two problems to solve are ipvs doesn't handle loopback packets the return packets are handled by ip_vs_in, and not by ip_vs_out. I have written a ip_vs_core.c.diff (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ip_vs_core.c.diff) patch for 2.4.26 using IPVS-NAT. It works correctly in my testcase. The schema is: IPVS:443 --> Local:443 ---> IPVS:80 ---> RealServer ]]> The problem happens when Local:443 goes to localIPVS:80, because the packet is discarded by the next lines in ip_vs_core.c: pkt_type != PACKET_HOST) || skb->dev == &loopback_dev) { IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n", skb->pkt_type, iph->protocol, NIPQUAD(iph->daddr)); return NF_ACCEPT; } ]]> Ratz

Why do you need this? Seems like a replication of mod_proxy/mod_rewrite. Your patch obviously makes it work but I wonder if such a functionality is really needed.

We are using it like an ssl accelerator. The first ipvs (443) sends the request to localhost:443 or to a different director, and the second ipvs(80), distributes the traffic in the realservers. IPVS:443 --> Local:443 --> IPVS:80 --> RealServer1 |-> Director2:443 |-> RealServer2 ]]> In the first case, it is a scheme "external machine client+director", but in the second case it is a "client+director in the same machine". This part of the patch only solves the output packet, the return is handled by the second part of the patch. (what is really a bad hack) For a mini-HOWTO on using this patch see . Matt Venn has tested it, it works using the local IP of the director, but not 127.0.0.1.

Graeme Fowler's proposals, Rob Wilson's help and Judd Bourgeois' modification Graeme came up with the original idea, Rob Wilson proposed a solution that didn't quite work, Graeme fixed it and then Judd saw an easier solution for the case of only one VIP. I've somewhat mashed the history in my write-up (sorry). Graeme Fowler is looking for a solution for realservers that can't use iptables Graeme Fowler graeme (at) graemef (dot) net 11 Feb 2005 After a long day spent tracing packets through the LVS and netfilter trail whilst trying to do cleverness with policy routing using the iproute2 package, I can condense quite a lot of reading (and trial, mainly followed by error!) down as follows: FastNAT, as provided by the iproute2 package, is incompatible with the netfilter conntrack module. As most LVS-NAT systems are also doing masquerading or SNAT for outbound connections from the realservers, the conntrack module is loaded automagically - thus FastNAT via policy routing simply won't work. Try as you might to do SNAT, it has to be done in the 'nat POSTROUTING' chain - and the packets being processed via LVS don't traverse this chain, because they're hooked right out of the nat POSTROUTING table and are processed by ip_vs_port_routing instead, which then plonks them back on the wire magically without further processing. So SNAT won't work either. Using fwmarks seems inconclusive, because ultimately (in my case at least) I want to SNAT the packets in some way, and point (2) above precludes that. "Internal" VIPs. This one just came to me so please feel free to try it, I'm away from my development lab and it might prove to be a complete lemon anyway! Here's the idea: on the director, for every "external" VIP configuration which faces the clients (say VIP1) another VIP - iVIP1 - is also configured with identical realservers but attached to the _internal_ interface. The principle difference is that this VIP uses LVS-DR, because - for obvious reasons - the realservers can respond directly to each other. The only complicated bit is setting up a netfilter rule to do DNAT as the packets arrive - trap all packets destined for VIP1 and DNAT them to iVIP1. Ensure VIP1 is a loopback alias on your realservers as per normal DR configuration, and in theory at least the realservers should then be able to talk to each other as clients of a VIP. Conclusions: mixing policy routing and LVS sounds like a great idea, and probably is if you're using LVS-DR or LVS-TUN. Just with LVS-NAT, it's a no-go (for me at the moment, anyway). Graeme Fowler graeme (at) graemef (dot) net 2005/03/11 Solved... was Re: LVS-NAT: realserver as client (new thread, same subject!) I've solved it - in as far as a proof of concept goes in testing. It's yet to be used under load though; however I can't see any specific problems ahead once I move it into production. The solution of type "4" above involves a "classic" LVS-NAT cluster as follows. Nomenclature after DIP/RIP/VIP classification is "e" for external (ie. public address space), "i" for internal (ie. RFC1918 address space) and numbers to delimit machines. In normal (or "classic" as referred to above) LVS-NAT, the director has a virtual server configured on VIP1e to NAT requests into RIP1 and RIP2. Under these circumstances, as discussed in great length in several threads in Jan/Feb (and many times before), a request from a realserver to a VIP will not work, because: VIP1e RIP1 SYN -> RIP2 (or RIP1, doesn't matter) RIP2 ACK -> RIP1 ]]> at this point the connection never completes because the ACK comes from an unexpected source (RIP2 rather than VIP1e), so RIP1 drops the packet and continues sending SYN packets until the application times out. We need a way to "catch" this part of the connection and make sure that the packets don't get dropped. As it turns out, the hypothesis I put forward a month ago works well (rather to my surprise!), and involves both netfilter (iptables) to mangle the "client" packets with an fwmark, and the use of LVS-DR to process them. What I now have (simplified somewhat, this assumes a single service is being load balanced in a very small cluster): The on the director: and we need a corresponding entry in the LVS tables for this. I'm using keepalived to manage it; yours may be different, but in a nutshell you need a virtual server on $MARKVALUE rather than an IP, using LVS-DR, pointing back to RIP1 and RIP2. Instead of me spamming configs, here's the ipvsadm -Ln output: RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 92 wlc -> $RIP1:$PORT Route 100 0 0 -> $RIP2:$PORT Route 100 0 0 (empty connection table right now) ]]> ...and believe it or not, that's it. Obviously the more VIPs you have, the more complex it gets but it's all about repeating the appropriate config with different RIP/VIP/mark values. For ease of use I make the hexadecimal mark value match the last octet of the IP address on the VIP; it makes for easier reading when tracking stats and so on. I've not addressed any problems with random ARP problems yet because they haven't yet occurred in testing; and one major bonus point is that if a connection is attempted from (ooh, let's say, without giving too much away) a server-side include on a virtual host on a realserver to another virtualhost on the same VIP, then it'll get handled locally as long as Apache (in my case) is configured appropriately. An interesting, and useful, side-effect of this scheme is that when a realserver wants to connect to a VIP which it is handling, it'll connect to itself - which reduces greatly the amount of traffic traversing the RS -> Director -> RS network and means that the amount of actual load-balancing is reduced too. Rob Wilson rewilson () gmail ! com 2005-08-09 We have an LVS server for testing which is handling 2 VIPs through LVS-NAT (using keepalived). Each of the VIPs currently points to 1 real server - it's a one realserver LVS - just in testing phase at the moment. Both real-servers are on the same internal network. Realserver1 VIP2 -> Realserver2 ]]> We'd now like Realserver2 to be able to connect to Realserver1 via VIP1. I was able to accomplish this following the solution provided by Graeme Fowler: http://www.in-addr.de/pipermail/lvs-users/2005-March/013517.html However, external connections to VIP1 no longer work while that solution is in place. Dropping the lo:0 interface assigned to VIP1 on Realserver1 fixes this, but then breaks Realserver2 from connecting. Graeme Fowler graeme () graemef ! net 2005-08-10 Are you doing your testing from clients on the same LAN as the VIP, by any chance? Have you set the netmask on the lo:0 VIP address on the realservers to 255.255.255.255? I can see that making it a /24 mask - 255.255.255.0 - might result in the realservers thinking that the client is actually local to them, thus dropping the packets. Rob Wilson rewilson () gmail ! com 2005-08-10 That's exactly it. I was hoping it was something daft I misconfigured, so.. wish granted :) It works perfectly now. Thanks for your help (and coming up with the idea in the first place!). Judd Bourgeois simishag (at) gmail (dot) com 19 Jan 2006 I am running LVS-NAT, where the director has two NICs (and two networks). The VIP is on the inside of the director (in the RIP network) (Joe - this functions as a two network LVS-NAT). Some of my web sites proxy to "themselves" within a page (proxy, PRPC, includes, etc.). The symptom is that the proxy functionality breaks. The real server does a DNS lookup for the remote site, gets back the VIP, and hangs waiting for a response. Previously I solved this problem by putting the site names and 127.0.0.1 in /etc/hosts (as mentioned in this section and in ), but after reading the FAQ more carefully tonight, I solved it by simply adding the VIP as a dummy interface on all of the realservers. This appears to be addressed in Graeme's solution, but he runs an extra iptables command on the director. Is this really necessary? Won't any packets originating on the real servers and destined for the VIP be handled by the dummy interface on the real server, without being put on the wire? It all appears to work fine and has the added nice effect of forcing each realserver to proxy to itself when necessary. Graeme Fowler graeme (at) graemef (dot) net 1/20/06 What you've suggested is the "single VIP" case of the above idea. It worked for me, it seems to have worked for Rob Wilson, so casting aside the fact that you might have multiple VIPs frontending multiple realserver clusters (as is my case) I can't see any reason why you shouldn't just go for it. Judd Bourgeois simishag (at) gmail (dot) com 20 Jan 2006 Right. In fact, after reading your solution again, I think your solution is the more useful general case, where there may be an arbitrary number of VIPs, RIPs, and groupings of real servers (which I don't need right now, but I've realized I will need it down the road). I have some Alteons that call these real server groups, not sure what the LVS equivalent is, but here's a short illustration. Assume 1 director, 3 VIPs, 4 RIPs on 4 real servers. Assume we have real server groups (RG) RG1 (RIP1-2), RG2 (RIP3-4), RG3 (RIP1-4). VIP1 goes to RG1, VIP2 goes to RG2, VIP3 goes to RG3. In my solution, servers in RG1 can simply put VIP1 and VIP3 on dummy interfaces, but for proxy requests they will only be able to talk to themselves. They will not be able to talk to VIP2. All servers should be able to talk to VIP3. Your solution solves this by using fwmark. This is a fairly common problem with NAT in general that I have to deal with a lot. Basically, the NAT box will not apply NAT rules for traffic originating and terminating on the NAT box. I recall that one workaround for this is to use the OUTPUT chain, I can't find the rules at present but it seemed to work ok. Ratz 21 Jan 2006 There is no LVS equivalent of "real server groups". But I think Alteon (Nortel) only has this feature for adminstrative reasons, so you can assign a group by its identifier to a VIP. What I would love to see with LVS is the VSR approach and a proper and working imlementation of VRRP or CARP. I've just recently set up a 2208 switch using one VSRs and 2 VIRs, doing failover when either the link or the DGW is not reachable anymore. The sexy thing about this setup is that you don't need to fiddle around with arp problems and you don't need to have NAT, so balancing schedulers can get meaningful L7 information. Alteon's groups are just an administrative layer with an identifier. We could add such a layer in ipvsadm and the IPVS code, however what benefit do you see in such an approach? One problem I see with the Alteon approach is that if you add a RS to a group, to my avail it can only pertain to one RG. This is a bit suboptimal if you want to use RS as spillover servers on top of their normal functionality. Regarding your example, I'd like to say, that RG1 is a spillover group for RG3. You can specify (IIRC) a spare server of each RG in AltheOS, however not cross-RG wise. Correct me if I'm wrong, please. Judd

Graeme's solution solves this by using fwmark.

Yes, fwmark solves almost all problems Graeme Fowler graeme (at) graemef (dot) net 21 Jan 2006 Judd doesn't need fwmark, because in a single VIP LVS-NAT, with that VIP assigned locally on the realservers on a dummy interface (or loopback alias), the realservers will always answer requests for the VIP locally. In a two-VIP case (the simplest multiple), if you have two "groups" [0] of realservers, then the director becomes involved by virtue of it being the default gateway for the realservers. At the point the director gets involved you need some way of determining which interface your traffic is on, and segregation via fwmark seems the most elegant way to achieve this (given the known and predictable failure of realservers as clients in LVS-NAT). I know I struggled for months before realising that I could, in effect, combine the use of NAT via an external interface for my real clients, and DR via an internal interface for my "realservers as clients". [0] I use the word groups in quotes and advisedly, since it appears that Alteon use that in their setup terminology from previous posts.

Realserver as LVS client in LVS-DR The topic came up with a posting about an LVS of httpd which generated mail (presumably a webmail LVS). The poster reasonably wanted the e-mail to be balanced by the same director. The problem is that the mail is being sent to the VIP (on the director) from a machine (the realserver) which also has the VIP. The mail will be accepted locally on the realserver, rather than being sent to the director to be load balanced. If you don't attempt to load balance the mail requests, then if there are enough requests, then statistically (over a long enough period) the http traffic will be balanced and the mail coming from each realserver will be approximately balanced. This posting started an off-line discussion with Horms and Ludo about ways to have clients on the director and on the realservers. The outcome was an idea by Ludo, which no-one has got to work yet (Horms tried something similar a while ago and couldn't get it to work either) and a proposal by Julian, which seems likely to work. Dan kasper37 (at) speakeasy (dot) net 1 Oct 2005

Is there a way to connect from one of the real servers hosting web to the VIP:smtp service? The problem is that telnet to VIP:smtp from one of the web real servers is going to try to connect to smtp locally. I'm actually talking about any virtual service in general. Here's what we've got so far (brace yourself): These commands are run on the real server (for the sake of brevity I only included the commands for one real server, but imagine these being run on all real servers with the correc RIPs substituted for y.y.y.16). With these rules, packets are being output to the network as hoped, but the problem is that the /source/ address is x.x.x.70 instead of the real server's RIP. If there was a way to force the kernel to send the request from the real servers RIP, this may actually work. Any ideas?

Ludo Stellingwerff ludo (at) protactive (dot) nl Oct 2 2005 Dan, try it like this: (without your routing table hacks) --dport 25 - -j ROUTE --oif eth0 ]]> I assume you have fixed the ARP problems. Therefore the above ROUTE target should work. If it doesn't I'll have to think of a solution using the "ip rule" command in combination with firewall marking. Joe (off-list)

Can either of you think of a situation where it would be useful to have the director also be a client?

Ludo Besides testing purposes, I can only think of two reasons: Flexibility - for all those situations we can't guess because of lack of imagination In my line of work: When you combine the LVS-director with a proxy server. Most companies (including my own) seem to want to integrate all difficult routing problems on the company's gateway/router/firewall. (My job is making this possible in a user-friendly manner, interfaces:) One of the things many firewalls do is providing proxy-services to the internal users. If you integrate LVS-director services on this firewall, proxyusers should be able to access these services too. Thus the director functions as a client. Joe

The case of allowing the realserver to be a client seems more useful. The posting today on lvs-users was of a 3-Tier site where the LVS'ed httpd sends mail. The poster wants to LVS the smtp too, but the realservers connect to the local VIP:smtp.

Right, this seems even more common.

Is there any routing that's done down in the depths of LOCAL_IN? What happens to a packet desting for a local IP? Does it appear in the routing diagram, or does it just never get out. Can you fwmark a packet to dst=LOCAL_IP:smtp and get it out somehow?

Ludo For email this question completely depends on the way the client-software presents it email: There are generally two possibilities: Using a smtp-client Using a local postdrop (Presenting mail to the local MTA without using network traffic) Most localservices will use a local postdrop, these can't be pulled out of the local machine easily. But clientsoftware using the SMTP protocol will normally connect to: localhost:25. To answer your question: traffic to the localhost is always done via the loopback network device (dev lo). This can be fwmarked and rerouted without too much of a problem. --dport 25 - -j ROUTE --oif eth0 ]]> local_ip is the VIP not the RIP. I'm using it as the -d (destination) of the connection. The part behind the -j might need some more thinking/testing, but this should work. If ROUTE doesn't work, I can think of some more complex solutions to get the packet to the director, through fwmarking and using Policy Routing. Let's follow a packet: SMTP-client (on Realserver) wants to send a email SMTP-client sends a SYN-packet to VIP:25. The src_addr of this smtp-client should be the RIP of the realserver that handled the original http request. Maybe you'll need to enforce the fact that the src_addr of this realserver's packet is the RIP: -p tcp --dport 25 -j SNAT --to-source ]]> This packet gets sent through the loopback device. (Because the VIP is local to the realserver when using LVS/DR) the PREROUTING nat rule above matches, stealing the packet from the normal routing. The packet is send directly through the output function of dev eth0. I'm not sure if this and the next step work correctly. If not, I'll have to go to the Policy Routing solution. This output driver asks for the MAC address of the VIP through ARP requests. the realserver doesn't answer because the ARP problem is solved. the Director does answer, so the packet is sent to the director, who balances the packet. The packets then are RIP -> VIP:smtp The packet goes to the VIP on the director and then you'll get a reply packet from the MTA: VIP:smtp -> RIP Normally you don't want RIP to be routable on the internet, but in this example the VIP host(s) do know where this RIP is, because it is in the same local network. Lets make this a complete example: To make this setup work the realservers use the loopback trick to prevent arp problems: The VIP is on their loopback devices. The RIP is on device eth0 of the realservers. Just a normal LVS/DR setup will do. Now you want Realserver1 to connect to the balanced service smtp(port 25) on the director. Using the following two iptables rules should do the trick: Or you might want to give a more generic solution to the Realservers connection to the director: And for Realserver2: Any client on the realservers that wants to connect to the VIP's services can work now. To test you can try: "telnet 1.2.3.4 25" from one of the realservers. The MTA should react with something like:

Joe: Let's say there's no way to do it with iptables. Is it possible to write a piece of code that does what we want outside of iproute2/iptables? Ratz: Basically, what you want is to trick a RS FIB to handle a mark'd packet with scope local into thinking its realm is scope global and then route t out on the interface to only wait that it comes back with src mac == mac of RIP, dest mac == mac of RIP, src IP = RIP, dest IP = VIP. My 5 minutes of thinking on the problem suggest that it's unsolvable with conventional methods without causing major breakage in the FIB of the routing cache.

Julian Anastasov ja (at) ssi (dot) bg 3 Nov 2005 One can try the "loop" flag (send-to-self) feature at routing level: http://www.ssi.bg/~ja/#loop. There is a text file that explains its usage. This patch changes the way how packets to local IPs are routed. The trick is that it is done only for outgoing routes. If applied to RS it can establish connection from RIP to VIP with the assumption the packets are looped via crossover cable or hub. In our case they will pass director and will come back to RS with daddr=VIP. At least, this is the theory, only for DR method. Not tested. The incoming connection is served as usually, the loop patch allows it to come with saddr=RIP. One can try it after making sure the director will not drop the packet due to rp_filter checks if packet from RIP comes from wrong interfaces. If the interface in director is single then there is no problem. The RS can look in this way (VIP and RIP on different eth devices): eth0: VIP (for DR the ARP problem should be solved with solutions that work for eth devices), for traffic from director to RS (RIP->VIP) eth1: RIP: for outgoing traffic (RIP->VIP) Such RS boxes will have loop=1 on eth0 and eth1 and should be protected by firewall because there is a risk they to accept unwanted UDP packet from RIP to VIP from world. Should be easy if reverse path checks are done only in border firewall and not in director and RSs. I'm not pushing it for inclusion as it is one big hack but Anton Blanchard is very active in this: and once DaveM said he will review it, so he knows about it

Ratz: What about medium_id issues? Does it work with bonding interfaces?

May be it should work. medium_id does not play here, loop works just after fib_lookup and accepts packets from local source IP, for other traffic loop=1 does not modify behaviour.

LVS: Non Linux Realservers This list of realservers is from Ratz (Joe: from about 2003). About the only thing he hasn't tried yet is Plan 9. Remember to set netmask=/32 for the VIP on LVS-DR and LVS-Tun (for LVS-NAT you can setup with any netmask you like). If you are running non-Linux unix realservers, you can usually handle the arp problem by configuring the device carrying the VIP with the -arp switch. Solaris 2.5.1, 2.6, 2.7 Linux (of course): 2.0.36, 2.2.9, 2.2.10, 2.2.12 FreeBSD 3.1, 3.2, 3.3 NT (although Webserver would crash): 4.0 no SP IRIX 6.5 (Indigo2) HPUX 11 HPUX arps even if you tell it not to. You'll need to handle the arp problem some other way. Ratz's code to setup non-linux realservers is now in the configure script (http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html#configure_script) This part of the script has not been well tested (you might find that it doesn't setup your non-linux unix box properly yet, please contact me - Joe). (In the 3yrs the configure script has been out, I've not heard of anyone using this part of the code, so there seems no point in maintaining it.) Here's the original info from Ratz for realservers with non-Linux OS's. On some Unixes you have to plumb the interface before assigning an IP. The plumb instruction is not included here. : ifconfig lo0 alias netmask 0xffffffff -arp up #ifconfig -a: lo0: flags=80c9mtu 16837 # inet 127.0.0.1 netmask 0xff000000 # inet netmask 0xffffffff #uname : IRIX #uname -r : 6.5 # : ifconfig lo0 alias netmask 0xffffffff -arp up #ifconfig -a: lo0: flags=18c9 # inet 127.0.0.1 netmask 0xff000000 # inet netmask 0xffffffff #uname : SunOS #uname -r : 5.7 # : ifconfig lo0:1 netmask 255.255.255.255 up #ifconfig -a: lo0: flags=849mtu 8232 # inet 127.0.0.1 netmask ff000000 # lo0:1 flags=849mtu 8232 # inet netmask ffffffff #uname : HP-UX #uname -r : B.11.00 # : ifconfig lan1:1 10.10.10.10 netmask 0xffffff00 -arp up #ifconfig -a: lan0: flags=842 # inet netmask ffffff00 # lan0:1: flags=8c2 # inet netmask ffffff00 # ]]> Ratz 16 Apr 2001

in most cases (when using the NOARP option) you need alias support. Some Unices have no support for aliased interfaces or only limited, such as QNX, Aegis or Amoeba for example. Others have interface flag inheritance problems like HP-UX where it is impossible to give an aliased interface a different flag vector as for the underlying physical interface (as happens with Linux 2.2 and 2.4 - Joe). So for HP/UX you need a special setup because with the standard depicted setup for DR it will NOT work. I've done most Unices as Realserver and was negatively astonished by all the different implementation variations of the different Unix flavours. This maybe resulted from unclear statements from the RFC's.

Gregory Boehnlein

I'm going to be working with a bunch of Solaris 9 boxes in the near future, and I would like to add them to my existing LVS cluster. Does anyone have information on what/how Solaris 9 can be used as the Real Servers in an LVS-DR cluster? On linux, I implement the hidden-arp patch. How is this accomplished on Solaris boxen?

Roberto Nibali ratz (at) drugphish (dot) ch 11 Aug 2003 Solaris doesn't have this issue ;). Chris Kennedy ckennedy (at) iland (dot) net

The thing I have found out is that on Solaris 2.6, and probably other versions of Solaris, you have to to some magic to get the loopback alias setup. You must run the following commands one at a time: ifconfig lo0:1 ifconfig lo0:1 netmask 255.255.255.255 ifconfig lo0:1 up ]]> Which works well and is actually a pointopoint link like ppp which must be the way Solaris defines aliases to the lo interface. It will not let you do this all at once, just each step at a time or you have to start over from scratch on the interface.

Ramon Kagan rkagan (at) YorkU (dot) ca 05 Jun 2002

Just in case anybody is interested. You can do the following on lo0:1 or for paranoid people like me hme1. plumb ifconfig ifconfig ifconfig netmask 255.255.255.255 ifconfig up ]]> This is from the FAQ but I'm adding that this doesn't have to be on lo0.

For LVS-NAT, anything with a tcpip stack can be a realserver, even a printer. For LVS-DR, you either must hide the realserver from arp requests, or have it not answer arp requests. HPUX and Linux (kernel > 2.0) are the only Unices that don't honour the -noarp flag. For LVS-Tun, you need a realserver that decapsulates IPIP packets. Windows used to do this but doesn't anymore.

Loopback interface on Windows/Microsoft/NT/W2K Windows is not handled by the configure script (http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html#configure_script). According to Horms you don't need anything special to handle the ARP problem; presumably Windows honours the noarp flag. Instructions for setting up windows realservers is one of the more common questions on the mailing list. This must be a difficult part of the HOWTO to find ;-\. There seem to be many ways of doing it. Here are some of the answers. Setting the metric to 254 seems to be critical. Wensong's original recipe for setting up the lo device on a NT realserver.

If you don't have MS Lookback Adapter Driver installed on your NT boxes, enter Network Control Panel, click the Adapter section, click to add a new adapter, select the MS Loopback Adapter. Your NT cdrom is needed here. Then add your VIP (Virtual IP) address on the MS Loopback Adapter, do not enter a gateway address on the Loopback Adapter. Since the netmask 255.255.255.255 is considered invalid in M$ NT, you just accept the default netmask, then enter MS-DOS prompt, remove the extra routing entry. ]]> This will make the packets destined for this network will go through the other network interface, not this MS Loopback interface. As I remember, setting its netmask to 255.0.0.0 also works.

Jerome Richard jrichard (at) virtual-net (dot) fr

On Windows NT Server, you just have to install a network adapter called "MS Loopback" (Provided on the Windows NT CDROM in new network section) and then you setup the VIP on this interface.

o1004g o1004g (at) nbuster (dot) com Click Start, point to Settings, click Control Panel, and then double-click Add/Remove Hardware. Click Add/Troubleshoot a device, and then click Next. Click Add a new device, and then click Next. Click No, I want to select the hardware from a list, and then click Next. Click Network adapters, and then click Next. In the Manufacturers box, click Microsoft. In the Network Adapter box, click Microsoft Loopback Adapter, and then click Next. Click Finish. robert.gehr (at) web2cad (dot) de; 24 Oct 2001

The MS-Loopback adapter is a virtual device under Windows that does not answer any arp requests. It should be on a Server Edition CD of WinNT/2000. Install and assign it the appropriate IP Address. Because MS would not let you assign a "x.x.x.x/32" netmask to the MS-Loopback adapter, you will end up having two routes pointing into the same net. Lets say your RIP is 10.10.10.10 and your MS-Loopback VIP is 10.10.10.11. You will have two routes in your routing table both pointing to the 10.0.0.0 net. You delete the route that is bound to the VIP on the MS-Loopback adapter with a command like

Johan Ronkainen jr (at) mpoli (dot) fi

True with Windows NT4. However with Win2000 you can just configure high metric value for loopback interface. I tried this about year ago with metric value 254 without problems. I think you could change netmask to /32 with regedit. Haven't tried that with WinNT4/2000 though. I've used that trick with Win98SE and it worked.

Brent Knotts brent (dot) knotts (at) mylocalgov (dot) com 28 Jan 2003

Concerning Windows 2000 realservers: It is possible to change the subnet mask to 255.255.255.255 in the registry, and it works fine for my DR setup. This is easier than deleting the extra route to the local interface after each boot. In Windows 2000, the interfaces are found in: HKEY_LOCAL_MACHINE\System\CurrentControlSet\Services\Tcpip\Parameters\Interfaces Find the interface with the appropriate IP address and change the subnet mask. Rebooting is not necessary, but you should bring the interface down and up.

Foreign Languages for MS Sebastien Sebastien.Bonnet (at) experian (dot) fg 12 Apr 2002

The "MS Lookback Adapter" is called "carte de bouclage"

Malcolm Turnbull

I've set up a simple LVS DR loadbalancer with 4 IIS web servers (win2K) behind it. I've setup a local loopback adapter on each 2K server and set the priority to 254. The loadbalancer works fine... But seems to confuse Windows Networking, SMB Network Browsing no longer seems to work. (Required by RoboCopy).

Martijn Klingens mklingens (at) ism (dot) nl 10 Sep 2002. Subject: Re: LVS DR setup with NT2K servers

We're using an almost similar setup, just that we're currently using a simple xcopy and want to migrate to DFS. That doesn't change the basics though. Things that I encountered: Windows always adds a route to the subnet on the loopback adapter. If that subnet is also available on the normal LAN your routing table will get confused. In our case we migrated sites with existing IPs to LVS so we had to pick an entire class C subnet as mask. Solution: manually delete the route to the /24 on the loopback interface. Disabling the file and print services on the loopback interface usually also helps. Check the default gateways and the DNS on the various interfaces. If they are not identical you may find very strange behaviour (although there are cases where it's useful to have them differ, this is not that common). Hope this helps. If not, please specify a bit more about your setup. If you have problems on the realservers, are they able to resolve another realserver using dns and/or ping another realserver? Oh, and you mentioned network browsing, which is not entirely the same as opening a share on a computer with a well-known name. Do you really need the browser? We usually have the computer browser service stopped anyway so an intruder isn't instantly able to see all other w2k hosts on the net. A bit moot, since DNS provides the same data, but it doesn't really hurt either.

lstep (at) banquise (dot) org 11 Aug 2004

I'm trying to understand why you set the metric of the localhost interface to 254 (according to the doc found on LVS with HA for Win2k Terminal Servers (http://wiki.linuxquestions.org/wiki/LVS_with_HA_for_Win2k_Terminal_Servers) when configuring a W2K realserver's loopback in a Direct Routing mode? If I don't do this what problem(s) will I get?

Malcolm Turnbull malcolm (at) loadbalancer (dot) org NT can be a pain in the backside when its routing table gets confused :-) In my experience setting the metric to 254 solves 95% of issues. Ratz

I'm sorry but this doesn't really convince me. Where can I read about this? What does NT really do when you set such a high metric?

Using the registry to set the mask to 255.255.255.255 does the other 5% Seems to happen more often in NT style domains or servers with multiple NICs. Any NT realserver with a problem is easy to spot as it won't work in DR full stop, and the routing table will be incorrect. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/15 We have a small .net based tool to setup the loopback adpater for DR mode on windows 2000/2003 web servers. The idea was to install the loopback and setup as many VIPs as required and then set the netmask to 255.255.255.255. But from testing we've found that windows will sometimes stop responding to the load balancers RIP (for health checks), whereas if you use a netmask of 255.0.0.0 Windows ignores this route because it looks for the smallest subnet in the routing table first i.e. you 255.255.255.0 (or whaterver your using for your RIP.) The small utility requires the .net framework 1.1. When you start iti, click on the red warning 'no adapter found' to install the loopback Then add as many VIPs as you want and click save. Slightly pointless for anyone who knows what they are doing, but someone may find it usefull. You can download binary and source (http://www.loadbalancer.org/download/nt/). NB. The one marked BETA will set the netmask to 255.255.255.255 (if thats what you want). Graeme Fowler graeme (at) graemef (dot) net 06/11/2005 Adding a WIN NT machine to LVS setup: If you go to "add new hardware" and "select from list", you should find it listed under Network Adapters - Microsoft. Install the "Microsoft Loopback Adapter" network device, and assign it the relevant address(es). IIRC that it doesn't ARP at all, and I've used it in production for a number of SLB setups - not behind LVS though, these were Cisco IOS or ExtremeWare setups, either using NAT or DR. Stuart walmsley wrote: 2007-05-16 The fact that all the real servers are Windows based and will not take a 255.255.255.255 mask on the loopback stops this from working. Graeme Fowler graeme (at) graemef (dot) net16 May 2007 No it doesn't - set the realserver loopback interface address as per normal, then edit the registry to change them from 255.255.255.0 to 255.255.255.255. Unfortunately I forget exactly where the key is buried, but I suspect it will be somewhere within: An alternative is to set the netmask to 255.0.0.0, since a more specific route for the local net will already exist and will be favoured over it; this can however cause unexpected problems for inter-real-server communication. Graeme Fowler graeme (at) graemef (dot) net 16 May 2007 when connecting from Windows LVS-NAT realservers to the VIP on the director using http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html#clients_on_LVS-NAT_realserver_contacting_services_on_VIP the realservers need a /32 netmask. set the realserver loopback interface address as per normal, then edit the registry to change them from 255.255.255.0 to 255.255.255.255. Unfortunately I forget exactly where the key is buried, but I suspect it will be somewhere within: An alternative is to set the netmask to 255.0.0.0, since a more specific route for the local net will already exist and will be favoured over it; this can however cause unexpected problems for inter-real-server communication.

Windows Server 2008 Malcolm has figured this out Direct Server Return on Windows 2008 using loopback adapter (http://www.loadbalancer.org/blog/direct-server-return-on-windows-2008-using-loopback-adpter/) Malcolm found you can't handle it by adding a 2nd NIC (as you could in the early days of Linux) as windows disables the NIC if there's no link.

Mac OS X (and Solaris) Malcolm Turnbull

Has anyone used Mac OS X for a real server in LVS/DR ?

Jerome RICHARD jrichard (at) virtual-net (dot) fr 20 Nov 2002 If you wish to configure multiple IP on Mac OS X, you should use the following command : There is more information in the AppleCare Documents Roberto Nibali ratz (at) drugphish (dot) ch 20 Nov 2002 Yes, I gave a presentation about a year ago and I remember that one of the attendees had a new shiny G4 with Mac OS X. It worked flawlessly.

Is it OK to use the loopback interface ?

You can but you don't have to. OS X is BSDish, so you need to use BSDish syntax: Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uk 22 Nov 2002

Got it working: Solaris: netmask 255.255.255.255 up ]]> Mac OS X:

Windows servers in Active Directory Domain Joe: I'm not a windows guy, so I don't understand this. I've just spliced it out of the ml as best as I can. Please send corrections. David Dyer-Bennet dd-b (at) dd-b (dot) net 10 Oct 2008

We're running into a problem with windows boxes being on a private LAN inside the LVS; they can't join the domain (apparently Active Directory has to be able to initiate connections to the system), and now that's starting to interfere with their deployment of what they call "tcp" protocol since it authenticates service users (obviously they're not talking about the real tcp proptocol; Microsoft must be working *really* hard to obfucate things in this area!). I'm not a Windows guy, but according to our Windows IT team, a computer can't be part of a windows domain unless the domain controller can initiate a connection to it. So these hidden servers can't be in our corporate domain. It's not an issue with additional services, it's the base domain membership.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 10 Oct 2008 Normally you wouldn't want load balanced servers to be in an active directory domain... but if it is a requirement then either use: Direct Routing (but make sure DNS is set to manual in both active directory and on the real servers) otherwise active directory stupidly registers the loopback adpater address :-0. Or you can try single network NAT and make sure that: Joe: for more details see On the load balancer: In order for one arm NAT to work correctly you must modify the firewall script on the load balancers to disable ICMP redirects: /proc/sys/net/ipv4/conf/all/send_redirects echo "0" >/proc/sys/net/ipv4/conf/default/send_redirects echo "0" >/proc/sys/net/ipv4/conf/eth0/send_redirects echo "0" >/proc/sys/net/ipv4/conf/eth1/send_redirects echo "0" >/proc/sys/net/ipv4/conf/eth2/send_redirects ]]> Make sure that these lines are active by removing the # at the start of each echo command. Then configure the routing on the windows real servers: Route configuration for Windows Server with one arm NAT mode When a client on the same subnet as the real server tries to access the virtual server on the load balancer the request will fail. The real server will try to use the local network to get back to the client rather than going through the load balancer and getting the correct network translation for the connection. To rectify this issue we need to add a route to the the load balancer that takes priority over Windows default routing rules. This is a simple case of adding a permanent route: route add -p 192.168.1.0 mask 255.255.255.0 metric 1 NB. Replace 192.168.1.0 with your local subnet address. The default route to the local network has a metric of 10, so this new route overrides all local traffic and forces it to go through the load balancer as required. Any local traffic (same subnet) is handled by this route and any external traffic is handled by the default route (which also points at the load balancer). I'm not sure what happens if the active directory is on a routed network, but I think it will still work. Please let me know. Graeme 10 Oct 2008 Hrm... it depends on the management tools you're using as to whether other domain member servers need to reach the realservers you're talking about. I certainly haven't ever come across a situation where the domain controllers initiate connections to member servers without being asked to (like someone running a computer management application to control a service on the realservers). If this were me, I'd put a domain controller into the "private" LAN which has firewall holes to the main AD domain controllers. That way firewall restrictions should force the local systems use the local DC (or DCs, for better resilience) which can then do all the fancy AD replication back to the other DCs. Not ideal, but it *might* work. David Dyer-Bennet dd-b (at) dd-b (dot) net

The benefits I saw initially for load balancing windows boxes in an active directory domain are: The people maintaining it can use their normal windows logons, meaning I don't have to maintain a parallel set of accounts, and the Windows software people don't have to remember yet more passwords. Access to these systems can be controlled through the normal windows active directory mechanisms. In addition to that, apparently when users are connecting from in-house applications on Windows boxes, it's easy for the windows people to extend authentication through the Web Service connect to supply authentication for the database and information services being accessed by the request. Or so they say. I hadn't thought that would be an issue, and it may not be in the end still. My desktop system is part of the corporate domain. So are the desktops of the people doing Windows development. Why would making a server part of the domain be any more dangerous than that? And that's standard anywhere that does Windows development.

Graeme You're personally fairly unlikely to run code as a system account, especially when developing - you're more likely to run it as yourself. Of course, many developers and sysadmins make themselves admins on their own machines (makes installing software just *so* much more convenient than doing "runas") so the security arguments in those cases are slightly damaged anyway :) Allowing arbitrary code (think of the mass of .NET examples out there) to be executed under the IIS framework is a dangerous game, especially (as is often the case) when it's being executed by a user with elevated privileges (like the Network Service user which IIRC is the default user for IIS code execution). This is, of course, a massive Catch-22 for hosting operations, and is the reason why app pools came along in IIS6 which allowed almost complete segregation of execution environments which themselves ran as non-privileged users. Much tidier than it used to be. In your environment you might not be exposing the web servers to that nasty Intertubes thingmy, which makes security all the easier to manage.

LVS: identd/authd

What is authd/identd? Jun 2002: authd clients are invoked on the realserver by services running under tcpwrappers and connect to a machine on the internet, in this case the client, before the service can complete the request. Initially we thought this was a problem unique to authd. However, we now see it as an example of an often occuring situation. See the section for more details.) If initial connection to your service (telnet, ftp, sendmail...) is delayed by 10secs..5mins, but after you connect everything is fine, then you have problems with identd. You can avoid reading this section by turning off identd on your realservers. identd is a demon run under inetd. Other services running on the server can use identd to ask the client machine for the identity of the user making the request. When a request arrives at a server for such a service (e.g. telnet, sendmail), the auth client will connect from a high port to client:auth asking "who is the owner of the process requesting this service". If the client's authd replies with a username@nodename, the reply will be optionally logged on the server (eg to syslog) and the connection request will be handed back to telnetd (or whichever service). If the reply is "root@nodename", or some null reply, or there is no authd on the client, then the server's authd will wait till a timeout before allowing connection. The delay is about 10secs for Slackware and 2mins for RedHat7.0. There is no checking of the validity of the reply and since the reply is under control of the client machine, the reply username@nodename could be bogus. The authd is a security feature. However it doesn't get the server very much (you don't know who has made the connect request, only what they told you), while clients that fail are delayed. This may only be a nuisance for people telnet'ing in (provided they understand what's happening), but will bring mail delivery to a crawl. If you setup an LVS with realservers that have services running inside identd, you will have to deal with identd. Any service in inetd running under tcpwrappers (probably just about every service, if tcpwrappers is installed) and sendmail (see section on sendmail) use it. Since problems with identd affect many aspects of an LVS, there are references to identd in several places in this document.

authd/identd and other 3-Tier clients A lot of time and effort was put into figuring out how to handle clients running on the realservers. The best solution we came up with was to turn off authd on the realservers. At the time the authd problem appeared to be a one-off problem and we dismissed it as just one-of-those-things. Later we realised that other demons running on the realservers invoke client processes, e.g. rshd, passive ftp. Still we didn't see the whole picture. It now turns out that there is a general class of services (demons) running on the realservers which invoke client processes as part of constructing a reply to the client. These demons require you to run the LVS as a 3-Tier LVS. If you allow packets from the RIP to be routable, then it's easy for the client to connect to 0/0. The problem before was that we did not allow the RIP to be routable.

symptoms of the identd problem There are two parts to identd on your realservers Identd runs on your realservers. This isn't a problem for LVS. Identd on the realservers is for clients on your realservers connecting to services on remote machines. These clients will be connecting from the RIP and not the VIP. You aren't using this identd when setting up an LVS. However if you telnet from your realservers for some other reason, you'll need to think about what this identd is doing. your LVS'ed services may (e.g. sendmail or services running inside tcpwrappers), ask the identd client on your server to connect to the identd on the client machine and ask for the identity of the person connecting to the service on the realserver. You don't want this. In general there is no way in an LVS, for the reply from the client to return to the realserver. The problem is in the second part, i.e. if the LVS'ed service on the realserver asks for the ident client to connect to the identd on the client. (If this is confusing, remember machines can be clients and servers at the same time.) Here's a example telnet connection through a director to a realserver where telnetd is running inside tcpwrappers. tcpwrappers uses the ident client on the remote host (the one with the telnetd) to connect to the identd on the local (telnet client) host.

comp.os.linux.security FAQ on identd swan_daniel (at) hotmail (dot) com v0.1 - Last updated: April 20, 2000 ]]> 4.5) What is Identd? Can I disable it? Identd identifies the username of a process owning a specific TCP/IP connection. It is usually run via inetd and listens on port 113. Identd should not be used as a method of authentication - anyone with root access can alter their identd response. Indeed, on many systems (such as FreeBSD and Windows) even a non-privledged user can specify whatever identd response they want. The protocol is most useful on multiuser systems as a method of tracking down problem users. If one of your users is causing problems on another system, that system's admin can inform you of the username of the specific user causing problems, saving you a lot of legwork. Should you run identd? That's really a judgement call. On systems with many users, the benefits could be great, but it doesn't serve any particular purpose on a single user box. Not running identd may limit your ability to connect to certain servers - many IRC and some FTP servers don't allow, or severly restrict, non-identd'd connections, for example. However, running it means leaving a service open to the outside world, with all the security risks that entails. Another thing to consider is that identd can allow attackers to find out valuable information about your system, such as whether a certain service is running as root, the operating system you are running, and the usernames of your users. Consider running identd with the -n flag, which sends userid numbers instead of usernames. See the identd manpage and /etc/identd.conf for more information about the available options. You can block access to identd by shutting it off entirely (usually done via inetd, see section on disabling services), or by using tcpwrappers and/or firewalling software to disable/restrict access. If you need identd enabled in order to connect to a certain server, you might want to consider allowing access to it only from that server. If you do choose to firewall the identd port, strongly consider using a reject policy rather than deny. Using deny may greatly increase the time it takes you to connect to servers that utilize identd, as they will wait for a response of some type before allowing you to connect.

Russ Nelson on identd Russ Nelson (he wrote the Clarkson packet drivers for DOS, he was the 1980's version of Donald Becker) says that the only possible role for identd is to keep track of client activity at the client end. He says that your firewall should reject, not drop identd requests. Russ also has a some links to sites that don't allow links from other sites. When you go to his site, please click on those links.

Why identd is a problem for LVS The problem is that the identd/authd client makes a callback from the RIP (for LVS-NAT) or the VIP (LVS-DR, LVS-Tun) and LVS doesn't handle clients on realservers. For the simple case where clients call from the RIP on NAT'ed realserver see the section on running clients on realservers. There the client is independant of the LVS. The case of clients on the realservers making call backs triggered by an LVS client's requests to an LVS'ed service is more difficult as the result has to get back to the LVS'ed service. Normally in an LVS, the director in an LVS responds to connect requests by handing them to an arbitrary realserver. The corrollary of this is that replies to a client request initiated on a realserver, to the outside world, will not return to the realserver unless something is done to handle it. (The only solutions we have are those in the section on running clients on realservers.) replies from the client which is connecting to the LVS, arriving at the director are not connect requests, and will not belong to an established connection. They will be dropped. even if the director could forward these replies to a realserver, they could go to any realserver, and not neccessarily to the realserver which originated the request. The result is that the client request will hang or timeout.

tcpdumps of connections delayed by identd Here's the tcpdump of the client telnet'ing to a LVS-DR LVS. Telnet on the realserver is running inside tcpwrappers, client and realservers cannot connect directly i.e. they have no routing to each other. seen from client: lvs.telnet: S 1170880662:1170880662(0) win 32120 (DF) [tos 0x10] 12:56:05.427949 client2.1038 > lvs.telnet: . ack 416490630 win 32120 (DF) [tos 0x10] 12:56:05.431752 client2.1038 > lvs.telnet: P 0:27(27) ack 1 win 32120 (DF) [tos 0x10] client replying to realserver's auth request 12:56:05.465152 client2.auth > lvs.1377: S 1159930752:1159930752(0) ack 417813448 win 32120 (DF) 12:56:05.465405 lvs.1377 > client2.auth: R 417813448:417813448(0) win 0 12:56:08.464671 client2.auth > lvs.1377: S 1162930275:1162930275(0) ack 417813448 win 32120 (DF) 12:56:08.464901 lvs.1377 > client2.auth: R 417813448:417813448(0) win 0 6 second delay then trying again 12:56:14.466048 client2.auth > lvs.1377: S 1168931649:1168931649(0) ack 417813448 win 32120 (DF) 12:56:14.466275 lvs.1377 > client2.auth: R 417813448:417813448(0) win 0 client login to LVS 12:56:15.501272 client2.1038 > lvs.telnet: . ack 13 win 32120 (DF) [tos 0x10] 12:56:15.503946 client2.1038 > lvs.telnet: P 27:125(98) ack 52 win 32120 (DF) [tos 0x10] 12:56:15.509024 client2.1038 > lvs.telnet: P 125:128(3) ack 55 win 32120 (DF) [tos 0x10] 12:56:15.538816 client2.1038 > lvs.telnet: P 128:131(3) ack 88 win 32120 (DF) [tos 0x10] 12:56:15.551836 client2.1038 > lvs.telnet: . ack 90 win 32120 (DF) [tos 0x10] 12:56:15.571837 client2.1038 > lvs.telnet: . ack 106 win 32120 (DF) [tos 0x10] ]]> Here's what it looks like on the realserver (this is a different connection from the above sample, so the times are not the same). lvs.telnet: S 1605709966:1605709966(0) win 32120 (DF) [tos 0x10] 12:50:58.051263 lvs.telnet > client2.1040: S 862075007:862075007(0) ack 1605709967 win 32120 (DF) 12:50:58.051661 client2.1040 > lvs.telnet: . ack 1 win 32120 (DF) [tos 0x10] 12:50:58.052819 client2.1040 > lvs.telnet: P 1:28(27) ack 1 win 32120 (DF) [tos 0x10] 12:50:58.053036 lvs.telnet > client2.1040: . ack 28 win 32120 (DF) realserver initiates auth request from VIP to client:auth 12:50:58.088510 lvs.1379 > client2.auth: S 852509908:852509908(0) win 32120 (DF) 12:51:01.083659 lvs.1379 > client2.auth: S 852509908:852509908(0) win 32120 (DF) realserver waits for timeout (about 8secs), sends final request to client:auth 12:51:07.083164 lvs.1379 > client2.auth: S 852509908:852509908(0) win 32120 (DF) telnet replies from realserver continue, login occurs 12:51:08.117727 lvs.telnet > client2.1040: P 1:13(12) ack 28 win 32120 (DF) 12:51:08.118142 client2.1040 > lvs.telnet: . ack 13 win 32120 (DF) [tos 0x10] ]]>

There are solutions to identd problem in some cases

Director is LVS-NAT In an LVS, authd on the realserver will be able to connect to the client if - LVS-NAT, the realservers are on public IPs (not likely, since you usually hide the realservers from public view and they'll be on 192.168.x.x or 10.x.x.x networks) LVS-NAT, and high ports are nat'ed out with a command like You usually don't want to blanket masquerade all ports. You really only want to masquerade ports that are being LVS'ed (so you can still get to the other services) in which case, for each service being LVS'ed, you to use ipchains rules like Since the auth client (on your telnet server) is connecting from a high port on the server, a better ipchains rule which will allow auth to work when the realservers are on private IPs.

LVS-DR, LVS-Tun, 2.2.x kernel directors There is no solution for LVS-DR for 2.2.x directors. The auth client on the realserver initiates the connection from the VIP. There is no way for a packet from VIP:high port to get a reply through the LVS because the incoming packet from the client on the internet is destined for a non-LVS'ed high port the incoming packet is not a connect request. the incoming packet is not associated with an established connection. The reply from the LVS client will be dropped.

LVS-DR, LVS-Tun, 2.4.x kernel directors Transparent proxy in 2.4 is different to 2.2 (see section on identd with 2.4 TP). You should be able to masquerade the identd client's request on the realserver.

Turn off tcpwrappers One cure is to turn off tcpwrappers. inetd.conf will have a line like change this to and re-HUP inetd.

using iptables to handle identd Graeme Fowler graeme (at) graemef (dot) net 19 Dec 2008 Make sure you REJECT rather than DROP ident lookups on the director, or even better configure the realservers to REJECT them in the OUTPUT chain on the outgoing interface. If they get DROPped, then the calling process will exhibit the exact hangup you're seeing. This is very, very common in SMTP systems using ident lookups with badly configured firewalls. David Merhar merhar (at) arlut (dot) utexas (dot) edu 19 Dec 2008 Nice. This about does the trick on the realservers: This reduces the wait to 3 seconds as opposed to 30 seconds. However it also increases the delay of connecting to the RIP from 0 to 3 seconds.

Identd and smtp/pop/qmail (This is from the early days of the mailing list when the problem first came up) Problem: In the case of identd, the smtpd on a realserver says to identd "give me the name of the owner of the process on IP:port that is asking me to accept mail". If identd thinks it is running on the RIP rather than the VIP and, as is most likely, RIP is not routable from the outside world, mail on the realserver will hang. If identd is running on the VIP, then replies will probably return to another realserver and mail will still hang. The converse case, of sending mail from the LVS, has the smtp server out in internetland asking the LVS for the name of the owner of the process running on the VIP sending him mail. If identd is clustered, then the request will in all probability go to another realserver. This seems equally intractable at the moment. Originally the problem was raised by ckennedy (at) iland (dot) net Subject: SMTP, POP3 using Qmail and Ident, also using Solaris as realservers ]]> I have setup a virtual server using the Linux 2.2 patch and 3 Sun Ultras as the actual servers. It has crashed twice, though possibly from running bind on the Virtual Server, since it was right when I started it up (bind) that the virtual server would crash. The major problem I am having is a timeout for Ident requests on POP3 and SMTP ports which seem to be confused. When looking at the problem with tcpdump on the virtual server and the real servers the vserver seems to do the following: vserver.net.smtp: . ack 2764990963 win 8760 13:41:48.636030 10.0.0.1.4658 > vserver.net.smtp: . ack 1 win 8760 13:41:48.658875 10.0.0.1.auth > vserver.net.48981: R 0:0(0) ack 2765099549 win 0 <<<<<< 13:41:52.143790 10.0.0.1.auth > vserver.net.48981: R 0:0(0) ack 1 win 0 <<<<<< 13:41:58.144210 10.0.0.1.auth > vserver.net.48981: R 0:0(0) ack 1 win 0 <<<<<< ]]> The Ident, or auth port on the client machine trying to connect back to the vserver is where it will pause for about 10-15 seconds then connect just fine. I believe this may be qmail specific since a server funning sendmail will not have this problem and ident seems to be used by qmail more than it or something. No-one answered, then months later...

Ted I currently have HTTP loadbalanced just fine with the LinuxDirector. I've setup SMTP in the same fashion, and I don't have as much luck.

Lars Check if your system (tcpwrapper or sendmail) is doing a NS lookup before accepting the connection or trying to connect to ident. ckennedy (at) iland (dot) net Subject: Re: SMTP -- very slow connection ]]> I had the same problem with the Direct Routing and SMTP and POP3. It looked like a problem with the Ident lookup to the server by the client, it was what always was occurring during that time out period. I saw this while doing tcpdumps on the virtual server where the client would just keep asking for Ident lookups to the Virtual IP address which are from the client port 113 to a random port above 1023 on the virtual server. I can see how this is tricky with the direct routing method since this traffic should be sent on to the realserver but is not. I sort of gave up on Direct Routing for now since this looks pretty hard to fix if it really is Ident and the client requirements getting in the way.

Ted I'm connecting directly to the IP. But just to be sure, I'll add an entry to the nameserver for that particular IP -- in both forward and reverse lookups...........done..... And it still does the same thing. :( Understand that the thirty seconds are *AFTER* the connection... Telnet connects, gives me the escape character, and sits. If it was a nameservice thing, I'd imagine it'd sit before it connected. I'd actually be happier if it wasn't connecting. :) Then I'd know there was definitely something I needed to fix between me and the real machine. But when it connects and THEN has trouble... I'm lost. :(

Michael Baird mike (at) tc3net (dot) com Sound's like an issue with ident lookup's, you probably aren't clustering IDENT, you can 1) cluster identd, or edit your sendmail.cf file and set the value

Ted Great idea. That was it. I turned off ident in sendmail and things worked fine. However, I don't want to turn of ident in sendmail, and I figure other things might want ident too... so I want to cluster ident. Clustering ident didn't help. I clustered tcp port 113 to both servers. (I even tried "loadbalancing" 25 and 113 just to ONE server -- that way it'd always hit the same server)... And that didn't work. I got the same results -- telnet to port 25... connect... thirty seconds... and then sendmail would enter command mode. Any ideas? Do I have to loadbalance anything else besides tcp 113 for identd to work? Why is identd run with smtp (any other reason other than wanting to know who is sending me the mail?) Do you have to turn identd off in smtp to get LVS smtp to work? Has anyone LVS'ed identd? (I'd imagine you wouldnt neccessarily get the ident from the same machine running the process for which you want the ident)

LVS: Variants on LVS: Local Nodes (One Box LVS) LVS Variants: LVS was originally based on the masquerading code of Linux-2.0. The director had the VIP, to let the router know where to send packets for the virtual server. There was no port for the service listening on the director's VIP, so ip_vs() forced acceptance of the packet by hooking into LOCAL_IN. The design of ip_vs() of VIP hook in LOCAL_IN are historical rather than a technical restriction. If other clues can be given to the router, to route packets, then the VIP is not needed on the director. Current thinking is that ip_vs() should be moved to the FORWARD chain to allow the director to function more like a router. This (and the following) section(s) show attempts at alternate designs for LVS, e.g. to move ip_vs() out of LOCAL_IN. This is NOT LocalNode. It's experimental code from Horms to allow running realservers on the director. This allows you to test LVS on one box. Viktors Rotanovs 1 Sep 2006 Is it possible to do port redirection using iptables _after_ director on localnode? I've changed NF_IP_LOCAL_IN to NF_IP_PRE_ROUTING at ip_vs_in_ops in ip_vs_core.c, and now it bypasses NAT, but I'm not a kernel hacker and I don't know which priority should be set and if it's possible to solve the problem that way. Siim Poder windo (at) p6drad-teel (dot) net 01 Sep 2006 If the LVS grabs a packet, you cant do any NAT on it any more. The packet is as good as lost for those purposes (currently it seems so, at least). However, LVS does it's own NAT, is there a reason why you have to first let LVS do its own nat and then have iptables nat again? couldn't you just have the right LVS real servers (with right ports) in the first place (using fwmarks, if that source address is important). Joe Although LVS has always hooked into LOCAL_IN, it could hook in anywhere and perhaps it would be good to write this into the ip_vs code. Both Horms and Ludo have fiddled around here with no apparent ill-effect. Horms 2 Sep 2006 Here is my take on this problem Local Nodes, msg00102 Local Nodes, msg00113. Here's the local_nodes.patch. If people tested it and gave feedback we could merge it into the kernel :) Dave Whitla 20 Jun 2005

I am trying to load balance to two "real" servers which are actually listening on virtual IPs on the load-balancing host. Why would I want to do this? To build a test environment for a web application which usually runs on an IPVS cluster. The purpose of the test environment is to test for databasecache contention issues before we deploy to our production cluster. The catch is I must make the test environment (lvs director + 2 x application server instances) run on one physical host (each developer's development machine). The man page for ipvsadm makes specific mention of forwarding to realservers which are in fact running on local interfaces stating that the load balancing "forward" mechanism specified for a virtual service is completely ignored if the kernel detects that a real server's IP is actually a local interface. The "Local Node" page describes a configuration in which I could load balance between a "real" server running on the director's virtual service IP and a real server running on another host. This does not solve my problem however as I must bind each instance of my application to a different IP address on the same physical box. You may be thinking "Why not run the two instances on different ports on the same IP (the virtual service IP)?". Sadly the application is not a simple web-site, and source code and deployment container dependencies on certain port numbers exist. eg RMI-IIOP listeners. Does anyone know of some config or kernel hack, patch or whatever which might make my ipvs present forwarded packets to the virtual interfaces as though they had appeared on the wire so that my forward directives are not ignored and the packets are not simply presented to the TCP stack for the virtual service IP? I guess this is like NAT to local destination addresses (as opposed to NAT of locally originated connections which is supported in the kernel).

Horms this is a pretty interesting problem that crops up all the time. I have often wondered how hard it would be to make nat work locally (not that LVS-Tun and LVS-DR don't/can't support portmaping anyway). The patch linked above is for 2.6.12 that allows nat to work locally by: Not marking local real-servers as local Passing nat-mangled packets down to local processes instead of passing them out onto the network Reversing nat in the LOCAL_IN chain Please note that this completely breaks NAT for non-Local hosts. It could be cleaned up and made so it doesn't do that. But I thought I'd throw it onto the list before putting any more time into it. Horms 21 Jun 2005 The patch is my second attempt. This should automatically switch local real-servers to Local unless the requested forwarding method is Masq and the real port differs from the virtual port. That is, if you want to do portmaping on a local service it will use Masq, otherwise it will use Local. It seems to work, but there are probably a few gotchas in there and I haven't tested a whole lot.

LVS: Variants on LVS: Peter Warasin's ip_vs() in PREROUTING Peter Warasin peter (at) endian (dot) com 11 Sep 2007 I made some modifications on the lvs specific kernel code, which now leads into kernel oops. Could someone give me some pointers about how to find the bug? I am not very familiar with the kernel code, so maybe I missed some simple tricks which routined people know and me not. Basically I altered the lvs code in order to make it catch packets within the PREROUTING chain instead of the INPUT chain. My setup works, but sometimes I have a kernel oops. I think somewhere it lacks some sort of spinlock, but I not really know where to begin in order to find where it must be inserted. My setup: Kernel is RHEL 2.6.55.0.2.EL. I have 2 LVS directors (master, backup), which at the same time are real servers running squid. They are configured as LVS-GW, the real servers have ip addresses on a different subnet, than the VIP. The backup has the correct route for both subnets and a default gateway pointing to the master. I use keepalived which configures LVS in order to have the correct rules configured on the master and have no rules on the backup whenever the master is up. Connections to port 80 from behind the master going to "the outside" should be transparently intercepted, balanced by lvs and passed to the respective squid, which does the rest. With the standard LVS this setup is not possible, because of 2 causes: I must mark packets within the PREROUTING chain in the mangle table in order to pass them to LVS, but LVS intercepts only packets coming in into the INPUT chain, but which forwarded packets will never pass. When I managed it to intercept the packets with LVS, both realservers needs to DNAT the packets in order to redirect them to squid, which runs on port 8080. But packets which come in on Local cannot be NAT'ed because LVS sends them directly to the wire. I solved those problems this way: packet_xmit = ip_vs_null_xmit; + cp->packet_xmit = ip_vs_loop_xmit; break; case IP_VS_CONN_F_BYPASS: --- linux-2.6.9/net/ipv4/ipvs/ip_vs_xmit.c.orig 2007-08-01 19:28:52.000000000 +0200 +++ linux-2.6.9/net/ipv4/ipvs/ip_vs_xmit.c 2007-08-03 16:47:16.000000000 +0200 @@ -24,6 +24,8 @@ #include /* for ip_route_output */ #include #include +#include +#include #include @@ -141,12 +143,47 @@ ip_vs_null_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, struct ip_vs_protocol *pp) { + IP_VS_DBG(10, "NULL transmitter called\n"); /* we do not touch skb and do not need pskb ptr */ return NF_ACCEPT; } /* + * LOOP transmitter (reinject on NF_IP_PRE_ROUTING) + */ +int +ip_vs_loop_xmit(struct sk_buff *skb, struct ip_vs_conn *cp, + struct ip_vs_protocol *pp) +{ + + struct ip_conntrack *ct; + enum ip_conntrack_info ctinfo; + struct ip_nat_info *info; + + IP_VS_DBG(5, "LOOP transmitter called\n"); + if (skb->nfcache & NFC_IPVS_PROPERTY) { + IP_VS_DBG(10, "Already passed LVS. Receive it normally\n"); + return NF_ACCEPT; + } + + IP_VS_DBG(10, "Retransmit to IP_PRE_ROUTING hook starting with priority NF_IP_PRI_MANGLE\n"); + nf_reset_debug(skb); + skb->nfcache |= NFC_IPVS_PROPERTY; + skb->ip_summed = CHECKSUM_NONE; + + ct = ip_conntrack_get(skb, &ctinfo); + if (ct && (ctinfo == IP_CT_NEW)) { + info = &ct->nat.info; + info->initialized = 0; + } + NF_HOOK_THRESH(PF_INET, NF_IP_PRE_ROUTING, skb, skb->dev, + NULL, ip_rercv, NF_IP_PRI_MANGLE); + return NF_STOLEN; +} + + +/* * Bypass transmitter * Let packets bypass the destination when the destination is not * available, it may be only used in transparent cache cluster. --- linux-2.6.9/net/ipv4/ip_input.c.orig 2007-08-01 19:29:54.000000000 +0200 +++ linux-2.6.9/net/ipv4/ip_input.c 2007-08-01 19:32:42.000000000 +0200 @@ -355,6 +355,14 @@ } /* + * Retransmit packet + */ +int ip_rercv(struct sk_buff *skb) +{ + return ip_rcv_finish(skb); +} + +/* * Main IP Receive routine. */ int ip_rcv(struct sk_buff *skb, struct net_device *dev, struct packet_type *pt) @@ -429,4 +437,5 @@ } EXPORT_SYMBOL(ip_rcv); +EXPORT_SYMBOL(ip_rercv); EXPORT_SYMBOL(ip_statistics); ]]> the patch causes incoming packets which should go to Local to retransmit through the netfilter hooks starting on NF_IP_PRI_MANGLE, instead of transmit them directly with ip_vs_null_xmit. This way I can remove the mark within the mangle table in order to pass it through LVS twice and then simply DNAT it. (please ask if you like to have the detailed iptables/ipvsadm rules.) The setup works. But sometimes I have a kernel oops (fatal exception in interrupt, ip_rcv, ip_rcv_finish is involved). I tried to narrow down the problem, by removing patch nr 2, but the problem still exists. So the problem must be with the 1st patch. But what could cause this. I simply let LVS catch packets within PREROUTING chain instead INPUT chain. That seems not too different to me. I think somewhere it lacks some sort of spinlock, but I not really know where to begin in order to find where it must be inserted. Horms 13 Sep 2007 As Joe mentioned in a subsequent email, being able to move LVS from one chain to another is something that we are interested in. In particular I am of the believe that the FORWARD chain would be a much more logical home than LOCAL_IN as in some ways would allow LVS to act more like a router than a proxy (not that it is a proxy, but it kind of behaves like one in some ways because of its home on LOCAL_IN). As I recall, I did try moving the code to the FORWARD chan a long time ago. I believe that the change was very similar to the LOCAL_IN to PRE_ROUTING snippet that you have below. I'm not sure that I ever posted the change, as I never tested it thorougly. So perhaps it too broke occasionally. In any case, this was a long time ago, and the kernel has changed significantly then, so any testing done at that time wouldn't really hold water now (incidently 2.6.9 is also pretty old, though I'm not sure what patches RHEL include to modernise it). As for debugging your problem. Providing the oops message - if any - might help. Hopefully there is a stack trace in there and that should start to point to where the problem is. Some portions of the locking schemantics of LVS are non-trivial and I have a deep suspicion that there are some races in there anyway. By which I mean, don't be surprised if things get a bit hairy as you are tracing through what is going on. If your kernel is compiled with IP_VS_DEBUG then you can enable and adjust the verbosity of debugging messages that LVS produced by tweaking /proc/sys/net/ipv4/vs/debug_level as documented in Documentation/networking/ipvs-sysctl.txt in the kernel source tree. Also, if you are doing development work, I do recommend considering using a more up to date kernel. Perhaps the latest rc kernel, currently 2.6.23-rc6. I'm not suggsting that you neccessarily drop this into production. But for development work, it is much easier to work with the kernel guys if you are on the same page as them.

LVS-J: Ludo's reinJect Forwarder: using the director as a gateway to load balance connections to the internet

Introduction We haven't had a new forwarder for quite a while (the last one was either Localnode or LVS-Tun, way back in the early days). An LVS director should be able to balance packets through multiple paths to the internet, except that it has to accept the replies as well. Ludo Stellingwerff ludo (at) protactive (dot) nl has hacked the ipvs code to do just that. A writeup of the state of the art in routing multiple internet connections over different paths is in . Handling failure in multipath routing is still difficult - see Julian's detection code. As for all the forwarders, failover of the realservers is handled externally to ip_vs. Anyone who sets up this forwarder will at least need to be aware of the problem of handling failure of any of the routes. Ludo Stellingwerff ludo (at) protactive (dot) nl 29 Jul 2005 Here are the vs_reinject_patches.tar.bz2 against 2.6.11 . This is a minimum implementation to provide support for using LVS as a loadbalancer for internet gateways. He're the physical layout with example IPs. The director is the gateway for the private network. The director load balances two separate internet connections (eth1, eth2) through the realservers which are the real gateways. Any return traffic will pass through the director on the way back (side note: you'll need to switch off reverse path filter on the director) The code consists of two parts: director code called "reinJect" and an iptables target called "LVS". I call my code "Multipath routing through LVS", because the kernel term for multiple internet connections is multipath routing. Patches are included to allow ipvsadm to set up the reinJect forwarding.

reinJect setup with ipvsadm Here's how you set it up. It's similar to the setup of the other forwarders, but using the -j forwarder option. There are a few extra wrinkles, as explained below.

The target LVS: sending packets with dst_addr=0/0 to ip_vs Since the director is being asked to process packets with dst_addr=0/0, some method of getting the director to process the packets is needed. Originally LVS was written to process single IP, single port services (e.g. a website at VIP:80). Since the packet was destined for an IP on the original server (that was replaced by an LVS), and this packet was routed to LOCAL_IN, LVS was written to hook into LOCAL_IN. In an LVS this IP became the VIP on the outside of the director and a non-arping VIP on the realservers. Later for 2.2.x kernels, transparent proxy allowed LVS to be extended so that it would accept packets to a wide range of IPs e.g. squids which process packets for 0:80. With the arrival of the 2.4 series of kernels, transparent proxy no longer worked for LVS (but still worked for squids) as the dst_addr of the packet was rewritten. LVS was back to working on only individual IPs. allowed grouping of a small number of IPs to be seen as one service, and was adapted to fwmark packets for 0.0.0.0. Still methods outside LVS were required to allow the packets to be accepted locally, which fwmark didn't do. The problem was that LVS required the packets to traverse LOCAL_IN and you couldn't put the IP=0.0.0.0 on the director (or realservers) which would direct these packets to LOCAL_IN. These IPs were on machines out in internetland. This was solved by Julian with two lines of iproute2 code (see ). Now everyone was happy again, but for LVS to work, the packets must traverse LOCAL_IN. There has been some talk of moving the LVS hooks from LOCAL_IN to another part of the routing table e.g. . In fact you can hook LVS anywhere after PRE_ROUTING, except that to make such a change would require much testing. Hopefully not too much would break in the rest of the code, but still you would have to allow time to fix it all. No-one has wanted to take on the job. Ludo solves the problem of LVS processing packets with dst_addr=0.0.0.0 by putting the hook for his forwarder into the FORWARD chain (which is traversed by packets to 0.0.0.0). Conceivably LVS could be rewritten so that all forwarders for packets to 0.0.0.0 hook into the FORWARD chain. Here's the instructions that send the packet to LVS. The marked packets jump to the target LVS, which sends the packet to the normal ip_vs code

setting up LVS-J forwarding The setup of the -j forwarder is the same as for the other forwarders. Normally you would like to use persistence: e.g. accessing websites with cookies based on sourceip, using https, ssh. The problem is not in which gateway you would use, but which source address you seem to come from (here either 200.200.10.1 or 200.200.20.1). The ip_vs code does the normal things with the packet - if it's a new connection, sets up the templates etc, and if it's a current connection, figures out which realserver to send the packet to. The reinject code puts the packet to the mangle chain, doing a form of direct routing, returning the packet to the place where target LVS was called, but with a new mac-address and destination device (here eth1 or eth2 on the director). The packet with have the source MAC address of the public interface on the outside of the director and destination MAC address of the gateway/realserver. The ip_vs code doesn't set destination MAC addresses, but leaves that to the outgoing device driver. LVS/DR (and LVS/Reinject) changes a field in a kernel structure called SKB. (skb->dst) This field is the next ipaddress the packet will go to. The corresponding MAC address will be determined by the link-layer driver. LVS only operates on network layer. (ip-addresses) Reinjection is only effective if ip_vs is called on the FORWARD path. If you try to reinject at the LOCAL-IN path, it won't work. The normal ip_vs function is called after the choice between local delivery and forwarding. If I only change the skb->dst at this point, it will not redo this choice and continue to be locally delivered. That is the reason why LVS-DR calls the ethernet output function directly.

SNAT'ing the output Packets emerging from the director would have src_addr=CIP (a private IP). Ludo fixes this by SNAT'ing the src_addr. SNAT'ing is only necessairy when the director is between a private and a public network. It will work when the director is between two public networks (with no SNAT required). In most cases SNAT is required.

LVS-J discussion by Ludo The code could have achieved the same result using direct routing, but then couldn't SNAT the private addresses to public addresses. Therefore the code introduced the reinject director, which only changes the routing decision and then returns to normal routing with NF_ACCEPT on the hook. The packet will than go on normally, but with a new route. This effect is similar to the iptables ROUTE target, but with the added features of LVS (caching, persistence). The scheduler is non-intrusive: it only changes 1 field in the skb and then returns to the normal network stack, at the same point where "ip_vs_in()" was called. The code provides a Netfilter Target called "LVS" which can be used in the mangle table on the FORWARD hook. When used, this target calls "ip_vs_in()" directly, providing the routing capability of lvs. Packet coming in from LAN has non-local destination, normal Linux routing will call ip_forward. Netfilter Forwarding hook is called In the mangle table the following two targets will be called: -d 0.0.0.0/0 -m state - --state NEW -j MARK --set-mark 1 #iptables -A FORWARD -t mangle -m mark --mark 1 -j LVS ]]> The LVS target will call ip_vs_in(), with a schedular on fwmark 1, using the new "reinject" director: #ipvsadm -a -f 1 -j -r ]]> The reinject director will make sure the packet will continue transfering the mangle table, but with a new Nexthop (skb->dst). The packet will continue through the normal network stack, through POSTROUTING, etc. The packet will be send towards the internet through the selected gateway (specified as next hop). The provided patches have three unfinished drawbacks: I'm not sure if the kernelpatches compile correctly when used as modules. Therefor I force the LVS subsystem to "inbuild/yes" when selecting the iptables target. I didn't check the usage of the iptables target for the FORWARDING hook. It is still possible to select this target in PREROUTING, even though this is ineffective, it will not work. It's against 2.6.11 (which is old allready:) Hopefully I'll find time to clean that up somewhere next week. Or maybe someone else has time/energy to clean them up? Linux networking is very flexible when it comes to routing. It is possible to use several internet connections through one router. The process of selecting from these multiple defaultroutes is called multipath routing. One of the remaining problems for multipath routing under Linux is the lack of flexibility on the scheduling of these multiple defaultroutes. The normal multipath routing only provides a weight factor, but no further setup parameters. It is a basic form of load-balancing, but nothing fancy. Another problem is that multipath routing is only supported on defaultroutes, not on any route with more than one possible gateway. The lvs_reinjection patches are designed to provide the full effectiveness of the LVS schedulars for deciding which route a given packet will take. Contrary to normal LVS setups it provides the possibility to schedule any traffic through the router. With normal LVS the scheduled service is a local service on the director which is then transfered to one or more realservers. The solution provided through these patches can select any traffic passing the director and then force this traffic through a nexthop/gateway. The fwmark can be anywhere in the networking stack, using iptables. Then you'll need to tell the network stack to send the traffic through the LVS subsystem. This is done through the use of a new iptables target called LVS. The purpose of this target is to call the entry function of LVS. Basically you can then use any of the LVS functions, any director available. But of the three standard forwarders, none is very effective for the internet loadbalancing case. The LVS/NAT director will mangle the headers of the packets, therefor loosing the final destination information. The LVS/TUN director will try to setup a tunnel to the realservers (in this case: the gateways), but most gateway's don't provide such a tunnel capability. Only LVS-DR provides the required behaviour: it will send the packets unmodified to the correct gateway. But the LVS/DR director does this by bypassing all local routing on the router, and sends the packet directly through the ethernet drivers output function. This means further services, like SNAT, Masquerading, etc. cannot be done on these packets. For this problem the patches introduce a new director, called LVS/Reinject. This is a very simple director, similar to LVS/DR. But instead of sending the packet directly out the ethernet device, this director leaves that to the normal network stack. It just returns back to where LVS was called in the first place. You can't use the LVS-J forwarder in normal LVS setups, where LVS hooks into LOCAL_IN. Returning the packet to LOCAL_IN would mean that the packet will try to be locally delivered. Here the code allows LVS to hook into the FORWARD chain. Normally LVS services require the traffic to the VIP to be delivered locally on the director. Just before this traffic is delivered to a local process the LVS subsystem will be called. If the LVS subsystem accepts the traffic for one of its services, it will steal the traffic from the local delivery-path and sent it through one of its directors to the realservers (bypassing standard routing). With the iptables target it is possible to call the LVS subsystem at any time you like. This can be on the local-delivery path, but also on the forwarding-path. If you call the "LVS" target and the traffic is selected by one of the LVS services, it will be stolen from the normal flow and delivered back there again. Registering LVS with the FORWARD hook, fixes the problem of requiring the dst_addr to be a VIP local to the director. But I also wanted to prevent LVS from stealing the packet. I wanted the traffic to stay in the netwerk stack, and continue on its normal path. The only service I needed from LVS was the ability to select one of the available gateways. These are the realservers in LVS terminology, but here they aren't the endpoint of the connection, just the next hop to internet. So I combined the ability to sent any traffic to LVS with the ability to reinject the packets in the network flow at the same place where LVS stole it (basically returning from the LVS entry function with IPT_CONTINUE, i.s.o. NF_STOLEN) Horms I had a brief look over the patches and the seem ok to me. Except that I am not clear on the motivation of the following hooks. Doesn't this mean that ip_vs_in is registered in three separate places? Is this actually what you need? Ludo Stellingwerff Aug 04, 2005

Yes, as I try to redirect forwarded traffic (with addresses not local to the director), I need to hook into NF_FORWARD. Ideally this has to be a seperate ip_vs_forward_in function, but these patches are a concept proof. This new ip_vs_forward_in function should be limited to matching fwmarks. The packet flow is: incoming packet -> PRE_ROUTING -> FORWARD -> ip_vs_in (returning NF_ACCEPT, after changing skb->dst) -> POSTROUTING -> outgoing packet. I'm also looking at the possibility of using the iptables REDIRECT target to get rid of the forwarding hook and use the normal ip_vs_in, but I'm not yet sure this will not mangle the original packet (It should not loose the original destination data). At least reinject should than be changed to return NF_STOLEN on the INPUT hook, and call ip_forward() to get the packet on it's way again. The flow for the packet will then become: incoming packet -> PRE_ROUTING (REDIRECT)-> INPUT -> ip_vs_in (returning NF_STOLEN, sending packet to ip_forward()) -> FORWARD -> POSTROUTING -> outgoing packet.

Horms 9 Aug 2005 IF you can get that working, that would be nice. Though I have often wondered about just moving ip_vs_in() to FORWARD and being done with it. I've tried it briefly in the past to good effect.

LVS: Services: general, setup, debugging new services If you just want to find out about setting up a particular service that we already have figured out (e.g. all single port read-only services, some multiport services) then just go to that section. This section is if you are having trouble setting up a service, or want to know more about how services are setup.

Single port services are simple Single port tcp services all use the same format: the realserver listens on a known port (e.g. port 23 for telnet) the client initiates connection by sending a SYN from a high port (port number > 1024) to the VIP:known_port (e.g.VIP:23) the director selects the next realserver to service the request from its scheduling table, allocates a new entry in its hash (ipvsadm) table, and forwards the SYN packet to the realserver. the realserver replies to the client. For LVS-DR and LVS-Tun, the default gw for the realserver is not the director: the reply packet goes directly to the client. For LVS-NAT, the default gw for the realserver is the director: the reply packet is sent to the director, where it is masqueraded and then sent to the client. A similar chain of events is involved in pulling down the tcp connection. In principle, setting up a service on an LVS is simple - you run the service on the realserver and forward the packets from the director. A simple service to setup on LVS is telnet: the client types a string of characters and the server returns a string of characters, making it the choice of services for debugging an LVS. In practice some services interact more with their environment. needs two ports. With http, the server needs to know its name (it will have the IP of realserver, but will need to proclaim to the client that it has the name associated with the VIP). https is not listening to an IP, but to requests to a nodename. This section shows the steps needed to get the common single-port services working. A section on shows how to set up multi-port services like ftp or e-commerce sites. When trying something new on an LVS, always have telnet running as an LVS'ed service. If something is not working with your service, check how telnet is doing. Telnet has the advantages telnetd listens on 0.0.0.0 on the realserver (at least under inetd) the exchange between the client and server is simple, well documented, the connection is non-persistence (new sessions initiated from a client will make a new connection with the LVS) unencrypted and in ascii (you can follow it with tcpdump) the telnet client is available on most OS's

setting up a (new) service When setting up your LVS, you should first test that your realservers are working correctly. Make sure that you can connect to each realserver from a test client, then put the realservers behind the director. Putting the realservers into an LVS changes the networking. For testing the realservers separately LVS-DR, LVS-Tun: Have the VIP on lo:n or tunl0:n with the service listening to the VIP. You'll need some way of from the test client. Alternately you can put the VIP on eth0 and move it back to the local device afterwards. LVS-NAT: The service will be listening on the RIP. In production the client will be connecting to the VIP, so name resolution may be required mapping the RIP to the name of the VIP. If you need this see . The LVS director behaves as a router (with slightly different rules). Thus when setting up an LVS on a new service, the client-server semantics are maintained the client thinks it is connecting directly to a server the realserver thinks it is being contacted directly by the client Example: nfs over LVS, realserver exports its disk, client mounts a disk from LVS (this example taken from performance data for single realserver LVS), realserver:/etc/exportfs (realserver exports disk to client, here a host called client2) The client mounts the disk from the VIP. Here's client2:/etc/fstab (client mounts disk from machine with an /etc/hosts entry of VIP=lvs). The client makes requests to VIP:nfs. The director must forward these packets to the realservers. Here's the conf file for the director.

services must be setup for forwarding type The services must be setup to listen on the correct IP. With telnet, this is easy (telnetd listens on 0.0.0.0 under inetd), but most other services need to be configured to listen to a particular IP. For LVS-NAT, the packets will arrive with dst_addr=RIP, i.e. the service will be listening to (and replying from) the RIP of the realserver. When the realserver replies, then name of the machine returned will be that of the realserver (the RIP), but the src_addr will be rewritten by the director to be the VIP. If the name of the realserver is part of its service (as with name based http) then the client will associate the VIP with this name. The realserver then will need to associate the RIP with this name. You could put an entry for the RIP into /etc/hosts linking it to this name. With LVS-DR and LVS-NAT the packets will arrive with dst_addr=VIP, i.e. the service will be listening to (and replying from) an IP which is NOT the IP of the realserver. Configuring the httpd to listen to the RIP rather than the VIP is a common cause of problems for people setting up http/https. All realservers will need to think that they have the hostname associated with the VIP. In both cases, in production, you will need to make the name of the machine given by the realserver to be the name associated with the VIP. Note: if the realserver is Linux 2.4 and is accepting packets by transparent proxy, then see the section on for the IP the service should listen on.

Realservers present the same content: Synchronising (filesharing) content and config files, backing up realservers Realservers should have indentical files/content for any particular service (since the client can be connected to any of them). This is not a problem for slowing changing sites (e.g. ftp servers), where the changes can be made by hand, but sites serving webpages have to be changed daily or hourly. Some semi-automated method is needed to stage the content in a place where it is reviewed and then moved to the realservers. For a database LVS, the changes have to be propagated in seconds. In an e-commerce site you have to either keep the client on the same machine when they transfer from http to https (using persistence), which may be difficult if they do some comparative shopping or have lunch in the middle of filling their shopping cart, or propagate the information e.g. a cookie to the other realservers. Here are comments from the mailing list. Wensong

If you just have two servers, it might be easy to use rsync to synchronize the backup server, and put the rsync job in the crontab in the primary. See http://rsync.samba.org/ for rsync. If you have a big cluster, you might be interested in Coda, a fault-tolerant distributed filesystem. See the code website for more information.

Joe

from comments on the mailing list, Coda now (Aug 2001) seems to be a usable project. I don't know what has happened to the sister project Intermezzo. May 2004. It appears that development has stopped on both Coda and Intermezzo. I think the problem was too difficult. Jan 2006. Coda appears to be back in development.

J Saunders 27 Sep 1999

I plan to start a frequently updated web site (potentially every minute or so).

Baeseman, Cliff Cliff (dot) Baeseman (at) greenheck (dot) com

I use mirror to do this. I created a ftp point on the director. All nodes run against the director ftp directory and update the local webs. It runs very fast and very solid. upload to a single point and the web propagates across the nodes.

Paul Baker pbaker (at) where2getit (dot) com 23 Jul 2001 (and another posting on 18 Jul 2002 announcing v0.9.2.2)

PFARS Project on SourceForge I have just finished commiting the latest revisions to the PFARS project CVS up on SourceForge. PFARS prounced 'farce' is the "PFARS For Automatic Replication System." PFARS is currently used to handle server replication for Where2GetIt.com's LVS cluster. It has been in the production environment for over 2 months so we are pretty confident with the code stability. We decided to open source this program under the GPL to give back to the community that provided us with so many other great FREE tools that we could not run our business without (especially LVS). It is written in Perl and uses rsync over SSH to replicate server file systems. It also natively supports Debian Package replication. Although the current version number is 0.8.1 it's not quite ready for release yet. It is seriously lacking in documentation and there is no installation procedure yet. Also in the future we would like add support for RPM based linux distros, many more replication stages, and support for restarting server processes when certain files are updated. If anyone would like to contribute to this project in any way do not be afraid to email me directly our join the development mailing list at pfars-devel@lists.sourceforge.net. Please visit the project page at http://sourceforge.net/projects/pfars/ and check it out. You will need to check it out from CVS as there are no files released yet. Any feedback will be greatly appreciated.

Joe (May 2004): the last code code entry for pfars was Sep 2002. The project appears to have stopped development. Zachariah Mully

I am having a debate with one of my software developers about how to most efficiently sync content between realservers in an LVS system. The situation is this... Our content production software that we'll be putting into active use soon will enable our marketing folks to insert the advertising into our newsletters without the tech and launch teams getting involved (this isn't necessarily a good thing, but I'm willing to remain open minded ;). This will require that the images they put into newsletters be synced between all the webservers... The problem though is that the web/app servers running this software are load-balanced so I'll never know which server the images are being copied to. Obviously loading the images into the database backend and then out to the servers would be one method, but the software guys are convinced that there is a way to do it with rsync. I've looked over the documentation for rsync and I don't see anyway to set up cron jobs on the servers to run an rsync job that will look at the other servers content, compare it and then either upload or download content to that server. Perhaps I am missing an obvious way of doing this, so can anyone give me some advice as to the best way of pursuing this problem?

Bjoern Metzdorf bm (at) turtle-entertainment (dot) de 19 Jul 2001

You have at least 3 possibilities: You let them upload to all RIPs (uploading to each realserver) You let them upload to a testserver, and after some checks you use rsync to put the images onto the RIPs. You let them upload to one defined RIP instead of the VIP and rsync from there (no need for a testserver)

Stuart Fox stuart (at) fotango (dot) com 19 Jul 2001

nfs mount one directory for upload and server the images from there. Write a small perl/bash script to monitor both upload directories remotely then rsync the differences when detected.

Don Hinshaw dwh (at) openrecording (dot) com 19 Jul 2001

You can use rsync, rsync over ssh or scp. You can also use partition syncing with a network/distributed filesystem such as Coda or OpenAFS or DRBD (DRBD is still too experimental for me). Such a setup creates partitions which are mirrored in real-time. I.e., changes to one reflect on them all. We use a common NFS share on a RAID array. In our particular setup, users connect to a "staging" server and make changes to the content on the RAID. As soon as they do this, the real-servers are immediately serving the changed content. The staging server will accept FTP uploads from authenticated users, but none of the real-servers will accept any FTP uploads. No content is kept locally on the real-servers so they never need be synced, except for config changes like adding a new vhost to Apache. jik (at) foxinternet (dot) net 19 Jul 2001
If you put the conf directory on the NFS mount along with htdocs then you only need to edit one file, then ssh to each server and "apachectl graceful"
Don Hinshaw dwh (at) openrecording (dot) com 20 Jul 2001
Um, no. We're serving a lot of: <VirtualHost x.x.x.x> and the IP is different for each machine. In fact the conf files for all the real-servers are stored in an NFS mounted dir. We have a script that manages the separate configs for each real-server.
I'm currently building a cluster for a client that uses a pair of NFS servers which will use OpenAFS to keep synced, then use linux-ha to make sure that one of them is always available. One thing to note about such a system is that the synced partitions are not "backups" of each other. This is really a "meme" (way of thinking about something). The distinction is simply that you cannot rollback changes made to a synced filesystem (because the change is made to them both), whereby with a backup you can rollback. So, if a user deletes a file, you must reload from backup. I mention this because many people that I've talked to think that if you have a synced filesystem, then you have backups. What I'm wondering is why you would want to do this at all. From your description, your marketing people are creating newsletters with embedded advertising. If they are embedding a call for a banner (called a creative in adspeak) then normally that call would grab the creative/clickthrough link from the ad server not the web servers. For tracking the results of the advertising, this is a superior solution. Any decent ad server will have an interface that the marketing dept. can access without touching the servers at all.

Marc Grimme grimme (at) comoonics (dot) atix (dot) de 20 Jul 2001

Depending on how much data you need to sync, you could think about using a Cluster Filesystem. So that any node in the LVS-Cluster could concurrently access the same physically data. Have a look at GFS. We have a clustered webserver with 3 to 5 nodes with GFS underneath and it works pretty stable. If you are sure on what server has latest data is uploaded to, no problem with rsync. If not, I would consider to use a Network - or Cluster Filesystem. That should save a lot of scripting work and is more storage efficient.

jgomez (at) autocity (dot) com

We are using rsync as a server. We are using a function that uploads the contents to the server and sync the uploaded file to the other servers.The list of servers we have to sync is in a file like: When a file is uploaded,the server reads this file and make the sync to all the other servers.

"Matthew S. Crocker" matthew (at) crocker (dot) com 14 May 2002

Working machines have local disks for qmail queue and /etc /boot which are EXT3. Working data (/home, /usr, /shared, /webspace) lives on a Network Appliance Netfiler. I really can't say enough about the NetApps they are simply awesome. You pay a chunk of money but it is money well spent.

Andres Tello Abrego C.A.K." criptos (at) aullox (dot) com 06 Sep 2002

Using the KISS principle. The usernames and password collection, must be centralized, for control, only one place, where, u update, change and remove passwords,then, a little help of scp, and all the trick is done. Just, copy, over a secure coneccition, ur password collectios file.. and, u are sync. We, even develop a "cluster" admin web based app, the principle of functioning, was, one server, is the "fistone" then, using, small programan triggered by a ssh execution command or attached to a port using the inetd super server.. and u are done.

"Matthew S. Crocker" matthew (at) crocker (dot) com 07 Sep 2002

Instead of using NIS, or NIS+ I use LDAP for all my customer information records. I store, Radius, Qmail, DHCP, DNS, and Apache Virtual Host information in my LDAP server. We have a couple LDAP slaves and have all servers query the LDAP servers for info. Radius, Qmail are real time, everything else is updated via a script. To replicate NIS functions in LDAP check out www.padl.com. They have a schema and migration tool set

Jerker Nyberg jerker (at) update (dot) uu (dot) se 08 Sep 2002

I used to take information from the customer database and store shadow/passwd/groups/httpd.conf/aliases/virtusertable/etc in two high-availability MySQL-databases (on the same machines that run LVS) then every 30 minutes or apropriate generate the files on the realservers. The "source of all information" for us whas the customer database (also MySQL), that we can modify in our own customized python/GTK-clients or the customers (indirect) via a webinterface. One of the ideas with this was to move the focus from what is stored on the servers to what is in the customer database. In that way it is easy to inactivate accounts if customers doesn't pay their fees etc. If the real-servers go down, they can all be reinstalled with a kickstartinstallation including the scripts that generate the configuration files. I found it easier with "pull" instead of a "push" for the configuration files. Local files (with databases "db" instead of linear files in /etc/nsswitch.conf - this began to make a difference with more than 10k users) in my experience always seemed to be faster than any networked nameservices (LDAP, NIS etc) even if you use nscd to cache as much as you can.

James Ogley james.ogley (at) pinnacle (dot) co (dot) uk 14 Aug 2002

We have an internal 'staging' server that our web designers upload content to. A shell script we run as a daemon then rsync's the content across the cluster members. In addition, we have an externally facing FTP server that external customers upload their content to. The above mentioned shell script rsyncs that content to itself as park of it's operation.

"Matthew S. Crocker" matthew (at) crocker (dot) com 14 Aug 2002

We use keepalived for the cluster/LVS monitoring/management. We use SCP to move the keepalived.conf file around to all the servers. For content it is all stored on a Netfiler F720. The next upgrade will replace the F720 with a cluster of Netfilers (F85C ??). The realservers NFS mount the content (Webdirs, Maildir) Put new content on the NFS server, every machine sees it We run SMTP,POP3,IMAP. We'll be adding HTTP, HTTPS and FTP in the next few weeks. Our LVS is LVS-NAT, 2 directors, 4 realservers, Cisco 3548 switch.

Do you know of any new technology or propriety solutions that need an open source implementation?

I would like to see a Netfiler type appliance open sourced. I know I can go with a linux box but I'm just not sure on the performance. I want a fiber channel based NFS server will complete journaling, 30 second reboots, fail over, snapshots. I think a linux box with EXT3 or ReiserFS comes close but you don't get snapshots and I'm not sure how the failover would work.

Doug Schasteen wrote: What does the keepalived vrrp do exactly? What are you using MySQL for? Because if you are running scripts or web programs, then don't you need to specify an IP in your connection strings? I'm just wondering how that works, because if all your connection strings are set to a certain IP and then that IP goes down, how does it know to fail over to the second machine? The only thing I can think of is your second machine takes over that IP somehow.

keepalived is an awesome tool, bundled with VRRP it allows for machine failover. Basically you set it up like this. Machine A: Machine B: The MySQL Master is setup to replicate with the MySQL Slave (192.168.1.20) The SQL client apps connect to MySQL on 192.168.1.20. The IP address 192.168.1.20 will only exist on the machine which keepalived determines to be the active MASTER. If something causes that machine to crash or if the backup machine stops recieving VRRP announcements from the master it will enable the IP address and send out arps for the IP. The clients will connect to the same IP address all the time. That IP address can be on either machine. I use keepalived to fail over my LVS servers but it can be used to failover any group of machines.

I was planning on tackling this issue by writing (rewriting) all of my scripts/programs to include one file that does the mysql connections. Then I only have to change one file when I want to change where my mysql connections go. And then maybe I'll add a failover connection inside of that include file, like an "if the first connection didn't work, try the backup server". The problem with that is that if for some odd reason the first connection doesn't work (perhaps I rebooted the machine), it will put them on the backup server and updates will be made to the backup server. Any updates made to the backup mysql server while I'm rebooting the main mysql server will probably be lost. I can maybe add two-way replication for when something like this happens (but not use two-way replication all the time, because I've heard that has problems.)

Have the slave server dump transaction logs so you can manually replicate the data back over when you recover the master server.

Ramon Kagan rkagan (at) YorkU (dot) ca 14 Aug 2002

We use lvs with keepalived for High Availability. All our servers are identical in setup, and use NFS to a cluster of Netapp filers (two F840s) Our setup uses, LVS-DR since we push very close to the 100Mbit/s range, NAT seemed to have too much overhead. Services that we run are web(http, https), web mail, mail delivery. Pop and imap on soon to be added to this list. New content is put onto the filers, thus all nfs clients pick it up immediately. For our MySQL setup I have a single "MySQL" machine. I setup my MySQL to listen on the designated port and have setup strict rules in MySQL for authentication and access. (see mysql.user and mysql.db tables). For redundancy I have a second machine running as a replication slave against the MySQL machine. I'm using keepalived's vrrp framework to force failover when problems arise (hasn't happened yet, knocking on wood really hard). I have tested this in a development environment and it seems to work nicely. I found that with both machines running 100Full Duplex, our MySQL server can complete over 10,000 write transactions per second and the latency between databases is on average 0.0019 seconds (yes, under 2 thousandth of a second!). I will admit that these are pretty strong machines (Dual P3 1.4 with 2 GB memory, 100% SCSI based), but I seen similar performance on P3 600 with 512 MB memory, IDE based, still 100 Full Duplex though. VRRP - virtual router redundancy protocol Keepalived is software writeen by Alexandre Cassen. What is can do it as such: Health checking of realserver allowing removing of unresponsive ones from an lvs table (auto control of the lvs table) Heartbeat between two lvs boxes so that if one fails the other takesover. So, with these two you can create a High Availibity (HA) cluster. Using "2" only you can setup a service, like MySQL replication, and run just the heartbeat (in this case VRRP framework) without the health checking or LVS framework. Then if the master node goes down the slave node can run a script to convert the slave database into a master database. So, if you have a master system say dbmast.ex.com, and a slave dbslav.ex.com you create a service IP db.ex.com. All clients talk through db.ex.com. On startup, dbmast.ex.com arps out gratuitously that it is db.ex.com. On failure, dbslave.ex.com would arp out to take over the systems. This way, client need not know of any changes. Go to www.keepalived.org for more info. If you have any further questions, there is a mailing list, and I have helped in the past with the documentation.

nick garratt nick-lvs (at) wordwork (dot) co (dot) za 14 Aug 2002

transferring of content is currently done via tar over ssh: (beware path length !) it's useful for transferring entire hierarchies, preserving perms, symlinks and whatnot, but we'll probably migrate to rsync. before new content is deployed it is transferred to an intermediate deploy server from the dev server where it is thoroughly tested/abused. none of these machines forms part of the cluster per se. the content is transferred to the remote access server in the cluster after testing. from this machine it it transferred to each of the web servers in turn using the same mechanism described above. any content which must be shared (rw) by all web servers (client templates, ftp dirs) is NFS served. content is database driven dynamic content (apart from obvious exceptions) providing both web site facilities and an http/s get/post and xml api. cluster manager : lvs using fwmark nat for (public) http/s,ftp, smtp, dns and fwmark dr (internal clustered services) for http. mon. www1 - wwwn : php, apache. all run msession and an ftpd although only one is IT at any point db : pair of mysql servers in master/slave. loghost : responsible for misc services : ntp master for cluster, syslogd - remote logging, tftpd for log archiving, log analysis ...) remote : dns for the world (also fwmark) and secondary nameserver for cluster, ssh

Neulinger, Nathan nneul (at) umr (dot) edu 26 Apr 2004

We use AFS/OpenAFS as our backend storage for all regular user and web data. (Mail and databases are separate.) About 3 TB total capacity, of which about 1.9 used. We have 3500+ clients, many of which are user desktops - thus unsuitable for nfs. NFS is suitable for tightly controlled server clusters, but not really for export to clients that may or may not be friendly.

Joe 26 Apr 2004

If it's a readonly site, then have only static pages on the realservers and rsync them from a staging machine (which may have dynamic html). If you need randomly different content from dynamic pages, generate different versions of them every few minutes and push different ones to each realserver. You want the httpd to be fetching as much from the disk cache as possible.

John Reuning john (at) metalab (dot) unc (dot) edu 26 Apr 2004

Squid might be a good solution for caching static pages on real servers. For php caching, you can use turck mmcache. It works well most of the time, but is sometimes flaky with poorly-written php code. We actually do what Joe suggested, manually rsync'ing content to a cache directory on the realservers. Alias directives are added to map the cached content to the proper url location. Our shared storage is nfs, and we, too, have lots of user-maintained files. However, we've targeted directories that don't change often (theme and layout directories for CMS applications, for example). Rsync is efficient, and we run it every hour or so. There's one performance problem that's not solved by this, though. Apache performs lots of stat() calls when serving pages (checking for .htaccess files). The stat calls are made before the content is served and go to the nfs servers. Under a high traffic load, the stat calls bog down our nfs servers despite the content being cached on the real servers.

Graeme Fowler keepalived (at) graemef (dot) net 28 May 2004 There's loads of way to synchronise realserver content or the whole filesytem incase of realserver disk failure. Have a "management station" which can do pubkey SSH logins to the managed machines, run scripts, push software and so on. Create a disk image of a server you're happy with and have it network boot using syslinux/pxelinux Utilise HP's open source OpenSSI project (an OSS cluster management system). I've found the OpenSSI concepts to be hugely useful in theory, if not in practice. Use a "virtual" machine system such as UML. You can then keep a "dumb" system running the virtual machine, make changes to the image offline, copy it to the "dumb" system and reboot the virtual machine instead. J. Simonetti jeroens (at) office (dot) netland (dot) nl 28 May 2004 I also found systemimager (http://www.systemimager.org/) myself which sounds promising as well.

cfengine for synchronising files cfengine is designed to control and propagate config files to large numbers of machines. Presumably it could be used to synchronise realserver content files as well.

Has anyone succesfully rolled-out a cluster of real-servers using coda? (Main reason would be for the replication of config files (Apache/qmail) across all real-servers) - Is this doable? Or am I better of using rsync?

Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no 27 Jun 2003 I recommend using rdist or cfengine which are designed to distribute configuration files, not to make exact copies of a directory (structure). Both rdist and cfengine are more configurable than rsync.

File Systems for (really big) Clusters: Lustre, Panasas People have discussed CODA as a filesystem for synchronisation. In clusters NFS has a lot of overhead and failed mounts result in hung systems. Here's a posting from the beowulf mailing list Bari Ari bari (at) onelabs (dot) com 26 Sep 2002

NFS is dead for clusters. We are targeting three possible systems, each having a different set of advantages: panasas (http://www.panasas.com) lustre (http://www.lustre.org) v9fs (http://v9fs.sourceforge.net)

File Systems for Clusters: Samba waits for a commit and is slow, NFS fills buffers and is fast (from the TriLUG mailing list). John Broome I have a RH 9 machine that is acting as a fileserver for a completly windows network (98 and 2000), the users mentioned that the file transfers seemed slow. Some testing showed that samba was moving data much slower than NFS. NFS was using pretty much the entire speed potential of the network, where SMB was about half that, or less. No indication on the server that CPU, HDD, or memory is the problem. When tested off site with different hardware and a different OS (Ubuntu 5.04), the same problem popped up - SMB dragging along, NFS cranking. Since this is a mostly windows network we can't really use NFS instead of the samba. Tanner Lovelace clubjuggler (at) gmail (dot) com 06/20/2005 A quick search for "samba tuning" brings up this quote from http://www.oreilly.com/catalog/samba/chapter/book/appb_01.html "If you run Samba on a well-tuned NFS server, Samba may perform rather badly." If you follow the link at the bottom of the page (http://www.oreilly.com/catalog/samba/chapter/book/appb_02.html) it has suggestions for things in samba to tune. Jason Tower jason (at) cerient (dot) net I was testing transfer speed using my t42 with ubuntu. File transfers using nfs would occur at wire speed (12.5 MB/s) while the exact same file transferred using smb (mounted with smbmount) would only be about 5.5 MB/s. However, when I booted into windows on the t42 I could copy the file (with smb of course) at nearly wire speed. So it seems that at least part of the perceived problem has something to do with the smb *client*, not the server. cpu utilization and iowait was not even close to being a bottleneck so I'm not sure where the slowdown is occuring or why. Jon Carnes jonc (at) nc (dot) rr (dot) com 06/20/2005 I wrote this up for TriLUG about 5 years ago... We tested various forms of file transfer including NFS and Samba and - if I remember correctly - we found Samba (version 3) to be about 1/3 the speed of NFS (version 2). The problem was that the Samba process waited for a commit before negotiating for the next data transfer whereas NFS filled a buffer and continuously pushed that buffer out. Obviously if you're running from a buffer out of RAM you can run at network speeds (or as fast as your internal bus and cpu can go). Microsoft's implementation of SMB pumps the data to be moved into a buffer and works similarly, so it's almost as fast as using NFS (though it does some other weirdness that always makes it a bit slower than NFS...) NFS v3 had a toggle that also defaulted to waiting for a commit from the remote hard drive before sending more data - that moved files around just slightly faster than Samba (it crawled.)

Discussion on distributed filesystems This was a long thread in which everyone discussed their knowlege of the matter. I've added subsequent postings at the end. If clients are reading from the realservers (e.g. webpages), then it's simple to have the same content on each machine and push content once a day say. If clients are writing to disks, you have an different problem, of propagating the writes to all machines. In this case you may want an (expensive) central fileserver (look for NetApp elsewhere in this HOWTO for happy users e.g. NetApp for mailservers). Graham D. Purcocks

What, if any, Distributed Filesystems have any LVS users tried/use with any success?

Joe We hear little about distributed file systems with LVS. Intermezzo is supposed to be the successor to CODA, but we don't hear much more here on the LVS mailing list about Intermezzo than we do about CODA. Distributed file systems are a subject of great interest to others (e.g. beowulfs) and you will probably find better info elsewhere. The simple (to setup) distributed filesystems (e.g. PVFS) are unreliable, i.e. if one machine dies, you loose the whole file system. This is not a problem for beowulfs, since the calculation has to be restarted if a compute node dies, and jobs can be checkpointed. Reliable distributed filesystems require some effort to setup. GFS looks like it would take months and much money to setup. A talk at OLS_2003 described Lustre (http://www.lustre.com), a file system for 1024 node clusters that is in deployment. It sounds simpler to setup than GFS, but I expect it will still be work. Lustre expects the layer of hardware (disks) below it to be reliable (all disks are RAID). Rather than depending on the filesystem to distribute state/content, for an LVS where clients write infrequently (if at all), state/content can be maintained on a failover pair of machines which push content to the realservers. Graham Purcocks grahamp (at) wsieurope (dot) com 04 Nov 2003 My thoughts are:- NFS is fine if the content is not changing often. As it is a single point of failure, you have to have a backup and do all the failover and synchronizing stuff mentioned in other emails. With a distributed file system, this is not the case and it 'should' sort itself out as servers go offline. This system is needed if you have dynamically changing content which changes often, then rsync will not cut it. John Barrett jbarrett (at) qx (dot) net 04 Nov 2003 nfs directory naming has not been an issue for me in the least -- I always mount nfs volumes as /nfs[n] with subdirs in each volume for specific data -- then symlink from where the data should be to the nfs volume -- same thing you will have to do with coda -- in either case the key is planning your nfs/coda setup in advance so that you dont have issues with directory layouts, by keeping the mountpoints and symlinks consistent across all the machines. I'm not currently doing replicated database -- i'm relying on raid5+hotswap and frequent database backups to another server using mysqldump+bacula. Based on my reading, mysql+nfs not a very good idea in any case -- mysql has file locking support, but it really slows things down because locking granularity is at the file level (or was the last time I looked into it -- over a year ago -- please check if there have been any improvements) based on my current understanding of the art with mysql, your best bet is to use mysql replication and have a failover server used primarily for read only until the read/write server fails (if you need additional query capacity) (ldirectord solution), or do strict failover (heartbeat solution), only one server active at a time, both writing replication logs, with the inactive reading the logs from the active whenever both machines are up (some jumping through hoops needed to get mysql to startup both ways -- possibly different my.cnf files based on which server is active) with either setup -- the worst case scenario is one machine goes down, then the other goes down some period of time later after getting many updates, then the out of sync server comes up first without all those updates (interesting thought just now -- using coda to store the replication logs and replicating the coda volume on both the database servers and a 3rd server for additional protection, 3rd server may or may not run mysql, your choice of you want to do a "tell me 3 times" setup -- then you just have to keep a file flag that tells which server was most recently active, then any server becoming active can easily check if it needs to merge the replication logs -- but we are going way beyond anything I have ever actually done here, pure theory based on old research) in either case you are going to have to very carefully test that your replication config recovers gracefully from failover/failback scenarios my current cluster design concept that I'm planning to build next week might give you some ideas on where to go (everything running ultramonkey kernels, in my case, on RH9, with software raid1 ide boot drives, and 3ware raid5 for main data storage): M 1 -- midrange system with large ide raid on 3ware controller, raid5+hotswap, coda SCM, bacula network backup software, configured to backup directly to disk for first stage, then 2nd stage backup the bacula volumes on disk to tape (allows for fast restores from disk, and protects the raid array in case of catastophic failure) M 2 and 3 heartbeat+ldirectord load balancers in LVS-DR mode -- anything that needs load balancing, web, mysql, etc, goes through here (if you are doing mysql with a read/write + multiple read only servers, the read only access goes through here, write access goes direct to the mysql server and you will need heartbeat on those servers to control possesion of the mysql write ip, and of course your scripts will need to open seperate mysql connections for reading and writing) M 4 -- mysql server, ide raid + hotswap -- I'm not doing replication, but we already discussed that one :) then as many web/application servers as you need to do the job :) each one is also a replica coda server for the webspace volume, giving replicated protection of the webspace and accessability even if the scm and all the other webservers are down -- you may want multiple dedicated coda servers if your webspace volume is large enough that having a replicate on each webserver would be prohibitivly expensive If you use replicated mysql servers, this script may provide a starting point for a much simplified LVS-DR implementation: http://dlabs.4t2.com/pro/dexter/dexter-info.html -- the script is designed to layer additional functionality on top of a single mysql server, but the quick glance that I took shows it could be extended to route queries to multiple servers based on the type of query.. i.e. read requests to the local mysql instance, write requests to a mysql instance on another server. setup 1 mysql master server, it will not be part of the mysql cluster, its sole task is to handle insert/update and replicate those requests to the slave servers. setup any number of mysql replicated slaves -- they should bind to a VIP on the "lo" interface, and the kernel should have the hidden ip patch (ultramonkey kernels for instance) modify dexter to intercept insert/update requests and redirect them to the master server (will mean keeping 2 mysql sessions open -- one to the master, one to the local slave instance) -- if the master isnt there, fail the update -- install the modified script on all the slave servers setup the VIP on an ldirectord box and add all the slaves as targets -- since mysql connections can be long running, I suggest using LeastConnection balancing Now clients can connect to the VIP, the slaves handle all read accesses, and the master server handles all writes The only possible issue with this setup is allowing for propogation delays after insert/update -- i.e. you wont be able to read back the new data instantly at the slaves -- may take a second or 2 before the slaves have the data -- your code can loop querying the database to see if the update has completed if absolutly neccessary you still have a single point of failure for database updates, but your database is always backed up on the slaves, and because there is only one point of update, its very difficult for the slaves to get out of sync, and read access is very unlikely to fail -- also you have none of the problems with database locking, as NFS is still not recommended based on the lists that I scanned to get up to date on the issues Lastly -- the master CAN be a read server if you wish (my setup above assumes it is not) given that the update load is not so much that the master gets overloaded -- if you have high update frequency, then lets the slaves handle all the reads, and the master only updates Ariyo Nugroho ariyo (at) ebdesk (dot) com 31 Oct 2003

After successfully setup LVS-NAT for telnet, http, and ftp services, now I'm going to do it with databases. From the HOWTO, it's said that implementing such configuration needs distributed filesystem. There're so many names mentioned in the HOWTO. And it make me confused. The first name I noticed from the LVS page is Coda. But then I found that Peter Braam has stopped working in coda team. He initiated another one, Intermezzo. Many articles stated that this new filesystems is very promising than Coda. Unfortunately until now, I can't find whether this Intermezzo has become stable or not. So, are there anyone of you that has experience with any distributed filesystems? Which one would you recommend?

Joe The HOWTO says that you need someway of distributing the writes to all the realservers. This can be done at the application level or at the filesystem lever. At the time we first considered running databases on LVS, neither distibuted filesystems nor replication was easily available on Linux. Pushing writes would have to be done via a DBI/DBD interface or similar. I expected that distributed filesystems were just around the corner (intermezzo, CODA), but they never arrived. In the meantime mysql has implemented replication and pushing the writes now seems best done at the application level. Ratz in his postings to the mailing list has shown how most problems that involve maintaining state across the realservers can and should be solved at the application level. It only seems reasonable to approach the database write problem the same way. If someone comes up with a bulletproof, easy to maintain, reliable distributed filesystem, then all of this will be thrown out the window and we'll all go to distributed filesystems. However with the effort that's been put into distributed filesystems and the lack of progress that's been made, relative to the progress from modifying the application, I think it's going to be a while before anyone has another go at distributed filesystems for Linux. I'm sure the HOWTO says that LVS databases can use distributed filesystems. However LVS'ed databases don't require (==need) distributed filesystems. Karl Kopper karl (at) gardengrown (dot) org 31 Oct 2003 Did I mention I have had a good experience with NFS? In my experience NFS is rock solid on the newer 2.4 Linux kernels. What let's me say this is the fact that since late February the company where I work has been using Linux NFS clients in an LVS cluster as the main business system without a single problem. This is a system that processes half a billion dollars a year in orders and prints hundreds of documents (warehouse pick sheets for example) every day (using the rock-solid LPRng system btw). NFS has not failed once. The existing in-house order processing applications did not have to be rewritten to run on the cluster over NFS because they use fcntl locking. This cluster is all in one data center. I would sooner quit my job than be forced into implementing a file system for a mission critical application that had to do lock operations to guarantee data integrity over a WAN. I don't care how good the file system code is. Joe

This went over my head. Are you using locking with nfs or not?

We are doing locking with NFS (the fcntl calls from the application cause the Linux kernel to use statd/lockd to talk to the NAS server). I just wouldn't do locking over a WAN. I'm just talking about doing any type of lock operation over a WAN (for example an NFS client that connects to the NAS box over the WAN). Not really distributed content stored on direct attached disk on each node (though that would have to be even worse for locking over a WAN). Joe

I missed the WAN bit in the original posting. So you were referring to some sort of file system (distributed ?, eg AFS) spread over a WAN? I had forgotten that some distributed file systems aren't local only. Let's see if we understand e.o. Where you're coming from: You don't like file locking onto a box on another network over possibly non-dedicated links. You're happy with NFS because you have the disks local (in some arrangement I don't know about yet) on a network that others can't get to. Where I'm coming from: I think of distributed file systems as being used on machines like beowulfs (or an LVS) where the disks are on machines on a separate and dedicated network, that is not accessible to clients (all jobs are submitted to a master node, and the client never sees the filesytem behind it). The people running the beowulf have complete control over the file system(s) and network behind the master node.

Karl the problem is alack of a consistent definition of the term "distributed file system." Here is the configuration I'm referring to: | NAS | | Server | [LVS RS 2]------------>| | | NFS Server) | [LVS RS 3]------------>|_____________| ]]> Each LVS RS has /var, /usr, etc. on a local disk drive, but shared data is placed on the NFS-mounted file system. (They are just ordinary NFS clients to the NAS box). Lock arbitration is performed by the NAS box. I think the terms "distributed file system" and "cluster file system" suffer from the same problem of vagueness of definition. Awhile back Alan Cox had this to say about the term cluster file system (CFS): It seems to mean about three different things A clusterwide view of the file store implemented by any unspecified means - i.e. an application view point. A filesystem which supports operation of a cluster A filesystem with multiple systems accessing a single shared file system on shared storage Meaning #3 can be really confusing because a 'cluster file system' in that sense is actually exactly what you don't want for many cluster setups, especially those with little active shared storage' For example if you are doing database failover you can do I/O fencing and mount/umount of a more normal fs. John Barrett jbarrett (at) qx (dot) net 01 Nov 2003 I've just recently installed Coda, and must say that i'm less impressed with it compared to NFS for a number of reasons. There is a server setup I will be doing in a few weeks where Coda will be the only choice, so don't think that I'm being completely against Coda, I just feel the range of areas where it is usable is fairly narrow. NFS is already in the linux kernel as is Coda, but Coda is an older version (5.3.xx vs current 6.0.xx, make sure your user-space code is right for your kernel module) NFS -- setting up the client and server is a no-brainer, webmin handles both if you dont want to get your keyboard dirty, and even if you hand-edit, its no problem, no such luck with Coda Coda's main strength is replicated servers, But you can do the same with NFS if you are willing to accept some delay before changes propogate to the other NFS servers (i.e. rsyncing the nfs servers every so often, using heartbeat failover to bring the backup NFS server online as needed) -- if your files on disc are frequently changing and replicated servers must be in sync, Coda is the better choice. Coda's main weakness IMHO is the hoops you have to jump through when you make changes to the server system -- you have to kill and restart the client daemon on each client machine to get changes to take. (adding a new server or replicate, creating or deleting volumes, etc -- experiment a lot on non-production systems, get the production setup right the 1st time) Coda does much more in the way of local caching than NFS, and the cache size is configurable... make the cache as large as the distributed filespace and it is possible to continue to operate if all the servers are down, and any changes will be committed when one or more of the servers come back online (presuming all the files needed are in the cache) Coda does not use system uid/gid for its file -- it maintains its own user/pass database, and you must login/acquire a ticket before accessing Coda volumes -- NFS runs off the existing uid/gid system, all you have to do is keep the passwd/group files on all the machines in sync for key users, or setup NIS+ In closing, I feel that I had a bad experience with Coda, but I wont hessitate to try again when I have more time to dig into the detail, I was under a lot of time pressure on this latest job, so I went with NFS just to get the system online NOW :) Todd Lyons tlyons (at) ivenue (dot) com 16 Feb 2006 Here are some questions to ask to compare which avenue you would like to take for a cluster filesystem: In a cluster filesystem, how many real servers can go down at one time and leave the virtual raid array still operational? In a NAS box, how many drives can die simultaneously and the system is still operational? In a cluster filesystem, what happens if half of the switch ports just all of a sudden die, or the whole switch, or some corruption happens in your switch and several ports suddently segment themselves into its own VLAN? For a NAS box, ask #3. (Hint: a good NAS like the FAS270c with twin heads will do a complete takeover if necessary, so 3 of the 4 ethernet ports can lose connectivity. As long as one is still connected, that one head can serve the load of both heads.) In an NAS box, does it have multiple power supplies? Multiple ethernet ports? Multiple parity drives? Multiple spare drives? How much money do you have available to spend? Are you counting the amount of time it will take you to keep a cluster filesystem running and stable as compared to the relatively troublefree NAS boxes (assuming you don't undersize it)? In terms of complete disaster recovery: How long does it take to backup a complete cluster fs? How long does it take to backup a NAS? How long would it take to rebuild and restore a cluster if half of the machines died all at once? If all of the machines died at once? How long would it take to rebuild and restore a NAS box if half of the drives died all at once? If all of the drives died at once? If both power supplies died at the same time? If you think that you'll never see any of these situations, you might be lucky and never will, but my general experience with computer hardware is that things run very smooth for a very long time, and then something hiccups hard and takes a bit of work to recover from. I've also heard mention that "Murphy was an optimist." :-) If you can't tell, I'm of the opinion that a NAS will do more for you than a cluster filesystem, but keep in mind that's also my comfort zone. If I was daily into the inner workings of a cluster fs production system, I might feel differently.

load balancing and scheduling based on the content of the packet: Cookies, URL, file requested, session headers Mar 2002: questions on these topics have come up in the context of , , or cookie based routing (see ). I've tried to collect it all here. Make sure you read these other sections if you are implementing cookies or URL rewriting. LVS being an L4 switch does not have access to the content of packets, only the packet headers. So LVS doesn't know the URL inside the packet. Even if it did, LVS would need an understanding of the http protocol to inspect cookies or rewrite URL headers. Often people think that an L7 Switch is the answer here. However an L7 switch is slow and expensive. If you have what looks to be an L7 problem, you should see if there is a solution at the L4 level first (see ). The short answer is that you can't use LVS to load balance based on the content of the packet. In the case of http, there are other tools which understand the content of http packets and you can use those.

Forwarding an httpd request based on file name (mod_proxy, mod_rewrite) Sean, 25 Dec 2000

I need to forward request using the Direct Routing method to a server. However I determine which server to send the request to depending on the file it has requested in the HTTP GET not based on it's load.

Michael E Brown Use LVS to balance the load among several servers set up to reverse proxy your realservers, set up the proxy servers to load balance to realservers based upon content. Atif Ghaffar atif (at) 4unet (dot) net On the LVS servers you can run apache with mod_proxy compiled in, then redirect traffic with it. Proxy pass, or you can use mod_rewrite, in that case, your realservers should be reachable from the net. There is also a transparent proxy module for apache. Yan Chan ychan (at) ureach (dot) com 19 Aug 2003

My ipvs is set up right and everything. I set the VIP's address of port 80 to forward to my Real Web Servers. I then set port 90 to another Web Server with different stuff in it. My problem is when i try to access the page for port 80, www.abc.com, the web page shows fine. In order for me to access the page in port 90, i have to type www.abc.com:90. As you can see, it doesnt look elegant. Is there a way to change it so i can make it www.abc.com/ipvs equal to www.abc.com:90? like the rewrite rule in apache? I tried using apache in the loadbalancer. But it doesnt seen to work.

Stephen Walker swalker (at) walkertek (dot) com 19 Aug 2003 This is how I set up my reverse proxy in apache: The ProxyPass rule says everything that goes to /perl is forwarded to port 8080, the Rewrite rules take care of scripts ending in .pl or .cgi. Obviously you need to have the rewrite and proxy modules running on your apache server. Randy Paries 4 Feb 2004

I want to loadbalance based on file names. I want

Dave Lowenstein You could try apache's mod_rewrite. Horm's Actually I think mod_proxy would probably be the right choice. Though this could be combined with mod_rewrite. Actually, it would be pretty trivial me thinks. Viperman Aug 23, 2003

I'm successfully using LVS with a reverse proxy configuration in apache, everything working just fine. I just faced a problem, when a user is trying to read the client IP address using PHP with the $_SERVER['REMOTE_ADDR'] variable. My RIP is showing up in place of the CIP.

Horms horms (at) verge (dot) net (dot) au 24 Aug 2003 I believe that when proxies are involved you need to check variables other than REMOTE_ADDR, which will generally be the IP address of the proxy, as this is acting as an end-user of sorts. See http://www.php.net/getenv In any case the problem should not be caused by LVS as it does not change the source IP address nor the HTTP headers (or body).

Rewriting URL headers alex (at) short (dot) net

We have 4 distict sites, all being virtual hosted and load balanced by a single VIP. Namevirtualhosts in apache picks the right content dependent on the host header. This works great for those distict sites and their corresponding domains. Problem is that we now have 25 domains to point to distict site 1 and 25 domains pointing to distinct site 2. Right now the nameserver entries are www CNAME www.domain.com for all of those domains i.e. distinct sites I want www.a.com www.a2.com -> www.a.com etc . . www.b1.com -> www.b.com ]]> I'd rather not fill my httpd conf with all these domains. I was either hoping that LVS can do some host header modifications or I'll have to make 4 VIP and each site have a distinct external ip.

Jacob Coby jcoby (at) listingbook (dot) com 19 Feb 2003 LVS sits too low to handle munging the HTTP headers. Check out ServerAlias (http://httpd.apache.org/docs-2.0/mod/core.html#serveralias), I think it will do pretty much exactly what you want it to do with a minimal of fuss. It's available in 1.3.x and 2.x, I only link to the 2.0 docs since they look better. You could even stick the aliases in another file (a_aliases.conf or whatever) and include that into the VirtualHost section of Site A. The only real "problem" with this setup is that you have to bump apache (you can use 'apachectl graceful' if you don't have SSL) everytime you add a new alias or aliases. Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no Thu, 20 Feb 2003 18:07:17 +0100 If the hosts have almost identical configuration (in apache) you can use dynamic virtual hosting

URL parsing unknown Is there any way to do URL parsing for http requests (ie send cgi-bin requests to one server group, static to another group?)

John Cronin jsc3 (at) havoc (dot) gtf (dot) org 13 Dec 2000 Probably the best way to do this is to do it in the html code itself; make all the cgis hrefs to cgi.yourdomain.com. Similarly, you can make images hrefs to image.yourdomain.com. You then set these up as additional virtual servers, in addition to your www virtual server. That is going to be a lot easier than parsing URLs; this is how they have done it at some of the places I have done consulting for; some of those places were using Extreme Networks load balancers, or Resonate, or something like that, using dozens of Sun and Linux servers, in multiple hosting facilities.

Horms

What you are after is a layer-7 switch, that is something that can inspect HTTP packets and make decisions bassed on that information. You can use squid to do this, there are other options. A post was made to this list about doing this a while back. Try hunting through the archives. LVS on the other hand is a layer-4 switch, the only information that it has available to it is IP address and port and protocol (TCP/IP or UDP/IP). It cannot inspect the data segment and see even understand that the request is an HTTP request, let alone that the URL requested is /cgi-bin or whatever. There has been talk of doing this, but to be honest it is a different problem to that which LVS solves and arguably should live in user space rather than kernel space as a _lot_ more proccessing is required.

session headers Torvald Baade Bringsvor Dec 05, 2002

We have a setup with two reverse proxies, two frontends and two backend application servers. Usually we just use LVS to switch between the proxies, and then establish a direct mapping from each of the reverse proxies onto an application server. But now I am wondering if it is possible to use LVS to switch between the two frontends and the two backends as well, in other words to cluster the frontends and backends. Regular persistence doesn't work here, because (as far as the backends are concerned) all the traffic comes from just two addresses. It would be really nice to be able to inspect the session ID (which is contained in the http header of the requests) and route the request based on that. But is it possible? Has anybody done this?

Horms 10 Mar 2003 Unfortunately LVS does not have access to the session header so this information cannot be used for load balancing.

Scheduling by packet content Horms 07 May 2004 It would be nice to be able to use KTCPVS-like shedulers that make use of L7 information inside LVS, but there are major problems. In terms of TCP the problem is that LVS wants to schedule the connection when the first packet arrives. However, to get L7 information you need the three way handshake to have completed. I guess this would be possible if LVS itself handled the three-way connection, and then buffered up the packets in the established state until enough L7 information had been collected to schedule a given connection. But I suspect this really would be quite painful.

Timeouts for TCP/UDP connections to services Sometime in 2008: LVS, as originally written, would timeout connections between client and server, independently of whether the connection itself wanted to be timed out. The timeout was short enough that people setting up a new LVS would find their sessions disconnected, without knowing why. Expect that sometime soon, that the timeouts will be changed to match those in netfilter. This is part of an off-line discussion on why the timeouts shouldn't be set to infinity. Graeme Fowler graeme (at) graemef (dot) net 23 Dec 2008 Although it is (theoretically, at least, and I'll have to check the TCP RFC for this) possible to write an app which does the three-way handshake and then holds the connection open without exchanging any further packets for a long time (for some value of "long"), in the case of ip_vs this could result in resource starvation on highly loaded systems. ip_vs is essentially a man in the middle (albeit a nice one) which has some knowledge about the transactions in progress at TCP level. If we set the timeout to 0/infinity, then under some conditions (say broken networks, BGP peerings dropping, or - ooh, topical - multiple undersea fibres being severed) the director will be left with a table stuffed full of "tracked" connections (I use the term carefully, noting the similarity to netfilter's conntrack modules) which never go away unless a FIN/RST turns up. If that/those packet/s never arrive, the director will have an ever-increasing number of connections in the table. Under circumstances where there's a high turnover of connections (100,000/sec, for example, which is deliberately high for illustrative purposes) even 0.0001% of connections getting into that state would result in the following: Assuming a well-managed, not-interfered-with director that would give us 3153600 "stalled" connections in a year of uptime. OK, so that may be far-fetched for some people *but* it's perfectly plausible in terms of embedded systems. And embedded systems often don't have much RAM - and that's the killer factor here. Low RAM means little space for the hash table, and if the hash tables fills up we stop routing (I presume, unless it FIFOs entries out or does some other non-time-related scavenging). I see Horms commented similarly, but without numbers... now let me look at RFC 793 and what it says...

"The timeout, if present, permits the caller to set up a timeout for all data submitted to TCP. If data is not successfully delivered to the destination within the timeout period, the TCP will abort the connection. The present global default is five minutes."

It doesn't specify upper or lower bounds, so it looks like infinity is (technically) possible. Note however that lots of firewall devices, NAT boxes and so on will drop them anyway - Cisco PIX and ASA, Checkpoint devices have a 24 hour default; netfilter's conntrack modules have a default 5 day TCP session timeout for established connections. For more netfilter goodness, see: /proc/sys/net/netfilter/ There are lots of sysctl goodies in there! Why not, as ip_vs is linked so closely to netfilter, make use of the same sysctls? Horms I'm not sure how it would work in practice - perhaps some people would want to tune LVS and netfilter separately - but at the very least we could use the same default. Feb 2003: The timeout information is now in man(8) ipvsadm.

2.2 kernels Julian, 28 Jun 00 LVS uses the default timeouts for idle connections set by MASQ for (EST/FIN/UDP) of 15,2,5 mins (set in /usr/src/linux/include/net/ip_masq.h). These values are fine for ftp or http, but if you have people sitting on a LVS telnet connection, they won't like the 15min timeout. You can't read these timeout values, but you can set them with ipchains. The format is Wensong Aug 2002

for 2.4 kernels

example: sets the timeout for established connections to 10hrs. The value "0" leaves the current timeout unchanged, in this case FIN and UDP.

2.4 kernels Julian Anastasov ja (at) ssi (dot) bg 31 Aug 2001 The timeout is set by ipvsadm. Although this feature has been around for a while, it didn't work till ipvsadm 1.20 (possibly 1.19) and ipvs-0.9.4 (thanks to Brent Cook for finding this bug). The default timeout is 15 min for the LVS connections in established (EST) state. For any NAT-ed client connections, ask iptables. To set the tcp timeout to 10hrs, while leaving tcpfin and udp timeouts unchanged, do Brent Cook busterb (at) mail (dot) utexas (dot) edu 31 Aug 2001

I found the relevant code in the kernel to modify this behavior in 2.4 kernels without using ipchains. I got this info from http://www.cs.princeton.edu/~jns/security/iptables/iptables_conntrack.html In /usr/src/linux/net/ipv4/netfilter/ip_conntrack_proto_tcp.c , change TCP_CONNTRACK_TIME_WAIT to however long you need to wait before a tcp connection timeout. Does anyone foresee a problem with other tcp connections as a result of this? A regular tcp program will probably close the connection anyway. In the general case you cannot change the settings at the client. If you have access to the client, you can you can arrange for the client to send keepalive packets often enough to reset the timer above and keep the connection open.

Kyle Sparger ksparger (at) dialtone (dot) com> 5 Oct 2001 You can address this from the client side by reducing the tcp keepalive transmission intervals. Under Linux, reduce it to 5 minutes: /proc/sys/net/ipv4/tcp_keepalive_time ]]> where '300' is the number of seconds. I find this useful in all manner of situations where the OS times out connections.

udp flush bug in early 2.6 kernels Ashish Jain

The load balancer works fine the way I expected for one one thing: Sometimes after heavy load surge (500 UDP packets per sec. on the same connection) , the output of ipvsadm -l -c shows active UDP connections even if there is no traffic and I stop sending any UDP packets. In ipvsadm, the default UDP connection timout is set to 300 sec (I did not change that. I did not even use persistant flag). I expect these connections to go away after 300 seconds. But after the 300 sec timer expires, it gets reset to 60 sec and these active UDP connectiosn stay forever. What could be the reason? I have noticed from the ip_vs code (ip_vs_conn.c) that there are 2 functions implemented to expire a connection: ip_vs_conn_expire (This one resets the timer to 60*Hz if the reference count for this connection is greater than 1 or these is error deleting connection from hash tab) ip_vs_conn_expire_now (Deletes the connection immediately) The function called after timer expires is ip_vs_conn_expire and not the second one. Why is this so?

Horms 27 Oct 2005 This is a bug, I believe it was fixed in 2.6.13.4

How can I turn on debugging for ip_vs?

You need to enable IP_VS_DEBUG at compile time, and then fiddle the debug proc value in /proc/sys/net/ipv4/vs at run time.

name resolution on realservers: running name resolution friendly demons on realservers Unless the realserver is in a LVS, it is just sending packets from the VIP to the CIP and doesn't need name resolution. The realservers however run services other than just the LVS services, e.g smtp to mail logs and cron output. The smtpd should only need /etc/hosts to send mail locally. I upgraded from sendmail to postfix on one of my realservers to find that I could no longer mail to or from the upgraded machine. The problem is that postfix (http://www.postfix.org/) requires DNS for name resolution, thus requiring a nameserver. Postfix couldn't deliver mail on my realserver, as there was no resolution for the private realserver names (which have private IPs). Thus to run postfix, you need also to run DNS. This adds the security complications of punching a hole in your filter and routing rules to get to port 53 on other machines. Postfix works fine on a machine delivering mail to users on the internet and where the hostname is publically known, but doesn't work for machines on a private network with no DNS running. A little later, I found that you can turn off DNS lookups in postfix (see postfix faq, look for "resolv"). However this doesn't get you much - now you need to do everything via /etc/hosts. You can't use /etc/hosts for local machines and /etc/resolv.conf for the occasional machine that's not on the localnetwork. With (the earlier versions of) sendmail, you can have a nameserver in /etc/resolv.conf where it will only be used for hosts not in /etc/hosts. You don't want to be running postfix on a realserver just for local mail. If you run sendmail only for local mail delivery, then you only need an /etc/hosts file. Jan 2004: just installed sendmail-8.12.10. It's been "improved" and now requires DNS for all addresses, including private addresses. It doesn't look at /etc/hosts. You can turn off DNS lookups by telling sendmail to look at /etc/service.switch (ignoring the already available /etc/host.conf and nsswitch.conf), where you can tell it to look at /etc/hosts but now it will not use DNS. This is a step backwards for sendmail. From the TriLUG mailing list: Tanner Lovelace clubjuggler (at) gmail (dot) com 26 Apr 2007

Postfix is a mail transport agent and therefore by design does not lookup A records. Instead it looks up MX records. Note that /etc/hosts does not contain MX records, so it is therefore appropriate that postfix not look there. However, it is possible to make postfix look for both A records and use /etc/hosts. This postfix config line will make postfix use /etc/hosts: For more information about this see this URL: http://www.postfix-jp.info/origdocs/QandA-en.html#4.10

The facilities for nameresolution on Linux are problematic. I had assumed that when an application asked for name resolution, that local facilities (libresolve.so?) handled the request (gethostbyname, gethostbyaddr?) using whatever resources were available (in /etc/nsswitch.conf /etc/host.conf) and the application accepted the result without knowing how the name was resolved (e.g. whether in /etc/hosts, NIS or DNS). This isn't what happens - the application has to do it all. If you watch ping deliver a packet to a remote machine, whose name is not in /etc/hosts, strace.out 2>&1 ` ]]> you see ping access files in this order and then finally connect to the dns port (53) on the first nameserver in /etc/resolv.conf. It turns out there is no "resolving facility". The application has to work its way through all these files and then handle the name resolution itself. Ping appears to be using a "resolving facility", but only because it goes through all the files in /etc searching in the order you'd expect for a "resolving facility". Applications can ignore these files and do whatever they want. nslookup doesn't look at /etc/hosts, but goes straight to DNS. Postfix also connects directly to DNS. I'm a little dissappointed to find that every application writer has to handle name resolution themselves. For postings on the topic where other people have been similary disabused of their ignorance on name resolution, do a google search on /etc/hosts, /etc/host.conf, /etc/resolv.conf, gethostbyname, nslookup and throw in "postfix" for more info on the postfix part of the problem. e.g. the neohapsis archives for postfix and openbsd. For info on setting up /etc/hosts, /etc/nsswitch.conf, /etc/host.conf see network HOWTO (http://www.tldp.org/HOWTO/Net-HOWTO/). For an MTA on realservers with private RIPs, neither postfix nor sendmail are suitable (although you may have to use them). An older version of sendmail should be fine. From the TriLUG mailing list: Tanner Lovelace clubjuggler (at) gmail (dot) com 26 Apr 2007

nslookup is a tool that previously came with the name server (and is now deprecated in favor of dig, which is also a dns testing tool). nslookup was written specifically to test DNS resolution and therefore it is perfectly valid that it not check local files. You have to take the context of what the application is looking for. The /etc/hosts file only provides names and IP addresses. Postfix, by default, isn't looking for that. It's looking for MX records. The programs nslookup, dig, and host are all tools written to test and debug the DNS system. It would be wrong for them to look in /etc/hosts, since it is not part of the DNS system. For most applications, though, that only look for IP addresses (A records) or hostnames (PTR records), looking in /etc/hosts is appropriate, and in fact, this is what the gethostbyname and gethostbyipaddr system calls do. Anything that uses the gethostbyname system call does follow what is in /etc/nsswitch.conf. For instance, with this line in nsswitch.conf If I run ping www.trilug.org and examine what files it opens I get this: Note that it does go to /etc/hosts first, as specified by nsswitch.conf. If I then change the line in nsswitch.conf to be this instead: and rerun the same test I get this: Note that it does not look in /etc/hosts. It isn't ping that's searching these files, it's gethostbyname in the C library calling into libnss_*. The libnss libraries are the resolver. I agree that there should be tools other than DNS debugging tools. Kevin suggested probably the best one: This will correctly use the linux name resolving functions and follow what has been set up in nsswitch.conf.

TriLUG mailing list: jason (at) monsterjam (dot) org 26 Apr 2007 A very easy way to do what you want is to install dnsmasq. It will allow you to treat your /etc/hosts as dns entries to your server AND clients. I think the -b flag will do what you want.

Debugging new services At some stage trying to LVS a service that works just fine when you connect directly to the realserver, but doesn't work when you connect to the same service through the director. There is something about the service that you've taken for granted or may not even be aware of, and that assumption doesn't hold when the two-way tcpip connection is spread amongst 3 machines (adding the director). It will probably be because the service uses multiple ports. The multiport problem can be solved by persistence to port 0. This isn't a particularly subtle approach, but will at least get your service working. requires multiple rounds of tcpip connections. This can be solved by persistence to the service's port, when all connections will go to the same realserver. writes to the realserver (see Filesystems for realserver content). something we don't know about or you've just plain messed up. In this case you'll need brute force. run tcpdump on the client and server (i.e. without a director, and not using an LVS) to see the packet exchanges when the service is working. Then connect up the LVS (make sure that it's working by testing say telnet as an LVS'ed service), then run tcpdump on the client, director (both NICs if a 2 NIC director) and the realservers. This will be tedious. If you know the service's protocol (test with just the client and server, i.e. with no LVS), you can work your way through the connection with . For example sessions of phatcat with a 2 port protocol, see the section on .

"broken" services:servlets and j2ee Here, Ratz is replying to a poster about his problems LVS'ing a website with servlets. The servelets are writing content to different realservers from the same client. This is normally handled by persistence. Roberto Nibali ratz (at) tac (dot) ch 10 Jul 2003 You have a broken :) service which you would like to load balance with persistence. Joe

How do you handle broken services? How would you design the service not to be broken?

I don't know how I shall answer this question. I hope his service is not broken. But how do you understand his problem? It should be solved by setting up port 0 service, right? Sometimes things have to be done in a more complex way. The only thing that they shouldn't do is migrate sessions within the realserver pool for CPU load sharing, as is not uncommon for servlet technology. If they do this, they need to have a common pool for allocated resources (and thus have the process migration overhead) which then again defeats the purpose of inter-node CPU load balancing at first hand. But we do not know what exactly he's trying to come up with. Another possibility would be to set up persistent fwmark pools consisting of a mark for incoming to service:80 and the same mark for incoming to service:defined_portrange. With the System.Properties in JDK you can set the port range which will be allocated and thus you can pretty much restrict the service fwmark pool. You then of course load balance the fwmark pool. I only told him to use port 0 because he doesn't know the application so with higher possibility he wouldn't know the dynamically opened ports of this application either and therefor would not be able to restrict it accordingly. Then again he could do something like:

Joao Clemente jpcl (at) rnl (dot) ist (dot) utl (dot) pt 24 Jun 2003 I've been talking with the developer of the j2ee app I'm trying to cluster, and I guess I'll have a hard time with this feature: After a user interaction with the web server, the user will dowload a applet. That applet will comunicate with a service (in a well-known port) that was started in the server (at that time). So, I see this problem coming: No matter what persistency rules I setup here, as I have 2 different ports (80 and xxx) I see no way to say "when user interacts with server, set persistency rule for yyy time that maps user:80 to node1:80 AND ALSO user:whatever to node1:xxx" Besides that, I also have another question: That service that is listening in the server node will then give the connection to another instance, that will control the connection from there on (there is a pool of instances waiting to take over). Will lvs route those connections, that it doesnt even know of? I'm not sure, but this mechanism seems something similar to a passive-ftp connection... Maybe someone know a lvs-friendly tip to make things work. Btw, this applet and the connection is used to allow server->browser communication without using http refresh/pooling.

http logs, error logs The logs from the various realservers need to be merged. From the postings below, at least in the period 2001-3, using a common nfs filesystem doesn't work and no-one knows whether this is a locking problem fixable by NFS-v3.0 or not. The way to go seems to be merglog Emmanuel Anne emanne (at) absysteme (dot) fr ..the problem about the logs. Apparently the best is to have each web server process its log file on a local disk, and then to make stats on both all files for the same period... It can become quite complex to handle, is there not a way to have only one log file for all the servers Joe - (this is quite old i.e. 2000 or older and hasn't been tested).

log to a common nfs mounted disk? I don't know whether you can have httpds running on separate machines writing to the same file. I notice (using truss on Solaris) that apache does write locking on files while it is running. Possibly it write-locks the log files. Normally multiple forked httpds are running. Presumably each of them writes to the log files and presumably each of them locks the log files for writing.

Webstream Technical Support mmusgrove (at) webstream (dot) net 18 May 2001 I've got 1 host and 2 realservers running apache(ver 1.3.12-25). The 2nd server NFS exports a directory called /logs. The 1st acts as a client and mounts that drive. I have apache on the 1st card saving the access_log file for each site into that directory as access1.log. The 2nd server saves it as access2.log in the same directory. Our stats program on another server looks for *.log files in that directory. The problem is that whenever I access a site (basically browse through all the pages of a site), the 2nd card adds the access info into the access2.log file and everything is fine. The 1st card saves it to the access1.log file for a few seconds, then all of a sudden the file size goes down to 0 and its empty. Alois Treindl alois (at) astro (dot) ch

I am running a similar system, but with Linux 2.4.4 which has NFS version 3, which is supposed to have safe locking. Earlier NFS version are said to have buggy file locking, and as Apache must lock the access_log for each entry, this might be the cause of your problem. I have chosen not to use a shared access_log between the realservers, i.e. not sharing it via NFS. I share the documents directory and a lot else via NFS between all realservers, but not the logfiles. I use remote syslog logging to collect all access logs on one server. On server w1, which holds the collective access_log and error_log, I have in /etc/syslog.conf the entry: on all other servers, I have an entry which sends the messages to w1: On all servers, I have in http.conf the entry: and the utility http_logger, which sends the log messages to w1, contains: ) { chomp; syslog($PRIORITY,$_); } closelog; ]]> I also to error_log logging to the same central server. This is even easier, because Apache allows to configure in httpd.conf: On all realservers, except w1, thse log entries are sent to w1 by the syslog.conf given above. I think it is superior to using NFS. the access_log entries of course contain additional fields in from of the Apache log lines, which originate from the syslogd daemon. It is also essential that the realservers are well synchonized, so that the log entries appear in correct timestamp sequence.

I have a shared directory setup and both Real Servers have their own access_log files that are put into that directory (access1.log and access2.log...i do it this way so the Stats server can grab both files and only use 1 license), so i dont think its a file locking issue at all. Each apache server is writing to its own separate access log file, it's just that they happen to be in the same shared directory. How would httpd daemon on server A know to LOCK the access log from server B. Alois

Why do you think it is NOT a file locking problem? On each realserver, you have a lot of httpd daemons running, and to write into the same file without interfering, they will have to use file locking, to get exclusive access. On one each server, you do not have just one httpd daemon, but many forked copies. All these processes on ONE server need to write to the SAME logfile. For this shared write access, they use file locking. If this files sits on a NFS server, and NFS file locking is buggy (which I only know as rumor, not as experience), then it might well be the cause of your problem. Why don't you keep your access_log local on each server, and rotate them frequently, to collect them on one server (merge-sorted by date/time), and then use your Stats server on it? If you use separate log files anyway, I cannot see the need to create them on NFS. Nothing prevents you from rotating them every 6 hours, and you will probably not need more current stats.

So the log files HAVE to be on a local disk or else one may run into such a problem as I am having now? Alois

I don't now. I only have read the NFS file locking before NFS 3.0 is broken. It is not a problem related to LVS. You may want to read http://httpd.apache.org/docs/mod/core.html#lockfile

Thanks but Ive seen that before. Each server saves that lock file to its own local directory. Anyone have a quick and dirty script to merge-sort by date/time the combined apache logs? Martin Hierling mad (at) cc (dot) fh-lippe (dot) de

try merglog

Alois

assuming that all files contain only entries from the same month, I think you can try:

Arnaud Brugnon arnaud (dot) brugnon (at) 24pmteam (dot) com

We successfuly use mergelog (you can find on freshmeat or SourceForge) for merging logs (gz or not) from our cluster nodes. With use a simple perl script for downloading them to a single machine.

Juri Haberland list-linux.lvs.users (at) spoiled (dot) org Jul 13 2001 I'm looking for a script that merges and sorts the access_log files of my three realservers running apache. The logs can be up to 500MB if combined. Michael Stiller ms (at) 2scale (dot) net Jul 13 2001

You want to look at mod_log_spread

Stuart Fox stuart (at) fotango (dot) com

cat one log to the end of the other then run then you can run webalizer on it. Thats what I use, doesnt take more than about 30 seconds. If you can copy the logs from your realservers to another box and run sort there, it seems to be better Heck, here's the whole(sanitized) script ${ROOT}/logs/$DATE.log" su - some_account -c "$SSH some_account@real.server2 \"cat /usr/local/apache/logs/access.log\" >> ${ROOT}/logs/$DATE.log" ## ## Second sort the contents in date order ## sort -t - -k 3 ${ROOT}/logs/$DATE.log > ${ROOT}/logs/access.log ## ## Third run webalizer on the sorted files ## Just set webalizer to dump the files in ${ROOT}/logs /usr/local/bin/webalizer -c /usr/stats/conf/webalizer.conf ## ## Forth remove all the crud ## You still got the originals on the realservers find ${ROOT} -name "*.log"|xargs -r rm -v ## ## Fifth tar up all the files for transport to somewhere else cd ${ROOT}/logs && tar cfI ${DATE}.tar.bz2 *.png *.tab *.html && chown some_account.some_account ${DATE}.tar.bz2 ]]>

Stuart Fox stuart (at) fotango (dot) com

Ok scrub my last post, i just tested mergelog. On a 2 x 400mb log it took 40 seconds, my script did it in 245 seconds.

Juri Haberland list-linux.lvs.users@spoiled.org Ok, thanks to you all very much! That was quick and successful :-) I tried mergelog, but I had some difficulties to compile it on Solaris 2.7 until I found that I was missing GNU make... But now: Happy happy, joy joy! karkoma abambala (at) genasys (dot) es

Another posibility... http://www.xach.com/multisort/

Stuart Fox stuart (at) fotango (dot) com

mergelog seems to be 33% faster than multisort using exactly the same file

Julien 7 Jan 2003

Does s/b know a way to merge apache error logs? Mergelog and Multisort only merge Access logs.

Jacob Coby 1 Dec 2003 error_log.all or to sort by date cat error_log* | sort -r > error_log.all ]]> ratz 01 Dec 2003 This will not sort entries by date. Imagine following two (fictive, but syntactically and semantically correct) error_logs: Your pipeline does not sort them in a correct way (entries by date) at all. IMHO it's not so easy to script ;). To me the best solution is still to either write all error_logs into the same file or to configure httpd.conf in a way that the logs are sent via the syslog() interface. Then you use syslog-ng to do all the needed logics, data handling, merging, correlation and event triggering. Jacob If you're trying to stitch error logs from seperate sources, then yes, it become less-trivial, and you're better off going with a scripting language to do the stitching. Guy Waugh

Would this work to sort the error logs... (use -r if you want the order reversed) I don't understand why, but when I do this, it sorts on the fourth field as well (the time)...

ratz 02 Dec 2003 It doesn't work with my sort or (more correctly) with my LC_TIME settings. If you want to sort with the '... -k xM ...' you need an appropriate LC_TIME entry or it will not work. A possible one is: But this must be handwaved according to locale(5) and then compiled with localedef(3). Lucky you, if you have a charmap which matches the apache log files output ;). Also read the info page on sort to see the difference between '-k 3n' and '-k 3,3n'. laurie.baker (at) bt (dot) com 08 Jan 2003 Take a look at logresolvemerge.pl. While I currently only use it myself for access logs, I believe you can configure it for whatever your require. Joe: the people on the Beowulf mailing list use Swatch to parse logs collected on a centralised logserver. Mikkel Kruse Johnsen mkj (dot) its (at) cbs (dot) dk 19 May 2003 You can use spreadlogd for logging the activity, so that all your web frontends send their logs to one server. www.spread.org, http://www.backhand.org/mod_log_spread/. there are tools for migrating logs under: http://awstats.sourceforge.net/docs/awstats_tools.html Joe Stump joe (at) joestump (dot) net 22 Dec 2005 I know of two ways to merge logs... A centralized logging server (i.e. syslog). Have Apache log to that and then parse the logs from there. Use rsync or scp to sync the logs to a central server, cat them into one larger server and then parse them from there. I'm currently doing #2, but plan on moving to #1 pretty soon. The reason for the change is that it's a lot more streamlined than my current setup and it keeps logs in a single location instead of N locations where N is the number of nodes. I currently use NFS for pretty much everything. All of my DocRoot's and configuration files (apache/php) are on the NFS server. I then aggregate the logs onto the NFS server and compile them with awstats on another server (don't ask). Dan Trainor dan (at) id-confirm (dot) com 22 Dec 2005 I have a few very high traffic sites, and I've found that it would sometimes take AWStats so long to read the logs in one pass, I'd set it up to rotate and parse logs up to six times a day. I would imagine that, with a heavy LVS setup with many realservers, you may face the same problem. Perhaps the awstats author at some point will create a tool which will merge all the gathered data into one single file or database. This way, the realservers could process their own logs, so you would not put all the load on one main processign server, and not face the same kind of problem which I previously had mentioned. Todd Lyons tlyons (at) ivenue (dot) com 22 Dec 2005 You tell syslog or syslog-ng to log to a remote network source instead of or in addition to a local file on each of the real servers, then on the central logging server configure it to listen for incoming network log info and tell it where to put it. Here's a syslog-ng master server config: Tomas Ruprich xruprich (at) akela (dot) mendelu (dot) cz 22 Dec 2005 Well, I was realizing something like month ago... In apache configuration file on each application server i have this line: and then on log server I have syslog-ng installed, where are these configuration lines: awstats is very good idea, I use it too Here's the syslog configuration on application servers. I think it's quite simple, but only for order... For /etc/syslog.conf: ]]> Lemaire, Olivier olivier (dot) lemaire (at) siemens (dot) com 22 Dec 2005 Mergelog is your friend (http://mergelog.sourceforge.net/), after rsyncing youf file to a larger server. A centralised logging server is probably overkill unless you need your logs up-to-date at the last second. Graeme Fowler graeme (at) graemef (dot) net 22 Dec 2005 mod_log_mysql might help you out too: http://bitbrook.de/software/mod_log_mysql/ ...or its' Apache 1 cousin, mod_log_sql: http://www.outoforder.cc/projects/apache/mod_log_sql/ Note however that for a big and/or very busy cluster you need to be very, very careful with your database design and the setup of your servers. At work a colleague recently ran this up across 40 Apache servers and knocked the ass out of the MySQL logging server, jamming it up with 1000 persistent client connections. That was bad operational design on our part, but still something worth remembering. Performance-wise it seems to do well as all the queries are inserts, and it's obviously possible to make use of MySQL table replication to amalgamate several collected tables onto one host for post-processing. As a theory it's definitely got legs, we just have to find out how many in practice now! Joe

when sending logs to a central server, are there any problems with streams becoming intermixed, so that you get nonsense?

Todd Lyons tlyons (at) ivenue (dot) com 23 Dec 2005 No one host's line will be interrupted but another hosts's line. The lines from different hosts will be intermingled, but they will be complete. Here's an example: , delay=00:00:01, pri=30500, stat=Blocked by SpamAssassin Dec 23 11:48:09 smtp1 sm-mta[10021]: jBNJm8o0010021: from=,size=0, class=0, nrcpts=0, proto=SMTP, daemon=MTA, relay=[82.131.161.22] Dec 23 11:48:09 smtp2 spamd[26407]: prefork: child states: IIIII Dec 23 11:48:09 smtp1 sm-mta[9872]: jBNJlQPs009872: ... User unknown Dec 23 11:48:09 smtp1 sm-mta[10010]: jBNJm3SU010010: ruleset=check_rcpt, arg1=, relay=24-176-185-20.dhcp.reno.nv.charter.com [24.176.185.20] (may be forged), reject=550 5.7.1 ... Relaying denied. IP name possibly forged [24.176.185.20] ]]>

If you have a central log server, what do you do if it dies? How do failover is this solution?

Joe Stumpf Well you could always load balance your log traffic on an LVS setup with redundant NetApp's, but really why on God's green Earth would you do that? It's log traffic. I've never heard of a place where log traffic ever justified redundant servers. Johan van den Berg vdberj (at) unisa (dot) ac (dot) za 11 Jan 2006 The following works like a charm. I have 4 nodes, 1 fileserver, and 1 lvs machine... The nodes are high usage, and log everything to syslog using "| logger..." syntax in apache, and syslog forwards to @fileserver. Fileserver uses the - option before a filename to only sync the access log when needed to file, so that the access logs don't cause too much filesystem access on the fileserver. I am though concerned that every once in a while, I've seen a line or two go missing if I push too much into syslog at one time. Graeme Fowler graeme (at) graemef (dot) net03 May 2006 (in reply to a query about logging from a two realserver LVS). Use mod_log_spread (http://www.backhand.org/mod_log_spread/). This makes use of the multicast spread toolkit to allow you to log messages to remote servers. The mechanics of it I leave to you as they aren't hugely simple. Alternatively, make Apache log to a remote syslog host which combines the logs for you. This could easily be *both* of your realservers logging to each other, and again I leave the mechanics of it to you. Note that this will not scale up or out very far, but for a two-node solution it's perfect IMO. I'm saying that having each realserver act as a logging host for all the rest won't scale. Beyond a pair, having a dedicated syslog host (or indeed more than one, for robust logging) is the way forward, as you say. Lasse Karstensen lkarsten (at) hyse (dot) org4 May 2006 I tried compiling mod_log_spread a few months ago, even found a apache2 patch on some mailling list, but without luck. Anyone succeeded using it with apache2? The project seems abandoned. Too bad, we used it before and were mostly happy. We're using the syslog solution. We're having 10-12 realservers now, with some moderate amounts of traffic. Graeme Fowler graeme (at) graemef (dot) net 04 May 2006 I used mod_log_spread with apache2, but I no longer have access to either the slightly modified codebase or the production results. It worked well, however, shared between 8 servers with a pair of collectors. There was a slight lag to log processing as each listener piped the log arrivals through a Perl script (running in a "while (<>" loop), did $magic with them, and put them in the right users' logfiles. Clint Byrum cbyrum (at) spamaps (dot) org 04 May 2006 We used mod_log_spread for about 6 months at Adicio, Inc. We were pumping 4-8 million hits per day, reaching rates of 600-1000 hits per second at times, across 8 servers through it. I even submitted a few patches and we paid the author, George Schlossnagle, to enhance it to be more robust for us. We gave up on it ultimately, as the underlying toolkit, spread, wasn't scaling for us. We'd have a little network blip on one machine, and the whole ring would stop working for 5 minutes. Server loads would go up retrying, and retransmitting, error logs would be flying. It was a real mess. Ultimately we needed to do some tuning that involved recompiling the spread daemon. I gained a deep understanding of the spread protocol, and decided it was far too complex for this purpose. We've gone to a system now where logs are written locally using the program 'cronolog' and once per hour they are collected via NFS export. It works pretty well, though it was nice to have one big log file. Dan Trainor dan (at) id-confirm (dot) com05 May 2006 What we've done in the past is also used mod_log_mysql. While not the most efficient way of logging (sure, your setup may differ) - this did allow us to be flexible as to where we wanted logs sent. We would dump this log nightly and export it back into one on disk, then run our stats against it. We later modified our stats system to read directly from the database, which worked out quite nicely. That's a lot of work, though. I guess if you're looking for something simple, that's not the answer for you. However, it's food for thought. Daniel Ballenger lpmusix (at) gmail (dot) com5 May 2006 I just recently ran a benchmark on mysql on a machine for my company... With MySQL I was pushing >1000 inserts per second. This was on a Quad 700Mhz (Compaq DL580) box with 1GB of ram and 4 9.1GB drives in raid 5. I'm sure with faster disks I'd be able to push that box even harder with mysql. But of course, I've yet to hit that many queries per second yet with it in production :). Graeme Fowler graeme (at) graemef (dot) net06 May 2006 In testing, we found an interesting limit - MySQL 4.x seems to have a hard limit of 1000 client connections, and it can't be raised. As every single Apache child process opens a connection to the server to log accesses, this means that (for example) 5 Apache servers with MaxClients set to more than 200 can block the MySQL server. In the same tests we found that the server doesn't recover from this, so it stops Apache from working while each child waits for its' logging connection to close. MySQL 5.x did not show this behaviour.

LVS: Services: single-port

ftp, tcp 21 This is a multi-port service and is covered in the section of multi-port services.

ssh, sftp, scp, tcp 22 Surprisingly (considering that it negotiates a secure connection) nothing special either. You do not need persistent port/client connection for this. sshd is a standard one-port tcp connection. The director will timeout an idle tcp connection (e.g. ssh, telnet) in 15mins, independantly of any settings on the client or server. You will want to change the tcpip idle timeouts. As well ssh has its own timeouts. You'll get "Connection reset by peer" independently of LVS. linuxxpert linuxxpert (at) gmail (dot) com 19 Dec 2008 Make sure you have "TCPKeepAlive yes" in your sshd_config file. If TCPKeepAlive is already yes, then add "ClientAliveInterval" in your sshd_config. man sshd_config: ClientAliveInterval

Sets a timeout interval in seconds after which if no data has been received from the client, sshd will send a message through the encrypted channel to request a response from the client. The default is 0, indicating that these messages will not be sent to the client. This option applies to protocol version 2 only.

sftp and scp are also single port services on port 22. The current (Jul 2001) implementations of ssh (openssh-2.x/openssl-0.9.x) all use ssh protocol 2. Almost all the webpages/HOWTOs available are about sshd protocol 1. None of this protocol 1 information is of any use for protocol 2 - everything from setting up the keys onward is different. Erase everything from your protocol 1 installations (the old ssh binaries, /etc/ssh*), do not try to be backwards compatible, avoid the old HOWTOs and go buy (sometime in 2002: looks like new information is arriving on webpages.) anonymous

How can I configure rsh/ssh to enable copying of files between 2 machines without human intervention?

Brent Cook busterb (at) mail (dot) utexas (dot) edu 09 Jul 2002 Here are some articles on OpenSSH key management. My realservers aren't connected to the outside world so I've setup ssh to allow root to login with no passwd. If you compile ssh with the default --prefix, where it is installed in /usr/local/bin, you will have to change the default $PATH for sshd too (I suggest using --prefix=/usr so you don't have to bother with this.) Here's my sshd_config - the docs on passwordless root logins were not helpfull.

keys for realservers running sshd If you install sshd and generate the host keys for the realservers using the default settings, you'll get a working LVS'ed sshd. However you should be aware of what you've done. The default sshd listens to 0.0.0.0 and you will have generated host keys for a machine whose name corresponds to the RIP (and not the VIP). Since the client will be displaying a prompt with the name of the realserver (rather than the name associated with the VIP) this will work just fine. However the client will get a different realserver each connection (which is OK too) and will accumulate keys for each realserver. If instead you want the client to be presented with one virtual machine, you will need each machine to have its hostname being the name associated with the VIP, the sshd will have to listen to the VIP (if LVS-DR, LVS-Tun) and the hostkeys will have to be generated for the name of the VIP.

ssh zombie processes The user must exit cleanly from their ssh session, or the realserver will be left running the ssh invoked process at high load average. The problem is that what the user thinks is a clean exit and what the sshd thinks are a clean exit, may be different things. (There is a similar problem on a webserver, which is running a process invoked by a cgi script, when the client disconnects by clicking to another page or hitting "stop"). Shivaji Navale

On our realservers, after the users have logged out of their ssh session, zombie processes run at high load average in the background. We resort to killing the zombie processess. The ssh connection to the director doesnt get closed even after ctrl-D. From netstat, the connection is still active.

Dave Jagoda dj (at) opsware (dot) com 23 Nov 2003 Does this sound like what's going on? http://www.openssh.org/faq.html#3.10, ssh hangs on exit.

persistence with sshd You do not need persistence with ssh, but you can use it. Piero Calucci calucci (at) sissa (dot) it 30 Mar 2004

I use ipvs to load balance ssh and I do use persistence (and my users are happy with it -- in fact they asked me to do so). With this setup when they open multiple sessions they are guaranteed to reach always the same realserver, so they can see all their processes and local /tmp files.

If you use persistence, be aware of the effects. jeremy (at) xxedgexx (dot) com

I'm using ipvs to balance ssh connections but for some reason ipvs is only using one real server and persists to use that server until I delete its arp entry from the ipvs machine and remove the virtual loopback on the realserver. Also, I noticed that connections do not register with this behavior.

Wensong do you use the persistent port for your VIP:22? If so, the default timeout of persistent port is 360 seconds, once the ssh session finishes, it takes 360 seconds to expire the persistent session. (In ipvs-0.9.1, you can flexibly set the timeout for the persistent port.) There is no need to use persistent port for ssh service, because the RSA keys are exchanged in each ssh session, and each session is not related.

telnet, tcp 23 Simple one port service. Use telnet (or ) for initial testing of your LVS. It is a simpler client/service than http (it is not persistent) and a connection shows up as an ActiveConn in the ipvsadm output. You can fabricate a fake service on the realservers with (Also note the director timeout problem, explained in the ssh section).

smtp, tcp 25; pop3, tcp 110; imap tcp/udp 143 (imap2), 220(imap3). Also sendmail, qmail, postfix, and mailfarms. (for non LVS solutions to high throughput, high availability mailers, see tutorial by Damir Horvat).

The many reader, single writer problem with smtp For mail which is being passed through, LVS is a good solution. If the realserver is the final delivery target, then the mail will arrive randomly at any one of the realservers and write to the different filesystems. This is the many reader/many writer problem that LVS has. Since you probably want your mail to arrive at one place only, the only way of handling this right now is to have the /home directory nfs mounted on all the realservers from a backend fileserver which is not part of the LVS. (an nfs.monitor is in the works.) Each realserver will have to be configured to accept mail for the virtual server DNS name (say lvs.domain.com). Rio rio (at) forestoflives (dot) com 21 Jun 2007 We're using the proprietary mail server software (surgemail) which has built-in 2-way mirroring that updates each other within milliseconds of a change. (Joe: can't imagine it would be that hard to add that feature to a GPL'ed smtpd.)

mailbox formats: mbox, maildir Joe - if mail arrives on different realservers, then people have tried merging/synching the files on the different machines. Only one of the file formats, maildir, is worth attempting. Even that hasn't worked out too well and it seems that most successful LVS mail farms are using centralised file servers. Todd Lyons tlyons (at) ivenue (dot) com 13 Jan 2006 mbox:one file for all messages (like the /var/spool/mail/$USER mailbox) maildir: one file per message I don't know that I'd feel all that comfortable about syncing either one: mbox - This could end up with a corrupted mbox file since it's all one big long file. maildir - Guaranteed that you won't end up with any filename conflicts since the hostname is used in the filename, but there are some shared files, such as the quota or subscribed files. Those files could easily get out of sync. Graeme Fowler graeme (at) graemef (dot) net 13 Jan 2006 Indeed they can (says someone long since burnt, not exactly by rsync, but by some other methods of attempting to sunc mailboxes...). I now use NetApp filers on which my maildirs live, and I have multiple IMAP/POP (and Horde/IMP webmail) servers living behind a pair of LVS directors. Using maildir format [0], I export the mailboxes via NFS to the frontend servers and it works a treat - generally because maildir saves files using filenames derived from the hostname, so the frontend servers don't hit race conditions when trying to manipulate files. If you need really (and I mean _really_) high NFS transaction rates and fault-tolerance, I can't recommend the NetApp kit highly enough. If you don't, then other options - cheaper ones! - are (for example) high-end Intel-based IBM, Dell, HP/Compaq etc. servers with large hardware RAID arrays for redundancy. You'd have to address the various benchmarks according to your means, but there's a price point for every pocket somewhere. [0] for the inevitable extra question: Exim v4.x for the (E)SMTP server, Courier IMAP server for IMAP and POP; all backended with MySQL for mailbox/address lookups and POP/IMAP/SMTP authentication.

Maintaining user passwds on the realservers

Gabriel Neagoe Gabriel (dot) Neagoe (at) snt (dot) ro for syncing the passwords - IF THE ENVIRONMENT IS SAFE- you could use NIS or rdist

identd (auth) problems with MTAs You will not be explicitely configuring identd in an LVS. However is used by sendmail and tcpwrappers and will cause problems. Services running on realservers can't use identd when running on an LVS (see identd and sendmail). Running identd as an LVS service doesn't fix this. sendmail in sendmail.cf set the value Also see Why do connections to the smtp port take such a long time? qmail qmail: Martin Lichtin lichtin (at) bivio (dot) com If invoked with tcp-env in inetd.conf - use the -R option If spawned using svc and DJ's daemontools packages - tcpserver is the recommended method of running qmail, where you use the -R option for tcpserver

-R: Do not attempt to obtain $TCPREMOTEINFO from the remote host. To avoid loops, you must use this option for servers on TCP ports 53 and 113.

exim Michael Stiller ms (at) 2scale (dot) net 26 Jun 2003 in exim.conf set the timeout for identd/auth to zero Remember to reHUP exim.

postfix Note: postfix is not name resolution friendly. This will not be a problem if smtp is an LVS'ed service, but will be a problem if you use it for local delivery too.

testing an LVS'ed sendmail Here's an LVS-DR with sendmail listening on the VIP on the realserver. Notice that the reply from the MTA includes the RIP allowing you to identify the realserver. Connect from the client to VIP:smtp check that you can access each realserver in turn (here the realserver with RIP=192.168.1.12 was accessed).

pop3, tcp 110 pop3 - as for smtp. The mail agents must see the same /home file system, so /home should be mounted on all realservers from a single file server. Using qmail - Abe Schwartz sloween (at) hotmail (dot) com

Has anyone used LVS to balance POP3 traffic in conjunction with qmail?

Wayne wayne (at) compute-aid (dot) com 13 Feb 2002 We used LVS to balance POP3 and Qmail without any problem. Mike McLean mikem (at) redhat (dot) com 21 Apr 2004 Generally RW activity like with POP does not work well with LVS. When such things are required, one reasonable solution is to have a three tier setup where the lvs director load balances across multiple servers (in your case pop/imap), which access their data via a highly available shared filesystem/database. Of course this requires more machines and more hardware. Kelly Corbin kcorbin (at) theiqgroup (dot) com 23 Aug 2004

Pop-Before-SMTP and LVS I just added pop and smtp services to my load balancers, and I wanted to know if there was a way to tie the two connections together somehow. I use pop-before-smtp to allow my users to send mail, but sometimes they reconnect to a different server for smtp than the one they pop-ed into. Most of the time it's OK because they have pop-ed to both of the servers in the cluster but every now and then I have a user with an issue. Is there a way to have it keep track of the IP and always send those SMTP and POP connections to the same realserver?

Josh Marshall josh (at) worldhosting (dot) org You don't need to have the smtp and pop connections on the same realserver. We've got three mail servers here with pop-before-smtp running... I just point the daemons at the same file on an nfs share and it all just works. The pop-before-smtp daemons can handle all writing to the same file just fine. Joe You can link the two services with fwmark (and possibly persistence, depending on the time between connections). LVS is not really designed for writing to the realservers. To do so, you have to serialise the writes and then propagate the writes to the other realservers. This is a bit of a kludge, but is the way everyone handles writes to an LVS. The other point of an LVS is load balancing. What you're proposing is to turn off the loadbalancing and have the content on each of the realservers different. You need to be able to failout a realserver. What you're proposing can be done, but it isn't an LVS anymore

imap tcp/udp 143 (imap2),220 (imap3) Ramon Kagan rkagan (at) YorkU (dot) CA 20 Nov 2002

Previously we had been using DNS shuffle records to spread the load of IMAP connections across 3 machines. Having used LVS-DR for web traffic for a long time now (about 2.5 yeras) we decided to bring our IMAP service into the pool. The imap servers have been setup to retain idle connections for 24 hours (not my choice but the current setup anyways). The lvs servers have been setup to have a persistance of 8 hours. What has been happening lately is that users get the following type error. "The server has disconnected, may have gone down or may be a network problem, try connecting again" So, users don't have to reauthenticate per se, they just need to reclick on whatever button, so that the client they are using reauthenticates them. The timeout for this is as low as ten minutes. However the imap daemon is kept alive, and the client is shipped the PID of the imapd to reconnect directly. This is supposed to circumvent the necessity of reauthentication. In the case where the connection is lost, the reauthentication actually starts a new imapd process.

Joe there is an LVS timeout and another timeout (tcp-keepalive) associated with tcp connections which affects telnet, ssh sessions. You need to reset both (see timeouts).

The solution: On the linux imap servers the tcp-keepalive value in /proc/sys/net/ipv4/tcp_keepalive_time is set to 7200 This must be matched by the ipvsadm tcp timeout. So the options are to do one of: use ipvsadm --set 7200 0 0 on the lvs server ! /proc/sys/net/ipv4/tcp_keepalive_time ]]>

Ramon Kagan rkagan (at) YorkU (dot) CA 27 Nov 2002

Further update on the solution: although this works there is one catch. If a person logs out and comes back within the 7200s as stated in the previous email, they continue to get the message because the realserver and the director don't match again. We will be lowering the value to somewhere between 5 and 15 minutes (300-900) to address this type of usage.

Torsten Rosenberger rosenberger (at) taoweb (dot) at 16 Sep 2003

i want to build a mail cluster with cyrus-imapd but i don't know how to handle the mailbox database on LVS.

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 16 Sep 2003 we use HP ServiceGuard for clustering (12 Cyrus instances running on three physical servers, storage on HP VA7410 connected through SAN), Perdition as a proxy (connects user to correct instance, also acts as an SSL accelerator, runs on two physical servers) and LVS (keepalived, active-active on two physical servers). the other route is to use Cyrus Murder. e.g. shared folders will probably work better with Murder. administration is easier, too. we have to connect to correct instance. at the time we chose our architecture, Murder seemed unfinished, and we were worried about single point of failure. I'm afraid I haven't followed its development closely.

Thoughts about sendmail/pop (another variation on the many reader/many writer problem) loc (at) indochinanet (dot) com wrote: I need this to convince my boss that LVS is the solution for very scalable and high available mail/POP server. Rob Thomas rob (at) rpi (dot) net (dot) au

This is about the hardest clustering thing you'll ever do. Because of the constant read/write access's you -will- have problems with locking, and file corruption.. The 'best' way to do this is (IMHO): NetCache Filer as the NFS disk server. Several SMTP clients using NFS v3 to the NFS server. Several POP/IMAP clients using NFS v3 to the NFS server. At least one dedicated machine for sending mail out (smarthost) LinuxDirector box in front of 2 and 3 firing requests off Now, items 1 2 -and- 3 can be replaced by Linux boxes, but, NFS v3 is still in Alpha on linux. I -believe- that NetBSD (FreeBSD? One of them) has a fully functional NFS v3 implementation, so you can use that. The reason why I emphasize NFSv3 is that it -finally- has 'real' locking support. You -must- have atomic locks to the file server, otherwise you -will- get corruption. And it's not something that'll happen occasionally. Picture this: Whilst [client] is reading mail (via [pop3 server]), [external host] sends an email to his mailbox. the pop3 client has a file handle on the mail spool, and suddenly data is appended to this. Now the problem is, the pop3 client has a copy of (what it thinks) is the mail spool in memory, and when the user deletes a file, the mail that's just been received will be deleted, because the pop3 client doesn't know about it. This is actually rather a simplification, as just about every pop3 client understands this, and will let go of the file handle.. But, the same thing will happen if a message comes in -whilst the pop3d is deleting mail-. You can lock the file --> Consider it locked <-- File is locked --> Consider it locked <-- Ooh, I can't lock it --> ]]> The issue with NFS v1 and v2 is that whilst it has locking support, it's not atomic. NFS v3 can do this: Ooh, I can't lock it --> ]]> That's why you want NFSv3. Plus, it's faster, and it works over TCP, rather than UDP 8-)

This is about the hardest clustering thing you'll ever do. Stefan Stefanov sstefanov (at) orbitel (dot) bg

I think this might be not-so-hardly achieved with CODA and Qmail. Coda (http://www.coda.cs.cmu.edu) allows "clustering" of file system space. Qmail's (http://www.qmail.org) default mailbox format is Maildir, which is very lock safe format (even on NFS without lockd). (I haven't implemented this, it's just a suggestion.)

pop3/LVS-DR by Fabien Fabien fabien (at) oeilpouroeil (dot) org 03 Oct 2002

I successfully installed and tested a LVS-DR in a small network (1 director and 2 realservers) with http and pop3 load balancing/high availability, using too some hints from ultramonkey project (which team I thank too for the very good howto :)). About the pop3 LVS here is what I did, and if someone feels like correcting me or suggesting/advicing me something better I will be grateful ! I used on both realservers : - the smtp daemon postfix with the VDA patch to handle maildir. Here LVS doesn't handle smtp, I just use dns multiple MX feature. - the light weighted pop3 daemon tpop3d ( http://www.ex-parrot.com/~chris/tpop3d/ ) which can manage maildir. The incoming mails are stored on the realservers in maildir type account following dns MX shedulding, and so are stored quite randomly on each realserver. To synchronize both realservers so that pop3 accounts are correct when checked, I use rsync and especially drsync.pl rsync wrapper ( http://hacks.dlux.hu/drsync/ ) which keeps track of a given filelist between rsync synchronizations (here the filelist is all the maildirs content). At the moment, using crontab, drsync is run on each realserver every minutes and synchronizes the content of the other realserver with his one. It seems to work with a dozen pop3 accounts and hundreds of mail sent (no loss).

Fabien fabien (at) oeilpouroeil (dot) org 03 Oct 2002

I have an LVS-DR (1 director and 2 realservers) with http and pop3 load balancing/high availability (using some hints from ultramonkey project). On both realservers I have the smtp daemon postfix with the VDA patch to handle maildir. (LVS doesn't handle smtp, I just use the dns multiple MX feature.) the lightweight pop3 daemon tpop3d which can manage maildir format. The incoming mails are stored on random realservers in maildir format following dns MX scheduling. To synchronize both realservers (so that pop3 accounts are correct when checked), I use rsync and especially the drsync.pl rsync wrapper, which keeps track of a given filelist between rsync synchronizations (here the filelist is the maildir files). drsync is run on each realserver every minute (cron) synchronizing the content with the other realserver(s). So far it works with a dozen pop3 accounts and hundreds of mail sent with no loss.

strange mail problem trietz trietz (at) t-ipnet (dot) net 18 Oct 2006 I'm using LVS-NAT for a simple rr-loadbalancing between two sendmail realservers. I setup a director with 3 NICs, one for the external connection(eth0) and the other two(eth1 and eth2) for connecting my realservers over crosspatch cable. The director has two ip adresses on the external interface. Here is the output from ipvsadm-save: The packets intialized by the realservers are SNATed with iptables on the director successfully. My problem: loadbalancing works fine, but I see a lot of the reply packets from the realserver leaving the director on interface eth0 with the internal ips 192.168.0.1 and 192.168.0.2. After testing I assume it is a timeout problem. Client creates a smtp connection over the director to the realserver, which works perfect Connection hang for a while and times out Realserver try to close the connection, but the director doesn't "SNAT" the package It looks like the director forgot the connection, because the timeout from the realserver is longer than the timeout from the director. My solution: Patch my kernel sources with the ipvs_nfct patch. Activate conntrack: /proc/sys/net/ipv4/vs/conntrack ]]> Add the following iptables rule on the director (eth1,2 are on the DRIP network, eth0 faces the internet): Horms explanation (off-list) Basically the real-server and end user have a connection open. It idles for a long time. So long that the connection entry on the linux director expires. Then a the real-server sends a packet over the connection, which the linux director doesn't recognise and sends out to the ether without unnatting it. Solution? Well other than the one he suggests, changing the timeouts would help, assuming his analysis is correct. Joe - this timeout problem came up earlier. Dominik Klein dk (at) in-telegence (dot) net 15 May 2006 So let's just say we have a simple setup: switch -> director -> switch -> realserver ]]> The client establishes a connection, sends some data, whatever. i Then it does nothing for say 10 minutes. After that it tries to reuse the still established connection and it works just fine. Then it does nothing for say 16 minutes and tries again to use the still established connection. In the meantime, the default timeout for the connection table (default 15 minutes) runs out and so this connection is not valid on the director. So the director replies with RST on the PSH packet from the client and the connection breaks for the client. The realserver does not know anything about the reset on the director, so it still considers the connection established. The client opens a new connection, but the old one will still be considered established on the realserver. That's what made my MySQL server hit the max_connection limit and reject any new clients. I will try to set the timeout higher, as it can easily happen that my clients do nothing for a few hours at night, which will - sooner or later - hit the max_connection limit again.

Mail Farms

Designing a Mail Farm Peter Mueller pmueller (at) sidestep (dot) com 10 May 2001 what open source mail programs have you guys used for SMTP mail farm with LVS? I'm thinking about Qmail or Sendmail? Michael Brown Michael_E_Brown (at) Dell (dot) com, Joe and Greg Cope gjjc (at) rubberplant (dot) freeserve (dot) co (dot) uk 10 May 2001

You can do load balancing against multiple mail servers without LVS. Use multiple MX records to load balance, and mailing list management software (Mailman, maybe?). DNS responds with all MX records for a request. The MTA should then choose one at random from the same piority. (A cache DNS will also return all MX records.) You don't get persistent use of one MX record. If the chosen MX record points to a machine that's down, the MTA will choose another MX record.

Wensong I think that central load balancing is more efficient in resource utilization than randomly picking up servers by clients, basic queuing theory can prove this. For example, if there are two mail servers grouped by multiple DNS MX records, it is quite possible that a mail server of load near to 1 still receiving new connections (QoS is bad here), in the mean while the other mail server just has load 0.1. If the central load balancing can keep the load of two server around 0.7 respectively, the resource utilization and QoS is better than that of the above case. :) Michael Brown Michael_E_Brown (at) Dell (dot) com 15 May 2001

I agree, but... :-) You can configure most mail programs to start refusing connections when load rises above a certain limit. The protocol itself has built-in redundancy and error-recovery. Connections will automatically fail-over to the secondary server when the primary refuses connections. Mail will _automatically_ spool on the sender's side if the server experiences temporary outage. Mail service is a special case. The protocol/RFC itself specified application-level load balancing, no extra software required. Central load balancer adds complexity/layers that can fail. I maintain that mail serving (smtp only, pop/imap is another case entirely) is a special case that does not need the extra complexity of LVS. Basic Queuing theory aside, the protocol itself specifies load-balancing, failover, and error-recovery which has been proven with years of real-world use. LVS is great for protocols that do not have the built-in protocol-level load-balancing and error recovery that SMTP inherently has (HTTP being a great example). All I am saying is use the right tool for the job.

Note this discussion applies to mail which is being forwarded by the MTA. The final target machine has the single-writer, many-reader problem as before (which is fine if it's a single node), (i.e. don't run the leaf node as an LVS). Joe

How would someone like AOL handle the mail farm problem? How do users get to their mail? Does everyone in AOL get their mail off one machine (or replicated copies of it) or is each person directed to one of many smaller machines to get their mail?

Michael Brown Tough question... AOL has a system of inbound mail relays to receive all their user's mail. Take a look: set type=mx > aol.com Server: ausdhcprr501.us.dell.com Address: 143.166.227.254 aol.com preference = 15, mail exchanger = mailin-03.mx.aol.com aol.com preference = 15, mail exchanger = mailin-04.mx.aol.com aol.com preference = 15, mail exchanger = mailin-01.mx.aol.com aol.com preference = 15, mail exchanger = mailin-02.mx.aol.com aol.com nameserver = dns-01.ns.aol.com aol.com nameserver = dns-02.ns.aol.com mailin-03.mx.aol.com internet address = 152.163.224.88 mailin-03.mx.aol.com internet address = 64.12.136.153 mailin-03.mx.aol.com internet address = 205.188.156.186 mailin-04.mx.aol.com internet address = 152.163.224.122 mailin-04.mx.aol.com internet address = 205.188.158.25 mailin-04.mx.aol.com internet address = 205.188.156.249 mailin-01.mx.aol.com internet address = 152.163.224.26 mailin-01.mx.aol.com internet address = 64.12.136.57 mailin-01.mx.aol.com internet address = 205.188.156.122 mailin-01.mx.aol.com internet address = 205.188.157.25 mailin-02.mx.aol.com internet address = 64.12.136.89 mailin-02.mx.aol.com internet address = 205.188.156.154 mailin-02.mx.aol.com internet address = 64.12.136.121 dns-01.ns.aol.com internet address = 152.163.159.232 dns-02.ns.aol.com internet address = 205.188.157.232 ]]> So that is on the recieve side. On the actual user reading their mail side, things are much different. AOL doesn't use normal SMTP mail. They have their own proprietary system, which interfaces to the normal internet SMTP system through gateways. I don't know how AOL does their internal, proprietary stuff, but I would guess it would be massively distributed system. Basically, you can break down your mail-farm problem into two, possibly three, areas. Items 1 and 3 can normally be hosted on the same set of machines, but it is important to realize that these are separate functions, and can be split up, if need be. For item #1, the listing above showing what AOL does is probably a good example of how to set up a super-high-traffic mail gateway system. I normally prefer to add one more layer of protection on top of this: a super low-priority MX at an offsite location. (example: aol.com preference = 100, mail exchanger = disaster-recovery.offsite.aol.com ) For item #2, that is going to be a site policy, and can be handled many different ways depending on what mail software you use (imap, pop, etc). The good IMAP software has LDAP integration. This means you can separate groups of users onto separate IMAP servers. The mail client then can get the correct server from LDAP and contact it with standard protocols (IMAP/POP/etc). For item #3, you will solve this differently depending on what software you have for #2. If the client software wants to send mail directly to a smart gateway, you are probably going to DNS round-robin between several hosts. If the client expects it's server (from #2) to handle sending email, then things will be handled differently. Wenzhuo Zhang wenzhuo (at) zhmail (dot) com Here's an article on paralleling mail servers by Derek Balling. Shain Miley 25 May 2001

I am planning on setting up an LVS IMAP cluster. I read some posts that talk about file locking problems with NFS that might cause mailbox corruption. Do you think NFS will do the trick or is there a better (faster, journaling) file system out there that will work in a production environment.

Matthew S. Crocker matthew (at) crocker (dot) com 25 May 2001 NFS will do the trick but you will have locking problems if you use mbox format e-mail. You *must* use MailDir instead of mbox to avoid the locking issues. You can also use GFS (www.globalfilesystem.org) which has a fault tolerant shared disk solution. Don Hinshaw dwh (at) openrecording (dot) com I do this. I use Qmail as it stores the email in Maildir format, which uses one file per message as opposed to mbox which keeps all messages in a single file. On a cluster this is an advantage since one server may have a file locked for writing while another is trying to write. Since they are locking two different files it eases the problems with NFS file locking. Courier also supports Maildir format as I believe does Postfix. I use Qmail+(many patches) for SMTP, Vpopmail for a single UID mail sandbox (shell accounts my ass, not on this rig), and Courier-Imap. Vpopmail is configured to store userinfo in MySQL and Courier-Imap auths out of Vpopmail's tables. Joe:

I've always had the creeps about pop and imap sending clear text passwds. How do you handle passwds?

It's a non-issue on that particular system, which is a webmail server. There is no pop, just imapd and it's configured to allow connections only from localhost. The webmail is configured to connect to imapd on localhost. No outside connections allowed. But, this is another reason that I started using Vpopmail. Since it is a mail sandbox that runs under a single UID, email users don't get a shell account, so even if their passwords are sniffed, it only gets the cracker a look into that user's mailbox, nothing more. At least on our system. If a cracker grabs someone's passwd and then finds that the user uses the same passwd on every account they have, there's not much I can do about that. On systems where users do have an ftp or shell login, I make sure that their login is not the same as their email login and I also gen random passwords for all such accounts, and disallow the users changing it. I'm negotiating a commercial contract to host webmail for a company (that you would recognize if I weren't prohibited by NDA from disclosing the name), and if it goes through then I'll gen an SSL cert for that company and auth the webmail via SSL. You can also support SSL for regular pop or imap clients such as Netscape Messanger or MS Outlook or Outlook Express. Everything is installed in /var/qmail/* and that /var/qmail/ is an NFS v3 export from a RAID server. All servers connect to a dedicated MySQL server that also stores it's databases on another NFS share from the RAID. Also each server mounts /www from the RAID. Each realserver runs all services, smtpd, imapd, httpd and dns. I use TWIG as a webmail imap client, which is configured to connect to imapd on localhost (each server does this). Incoming smtp, httpd and dns requests are load balanced, but not imapd, since they are local connections on each server. Each server stores it's logs locally, then they are combined with a cron script and moved to the raid. It's been working very well in a devel environment for over a year (round- robin dns, not lvs). I've recently begun the project to rebuild the system and scale it up into a commercially viable system, which is quite a task since most of the software packages are at least a year old, and I'll be using a pair of LVS directors instead of the RRDNS. Matthew Croker Users will also be using some sort of webmail (IMP/HORDE) to get their mail when they are off site...other than that standard Eudora/Netscape will be used for retrieval. I settled on TWIG mainly because of it's vhost support. With Vpopmail, I can execute ]]> and add that domain to dns and begin adding users and serving it up. I had to tweak TWIG just a bit to get it to properly deal with the "user@domain" style login that Vpopmail requires, but otherwise it works great. Each vhost domain can have it's own config file, but there is only one copy of TWIG in /www. TWIG uses MySQL, and though it doesn't require it, I also create a seperate database for each vhost domain. IMP's development runs along at about Mach 0.00000000004 and I got tired of waiting for them to get a working addressbook. That plus it doesn't vhost all that well. SquirrelMail is very nice, but again not much vhost support. Plus TWIG includes the kitchen sink. Email, contacts, schedule, notes, todo, bookmarks and even USENET (which I don't use), each module can be enabled/disabled in the config file for that domain, and it's got a very complete advanced security module (which I also don't use). It's all PHP and using mod_gzip is pretty fast. I tested the APC optimizer for PHP, but every time I made a change to a script I had to reload Apache. Not very handy for a devel system, but it did add some noticable speed increases, until I unloaded it. (Joe - I've lost track of who is who here) The realservers would need access to both the users home directories as well as the /var/mail directory. I am not too familiar with the actual locking problems...I understand the basics but I also hear that NFS V3 was supposed to fix some of the locking issues with V2...I also saw some links to GFS,AFS,etc not too sure how they would work... For those of you that missed the importance of using Maildir format... Alexandra Alvarado

I need to implement a cluster mail server with 2 computers smtp, 2 computers pop3 and 2 computers (NAS) failover for storage. The idea is to have duplicated copies of the mail (/var/spool/mail and /var/spool/mqueue) in the nas using online replication.

Matthew Crocker matthew (at) crocker (dot) com 20 May 2003 Do not use mbox mailbox format. mbox does not work well over NFS with multiple hosts writing to the same file. You should use Maildir format. Just about every SMTP server can handle Maildir. I'm running 4 mail servers. All mail servers have SMTP,POP3,IMAP, SMTPS,IMAP-SSL,POP3-SSL running on them. I'm using qmail, qmail-ldap, and Courier-IMAP. I have a Network Appliance NetFiler F720 200gig NFS server for my Maildir storage I have 3 OpenLDAP servers setup with 1 master and 2 slaves. The Mail servers only talk with the slaves. All mailAddress information is held in the LDAP directory. You need to centralize the storage of mail using NAS and centralized the storage of account information using either LDAP or MySQL How do you plan on failing over the NAS? NFS with soft mounts should handle it pretty well. Setup 2 Linux boxes one as an active, one as a passive NFS server. Both connected via Fiber Channel to a chunk of drives. The passive machine *must* have a way to STOMITH (Shoot The Other Machine In The Head) the active server if it crashes. You do not want to have both machines mounting the same drive space at the same time. Very bad things will happen.... Soo.. Setup an EXT3 filesystem on the FC drives. mount it on the active Linux box, export it over NFS with a virtual IP address. If the active server fails you need to remove power from it using a remote power switch. You need to be able to guarantee that it won't come back to life and start writing to the filesystem again. Clean the FC filesystem. remount it and export it over NFS on the same virtual IP addresses. keepalived can handle the VIP stuff with VRRP. I think it can also launch an external script during a failover to handle the shooting, cleaning, mounting, exporting of the filesystem. EXT3 cleans pretty quickly. The SMTP/POP3 servers will be very un-happy to see their NFS server disappear so you will need to recover quickly. Processes will probably pile up in 'D'isk wait status on all of the machines. load will go through the roof. After the NFS server comes back online the hung processes should recover and finish up. Adaptec makes a very nice 12 disk rack mounted RAID controller that has U160 SCSI going to the disks and 1 gig FiberChannel going to the servers. Redundant power, Redundant RAID controllers, Redundant FC loops going to each server. It is called the DuraStor 7320S. Plan on about $10kUS + drives for this type of system. Network Appliance make an amazing box with complete High Availability failover of clustered data. You can expect to pay $200KUS for a complete clustered solution with 300GB usable storage. Fully redundant. Pretty much shoot a shotgun at it and not go down or lose data. EXT3 running on Logical Volume Manager (LVM) can handle journalling and snapshots Making the servers/services redundant is easy. Making your NAS/SAN redundant is expensive. I'm looking into the Adaptec external RAID controller/drive array setup with 2 Linux boxes for my NAS. I've been running my NetApp for 3 years and have *NEVER* had it crash. It really is an amazing box. The problem is it is only one box and I don't have $200k to make it a cluster. I think I can do a pretty good job for about $20k with the Adaptec box, a bunch of Seagate drives and a couple Linux boxes. You could also look into distributed filesystems like GFC, Coda ... but I don't feel confident enough in them to handle production data just yet.

Just a quick note: About a week ago I tried compiling a kernel that had been patched by SGI for XFS. The kernel (2.4.2) compiled fine, but choked once the LVS patches had been applied. Not having a lot of time to play around with it, I simply moved to 2.4.4+lvs 0.9 and decided not to bother with XFS on the director boxes.

also I thought about samba and only found one post from last year where someone was going to try it but there was no more info there.

Well, there's how I do it. I've tried damned near every combination of GPL software available over about the last 2 years to finally arrive at my current setup. Now if I could just load balance MySQL...

Greg Cope

MySQL connections / data transfere work much faster (20% ish) when on local host - so how about running mysql on each host, which is a select only system, and each localhost uses replication to a mster DB that is used for inserts and updates ?

Ultimately I think I'll have to. After I get done rebuilding the system to use kernel 2.4 and LVS and everything is stabilized, then I'll be looking very hard at just this sort of thing. Joe, 04 Jun 2001 SMTP servers need access to DNS for reverse name lookup. If they are LVS'ed in a LVS-DR setup, won't this be a problem? Matthew S. Crocker matthew (at) crocker (dot) com

You only need to make sure you have the proper forward and reverse lookup set. inbound mail to an SMTP server gets load balanced by the LVS but it still sees the orginal from IP of the sender and can do reverse lookups as normal. outbound mail from an SMTP server makes connections from its real IP address which can be NAT'd by a firewall or not. That IP address can also be reverse looked up.

normally the realservers in a LVS-DR setup have private IPs for the RIPs and hence they can't receive replies from calls made to external name servers. I would also assume that people would write filter rules to only allow packets in and out of the realservers that belong to the services listed in the director's ipvsadm tables. I take it that your LVS'ed SMTP servers can access external DNS servers, either by NAT through the director, or in the case of LVS-DR by having public IPs and making calls from those IPs to external nameservers via the default gw of the realservers?

We currently have our realservers with public IP addresses.

Bowie Bailey Bowie_Bailey (at) buc (dot) com

You can also do this by NAT through a firewall or router. I am not doing SMTP, but my entire LVS setup (VIPs and all) is private. I give the VIPs a static conduit through the firewall for external access. The realservers can access the internet via NAT, the same as any computer on the network.

Adail Oliveira

I want to build e-mail cluster using lvs, anybody have experience with this?

kwijibo (at) zianet (dot) com It is pretty much the same as setting up an LVS for any other service. The biggest problem you will probably have is figuring out how handle storage for the mailboxes. My experience is that it works great. I am not sure how I would handle our mail load without it. Todd Lyons tlyons (at) ivenue (dot) com 2005/27/05 Agreed. We use a NetApp for our central NFS server, 2 http machines for webmail, 2 imap machines, and 2 smtp machines. We have a 2 node load balancer with failover that balances the 3 protocols listed above (as well as other webservers). The 6 machines serving mail are dual P4 2.8 GHz with 1Gig RAM boxen, the load balancers are old P3 700 boxes that only do load balancing. We're just a small system though compared to many who do it. We estimate we could scale up to about 10-20 real servers for each service before we start to get throughput problems with the NetApp. Graeme Fowler graeme (at) graemef (dot) net 15 Dec 2005 My day job sees me working for an ISP; we have a number of mail systems where we use multiple frontend servers (some behind LVS, some using other methods) with NetApp Filer backends offering mail storage over NFS mounts to the real/frontend servers. They handle SMTP, POP, IMAP, Webmail (and other things not necessarily mail related) but store the data on common NFS mounts. Yes, there are well-documented issues with "mail on NFS" but that usually happens with shared SMTP server spools rather than IMAP/POP systems. We haven't had a problem yet; the filers are very reliable (and good if they do go wrong on rare occasions, too) and the platform scales out nicely. Pierre Ancelot pierre (at) bostoncybertech (dot) com 15 Dec 2005

I tried to implement a load balance mail system with imap and imaps In this case, a user creating an imap folder will create it on only one node.... How do I update a mail received on one host to every other host? Using rsync would delete every mail received in the same time on other servers...

Scott J. Henson scotth () csee ! wvu ! edu 2005-12-15 Perdition or Courier-Imap. Both of them are/have imap proxies(or directors). So what you end up with is one(or more) front end proxies that send the person to the right machine. It looks something like this. Btw, we are moving to such a setup so I don't know how well it will be The connection hits your load balancers(just straight ip_vs). Then it hands of the connection to a pool of imap proxies. Then each imap proxy figures out which real mail warehouse to send the message to(ldap is a good place to store this info cause it too can be load balanced, slaves anyway and then no need to load balance the master). At that point the user does their thing and its all stored on one server. But you have many mail boxes distributed across many mail warehouses. There is an issue of backups and such for the mail warehouse, but this distributes the load so you can serve many more mail boxes than one server could. I think we are gonna go with RAID arrays and then hot spare mail warehouse mirrors. If the lead warehouse fails the backup comes online(through something like heartbeat). This should be more than enough redundancy for us, but you may want to look into other solutions if you want more, aka fiberchannel or some such. Obviously this setup can get very complex but become very stable. Depending on your application you can throw money at it or not even have hot spares and trust in your RAID/backups. Markmsalists () gmx ! net 2005-12-15

Maybe I am misinterpreting this, but it sounds like each mailbox is assigned to exactly one realserver? So you have distributed the mailboxes for load-balancing, but is there any redundancy if one of the boxes goes down?

Scott J. Henson scotth (at) csee (dot) wvu (dot) edu 15 Dec 2005 Nope, you're not misinterpreting anything. With mail you can't really have the type of redundancy you can have with say a webserver serving out content. The problem is that imap doesn't like to be distributed. The only way is to do it over something like iSCSI or fiberchannel or some other enterprise level storage medium. What we are doing is to distribute load and to provide some redundancy. If one mail box goes down then we bring up the hot spare, but most of our mail boxes still stay up. Also if one has a failed RAID, then we can move all the mail boxes off of it and bring it down, repair the raid, then bring it back up and move the mail boxes back. It really offers more flexibility and does increase the redundancy on a site level. Todd Lyons tlyons () ivenue ! com 2005-12-16 Could always go with something like cyrus-imap with the murder extension, which is for an imap cluster. I've never set it up, don't know any details much beyond what I've stated here. But it's supposed to make the mail machines look like a cluster. Kees Bos k.bos () zx ! nl 2005-12-16 Maybe you can put some kind of imapproxy in front of your imap servers. The imapproxy than has to know about the multiple imap servers and the imapproxy itself can be loadbalanced. I haven't used it myself, but perdition seems to do this: http://www.vergenet.net/linux/perdition/ Scott J. Henson scotth () csee ! wvu ! edu 2005-12-16 Yes, I forgot to mention that, but word from the already been there, its HARD. At least in my experience its more trouble than its worth.

Commercial Mail Farm Here's an example of a commercial mail farm using LVS. Commercial ventures are usually loathe to tell us how they use LVS - Seems they don't understand the spririt of GPL. Even after you're told the setup, you still need someone to get it going and keep it going, so I don't know what advantage they get out of keeping their setups secret. Michael Spiegle popped up on the mailing list and gave some info about a LVS'ed mail farm he'd setup for a customer, so I asked off-list if he'd mind giving us more details. So here it is. Thanks Michael. Michael Spiegle mike (at) nauticaltech (dot) com 13 Nov 2006 The setup is VERY simple and straightforward. We have 17 mailservers in production right now. Here's some specs on our setup: Pair of LVS-NAT directors, each dual Xeon, 2GB RAM, dual 3com Fiber 1000SX NICS, failover uses keepalived. There is no firewall in-front of the directors. The 17 mail servers are dual Xeon, 4GB RAM, dual 10K SCSI disks in RAID1 (software), dual onboard e1000. During peak time today (monday has highest load), we were doing about 14K active connections at any given time (probably over 30K inactive connections). Actual bandwidth isn't terribly high... due to the nature of the service. The director pair also provides load balancing for a cluster of 30 Sun X1 servers running HTTPd, which has much fewer connections, but a little more bandwidth intensive. I think we might be pushing 350mbit/sec at peak times (haven't seen MRTG graphs in a while). Our services allow local mailboxes and forwarding to external addresses. We have something in the neighborhood of 250K local mailboxes and 170K external forwards. The load on the director pair is practically non-existent (generally 0.0x). The ONLY time we have EVER come close to maxing out our LVS was during a DDOS attack. The NICs we use don't have interrupt coalescence in the driver, so we were actually running out of interrupts on the box which isn't the fault of LVS. I don't recall any other metrics from the DDOS attack, but LVS would have swallowed it if we had interrupt coalescence. The mailservers however have some issues which need to be sorted out. They run anywhere from 2 to 15 as the load average. They use their local disks heavilly for temporary incoming mail storage. We store all mail on a pair of NetApp FAS-940s which we appear to be pushing the limits on somehow ( the excessive amount of NFS-wait we're hitting is driving up the load on the mailservers). How does localmail get from the realservers to the NetApp? The mailserver uses multiple daemons to accomplish mail tasks. When a piece of mail comes in, a message is created on the disk. Another daemon is dedicated to figuring out where the messages go (to an external forward, or to a local mailbox). If the message is going to an external forward, it is handled by an external-mail daemon. If the message is staying local, an internal-mail daemon puts it on the NetApp. When a customer connects via POP, the POP server looks up the location of the mailbox in memory, then goes straight to the NetApp to fetch the messages. All NetApps are mounted to the realservers via Linux NFS client. (the excessive amount of NFS-wait we're hitting is driving up the load on the mailservers).

nfs used to be a real dog. It's impressive that nowadays you can run a network disk for a machine that's being pushed hard.

On any given server, we've got 300 items in dmesg regarding "couldn't contact NetApp". We have some issues to sort out regarding the NFS mount options we're using. Also, the NetApps have an issue where if a single mailbox hits the dirsize limits (I think its something on the order of 2 million messages in a single directory), the NetApp freaks out and pegs 100% CPU. During this time, load on the mailservers doubles because those mail processes/daemons can't talk to the NetApp and causes a pile-up of connections. It is a real "cluster" in the sense that customer data (email address to local mailbox mapping) is stored in a memory-resident database - therefore, access is VERY fast and any server can handle any number of connections. We have a pair of dual-xeon boxes running tinyDNS to provide caching DNS for the mail cluster. Previously, we had our default gateways on the mailservers set to the 2nd interface of the directors (which were also load balanced). This caused all cache DNS traffic to go through the LVS which resulted in a conntrack table of over 100K at peaktimes. Even though the DNS traffic was UDP, netfilter attempts to create a very basic connection status for the traffic. For example, if mailserver01 sends out a DNS request to dnscache01, netfilter will create a conntrack table entry and label it as unack'd until LVS sees a response from dnscache01. When it sees this response, it relabels the connection to ack'd. Once I realized what was going on, I re-architected the DNS layout slightly to allow the mailservers to directly communicate with the caching nameservers. Now, the LVS runs about 20-25K connections in conntrack. About the limitations of conntrack. Previously (year ago), we had an old pair of DNS caches, which were nothing more than single-proc P3 boxes @ 900Mhz. We NEVER had a problem with these boxes until we provisioned the new cache boxes (dual xeon). The old boxes were on the same VLAN as the mailservers, so traffic/connections through LVS weren't really a problem. When we provisioned the new DNS caches, which were to be "segregated" from other internal networks, thus the mailservers using LVS as a gateway to hit them. I always had a feeling that our mail servers were "slower" with the new caches than the old ones. I dug deeper one day and found out that we were hitting conntrack limits on LVS from all these DNS queries. I alleviated that as I explained earlier by re-architecting the DNS cache layout, but I still don't feel it was quite up to snuff. I did notice that each of the new DNS caches had netfilter enabled in kernel (unnecessary addition from our shoddy development team) and we were hitting conntrack limits on them as well. The moral of the story is that I never had these problems with the old servers, because I compiled the kernel WITHOUT netfilter. So in the old kernel , there are no conntrack limits to hit, the kernel doesn't have to do lookups in a massive 100K table for each connection. If my math serves me right, 100K conntrack entries consumes a little over 100MB of RAM. Its not very scientific, but I believe conntrack introduces enough latency to be noticeable at our level. Since we run a "real" cluster, we can fail out any machine as we please. Any mailserver can handle any customer. We also have a separate LVS-DR web cluster based on linux x86. It runs the same memory-resident database to build apache virtualhost entries on the fly. Also, since each customer has their own IP (for SSL purposes), the realservers have about 65K (250+ class Cs) bound to the loopback. Works beautifully.

How do you handle ?

Since I haven't had a chance to work on that particular system yet, I don't know. I can tell you however that the slave wouldn't have to ARP for all of those IPs in our particular setup. We have a pair of PIX firewalls in front of the LVS which do passthru to the LVS. The only thing the slave has to ARP for is the placeholder IP on the interface, and the firewall will figure out where to send the traffic. True, if the master firewall died, the slave would have to ARP for all the IPs... but we've found the firewalls to be quite reliable. I'm hoping to be able to push LVS a little farther with a possible project in the upcomming month. I'm leaving this place I currently work at and am going to a company that does lots of media streaming. They're using netscalers to push 10gbit of bandwidth outbound in an asymmetrical-routing sort of way. Since they cost 80K/ea, I'm hoping I can convince them that LVS is just as good if not better for MUCH cheaper.

dns, tcp/udp 53 (and dhcpd server 67, dhcp client 68) For an article containing a section on loadbalancing by DNS, see http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing by Willy Tarreau. Another article about using round robin DNS to load balance services. For rotating/round robin DNS/DNS for geographically distributed load balancing, see DNS uses tcp and udp on port 53. It's a little more complicated than a regular single port service and is in the multiple port section at

http name and IP-based (with LVS-DR or LVS-Tun), tcp 80 http with name- and ip-based http is a simple one port service. Your httpd must be listening to the VIP which will be on lo:0 or tunl0:0. The httpd can be listening on the RIP too (on eth0) for mon, but for the LVS you need the httpd listening to the VIP. Thanks to Doug Bagley doug (at) deja (dot) com for getting this info on ip and name based http into the HOWTO. Both ip-based and name-based webserving in an LVS are simple. In ip-based (HTTP/1.0) webserving, the client sends a request to a hostname which resolves to an IP (the VIP on the director). The director sends the request to the httpd on a realserver. The httpd looks up its httpd.conf to determine how to handle the request (e.g. which DOCUMENTROOT). In named-based (HTTP/1.1) webserving, the client passes the HOST: header to the httpd. The httpd looks up the httpd.conf file and directs the request to the appropriate DOCUMENTROOT. In this case all URL's on the webserver can have the same IP. The difference between ip- and name-based web support is handled by the httpd running on the realservers. LVS operates at the IP level and has no knowledge of ip- or name-based httpd and has no need to know how the URLs are being handled. Here's the definitive word on ip-based and name-based web support. Here are some excerpts.

The original (HTTP/1.0) form of http was IP-based, i.e. the httpd accepted a call to an IP:port pair, eg 192.168.1.110:80. In the single server case, the machine name (www.foo.com) resolves to this IP and the httpd listens to calls to this IP. Here's the lines from httpd.conf ServerName lvs.mack.net DocumentRoot /usr/local/etc/httpd/htdocs ServerAdmin root@chuck.mack.net ErrorLog logs/error_log TransferLog logs/access_log ]]>

To make an LVS with IP-based httpds, this IP is used as the VIP for the LVS and if you are using LVS-DR/VS-Tun, then you set up multiple realservers, each with the httpd listening to the VIP (ie its own VIP). If you are running an LVS for 2 urls (www.foo.com, www.bar.com), then you have 2 VIPs on the LVS and the httpd on each realserver listens to 2 IPs. The problem with ip-based virtual hosts is that an IP is needed for each url and ISPs charge for IPs. Doug Bagley doug (at) deja (dot) com

With HTTP/1.1, a client Name based virtual hosting uses the HTTP/1.1 "Host:" header, which HTTP/1.1 clients send. This allows the server to know what host/domain, the client thinks it is connecting to. A normal HTTP request line only has the request path in it, no hostname, hence the new header. IP-based virtual hosting works for older browsers that use HTTP/1.0 and don't send the "Host:" header, and requires the server to use a separate IP for each virtual domain. The httpd.conf file then has ServerName www.foo.com DocumentRoot /www.foo.com/ .. ServerName www.bar.com DocumentRoot /www.bar.com/ .. ]]> DNS for both hostnames resolves to 192.168.1.110 and the httpd determines the hostname to accept the connection from the "Host:" header. Old (HTTP/1.0) browsers will be served the webpages from the first VirtualHost in the httpd.conf. For LVS again nothing special has to be done. All the hostnames resolve to the VIP and on the realservers, VirtualHost directives are setup as if the machine was a standalone.

Ted Pavlic pavlic (at) netwalk (dot) com. Note that in 2000, http://www.arin.net/announcements/ ARIN (look for "name based web hosting" announcements, the link changes occasionally, couldn't find it anymore - May 2002) announced that IP based webserving would be phased out in favor of name based webserving for ISPs who have more that 256 hosts. This will only require one IP for each webserver. (There are exceptions, ftp, ssl, frontpage...)

"/" terminated urls Noah Roberts wrote:

When I use urls like www.myserver.org/directory/ everything works fine. But if I don't have the ending / then my client attempts to find the realserver and ask it, and it uses the hostname that I have in /etc/hosts on the director which is to the internal LAN so it fails badly.

Scott Laird laird (at) internap (dot) com 02 Jul 2001 Assuming that you're using Apache, set the ServerName for the realserver to the virtual server name. When a user does a 'GET /some/directory', Apache returns a redirect to 'http://$servername/some/directory/'.

http with LVS-NAT Summary: make sure the httpd on the realserver is listening on the RIP not the VIP (this is the opposite of what was needed for LVS-DR or LVS-Tun). (Remember, there is no VIP on the realserver with LVS-NAT). tc lewis had an (ip-based) non-working http LVS-NAT setup. The VIP was a routable IP, while the realservers were virtual hosts on the non-routable 192.168.1.0/24 network.

Michael Sparks michael (dot) sparks (at) mcc (dot) ac (dot) uk What's happening is a consequence of using NAT. Your LVS is accepting packets for the VIP, and re-writing them to either 192.168.123.3 or 192.168.123.2. The packets therefore arrive at those two servers marked for address 192.168.123.2 or 192.168.123.3, not the VIP. As a result when apache sees this: ... ]]> It notices that the packets are arriving on either 192.168.123.2 or 192.168.123.3 and not w1.bungalow.intra, hence your problem. Solutions If this is the only website being serviced by these two servers, change the config so the default doc root is the one you want. If they're servicing many websites, map a realworld IP to an alias on the realservers and use that to do the work. IMO this is messy, and could cause you major headaches. Use LVS-DR or LVS-Tun - that way the above config could be used without problems since the VS address is a local address as well. This'd be my choice.

Joe 10 May 2001 It just occured to me that a realserver in a LVS-NAT LVS is listening on the RIP. The client is sending to the VIP. In an HTTP 1.1 or name based httpd, doesn't the server get a request with the URL (which will have the VIP) in the payload of the packet (where an L4 switch doesn't see it)? Won't the server be unhappy about this? This has come up before with name based service like https and for indexing of webpages. Does anyone know how to force an HTTP 1.1 connection (or to check whether the connection was HTTP 1.0 or 1.1) so we can check this?

Paul Baker pbaker (at) where2getit (dot) com 10 May 2001 The HTTP 1.1 request (and also 1.0 requests from any modern browser) contain a Host: header which specifies the hostname of the server. As long as the webservers on the realservers are aware that they are serving this hostname. There should be no issue with 1.1 vs 1.0 http requests.

so both virtualHost and servername should be the reverse dns of the VIP?

Yes. Your Servername should be the reverse dns of the VIP and you need to have a Virtualhost entry for it as well. In the event that you are serving more than one domain on that VIP, then you need to have a VirtualHost entry for each domain as well.

what if instead of the name of the VIP, I surf to the actual IP? There is no device with the VIP on the LVS-NAT realserver. Does there need to be one? Will an entry in /etc/hosts that maps the VIP to the public name do?

Ilker Gokhan IlkerG (at) sumerbank (dot) com (dot) tr If you write URL with IP address such as http://123.123.123.123/, the Host: header is filled with this IP address, not hostname. You can see it using any network monitor program (tcpdump).

httpd is stateless and normally closes connections http is stateless, in that the httpd has no memory of your previous connections. Unlike other stateless protocols (NFS, ntp) a connection is made (it is tcp rather than udp based). However httpd will usually attempt to disconnect as soon as possible, in which case you will not see any entries in the of column of the ipvsadm output. For HTTP/1.1, the browser/server can negotiate a persistent httpd connection. If you look with ipvsadm to see the activity on an LVS serving httpd, you won't see much. A non-persistent httpd on the realserver closes the connection after sending the packets. Here's the output from ipvsadm, immediately after retrieving a gif filled webpage from a 2 realserver LVS. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:www rr -> RS2.mack.net:www Masq 1 2 12 -> RS1.mack.net:www Masq 1 1 11 ]]> The InActConn are showing the connections that transferred hits that have been closed and are in the FIN state waiting to timeout. You may see "0" in the InActConn column, leading you to think that you are not getting the packets via the LVS. Roberto Nibali ratz@drugphish.ch 22 Dec 2003 If you want to see connections before they are closed, you should invoke ipvsadm with watch, or if you want it realtime (warning, eats a lot of CPU time): When a client connects, you'll see a positive integer in the ActiveConn column.

netscape/database/tcpip persistence (keepalives) With the first version of the http protocol, HTTP/1.0, a client would request a hit/page from the httpd. After the transfer, the connection was dropped. It is expensive to setup a tcp connection just to transfer a small number of packets, when it is likely that the client will be making several more requests immediately afterwards (e.g. if the client downloads an html page which contains gifs, then after parsing the html page, the client will request the gifs). With HTTP/1.1 application level persistent connection was possible. The client/server pair negotiate to see if persistent connection is possible. The server uses an algorithm to determine when to drop the connection (KeepAliveTimeout, 15sec usually or needs to recover file handles...). The client can drop the connection at anytime without consulting the server (e.g. when it has got all the hits on a page). Persistence is only allowed when the file transfer size is known ahead of time (e.g. an html page or a gif). The output from a cgi script is of unknown size and it will be transferred by a regular (non-persistent) connection. Persistent connection requires more resources from the server, as file handles can be open for much longer than the time needed for a tcpip transfer. The number of keepalive connections/client and the timeout are set in httpd.conf. Persistent connection with apache is called keepalive (http://www.auburn.edu/docs/apache/keepalive.html) and is described in http persistent connection. With the introduction of lingering_close() to apache_v1.2, a bug in some browsers would hold open the connection forever Connections in FIN_WAIT_2 and Apache (http://www.auburn.edu/docs/apache/misc/fin_wait_2.html), leaving the output of netstat on the server filled with connections in FIN_WAIT_2 state. This required the addition of an RFC violating timeout for the FIN_WAIT_2 state to the server's tcpip stack. Kees Hoekzema kees (at) tweakers (dot) net 17 Feb 2005 When using keepalive a client opens a connection to the cluster and that connection stays open (for as long as the clients wants, or a timeout occurs serverside). So the loadbalancer can not (at normal LVS level) see whether it is a normal connection with just one large request from the server or that it is a keepalived connection with lots of requests. As far as I know persistence has no influence on keepalive. Jacob Coby

The Apache KeepAlive option is one of the first things to turn off when you start getting a lot of traffic.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 2005/02/18 This is a dangerous shortcut. Sometimes, opening/closing connections a hundred times a second could put down your server. Turning off HTTP keepalive implies a good choice between the 3 available apache mpms. In my case, I turn it off for the banner server, but keep it on (only a short time) on our products sites (which have lots of little images). Having keepalive on or off will not affect your LVS performances; LVS persistence (aka affinity) will. Alois Treindl alois (at) astro (dot) ch 30 Apr 2001

when I reload a page on the client, the browser makes several http hits on the server for the graphics in the page. These hits are load balanced between the realservers. I presume this is normal for HTTP/1.0 protocol, though I would have expected Netscape 4.77 to use HTTP/1.1 with one connection for all parts of a page.

Joe Here's the output of ipvsadm after downloading a test page consisting of 80 different gifs (the html file has 80 lines of <img src="foo.gif">). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:http rr -> RS2.mack.net:http Route 1 2 0 -> RS1.mack.net:http Route 1 2 0 ]]> It would appear that the browser has made 4 connections which are left open. The client shows (netstat -an) 4 connections which are ESTABLISHED, while the realservers show 2 connections each in FIN_WAIT2. Presumably each connection was used to transfer an average of 20 requests. If the client-server pair were using persistent connection, I would expect only one connection to have been used. Andreas J. Koenig andreas (dot) koenig (at) anima (dot) de 02 May 2001

Netscape just doesn't use a single connection, and not only Netscape. All major browsers fire mercilessly a whole lot of connections at the server. They just don't form a single line, they try to queue up on several ports simultaneously... ...and that is why you should never set KeepAliveTimeout to 15 unless you want to burn your money. You keep several gates open for a single user who doesn't use them most of the time while you lock others out.

Julian Hm, I think the browsers fetch the objects by creating 3-4 connections (not sure how many exactly). If there is a KeepAlive option in the httpd.conf you can expect small number of inactive connections after the page download is completed. Without this option the client is forced to create new connections after each object is downloaded and the HTTP connections are not reused. The browsers reuse the connection but there are more than one connections. KeepAlive Off can be useful for banner serving but a short KeepAlive period has its advantages in some cases with long rtt where the connection setups costs time and because the modern browsers are limited to the number of connections they open. Of course, the default period can be reduced but its value depends on the served content, whether the client is expected to open many connections for short period or just one. Peter Mueller pmueller (at) sidestep (dot) com 01 May 2001

I was searching around on the web and found the following relevant links..

Andreas J. Koenig andreas (dot) koenig (at) anima (dot) de 02 May 2001

If you have 5 servers with 15 secs KeepAliveTimeout, then you can serve 60*60*24*5/15 = 28800 requests per day

Joe don't you actually have MaxClients=150 servers available and this can be increased to several thousand presumably? Peter Mueller I think a factor of 64000 is forgotten here (number of possible reply ports), plus the fact that most http connections do seem to terminate immediately, despite the KeepAlive. Andreas (?)

Sure, and people do this and buy lots of RAM for them. But many of them servers are just in 'K' state, waiting for more data on these KeepAlive connections. Moreover, they do not compile the status module into their servers and never notice. Let's rewrite the above formula: MaxClients / KeepAliveTimeout denotes the number of requests that can be satisfied if all clients *send* a keepalive header (I think that's "Connection: keepalive") but *do not actually use* the kept-alive line. If they actually use the kept-alive line, you can serve more, of course. Try this: start apache with the -X flag, so it will not fork children and set the keepalivetimeout to 60. Then load a page from it with Netscape that contains many images. You will notice that many pictures arive quickly and a few pictures arive after a long, long, long, looooong time. When the browser parses the incoming HTML stream and sees the first IMG tag it will fire off the first IMG request. It will do likewise for the next IMG tag. At some point it will reach an IMG tag and be able to re-use an open keepalive connection. This is good and does save time. But if a whole second has passed after a keepalive request it becomes very unlikely that this connection will be re-used ever, so 15 seconds is braindead. One or two seconds is OK. In the above experiment my Netscape loaded 14 images immediately after the HTML page was loaded, but it took about a minute for each of the remaining 4 images which happened to be the first in the HTML stream.

Joe Here's the output of ipvsadm after downloading the same 80 gif page with the -X option on apache (only one httpd is seen with ps, rather than the 5 I usually have). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:http rr -> RS2.mack.net:http Route 1 1 1 -> RS1.mack.net:http Route 1 0 2 ]]> The page shows a lot of loading at the status line, then stops, showing 100% of 30k. However the downloaded page is blank. A few seconds later the gifs are displayed. The client shows 4 connections in CLOSE_WAIT and the realservers each show 2 connections in FIN_WAIT2. Paul J. Baker pbaker (at) where2getit (dot) com 02 May 2001

The KeepAliveTimeout value is NOT the connection time out. It says how long Apache will keep an active connection open waiting for a new request to come on the SAME connection after it has fulfilled a request. Setting this to 15 seconds does not mean apache cuts all connections after 15 seconds. I write server load-testing software so I have do quiet a bit of research in the behaviour of each browser. If Netscape hits a page with a lot of images on it, it will usually open about 8 connections. It will use these 8 connections to download things as quickly as it can. If the server cuts each connection after 1 request is fullfilled, then Netscape browser has to keep reconnecting. This costs a lot of time. KeepAlive is a GOOD THING. Netscape does close the connections when it is done with them which will be well before the 15 seconds since the last request expire. Think of KeepAliveTimeout as being like an Idle Timeout in FTP. Imagine it being set to 15 seconds.

Ivan Pulleyn ivan (at) sixfold (dot) com 23 Jan 2003

To totally fragment the request, if using apache, 'KeepAlive off' option will disable HTTP keep-alive sessions. So a single browser load will have to connect() many times; once for the document, then again for each image on the page, style sheet, etc. Also, sending a pragma no-cache in the HTTP header would be a good idea to ensure the client actually reloads.

Sudden Changes in InActConn nigel (at) turbo10 (dot) com

This weekend the web service we run came under increased load --- about an extra 10,000,000 queries per day ---- when InActConn went from 200-300 to 2000+ in about 60 seconds and the LVS locks up.

Rob ipvsuser (at) itsbeen (dot) sent (dot) com 2005/03/13 I had a high number of inactive connections with apache set up to not use keepalive at all. After activating keep alive in apache (LVS was already persistant) the number of inactive connections went way down. So in my case at least, connections were setup, used for a single GET for a gif, button, jpeg, js script, or other page component then the server closed the connection, only to open another for the next gif, etc. You might be able to use something like multilog to watch a bunch of the logs at the same time to get an idea if the traffic looks like real people (get page 1, get page 1 images, get page 2, get page 2 images) or if it is random hammering from a dos attack. I wrote a small shell script that pulled the recent log entries, counted the hits per IP address for certain requests and then created a iptables rule on the director (or some machine in front of the director) to tarpit requests from that IP. This worked in my situation because we knew that certain URLs were only hit a small number of times during a legit use session (like a login page shouldn't be hit 957 times in an hour by the same external IP) This could help reduce the tide of requests if you are actually encountering a (d)dos. I ran it every 12 minutes or so. If you are getting ddos'd the tarpit function of iptables http://www.securityfocus.com/infocus/1723 or the tarpit standalone can be a great help. Also, Felix and his company seem to have helped some large companies deal with high traffic ddos attacks - http://www.fefe.de/ BTW, You might be interested in http://www.backhand.org/mod_log_spread/ for centralized and redundant logging. That way you can run different kinds of real time analysis with no extra load on the webservers or the normal logging hosts by just having an additional machine join/subscribe to the multicast spread group with the log data. OK I can't find my script, but this was the start of it, it is hardly a shell script (but someone may find it useful): Add a "grep blah" command just before the awk '{print $2}' if you want just certain requests or other filtering.

dynamically generated images on web pages On static webpages, all realservers serve identical content. A dynamically generated image is only present on the webserver that gets the request (and which generates the image). However the director will send the client's request for that image to any of the realservers and not neccessarily to the realserver that generated the image. Solutions are generate the images in a shared directory/filesystem use fwmark to setup the LVS. Both methods are described in the section using fwmark for dynamically generated images.

http: sanity checks, shutting down, indexing programs, htpasswd, apache proxy and reverse proxy to look at URL, mod_backhand, logging people running webservers are interesting in optimising throughput and often want to look at the content of packets. You can't do this with LVS, since LVS works at layer 4. However there are many ways of looking at the content of packets that are passed through an LVS to backend webservers. Material on reverse proxies is all through this HOWTO. I haven't worked out whether to pull it all together or leave it in the context it came up. As a start... defn: forward and reverse proxy: adapted from Apache Overview HOWTO (http://www.tldp.org/HOWTO/Apache-Overview-HOWTO-2.html) and Running a Reverse Proxy with Apache (http://www.apacheweek.com/features/reverseproxies). See also http://en.wikipedia.org/wiki/Proxy_server and http://en.wikipedia.org/wiki/Reverse_proxy

A proxy is a program that performs requests on behalf of another program. The source and destination IPs on the packets do not change. forward proxy: (the traditional http proxy), accepts requests from clients, contacts the remote http server and returns the response. An example is "squid". The main advantage of a squid is that it caches responses (it's a proxy cache). Thus a repeat request for a webpage will be returned more quickly, since the proxy cache will (usually) be closer to the client on the internet. Forward proxies are of interest because of their caching. That they cache by doing a proxy request is not of much interest to users. reverse proxy: a webserver placed in front of other servers, providing a unified front end to the client, but offloading certain tasks, e.g. SSL processing, FTP from the backend webservers to other machines. The most common reason to run a reverse proxy to enable controlled access from the internet to servers behind a firewall. Squid can also reverse proxy. Quite what is "reverse" about this sort of proxy, I don't know - perhaps they needed a name to differentiate it from "forward". "reverse" is not a helpful name here. Both forward and reverse proxies have the same functionality: the forward proxy is at the client end, while the reverse proxy is at the server end.) In some sense, the LVS director functions as a reverse proxy for the realservers.

sanity checks When first setting up, to check that your LVS is working... put something different on each realserver's web page (e.g. the string "realserver_1" at the top of the homepage). use rr for your scheduler make sure you're rotating through the different web pages (each one is different) and look at the output of ipvsadm to seen a new connection (probably InActConn) ping the VIP from a machine directly connected to the outside of the director. Then check the MAC address for the VIP with arp -a

replies coming from wrong VIP (check configs) Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com 24 Feb 2007 (with help from Graeme Fowler) I have several VIPs. Regardless of the VIP the client connects to, they get a response from a different IP which never varies. I found out that everything was working the way it should with https, which further led me into debugging our apache setup rather than LVS. Apache didn't have the appropiate virtual hosts configured for all the vip's. This is why I always saw the same ip address as source - the ip of the _default_ (first configured) apache virtual host.

Shutting down http You need to shut down httpd gracefully, by bringing the weight to 0 and letting connections drop, or you will not be able to bind to port 80 when you restart httpd. If you want to do on the fly modifications to your httpd, and keep all realservers in the same state, you may have problems. Thornton Prime thornton (at) jalan (dot) com 05 Jan 2001

I have been having some problems restarting apache on servers that are using LVS-NAT and was hoping someone had some insight or a workaround. Basically, when I make a configuration change to my webservers and I try to restart them (either with a complete shutdown or even just a graceful restart), Apache tries to close all the current connections and re-bind to the port. The problem is that invariably it takes several minutes for all the current connections to clear even if I kill apache, and the server won't start as long as any socket is open on port 80, even if it is in a 'CLOSING' state.

Michael E Brown wrote: Catch-22. I think the proper way to do something like this is to take the affected server out of the LVS table _before_ making any configuration changes to the machine. Wait until all connections are closed, then make your change and restart apache. You should run into less problems this way. After the server has restarted, then add it back into the pool.

I thought of that, but unfortunately I need to make sure that the servers in the cluster remain in a near identical state, so the reconfiguration time should be minimal.

Julian wrote Hm, I don't have such problems with Apache. I use the default configuration-time settings, may be with higher process limit only. Are you sure you use the latest 2.2 kernels in the realservers?

I'm guessing that my problem is that I am using LVS persistent connections, and combined with apache's lingering close this makes it difficult for apache to know the difference between a slow connection and a dead connection when it tries to close down, so the time it takes to clear some of the sockets approaches my LVS persistence time. I haven't tried turning off persistence, and I haven't tried re-compiling apache without lingering-close. This is a production cluster with rather heavy traffic and I don't have a test cluster to play with. In the end rebooting the machine has been faster than waiting for the ports to clear so I can restart apache, but this seems really dumb, and doesn't work well because then my cluster machines have different configuration states.

One reason for your servers to block is a very low value for the client number. You can build apache in this way: and then to increase MaxClients (up to the above limit). Try with different values. And don't play too much with the MinSpareServers and MaxSpareServers. Values near the default are preferred. Is your kernel compiled with higher value for the number of processes:

Is there any way anyone knows of to kill the sockets on the webserver other than simply wait for them to clear out or rebooting the machine? (I tried also taking the interface down and bringing it up again ... that didn't work either.) Is there any way to 'reset' the MASQ table on the LVS machine to force a reset?

No way! The masq follows the TCP protocol and it is transparent to the both ends. The expiration timeouts in the LVS/MASQ box are high enough to allow the connection termination to complete. Do you remove the realservers from the LVS configuration before stopping the apaches? This can block the traffic and can delay the shutdown. It seems the fastest way to restart the apache is apachectl graceful, of course, if you don't change anything in apachectl (in the httpd args).

Running indexing programs (<emphasis>e.g.</emphasis> htdig) on the LVS (From Ted I think) Setup - realservers are node1.foobar.com, node2.foobar.com... nodeN.foobar.com, director has VIP=lvs.foobar.com (all realservers appear as lvs.foobar.com to users). Problem - if you run the indexing program on one of the (identical) realservers, the urls of the indexed files will be These urls will be unuseable by clients out in internetland since the realservers are not individually accessable by clients. If instead you run the indexing program from outside the LVS (as a user), you will get the correct urls for the files, but you will have to move/copy your index back to the realservers. Solution (from Ted Pavlic, edited by Joe). On the indexing node, if you are using LVS-NAT add a non-arping device (eg lo:0, tunl0, ppp0, slip0 or dummy) with IP=VIP as if you were setting up LVS-DR (or LVS-Tun). With LVS-DR/VS-Tun this device with the VIP is already setup. The VIP is associated in dns with the name lvs.foobar.com. To index, on the indexing node, start indexing from http://lvs.foobar.com and the realserver will index itself giving the URLs appropriate for the user in the index. Alternately (for LVS-NAT), on the indexing node, add the following line to /etc/hosts. make sure your resolver looks to /etc/hosts before it looks to dns and then run your indexing program. This is a less general solution, since if the name of lvs.foobar.com was changed to lvs.bazbar.com, or if lvs.foobar.com is changed to be a CNAME, then you would have to edit all your hosts files. The solution with the VIP on every machine would be handled by dns. There is no need to fool with anything unless you are running LVS-NAT.

htpasswd with http Noah Roberts wrote: If anyone has had success with htpasswords in an LVS cluster please tell me how you did it. Thornton Prime thornton (at) jalan (dot) com Fri, 06 Jul 2001

We have HTTP authentication working on dozens of sites through LVS with all sorts of different password storage from old fashioned htpasswd files to LDAP. LVS when working properly is pretty transparent to HTTP between the client and server.

apache proxy (reverse proxy) rather than LVS Tony Requist

We currently have a LVS configuration with 2 directors and a set of web servers using LVS-DR and keepalived between the directors (and a set of MySql servers also). This is all working well using the standard RR scheduling without persistence. We will be adding functionality that will be storing data on some but not all web servers. For this, we need to be able to route requests to specific web servers according to the URL. Ideally I would generate URLs like: and I could have a little code in LVS (or called from LVS) where I could decode KEY to find that the data is on server A, B and C -- then have LVS route to one of these three servers. I've looked through the HOWTO and searched around but I have not been able to find anything.

Scott J. Henson scotth (at) csee (dot) wvu (dot) edu 20 Jul 2005 I would personally use apache proxy statements on the servers that don't have the information. This will increase load slightly, but is probably the easiest. If you really want to go the LVS route, there are some issues, I believe. If my memory serves, the current version of LVS is a level 4 router and to do what you want, you would need a level 7 router. I have heard of some patches floating around to turn ip_vs into a level 7 router, but Ive not seen them personally, nor tried them. L7 requires much more computation than L4. You don't want to do anything at L7 that you can handle any way at all at L4. Todd Lyons tlyons (at) ivenue (dot) com 20 Jul 2005 This is application level, not network level. The better solution (IMHO) is to put two machines doing reverse proxy with the rules to send the requests to the correct server. Then have your load balancers balance among the two rproxies. A poor man's solution would be to put the reverse proxying on the webservers themselves. This is not really good for HA though since you don't have redundancy if there is only one webserver serving a particular resource. If the reverse proxies have a different IP than the public IP of the webservers, then you have more options. Andres Canada

When the cluster node gets the request it looks to apache configuration and whatever to serve this request. When a Director node receives a request for a special web application that is only in one cluster node (for example node35) there should be something inside it that send that request to a "special node" (not the next one if it's using round robin, but the node35).

Todd Lyons tlyons (at) ivenue (dot) com Dec 14 2005 You should consider setting up a reverse proxy. This is a machine that sits in front of your apache boxen that examines urls and sends them to various private apache servers, and sends the reply back. The outside world doesn't talk directly to the private apache servers. In our case, we have several different machines that handle different types of requests. We have 3 rp's sitting in front of them getting load balanced by two LVS boxen. The load balanced rp's receive the GET or POST from the outside world, examine it, and send the request to the appropriate private machine, waits for the reply, and sends that to the requesting client.

mod_backhand From Lars lmb (at) suse (dot) de mod_backhand, a method of balancing apache httpd servers that looks like ICP for web caches. Jessica Yang, 8 Oct 2004

Our application require L7 load balancing because we use URL rewriting to keep the session info in the requested URLs, like this: http://ip/servlet/servletname;jsessionid =*****. Basically, we want the load balancer will deliver the requests who have the same jsessionid to the same realserver. Looking through the LVS document, KTCPVS seems to be able to provide L7 load balancing, but I couldn't find any documentation about compiling, configuring, features and commands of KTCPVS. Does KTCPVS have the feature to distinguish the jsessionid in the requested URL and/or in the Cookie header? Does KTCPVS have to be bundled together with IPVS? And what is the process to make it work?

Wensong 09 Oct 2004 KTCPVS has a cookie-injection load balancing feature just as you described. You can use something like the following Jacob Coby jcoby (at) listingbook (dot) com 08 Oct 2004 It almost sounds like you need to use a proxy instead of LVS to do the load balancing. If something in your jsessionid is unique to a server, it would be very simple to make a rewrite rule accomplish what you want. Clint Byrum cbyrum (at) spamaps (dot) org 08 Oct 2004 mod_backhand (http://www.backhand.org/mod_backhand/). VERY nice load balancing proxy module for apache. It does require that your content be served from Apache 1.3 (Windows or Unix) though. cheaney Chen

There are a lot of different kinds of SLB techniques, ex. DNS-based, Dispatcher-based(like LVS), and server-based , etc. And my question is, for a commercial web site (like yahoo or ...) how to do SLB. What methods are used to handle huge numbers of client's requests? Combination of SLB techniques above , or ... ?!

Clint Byrum cbyrum (at) spamaps (dot) org 06 Jan 2005 I've used LVS for frontend balancing, and backhand (www.backhand.org) for backend. In short, mod_backhand takes specific resource-intensive requests and proxies them to whichever servers are least busy. It works *VERY* well. We have a farm of cheap boxes serving lots of CPU intensive requests and every box has the same exact load average within 2-3%. It even allows persistence. Only downside is it requires Apache 1.3, but so far that hasn't been a problem for us. :)

Apache logging isplist (at) logicore (dot) net 2007-07-25

How can I exclude the logging from the LVS servers on apache? The constant checking for the host is creating VERY large log files.

Graeme Fowler graeme (at) graemef (dot) net 25 Jul 2007 This is really a question you should be asking on an Apache mailing list, but anyway... The easiest thing to do is to create a separate <VirtualHost blah> definition that simply logs to /dev/null: ServerName blah.test.domain CustomLog /dev/null combined ...other directives... ]]> then configure whatever healthcheck/monitor app you're using to query the virtual host blah.test.domain by hostname - that differs so much between mon, keepalived and ldirectord that I'll leave that as an exercise for you. However, I have to say that even with a check interval of 1 second that would only give you 86400 lines per day - if you're using LB of any form I'd expect you to be generating that number of entries per hour, if not more.

HTTP 1.0 and 1.1 requests Joe: Does anyone know how to force an HTTP 1.1 connection? Patrick O'Rourke orourke (at) missioncriticallinux (dot) com:

httperf has an 'http-version' flag which will cause it to generate 1.0 or 1.1 requests.

Large HTTP /POST with LVS-Tun If a client does a large (packet>MTU) POST through a tunnel device (i.e. LVS-Tun) the MTU will be exceeded. This is normally handled by the icmp layer, but linux kernels, at least upto 2.4.23, do not handle this properly for paths involving tunnel devices. see .

http keepalive - effect on InActConn Randy Paries rtparies (at) gmail (dot) com 07 Feb 2006

I just added a new realserver (local.lovejoy). It has many more InActConn than the other servers. It's newer hardware. Any ideas? RemoteAddress:Port Forward Weight ActiveConn InActConn TCP www.unitnet.com:http wlc persistent 1800 mask 255.255.255.0 -> local.lovejoy:http Route 1 113 5568 -> local.krusty:http Route 1 97 223 -> local.flanders:http Route 1 91 158 TCP www.unitnet.com:https wlc persistent 1800 mask 255.255.255.0 -> local.flanders:https Route 1 0 12 ]]>

Graeme Fowler graeme (at) graemef (dot) net 7 Feb 2006 The newer machine is a newer OS, running a newer version of Apache and probably newer hardware (OK, those last two are assumptions) - I bet it's responding more quickly. The InActConn counter diaplays thso connection in TIME_WAIT or related states, after a FIN packet has arrived to end the connection. If you run: You will probably see that lovejoy is handling rather more traffic than krusty and flanders. Joe - this could have been the answer, but it wasn't. It was a timeout problem though. Randy Paries

This ending up being the KeepAlive setting (or lack there of in the httpd.conf) change to KeepAlive On and problem went away

Fallback/Sorry pages with Apache Gustavo Mateus I want to customize a fallback server page for each of the 10 web sites (domains) running on 5 realservers servers. The way I imagine it can be done is setting lighttpd to respond to 10 different ips. One ip on the fallback server for every virtual server that I have. Is there a way to avoid that? prockter (at) lse (dot) ac (dot) uk 30 May 2007 The fallback web server can use virtual hosts just like any other web service so you can provide all sorry pages (little mini sites with graphics and all) from a single server. Or you can use a cgi script which varies what it does base on the environment (which will include virtual host information) Very very ancient web browsers don't send enough information and to support them you will have to use IP based hosting, so if you want a single IP just provide a catch all page for those few (if any) browsers. You use the information about which virtual host it is, comes in the http request from the users browser, just like it does when they talk to the real service.

Testing http with apachebench (ab) Larry Ludwig ludes (at) yahoo (dot) com 12 Nov 2006 From testing it appears that my load balance is working. From using apachebench (ab) I get about half of the connections fail. Sometimes the test doesn't even complete. I don't get these errors if I test directly to the server IP. Some times I get: Turns out my "error" wasn't an error after all. Everything was working fine, except for the apachebench error. What happens is apachebench (ab) stores a copy of the first downloaded web page and if it doesn't match in future page requests marks it as an error. So if the pages on the load balancers are not EXACTLY the same, then it will spew an error like the one I got. In our case the page listed the server name.

Apache setup for DoS Willem de Groot willem (at) byte (dot) nl 18 Apr 2006

To my surprise, opening 150 tcp connections to a default apache installation is enough to effectively DoS it for a few minutes (until connections time out). This could be circumvented by using mod_throttle, mod_bwshare or mod_limitipconn but imho a much better place to solve this is at the LVS loadbalancer. Which already does source IP tracking for the "persistency" feature.

Ratz Only on a really badly configured web server or maybe a 486 machine :). Otherwise this does not hold. Every web server will handle at least 1000 concurrent TCP connections easily. After that you need some ulimit or epoll tweaking. Nope, these won't circumvent anything - you then just open a HTTP 1.1 channel and reload using GET / every MaxKeepAliveTimeout-1. Those modules will not help much IMHO. They only do QoS on established sockets. It's the wrong place to interfere. It's not a Layer 4 issue, it's a higher layer issue. Even if it wasn't, how would source IP tracking ever help? Check out HTTP 1.1 and pipelining. Read up on the timing configurations and so on. Only poorly-configured web servers will allow you to hold a socket after a simple TCP handshake without sending any data, you get a close on the socket for HTTP 1.1 configured web servers with low timeouts. You are right however, in that using such an approach of blocking TCP connections (_inluding_ data fetching) can tear down a lot of (even very well known) web sites. I've actually started writing a paper on this last year, however never finished it. I wrote a proof-of-concept tool that would (after some scanning and timeout guessing) block a whole web site, if not properly configured. This was done using the CURL library. It simulates some sort of slow-start slashdot-effect. Ken Brownfield krb (at) irridia (dot) com 18 Apr 2006 This 150 connection limit is the default MaxClients setting in Apache, which in practice should be adjusted as high as you can without Apache using more memory than you want it to (e.g., 80-100% of available RAM -- no need for the DoS to swap-kill your box, too). Each Apache process will use several megabytes (higher or lower depending on 32- or 64-bit platforms, add-on modules, etc) so this can't be set too high. Disabling KeepAlives will drop your working process count by roughly an order of magnitude, and unless you're just serving static content it's generally worth disabling. But for your case of 150 idle connections, it doesn't help. Netfilter has various matching modules that can limit connections from and/or to specific IPs, for example: The reason DoS attacks are so successful (especially full-handshake attacks) is that something needs to be able to cache and handle incoming connections. And that is exactly where Apache is weakest -- the process model is terrible at handling a high number of simultaneous, quasi-idle connections. LVS has some DoS prevention settings which you should consider (drop_entry, drop_packet, secure_tcp) but they're generally only useful for SYN floods. A full handshake will be passed on through LVS to the application, and that is where the resources must be available. And given persistence, a single-IP attack will be extremely effective if you only have one (or few) real servers. Once a connection has been made to Apache, it will need to either relegate idle connections out of process (see Apache 2.2's new event MPM, not sure if it only works on idle keepalives) or limit based on IP with the modules you mention. This problem is difficult to solve completely, and I agree that solving it in Apache is the least powerful, least convenient, and highest overhead solution. Given Netfilter functionality (2.6 and later), the absence of throttles or connection limits in LVS isn't fatal. But I do feel that LVS could be made a more comprehensive system if it rolled in even basic connection throttling/limiting, plus a more closely integrated and maintained health checking system. And source-routing support. ;) There are commercial products available that implement heavy-duty DoS/ intrusion protection. They block the vast majority of simple attacks and are crucial for any large-scale public-facing services. But a good distributed full-handshake or even valid HTTP request DoS is almost impossible to fully block. I agree that the ~1,000 simultaneous connection count is indeed the general breaking point for select()- or poll()-based web servers (in my experience), and epoll() is a much better solution as you say. But Apache will not handle 1,000 simultaneous connections unless you have 4GB of RAM, you're on a 32-bit platform, and you have every feature turned off. And then only if you don't want any disk buffer/ cache. :) With typical application server support (e.g., mod_php), Apache will not reach 1000 processes without something like 8-16G of RAM. I've never been able to set MaxClients above 200... Copy-on-write only goes so far. Sorry for the tangent, but throttling/DoS prevention is especially important for any web/application server based on the process model. Graeme Fowler graeme (at) graemef (dot) net 18 Apr 2006 This is an application (Apache) configuration issue, not really a load balancing issue at all. A default Apache configuration shouldn't, ideally, be in production. The MaxClients setting is 150 (may be higher depending on distro and choice of MPM) for a reason, which is that not everyone has the same hardware and resource availability. It's better that you're given a limited resource version than one which immediately spins away and causes your server to expire due to lack of memory, for example. This is a problem which LVS itself can't help with, given that the concept of true feedback isn't implemented. If you spend the time to get to know your server, you'll find that you can sort out this sort of resource famine quite easily by tuning Apache, with the caveat that it will _always_ be possible to cause Apache (or any other webserver for that matter) to fall over by flooding it. Think about the infamous "Slashdot effect". You could, in theory, do some limiting with netfilter/iptables on the director, but that's OT for this list. To test, just use ApacheBench, which comes with Apache :) Ratz Too bad that apache only allows epoll for MPM event models. For the other interested readers, we're essentially talking about a feature which is best described here: http://www.kegel.com/c10k.html Now, as for the memory pressure mentioned below, I beg to differ a bit ... I have rarely hit the problems serving 800-1000 concurrent sessions on 32bit using a normal 2G/2G-split 2.4.x or 2.6.x kernel. As for memory/cache... Again, I believe that if you already hit the memory limits, you did something wrong in your configuration or setup :). mod_php or even mod_perl are memory hogs, but if you use a proper m:n threading model, I bet you can still serve a couple of hundred concurrent connections. I would argue that copy on write kills your performance because your application was not designed properly :). No pun intended, but I've more than once fixed rather broken web service architectures based on PHP or Servlets or JSP or ASP or -insert you favourite web service technology-. DoS prevention does not exist, this topic has been beaten to death already :). DoS mitigation, maybe yes. Maybe we should define throttling before continuing discussing its pro/cons. It could very well be that we agree on that. Most of our customer's httpd show RSS between 800KB and 2MB; some of them it's including mod_perl or mod_php. You can drop the process count if you set your timeouts correctly, or you implement proper state mapping using a reverse proxy and a cookie engine. With your iptables command, no wonder you have no memory left on your box :). You can't drop a certain amount of illegitimate _and_ legitimate connections when you're running on a strict SLA. QoS approaches based on window scaling help a bit more. Regarding throttling, I reckon you use the TC framework available in the Linux kernel since long before netfilter hit the tables. Commercial packages use Traffic contracts and the sorts, just like TC for example. Blocking or dropping is not acceptable, diverting or migrating is. The biggest issue on large-scale web services according to my experience is the detection of malicious requests. Ken The mod_python and mod_php applications currently under my care are at 38-44MB resident on 64-bit. On a minimal 64-bit box, I'm seeing 6MB resident. I've honestly never seen an application, either CGI- or mod-based, use less than 2MB on 32-bit including the CGI, and most in the 15-45MB range. As you say, I think the application is a huge variable, but therein lies the weakness of the process model. Timeouts certainly help, but that's somewhat akin to saying that if you set your synrecv timeout low enough, the DoS won't hurt you. :) KeepAlives by their nature will increase the simultanous connection count, but I apologize if I came across as advocating turning them off as a knee-jerk fix to connection-count problems. Whether they're beneficial or not (for scalability reasons) depends on whether you bottleneck on CPU or RAM first, and whether you're willing to scale wider to keep the behavior change due to keepalives. The iptables rule I gave was just an example, and 666 is my numeric "foo". I was just mentioning dropping packets, not advocating them. drop_packet and secure_tcp, set to 1, seem decent choices. If LVS is out of RAM, I think your SLA is doomed, only to be perhaps aided by these settings. Having them on all the time is indeed Bad. I had forgotten about TC, though I'm not sure it can throttle *connections* vs *throughput*. As for improving LVS: I had to completely rewrite the LVS alert module for mon, in addition to tweaking several of the other mon modules. Now, this was on a so- last-year 2.4 distro -- I haven't worked with LVS under 2.6 yet or more modern mon installs. I also wrote a simple CLI interface wrapper to ipvsadm, since editing the ipvsadm rules file isn't terribly operator-friendly and prone to error for simple host/service disables. I think all the parts are there for a Unix admin to complete an install. But for a health-checking, stateful-failover, user-friendly- interface setup, it's pretty piecemeal. And there's no L7 to my knowledge. There are some commercial alternatives (that will appeal to some admins for these reasons) that are likely inferior overall to LVS. I think the work lies most in integration, both of the documentation and testing, and perhaps patch integration. The primary parts of the commercial DoS systems I alluded to are the attack fingerprints and flood detection that intelligently blocks bad traffic, not good traffic. Nothing is 100%, but in terms of intelligent, low-false-positive malicious request / flood blocking, they do extremely well at blocking the bad stuff and passing the good stuff. Is it worth the bank that they charge, or the added points of failure? Depends on how big your company is I suppose. But I know of no direct OSS alternative -- or I'd use it! Ratz As for RSS - these seem to be my findings as well (contrary to what I stated earlier), after logging into various high volume web servers of our customers. In fact, I quickly set up an apache2 with some modules and this is the result: If I disable everything non-important except mod_php, I get following: A bare apache2 which only serves static content (not stripped or optimized) yields: However, copy on write does not occur for carefully designed application logic with shared data. So, normally even 40 rss does not hurt you. Stripping both perl and python to a minimal set of functionality helps further. I checked with various customers' CMS installations based on CGIs and they range between 1.8MB and 11MB RSS. Again, this does not hurt so long as the thread model is enabled. However, one has to be cautious regarding the thread pool settings and for the application handler (perl, python, ...) within the thread model of apache or else resource starvation or locking issues bite you in the butt. For Perl I believe the thread-related settings are: PerlInterpMax PerlInterpMaxSpare ]]> Which however heavily interferes with the underlying apache threading model. If you only use LWPs (pre-2.6 kernel time) those settings are better not used or you get COW behaviour of the perl thread pool. For NPTL based setups, this yield much reduced memory constraints. I cannot post customer data for obvious reasons. I believe that 3 simple design techniques help reduce the weakness of the process model. Design your web service with shared sources in mind Use caches and ram disks for your storage Optimise your OS (most people don't know this) The last point sounds trivial but I've seen people running web servers on SuSE or RedHat using a preemtive Kernel, NAPI and runlevel 5 with KDE or Gnome, d/i-notify and power management on! Basic debugging with valgrind, vmstat, slabtop could have showed them immediately why there was I/O starvation, memory pressure and heavy network latency. I didn't actually mean TCP timeouts, but KeepAlive timeouts for example. I don't buy the CPU bottleneck for web service applications. Yes, I have seen 36-CPU nodes go down to their knees by simply invoking a Servlet directly, but after fixing that application and moving to a multi-tier load balanced environment things smoothed down quite a bit. My experience is that CPU constraints for web services are a result of bad software engineering. Excellent technology and frameworks are available, people sometimes just don't know how to use them ;). For RAM, I'd have to agree that this is always a weakness in the system, but I reckon that a sane IT budget to implement and map your business into an e-business web service is certainly considering high enough expenses in buying hardware, including enough (fast and reliable) RAM. You should seriously consider giving advice regarding installing iptables/netfilter stuff on high-volume networked machines. At least make sure you do not load the ip_conntrack module, or you're running out of memory in no time. I've seen badly configured web servers which had the ip_conntrack module loaded (collecting every incoming connection tuple and setting a timeout of a couple of hours) running out of memory within hours. The customer before this fix rebooted his box 3 times a day per cronjob ... go figure. In my 8+ years of LVS development and usage, I have never seen an LVS box run out of memory. I'd very much like to see such a site :). For TC: with the (not so very well documented) action classifier and the u32 filter plus a few classes you should get there. As for L7: ktcpvs is a start, not much tested in the wild though I believe. OTOH putting my load balancer consultancy hat on, I rarely see L7 load balancing needs, except maybe cookie based load balancing. I would very much like to see a simple, working and proper VRRP implementation or integration into LVS. This is what gives hardware load balancers USPs. We spend a considerable amount of time doing consultant work in banking or government environments (after all, what else is there in Switzerland :)), and there is the tendency to zero-tolerance regarding false-positives in blocking. Trying to explain why this happens nevertheless is sort of difficult at times.

squids, tcp 80, 3128 A squid realserver for the most part can be regarded as just another httpd realserver. Squids were one of the first big uses of LVS. There are some exceptions scheduling squids). In an LVS of squids, the content of each squid will be different. This breaks one of the assumptions of LVS, so you need to use an appropriate scheduler. see setups I haven't set up a squid LVS myself but some people have found problems. Rafael Morales

Before I run the rc.lvs_dr script, my realserver can connect to the outside world, but after I run it, I lost connection. The only difference in the route table is this:

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 16 Jan 2004

I had the same thing happen. You should not add any lo route. I don't know why. Here is how I configure my realservers for LVS-DR : I use noarpctl and first of all I had the noarp entry I add a lo:n interface using ifconfig (never use ifup, it seems to make some arp things, and operate badly on an already running cluster). I start the service. Then I configure the director via keepalived. To make it right in case of reboot. I add the ifcfg-lo:n part of the script to the redhat /etc/sysconifg/network-script directory : And then I have this init.d script I use for noarp :

Palmer J.D.F J (dot) D (dot) F (dot) Palmer (at) swansea (dot) ac (dot) uk Nov 05, 2001

With the use of IP-Tables etc on the directors can you route various URLs/IPs (such as ones requiring NTLM authentication like FrontPage servers etc) not to go through the caches, but just to be transparently routed to their destination.

Horms This can only be done if the URLS to be passed directly through can be identified by IP address and/or Port. LVS only understands IP addresses and Ports, whether it is TCP or UDP, and other spurious low level data that can be matched using ipchains/iptables. In particular LVS does _not_ understand HTML, it cannot differentiate between, for instance http://blah/index.html and http://blah/index.asp. rather you would need to set up something along the lines of http://www.blah/index.html and http://asp.blah/index.asp, and have www.blah and asp.blah resolve to different IP addresses. Further to this you may want to take a look at Janet, (http://wwwcache.ja.net/JanetService/PilotService.html) one of the first big uses of LVS with squids. Jezz Palmer had to add a default route from his squid realservers to get them to work. Squid accessed machines on the internet. An alternate approach would be to use iproute2 to add routes only for the services required, and to not add a default route. Jezz J (dot) D (dot) F (dot) Palmer (at) swansea (dot) ac (dot) uk 10 Apr 2002

Here is a list of ports that squid accesses on the internet (outside world).

Setting up squids with fwmark on the director and transparent proxy on the realservers With squids you can't use a VIP to set up a virtual service - the requests you're interested in are all going to a port, port 80. Since the requests are being to an IP that's not on the director, you also need to force the director to accept the packets for local processing (see ). Both of these problems were handled in one blow in the 2.0 and 2.2 kernels by using the -j REDIRECT option to ipchains. This doesn't work for the 2.4 and beyond kernels, as the dst_addr of the packet is rewritten before the packet is delivered to the director. A possible (untested) solution is the method. The method used starting with the 2.4 kernels is to mark all packets to port 80 and schedule on the mark. The packets are accepted locally by iproute2 commands. Con Tassios ct (at) swin (dot) edu (dot) au 13 Feb 2005 Transparent proxy with squid works well if you use fwmarks. I use it with the following LVS-DR configuration: Directors: kernel 2.4.29, keepalived 1.1.7 Assumming 192.168.0.0/16 is the local network, mark all non local http packets with mark 1. Then configure LVS using fwmark 1 as the virtual service. Use these commands so the director will accept the packets Realservers: standard RHEL kernel, squid, noarp: Configure the squid servers to handle transparency the normal way as described in the squid documentation. bikrant (at) wlink (dot) com (dot) np Jun 24 2005 202.79.xx.230 | |-------------------------|-----------------------| | | | | | | eth0:202.79.xx.240/24 fxp0: 202.79.xx.241/24 202.79.xx.235/24 (gw: cisco) (gw: cisco) (gw: cisco) ]]> The default route for all machines is the router. Forwarding is by LVS-DR. Director /proc/sys/net/ipv4/ip_forward ]]> Real server Configuration: FreeBSD 5.3 with squid configured by trans-proxy. Cisco Router:

authd/identd, tcp 113 and tcpwrappers (tcpd) You do not explicitely setup authd (identd) as an LVS service. Some services (e.g. sendmail and services running inside tcpwrappers). initiate a call from the identd client on the realserver to the identd server on the client. With realservers on private networks (192.168.x.x) these clients will have non-routable src_addr'es and the LVS'ed service will have to wait for the call to timeout. authd initiates calls from the realservers to the client. LVS is designed for services which receive connect requests from clients. LVS does not allow authd to work anymore and this must be taken into account when running services that cooperate with authd. The inability of authd to work with LVS is important enough that there is a separate section on .

ntp, udp 123 ntp is a protocol for time synching machines. The protocol relies on long time averaging of data from multiple machines, statistics being kept separately for each machine. The protocol has its own loadbalancing and failure detection and handling mechanisms. If LVS is brought in to an ntp setup, then an ntp client machine would be balanced to several different servers over a long time period, negating the effort to average the data as if it were coming from one machine. ntp is probably not a good service to LVS. Joe May 2002 I tried setting up ntp under LVS-DR and found that on the realserver, ntpd did not bind to the VIP (on lo:xxx). ntpd bound to 0.0.0.0, 127.0.0.1 and the RIP on eth0, but not to the VIP. Requests to the VIP:ntp from the client would receive a reply from RIP:ntp. The client does not accept these packets (reachable = 0). Attempts to fix this on the realserver, all of which produced reply packets which were _not_ accepted by the client, were bring up ntpd while only lo was up. ntpd is bound to 0.0.0.0 and 127.0.0.1. bring up ntpd while the VIP was on lo:xxx and while eth0 was down: result ntpd bound to 0.0.0.0 and 127.0.0.1 but not to the VIP. put the VIP onto another ethernet card, e.g. eth1. Under these conditions, the realserver worked for LVS:telnet. However ntpd bound to eth0, lo and 0.0.0.0, but not to the VIP on eth1. put the VIP onto eth0 and the RIP onto eth1. ntpd now bound to the VIP, but with my hardware, only eth1 was connected to the network and I couldn't figure out how to route between the RIP on eth0 and the outside world. For comparison, telnet also binds to 0.0.0.0 under the same circumstances, but LVS-DR telnet realservers return packets from the VIP rather than the RIP, allowing telnet to work in an LVS. The difference is that telnet is invoked by inetd and when a packet arrived on the VIP, then a copy of telnetd is attached to the VIP on the realserver. If you run ntpd under inetd using this line in inetd.conf and look in the logfiles, you'll see that for every ntp packet that arrives from the client, ntpd is invoked, rereads the /etc/ntp.drift file, finds that another ntpd is bound to port 123 and gets a signal_no_reset. However ntpd is bound to the VIP on the realserver and the client does get back packets from the VIP and starts accumulating data. Meanwhile on the realservers, 100's of copies of ntpd are running and 100's of entries appear in ntpq>peer. You can now kill all the ntpds but the first and have a working ntp realserver with ntpd listening on the VIP (lo:xxx). ntp is not designed to run under inetd. Postings on the news:comp.protocols.time.ntp about binding ntpd to select IPs indicate that the code for binding to an interface was written early in the history of ntpd and is due a rewrite. Invoking ntpd under inetd with the nowait option, produces similar results on the realserver, except that now the client does not get (accept) any packets. Tc lewis managed to get ntp working on LVS-DR realservers, after some wrestling with the routing tables. ntp was being NAT'ed to the outside world, rather than being LVS'ed (i.e. the realservers were time synching with outside ntp master machines). See also NAT clients under LVS-DR. Wayne wayne (at) compute-aid (dot) com 2000-04-24

I'm setting up an LVS for NTP (udp 123) using LVS-NAT. Two issues are rr seems to balance better than lc the balance seems in a large time frame is fine (by sampling the NTP log every 5 minutes) but not fine by sampling the NTP log every second. Is round robin sending traffic to servers based on each request or based on a period of time? We have tested LVS with DNS, which is UDP based, too. What we are doing with this test is not for heavy load issue, rather to see if LVS can provide a fail-over mechanism for the services. By load balancing the servers, we can make two servers backup one, if the one failed, the service will not stop. If round-robin does this one request per server, it is pretty hard to explain what we saw at the server log, which indicating one server getting twice the requests than other two in some seconds, and getting a lot less at other seconds. Could you explain why we seeing that?

Joe DNS occasionally issues tcp requests too, which might muddy the waters. I tested LVS on DNS about 6 months ago at moderate load (about 5/sec I believe) and it behaved well. I don't remember looking for exact balance and I doubt if I would have regarded a 50% imbalance a problem. I was just seeing that it worked and didn't lock up etc. If you have two ntp servers at the same stratum level and the rest of the machines are slaves, the whole setup will keep functioning if you pull the plug on one of the two servers. Will LVS give you anything more than that? Just realised that you must be running a busy site there. If I have only one client, after everything has settled down, it will only be making one request/1024secs. If I have 5 servers they'll only be getting requests every 5120 secs. Got any suggestions about simulating a busy network? run ntpdate in a loop on the client?

One client will not do. We setup a lot of clients. The reason is the NTP server somehow remember the client and if a client asked too much, it would not answer it for a period of time. Do not know who designed this in, but we found it out from our tests. We have close to 100 clients computers in the test.

If you have 100 clients making requests every 1024 secs, thats only 1 request every 10secs. You seem to be getting more than that. Even at startup with 1 request every 64secs, that's only 2 requests/sec.

Make two to three requests per second per client. At some seconds later, NTP servers will stop talking, but that is fine. You already got the profit by then. The scheduling has been changed to round-robin now, it does work better, but it still has problem on the micro scale.

https, tcp 443 http is an IP based protocol, while https is a name based protol. http: you can test an httpd from the console by configuring it to listen on the RIP of the realserver. Then when you bring up the LVS you can re-configure it to listen on the VIP. https: requires a certificate with the official (DNS) name of the server as the client sees it (the DNS name of the LVS cluster which is associated with the VIP). The https on the realserver then must be setup as if it had the name of the LVS cluster. To do this, activate the VIP on a device on the realserver (it can be non-arping or arping - make sure there are no other machines with the VIP on the network or disconnect your realserver from the LVS), make sure that the realserver can resolve the DNS name of the LVS to the VIP (by dns or /etc/hosts), setup the certificate and conf file for https and startup the httpd. Check that a netscape client running on the realserver (so that it connects to the realserver's VIP and not to the arping VIP on the director) can connect to https://lvs.clustername.org. Since the certificate is for the URL and not for the IP, you only need 1 certificate for an LVS running LVS-DR serving https. Do this for all the realservers, then use ipvsadm on the director to forward https requests to each of the RIPs. The scheduling method for https must be persistent for keys to remain valid. When doing health checking of the https service, you can connect directly to the IP:port. e.g. see code in https.monitor at http://ftp.kernel.org/pub/software/admin/mon/contrib/monitors/https/https.monitor Jaroslav Libak jarol1 (at) seznam (dot) cz I run several apache ip based virtual servers on several RSs and test them using ldirectord via http only, even though they run https too. If https is configured properly it will work whenever http does.

When compiling in Apache.. What kind of certificate should I create for a real application with Thawte?

Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr When you generate your CSR, use the CN (Common Name) of the DNS entry of your VIP. pb peterbaitz (at) yahoo (dot) com 18 Feb 2003 F5 and Foundry both DO NOT put SSL Cert processing on their load balancers, they offload it to SSL Accelerator boxes. So, don't let anyone tell you anything negative about LVS in this regard. The big boys don't recommend or do it either.

use reverse proxy to run https on localnode while other services are forwarded peterbaitz - 27 Jan 2003 Is it possible for SSL to be supported on the director rather than on the realserver(s)? Right now, the powers that be have brought up the question of placing a purchased Mirapoint email system behind a "free" load balancer (neglecting to consider that Mirapoint runs on FreeBSD, and that Piranha is a purchasable LVS product as well). Joe

I think you're saying that you want the director to forward port 80 and to accept port 443 locally, ie to not forward port 443. If this is the case then you add entries with ipvsadm for port 80 only. All other traffic sent to the VIP on other ports will be handled locally.

Matthew Crocker matthew (at) crocker (dot) com 27 Jan 2003

It can be done, in fact just about anything can be done these days. If it is a smart thing to do is another matter... What you are trying to do isn't really a function of LVS. You can setup Apache+SSL running in a reverse proxy configuration. That apache and be running on or in front of the LVS director. The apache can then make normal web connections to the internal machines which can be run through the LVS director and load balanced. You can use keepalived or hearbeat to manage the high availability functions of your Apache/SSL proxy. You can use hardware based SSL engines to handle the encryption/decryption. This is all transparent to the functions of LVS. LVS is 'just' a smart IP packet router, you give it a packet and tell it how you want it handled. It can be configured to do a bunch of things. The ideal solution for the highest performance and greatest availability is to have 2 groups of directors, each group having N+1 machines running LVS. Have 1 group of Apache/SSL servers configured and 1 group of internal web servers. LVS group 1 load balances the inbound SSL traffic to one of the Apache/SSL servers. The apache servers make connections to the internal servers though LVS group 2. LVS group 2 load balances the internel HTTP traffic into the realservers. To save money you could move Apache/SSL onto the LVS directors but that could hurt performance.

Matt Venn's director(NAT)+mod_proxy+mod_ssl+apache HOWTO (howto run https in localnode while forwarding other services, uses reverse proxy) Matt Venn lvs (at) attvenn (dot) net Jul 6 2004 You might want to do this if you have highly specced director(s) that you don't want to waste, or not much SSL traffic. I use this setup to cache all images, and to do SSL acceleration for my realservers. Requirements 2.4.26 kernel on the director Carlos Lozano's ipvsadm-1.21 your preferred versions of apache and mod_ssl, mod_proxy Method: patch ip_vs_core.c with Carlos' patch, configure the kernel for LVS, build kernel, install and reboot, compile and install ipvsadm-1.21. Here are my config files for a small cluster with 1 director and 2 realservers. This config will do the SSL for traffic to editcluster.localnet, and load balance both https and http traffic to the 2 realservers. ipvsadm rules for a setup which listens on the director 8080, and load balances the realservers on port 80. apache: note that I have many virtual hosts, and then one domain for the SSL content. for reverse proxy cache CacheRoot "/tmp/proxy" CacheSize 1000000 ]]> for SSL content ServerName editcluster.localnet SSLEngine On ProxyPass / http://editcluster.safenet:8080/ ProxyPassReverse / http://editcluster.safenet:8080/ ]]> one of these for each virtual host ServerName vhost1.localnet ProxyPass / http://vhost1.safenet:8080/ ProxyPassReverse / http://vhost1.safenet:8080/ ]]> Then you need a properly configured apache on your realservers that is set up with virtual hosts for vhost1.safenet and editcluster.safenet, all on port 80.

https without persistence, how sessions work William Francis 29 Jul 2003 14:46:08

Is it possible to use LVS-DR with https without persistence?

James Bourne james (at) sublime (dot) com (dot) au 30 Jul 2003 It is possible. I made sure that the SSL certificate was available to each realserver/virtual host via an NFS mount. I use a single centralised httpd.conf file across all realservers. For example: :443> SSLEngine On ServerName servername:443 DocumentRoot "/net/content/httpd/vhostname" ServerAdmin email@domain.com ErrorLog /net/logs/httpd/vhostname/ssl_error_log TransferLog /net/logs/httpd/vhostname/ssl_access_log CustomLog /net/logs/httpd/vhostname/ssl_request_log "%t %h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \"%r\" %b" SSLCertificateFile /net/conf/httpd/certs/vhostname.crt SSLCertificateKeyFile /net/conf/httpd/certs/vhostname.key SSLCipherSuite ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL Options None AllowOverride None Order Allow,Deny Allow from a.b.c.d/255.255.255.0 a.b.c.d/255.255.255.0 ]]> /net/logs, /net/conf and /net/content are all NFS mount points. The downside is that unless you have real signed certificates from Thawte etc. your browser may want to confirm the legitimacy of the certificate presented each time it hits a new realserver. This depends on the load balancing method used. Hence why the use of persistence is good with https. Horms The other reason that persistance is a good idea relates to session resumption. This allows subsequent connections to be set up much faster if an end-user connects to the same realserver. Some Layer 4 Switching implementations allow persistance bassed on session Id for this reason. LVS doesn't do this. And it is a bit hard to put into the current code (when I say a bit, I mean more or less impossible). For those who are interested, this is how Session IDs are used. An basic SSL/TLS connection has two main phases, the handshake phase and the data transfer phase. Typically the handshake occurs at the begining of the connection and once it has finished data transfer takes place. The handshake uses asymetric (public key) cryptography, typically RSA, while the data transfer uses symetric cryptography, typially something like DES3. When the sesssion begins the public keys are generated. They are then used to securly transfer the keys that are generated for use with the symetric cryptography that is used for the data transfer. In a Nutshell the idea is to use slow asymetric cryptography to share the keys required for fast symetric cryptography which is used to transfer the data. Unfortunately the handshake itself is quite slow. Especially for many short connections - as the handshake usually only occurs once its persentage of time for a connection diminishes the longer the connection lasts. To avoid this problem SessionIDs may be used. This alows an end-user and real-server to identify each other using the SessionID that was issued by the real-server in a previous session. When this occurs an abreviated handhake is used which avoids the more expensive parts of the handshake. Thus making things faster. Note that using different real-servers will not cause connections that try to use Session IDs to fail. They will just use the slower version of the handshake. Nicolas Niclausse Jul 30, 2003

Indeed, it will be MUCH slower. I've made a few benchmarks, and https with renegotiation is ~20 times slower. There is an alternative to persistance: you can share the session IDs on the realservers side with distcache http://distcache.sourceforge.net/ .

Christian Wicke Jul 31, 2003

Is load balancing based on the session id extracted from the request possible?

Horms LVS works at layer 4 so fundamentally it doesn't have the capability to handle session ids.

You can have two IPs for an https domainname Cheong Tek Mun

Is it possible to have one domain name for https with two VIPs. For example, the DNS for domain name http:/test.com is 166.166.166.100. I have an LVS with these two VIPs: 166.166.166.100 and 166.166.166.101. Can I have https service on both VIPs?

Horms 21 Dec 2004 Yes. Joe - I assume test.com is listed in DNS as having two IPs

name based virtual hosts for https Dirk Vleugels dvl (at) 2scale (dot) net 05 Jul 2001

I want to host several https domains on a single LVS-DR cluster. The setup of http virtual hosts is straightforward, but what about https? The director needs to be known with several VIP's (or it would be impossible to select the correct server certificate).

Matthew S. Crocker matthew (at) crocker (dot) com SSL certs are labelled with the URL name but the SLL session is established before any HTTP requests. So, you can only have one SSL cert tied to an IP address. If you want to have a single host handle multiple SSL certs you need a seperate IP for each cert. You also need to setup the director to handle all the IP's named based HTTP DO NOT WORK with SSL because the SSL cert is sent BEFORE the HTTP so the sever won't know what cert to send. Horms has described how sessions are established. Martin Hierling mad (at) cc (dot) fh-lippe (dot) de You can't do Name Based VHosts, because the SSL Stuff is done before HTTP snaps in. So at the Beginning there is only the IP:Port and no www.domain.com. Look at Why can't I use SSL with name-based/non-IP-based virtual hosts? (here reproduced in its entirety).

The reason is very technical. Actually it's some sort of a chicken and egg problem: The SSL protocol layer stays below the HTTP protocol layer and encapsulates HTTP. When an SSL connection (HTTPS) is established Apache/mod_ssl has to negotiate the SSL protocol parameters with the client. For this mod_ssl has to consult the configuration of the virtual server (for instance it has to look for the cipher suite, the server certificate, etc.). But in order to dispatch to the correct virtual server Apache has to know the Host HTTP header field. For this the HTTP request header has to be read. This cannot be done before the SSL handshake is finished. But the information is already needed at the SSL handshake phase. Bingo!

Simone

I have done a configuration with LVS-DR and keepalived. I will use the server for an intranet application. I need to menage various intranet domain over the "job machine" and each domain has to be encrypted over ssl. Apache needs to use a different IP for any ssl certificate. What is the right way to implement about 10 ssl domains over the job machine?

Stephen Walker swalker (at) walkertek (dot) com 18 Aug 2003 You cannot use name-based virtual hosts in conjunction with a secure Web server. The SSL handshake occurs before the HTTP request identifies the appropriate name-based virtual host. Name-based virtual hosts only work with the non-secure Web server. Dirk

With LVS-NAT this would be no problem (targeting different ports on the RS's). But with direct routing I need different virtual IP's on the RS. The question: will the return traffic use the VIP-IP by default? Otherwise the client will notice the mismatch during the SSL handshake.

"Matthew S. Crocker" matthew (at) crocker (dot) com Yes, on the realservers you will have multiple dummy interfaces, on for each VIP. Apache will bind itself to each interface. The sockets for the SSL session are also bound to the interface. The machine will send packets from the IP address of the interface the packet leaves the machine on. So, it will work as expected. The clients will see packets from the IP address they connected to. Julian Anastasov ja (at) ssi (dot) bg

Is this correct?

James Ogley james (dot) ogley (at) pinnacle (dot) co (dot) uk The realservers also need a VIP for each https URL, as they need to be able to resolve that URL to themself on a unique IP (this can be achieved with /etc/hosts of course) Joe

Are you saying that https needs its own IP:port rather than just IP?

Dirk Nope. A unique IP is sufficient. Apache has to decide which csr to use _before_ seeing the 'Host' header in the HTTP request (SSL handshake comes first). A unique port is also sufficient to decide which virtual server is meant though (and via NAT easier to manage imho).

(I interpret Dirk as saying that the IP:port must be unique. Either the IP is unique and the port is the same for all urls, or the IP is common and there is a unique port for each url.)

anon:

how would you run two virtual domains in apache with different certificates, but just one ip address?

Jacob Coby jcoby (at) listingbook (dot) com 26 Feb 2003 It is impossible to share an ip address across multiple https domains on the standard port. Why? Because the HTTP Host header is encapsulated inside the SSL session, and apache (or anything else) can't figure out which SSL cert to use, until AFTER decoding the session. But, to decode the session, it must first send the cert to the client. Catch-22. To use multiple https domains, you'll have to either differenciate them by IP and/or by port.

What if I use a SSL-hardware decoder box

I'm not sure what you are talking about, but I really don't think it will help. The problem is still the same: trying to serve up two different SSL certs based on a Host: header alone in the HTTP stream which is encapsulated by the SSL session which can only be verified by the correct SSL cert. The server _cannot_ get to this Host header without sending a SSL cert. Niraj Patel niraj (at) vipana (dot) com 20 Dec 2006

Since https uses name resolution to pull the SSL cert, would I also need something like the following: a dns entry for each virtual host that maps a fqdn like web.abc.com to each of the RIPs i.e. web.abc.com resolves to RIP1, RIP2, etc. an SSL certificate for web.abc.com that's installed on each RS.

Jaro jarol1 (at) seznam (dot) cz Dec 20 2006 Name resolution is used to discover IP address not pull SSL certificates. Client initiates a TCP connection to server IP address to receive the SSL certificate. SSL will also work if you connect to IP directly in your browser (in sence that encryption will take place). You don't need the DNS entries. ldirectord should be able to perform https checks to IP directly. You will need the certificates. I run several apache ip based virtual servers on several RSs and test them using ldirectord via http only even though they run https too. If https is configured properly it will work whenever http does.

Obtaining certificates for https May 2006: Van Jacobson (of TCP/IP fame) (http://en.wikipedia.org/wiki/Van_Jacobson) has a talk on how they got from circuit switching to packet switching. Now he wants networking changed from point-to-point to allow fetching signed data BitTorrent style without having to specify the location. The problem with the current system is that only the connection is certified (you know who you've connected to via ssl/ssh, you don't know who originated the data/e-mail). If each webpage/piece of data was signed, then there'd be no more pharming, phishing or spam. He points out that obtaining certificates from Verisign is a single point of failure. He tells the story that in about 2004, someone (and they don't know who) obtained a root certificate in the name of Microsoft from Verisign. Better is a distributed (and presumably revokable) system e.g. like PKI (http://en.wikipedia.org/wiki/Public_key_infrastructure). Zachariah Mully zmully (at) smartbrief (dot) com 26 Aug 2002

Finally received the quote from Verisign for 128-bit SSL certs for our website, and I was blown away, $1595/yr! These guys must be making money hand over foot at these prices. They want $895 for one cert and license for one server and another $700 for each additional server in the cluster. This is only for one FQDN, by the end of next year, I'll need to secure three more domains hosted by these servers... Perhaps I heard wrong, but I had thought that I could simply get one cert for a domain (in this case www.smartbrief.com) and use it on all the servers hosting it in my LVS system, but the Verisign people said I needed to buy licenses for each server in the system! So I am wondering if Verisign is yanking my chain and if anyone has any recommedations for other Root CA's that have more reasonable pricing.

Joe (warning - rant follows)

I'll sell you one for $1500 or for $1 if you like. They're both the same ;-\ This is a rip-off because Verisign got their certificates into Netscape/IE back when it counted and no-one else bothered to do the same thing. It's the same monopoly that they had on domain names and they've just got greedy. When I needed to get a certificate, I looked up all the companies listed in my Netscape browser. Most didn't exist anymore or weren't offering certificates. The only two left were verisign and Thawte. Thawte was in South Africa and were half the price of Verisign. I wasn't sure how well a South African certificate would stand up in a US court. Thawte then bungled by setting the expiration of their certificates to be short enough that everyone with the current browsers of the time would not recognise Thawte certificates anymore. End of Thawte. Eventually Verisign bought out Thawte. No more competition. The webpage to get a certificate was an abomination a few years back. I can't imagine the dimwit who wrote it. No-one has stepped in to be an alternate RootCA, and I can't imagine why. I would expect EFF could do it, anyone could do it. You do need a bit of money and have to setup secure machine(s), have some way of keeping track of keys and making sure that the webbrowsers have them pre-installed. It appears to be more than anyone else wants to do, even with the price going through the roof at $1500 a pop. The browser people could help here by making newly approved RootCA certificates downloadable from the website for each browser, but it would appear that all are colluding with Verisign. As far as the website operation is concerned a self signed certificate is just as good as one from Verisign. The only problem is when the user gets the ominous message warning them that the signing authority of this certificate is not recognised. You could engage in a bit of user education here and tell them that Verisign's signature is no better than yours. Otherwise you're over a barrel that doesn't need to be there and no-one has stepped forward to fix the situation.

Doug Schasteen dschast (at) escindex (dot) com 26 Aug 2002

www.ssl.com sells certs but their prices aren't much better. $249 per domain. The real kick in the teeth is that you need a separate certificate for not only each domain, but also sub-domains. So I have to pay an additional $249 if I want to secure something like intranet.mywebsite.com or mail.mywebsite.com as opposed to just www.mywebsite.com. As far as I know, Thawte is still the cheapest, even though they are owned by Verisign.

nick garratt nick-lvs (at) wordwork (dot) co (dot) za 26 Aug 2002

if you're using NAT you'll just need the one cert for, as you correctly state, its per FQDN. one cert works fine for my NATed cluster. not too sure what the implications of DR would be in this context... all a CA is a trusted third party with the buy-in from the browser manufacturers. the tech is not rocket science either. could be anyone; there's clearly an opportunity for another operator.

Zachariah Mully wrote: Thanks Joe, this is unfortunately exactly what I expected to hear. And yes, the omnious warning will definitely confuse and scare our brain dead users. Joe

They aren't really brain dead. They just don't understand what's going on and quite reasonably in that situation they are worried about their credit card number and what's going to happen to it. They have a right to know that their connection isn't being rerouted to some other entity and this fear is how Verisign is making their money. I've just had an offline exchange with someone who self signs and send the client a pop-up explaining the situation. This appears to be for inhouse stuff. I don't know if this is going to work in the general case - I expect that you'll get a different reception if you are the Bank of London and if you are selling dubious services. You could try it initially and log the connections that don't follow through after getting the educational pop-up to see how much people are scared off.
As someone pointed out, one cert should work fine for many NAT'ed servers, anyone know if my DR config would change that?
The certificate is for a domainname. All realservers think they are running that domainname. For LVS-DR they all have the same IP (the VIP). For LVS-NAT they all have different IPs (the various RIPs) in which case you have to have a different /etc/hosts file for each realserver (see the HOWTO). In all cases the machines have the same domainname and can run the same certificate. (Hmm, it's been a while, I can't remember whether the RootCA asks you for your IP or not, so I don't remember if the IP is part of the cert). I can't imagine how Verisign is ever going to tell that you have multiple machines using the same cert. Perhaps you could NFS export the one copy of the cert to all realservers.

You do need a bit of money and have to setup secure machine(s), have some way of keeping track of keys and making sure that the webbrowsers have them pre-installed. Greg Woods woods (at) ucar (dot) edu 26 Aug 2002

The last part of this is the difficult part. We run our own RootCA here, because we were quoted a price from Verisign in excess of $50K per year for what we wanted to do. Then there is the ominous-looking spam that VeriSign sends that makes it sound like you will lose your domain name if you don't register it through them, so I won't do business with them anyway even if the price *has* come down. So we had little choice, and we've just had to guide our users through the scary dialog boxes to get them to accept our CA. Once that's done though, we can now use SSL with authentication to control viewing of our internal web pages. Works for us, but your mileage may vary. I do recall hearing a lot of cursing coming from the security administrator's office while they were trying to get the RootCA working, too. That can be rather tricky.

Eric Schwien fred (at) igtech (dot) fr 26 Aug 2002

Thawte is selling 1 year certs at 199 $ each. If you have a Load Balancer System, they ask you to buy additional "Licences" for the second, third, etc ... Real server. In fact, if you just copy and paste the original cert, all is working fine (ie, without additional "licences"), ... But you do not have the right for it. This is a new pricing scheme of Thawte, that still seems cheaper than offers you had! However, you still need one cert for each domain. Their Web Site is all new, quite long to read everything, but procedures are well explained.

"Chris A. Kalin" cak (at) netwurx (dot) net 26 Aug 2002

Nope, he was talking about 128 bit certs, which even from Thawte are $449/year.

Joe Cooper joe (at) swelltech (dot) com 26 Aug 2002

How about GeoTrust? Looks like $119/year for a 128-bit cert. Though some colo/hosting providers seem to be offering the same product for $49/year (RackShack.net, for example). Maybe only for their own customers, I don't know.

Bobby Johns bobbyj (at) freebie (dot) com 29 Aug 2002

If you're using DR you only need one cert. I'm running that way right now and it works flawlessly. Also, if you're interested in the nuts and bolts of making your own certificates, see Holt Sorensen's articles on SecurtyFocus. Parts 1-4:

Here's what the readers at Slashdot have to say about why certificates are so expensive Malcolm Turnbull wrote

As far as I am aware Thwate do not require you to buy a seperate cert for each realserver (just for each domain).

Simon Young simon-lvs (at) blackstar (dot) co (dot) uk 18 Feb 2003 Just for the record, here's the relevant section from the definitions section of Thawte's ssl certificate license agreement:

"Licensing Option" shall mean the specific licensing option on the enrollment screen that permits a subscriber to use of a Certificate on one physical Device and obtain additional Certificate licenses for each physical server that each Device manages, or where replicated Certificates may otherwise reside "Device" shall mean a network management tool, such as a server load balancer or , that routes electronic data from one point to single or multiple devices or servers.

And from section 4 or the agreement:

... You are also prohibited from using your Certificate on more than one server at a time, except where you have purchased the specific licensing option on the enrollment screen that permits the use of a Certificate on multiple servers (the Licensing Option). ...

So it looks like all realservers do indeed require a license for each realserver - or at least you have to buy the 'multiple server' license option, which is more expensive than the single machine license. In addition:

In the event you purchase the Licensing Option, you hereby acknowledge and agree that ... you may not copy the Certificate on more than five (5) servers.

So a large number of realservers may need a multiple server license for every five machines. This could get expensive... In summary, you need a valid license for every copy of your certificate being used, whether it be single licenses for each, or multiple licenses for every 5 machines. anon

You only need a certificate per domain. You should be able to copy it to as many servers as you want. I had a SSL IPs load balanced using LVS-TUN with two computers, using the same certificate, and nothing complained about the certificate.

Malcolm Turnbull Malcolm (at) loadbalancer (dot) org 22 Oct 2003 Me too, and it only took 2 years for me to realise that I was breaking the licence/law... Verisign et. al. have clauses in their contracts stating that you can't use a cert on more than one web server unless you pay for a multiple use licence... Joe Dec 2003, "Crypto" Steven Levy, Viking Pub, 2001, ISBN 0-670-85950-8. I enjoyed this book - it describes the people involved in producing cryptography for the masses: Diffie, Hellman, Rivest, Shamir, Adleman, Zimmermann, Chaum and Ellis (and many others), how they did it and how they had to fight the NSA, and the legislators to get their discoveries out into the public in a useful way. The real story of why RSA (or its descendants, e.g. Verisign) has the only certificate in Netscape is revealed in this book on p278. I will attempt to summarise

RSA owned or controlled all the patents needed for cryptography for the masses. They had licensed their patents for Lotus Notes and to Microsoft but were limited to 40 bits for the export version and 56 bits for the US version (Microsoft shipped 40 bit enabled code in all versions, to simplify maintenance). RSA was not making much money and attempts to put their patents to use were hobbled by the NSA, the US laws on cryptography, and by the lack of awareness amongst the general public and application writers that cryptography was useful (c.f. how hard it is to get wifi users to enable WEP). The Netscape team was assembled from the authors of Mosaic by Jim Clark, the just departed CEO of Silicon Graphics, who was casting about for a new idea for a start-up company.
" The idea was to develope an improved browser called the Navigator, along with software for servers that would allow businesses to go on-line. The one missing component was security. If companies were doing to sell products and make transactions over the internet, surely customers would demand protection. It was the perfect job for encryption technology. Fortunately Jim Clark knew someone in the field - Jim Bidzos (the business manager at RSA). By the time negotiations were completed, Netscape had a license for RSA and the company's help in developing a security standard for the Web: a public key-based protocol known as the Secure Sockets Layer. Netscape would build this into its software, ensuring that its estimated millions of users would automatically get the benefits of crypto as envisioned by Merkle, Diffie and Hellman, and implemented by Rivest, Shamir and Adleman. A click of the mouse would send Netscape users into crypto mode: a message would appear informing them that all information entered from that point was secure. Meanwhile, RSA's encryption and authentication would be running behind the scenes. Jim Bidzos drove his usual hard bargain with Netscape: in exchange for its algorithms, RSA was given 1% of the new company. In mid-1995, Netscape ran the most successful public offering in Wall Street's history, making RSA's share of the company worth over 20$M. "
The point of this quote is that the people who'd invented cryptography for the masses, had struggled against their own evil empire (NSA, the US Govt) and were now in a position to make good. Because of the patents, there was only one game in town, RSA, and when Netscape was casting around for crypto, RSA was it. Even if RSA had been staffed by GPL true believers, there weren't any other companies with root CA's (if there were, they would have had to license RSA's patents). So because of historical accident (the crypto algorithms were all patented and because the US Govt/NSA tried to keep the genie of crypto from getting out by sitting on people) RSA was the only company that had crypto at the dawn of the internet.

Self made certificates Matthias Krauss MKrauss (at) hitchhiker (dot) com 18 Aug 2003 you can find a nice explain of self made certs configure and virt. addresses under: http://www.eclectica.ca/howto/ssl-cert-howto.php

SSL Accelerators and Load Balancers Horms has described how sessions are established For a description of the (commerical) Radware SSL accelarator setup see . This setup has the SSL accelarator as a realserver and the decrytped http traffice is fed back to the director for loadbalancing as http traffic. Encrypted versions of services (e.g. https/http, imaps/imap, pops/pop, smtps/smtp) are available which require decryption of the client stream, with the plain text being fed to the regular demon. In the reply direction, the plain text must be re-encrypted. Decryption/Encryption are CPU intensive processes and also require the tracking of keys. Vendors have produced SSL accelerators (cards or stand-alone boxes), which do the de- and en-cryption, thus taking the load off the CPU in the server allowing it to do other things. These accelerator cards are useful if you need to increase the capacity of your server(s) and don't want to buy a bigger server to handle the extra load from encryption/decryption, you already have the biggest server you can buy you don't have the SSL enabled version of the demon The cards (or boxes) usually have proprietary software. There are only 2 products which work with Linux (both based on the Broadcom chip?). Since these SSL accelerator boxes are not commodity items, they are always going to be more expensive than the equivalent extra computing power in more servers. The niche for SSL accelerators seems to be the suits, who faced with choosing between a low cost solution which requires some understanding of technology or a high cost solution supported by an external vendor, which requires no understanding of technology will choose the high cost solution. The people with money understand money; they usually don't understand technology. They have little basis to judge the information coming from the technologically aware people they hire and whose job it is to advise them (i.e. the suits don't trust the people at the keyboard.) solutions for the technologically aware people are (this is not my area - anyone care to expand on this - Joe) add more servers and use a load balancer. The demon on the servers will have its own SSL code. use apache with mod_reverse_proxy use ssl engine The dominant school of thought amongst LVS'ers is to add realservers running the SSL'ified demon. Kenton Smith

Do I terminate the SSL traffic at the director or the realserver? How do I handle the certs? If the traffic is terminated at the realserver, do I need a certificate for each realserver? Can I use a name-based cert using the domain name that goes with the virtual IP on the director, thus only requiring one certificate?

Joe (caveat: I haven't done SSL with LVS). Some rules when thinking about how to handle services on LVS each realserver thinks it is being connected directly by the client. each client thinks it is directly connected to a single box (the realserver). Neither the client or the realserver knows the director exists. So - setup each realserver as if the client was directly connecting to it. Put a name based cert on each realservers and let the realserver handle the SSL de/encoding. pb peterbaitz (at) yahoo (dot) com 15 Oct 2003 Where I work we use Piranha (Red Hat's spin of LVS) and regarding SSL, we let the realservers do the SSL work. No sense busying the director with processing the SSL, and even if you wanted to, you would look to SSL Accelerators, which we have not implemented, though we looked at the technology theoretically speaking - but you also get into what service(s) you are using SSL for, webmail, web sites, etc. Better to let the realservers handle the SSL... you can always add more realservers if SSL processing bogs them down by some fraction. Horms I agree. And arguments that I have heard to the contrary are usually tedious at best. SSL is probably the most expensive thing that your cluster needs to do. Thus disributing amongst the realservers makes the most sense as you can scale that by just adding new machines. You can terminate the SSL connection at the director, perhaps using something like squid as a reverse proxy, but then the Linux Director has to do a _lot_ of work. You probably want to get an hardware crypto card if you are going down that road and have a reasonable ammount of traffic. You shold use the same certificate on each of the realservers. That way end-users will always see the same certificate for a given virtual service.

Can I use a name-based cert using the domain name that goes with the virtual IP on the director, thus only requiring one certificate?

I am not sure that I follow this. The name in the certificate needs to match the name that your end-users are connecting to. So if you have www.a.com, www.b.com and www.c.com then they can't use the same certificate. Though the certificates can have wildcarsd, so you could use the same certificate for www1.a.com, www2.a.com and www3.a.com. On a related note. You have to have a different IP address or use a different port for each different certificate. There is no way to use name based virtual services with certificates as SSL has no facility for virtal hosting and thus there is no way for the ssl server to select beetween different certificates on the same IP/Port. Peter Mueller

If I wanted to use a hardware SSL decrypting device such as a card in my LVS-director boxes, how could I set this up in LVS? I see no problem getting 443 to decrypt, but how do people then forward this traffic to the realserver boxes? I like the idea of saving 20-30+ Thawte bills a month AND offloading a whole bunch of CPU for the one time cost of $500/card..

AFIK at this time the only real way to do this is to use a user-space proxy of some sort. Once you have it in user space it is pretty straight forward as long as the card is supported by openssl / provides the appropriate engine library for openssl. On the other hand, surely there is someone who isn't committing highway robbery to provide certificates. AFIK the reason you offer above is the only reason to use an accellearator card in this situation. It is a technical solution to Thwate overcharging. A much better solution is to distribute load on the cluster, that is what it is there for. Matthew Crocker matthew (at) crocker (dot) com 22 Oct 2003 There are more than one way to handle SSL traffic. This is how I do it I have 2 working machines (aka realservers) running Linux/Apache/SSL I have 1 /24 subnet (256 IP addresses) assigned to SSL serving I register 1 SSL certificate per SSL domain I host ( www.abc.com, www.def.com ,www,ghi.com) I assign each domain to an IP address from the SSL pool using DNS (www.abc.com IN A 159.250.20.1, www.def.com IN A 159.250.20.2) I use LVS-DR to load balance the connections to the 2 realservers. I setup the realservers to handle every IP in the SSL pool. In short, SSL certificates are branded with the domain name. The SSL protocol establishes security before any HTTP requests. The client web browser checks the domain it went to (location bar) wit the domain in the certificate. If the domains do not match the web browser complains to the user. SSL is still established. Due to this processes you must us separate IP address for each SSL certificate so Apache will know what SSL cert to use when establishing the connection. I use a hybrid LVS-NAT/LVS-DR setup with fwmarks and some static routes to handle my SSL traffic. Check a couple months back in the logs where I detail how I do it. My realservers are not on the Internet. Only traffic in the SSL Pool going to port 80 and 443 are routed to the realservers. Each real server has a copy of all SSL certs (shared drive). If I need SSL decryption hardware I would place it in the realservers. Persistence is set on the LVS box for the connections. Port 80 and 443 are bound together for persistence. As part of his work, PB, has been finding out about SSL accelerators. Here's his writeup (Mar 2003).

PB's SSL Accelerator write up I've been working with SSL accelerators and have been thinking about using them with LVS. After having Foundry, F5 and Cisco in here for a technical review of their products, I find that LVS has the same basic load balancing functionality. F5 uses BSD Unix, Cisco runs on Linux with ASIC chips, and they have fancy GUI and various additional functionality, but LVS has all the same basic stuff. Note: Red Hat is a Cisco customer, and uses Cisco load balancing and ssl solution in-house (I assume), rather than using their own Red Hat Advanced Server + Piranha + SSL solution of their own which would have helped all us RH Linux Piranha customers. An SSL accelerator is a piece of software running either on a separate box which is inserted into your data stream, decrypting on the way to the server and then re-encrypting on the way back to the client, or running on a card with its own processor, that's inserted into the server to do the same thing. In simpler times we put SSL Certificates (Verisign, Thawte, etc.) on the realservers, and let the load balancer route the traffic to the realservers where the SSL decryption is done. https, e-mail protocols smtps, imaps, and pops are all SSL Encryption oriented. Additionally, many new applications today have joined the SSL bandwagon. Now SSL decryption is CPU-intensive and adds actual load on your realservers. Solution? Add more real servers behind your load balancer, right? Yes and no. Adding additional realservers behind your load balancer to offset SSL decryption load is the intuitive solution, since that is what the purpose of a load balancer is, to allow many realservers, and lower the load to each. However some folks today add what is gernerally called "SSL Accelerator" solution to the mix, and remove as many SSL Certificates and SSL decryption from the realservers and pre-process the SSL Encrypted data stream before it hits the real servers. In short SSL Accelerators decrypt the data, then pass the decrypted data via clear-text (standard) protocols (like http, smtp, imap, pop) to the real servers. e.g. for https, in a standard server, the decrypted https is sent as clear text to the httpd on port 80. Same for smtps -> smtp, imaps ->imap, pops-> pop, etc. What happens to ssh? I think because ssh is using its own RSA keys (not SSL) on the server the information is encrypted all the way. The point of the SSL accelerator is to reduce the load on your server (which is now busy crypting, rather than serving pages), so that it can deliver more pages of data to clients without changing your server setup. However you could do the same thing by beefing up you server (or putting an array of realservers behind an LVS) without any change in software on your servers. The problem then is one of cost. The Ingrian and Sonicwall SSL boxes are in the multiple 10K$ range. The cards cost less, and other stand-alone units that support fewer protocols (like http/https only) cost a few thousand. Suits feel better spending money when it is available (ie. an SSL accelerator array will insure we don't need to spend more money on 1 or 2 or 3 more realservers in the future). So an SSL accelerator is aimed at non-technical, but financially knowlegable managers, rather than techically competent but financially naive computer people. (When faced with the choice of going to an L7 load-balancer or rewriting the application to be L4 friendly, the suits will go for the slow and expensive L7 load-balancer, while the programmers will re-write the application - the choice depends on what you have at hand that you understand). You can have an SSL Accelerator without a load balancer, but in order to have an array of them, you want a load balancer. (Another choice is DNS round robin, which several people, including Horms, have found, is not a good way to go). There are two main ways this passing the baton is done. First, your load balancer can load balance all data steams to a few of these SSL Accelerator units, which decrypt any SSL encrypted data, and themselves load balance all protocols to your realservers. Second, your load balancer can load balance just the SSL encrypted data streams to several SSL Accelerator units, which after they decrypt the data, pass it back to your load balancer as clear-text to be routed by your load balancer as non-encrupted data over standard protocols to your realservers. SSL Accelerators are available as stand alone units (you would normally buy several and make an array of them) or as SSL Accelerator CARDS which plug into your realservers to speed them up with regard to SSL decryption (which is yet another solution). In principle you could separate out the SSL handingly from a linux server and run it on a separate box to make a linux only SSL accelerator box. The natural people to do this would be the mod_ssl people, but they are supporting linux compatible SSL hardware (e.g. cards that use the Broadcom chipset) via the "OpenSSL engine" feature. (see the mod_ssl mailing list archives ) This will allow you to build a SSL Accelerator Linux box or to beef up your realserver with a Broadcom card in your Linux LVS load balancer. Since the extra processing is now on a separate processor, in principle this should not add a lot of extra load to your realserver. Ingrian and Sonicwall are a couple fairly expensive SSL Accelorator solutions which support all the protocols you need. Broadcom makes the card you can add to your real servers. There are other brands which make less expensive SSL Accelerators that support only https (web) protocol. The information from most vendors is vendor-speak, however Ingrian has some white papers. I peronally called Ingrian and found them to be extremely TECHNICALLY helpful (more than I could understand myself). They seemed willing to help me even though I told them I use Piranha/LVS (and not Foundry their partner). I did not find any level of open-source style detail on SSL Acceleration/decruption. Got all my info from the load balancer companies, Ingrian, and white papers. I don't know how an SSL accelerator box/card works. Presumably several levels are involved. The box has to get the packets One of the two methods (called "one arm config") you have the load balancer route only SSL-bound protocals (ie. https, imaps, smtps, etc.) to the SSL accelerator. for a card, it has to grab the packets off the PCI bus. (Anyone know how this is done?). the SSL accelerator has to keep track of session data etc. The end result is that the box/card takes the SSL encrypted form of the data, decrypts it to clear text, and shoots it out like it was never encrypted. Some SSL accelerators have load balancing built-in (eg Ingrian). But not all. The standard "one arm config" does NOT require it. You use your own load balancer, and send SSL encrypted data to your SSL Accelerators, then they output decrypted clear text data sending it back to the load balancer for routing to the realservers. Only Ingrian mentioned/recommended that they can also do load balancing. I don't know how you would use an SSL accelerator with LVS-DR. The load balancer companies talk about the decrypted clear text going to the realservers, but do not recall a dicussion about going back out again.

from the mailing list Matthias Krauss MKrauss (at) hitchhiker (dot) com 10 Mar 2003 (severly editorialised by Joe) I had LVS-DR forwarding http (but not https). I hoped to decrypt the https packets on the director with the SSL accelerator card and then pass the decrypted packets to the realserver via the LVS, to save cpu time on realservers. I simulated 10 concurrent requests and downloads of about 3 GB via the ssl acclerator, apache's cpu time went up to 30% on a 1Ghz/512MB host.

when you had the accelerator card in front of the director, what does it do with traffic to other ports eg port 80? Does it just pass them through? Does it look like a router/bridge except for port 443?

For me it looked like some kind of proxy, dealing with the incomming ssl/443/encrypted traffic, decrypts it and passed http/80/decrypted traffic to the LVS . With tcpdump I saw that between the ssl rewrite engine and the VIP was only regular http traffic. By the way, I didnt have an acclerator card for the apache box, I just used apache rewrite and proxy pass mod for the decrytion job. Julian 11 Mar 2003 The SSL Accelerator cards I know allow user space processes (usually many threads) to accelerate the handling of private keys. What I know is that the normal traffic is still encrypted and decrypted from the CPU(s). OTOH, decryptying the SSL data in directors is used mostly to modify the HTTP headers for cookie and scheduling purposes. For other applications it can be for another reason. Even the HTTP stream from the realservers is modified. So, I don't think LVS can be used here. Of course, it is possible to implement everything in such way so the user space handling is avoided and all processing is moved in kernel space: queues for SSL async processing, HTTP protocol handling, cookies, just like ktcpvs works in kernel space to avoid memory copy. I don't follow the SSL forums, so I don't know at what stage is the kernel-level SSL acceleration. My experience shows that the user-space model needs many threads just for private keys to keep the accelerator busy and the rest is spent for CPU encryption and decryption of the data.

Joe: private keys are kept in threads rather than in a table? Julian: If you have to use the following sequence for an incoming SSL connection (user space): The SSL_accept operation is the bottleneck for non-hardware SSL processing, SSL_accept handles the private key which costs very much. The hw accel cards offload atleast this processing (the engine is used internally from SSL_accept) but we continue to call SSL_read and SSL_write without using the hw engine. Considering the above sequence we have two phases which repeat for every incoming connection: wait the card drivers to finish the private key operations. That means one thread waiting in blocked state for SSL_accept to finish. do I/O for the connection (this includes data encryption and decryption and everything else) What we want is while the card is busy with processing (it does not have any PCI I/O during this processing) to use the CPU not for the idle kernel process but for encryption and decryption of other connections that are not waiting the engine. So, the goal is to keep the queue of the hwaccel busy with requests and to use the CPU at the same time for other processing. As result, the accel reaches its designed limit of RSA keys/sec and we don't waste CPU in waiting only one SSL_accept for results.
Joe: if there was only one thread, the accelerator would be a bottle neck or it wouldn't work at all?
The CPU is idle waiting SSL_accept (the card) then the card is idle waiting the CPU to encrypt/decrypt data. The result could be 20% usage of the card and (30% usage of the CPU) and the idle process is happy: 70% CPU.

I don't know if this is true for all cards. But even with an accelerator, the using of SSL costs 3-4 times more than just the plain HTTP. My opinion is that this game is useful only for cookie persistence. For other cases LVS can be used in directors and the accelerators on the realservers - LVS forwards TCP(:SSL:HTTP) at L4, the accel is used from user space as usually.

Joe: is this 3-4 times the number of CPU cycles?

Yes, handling of SSL encrypted HTTP traffic with hwaccel is 3-4 times slower than handling the same HTTP traffic without using SSL. Of course, without hwaccel this difference could be 20 and that depends on the used CPU model. What I want to say is that it is better to delay this processing at the place where it is needed: if the SSL traffic needs to be decrypted for scheduling and persistence reasons than it should be done in director but if SSL is used only as secure transport and not for the above reasons than it is better to buy one or more hwaccel cards for the realservers. Loading the director should be avoided if possible. As for any tricks to include LVS in the SSL processing I don't know how that can be done without using kernel-space SSL hwaccel support. And even then, LVS can not be used, may be ktcpvs can perform URL switching and cookie management. pb Mar 11, 2003

I wonder if a software engine could be written to accept data from any SSL service (https/smtps/imaps/pops) and let apache rewrite + proxy pass mod decrypt it, then get it sent back out the correct clear text port (http/smtp/imap/pop). Its all SSL encrypted the same way, so once decrypted just pass it to the right protocol. No?

Horms Yes this would be possible. But I fail to see why it would be desirable unless you are running a daemon that can't do SSL itself (not all demons are SSL enabled). Joe: see SSL'ifying demons.

unknown: Would an SMP system make a difference such that the OS and general I/O is not bogged down?

Horms Surely the problem is CPU and not I/O so to that end one would expect that an SMP machine would help. Ratz the simple solution that works in almost all cases is apache + mod_reverse_proxy. You can build highly scalable and secure transaction servers. No need for SSL accelerators, since after all, there are only 2 known products out there that work with linux. Also you can double-balance the requests with the same load balancer before the reverse proxy and after the proxy with LVS-DR and two NICs which leads to a pretty nice HA setup. (The healthchecking scripts however are a bit complex though.) Horms Personally I agree that SSL accelerators are pointless. I don't really see that you can do much that can't be done with a reverse proxy or better still, just handling the SSL on the realservers. When you think about it, SSL is probably at least as CPU intensive as whatever else the Real Servers are doing, so it makes sense to me that load should be spread out. I can see some arguments relating to SSL session IDs and the like. But the only real way forward here is to move stuff into the kernel. I haven't given this much thought, but what Julian had to say on the matter makes a lot of sense. (from a while back)

Jeremy Johnson jjohnson (at) real (dot) com 12 Oct 2000 I have been evaluating the Intel 7110 and the 7180 SSL E-commerce accelerators recently. I would prefer to use the 7110's (Straight SSL Accelerators) coupled with LVS instead of throwing out LVS and using the 7180 as the cluster director. Now, before anyone says "You can do that with LVS-Tun", I am aware that I can fix this problem with LVS-Tun but I would like to see if there is a fix for LVS-DR that would let this work as I prefer the performance of LVS-DR over LVS-Tun. Here is how I have the network setup These 7110 SSL Accelerators have 2 Ports, IN one and OUT one. The problem is that when the request comes into the Director and the director looks up the MAC address of the RealServer, it is getting the MAC address of the First NIC in the 7110 Director, so when the packet headers are rewritten with the Destination MAC Address, the MAC address of a NICK in the 7110 is written instead of the MAC of the RealServer. The effect is a black hole.The cluster works fine without the 7110 in between LVS and the RealServer, as soon as I swap in the 7110, all traffic bound for the RealServer with the 7110 in front of it is blackholed. Any Ideas? I could be wrong about what exactly is happening but basically as soon as I swap in the 7110 LVS for the RealServer Dies, same thing when I set the box in fail-through mode, the box is acting as a straight piece of wire then and I am having the same problem so I believe it has something to do with the wrong MAC address, any ideas? I am really hoping that someone has encountered something like this and has a fix. The obvious fix would be to be able to on the director force the MAC addresses of the RealServer instead of looking them up.

Terje Eggestad terje (dot) eggestad (at) scali (dot) no 13 Oct 2000 Since you're in an evaluation process, I offer another solution to the accelerator problem, that will give you a cleaner lvs setup. I've tried a PCI accelerator card from RainBow, RainBow. My experience was with Netscape Enterprise Server on HP, but RedHat flags that their version of Apache, Stronghold, has drivers for this card. The effect was nothing short of amazing. Before adding this card the CPU usage ration between the web server and the CGI programms was 75:25. After adding this card it was in the neighbourhood of 10:90. Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 17 Aug 2005

I have been using LVS for a small farm for a couple of years now. I am interested in adding SSL hardware acceleration to my two LVS servers. It is my goal to maintain performance by offloading the SSL chores, and reduce the cost of certificate renewal by not applying certificates to my web servers. Can anyone offer advice from experience doing the same? I am using an LVS-NAT configuration currently and am happy with it. It has been suggested that I get a commercial product to do this (Big-IP from F5) which I am not absolutely opposed to, but if there is a good track record with adding SSL hardware acceleration to LVS then I will be happy to stick with what I've been using.

Peter Mueller pmueller (at) sidestep (dot) com Intel used to make a daisy-chain network device that would do this. A lot of companies still add an SSL card to a few servers, e.g. http://h18004.www1.hp.com/products/servers/security/axl600l/ or http://www.chipsign.com/modex_7000.htm. And then there are the accelerators on F5s and their like. I think the least disruptive way will be the add-on-card to two servers and a :443 vip containing only them. kwijibo (at) zianet (dot) com Why are you worrying about offloading it? I would just buy some boxes with faster CPU's if speed is a concern. The time it takes to ship the data back and forth over the bus to the SSL accelerator your processor probably could have taken care of it. Especially if the algorithm uses all the bell and whistle features todays CPUs have. Joe: as Horms says, it's all about the number of certificates. This is not a problem that has a technical solution.

SSL'ifying demons Richard Lyons posting to the qmail mailing list, 05 Feb 2003 To configure Qmail to use "pop3s" and "smtps", there are plenty of examples in the archives at http://www-archive.ornl.gov:8000/, but let me hit the high points. There are two ways to offer secure versions of the SMTP and POP services. In the first, the existing services on port 25 and 110 can be enhanced with the STARTTLS and STLS extensions (RFCs 2487 and 2595), allowing clients to negotiate a secure connection. The other method is to provide services on different ports and wrap or forward connections on the new ports to the existing services. STARTTLS for qmail-smtpd is done either with the starttls patch (look for starttls on http://www.qmail.org) or with Scott Gifford's TLS proxy, see http://www.suspectclass.com/~sgifford/stunnel-tlsproxy/stunnel-tlsproxy.html As far as I know, the only implementation of STLS for qmail-pop3d is Scott's TLS proxy, see the above link. The starttls patch requires patching and installing a modified qmail-smtpd/qmail-remote but no other changes (apart from configuring certs, etc). The starttls patch also allows secure connections from your mailserver to others supporting RFC2487. The TLS proxy requires patching and installing a modified stunnel and changing your run scripts, but doesn't modify the current qmail install. Creating new services can be done with stunnel (http://www.stunnel.org) or sslwrap (http://www.quiltaholic.com/rickk/sslwrap/). You can either configure a daemon to listen on the secure ports (465 and 995 for SSMTP and SPOP3) and forward the traffic to the normal services, or run a service on the ports that invokes the secure wrapper inline. A drawback of the first approach is that connections appear to be from 127.0.0.1, reducing the usefullness of tcpserver on the unencrytped port (I'm told there's a work around for this on Linux machines). New services requires configuring stunnel and new run scripts but no changes to the existing installation. An example of stunnel-3 wrapped services can be found in http://www.ornl.gov/its/archives/mailing-lists/qmail/2003/01/msg01105.html Jesse reports success with stunnel-4 wrapped services in http://www.ornl.gov/its/archives/mailing-lists/qmail/2002/09/msg00238.html Finally, let me note that if your users want secure services, they should using something like PGP/GPG and APOP. Brad Taylor btaylor (at) Autotask (dot) com 8 Aug 2005

I'm running a web server and Squid in reverse proxy mode and terminating SSL on the Squid box allowing HTTP to the realserver. I want to add two more real servers for a total of 3. Could I put and LVS into this mix in front of the Squid server and add IP's to Squid and load balancing the Squid IP's allowing Squid to continue to terminate SSL? Traffic would move like this: HTTPS -> (LVS) -> HTTPS (1 of 3 Squid IP's) -> HTTP (1 of 3 real servers). Or would the SSL traffic at the LVS need to decrypted 1st?

Joe 2005-08-08 Whether it's better to have the SSL decryption on the realserver or in a separate SSL accelarator isn't clear AFAIK. It's not like it comes up a lot and we've figured out what to do. Horms probably has the clearest point on the matter which is not to have a separate SSL engine but to have each realserver do its own decrypting/encrypting. Graeme Fowler graeme (at) graemef (dot) net 08 Aug 2005 I've built a similar system at work (can't go into too much detail about certain parts of it, sadly) but the essence is as follows: The Squids are acting in reverse proxy mode and I terminate the SSL connections on them via the frontend LVS, so they're all load balanced. Behind the scenes is a bit more complicated as certain vhosts are only present on certain groups of servers within the "cluster" and the Squids aren't necessarily aware where they might be, so the Squids use a custom-written redirector to do lookups against an appropriate directory and redirect their requests accordingly. In a position where you have a 1:1 map of squids/realservers you could in theory park a single server behind a single squid, but that doesn't give you much scaleability. It does however mean if a webserver fails then that failure gets cascaded up into the cluster more easily. Then again, if you're shrewd with your healthchecks you can combine tests for your SSL IP addresses and take them down if all the webservers fail. Also, don't overallocate IP addresses on the Squids. If I say "think TCP ports", you only need a single IP on your frontend NIC... but I'll leave you to work that one out! Remember that the LVS is effectively just a clever router; it isn't application-aware at all. That's what L7 kit is for :) Horms Here is a description I wrote a while ago about SSL/LVS In a nutshell, you probably want to use persistance and have the real-servers handle the SSL decryption. http://archive.linuxvirtualserver.org/html/lvs-users/2003-07/msg00184.html Actually, using the lblc scheduler, or something similar, might be another good solution to this problem. Graeme ...this is OK if you have a set of servers onto which you can install multiple SSL Certs and undergo the pain of potentially having an IP management nightmare: 1 "cluster" IP 2 realservers, 1 site -> 2 "cluster" IPs ... 10 realservers, 10 sites -> 100 "cluster" IPs ]]> I think you can see where this is going! Very rapidly the IP address management becomes unwieldy. You can of course get around this by assigning different *ports* to each site on each server, but if you have the cert installed on all servers in the cluster this soon becomes difficult to manage too. Offloading onto some sort of SSL "proxy", accelerator, engine (call it what you will) means that you can then simply utilise the processing power of that (those) system(s) to do the SSL overhead and keep your webservers doing just that, serving pages. If each VIP:443 points to a different port on the "engine" you greatly simplify your address management too. Horms I do not follow how having traffic that arrives on the real-servers in plain-text or SSL changes the IP address situation you describe above. Graeme Also, most commercial SSL certification authorities will charge you an additional fee to deploy a cert on more than one machine, so if you can reduce the number of "engine" servers you reduce your costs by quite some margin. Horms This is the sole reason to use an SSL accelerator in my opinion. The fact is that SSL is likely to be the most expensive ($) operation your cluster is doing. And that in many cases it is cheaper to buy some extra servers, and offload this processing to an LVS cluster, than to by an SSL accelerator card. Graeme You do however still need to use persistence, and potentially deploy some sort of "pseudo-persistence" in your engines too to ensure that they are utilising the same backend server. If you don't, you'll get all sorts of application-based session oddness occurring (unless you can share session states across the cluster). Horms If you have applications that need persistence, then of course you need persistence regardless of SSL or not. But if you are delivering SSL to the real servers then you probably want persistence, regardless of your applications, to allow SessionID to work, and thus lower the SSL processing cost.

SSL termination at localnode: patch by Carlos Lozano, Siim Poder and Malcolm Turnbull This code was originally written by Carlos Lozano . Malcolm wanted it ported to the current kernel and offered to pay for the work. Siim Poder did the recoding. This patch will be in 2.6.28 (Dec 2008). Malcolm I just paid Siim Poder for the work to convert Carlos Lozano's earlier patch to the latest kernel... And then luckily Horms thought it was a good idea... I think that Kemp and Barracuda did it as well but didn't feed it back to the community. lists lists (at) loadbalancer (dot) org 01 May 2008

At the moment I can do SSL termination with Pound (http://www.apsis.ch/pound/), then hand off locally to HaProxy (http://haproxy.1wt.eu/) for cookie insertion and load balancing: HaProxy -> Real Servers x.x.x.10:443 -> x.x.x.10:80 -> Real Servers ]]> But I'd like to do : LVS -> Real Servers x.x.x.10:443 -> x.x.x.10:80 -> Real Servers ]]> But the Pound process on the director can't access Real servers via the local LVS set up at x.x.x.10:80? Is this the local node problem? I've tried in NAT and DR mode. Is their anyway I can get LVS to pick up a local request i.e. wget x.x.x.10:80 (from local console) picks up data from a real server? I found what I was after in the HOWTO ()!

Malcolm 17 Dec 2008 The patch is so that we can have a local server proxy e.g. mod_proxy, pound, stunnel, haproxy feed into an LVS server pool on the same server (Joe i.e. director). We use Pound (reverse proxy) to terminate HTTPS traffic and then pass it to an LVS VIP (masq/NAT) which then distributes to HTTP servers. This only works in LVS-NAT mode with the load balancer as the default gateway. Blog post here: LVS Local node patch for Linux 2.6.25, Centos 5 kernel build how-to (http://www.loadbalancer.org/blog/lvs-local-node-patch-for-linux-2625-centos-5-kernel-build-how-to/) Joe: you don't need to do the patch/build. The code is in the kernel. Siim Poder siim (at) p6drad-teel (dot) net 18 Dec 2008 afaik it doesn't require the director to be the default gw in general (for any LVS mode), as a director's IP will be the source IP of the packet. The added functionality is that the director can connect to one of it's own VIPs and have the connections load balanced to a number of RSs. As Malcolm already mentioned, the main use case would be having having a https proxy on the director accepting https using local VIP to LB to http RSs, instead of LBing https to https RSs (potentially using director resources more efficiently, in case the traffic is low enough). The reason it didn't work (before the patch) is that traditionally ipvs code only tried to load balance packets that were coming IN to the director (from the network). This patch adds the possibility to load balance packets that are leaving the director - that is, packets originating from the director itself. Localnode only comes to play, if you have two directors and a setup that load balances both incoming and outgoing connections. For only load balancing the outgoing connections (that is, just one director) you do not need to have a VIP:443, the https_proxy can just listen on your "normal every-day" IP. The part that was not previously possible, is the https_proxy connecting to VIP:80 on the DIRECTOR1 itself and having the connections load balanced between RS1 and RS2. There is nothing special to the setup. You just have to connect to one of the VIPs from the director. Before the patch, the connection would get reset, but after patch, it is load balanced to whatever real servers behind the VIP. Something like this: For unpatched kernels, you would get connection refused, for patched ones you get connected to either to 1.2:80 or 1.3:80.

r commands; rsh, rcpi (and their ssh replacements), tcp 514 The r commands use multiport protocols. see .

lpd, tcp 515 Network printers have a functioning tcpip stack and can be used as realservers in an LVS-NAT setup to produce a print farm. I got the idea for this at Disneyworld in Florida. At the end of one of the roller coaster rides (Splash Mountain) cameras take your photo and if you buy your photo, it is printed out on one of about 6 colour printers. I watched the people collecting the print-outs - they didn't seem to know which printer the output was going to and looked at all of them looking for your print. The same thing would happen with an LVS-NAT printer farm, since you can't pick the realserver that will get the job. Using LVS-NAT with lpd is written up in section 5.3 of the article on performance of single realserver LVSs.

Databases Because of the large cost of distributed or parallel database servers, people have looked at using LVS with mysql or postgres. A read-only LVS of databases is simple to setup; you only need some mechanism to update the database (on all of the realservers) periodically. A read-write LVS of databases is more difficult, as the write to the database on one realserver has to be propagated to the databases on the other realservers. A read-mostly database (like a shopping cart), can be loadbalanced by LVS with the updates occuring outside the LVS. However a full read-write LVS database is not feasible at the moment. Moving data around between realservers is external to LVS. If clients are writing, you need to do something at the back end to propagate the writes to the other nodes on a time scale which is atomic compared to the time of reads by other users. If you have a shopping cart, then other user's don't need to know about the a particular user's writes, till they commit to ordering the item (decrementing the stock), so you have some time to propagate the writes. Currently most LVS databased are deployed in a multi-tier setup. The client connects to a web-server, which in turn is a client for the database; this web-server database client then connects to a single databased. In this arrangement the LVS should balance the webservers/database clients and not balance the database directly. Production LVS databases, e.g. the service implemented by Ryan Hulsker RHulsker (at) ServiceIntelligence (dot) com (sample load data at http://www.secretshopnet.com/mrtg/) have the LVS users connect to database clients (perl scripts running under a webpage) on each realserver. These database clients connect to a single databased running on a backend machine that the LVS user can't directly access. The databased isn't being LVS'ed - instead the user connects to LVS'ed database clients on the realserver(s) which handle intermediate dataprocessing, increasing your throughput. Next, replication was added to mysqld and updates to one database could be propagated to another. Rather than using peer-to-peer replication, LVS users setup replication in a star, with a master that accepted the writes, which were propagated to slave databases on the realservers. Then mysqld acquired functionality similar to loadbalancing removing any need to develope LVS for loadbalancing a database (e.g. - from Ricardo Kleeman 4 Aug 2006 - multi-master replication, which means you can have 2 servers as master and slave... so in that sense a pair should be usable for load balancing). (also see loadbalanced mysql cluster) Some of the earlier comments about mysqld below may no longer be relevant to current versions of mysqld. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 10 Dec 2008 Load balancing SQL server is no problem with LVS but just like every other database pretty difficult to get your application to work with it unless its a read only copy. Normal configuration is a shared storage twin-head write cluster + load balanced read-only clusters.

Multiple Writer, Common Filesystem summary: this doesn't work An alternative to propagating updates to databases is to use a distibuted filesystem, which then becomes responsible for the updates appearing on other machines. (see Filesystems for realserver content and Synchronising realserver content). While this may work for a webserver, a databased is not happy about having agents outside its control writing to its data filesystem. A similar early naive approach approach of having databaseds on each realserver accessing a common filesystem on a back-end server, fails. Tests with mysqld running on each of two realservers working off the same database files mounted from a backend machine, showed that reads were OK, but writes from any realserver either weren't seen by the other mysqld or corrupted the database files. Presumably each mysqld thinks it owns the database files and keeps copies of locks and pointers. If another mysqld is updating the filesystem at the same time then these first set of locks and pointers are invalid. Presumably any setup in which multiple (unlocked) databaseds were writing to one file system (whether NFS'ed, GFS'ed, coda, intermezzo...) would fail for the same reason. In an early attempt to setup this sort of LVS jake buchholz jake (at) execpc (dot) com setup an LVS'ed mysql database with a webinterface. LVS was to serve http and each realserver to connect to the mysqld running on itself. Jake wanted the mysql service to be lvs'ed as well and for each realserver to be a mysql client. The solution was to have 2 VIPs on the director, one for http and the other for mysqld. Each http realserver makes a mysql request to the myqslVIP. In this case no realserver is allowed to have both a mysqld and an httpd. A single copy of the database is nfs'ed from a fileserver. This works for reads.

Parallel Databases Malcolm Cowe

What if the realservers were also configured to be part of an HA cluster like SGI Failsafe or MC/ServiceGuard (from HP)? Put the real servers into a highly available configuration (probably with shared storage), then use the director to load balance connections to the virtual IP addresses of the HA cluster. Then you have a system that parallelises the database and load balances the connections. This might require ORACLE Open Parallel Server (OPS) edition of the database, you'd have to check.

"John P. Looney" john (at) antefacto (dot) com 12 Apr 2002 My previous employer did try such a system, and had very little luck with it. In general, you are much better off having one writer, and multiple readers. Then, update the "read-only" databases from the write-database. We were using Oracle Parallel Server on HP-UX, and it crashed about once every two weeks, not always just under heavy load. Databases are by design single entities. Trying to find a technical solution to such a mathematical problem is asking for pain and suffering, over a prolonged period of time.

Linux Scalable Database project May 2001: The LSD project does not seem to be active anymore. The replication feature of mysql is functionally equivelant. The Linux Scalable Database (LSD) project http://lsdproject.sourceforge.net/ is working on code to serialise client writes so that they can be written to all realservers by an intermediate agent. Their code is experimental at the moment, but is a good prospect in the long term for setting up multiple databased and file systems on separate realservers.

Gaston's Apache/mysql setup Gaston Gorosterrazgoro (at) hostarg (dot) com (dot) ar Jun 12 2003 I solved my problems quite different than seen in the HOW-TO. So, someone may need this email in the near future: End of story. I have better security in the MySQL daemons (not accesible from clients), less charge in the director machine (don't have to worry about MySQL, neither run MON yet), and I'm happy. :) Now is the turn for Cyrus in Server2 and Server3.

Databases: mysql

How To Set Up A Load-Balanced MySQL Cluster There is a writeup (Mar 2006) at http://www.howtoforge.com/loadbalanced_mysql_cluster_debian by Falko Timme based on UltraMonkey. (we don't know this guy and he's never posted to the LVS mailing list - he seems to know what he's doing though.)

mysql replication MySQL (and most other databases) supports replication of databases. Ted Pavlic tpavlic (at) netwalk (dot) com 23 Mar 2001

When used with LVS, a replicated database is still a single database. The MySQL service is not load balanced. HOWEVER, it is possible to put some of your databases on one server and others on another. Replicate each SET of databases to the OTHER server and only access them from the other server when needed (at an application or at some fail-over level).

Doug Sisk sisk (at) coolpagehosting (dot) com 9 May 2001

An article on mysql's built in replication facility

Michael McConnell michaelm (at) eyeball (dot) com> 13 Sep 2001

Can anyone see a down side or a reason why one could not have two Systems in a Failover Relationship running MySQL. The Database file would be synchronized between the two systems via a crontab and rsync. Can anyone see a reason why rsync would not work? I've on many occasions copied the entire mysql data directory to another system and start it up without a problem. I recognize that there are potential problems that the rsync might take place while the master is writing and the sync will only have part of a table, but mysql's new table structure is supposed to account for this. If anything, a quick myismfix should resolve these problems.

Michael McConnel michaelm (at) eyeball (dot) com 2001-09-14 There are many fundamental problems with MySQL Replication. MySQL's Replication requires that two systems be setup with identical data sources, activate in a master/slave relationship. If the Master fails all requests can be directed to the Slave. Unfortunately this Slave does not have a Slave, and the only way to give it a slave, is to turn it off, synchronize it's data with another system and then Activate them in a Master/Slave relationship, resulting in serious downtime when databases are in excess of 6 gigs (-: This is the most important problem, but there are many many more, and I suggest people take a serious look at other options. Currently I use a method of syncing 3 systems using BinLog's. Paul Baker pbaker (at) where2getit (dot) com>

What is the downtime when you have to run myisamchk against the 6 gig database because rsync ran at exactly the same time as mysql was writting to the database files and now your sync'd image is corrupted? There is no reason you can not set up the slave as a master in advance from the beginning. You just use the same database image as from the original master. When the master master goes down, set up a new slave by simple copying the original master image over to the new slave, then point it to the old slave that was already setup to be a master. You wouldn't need to take the original slave down at all to bring up a new one. You would essentially be setting up a replication chain but only with the first 2 links active.

Michael McConnell michaelm (at) eyeball (dot) com> In the configuration I described using Rsync, the MyISMchk would take place on the slave system, I recognize the time involved would be very large, but this is only the slave. This configuration would be setup so an Rsync between the master and slave takes place every 2 hours, and then the Slave would execute a MyISMchk to ensure the data is ready for action. I recognize that approximately 2 hours worth of data could be lost, but I would probably us the MySQL BinLogs rotated at 15 minutes interval and stored on the slave to allow this to be manually merged in, and keep the data loss time down to only 15 minutes. Paul, you said that I could simply copy the Data from the Slave to a new Slave, but you must keep in mind, in order to do this MySQL requires that the Master and Slave data files be IDENTICAL, that means the Master must be turned off, the data copied to the slave, and then both systems activated. Resulting in serious downtime. Paul

You only have to make a copy of the data one time when you initial set up your Master the first time. As long as it takes to do this is your downtime: Your down time is essentially only how long it takes to copy your 6 gigs of data NOT across a network, but just on the same machine. (which is far less than a myisamchk on the same data) Once that is done, at your leisure you can copy the 6 gigs to the slave while the master is still up and serving requests. You can then continue to make slave after slave after slave just by copying the original snap shot to each one. The master never has to be taking offline again.

Michael McConnell You explained that I can kill the MySQL on the Master, tar it up, copy the data to the Slave and activate it as the Slave. Unfortunately this is not how MySQL works. MySQL requires that the Master and Slave be identical data files, *IDENTICAL* that means the Master (tar file) cannot change before the Slave comes online. Paul

Well I suppose there was an extra step that I left out (because it doesn't affect the amount of downtime). The complete migration steps would be: Modify my.cnf file to turn on replication of the master server. This is done while the master mysql daemon is still running with the previous config in memory. shutdown the mysql daemon via kill. tar up the data. start up the mysql daemon. This will activate replication on the master and cause it to start logging all changes for replication since the time of the snapshot in step 3. At this point downtime is only as long as it takes you to do steps 2, 3, and 4. copy the snapshot to a slave server and active replication in the my.cnf on the slave server as both a master and a slave. start up the slave daemon. at this time the slave will connect to the master and catch up to any of the changes that took place since the snapshot. So as you see, the data can change on the master before the slave comes online. The data just can't change between when you make the snapshot and when the master daemon comes up configured for replication as a master.

Michael McConnell Paul you are correct. I've just done this experiment. B(slave) B(master -> C(slave) ]]> A Died. Turn off C's Database, tar it up, replicated the Data to A, Activate A as Slave to C. No data loss, and 0 downtime. (there appears to have been an offline exchange in here.) Michael McConnell michaelm (at) eyeball (dot) com> I've just completed the Rsync deployment method, this works very well. I find this method vastly superiors to both other methods we discussed. Using Rsync allows me to only use 2 HOSTS and I can still provide 100% uptime. In the other method I need 3 systems to provide 100% uptime. In addition the Rsync method is far easier to maintain and setup. I do recognize this is not *perfect* I run the Rsync every 20 minutes, and then I run myismchk on the slave system immediately afterwards. I run the MyISMChk to only scan Tables that have changed since the last check. Not all my tables change every 20 minutes. I will be timing operations and lowering this Rsync down to approximately 12 minutes. This method works very effectively for managing a 6 Gig Database that is changing approximately 400 megs of data a day. Keep in mind, there are no *real time* replication methods available for MySQL. Running with MySQL's building Replication commonly results (at least with a 6 gig / 400 megs changing) in as much as 1 hour of data inconsistency. The only way to get true real time is to use a shared storage array. Paul Baker

MySQL builtin replication is supposed to be "realtime" AFAIK. It should only fall behind when the slave is doing selects against a database that causes changes to wait until the selects are complete. Unless you have a select that is taking an hour, there is no reason it should fall that far behind. Have you discussed your findings with the MySQL developers?

Michael McConnell I do not see MySQL making any claims to Real-time. There are many situations where a high load will result in systems getting backed up, especially if your Slave system performs other operations. MySQL's built-in replication functions like so; Master writes updates to Binary Log Slave checks for Binary Updates Slave Downloads new Bin Updates / Installs Alexander N. Spitzeraspitzer (at) 3plex (dot) com

how often are you running rsync? since this is not realtime replication, you run the risk of losing information if the master dies before the rsync has run (and new data has been added since the last rsync.)

Don Hinshaw dwh (at) openrecording (dot) com> I have one client where we do this. At this moment ( I just checked) their database is 279 megs. An rsync from one box to another across a local 100mbit connections takes 7-10 seconds if we run it at 15 minute intervals. If we run it at 5 minute intervals it takes <3 seconds. If we run it too often, we get an error of "unexpected EOF in read_timeout". I don't know what causes that error, and they are very happy with the current situation, so I haven't spent any time to find out why. I assume it has something to do with write caching or filesystem syncing, but that's just a wild guess with nothing to back it up. For all I know, it could be ssh related. We also do an rsync of their http content and httpd logs, which total approximately 30 gigs. We run this sync hourly, and it takes about 20 minutes. Benjamin Lee benjaminlee (at) consultant (dot) com> For what it's worth, I have also been witness to the EOF error. I have also fingered ssh as the culprit. John Cronin

What kind of CPU load are you seeing? rsync is more CPU intensive than most other replication methods, which is how it gains its bandwidth efficiency. CPU Load? How many files are you syncing up - a whole filesystem, or just a few key files? From you answer, I assume you are not seeing a significant CPU load.

Michael McConnell I RSync 6 Gigs worth of data, Approximately 50 files (tables). Calculating Checksum's is really a very simple calculation, the cpu used to do this is less than 0% - 1% of a PIII 866. (care of vmstat) I believe all of these articles you have found are related to RSync Servers that serve in the function one would see a system as a major FTP Server. For example ftp.linuxberg.org or ftp.cdrom.com Mark

We're using master/slave replication. The LVS balances the reads from the slaves and the master isn't in the load-balancing pool at all. The app knows that writes go to the master and reads go the VIP for the slaves. Aside from replication latency, this works very reliably.

Todd Lyons tlyons (at) ivenue (dot) com 26 Oct 2005 This works for us too. We have an inhouse DBI wrapper that knows to read from slaves and write to the master. Reads are directed to a master for a certain number of seconds until we are reasonably sure that the data has been replicated out.

this is exactly how i was doing it. The problem was that i expected the data on the slaves to be realtime - and the replication moved too slow. not to mention it broke a lot (not sure why) - anytime there was a query that failed it halted replication.

We've not seen this. Our replication has been very stable, and our system is roughly 80% read, 20% write. But we do as few reads as possible on the master, keeping all of the reads on the slaves, doing writes to the master.

in which case, i had to either take the master down, or put it in read-only mode, so i could copy the database over, and re-initialize replication.

Make a tarball of the databases that you replicate and keep that tarball someplace available for all slaves. Record the replication point and filename. Then when you have to restart a slave, blow away the db, extract the tarball, configure the info file to the correct data, then start mysql and it will populate itself from the bin files on the master. For us, we try to do it once per month, but I've seen 100 day bin file processing and it only took about 30 minutes to catch the database up. We have gig networking though, so your results may vary. We also have a script that connects and checks the replication status for each slave and spits out email warnings if the bin logs don't match. It doesn't check position because it frequently changes mid-script, sometimes more than once. Here's an example:

of course, this was on sub-par equipment, and maybe nowadays it runs better. i'm thinking that the NDBcluster solution might be better since it's designed for this more - replication still (afaik) sucks because when you need to re-sync it requires the server to be down or put in read-only, which is essentially downtime. The main problem I see with NDB is that it requires the entire database (or whatever part each node is reponsible for) to be in memory.

We have had very stable results from mysql stable releases doing replication. We typically use the mysql built rpms instead of the vendor rpms. We had serious stability problems with NDB, but we were also trying a very early version. It is liable to be much more solid now.

not work with files the way normal InnoDB, MyIsam, etc engines work. And I think this is still the case in MySQL 5 (correct me if I'm wrong.) I don't know what happens if your storage node suddenly runs out of Memory.

It won't be until MySQL 5.1 that you'll be able do file based clustering. Then load balancing becomes simple, for both reads and writes. Dan Yocum yocum (at) fnal (dot) gov 10 Dec 2008 While not strictly on topic wrt to LVS, with MySQL one can use multi-master, shared-nothing replication or MySQL clustering (yes, these are 2 separate methods for replicating data), with all nodes having read and write access. I have implemented the former with LVS load balancing the connections to the MySQL RSs with good success. I used these guides to set this up: http://www.howtoforge.com/mysql_database_replication http://www.howtoforge.com/loadbalanced_mysql_cluster_debian And I wrote up the whole procedure for our (grid) authentication and authorization environment: https://docs.google.com/View?docID=ddszv68g_19d88pzv&revision=_latest

Clustering mysql Matthias Mosimann matthias_mosimann (at) gmx (dot) net 2005/03/01 I also don't have experience with any of these solutions but... MySQL's solutions to make a mysql cluster are at MySQL cluster (http://www.mysql.com/products/cluster/). The book "High Performance MySQL Cluster" is available from O'Reilly. For people using PHP with a MySQL cluster see SQL relay (http://sqlrelay.sourceforge.net/). If you're using a java application maybe this page could help you: ObjectWeb: Clustered JDBC (http://c-jdbc.objectweb.org/) Jan Klopper janklopper (at) gmail (dot) com 2005/03/02 Your best method (if you only do a lot of Select query's) would be to use one mysql master server which gets all the update/delete/insert query's and stores the main database,. Then you could use mysql master-slave setup to replicate this master database to a couple of slave servers on which you could do all of your selects. You could use Mon/ldirectord to make a LVS cluster out of these select servers just like any other type of servers. I'm not sure what you should do with the persistence though (especially if you use mysql_pconnect from within php/apache). I have the following setup: cheap webservers (LVS) cheap load balancers (VIP) expensive raid 10 mysql server (master) cheap fileserver + mysql slave (just for backup, not for queries) Granted, if your database server dies, you're screwed, but with the current mysql master/slave/replication/cluster technologies you wouldn't be able to provide failover for update/insert/delete query's anyway. making one of the slaves into a master and thus change its database will give you a serious problem when you master comes back up again. You'd have to make the master a slave, stop the current master (breaking the whole point of allways online) and replicate to it, then when it is up2date change its role to be master again. Jan Klopper So you can setup a mysql server on each of the nodes of you cluster (allowing all of the nodes to be load balanced mysql servers) What would be a better sollution (afaik) is to place mysql on all the apache servers as well, since this would would allow them to connect to their own database instead of using a network connect, decreasing the overhead for each query/connect. And if one of the webserver nodes went down, so does its mysql, so thats not a real problem. LVS will just not redirect any request to the machine at all. A not clustered server would then be the nbdd server managing and updating all of the mysql servers on the nodes itself. And thus you could use really inexpensie hardware for the mysql server, as you won't need UBER speed on that machine to handle all requests from all LVS-nodes anymore. Todd Lyons tlyons (at) ivenue (dot) com We've always found that separating services out to different machines works better than running multiple services on each machine. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com (replying to Jan) Well you still need a very powerfull nbdd server, with high network capabilities, and you should use at least two data nodes. Matthias Mosimann Here's a thread in the mysql mailing list about clustering / loadbalancing mysql http://forums.mysql.com/read.php?25,14754,14754#msg-14754 Francois JEANMOUGIN The idea is to have small MySQL nodes on the realservers and 2 or 3 data nodes (powerfull machines) on the backend. The current state of the artr for MySQL is to use an NBD cluster. You can use LVS to load-balance connections to your MySQL servers. You need a small management server to manage it. Todd Lyons tlyons (at) ivenue (dot) com 2005/03/02 We worked for a short while with NBD clustering and experienced huge problems. It was probably more an issue with perl-DBD instead of NBD itself, but it was enough that we had to go back to standard replication. Our current setup uses a homegrown DBI wrapper that does all reads from the slave and directs all writes to the master (and for a short time all reads in that session go to that master too). As read load escalates, just add another slave and put it on an LVS VIP. Francois JEANMOUGIN To check whether a MySQL server is alive or not, you have two solutions (assuming you choose Keepalived): Use a TCP_CHECK that will detect if the server accepts connections. Use the check_mysql from nagios-plugins embedded in a MISC_CHECK script. Jan Klopper But as soon as you start doing updates to the database, you will horribly screw up your data, because mysql can't propagate the changes made on slaves back to the master and so on. This will lead to inconsistency's in your database. If you only need to do load (and I mean loads) of select queries with no changes than you could very well use mysql slaves inside a LVS to run these queries.

Francois JEANMOUGIN Are we talking about the same thing? I'm not talking about master/slave or active-slave method using old MySQL replication process. I'm talking about NBD cluster: http://dev.mysql.com/tech-resources/articles/mysql-cluster-for-two-servers.html, http://dev.mysql.com/doc/mysql/en/mysql-cluster-basics.html I'm currently evaluating this solution, which as passed the "case study" test. I will implement it at the end of this month. I would like to know if anyone here have any (bad or good) experience with such MySQL setup.

Todd Lyons tlyons (at) ivenue (dot) com We've had experience and it was bad. Things like 'DESCRIBE table' would give erratic results. Required a beta version of perl-DBD. Performance was good though. We had a 2 node cluster with plans to scale upwards. I think once supporting code stabilizes it will be a good solution. I like the way they keep things in memory because RAM is cheap. We are also evaluating the true clustering solution provided by Emic Networks. It is fantastic in my opinion, but it's not cheap. It requires custom configuration of your switches and routers to handle multicast MACs, a virtual gateway, and some VLAN'ing. It works well though and scales well. If your system has more than 50% reads, it will work well for you. Jan Klopper (replying to Francois) You are talking about pure failover only right? I was talking about load balancing for mysql which is pretty much undoable. Pure failover might be possible, just make sure the slave becomes the master on failover, and when the old master get back up it starts being a slave, and replicates from the new master. Francois

No no no. Please read the links I sent, and add this one : http://dev.mysql.com/doc/mysql/en/mysql-cluster-overview.html I'm talking about pure active-active cluster. This setup could be strenghted by LVS and a Keepalived MISC checker. In the figure above, you put LVS between Applications and SQL. About the link above, note that mysqld server and nbdd server can be on the same machine.

Jan Klopper Even normal master-slave replication still gives loads of problems, so anything cutting edge which might do the trick and allow for the slaves to be full database servers (and thus capable of handling updates/inserts/deletes) should not be trusted in my opinion. Granted, if they get the slaves to also be masters to each other, and they get it to pay nice, then we're all set to have the best and fastest clustering database on the planet (except for oracle?). Bosco Lau bosco (at) musehub (dot) com Lets forget about 'multi'-master scenario. It is basically an unsolvable problem for MySQL (at the moment). The only sensible multi-server setup for MySQL is 'single master' - 'mutiple slave', preferbly in 'star shape formation' i.e. each slave connected directly to master. Also don't bother about 'daisy-chaining' the slaves, this creates a single point of failure. With the mutiple slaves/LVS setup , you can just use MYISAM tables in the slaves database whereas you may have innoDB tables in the master. This will make your SELECT performances even better. Since the slaves just replicates the data from the master, there is no need for transactional tables type in the slave database. Jan Klopper Multi-master setups are a huge problem in mysql. If you do use slaves for just the queries, use MyIsam tables (preferrable stored completely in memmory) to increase select speeds. I'm using mysql to run huge banner/advertisement systems, and thus i need at least one update/insert per view. or i should store them outside of the db, and thus i can;t use normal slaves. (because i'd still need to connect to the master to make the update.) Star formation is the way to go, unless you have a huge amount of changes on the master which make you need a seccond cascade in there to load balance the updates and release the master from most of the slave replication querys. I think this step could also be done with LVS (having a couple of slave's to retreive the updates from one master, balanced by LVS, ) and then use them as masters for a couple of other normal query slaves. Matthias Mosimann matthias_mosimann (at) gmx (dot) net 2005/04/18 In a mysql cluster, data is stored in memory. If you store about 100 GB in your databaes you need about 100 GB of Memory (ram + swap). Now it looks that there will be a solution for that. I found a thread in the myslq-cluster mailing list and a link in the manual. Mysql 5.1 will be able to handle disk-based records in the cluster. Nigel Hamilton nigel (at) turbo10 (dot) com 2005/04/18 The RAM requirements of the initial MySQL Cluster design makes it unworkable for me so a hybrid RAM + disk-based system is welcome news. Our writes may equal or exceed reads - I was planning on using "memcached" to help with handing off as many reads as possible to a distributed RAM bucket. But I'm still thinking of ways to distribute writes - ideally at the application level all I need is a database handle that connects to one big high speed amorphous distributed DB. My current short term solution to distributing writes is to use a MySQL heap/ram table on each node which acts as a bucket collecting writes which is periodically emptied to the Master's disk. Matthias Mosimann matthias_mosimann (at) gmx (dot) net IIRC we had a discussion (Jan Klopper, Francois Jeanmougin and me) on the 2nd March this year on the same topic allready. Here's the link: http://archive.linuxvirtualserver.org/html/lvs-users/2005-03/mail4.html Francoise Jeanmougin We made some other testing on high availabilty/high performance MySQL: InnoDB has a bottleneck, if you have more than 1024 simultaneous concurrent connections to the DB (We have about 5000). Ndb (the MySQL cluster) works quite great, but has a problem with DB opening (use database ;) which limits the scalability of the solution. We didn't had time to test Ndb properly on our environment, the solution seems to be good in terms of design, it is memory based, and uses table patitionning (so it'll split the datas on several servers). It has to be improved to be really usable and strong. Troy Hakala troy (at) recipezaar (dot) com 25 Oct 2005 We're using master/slave replication. The LVS balances the reads from the slaves and the master isn't in the load-balancing pool at all. The app knows that writes go to the master and reads go the VIP for the slaves. Aside from replication latency, this works very reliably. The problem then is failure of the master. The master is redundant in itself (RAID drives, redundant power) to minimize the risk. But yes, we are very read-intensive and keeping the master out of the read load-balancing further increases the master's lifespan. We haven't tried master-master replication, but it's pretty easy and quick to turn a slave into a master. We prefer simplicity to complexity. Besides, we're not a bank and no one dies if the master db goes down for a few minutes every 5 years. :-) FWIW, our outages have been caused by bandwidth outages more than server hardware failure. There's supposed to be redundancy there too, but for some reason or another, the redundancy never kicks in. ;-) Replication can be restarted easily: slave start after you fix the error, which is usually just a mysqlcheck on the table. You shouldn't have to take the master down at all. And if you use a replication check in LVS, LVS will take the db server out of rotation if it's not replicating, so it shouldn't even be noticed until you can get around to fixing it. Nagios is also recommended to let you know when a slave is no longer replicating. The latency is on the order of a couple seconds and it's easy to take care of in the app. In fact, it's only a problem if you cache db results for hours or days (we use memcached). So it's not much of a problem in reality if you account for it. NDB violates my simplicity+commodity ideals... it's complicated and requires very expensive (lots of RAM) servers. And, I think it *requires* SCSI drives for some reason (I thought it was memory-based)! Doesn't Yahoo use MySQL replication? If it's good enough for Yahoo, it's good enough for most people. :-) If the master went down, we'd make a slave into a master manually. But it's never happened in real life, just in lab tests. A script could be written to do it, I guess, but it's not that simple. All the talk about redundancy and high-availability is a bit academic, IMO, unless it really is mission-critical -- and if it was, I'd be using Oracle and not MySQL so I could blame them. ;-) As I mentioned, none of this matters if bandwidth and electricity go out, which is less reliable in my experience than solid-state computer hardware. mike mike503 (at) gmail (dot) com 25 Oct 2005

perhaps changing a slave to a master has changed from when I used it - anytime a slave died, it would not start replicating (or you wouldn't want to) unless it caught up to the master since it died. but to grab a snapshot of the master data, you would have to take the master down or flush tables with a read lock until it was done getting the data synced. then you could restart it... for example, i was using this on a forum - when someone posts a message: if that doesn't happen within milliseconds, the user is taken to a page that doesn't have their post show up yet.

Pierre Ancelot pierre (at) bostoncybertech (dot) com 26 Oct 2005 I run NDB and it works pretty fine... Debian GNU/Linux Sarge Mysql 4.1 2 nodes and a management server. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 26 Oct 2005 Note that it is unable to handle more than about 1000rq/s. Our MySQL server (standalone MyISAM) goes up to 3000rq/s. InooDB hangs at 1024rq/s.

Using Zope with databases Karl Kopper karl (at) gardengrown (dot) org 10 Feb 2004 Here is the Zope database (ZODB) (http://www.zope.org/Products/ZEO/ZEOFactSheet). The "ZEO" client/server model was designed to make ZODB work in a cluster for read mostly data I'm using it, or about to use it in production. Here is a quick write-up:

Zope (http://www.zope.org/) is collection of python programs that help display web content. Zope can run as a standalone web server or underneath Apache. Add-on features to Zope are called "products." One Zope product is called the Content Management Framework or CMF. The CMF product introduces the concept of users to the Zope content. A separate project later sprung up on top of the CMF called Plone (plone.org). Plone and Zope let you build your own web pages using either the TAL or the deprecated DTML language that Zope interprets. Zope has connectors to external databases like Postgres and MySQL but it also comes with its own database called ZODB. ZEO sits between Zope scripts that you write and the ZODB. The ZEO client grabs all calls the local Zope server makes to the ZODB and trys to satisfy them using a local cache or by sending the request over the network to a ZEO server. The ZEO server then writes the data to local storage using the ZODB. Because ZEO clients talk to the ZEO server on an IP address the ZEO server can be made highly available with Heartbeat (on a server pair outside the cluster), and each realserver in the LVS cluster can be a ZEO client. There are a few catches to this, the biggest is that all Zope servers should share an "instancehome". See the Zope web site for details.

Databases: Microsoft SQL server, tcp 1433 Malcolm Turnbull had a 3-Tier LVS with database clients running on the realserver webservers connecting back through the VIP on the director to other realservers running M$SQL. The requests never made it from the database clients to the database. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 10 May 2004

Apparently if you use either a netbios name in an ASP db connection string, or if your web server has M$oft authentication turned on, IIS will use a challenge response hand shake to log on to the SQL server even though you've specified a basic username and password. Anyway its working fine now. Heres the M$oft article (http://support.microsoft.com/default.aspx?scid=kb;EN-US;175671).

Databases: Oracle Bilia Gilad giladbi (at) rafael (dot) co (dot) il 03 Feb 2005

Hi i configured oracle application server 10g with lvs . All works fine except oracle portal . In the oracle portal cluster manual : http://www.tju.cn/docs/oas90400/core.904/b12115/networking.htm
" The Parallel Page Engine (PPE) in Portal makes loopback connections to Oracle Application Server Web Cache for requesting page metadata information. In a default configuration, OracleAS Web Cache and the OracleAS Portal middle-tier are on the same machine and the loopback is local. When Oracle Application Server is front-ended by an LBR, all loopback requests from the PPE will start contacting OracleAS Web Cache through the LBR. Assume that the OracleAS Portal middle-tier and OracleAS Web Cache are on the same machine, or even on the same subnet. Then, without additional configuration, loopback requests result in network handshake issues during the socket connection calls. " " In order for loopbacks to happen successfully, you must set up a Network Address Translation (NAT) bounce back rule in the LBR, which essentially configures the LBR as a proxy for requests coming to it from inside the firewall. This causes the response to be sent back to the source address on the network, and then forwarded back to the client. "
Does any one know how can I make the bounce back rule in the LBR( load balancer ) ? Does this mean We have to work with NAT ( we prefer DR )?

Con Tassios ct (at) swin (dot) edu (dot) au 03 Feb 2005 We run LVS-DR with Oracle portal and don't experience any problems. When the application server connects to the VIP the connection is internal to the server itself as each server is configured with the VIP on lo:1.

Databases: ldap, tcp/udp 389, tcp/udp 636

ldap, read only Tim Mooney Tim (dot) Mooney (at) ndsu (dot) edu 10 Sep 2007 We've been load balancing OpenLDAP for years using LVS-DR. When clients can update LDAP, balancing becomes much more tricky. It's a pretty standard setup. Original setup was done by someone else, but openldap was the first service we used LVS for, before even http. We've been using LVS-DR with OpenLDAP for at least 5 years, probably closer to 7. For now it's only one port. Clients don't need to bind and can't retrieve anything that's sensitive, so we're only doing ldap (no ldaps). We have additional balanced services beyond LDAP, but the LDAP portion looks like: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP vs2.ndsu.NoDak.edu:ldap lc -> obscured2.NoDak.edu:ldap Route 1 16 982 -> obscured1.NoDak.edu:ldap Route 1 17 984 ]]> If you do an ldapsearch against our directory, you're getting our LVS-DR openldap: i There's another organization co-located with the IT organization here at the university, and they've also been running LVS-DR in front of their openldap directory for nearly as along as we have. LDAP is a critical component of Hurderos, which we've been using since its inception. Hence the need for a highly-available LDAP. There can be replication between ldap servers, like there is with mysql, but in our case we have a master repository (an Oracle database) that feeds adds/deletes/modifies directly to our two back-end LDAP servers (bypassing the LVS-DR director). The built-in replication of ldap has really matured. Once OpenLDAP 2.4 is out, I need to revisit what's possible with it.

ldap, read/write We haven't got ldap working in read/write mode with LVS yet.

Konrads (dot) Smelkovs (at) emicovero (dot) com 03 Jul 2003 I have an LVS-DR, wlc, with three realservers running openldap 2.0. I have noticed that when going through the loadbalancer the nodes do not always reply, while if issuing requests directly, I get a reply all the time. The service is pretty busy (about 1k connections at any given moment). Adding persistence does not help. If I understand it correctly, LDAP is a simple TCP service. Usually, after performing an initial connection and query it idles there and reuses the connection.

Joe Not knowing anything about ldap I looked at some of the HOWTOs. They are all long on configuring and using ldap, but none describe the ldap protocol. From the LDAP HOWTO I see that ldap uses iterative calls to the slapd (somewhat like DNS), that can be sent to other slapds (I don't know how this would work in your case). As well the slapd needs a 3rd tier database (eg dbm). If the clients are writing to the database, then you're going to have the many reader, many writer LVS problem. You're going to have to have only one copy of the database if clients are writing to the slapd

My setup: I have a "master" LDAP server that performs all writes and it replicates to the other "slave" servers. If a write request is sent to a slave, it is refered to the master server. The client then follows this reference and makes a succesful write. So in my case I have a stand-alone master, which is not load-balanced. Also, due to specific of the application (think authentification or such), it performs one or two exact search requests, like uid=konrads and requests for a attribute to be returned, e.g. : userPassword.

presumably these are two consecutive tcpip connections? If so you'll need persistence. As to what else is missing I don't know. Just for sanity checks, you can use these machines, IPs to setup some conventional LVS'ed service eg telnet, httpd? You know the ldap realservers are working OK outside of the LVS (ie you can connect to them directly, possibly after fiddling IPs)? After that it's brute force eg tcpdump I'm afraid. If you know the protocol well enough you can debug also with .

nfs, udp 2049 It is possible with LVS to export directories from realservers to a client, making a highly capacity nfs fileserver (see performance data for single realserver LVS, near the end). This would be all fine and dandy, except that there is no way to fail-out the nfs service. The problem is that the filehandle the client is handed is derived from the location of the file on the disk and not from the filename (or directory). The filehandle is required by the NFS spec to be invariant under name change or a move to a new mount point, or to another directory. I've talked to the people who write the NFS code for Linux and they think these are all good specs and there's no way the specs are going to change. The effect this has for LVS is that if you have to failout a realserver (or shut it down for routine maintenance), the client's file presented from a different realserver will have a different filehandle and the filehandle the client has will be for a file that doesn't exist. The client will get a requiring a reboot. Although no-one has attempted it, a is possible.

NFS, the dawning of awareness of the fail-out problem

Matthew Enger menger (at) outblaze (dot) com 1 Sep 2000 I am looking into running two NFS servers, one as a backup for serving lots of small text files and am wondering what would be the best solution for doing this. Does anyone have any recomendations.

Joe LVS handles NFS just fine (see the performance document on the docs page of the lvs website). You have to handle the problem of writes yourself to keep both NFS servers in sync.

Jeremy Hansen jeremy (at) xxedgexx (dot) com 1 Sep 2000 Can nfs handle picking up a client machine in a failover situation. If my primary nfs server dies, and my secondary takes over the first, can nfs clients handle this? Something tells me nfs wouldn't be very happy in this situation.

Wayne wayne (at) compute-aid (dot) com 01 Sep 2000 It will not work for NFS. NFS is working based on server hand over an opaque packet to the client. Client then will communicate with the server based on that opaque handle. Normal NFS construct that opaque handle involves some file system ID from that particular server, which most likely will be different from one server to the other.

Joe this is a problem even for UDP NFS? (I know NFS is supposed to be stateless but isn't)

Wayne If the NFS servers are identical, there is a chance that it may work. However, if the file systems are not identical (from file system ID point of view), it will not work, no matter if it is UDP or not. The stateless is only true to that particular server. BUT, that is the NFSes I had been worked on before based on Sun's original invention, it may not true for others implementations. Alan Cox alan (at) lxorguk (dot) ukuu (dot) org (dot) uk 1 Sep 2000 IFF your NFS servers are running some kind of duplexing protocol and handling write commits to both disk sets before acking them then the protocl is good enough - for any sane performance you would want NFSv3 The implementations are another matter John Cronin jsc3 (at) havoc (dot) gtf (dot) org It works as long as the filesystems are identical. That means either readonly content dd'd to identical disk on failover machine, or dual ported disk storage with two hosts (or more) attached. When the failover happens the backup system then mounts the disks and takes over. You need to be VERY sure that the primary system is NOT up and writing to the disk or you will have to go to backups after the disk gets corrupted (having two systems perform unsynchronized writes to the same filesystem is not a good idea). This is the way Sun handles it in their HA cluster. They go a step further by using Veritas Volume Manager, and Veritas has to forcibly import the disk group to the backup when a failover is done - the backup also sends a break or poweroff command to the primary via the serial terminal interface to make darn sure the primary is down. That said, I have seen three systems all mounting the same volume during some pathological testing I did on some systems at a Sun HA training course. The storaage used in this situation was Sun D1000/A1000 (dual ported UltraSCSI) and Sun A5000 (Fiberchannel).

Joe Would the client get a stale file handle on real-server failure? Now that I think about it, I snooped on a tread with similar content recently. It would have been linux-ha or beowulf. I had assumed you could fail out an NFS server, since file servers like the Auspex boxes use NFS (I thought) and can failover. How they work at the low level I don't know.

Joe in principle this could be handled by generating the filehandle from the path/filename which would be the same on all systems?

Wayne It could be done like that. But if the hard drive SCSI ID is different that could cause system uses different file system ID assigned to the file system, and end up stalled handle.

failover of NFS Here we're asking one of the authors of the Linux NFS code.

Joseph Mack One of the problems with running NFS as an LVS'ed service (ie to make an LVS fileserver), that has come up on this mailing list is that a filehandle is generated from disk geometry and file location data. In general then the identical copies of the same file that are on different realservers will have different file handles. When a realserver is failed out (e.g. for maintenance) and the client is swapped over to a new machine (which he is not supposed to be able to detect), he will now have an invalid file handle. Is our understanding of the matter correct?

Dave Higgen dhiggen (at) valinux (dot) com 14 Nov 2000 In principle. The file handle actually contains a 'dev', indicating the filesystem, the inode number of the file, and a generation number used to avoid confusion if the file is deleted and the inode reused for another file. You could arrange things so that the secondary server has the same FS dev... but there is no guarantee that equivalent files will have the same inode number; (depends on order of file creation etc.) And finally the kicker is that the generation number on any given system will almost certainly be different on equivalent files, since it's created from a random seed.

If so is it possible to generate a filehandle only on the path/name of the file say?

Well, as I explained, the file handle doesn't contain anything explicitly related to the pathname. (File handles aren't big enough for that; only 32 bytes in NFS2, up to 64 in NFS3.) Trying to change the way file handles are generated would be a MASSIVE redesign project in the NFS code, I'm afraid... In fact, you would really need some kind of "universal invariant file ID" which would have to be supported by the underlying local filesystem, so it would ramify heavily into other parts of the system too... NFS just doesn't lend itself to replication of 'live' filesystems in this manner. It was never a design consideration when it was being developed (over 15 years ago, now!) There HAVE been a number of heroic (and doomed!) efforts to do this kind of thing; for example, Auspex had a project called 'serverguard' a few years ago into which they poured millions in resources... and never got it working properly... :-( Sorry. Not the answer you were hoping for, I guess...

shared scsi solution for NFS It seems that the code which calculates the filehandle in NFS is so entrenched in NFS, that it can't be rewritten to allow disks with the same content (but not neccessarily the same disk geometry) to act as failovers in NFS. The current way around this problem is for a reliable (eg RAID-5) disk to be on a shared scsi line. This way two machines can access the same disk. If one machine fails, then the other can supply the disk content. If the RAID-5 disk fails, then you're dead. John Cronin jsc3 (at) havoc (dot) gtf (dot) org 08 Aug 2001

You should be able to do it with shared SCSI, in an active-passive failover configuration (one system at a time active, the second system standing by to takeover if the first one fails). Only by using something like GFS could both be active simultaneously, and I am not familiar enough with GFS to comment on how reliable it is. If you get the devices right, you can have a seamless NFS failover. If you don't, you may have to umount and remount the file systems to get rid of stale file handle problems. For SMB, users will have to reconnect in the event of a server failure, but that is still not bad.

Salvatore J. Guercio Jr sguercio (at) ccr (dot) buffalo (dot) edu 12 Jun 2003 Here is background on my LVS/NFS setup: I have 4 IBM x330 and an IBM FastT500 SAN, this project I am working is to set up the SAN to server 635GB worth of storage space to the Media Studies department of the University. Each x330 is connected to the SAN with GB fibre channel and 100MB to a Cisco 4006 Switch. The Switch has a GB connection to the media studies department where they have a lab of 4 Macs running MacOS X. They will mount the share and access media data on the share. Each client has a GB connection to the Cisco switch. The LVS comes in handy to increase the bandwidth of the SAN as well as gives us some redundancy. Since I have a shared SCSI solution, I will not run into the file handle problem and I am hoping that using IBM GPFS filesystem, will help me with the lockd problem. Did you have to do anything as far as forwarding your portmap or any other ports to the realservers? Right now I am only forwarding udp port 2049.

Joe Just a sanity check here... Do you understand the problems of exporting NFS via LVS (see the HOWTO)? Are these problems going to affect you (if not, this would be useful for us to know). I assume this setup will handle write locking etc. For my information, how is LVS going to help here? Is it increasing throughput to disk?.

detecting failed NFS Don Hinshaw

Would a simple TCP_CONNECT be the right way to test NFS?

Horms If you are running NFS over TCP/IP then perhaps, but in my experience NFS deployments typically use NFS over UDP/IP. I'm suspecting a better test would be to issue some rpc calls to the NFS server and look at the response, if any. Something equivalent to what showmount can do might not be a bad start. Joe

how about exporting the directory to the director as well and doing an `ls` every so often

Malcolm Cowe malcolm_cowe (at) agilent (dot) com 7 Aug 2001 HP's MC/ServiceGuard cluster software monitors NFS using the "rpcinfo" command -- it can be quite sensitive to network congestion, but it is probably the best tool for monitoring this kind of service. The problem with listing an NFS exported directory is that when the service bombs, ls will wait for quite a while -- you don't want the director hanging because of an NFS query that's gone awry. Nathan Patrick np (at) sonic (dot) net 09 Aug 2001 Mon includes a small program which extends what "rpcinfo" can do (and shares some code!) Look for rpc.monitor.c in the mon package, available from kernel.org among other places. In a nutshell, you can check all RPC services or specific RPC services on a remote host to make sure that they respond (via the RPC null procedure). This, of course, implies that the portmapper is up and responding. From the README.rpc.monitor file:

This program is a monitor for RPC-based services such as the NFS protocol, NIS, and anything else that is based on the RPC protocol. Some general examples of RPC failures that this program can detect are: To test services, the monitor queries the portmapper for a listing of RPC programs and then optionally tests programs using the RPC null procedure. At Transmeta, we use:

Michael E Brown michael_e_brown (at) dell (dot) com 08 Aug 2001 how about rpcping? (Joe - rpcping is at nocol_xxx.tar.gz for those of us who didn't know rpcping existed.)

NFS locking and persistence Steven Lang Sep 26, 2001 The primary protocol I am interested in here is NFS. I have the director setup with DR with LC scheduling, no persistence, with UDP connections timing out after 5 seconds. I figured the time it would need to be accessing the same host would be when reading a file, so they are not all in contention for the same file, which seems to cost preformance in GFS. That would all come in a series of accesses. So there is not much need to keep the traffic to the same host beyond 5 seconds. Horms horms (at) vergenet (dot) net

I know this isn't the problem you are asking about, but I think there are some problems with your architecture. I spent far to much of last month delving into NFS - for reasons not related to laad balancing - and here are some of the problems I see with your design. I hope they are useful. As far as I can work out you'll need the persistance to be much longer than 5s. NFS is stateless, but regardless, when a client connects to a server a small ammount of state is established on both sides. The stateless aspect comes into play in that if either side times out, or a reboot is detected then the client will attempt to reconnect to the server. If the client doesn't reconnect, but rather its packets end up on a different server because of load balancing the server won't know anything about the client and nothing good will come of the sitiation. The solution to this is to ensure a client consistently talks to the same server, by setting a really long persistancy. There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has. Of course unless you can get the lockd on the different NFS servers to talk to each other you are going to have a problem if different clients connected to different servers want to lock the same file. I think if you want to have any locking you're in trouble.

I actually specifically tested for this. Now it may just be a linux thing, not common to all NFS platforms, but in my tests, when packets were sent to the server other than the one mounted, it happily serves up the requested data without even blinking. So whatever state is being saved (And I do know there is some minimal state) seems unimportant. It actually surprised me how seamlessly it all worked, as I was expecting the non-mounted server to reject the requests or something similar.

That is quite suprising as the server should maintain state as to what clients have what mounted. There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has.

This could indeed be an issue. Perhaps setting up persistence for locks. But I don't think it is all bad. Of course, I am basing this off several assumptions that I have not verified at the moment. I am assuming that NFS and GFS will Do The Right Thing. I am assuming the NFS lock daemon will coordinate locks with the underlying OS. I am also assuming that GFS will then coordinate the locking with the rest of the servers in the cluster. Now as I understand locking on UNIX, it is only advisory and not enforced by the kernel, the programs are supposed to police themselves. So in that case, as long as it talks to a lock daemon, and keeps talking to the same lock daemon, it should be okay, even if the lock daemon is not on the server it is talking to, right?

That should be correct, the locks are advisor. As long as the lock daemons are talking to the underlying file system (GFS) then the behaviour should be correct, regarldess of which lock daemon a client talks to, as long as the client consistently talks to the same lock daemon for a given lock.

Of course, in the case of a failure, I don't know what will happen. I will definitely have to look at the whole locking thing in more detail to know if things work right or not, and maybe get the lock daemons to coordinate locking.

Given that lockd currently lives entirely in the kernel that is easier said than done.

Other Network file systems: replacements for NFS This is from the beowulf mailing list Ronald G Minnich rminnich (at) lanl (dot) gov 26 Sep 2002

NFS is dead for clusters. We are targeting three possible systems, each having a different set of advantages: panasas (http://www.panasas.com) lustre (http://www.lustre.org) v9fs (http://v9fs.sourceforge.net), from yours truly

nfs mounts: hard and soft This is not LVS exactly, but here's from a discussion on hard and soft mounts from the beowulf mailing list. Alvin Oga alvin (at) Maggie (dot) Linux-Consulting (dot) com 14 May 2003

if you export a raid subsystem via nfs... you will have problems if you export it w/ hard mount, all machines sit and wait for the machine that went down to come back up if you export it w/ soft mount, you can ^C out of any commands that try to access the machine went down, and continue on your merry way as long as you didnt need its data best way to get around NFS export problems - have 2 or 3 redundant systems of identical data ( a cluster of data ... like that of www.foo.com websites )

Greg Lindahl lindahl (at) keyresearch (dot) com 15 May 2003

There are soft and hard mounts, and there are interruptable mounts ("intr" -- check out "man nfs"). A hard mount will never time out. If you make it interruptable, then the user can choose to ^C. This is the safe option. A soft mount will time out, typically leaving you with data corruption for any file open at the time. This is the option you probably never want to use. By the way, if you use the automounter to make sure that unused NFS filesystems are not mounted, it can help quite a bit when you have a crash, depending on your usage pattern. Other potential problems include: inadequate power or airflow, high input air temperature, old BIOS on the 3ware card, old Linux kernel, etc.

linux nfs in general

Benjamin Lee ben (at) au (dot) atlas-associates (dot) com I am wondering what (currently) people consider a production quality platform to run an NFS server on? I am thinking maybe a BSD although the kernel NFS code in Linux is much more stable now, I've heard. The idea is to put together a RAID boxen which will serve web pages and mail spool via NFS. (Don't worry, I'm only mounting the mail spool once. ;-) It's not a large enterprise system. )
Ted Pavlic tpavlic (at) netwalk (dot) com 19 Sep 2000 While Linux does not provide very good NFS support and generally has problems with things like locking (major problems), and quotas, don't rule it out. Personally I have a system now which is very similar to the system in which you sound like you want to build. Until recently, it was a Linux server with an external RAID serving both mailspool and web pages to four realservers. (and a couple of other servers -- application, news, etc.) Right now (for various reasons most having to do with how nightly backups are handled) I have two Linux servers both with external RAIDs. One handles mailspool, the other handles web pages. Both are configured so that if and when one machine Linux server dies completely the other can pick up the other RAID and serve both RAIDs again. There will be some manual interaction, of course, but I have NOCs 24/7 and hopefully such a problem would occur when a tech was available to handle the switch. Such a system seems to work fine for me. (It's worked for a long time). It's easy for me to administrate because everything's Linux. I run into problems here or there when a program has trouble locking and requires a local share, but I get around those problems... and overall it's not that bad.

stale file handles Joe 16 Dec 2005 The client gets a stale file handle when the client has a file||directory open and the server stops serving the file||directory. This error is part of the protocol (i.e. it's not a bug or a problem, it's supposed to happen). The stale file handle doesn't happen at the server end (the server doesn't care if the client is totally hosed). df and mount will hang (or possibly return after a long timeout). The error goes away when the server comes back. If the servers went down and the clients (realservers) stayed up, all with open file handles, the clients just have to wait till the servers come up again. The stale file handle will mess up the client till the server comes back. Since foo is on the same piece of disk real-estate, foo comes back with the same file handle when the server reappears An irrecoverable example: In this example, when /home/user/foo is recreated, it's on a new piece of disk real-estate and will have a different file handle. The client is hung and you can't umount /home/user (maybe you can with umount -f). If you can't umount /home/user, you will have to reboot the client.

read-only NFS LVS Joseph L. Kaiser 7 Mar 2006

I have been tasked to mount a read-only NFS mounted software area to 500+ nodes. I wanted to do this with NFS and LVS, but after reading all the howtos with regard to NFS and LVS and reading all the email with regard to this in the archives (twice), it seems clear to me that this doesn't work.

Joe I haven't done any of this, I've just talked to people - so my knowledge is only theoretical. (ro) is a different kettle of fish. If you use identical disks (you only need identical geometry, so you could in principle use different model disks) and make the disks bitwise identical (e.g. made with dd), then clients will always get the same filehandle for the same file, no matter which realserver they connect to. For disk failure, make sure you have a few dd copies spare. I assume (but don't know) you won't be able to use RAID etc because of problems keeping the individual disks identical on the different realservers. You'll only be able to use single disks. If you have to update files, then you'll have to do it bitwise. I don't have any ideas on how to do this off the top of my head. Perhaps you could have two disks, one mounted and the other umounted, but having received the new dd copy. You would change the weight to 0 with ipvsadm and let the connections expire (about 3 mins ) umount the current disk and then mount the new disk. Presumably you'll have some period in the changeover where the realservers are presenting different files. Presumably for speed, you should have as much memory in the realservers as possible, so that recently accessed files are still in memory.

However, I have a boss, and he wanted me to ask if turning off no-attribute caching (noac) would help in the reliability of this service.

If the disks are (ro) then the attributes on the server will never change, in which case you want the client to cache the attributes forever.

He has seen with another NFS mounted filesystem that using the "noac" turns off caching and clients that sometimes get "Stale NFS file handle" will reread the file and succeed.

Is having the client not reboot only "sometimes" an acceptable spec? With "noac" will it now pass the tests in ?

DRBD DRBD is not a service in the network sense, but it used as a file server. Neamtu Dan dlxneamtu(at)yahoo(dot)com

I have a test system made up by a director, 3 servers running ftp and http and 2 storage computers from where the servers take their data. If I were to stick to one storage computer there would be no problem, but I want redundacy. So I've set up a working heartbeat on both the storage computers, but when the primary fails the servers can no longer use the data unless the servers umount and mount again on the Virtual IP address the heartbeat uses (I tryed mounting with nfs and samba so far). As I understand these 2 mounting methods will not work in case of failover because of different disk geometry. Do you know what I should do for nfs or samba to work in case of failover (at least for read only, if read-write on the storage is not possible).

mike mike503 (at) gmail (dot) com 1 Apr 2006 DRBD+NFS works for me, pretty well too. check out linux-ha.org for HaNFS. Martijn Grendelman martijn (at) pocos (dot) nl 21 Mar 2007 I have recently set up a two-node cluster, both servers configured identically, both handling HTTP, HTTPS and FTP connections over LVS. Both machines are capable of playing the role of LVS director, but only one is active at once. Monitoring of real servers is done with Mon. Files are on a DRBD device, which is exported over NFS. Failover of LVS (VIP + rules), DRBD, NFS and MySQL (also on DRBD) is handled by Heartbeat. Works like a charm!! The information on http://www.linux-ha.org/DRBD/NFS was extremely useful.

LVS: Services: multi-port

Introduction While all use the same scheme (server listens, client connects), multi-port services each have their own scheme (ftp has two schemes, active and passive). For multi-port services, the initial connection is the standard single-port connection, but the setup of the 2nd (or more) port occurs through information sent in the payload of the connection to the first port. The director does not inspect the payload of packets and has no information about subsequent connection(s) that the client and realserver is attempting to setup. Approaches used to load balance multi-port services are Use persistence to all ports at once on the realserver. (Persistence can also be set for a single port, but this is not used here). This is a brute force approach. Once the initial connection is made from the client to the first port on the realserver, then any packet from any port on the client is forwarded to any port that the client requests on the realserver. This has been the approach historically used for ftp on LVS-DR or LVS-Tun. While it works, it is not secure, since any packets are allowed between the client and the LVS, and not just the packets required for the ftp transfer. For ftp, where no state is maintained on the realserver and where idle timeouts are just a matter of the client reconnecting, then persistence is a satisfactory solution for LVS/ftp. It would be nice if we could do better than this, but currently this is the state of the art for LVS with ftp. Use other code to inspect the payload of packets that are passed in the first port opened. Since this code must talk to ip_vs, it must run on the director. All packets in the first connection then must pass through the director and so this approach will only work for LVS-NAT (or LVS-DR with Julian's forward shared patch) (for LVS-DR, LVS-Tun, the packets returning to the client from the realservers, go directly to the client and not through the director). Code which inspects packets passing through the director to aid setup of other ports includes Helper modules: ftp is the only multi-port service for which a helper module has been written (see ). fwmark: for ftp this requires the contrack module ip_conntrack_ftp to look for packets which are RELATED. listening on ports 80 and 443: This is not a multi-port tcpip protocol. A multi-port tcpip protocol requires one demon running on the realserver sending packets on two ports. For an e-commerce site, the connections are independant at the tcpip level and are serviced by different demons. For LVS, it is convenient to think of an e-commerce site as multi-port, for following the initial connection to port 80, you want the client's subsequent connection to 443 to go to the same realserver. This is handled by persistence or by persistent fwmark.

ftp general, active tcp 20,21; passive 21,high_port ftp is a 2 port service in both active and passive modes. For a description of active and passive ftp see Active FTP vs. Passive FTP, a Definitive Explanation on Slaksite. Also see the RFC 1579 for passive ftp and the RFC 959 for ftp (where ftp is referred to as just "ftp", but with the arrival of passive ftp, is now called "active ftp"). The usual resource for this sort of information, "TCP/IP Illustrated Vol 1", by W. Richard Stevens (Chapter 27 on FTP), only discusses what is now called active FTP. Useful links (from Ratz 30 Nov 2003) http://www.ssh.com/support/documentation/online/ssh/winhelp/32/Forwarding_FTP.html Forwarding ftp. (port forwarded ftp is not the same as or ). Because of the problems securing ftp, Ratz suggests that you use a single ftp server that is not part of your LVS and secure it separately.

ftp helper modules: ip_vs_ftp/ip_masq_ftp The ip_vs build produces the modules ip_masq_ftp (2.2.x) or ip_vs_ftp (2.4.x and later, written as a netfilter module). The ip_masq_ftp module is a patched version of the file which allowed ftp through a NAT box. The patch stopped the original function (at least in early versions of LVS) and is probably why it has a new name in 2.4.x kernels. The ip_vs_ftp module will autoload (Nov 2003) when ipvsadm is invoked - check that the module is loaded by running lsmod. The ipvs ftp helper module needed for LVS-NAT has resulted in a disproportionate number of problems on the LVS mailing list (presumably this will continue). In Dec 2006, Eric Robinson eric (dot) robinson (at) pmcipa (dot) com was the unwitting guinea pig in straightening some of this out. Problems include: Few people are using LVS-NAT with ftp, so we wouldn't hear any problems even if the helper module was completely broken. When we hear of a problem we don't know whether to believe the poster, since we haven't heard a problem with ftp for ... oh you know, years. Bugs have affected other services and we get reports of problems with (say) http when no-one is using the LVS'ed ftp service and the poster doesn't tell us that they have LVS'ed ftp (and doesn't realise that it's relevant). different ftpd demons give different responses to calls from the client and listen on different ports. Unless you take appropriate action, ftp demons listening on non-standard ports stop working, when put behind an LVS director. the docs and fuctionality were out of step for quite a while. Tony Clarke sam (at) palamon (dot) ie found (Sep 2002) that the ftp helper module ip_masq_ftp had not been patched for LVS for 2.2.19 at least a year after its release. I was testing ftp with its default settings (without being terribly aware that I was using active ftp) and found that I didn't need the helper module. It took at least a year before anyone else (Wensong 17 Sep 2002) would agree with me. The conventional wisdom from 2002-2006 was that the ftp helper module wasn't needed for active ftp. I thought the helper function for active ftp must have been in ip_vs. A possible explanation is Mark de Vries comment immediately below, although not having the setup around any more I don't know for sure. Mark de Vries markdv (dot) lvsuser (at) asphyx (dot) net 23 Dec 2006. ftp-clients don't care which IP the connection originates from. Joe - the ftp-data connection then would originate on the RIP, rather than the VIP. With the ftp helper, the ftp-data connections would be nat'ed to src_addr=VIP. In my test setup, with no ftp helper and two private networks (which I routed locally), the packets src_addr=RIP:ftp-data would have been routed directly through the director to the CIP. Complicating matters, I don't remember whether the ftpd was listening to the VIP or 0.0.0.0. Mark de Vries markdv (dot) lvsuser (at) asphyx (dot) net 23 Dec 2006 So that would suggest (to me) that you do need the ip_vs_ftp helper module, to do the src address translation in the active connection from server to client. Horms 27 Dec 2006 I just skimmed through the code, and the helper seems to listen for both the PASV and PORT command. My FTP knowledge is a bit rusty, but I think the latter is for non-passive ftp, so yes it seems to be needed for both. The auto-loading is just a hack for the convenience of most people. Basically, in recent versions of ipvsadm, if you're setting up a virtual service on port 21, it guesses that there is a good chance that it is ftp and tries to load ip_vs_ftp. The ftp helper auto-load went in on 9 Oct 2003 - look at the date of your ipvsadm (due to a releaes procedure that is beyond my control, it seems that ipvsadm has been released multiple times with the version number of 1.24. Indeed, the version only seems to denote that it is the ipvsadm that works with the 2.6 kernels, or perhaps an revision of the ABI, rather than a release of the utility itself. Grrr. - i.e. the version number doesn't mean anything.) If you are using a port other than 21, then you will need to set the ports argument to the module when it is loded The default is 21. You can have up to IP_VS_APP_MAX_PORTS (8). They are comma delimited If the ftp helper module doesn't load, maybe you have an old version of ipvsadm? ftp is running on a port other than 21? The module couldn't be found by modprobe for some reason? Eric: with the ftp helper loaded, the ftp-data packets arriving at the client have src_addr=VIP (the expected behaviour). Joe - The 2.2.x ftp module is only available as a module (i.e. it can't be built into the kernel). Juri Haberland juri (at) koschikode (dot) com 30 Apr 2001

AFAIK the IP_MASQ_* parts can only be built as modules. They are automagically selected if you select CONFIG_IP_MASQUERADE.

Julian Anastasov May 01, 2001

Starting from 2.2.19 the following module parameter is required: Joe
I don't see this mentioned in /usr/src/linux/Documentation, ipvs-1.0.7-2.2.19/Changelog, google or dejanews. Is this an ip_vs feature or is it a new kernel feature?
ratz
I see info only in the source. This is a new 2.2.19 feature. It's /usr/src/linux/net/ipv4/ip_masq_ftp.c: And it is a new kernel feature, not LVS feature.

what are these modules for: from ipvsadm(8) (ipvs 0.2.11)

If a virtual service is to handle FTP connections then persistence must be set for the virtual service if Direct Routing or Tunnelling is used as the forwarding mechanism. If Masquerading is used in conjunction with an FTP service than persistence is not necessary, but the ip_vs_ftp kernel module must be used. This module may be manually inserted into the kernel using insmod(8)

The modules are NOT used for LVS-DR or LVS-Tun: in these cases persistence is used (or fwmarks version of persistence). Joe 23 May 2001: I run these rules on the director (without the ftp module) and ftp works fine Julian - these rules are risky. What happens with ICMP? It is not masqueraded. I hope there is a similar rule for ICMP. Joe Dec 2006 - We're a little more careful nat'ing out clients running on the realservers now. We'd at least make sure the packets came out with src_addr=VIP. Stephane Klein

I've tried to use your example to setup active and passive FTP. I can authenticate, but i can't list or send data. I can see packet in the conntrack file that with dport=20, but the ftp server tried to send a SYN_SENT and have no reply. ip_vs_ftp is loaded as module, ip_nat_ftp and ip_conntrack_ftp are in the kernel. I used iptables rules of your example in the HOWTO. I saw this article where you said it's necessary to patch the kernel to work with ip_nat_ftp (http://www.in-addr.de/pipermail/lvs-users/2004-June/011955.html) That patch is for kernel 2.6.5. Is this patch included in your NFCT patch or is it necessary to apply this patch?

Julian 29 Aug 2004 Yes, it is needed if you are loading ip_nat_ftp. I didn't received any replies from the netfilter coreteam about this patch, so I just linked it to the web site: ip_nat_ftp-2.6.5-1.diff There are problems with the helper module approach for ftp, since there is no agreement amongst ftpd code authors about the responses given. To help passive ftp, the ip_vs_ftp module looks for the response from the ftpd. Postings to the LVS mailing list (starting with a posting by Tom Cronin on LVS-NAT ftp), show that this response is not universal for ftpds. As well Rutger van Oosten found for passive ftp, that the ftpd must be set to listen on the correct IP.

For active ftp, the helper module expects ftp-data=20 (problems with vsftp) Mark de Vries found that his ftp LVS-NAT didn't work, the reason being that the ftp helper module wasn't forwarding the reply packets from the ftp-data port (usually port 20). On further exploration, Mark found that the ftpd (the GPL'ed vsftp) wasn't using the standard ftp-data port, but was using a high (>1024) port, thus allowing the ftpd to run with lower privileges (vsftp can be setup to run with the standard ftp-data port). Currently the ftp helper expects ftp-data=20. We're working on a fix for this. Here's the discussion so far. Mark de Vries markdv (dot) lvsuser (at) asphyx (dot) net 25 Nov 2005 Problem found... The thing is that ip_vs(_ftp) seems to assume that the ftp-data connection will be initiated from port 20. Seems like a valid assumption... But unfortunately this is not always the case... the vsftpd I was testing with was configured to "connect_from_port_20=NO" by default. Once I swithched to "=YES" active FTP worked fine. Otherwise I just used some SNAT rules on the director. So.... Now the question is: is this a vsftpd 'problem'? MUST ftp-data connections originate from port 20? Or should this assumption be relaxed? Aparently the iptables contrack_ftp module does not assume it; Connections from ports other then 20 are considered "RELATED". (I have not checked the src or debugged anything, I just observed that this type of connection is indeed matched by a "RELATED" rule in my own iptables setup.) I don't think adding an option --data-port="some_number" to the ftp helper would get us anywhere - the src port is not always the same. vsftpd (probably) just connects without binding to a specific port, just getting a random one in the ip_local_port_range... Is there anything against not matching on the src port like the ip_contrack(_ftp) stuff, i.e. matching/finding the source port on the fly? vsftp has passive ftp (pasv_enable = YES). A lot of clients will default to passive mode or fallback to it if active does not seem to be working. which is probably the main reason I've had relatively few complaints about active ftp not working. As far as I understands the RFC leaves no room for a different src port for the data connection. It's not fixed at 20 but should be 1 below the controll port. Which is what ip_vs uses literally IIRC. ip_vs_ftp and ip_conntrack_ftp do much of the same thing. The only difference is that in iptables you need an explicit rule to handle the connection entries created, when in ipvs they are allways used. The real difference is only in the details of the connection entry they create. In ipvs there is the assumption/requirement that the connection will originate from port 20 (assuming the ftpd is listening on port 21). The ip_contrack_ftp module (aparently) does not make this assumption. Taking the RFC as a guide the assumption is of course valid.

Graeme Fowler's checklist for ftp Graeme Fowler graeme (at) graemef (dot) net 23 Aug 2006 Ensure the LVS FTP helper is loaded. Make sure that you define (or make a note of) the range of ports your FTP server uses for data connections (this varies from server to server). Ensure that you will accept traffic to those ports on your director. If the packets are rejected by netfilter/iptables on the director, the FTP helper never sees them so the connections will almost never work.

LVS-NAT, 2.2.x director I found that ftp worked just fine without the module for 2.2.x (1.0.3-2.2.18 kernel). (see discussion following Mark de Vries comments in the ftp helper section above for a possible explanation.)

LVS-NAT, 2.4.x director For 2.4.x you can connect with ftp without any extra modules, but you can't "ls" the contents of the ftp directory. For that you need to load the ip_vs_ftp module. Without this module, your client's screen won't lock up, it just does nothing. If you then load the module, you can list the contents of the directory.

LVS-DR, LVS-Tun For LVS-DR, LVS-Tun active ftp needs persistence. Otherwise it does not work, with or without ip_masq_ftp loaded. You can login, but attempting to do a `ls` will lockup the client screen. Checking the realserver, shows connections on ports 20,21 to paired ports on the client.

ftp (active) - the classic command line ftp This is a 2 port service. port 20 calls - data (files transferred in either direction, and the output of the listing from ls command) port 21 listens - commands (e.g. user, pass, ls) Here's part of my /etc/services To setup ftp with LVS, you schedule only port 21 for forwarding. While the realserver is listening on port 21, it calls the client from port 20 (i.e. it's not listening on port 20) rather than the client calling the realserver (through the director). You do not add entries for port 20 with ipvsadm. Port 20 is handled by persistence for LVS-DR and LVS-Tun. For active ftp with LVS-NAT, you don't need the ipvs ftp helper module (the ftp helper module is only needed for passive ftp, Wensong 17 Sep 2002) (however see ftp helper module.

session: active ftp (no LVS) Here's a standard non-LVS active ftp session using . The ftp "client" machine (192.168.1.254) connects to the ftp server machine "sneezy" (192.168.1.11). Since two ports are involved, phatcat is run from two windows, xterm_1, xterm_2. xterm_1: 's unimplemented). USER PORT STOR MSAM* RNTO NLST MKD CDUP PASS PASV APPE MRSQ* ABOR SITE XMKD XCUP ACCT* TYPE MLFL* MRCP* DELE SYST RMD STOU SMNT* STRU MAIL* ALLO CWD STAT XRMD SIZE REIN* MODE MSND* REST XCWD HELP PWD MDTM QUIT RETR MSOM* RNFR LIST NOOP XPWD 214 Direct comments to ftp-bugs@sneezy.mack.net. user ftp 331 Guest login ok, send your complete e-mail address as password. pass mack 230 Guest login ok, access restrictions apply. ]]> On the client, use netstat -an to find the highest unprivileged port in use (in this case port 1029). xterm_2: tell the client to listen on the first unused port (here 1030). xterm_1: tell the ftpserver to connect to client:1030 (192,168,1,254,256,6) (1030=256x4 + 6), and then list the contents of the directory xterm_2: receives the output of list. The ftpserver then closes the connection from port 21 (i.e. you can't do a second listing). xterm_1: xterm_2: on the ftp client, initiate another listener (on the next unused port). xterm_1: tell the ftp server to connect to client:1033 (1033 = 256 x 4 + 9), prepare for upload of an ascii file (type a), check the size of the file (size welcome.msg about to be downloaded, then retreive it (retr welcome.msg). (ftp server will then close connection from port 20.) xterm_2: watch welcome.msg being delivered. xterm_1:say goodbye (the data connection has closed, so you can't list using the same connection).

session: active ftp, one network LVS-DR with no persistence (this is NOT going to work) The example illustrates what happens with active ftp on LVS-DR without persistence (it is not going to work). Set up a working one network LVS-DR (i.e. all IPs are in the same network), add rules to forward ftp Here you are only running commands to forward port 21. You have not handled the data port 20 in any way. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:ftp rr -> sneezy.mack.net:ftp Route 1 0 0 -> bashfull.mack.net:ftp Route 1 0 0 ]]> Use phatcat (as above) to connect attempt to setup an ftp session with the VIP. xterm_1:connect to VIP:ftp With netstat -an on the realserver, note that the client is connected to VIP:21, not to RIP:21. xterm_2:listen on the next available port xterm_1:tell the realserver to connect to client:1036, and then list the contents of /home/ftp. (The connection hangs for a while - eventually you'll get the 425 message). On the realserver, netstat -an shows On the client, netstat -an shows that client is listening, but not connecting following the list, if you run tcpdump on the realserver when you run the list command, you'll see that the realserver is sending SYN packets from VIP:20->client:1036 but not receiving any replies. The problem is that the ACK from the client is sent to VIP:20 which is routed to the director, which has no forwarding rules for VIP:20. Even if the director had forwarding rules for VIP:20, it requires the first packet in a connection to be a SYN, to start the process of making an entry in the ipsvadm table for packets to port 20. Thus the director will reject the ACK from the client to VIP:20 and no connection will be made.

session: active ftp, one network LVS-DR with persistence This is the normal method of setting up LVS-DR for ftp. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:ftp rr persistent 600 -> sneezy.mack.net:ftp Route 1 0 0 -> bashfull.mack.net:ftp Route 1 0 0 ]]>

ftp (passive) Passive ftp is used by netscape to get files from an ftp url like ftp://ftp.domain.com/pub/ . Here's an explanation of passive ftp from http://www.tm.net.my/learning/technotes/960513-36.html

If you can't open connections from Netscape Navigator through a firewall to ftp servers outside your site, then try configuring the firewall to allow outgoing connections on high-numbered ports. Usually, ftp'ing involves opening a connection to an ftp server and then accepting a connection from the ftp server back to your computer on a randomly-chosen high-numbered telnet port. the connection from your computer is called the "control" connection, and the one from the ftp server is known as the "data" connection. All commands you send and the ftp server's responses to those commands will go over the control connection, but any data sent back (such as "ls" directory lists or actual file data in either direction) will go over the data connection. However, this approach usually doesn't work through a firewall, which typically doesn't let any connections come in at all; In this case you might see your ftp connection appear to work, but then as soon as you do an "ls" or a "dir" or a "get", the connection will appear to hang. Netscape Navigator uses a different method, known as "PASV" ("passive ftp"), to retrieve files from an ftp site. This means it opens a control connection to the ftp server, tells the ftp server to expect a control connection to the ftp server, tells the ftp server to expect a second connection, then opens the data connection to the ftp server itself on a randomly-chosen high-numbered port. This works with most firewalls, unless your firewall retricts outgoing connections on high-numbered ports too, in which case you're out of luck (and you should tell your sysadmins about this). "Passive FTP" is described as part of the ftp protocol specification in RFC 959 ("http://www.cis.ohio-state.edu/htbin/rfc/rfc959.html").

If you are setting up an LVS ftp farm, it is likely that users will retrieve files with a browser and you will need to setup the LVS to handle passive ftp. You will need the ftp helper module or (also see on the LVS website under documentation; persistence handling in LVS) or fwmark persistent connection for ftp. For passive ftp, the ftpd sets up a listener on a high port for the data transfer. This problem for LVS is that the IP for the listener is the RIP and not the VIP. Wenzhuo Zhang 1 May 2001

I've been using 2.2.19 on my dialup masquerading box for quite some time. It doesn't seem to me that the option is required, whether in PASV or PORT mode. We can actually get ftp to work in NAT mode without using the ip_masq_ftp module. The trick is to tell the real ftp servers to use the VIP as the passive address for connections from outside; e.g. in wu-ftpd, add the following lines to the /etc/ftpaccess: passive address 127.0.0.1 127.0.0.0/8 passive address VIP 0.0.0.0/0 ]]> Of course, the ftp virtual service has to be persistent port 0.

Alois Treindl, 3 May 2001

I found (with kernel 2.2.19) that I needed the command so that (passive mode) ftp from Netscape would work. without the in_ports=21 it did not work.

Julian Anastasov ja (at) ssi (dot) bg 03 May 2001 Yes, it seems this option is not useful for the active FTP transfers because if the data connection is not created while the client's PORT command is detected in the command stream, then it is created later when the internal realserver creates normal in->out connection to the client. So, it is not a fatal problem for active FTP to avoid this option. The only problem is that these two connections are independent and the command connection can die before the data connection, for long transfers. With the in_ports option used this can not happen. Joe - in previous HOWTOs I had a comment from Julian saying that the ftp helper was "recommended" for active ftp (presumably not required). Presumably this is what he's talking about. The fatal problems come for the passive transfers when the data connection from the client must hit the LVS service. For this, the ip_masq_ftp module must detect the 227 response from the realserver in the in->out packets and to open a hole for the client's data connection. And the "good" news is that this works only with in_ports/in_mark options used. Alois

on option so that I could configure on the server that it gives the VIP to clients making a PASV request; it always gives the realserver IP address in replies to such requests.

Bad ftpd :) It seems the follwing rules are valid: active ftp always works through stupid balancers (for external clients) that have minimum support for masquerading, with some drops in the command connection passive ftp always works through stupid masq boxes (for internal clients). The passive ftp setup is useful because the data connection can be marked as a slave to the command connection and in this way avoid connection reconnects.

passive ftp client/server miss-match with LVS-NAT Jeremy Kusnetz:

although Julian says that all you need for ftp with LVS-NAT is the ip_masq_ftp module, it doesn't work for me (director 2.2.19-1.0.7 with ip_masq_ftp in_ports=21) my ftp client just hangs.

Julian The Netfilter guys use another approach when detecting the 227 message in Linux 2.4, i.e. they try to ignore the message and to use only the code (I'm not sure what is the final status of this handling there). But in Linux 2.2 the word "Entering" may be a requirement :( You have to select another FTPd, IMO. Jeremy Kusnetz JKusnetz (at) nrtc (dot) org 24 May 2001

It was my ftp server. When going into passive mode it said: instead of:

ftp helper bug(s) In early 2005 Johan van den Berg, and Simon Schwendemann sent a report of a problem with LVS-NAT (2.4.x) where the ACK reply to a SYN would not be source-NAT'ed and so would emerge with src_addr=RIP and not src_addr=VIP. (http://archive.linuxvirtualserver.org/html/lvs-users/2005-02/msg00299.html) Johan van den Berg switched to using LVS-DR. http://archive.linuxvirtualserver.org/html/lvs-users/2005-02/msg00299.html Even to figure this out took a while. Initially only one in 60 or so SYNs would have the problem. No-one had any idea what the problem was and cries for help were greeted by silence. Then Jari Takkala Jari (dot) Takkala (at) Q9 (dot) com 15 Aug 2005 found it only occured when the LVS-NAT was forwarding ftp, but the problem occured on all VIPs, not just the VIP that had the ftp service. With Jari's posting, other people started to recognise the problem too. Graeme Fowler graeme (at) graemef (dot) net 16 Aug 2005

This is very interesting; I have a number of clusters behind LVS-NAT and hadn't managed to observe myself that the one having problems - which I posted about sometime in the last year - is the only one of the whole lot which has ip_vs_ftp loaded. It's also a 2.4.x kernel, and can't be in-service upgraded.

Julian Anastasov ja (at) ssi (dot) bg 26 Aug 2005 I can not reproduce it, I tried with 2.4.32-pre3 as it contains some changes. Can you show your vs settings?: So, you don't have any iptables rules, fwmarking, NAT or linux ethernet bridging? Any extra patches for IPVS? From your explanation ip_vs_ftp leads to problems where SYN creates web connection, it is hashed in table, DNAT-ed to RS, then RS replies SYN+ACK which can not match the connection in table. It looks like this connection is not present (may be removed, do you see something in debug logs from the SYN to the SYN+ACK) or the hash table is damaged. Do you still think it is caused by ip_vs_ftp? About your tests, is the client IP on lan? Do you think this client IP has many connections to the director? Jari (data dumps omitted)

The client IP is not on the LAN. The problem occurs from any source IP trying to visit a load balanced VIP. Whenever we add the FTP service to ipvsadm, and begin load balancing to it, the problem begins to occur on all services. However, it is not consistent. Some outgoing SYN+ACK packets will get translated correctly for a certain period of time, then after awhile some packets will not be translated. I do not think it is load related. We have other load balancers built from the same image handling many more connections.

There were various discussions (under the title "LVS bugs") between Julian and Agostino di Salle a (dot) disalle (at) fineco (dot) it that you can find in the archives if you want to know more. Julian As reported from some users, the ip_nat_ftp module causes some problems with other virtual services. ip_nat_ftp can keep ip_vs_conn_no_cport_cnt > 0 for the time it expects connections from unknown client ports. This is fatal for the persistence services as the normal packets start to hit persistence templates instead of valid connections. Such packets are correctly forwarded to real servers but the reply packets do not see connections as they are not created. As result, the reply packets are not SNAT-ed by the IPVS code. It is enough to have passive FTP connection that waits to learn its client port to trigger problems with non-ftp persistent services. The used VIPs do not matter. I tried to fix this problem with the following patch: Linux 2.6.13: http://www.ssi.bg/~ja/tmp/ipvs-2.6/ct-2.6.13-1.diff, Linux 2.4.32-pre3: http://www.ssi.bg/~ja/tmp/ipvs-2.4/ct-2.4.32-pre3-1.diff These patches do the following: introduce IP_VS_CONN_F_TEMPLATE connection flag to mark the connection as template create new connection lookup function just for templates: ip_vs_ct_in_get make sure ip_vs_conn_in_get hits only connections with IP_VS_CONN_F_NO_CPORT flag set when s_port is 0. By this way we avoid returning template when looking for cport=0 (ftp) There is a second patch that properly invalidates the templates as Agostino di Salle noticed: Linux 2.6.13: http://www.ssi.bg/~ja/tmp/ipvs-2.6/invct-2.6.13-1.diff Linux 2.4.32-pre3: http://www.ssi.bg/~ja/tmp/ipvs-2.4/invct-2.4.32-pre3-1.diff I performed simple tests, so please test these patches, for example, persistence+ip_nat_ftp, the ip_vs_sync code is changed too. If there is a better solution please speak before including them in next kernel releases. I'm expecting confirmation from people with the problem that reply packets were not translated from IPVS. Jari Takkala Jari (dot) Takkala (at) Q9 (dot) com 9 Sep 2005

We applied these patches to a production load balancer on kernel 2.4.26. Our IPVS code is one version behind, however the patches applied cleanly. We began load balancing FTP last night, and so far everything is working properly. Thanks very much for your help!

The patches worked from Graeme Fowler too. Julian thinks this problem has been affecting people for a while. Julian Anastasov ja (at) ssi (dot) bg 12 Sep 2005 thanks to Graeme and to Jari for the tests. It seems the problems reported from many users in last 2 years and more are now fixed.

ftp is difficult to secure Roberto Nibali ratz (at) tac (dot) ch 06 May 2001 If you are trying to secure the LVS using the LVS as a packetfilter, will have no big success in doing it for the ftp protocol, because it is so open. You can do a lot to minimize full breaches. At least put the ftp daemon in a chroot environment. We have multiple choices if we want to narrow down the input ipchains rules on the front interface of director Use ftp via LVS. (this is not a solution actually, we still need special input rules on the EXT_IF for 1024:65535) Use ftp without LVS but with SNAT. (difficult to setup) Use SuSE ftp proxy suite Use 2.4 kernel and ip_conntrack_ftp (don't know much about this, ask Rusty) Don't use ftp at all (this is what we want) The ftpfs project. I haven't fully tested it and it's a very dangerous approach but it is worth to a look. The biggest problem is with the ip_masq_ftp module. It should create an ip_fw entry in the masq_table for the PORT port. It doesn't do this and we have to open the whole port range. For PASV we have to DNAT the range. FTP is made up of two connections, the Control- and the Data- Connection. ftp Control Connection The Client contacts the Servers port 21 from an UNPRIV Port. No trouble, standard, plain, vanilla TCP-Connection, we all love it. Over this connection the client sends commands to the server. We will see examples later. FTP Data Connection "Data" can be either the content of a file (sent as e.g. the result of a "get" or "put" command) or the content of a directory-listing (i.e. the result of a "ls" or "dir" command). The data connection is where the trouble starts. To transfer data, a second connection is opened. Usually the client opens this second connection to the server. But for active ftp, the server opens this second connection, using the well-known port 20 (called ftp-data) as sourceport. But which port on the client should he connect to? The client announces the port via a "port"-command over the control connection. This is nasty: Ports are negotiated on application-level where L4 switches like LVS can see what's going on. For passive ftp, the server announces the port the client should connect to in its reply to the client's "pasv"-command (this command starts passive FTP, active is the default). The client then opens the data-connection to the server. The port that the server listens on is an unprivileged port (rather than a privileged port as is normal for internet services). A passive ftp transfer then requires that connections be allowed between all 63000 unprivileged ports on both the client and realservers rather than just one. A passive ftp server is difficult to secure with packet filter rules. If we have to protect a client, we would like to only allow passive ftp, because then we do not have to allow incoming connections. If we have to protect a server, we would like to only allow active ftp, because then we only have to allow the incoming control-connection. This is a deadlock.

Example ftp sessions with <xref linkend="phatcat"/> We need 2 xterms (x1, x2), fatcat and an ftp-server (here "ftpserver" 172.23.2.30). First passive mode (because it is conceptionally easier) The server replied with 6 numbers: 172,23,2,30 is the IP I have to connect to (169*256+29=43293) is the Port In x2 I open a second connection with a second phatcat Now in x1 (the control-connection) and in x2 the listing appears. Active ftp I use the same control-connection in x1 as above, but I want the server to open a connection. Therefore I first need a listener. I do it with phatcat in x2: Now I tell the server on the control connection to connect (2560=10*256+0) Now you see, why I used port 2560. 172.23.2.8 is, of course, my own IP-address. And now, using x1, I ask for a directory-listing with the list command, and it appears in x2. For completeness sake, here is the the full in/output. First the xterm 1: xterm 2:

mail on securing ftp Joe

I see that ftp is hard to make secure and your prime recommendation is to have an ftp server isolated from all other machines. Do you recommend that people not use ftp and say instead use http for LVSs that are delivering files? I don't like http for file download. At home (28k phone ppp link) if I do anything else over the line (like load a webpage) while doing a download, the download stalls and doesn't start up again. This is pain as a 10M file takes 2hrs and I have to start again.

Joe Cooper joe (at) swelltech (dot) com 07 May 2001 will solve that problem. is now available as part of the openssh packages I believe, but requires clients to have a recent version of openssh -- probably not what folks want if they have enough clients to justify an LVS cluster. I don't think LVS really has anything to do with whether someone should use ftp for security reasons or not. Securing ftp is a separate issue from securing LVS.

ftps (ssl based ftp), tcp 21, 22? This is not ftp port forwarded through ssh (see ), nor is it . From Ratz, 30 Nov 2003, see http://www.stunnel.org/examples/ftp.html FTP+SSL, FTP+TLS. There are two deprecated methods of doing SSL+FTP. Make sure that what you're doing and talking about is http://www.ietf.org/rfc/tfc2228.txt RFC228 ftps. The session starts by the client connecting to port 21 and issuing the "PROT P" command. Quite what happens after that I don't know (which ports, are the packets encrypted?). Kai

I am using LVS/NAT with ssl based ftp. I can ftp via realserver by using either port mode or passive mode.

ratz 29 Nov 2003 Over the director, correct?

For security reasons SSL based ftp was required. After adding ssl based ftp auth to the realservers, the client computers cannot connect to the realserver with passive mode, but port mode works well.

IIRC you need to load balance port 22 too.

I think the problem is ,data which ftp server send to client include the server's passive port was crypted by ssl. so the LVS don't know which port should be translate and open.

AFAICR this isn't the issue. The client receives the PASV command and then translates the PORT into a local ssh tunnel forward. So I think you have to also load balance port 22 TCP. You can use the port 0 feature :). Kai reposted this on 19 Feb 2004

I think the problem is, the data which ftp server sends to the client includes the server's passive port was crypted by SSL. So the LVS don't know which port should be translated and opened.

Horms 20 Feb 2004 Yes, that sounds likely. Try tracing the traffic using something like ngrep.

Does LVS support the SSL based FTP? If not, is there any solution?

If your guess is correct, then no. Well, not unless you get the linux director to handle the ssl and just talk plain-text to the real-servers, but then that isn't LVS.

dns, tcp/udp 53 (and dhcpd server 67, dhcp client 68) (from the IPCHAINS-HOWTO) DNS doesn't always use UDP; if the reply from the server exceeds 512 bytes, the client uses a TCP connection to port number 53, to get the data. Usually this is for a zone transfer. The (ulink url="LVS-HOWTO.services.general.html#name_resolution") process is broken. It's possible for a client (resolver) to get a reply from a hung nameserver which it interprets as "no resolution for that name", rather than allowing the client to go on to the next nameserver in the list. This is a design flaw that will take some fixing (all the clients and all the nameservers must be fixed). DNS should have it's own failover mechanism, but it doesn't. In the meantime, some other failover mechanism will have to present a perfect nameserver to the clients. There is no consensus amongst people running LVS as to whether it's best to have named/bind LVS'ed or to just to have a set of machines in a failover setup (Horms, 2 Oct 2006, is of the opinion than named can't be load balanced by nature of the protocol). If you're running DNS in a failover setup, you might think that you could have one primary machine and a secondary machine and that on failover of the primary you could promote the secondary to be the primary. By design of DNS, there can only be one primary machine. The primary and secondary have different config files and it's not simple to programatically switch the secondary into the primary role (it can be done, it just requires some thinking). A failover DNS setup then requires two machines with identical config files, one as the master and one as the backup. However if you have dhcpd running on the network, the primary name server machine will be updated continuously with the addresses from the dhcpd, which the backup primary will not get. On failover, you will loose name resolution on these addresses until they renew their lease. If you are going to LVS named, and are running LVS-DR or LVS-Tun, as usual make sure your named is listening on the VIP (not the RIP). dhcpd has its own failover/redundancy mechanism. You can't LVS a dhcpd server - it has a database of its leases and no other machine can have the same list. dhcpd can be setup with multiple dhcpd servers on the same network and they pass the updates to each other. Unfortunately it doesn't work - you get to a stage where one machine will mistakenly think that another machine is incharge of all the IPs and both machines refuse to answer requests. The problem has been posted to the dhcpd mailing list for several years without any answers from the dhcpd authors. The only thing to do when this happens is to kill all the dhcpd servers, erase the lease table files, touch new ones, and start the servers again. I went back to having only one dhcpd server and left the other one turned off waiting as a backup. This setup for LVS'ing named is from Ted Pavlic. Two (independant) connections, tcp and udp to port 53 are needed. Here is part of an lvs.conf file which has dns on two realservers. If the LVS is run without mon, then any setup that allows the realservers to resolve names is fine (ie if you can sit at the console of each realserver and run nslookup, you're OK). If the LVS is run with mon (e.g. for production), then dns needs to be setup in a way that dns.monitor can tell if the LVS'ed form of dns is working. When dns.monitor tests a realserver for valid dns service, it first asks for the zone serial number from the authoritative (SOA) nameserver of the virtualserver's domain. This is compared with the serialnumber for the zone returned from the realserver. If these match then dns.monitor declares that the realserver's dns is working. The simplest way of setting up an LVS dns server is for the realservers to be secondaries (writing their secondary zone info to local files, so that you can look at the date and contents of the files) and some other machine (e.g. the director) to be the authoritative nameserver. Any changes to the authoritative nameserver (say the director) will have to be propagated to the secondaries (here the realservers) (delete the secondary's zone files and HUP named on the realservers). After the HUP, new files will be created on the secondary nameservers (the realservers) with the time of the HUP and with the new serial numbers. If the files on the secondary nameservers are not deleted before the HUP, then they will not be updated till the refresh/expire time in the zonefile and the secondary nameservers will appear to dns.monitor to not be working. LVS is no better than DNS for the same number of working DNS servers. However if a DNS server fails... Nick Burrett nick (at) dsvr (dot) net 20 Jan 2004 Consider a client with a resolv.conf with IPs: If 10.0.0.10 is taken offline, then the client application's speed at getting domains resolved is drastically reduced, because the resolver library will always query 10.0.0.10 before querying 10.0.0.11. Sticking DNS behind LVS alleviates this. Monitoring software will failout the dead DNS realserver. anon

I'm planning to put my company's dns on lvs with ha.

Greg Woods woods (at) ucar (dot) edu 30 Aug 2002 Unless you have a really unusual situation, I think using LVS for DNS is massive overkill. There is no way that DNS load should overwhelm a single server. If it does, you probably are in dire need of some subdomains. What I do here is just use the heartbeat code so that the hot spare backup machine will take over if the primary goes down, and I do have a restart script that uses scp to move the data files that have been modified over to the backup machine. scp is called out of a script that will keep trying the scp until it succeeds, in case the backup machine is down at the time a change is made. This seems to work for us. I do use LVS for our mail system, but then, the mail system does anti-spam IP address blacklist checking, and virus scanning. That means the overhead of establishing a connection through LVS is small compared to the load on the server to process a connection. I don't think this is the case for DNS. Jeff Kilbride

Does anybody else agree that load balancing DNS servers with LVS is not worthwhile?

Peter Mueller pmueller (at) sidestep (dot) com 2005/04/18 Yes, for authorative. ISC-bind has some kind of response-latency measurement built-in. For client side, LVS is useful. In the event of the first server in /etc/resolv.conf failing, there's a 2 second timeout that can be avoided. If you're LVS'ing named, you may wind up with many VIP's on your director.

samba, udp 137, udp 138, tcp 139, tcp 445 The problems to be solved with setting up an LVS'ed samba are it's peer-to-peer (rather than client-server) you have to authenticate users if clients can write to the samba'ed disks, you have to propagate the updates to the other realservers.

Fred Lacombe's LVS-Samba HOWTO Lapin(c) lapin (at) linagora (dot) com 04 Mar 2004 Here is a draft for an LVS-Samba HOWTO (http://www.lapinux.org/howto/) that load balances Samba with LVS-NAT. There are still modifications to add and some tricks to point out, but all feedback will be helpful. I just tried to make the samba realservers invisible to each other with iptables rules. The only visible machines are an LDAP server and the director. It still has some (undocumented) drawbacks, but I can authenticate against 2 samba realservers and I can access shares on each of them (directly in their filesystem). Unsolved is the problem of sync for the shares: I've thought about a SAN, or some DRBD cross definition. I still to solve this.

Joe: This is big news. I haven't read all of Fred's docs yet, or set one of these up, but Fred seems to have solved the many reader/ single writer problem by having a single LDAP database for all Samba servers and by having (or assuming) a single file system for the shares.

Will McDonald's setup Will McDonald wmcdonald (at) gmail (dot) com 21 Mar 2006 We have a simple Samba share available on some systems sat behind a pair of LVSs. We have 2 directors in Active/Passive NATing through to two realservers running Heartbeat in Active/Passive. So only one of the realservers has the Heartbeat managed VIP the LVSs NAT through to at any one time. I know for our purposes the realservers could just sit on the same subnet as our other servers but this is an inherited setup and there are other reasonable reasons for it to be like this. Samba's not the realservers *primary* role, there are other services too. The reason they're Active/Passive is because DRBD devices can only be mounted on one node at any one time. The LVSs are running CentOS4 and the repackaged Ultramonkey packages out of the CentOS Extras repository 192.168.25.10:445 Masq 1 2 0 TCP 192.168.24.45:139 rr persistent 600 -> 192.168.25.10:139 Masq 1 0 0 UDP 192.168.24.45:137 rr persistent 600 -> 192.168.25.10:137 Masq 1 0 0 UDP 192.168.24.45:138 rr persistent 600 -> 192.168.25.10:138 Masq 1 0 0 ]]> The ldirectord.cf on the LVSs looks as follows... The back-end boxes are FC3 running Ultramonkey packages again, and DRBD for disk replication. Samba startup is handled from /etc/ha.d/haresources by simply including "smb" as a resource which starts/stops on failover. The smb.conf's very simple too... This has been pretty reliable but it's not high volume by any stretch of the imagination. Nor is it attached to a domain so I'm not sure how you'll get on with browser mastering etc.

early attempts to LVS samba The topic of serving samba on an LVS was first raised by John Rodkey rodkey (at) wesmont (dot) edu who wondered if he could serve 300 w2k machines with samba/LVS. Not knowing much about SMB I put out a request for help on samba-technical (at) lists (dot) samba (dot) org. I got replies from several people, including the samba developer Chris Hertal, and from Ryan Fox (who had setup an LVS and who had even read the LVS-HOWTO). I also got a free 2hr phone tutorial by John Terpstra jht (at) samba (dot) org on 26 Oct 2001. One of the big problems is that samba is peer-peer, while LVS works with server-client connections. Wensong has said on the mailing list that you can use samba in read-only mode over LVS, but this will not be of much use to a bunch of windows boxes. Apparently there's a lot of interest in the commecial world in highly available samba clustering and some effort has been put into making LVS work with samba, by people who don't come up on the LVS mailing list. No-one has succeeded and now it's generally thought that LVS is not the way to go. Here's John's tutorial as I copied it down over the phone. Thanks John for your help and time.

Attitude adjustment zone for unix people. cifs==smb (two names for the same thing) For more information on cifs/smb see ftp://ftp.samba.org/pub/samba/specs "samba ftp docs" (Sep 2002 link is dead). You'll need to go to a mirror site. Microsoft's clustering service is derived from the DEC Wolfpack, which was originally a bulletproof, all things to all men, industrial strength, cluster and failout framework. Microsoft is using the part of Wolfpack that corresponds to Linux-HA. Most communication between windows machines in setting up logins, finding resources (printers, disks, network) is between peers, rather than server/client as for unix. Any machine then will be able to find the resources on the network, whereas in unix, the clients have to find out by some mechanism external to the host (e.g.phone up the sysadmin). The same ports are used at both ends and communication is usually by broadcast (at least initially). Thus there is no distinction between a samba server and a samba client. One host may have the files the user wants and to unix people, this host would be the server. But in windows there are two peers: one machine has a file and the other machine may want it - the role can in principle be reversed without any change in the setup. Election amongst peers is used to determine who will have the role of knowing the location of other resources (e.g. becoming the domain node controller). Unlike unix where you setup a machine deliberately to be a server, by setting up demons listening on a socket, with windows you cannot be guarantee that a certain machine will assume a particular role. You can bias the election (e.g. machines which have been up longer have more weight), but you can't rig the election. If you have to bring a machine down, it's down and it's less likely to win any new elections (possibly for a long time). In a LVS clustered samba setup (if such a thing could be made to exist), a long running client machine out on the internet might win the election and assume the role of locator service. Communication between machines using IP, is in fact encapsulating netware (if running Novell) or netbuei datagrams inside IP. Samba uses netbios over IP. To unix people the network is sometimes thought of as a hardware layer. To windows, the network is a netbios messaging layer. Two windows machines could be connected by several protocols (netbuei, netbios, tcpip) over the same piece of wire (ethernet). These connections are regarded as being separate and independant networks - i.e. they use different names for the machines at each end.

the kernel resources are a cloud which receives broadcast messages Here is an application talking to the kernel in windows. WIN32API - all communication with cloud is by broadcast. The appropriate box from the cloud will reply. There is no direct connection to drivers as in unix, where the kernel asks the disk driver to "open file X" (on behalf of the application). SMB API - nothing happens in windows without SMB being involved. locator - knows where resources (printers, disks, network connections) are redirector - sends services. resolver - uses SMB messages to find out where to go. netware - messes everything up, it's incompatible with the rest of the kernel (e.g. as you'll find if you try to connect by netware _and_ tcpip). let's look a little more closely SMB has to decide if request is local or remote local - passes call to local PC. anything local has a name like /DEVICE/xxx remote - these have UNC names \\SERVER\share\path\filename Netbios converts SMB message to a netbios datagram and puts it on the wire as a netbuei or netware message when running IP. SMB uses netbios over tcpip (not netbios or netware). It uses 3 ports udp 137 - netbios nameservice (==WINS). WINS namespace is flat, rather than hierachial like DNS. udp 138 - browse list (i.e. network neighbourhood) tcp 139 - persistent connection to other machine: session traffic, printing, filesharing. Every client has to be able to find the local master browser, (domain master browser != local master browser). This could be any machine. Election is conducted by broadcast over udp 137,138 (the election can be biased, but the outcome cannot be forced/guaranteed). What we think of as the samba server, may not win. Broadcast udp will not go over a router, so if the network is routed, then tcp unicast is used for the election (as well as udp broadcasts), telling client to use WINserver (which will be a samba machine or NTWINserver).

connecting to the cluster, windows style When a new windows machine comes on the net (e.g. an smb client or our samba server), it needs to establish that it has a unique name. Name space is handled by contest. The machine udp broadcasts its name (e.g. JACK) 4 times at 200msec interval and asks "who is local master browser?". A samba server will announce that it is "JILL". If there is another machine of the same name already, it will send back a <NACK>. If there are no <NACK>s, the local master browser will accept the name. The client will register its name by udp broadcast (or possibly tcp unicast) with the WINserver, into the workgroup or domain. The user will then see something in "network neighborhood". The client machine will do a udp 138 unicast to the local master browser "give me browse list enumeration" (the local master browser has information from the domain master browser too). On a multisegmented, routed network, each segment has its own local master browser. One machine will be both a local master browser and a domain master browser. If the user clicks on a machine in "network neighbourhood" (and is using WINserver), the client machine will send a "name lookup request" (like a DNS request) - a netbios unicast request to udp 137 on the local master browser and get the IP of the machine. The client registers (includes services available) with the other machine. The client machine will then send a tcp 139 "session setup request", and then sets up a netbios connection over tcp to IPC$share on the machine. This setup involves an SMB "net_prot" (negotiate protocol) exchange to setup protocol(s) and establish whether the client can use long filename support and UC/lc letters. The client has connected with an empty username and passwd at this stage. The client now authenticates and receives back a list of printers, files and is given a persistent connection. The original (passwdless) connection is pulled down. After 10-15mins of inactivity, the client kernel may elect to drop its session (even if an application is in the middle of editing an open file on the remote machine). The application has no knowlege of this disconnect. When something happens in the application again (or you click on network neighbourhood etc), the session will be renegotiated. If the remote machine has gone down in the mean time or the client is connected to our hypothetical samba LVS and is redirected to a new samba server (which doesn't know anything about the client's original connection), the user will get a message that the connection cannot be re-established and that the user will have to exit from the application (without saving the edits). Ha-ha, just kidding - that's what you should get - you'll actually get the BSOD.

Samba using a distributed filesystem on the realservers Kai Suchomel1 KAISUCH (at) de (dot) ibm (dot) com 12 Jun 2006 The Samba Service uses a SAN Filesystem, here GPFS. This File system is shared among all the Samba Services on the RS. When I connect to VIP and the SAN Filesystem, the client can connect to any realserver. When the RS fails, after doing a reconnect, the Client can access the SAN Filesystem over another RS.

xdmcp, X-window, udp 177 (xdmcp), tcp 6000 (and ssh X-forwarding) Multiple ports are involved here. However you don't have to LVS all the ports. As far as LVS is concerned, only port 177 needs to be LVS'ed. However you have to know about all the ports to get xdmcp to work, so it's in the multi-port services section. Not so long ago, a common practice was to serve all applications from a central server to a diskless X-terminal (like an NCD) which ran an X-server from ROM. User's files were backed up centrally. Upgrades/fixes to applications for 100s of clients was a matter of writing the new files to the single, large, reliable central server. The fixes appeared simultaneously for all clients. We've all realised the fundamental flaw in this setup and now we have the applications running on several 100 desktop machines, where upgrades and fixes can take weeks to propagate. The fix to the fix is to run thin clients on the desktops (no wait, didn't we already do that one).

X - attempt 1, connecting directly to X being served by the LVS This method does not work X window is another client-server protocol. The X-client asks for a connection to the X-server by calling from ports starting at 6000 and the server will start displaying X images on its display. If you don't think about it too much, it seems that X should work through LVS. However Lidsa lidsa (at) legend (dot) com (dot) cn 24 Apr 2002

..but the realserver is the X-client and the X-server does not reside on the realserver. So I think it impossible for LVS to forward X-window.

If you connect from your LVS client to VIP:telnet on an LVS-DR (you will now be connected to one of the realservers) and start xclock on the realserver, you'll get the xclock image on the lvs client (provided that you have a direct connection also between the realserver and the client). If you look with netstat -an you'll find that RIP:1025 is ESTABLISHED with CIP:6000. Yes, the LVS client is the X-server - it is not the X-client. The realserver is the X-client. You can't use LVS to forward X-sesssions.

X - attempt 2, connecting to LVS:xdmcp This method is most like the login from a diskless xterm. Severin Olloz showed that you can use an LVS to serve X-sessions by running xdmcpd on the realservers. Severin had problems initially with some logins locking up, but this apparently was due to a misconfiguration of one of his realservers. When I tried it, I was still able to login from the lvs client after leaving the connection idle for a few hours at the xdm login screen. After leaving the nodes idle overnight I couldn't get a login at the LVS client anymore. On one occasion xdm was running on the node corresponding to the login shown on the client. I restarted xdm on that node and could connect again. On another occasion xdm had died on one of the realservers and the client was just showing the background color for the X-window and a functional mouse, but no xdm login screen. The connections to port 6000 on the lvs client were also gone. I restarted xdm on all the realservers and restarted the client ("X :1 -query VIP") but did not get the xdm login screen. I could connect after running ipvsadm again. Presumably timeouts will need to be explored to make a working xdmcp LVS. Severin Olloz S (dot) Olloz (at) soid (dot) ch 30 Apr 2002

I have set up an LVS-DR X11-Server. The LVS client makes a XDMCP-Query with a command like this: and the director of the cluster sends the UDP packets on port 177 (XDMCP) to the realserver. The realserver accepts the request and opens a X11-session for the user. (Note: the realserver is opening a direct connection to CIP:600x - this is not under control of the LVS. The LVS client and the realserver must be able to exchange packets directly.) My ipvsadm table looks like this: RemoteAddress:Port Forward Weight ActiveConn InActConn UDP VIP:xdmcp wlc persistent 360 -> node1:xdmcp Local 100 0 0 -> node2:xdmcp Route 100 0 0 ]]> The director is a realserver too, using localnode.

Here's more details I discovered when I reproduced Severin's setup. For info on XDMCP, see the Linux XDMCP HOWTO and the many links provided therein. In this method, X-clients on the realservers connect directly to the X-server on the LVS client. The LVS is only used for xdmcp authentication. Once this step has been accomplished, the LVS steps out of the way and the X-session is between the realserver selected and the client. The client then must be able to send packets directly to the RIP on the realserver. In a normal LVS-DR, the RIP is not routable from the lvs client. The RIPs will have to be routable or public IPs. For a test, first connect directly from your lvs client box to a realserver (no director or LVS involved yet). Setup your xdm-config, Xaccess files on the realserver(s) as described in the XDMCP HOWTO and check the permissions of Xservers and Xsetup_0. Make sure xdm is running on the realserver (the XDMCP-HOWTO does this via the inittab file, but you can just fire it up from the command line for a test). Check that xdm is running Run the next command. If you don't have X running, it will be started for you. If your LVS client is displaying an X-window (i.e. you ran `startx`) then the client at the other end will overwrite your current X-session. The original window manager screen should disappear on your client box to be replaced by the xdm login from the realserver. If you just have a blank screen on the client, with a mouse X but no login box, check that xdm is running on the realserver. After you login, you get the window manager set in /etc/X11/xdm/Xsessions. In the default Xsession file, xsm is used which defaults (see man xsm) to running twm with smproxy and an xterm. This is pretty gruesome, so I substituted xsm with my window manager, fvwm2. Here's part of Xsession. From the console on the client, you can see the connections from the realserver back to the X-server on the client. Now set up the director to forward xdmcp/udp and connect to VIP:xdmcp. Note: I'm not using persistence, while Severin is. Non-persistence seems to work for an LVS of 4 realservers. Here's the output of ipvsadm after connecting RemoteAddress:Port Forward Weight ActiveConn InActConn UDP lvs.mack.net:xdmcp rr -> RS4.mack.net:xdmcp Route 1 0 1 -> RS3.mack.net:xdmcp Route 1 0 0 -> RS2.mack.net:xdmcp Route 1 0 0 -> RS1.mack.net:xdmcp Route 1 0 0 ]]> `netstat -an` doesn't show any connections to the VIP (it's udp afterall), but the connections to the X-server ports on the client are seen along with the entry for X1. Once you have logged in via xdm, the client and realserver are connected directly and LVS is not involved anymore. After you logout from the X-session at the client, and return to the XDMCP login screen, the connections to port 600x are gone. After exiting from an X-session the client will be presented with a new xdm login screen. Watching with tcpdump shows the following steps following the termination of an X-session. many packets are exchanged between RIP:high port and CIP:6001, presumably repainting the login screen. single udp packet from CIP:high_port to VIP:xdmcp is forwarded to realserver. (The director has no further role after this). 2.5 secs later, an exchange of two pairs of udp packets between RIP:xdmcp and CIP:high_port. the login screen appears on the LVS client and the InActConn counter is incremented in ipvsadm. if you wait here, the InActConn counter (ipvsadm) decrements in about 3mins. If you login and out before the timer decrements, you are returned to the same realserver. If you disconnect after the timeout, you are reconnected with the next realserver. Whether you wait or not, when you enter your name/passwd, no further packets are passed by udp/xdmcp. Instead, there is another flood of tcp packets from RIP:high_port to CIP:6001. It appears that xdmcp only presents the login screen and that login occurs via the X connection later. If both painting the login screen initially and (after the timeout on the director, about 3mins) sending the name/passwd used xdcmp, then it's possible that the login data could be sent to a different host that painted the screen. This apparently can't happen. (Joe, May 2004: I have no idea why I said that this "apparently can't happen.) You could setup an xterm farm with this using a bunch of diskless 486 PCs with 16M memory.

X - attempt 3, X-forwarding with ssh and connecting to LVS:sshd In this method, you have your window manager running on your LVS client and you are displaying realserver X-clients on the X-server running on your LVS client. To setup, on the realserver, have the entry "X11Forwarding yes" in sshd_config (re-HUP sshd if neccessary). On the client, have the entry "ForwardX11 yes" in ssh_config. If you like, as a test, ssh (with `ssh -v`) directly from the client box to the realserver (not to the VIP) as if you were doing a regular ssh login. After login, look to see that X-forwarding is turned on by looking at the DISPLAY variable. "realserver" is the name of the machine you have logged into (it might be "localhost"). The name you get will NOT be the name of the machine with the X-server that will be displaying the X (as would normally happen with non-forwarded X connections). Here your X-server is the lvs client. The $DISPLAY variable is showing where X-clients running on the realserver will send their output. In this case "realserver:10.0" is a proxy X-server running on the remote machine, which will forward the X-calls to the X-server running on the lvs client machine (the output will not go to the realserver). If you now run `xclock` on the realserver, it will be displayed on the lvs client machine. Next setup your director to forward ssh. For more info see the section on . In particular make sure all the host keys on the realservers are identical. Connect to VIP:sshd. You should now be able to start X-clients apparently running on the VIP (but really running on the realserver).

r commands; rsh, rcp, and their ssh replacements, tcp 513 (,514) and another connection An example of using rsh to copy files is in performance data for single realserver LVS Sect 5.2, Caution: The matter of rsh came up in a private e-mail exchange. The person had found that rshd, operating as an LVS'ed service, initiated a call (rsh client request) to the rshd running on the LVS client. (See Stevens "Unix Network Programming" Chapter 14, which explains rsh.) This call will come from the RIP rather than the VIP. This will require rsh to be run under LVS-NAT or else the realservers must be able to contact the client directly. Similar requests from the client and passive ftp on realservers cause problems for LVS. David Lambe david (dot) lambe (at) netunlimited (dot) com Mon, 13 Nov 2000

I've recently completed "construction" of a LVS cluster consisting of 1 LVS and 3 realservers. Everything seems to work OK with the setup except for rcp. All it ever gives is "Permission Denied" when running rcp blahfile node2:/tmp/blahfile from a console on node1. Both rsh and rlogin function, BUT require the password to be entered twice.

Joe sounds like you are running RedHat. You have to fix the pam files. The beowulf people have been through all of this. You can either recompile the r* executables without pam (my solution), or you can fiddle with the pam files. For suggestions, go to the beowulf mailing archives - you have to download the whole archive at whole archive and grep through it. If you go to the beowulf site, you'll find people are moving to replace rsh etc with ssh etc on sites which could be attacked from outside (and turning off telnet, r* etc). For examples setup files for ssh see the section on sshd.

Streaming Media: RealNetworks, Quicktime, Windows Media Server, tcp/udp 554 (and other ports)

RealNetworks streaming protocols, tcp 554, many ports Jerry Glomph Black black (at) real (dot) com August 25, 2000 RealNetworks' streaming protocols are PNM (TCP on port 7070, UDP from server -> player on ports 6970-7170). PNM was the original protocol in version 1 through 5. It's now mostly legacy. RTSP (TCP on port 554, similar UDP as above, but often on multiple ports) With the G2 release, we adopted the RTSP delivery standard. The current version, RealPlayer 8 came out about two weeks ago. A free one is available to run on just about any platform in common use today. The Linux versions are great. There's also a HTTP/TCP-only fallback mode which is (usually) on port 8080. The server configuration can be altered to run on any port, but the above numbers are the customary, and almost universally-used ones. Mark Winter, a network/system engineer in my group wrote up the following detailed recipe on how we do it with LVS: add IP binding in the G2 server config file On the LVS side ./ipvsadm -A -u :0 -p ./ipvsadm -A -t :554 -p ./ipvsadm -A -t :7070 -p ./ipvsadm -A -t :8080 -p ./ipvsadm -a -u :0 -r ./ipvsadm -a -t :554 -r ./ipvsadm -a -t :7070 -r ./ipvsadm -a -t :8080 -r ]]> (Ted) I just wanted to add that if you use FWMARK, you might be able to make it a little simpler and not have to worry about forwarding EVERY UDP port. /32 7070 -p tcp -m 1 ipchains -A input -d /32 554 -p tcp -m 1 ipchains -A input -d /32 8080 -p tcp -m 1 ipchains -A input -d /32 6970:7170 -p udp -m 1 # Setup the LVS to listen to FWMARK1 director:/etc/lvs# ipvsadm -A -f 1 -p # Setup the realserver director:/etc/lvs# ipvsadm -a -f 1 -r ]]> Not only is this only six lines rather than eight, but now you've setup a persistent port grouping. You do not have to forward EVERY UDP port, and you're still free to setup non-persistent services (or other persistent services that are persistent based on other ports). When you want to remove a realserver, you now do not have to remove FOUR realservers, you just remove one. Same thing with adding. Plus, if you want to change what's forwarded to each realserver, you can do so with ipchains and not bother with taking up and down the LVS. ALSO... if you have an entire network of VIPs, you can setup IPCHAINS rules which will forward the entire network automatically rather than each VIP one by one. Jerry Glomph Black black (at) prognet (dot) com 07 Jun 2001 Following is a currently-operational configuration for LVS balancing of a set of 3 RealServers (or Real Servers, in LVS-terminology) It has been running at very high loads (thousands of simultaneous connections) for months, in addition to numerous conventional LVS setups for more familiar web load-balancing at massive loads. Roberto Nibali ratz (at) tac (dot) ch 08 Jun 2001 there is no fwmark module, and the ip_vs module is loaded by ipvsadm now. Why do you need persistence?

RealNetworks g2 server philz (at) testengeer (dot) com 3 Apr 2000

A realnetworks g2 server is the daemon that serves up real audio/video streams (http://real.com). I'm using LVS-Tun. When I tried setup a realnetworks g2 server I could not get it to accept the connection (tcp port 7070). A telnet to port 7070 on the VIP yeilds a connection refused. while telnet to the realserver ip yeilds a "connect" (it also serves video and audio if you use the proper client).

Joe Is the service listening on the VIP (a common thing to forget when setting up LVS-DR or LVS-Tun)?

That's it. Success! Here is what has to be done: The real real audio/video daemon must be configured to listen/respond to _BOTH_ the VIP and its RIP (see Configure->General Setup->IP Binding on the RealAdministrator web page). Both the 7070 and 554 (PNAPort and RTSPPort respectively) must be redirected. You might have to do more ports for other features of the real audio/video daemon.

er OK. The demon listening on the RIP never hears from anyone though ;-\

You actually need the RIP to respond so that you can manage/monitor it.

congratulations. You've got a realserver to be a RealServer. Is this the thing that costs $2995 with RedHat?

Nope. This is the free one that supports 25 session per server ;-)

What's on each of 7070 and 554? Is one video and the other audio? What does PNAPort and RTSPPort stand for? What happens if the client gets 7070 from one realserver and 554 from another? Did you have to link the 2 services with persistent connection?

Quicktime, tcp 554, many ports First, a quicktime primer from Andy Wettstein:

It is similar to Real. 554 is rtsp, and there is an option on the quicktime server to stream over port 80 to avoid firewall problems. The ports 6970:7170 are what the client will actually send/receive the stream on (if not blocked by firewall rules, etc). The udp stuff is why you need persistence. The stream would try to switch between servers without persistence enabled (since udp is really connectionless).

Andy Wettstein awettstein (at) cait (dot) org 20 Dec 2002

I'm trying to set up the quicktime (darwin) streaming server through lvs. It kind of works, but it is very slow, much slower than just accessing the stream without going through lvs. I have set it up exactly the same as the Real rtsp examples. I am using lvs-dr with fwmark on ports. Here are the iptables commands I used: Then I added the lvs-dr like the examples: And I get this with ipvsadm: lead.web.cait.org:0 Route 1 0 1 -> tin.web.cait.org:0 Route 1 1 1 ]]> I am also unable to access the stream on port 80 through lvs. If anyone has experience with quicktime please let me know if there is anything further that I need to do.

I figured it out. It needs persistence (or streaming movies will fail) i.e. the ipvsadm -A command needs a "-p". Here's the mon.cf and the qtss.alert /dev/null; then $IPTABLES -t mangle -A PREROUTING -i eth0 -p tcp -s 0.0.0.0/0 -d $VIRTUALSERVER --dport 80 -j MARK --set-mark $MARK $IPTABLES -t mangle -A PREROUTING -i eth0 -p tcp -s 0.0.0.0/0 -d $VIRTUALSERVER --dport 554 -j MARK --set-mark $MARK $IPTABLES -t mangle -A PREROUTING -i eth0 -p udp -s 0.0.0.0/0 -d $VIRTUALSERVER --dport 6970:7170 -j MARK --set-mark $MARK fi # set up the virtual server $IPVSADM -A -f $MARK -s $SCHEDULER -p # add the realserver $IPVSADM -a -f $MARK -w $WEIGHT -r $REALSERVER else # remove $IPVSADM -d -f $MARK -r $REALSERVER fi exit 0 ]]>

Windows Media Server, tcp/udp 554, tcp 1755, udp 1024:5000 Mark Weaver mark (at) npsl (dot) co (dot) uk 23 Mar 2004 Here's how to setup Windows Media Server. This information is not easy to come across as I can't find a simple published document which lists what WMS actually does. There is also some attempt here at WMS9 support, but that's untested and is just based on what the player tries to do (the player connects more quickly, however, if you reject rather than drop those connection attempts, which I'm letting the server do).

Radius, udp 1645,1646 Francois Baligant 2000-05-10

We have a very weird problem load-balancing UDP-based RADIUS packets. 195.74.212.26:16450 Route 1 0 0 -> 195.74.212.34:16450 Route 1 0 0 UDP 195.74.212.31:1646 wlc -> 195.74.212.26:1646 Route 1 0 106 -> 195.74.212.10:1646 Route 1 0 106 UDP 195.74.212.31:1645 wlc -> 195.74.212.26:1645 Route 1 0 1 -> 195.74.212.10:1645 Route 1 0 0 ]]> I have a series of NAS (Network Access Server) sending Authentication Requests to a single central Proxy Radius server (packets arrive sometimes 5packets/sec). This Proxy Radius Server then forwards Authentication Request to the load-balancer which should normally dispatch them to several nodes for processing (check with DB etc..) We want to load-balance 3 ports: 1645 (authentication), 1646 (accounting) and 16450 (authentication for another kind of service). The rule for port 1646 loadbalances. However for rule 16450 and 1645, all UDP requests go to only one realserver. (rule 16450 is not used at the moment. 1645 is. You can see the strange little "1" for 195.74.212.26) What's weird is that 1645 works really fine but the 2 others rules just do not load-balance. Packets are always sent to the same host. (in fact the first that was added to the VS IP)

Joe Someone had a similar sounding problem with udp ntp. All packets would go to one host and then after a little while to another. In the short term the load balancing was bad, but over the long term (>15mins) the loadbalancing was fine. The udp LVS code sends all udp packets to one realserver, till a timeout is reached, and then sends the next packets to another realserver. (See also Scheduling TCP/UDP.) Julian Julian Single Radius Server? Does that mean that all packets come from a single IP:port too? Don't forget that for UDP the autobind ports are not rotated. For TCP you have ports selected in the 1024..4999 range but it is possible all your client UDP packets to come from the same port on the client. This can be a good reason they to be redirected to the same realserver if the UDP entry is not expired. Show a tcpdump session or try to set UDP timeout to a small value: Any difference? How many clients (UDP sockets) you have? If you have one, it can't be balanced. There is a persistency according to the default UDP timeout value.

195.74.212.31.1645: udp 244 (DF) 14:06:36.277205 195.74.193.40.60774 > 195.74.212.31.1645: udp 244 (DF) 14:06:36.430549 195.74.193.40.60774 > 195.74.212.31.1645: udp 244 (DF) 14:06:36.430575 195.74.193.40.60774 > 195.74.212.31.1645: udp 244 (DF) 14:06:36.639869 195.74.193.40.60774 > 195.74.212.31.1645: udp 244 (DF) 14:06:36.639894 195.74.193.40.60774 > 195.74.212.31.1645: udp 244 (DF) 14:06:38.040246 195.74.193.40.60774 > 195.74.212.31.1645: udp 246 (DF) 14:06:38.040276 195.74.193.40.60774 > 195.74.212.31.1645: udp 246 (DF) 14:06:38.117694 195.74.193.40.60774 > 195.74.212.31.1645: udp 243 (DF) 14:06:49.899222 195.74.193.40.40190 > 195.74.212.31.1646: udp 349 (DF) 14:06:49.899256 195.74.193.40.40190 > 195.74.212.31.1646: udp 349 (DF) 14:06:50.358085 195.74.193.40.40223 > 195.74.212.31.1646: udp 349 (DF) 14:06:50.358114 195.74.193.40.40223 > 195.74.212.31.1646: udp 349 (DF) 14:06:51.494628 195.74.193.40.40346 > 195.74.212.31.1646: udp 349 (DF) 14:06:51.494656 195.74.193.40.40346 > 195.74.212.31.1646: udp 349 (DF) 14:06:51.810022 195.74.193.40.40381 > 195.74.212.31.1646: udp 349 (DF) 14:06:51.810051 195.74.193.40.40381 > 195.74.212.31.1646: udp 349 (DF) 14:06:52.351541 195.74.193.40.40485 > 195.74.212.31.1646: udp 199 (DF) ]]> I think you just helped me to understand what was the problem. Port 1645 is not loadbalancing. I will patch the radius to increate port number for accounting request too.

LVS: Services that we haven't got to work with LVS yet You may get some hints at .

Kerberos Kerberos is a secure authentication protocol. Many ports are involved, making it difficult to setup firewalls: e.g. Configuring your firewall to work with Kerberos. Further down this section, someone is ssh tunneling by LVS forwarding ssh to the realservers which use kerberos for authentication. In this case kerberos is just a login protocol that has nothing to do with LVS. There should only be one kerberos (ticket) server in your realm. You shouldn't LVS kerberos servers. Kerberos is a login protocol. There's no more reason to LVS kerberos, than there is to LVS login. However you can have the client login to kerberos'ed realservers. several people

I'd be interested to know if LVS can be used and setup for silent login(no password prompting, i.e. using ticket forwarding) using ssh and kerberos.

Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 07 Mar 2006 this is not a good idea. The kerberos replication system only permits one active admin server, so there is no opportunity for load balancing of the admin function. You shouldn't try to fool it by using LVS. You'll likely screw up the replication. What you should do instead is to list multiple Kerberos key distribution centers (KDC'si) in your krb5.conf. Take a look at that and notice the section under [realms]. All you need to do is list multiple kdc's like so: So, there you have it. By specifying multiple KDC servers you will be getting the behavior you really want, in case you lose a server for a little while, things will just keep on chugin' in your network. Don't get me wrong - LVS is a great tool, but its not the answer to every problem of service redundancy. Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 02 Mar 2006 If you kerberize your host - that is to say, you have stuff like your ssh client using kerberos-compatible versions, and your authentication happening against kerberos, then kerberos works something like this: I sit down at my favorite Linux workstation and open a shell so I can ssh to some other host on my network. Assuming I have not done so recently, the host I'm trying to reach won't be able to verify me. Instead, both my local host and the target host will rely on an authentication server to generate a new encryption key and distribute it to both parties. This is the session key. A Kerberos ticket is used to distribute the session key which includes info about me / my host that will be used by the target host to verify my connection. When the server passes this back to me, I forward it on to the target host as part of my authentication request. So, the ticket will be encrypted in a server key, which is known only by the server and the target host. nifty huh? The ticket-granting-ticket just extends this to make life a little easier. We assume that some period of time is an acceptable amount for ticket granting without requiring the user to type in a password every time a ticket is needed. In short, I authenticate myself once to the server, and it allows me to perform any number of permitted authentications during the allowed time period. The ports used are likely going to be 88 for the kdc and possibly 749 for the admin server. Karen Shepelak shepelak (at) fnal (dot) gov 04 Feb 2005

We are trying to get kerberos to work with LVS. We're ssh tunnelling to the realservers in an LVS forwarding ssh. We can kerberos through the ssh tunnel when connecting directly to the realservers, but not when we ssh to VIP:ssh. We get the following errors. ]]>

Horms xauth isn't working. You should probably just turn off xforwarding in your sshd config rather than make xauth work. Your user doesn't have permission to access /afs/fnal.gov/files/home/room3/shepelak/.Info

RMI Joe: multiport protocols are difficult to loadbalance under LVS Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) indexmultimedia (dot) com 03/10/2006 Rmi is NOT a TCP protocol as is. It is a subprotocol that is similar to FTP. The two RMI ports are dealing a transaction on the standard RMI port, and then, there is a dynamic port negociation. There is no way to make RMI load balanced, as well as there is no way to make it go through a firewall... If you look at google:RMI+Firewal, you will find relevant documents about port negociation and all the (bad) ways to handle this (bad) idea that is RMI. Mr. CBoy cboy168 (at) gmail (dot) com 10 Mar 2006

Currently I am using LVS-Nat with 2 real JBoss servers. In my test environment I have 1 client that spawns hundreds of threads that will invoke methods through RMI. The flow is like this: Client sends a request to the VIP for a naming proxy which will get forwarded to RS1. RS1 will then push down the naming proxy back to the client, but will be masquerated by the director. The client will ask for the stub on the remote object so that it can invoke methods and it will be returned. Now the client has the stub and can start to invoke methods, but here lies the problem. The stub came from RS1 and when the client calls a method on the stub, it gets load balanced to RS2 and becomes invalid. In a real world example, turning on persistance would remedy this however in a test environment, I'm limited by hardware. Is there anyway to get threads on the same client to talk to the same real server? The client code is something simple like All the underlying connection details are handled by Sun's RMI so I'm unable to keep 1 connection for each thread myself, so my only guess is to figure out a way for LVS to handle it.

LVS: UDP Services - unique problems LVS has been able to schedule and forward UDP packets from the very beginning. Most of our users have been load balancing TCP services and we've glibly assumed that all our TCP knowledge and coding applied to UDP as well. However while TCP connections are alike (start with SYN and end with FIN), a service using UDP has more flexibility. ntp requires that the client be sent and receive packets from the same realserver forever. some services only send one packet as the sum total of information transfer. There will never be any information transfer in the reverse direction. SIP is NAT'ed. It's unlikely that SIP will work out of the box with LVS. A single scheduler is not going to work for all of these. Little progress has been made for UDP over LVS because few people are using UDP services with LVS. Most UDP services of interest (ntp, dns) have their own inbuilt loadbalancing and so their has been no pressing requirement for LVS/UDP, like there has for LVS/TCP.

SIP (Session Initiation Protocol) SIP is an all UDP protocol for VoIP (voice over IP) telephony. It has lots of ports and is a bit complicated for LVS (no-one has it working under LVS). Asterisk has its own load balancer (see below). Currently we suggest that you try that first. Joao Filipe Placido, Jul 19, 2005

I have read some posts about a SIP module for lvs. Is this available? Does it do SIP dialog persistence?

Horms Unfortunately it is not available as the project was halted before it became ready for release. I'm not entirely sure what SIP dialog persistance is, but in a nutshell all my design did was to provide persistance based on the Caller-ID, rather than the client's address and port as LVS ususally does. Mike Machado mmachado (at) o1 (dot) com 20 Aug 2004

I have a LVS-DR setup with two realservers. My service is a voip application, SIP specifically. I am trying to balance requests between the two. I have all the LVS stuff setup, and was able to get the telnet test to work properly. With UDP though, there seems to be a problem. When the application is forming its reply, it uses the realserver as the source IP, instead of the VIP, as it does with the telnet test. I assume this is because UDP is stateless. I tried to SNAT the packets back to the correct IP, but you cannot SNAT locally generated packets. I was able to change my voip application to just BIND to the VIP, but due to the nature of this application, it needs to be able to communicate on both the VIP and the RIP, I just want reply packets to use the same source IP and the inbound packets. Anyone come across this problem for UDP applications, along with a possible solution?

Julian What about using IP_PKTINFO in sendmsg, "srcip" is your server IP used in each request packet as daddr:e.g. example (http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/0771.html); thread (http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/index.html#0247). Horms Fixing the application to send reply packets from the addresses that they were received on is the best solution IMHO. It is the way I have resolved this problem in the past. The alternative would be to use LVS-NAT instead. Erik Versaevel erik (at) infopact (dot) nl 13 Jan 2005

I'm currently trying to create a loadbalacing SIP (voip protocol) cluster, however for this to work I need SIP messages from the same call (identifiable by the sip callid field) to get to the same realserver over and over again. (so, I need persistence based on the contents of the SIP Call-ID field). This would call for ktcpvs as we need to process packets at layer 7, however that poses 2 new problems, the first is that SIP uses clear text UDP messages, not tcp and the second is that there are no SIP modules for ktcpvs. Another option would be to mark SIP packets with iptables/netfilter based on the callid, however i run into the same problem, there are no modules to accomplish this. I know that there are commercial products available who are able to do SIP session persistence based on callid, the F5 Big-IP for example, the downside of that is it costs around $ 10.000 for a single loadbalancer (which is a SPOF so you need 2) and is a bit overkill as i don't need multi gigabit loadbalancing. High persistence won't work because reply packets from another SIP source might be balanced to the wrong server, ie packets from 192.168.0.2 might be balanced to real server 1, which sends it's reply directly to 192.168.0.3 (the end point for the call) but the replies from 192.168.0.3 might end up at another real server. (be aware I'm using direct routing because of NAT traversal) Using TCP Sip would only solve half the problem (and couse some more). Answers to request could still end up at the wrong server (but one would only have to write a module for that, and not a kudpvs) and not all clients support TCP based sip.

Wensong Zhang wensong (at) linux-vs (dot) org 17 Jan 2005 We cannot use ktcpvs, because ktcpvs supports TCP only, and there is no SIP modules for TCP transport. The firewall marking doesn't solve the problem. However, we can write a special SIP UDP scheduling module for IPVS. It can detect the Call-Id from UDP packet and send it to the SIP server according to the recorded Call-Id table. We assume that there is no UDP fragments. Some NAT boxes (such as early IOS version of Cisco router) may drop UDP fragements except the first one. Erik Versaevel

Such a module would definitly solve the problem. A round robin with call-id persistence would be awsome. Currently a device which can do that costs around $ 10000 each (times 2 for HA).

Malcolm Turnbull wrote:

If I wanted to test load balancing SIP using standard LVS UDP is their an OpenSource or Commercial Free Server to test against ?

Joe AFAIK you can do all of VOIP PBX on Linux Asterisk The handsets (phones) are more of a problem. I believe you can get a linux box with a sound card and mic/headphones to be a phone. There was a large purchase of VOIP phones a while ago, that someone got to run linux, but these have all gone. You'll probably need to buy a real VOIP phone to test with. Curt Moore tgrman21 (at) gmail (dot) com 11 Feb 2005 I believe that Wensong has the right idea here. Although SIP does support TCP the vast majority of SIP endpoints ony support UDP. Asterisk can be used for a SIP application server but it's not geared/written to be a SIP proxy. For this you need something like SER, SIP Express Router. When correctly configured, SER acting as a SIP proxy does support the distributing of calls to Asterisk boxes acting as media/application servers. The issue becomes how to load balance/distribute calls to multiple SER boxes, based on SIP call-id, so that the same SIP call-id always goes to the same SER box. Once you've statefully routed a call, based on call-id, to a particular SER box, SER can take over and ensure that things go to the correct Asterisk media/application server or endpoint based on its routing configuration. The director nodes just need to be smart enough to send the right call to the right SIP proxy residing in the LVS cluster. As far as NAT goes, I've found though lots of experience that you'll never be able to penetrate every NAT implementation out there when it comes to SIP/UDP, you can only hope to get 99% of them as many of them don't fully conform to the RFC. Gerry Reno greno (at) verizon (dot) net 16 May 2008

Ok, I finished setting up some pbx (asterisk). Can I use LVS to load balance the call traffic between multiple pbx's? Or with SIP protocol is it necessary to use OpenSER?

Graeme You probably can, but given the nature of SIP - two transport protocols, multi-port, session based - it could get very complicated. You could definitely sort out the main ports - TCP/UDP port 5060 - trivially; but the follow-on complication is how you then track the session traffic which can wander around all over the place (cf. the LVS FTP helper). I'd strongly recommend you have a good read of the Asterisk mailing list - it seems that there are several app-based load balancing schemes for Asterisk, and if they do what you need, I'd use them. Morgan Fainberg morgan (at) mediatemple (dot) net 17 May 2008 In theory, you could use a FWM (firewall mark) setup and persistent connections. If you map the virtual server group to use the same FWM for the TCP ( SIP uses TCP port 5060) and UDP (RTP usually is configured for UDP ports 16384-32767) datastreams. It should work in theory. However, the application-based Load-balancing in Asterisk does function fairly well and you might end up with a better solution. Typically, with load-balancing I find that the more complexity you add just makes it that much harder to debug when things go awry. Gerry

I think the fwmark approach might work. And I like this since load-balancing with LVS is better for me because I have all my other services on it. I'm keeping all traffic going through the Asterisk box with canreinvite=no. canreinvite=yes would present a further scenario as the endpoints would then end up in direct communication for RTP. You'll have to excuse me if I've oversimplified this. I have not used fwmarks before. So let's see, I'm using keepalived so in the conf I guess I would have something like: -j MARK --set-mark 1 # route back to director ]]>

Morgan Those looks reasonable, however, you will probably not want to separate the SIP and RTP traffic. It would make more sense to use two iptables rules that set the same firewall mark. IE: You can set as many iptables rules as the system can handle to assign a given firewall mark. Any traffic (regardless of port/type) can be balanced with the FWM. FWM is (as you can see by the ipvsadm man-page) it's own service type. Instead of specifying --tcp-service or --udp- service you specify --fwmark-service. Given that I use Keepalived vs. the other methods, it is slightly different than making direct calls with ipvsadm. In short, no need to have separate VIPS for SIP and RTP unless you have different servers handing SIP traffic. It would probably look something more like this: I've not used FWM+NAT in a good long while. You probably don't need to set the firewall mark on the realservers as the firewall mark (I don't believe) stays with the packet once it leaves the local networking stack (ie, it is not sent out on the wire). So unless the system needs to do something specific with the firewall mark (IE iprule to policy-route to the director) the firewall mark will not need to be set on the real-server. A DR configuration should work almost identically, however, I've not done UDP in a DR configuration (always NAT). A standard DR configuration ~should~ function for a Asterisk setup like this. Gerry

Yes, of course, I need to keep the SIP and RTP together since I'm not using a separate SIP server. So now if we use ARA we should have a good extensible solution. To me this seems like it might be better than OpenSER because with OpenSER you have a SPOF whereas with keepalived/LVS you have more robust solution. My setup is LVS-DR so I need to think is the direct return route is going to create any problems. Otherwise, the only thing lacking in this picture is FreePBX does not support ARA :-(

later...Gerry Reno greno (at) verizon (dot) net 26 Dec 2008 Actually, I abandoned plans for LVS+Asterisk. We just beefed up our recovery techniques and made sure everyone knew what to do if Asterisk crashed or hung on us. Yes, we lose calls and it's a pain but we live with it right now.

UDP timeouts (SIP) Benjamin Lawetz blawetz (at) teliphone (dot) ca 30 Jun 2005 I have a setup that load balances SIP UDP packets between 4 servers. Today one of my servers failed and mon removed it from the load-balancing, but some of the connections still remain and keep getting refreshed. I noticed something bizarre though with the ipvsadm UDP timeout, it is set to 35s, but the 3 connections that "stay stuck" seem to have an expire of 60 seconds instead of 35. And even though the server is removed and the rest of the connections timeout and get redirected to another server. Those 3 just keep going to the failed server. Anyone have any idea why these 3 connections are (so it seems) auto-refreshing every 60s on a server that doesn't exist? Any way to clear these? Ipvsadm startup script: /proc/sys/net/ipv4/ip_forward /sbin/ipvsadm --set 0 0 35 /sbin/ipvsadm -A -u 192.168.89.220:5060 -s rr /sbin/ipvsadm -a -u 192.168.89.220:5060 -r 192.168.89.231 -i -w 5 /sbin/ipvsadm -a -u 192.168.89.220:5060 -r 192.168.89.232 -i -w 5 /sbin/ipvsadm -a -u 192.168.89.220:5060 -r 192.168.89.233 -i -w 5 /sbin/ipvsadm -a -u 192.168.89.220:5060 -r 192.168.89.234 -i -w 5 ]]> Before removal this is what I have: RemoteAddress:Port Forward Weight ActiveConn InActConn UDP 192.168.89.220:5060 rr -> 192.168.89.233:5060 Tunnel 5 0 4 -> 192.168.89.231:5060 Tunnel 5 0 3 -> 192.168.89.232:5060 Tunnel 5 0 3 -> 192.168.89.234:5060 Tunnel 5 0 6 Ipvsadm -L -n -c: IPVS connection entries pro expire state source virtual destination UDP 00:30 UDP 10.10.250.209:5060 192.168.89.220:5060 192.168.89.231:5060 UDP 00:25 UDP 192.168.85.25:1035 192.168.89.220:5060 192.168.89.233:5060 UDP 00:34 UDP 10.10.125.23:5060 192.168.89.220:5060 192.168.89.232:5060 UDP 00:26 UDP 206.55.81.128:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:22 UDP 10.10.233.189:5060 192.168.89.220:5060 192.168.89.232:5060 UDP 00:32 UDP 192.168.85.25:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:21 UDP 10.10.142.148:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:20 UDP 192.168.85.25:1036 192.168.89.220:5060 192.168.89.232:5060 UDP 00:34 UDP 10.10.249.211:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:23 UDP 10.10.249.211:1024 192.168.89.220:5060 192.168.89.233:5060 UDP 00:31 UDP 10.10.208.97:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:32 UDP 10.10.184.83:5060 192.168.89.220:5060 192.168.89.231:5060 UDP 00:27 UDP 70.80.53.141:5060 192.168.89.220:5060 192.168.89.233:5060 UDP 00:40 UDP 192.168.85.5:58040 192.168.89.220:5060 192.168.89.233:5060 UDP 00:28 UDP 10.10.14.83:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:25 UDP 10.10.215.175:6084 192.168.89.220:5060 192.168.89.231:5060 ]]> After removal I have: RemoteAddress:Port Forward Weight ActiveConn InActConn UDP 192.168.89.220:5060 rr -> 192.168.89.231:5060 Tunnel 5 0 3 -> 192.168.89.232:5060 Tunnel 5 0 3 -> 192.168.89.234:5060 Tunnel 5 0 6 Ipvsadm -L -n -c: IPVS connection entries pro expire state source virtual destination UDP 00:28 UDP 10.10.250.209:5060 192.168.89.220:5060 192.168.89.231:5060 UDP 00:31 UDP 192.168.85.25:1035 192.168.89.220:5060 192.168.89.231:5060 UDP 00:31 UDP 10.10.125.23:5060 192.168.89.220:5060 192.168.89.232:5060 UDP 00:31 UDP 10.10.81.128:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:29 UDP 10.10.233.189:5060 192.168.89.220:5060 192.168.89.232:5060 UDP 00:27 UDP 192.168.85.25:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:24 UDP 10.10.142.148:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:24 UDP 192.168.85.25:1036 192.168.89.220:5060 192.168.89.232:5060 UDP 00:33 UDP 10.10.249.211:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:40 UDP 10.10.249.211:1024 192.168.89.220:5060 192.168.89.233:5060 UDP 00:11 UDP 10.10.208.97:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:57 UDP 10.10.53.141:5060 192.168.89.220:5060 192.168.89.233:5060 UDP 00:57 UDP 192.168.85.5:58040 192.168.89.220:5060 192.168.89.233:5060 UDP 00:33 UDP 10.10.14.83:5060 192.168.89.220:5060 192.168.89.234:5060 UDP 00:33 UDP 10.10.215.175:6084 192.168.89.220:5060 192.168.89.231:5060 ]]> Sorry to answer myself, but narrowed down the problem. I sniffed the traffic on the load balancer coming from one of the "stuck" IPs and I'm getting no traffic whatsoever. So basically while getting no traffic coming in, having no destination to go to and having a timeout of 35 seconds, the connection entries counts down from 59 seconds to 0 seconds and loops back to 59 seconds again. Is there a way to remove these connections? Anyone have any idea why they are looping when no traffic is coming in? Horms Which kernel do you have? This sounds like a counter bug that was resolved recently. However I couldn't convince myself that it manifests in 2.4 (I looked at 2.4.27). http://archive.linuxvirtualserver.org/html/lvs-users/2005-05/msg00043.html Also, I may as well mention this as it has been floating around: http://oss.sgi.com/archives/netdev/2005-06/msg00564.html Marcos Hack marcoshack (at) gmail (dot) com 20 Dec 2005 I'm using LVS to load balance SIP UDP connections, and when a virtual service fail on a real server the director don't clean up the UDP connection on connection table. The solution seems to be just setting /proc/sys/net/ipv4/vs/expire_nodest_conn=1 and change keepalived to remove the virtual service on failure instead of using "inhibit_on_failure" (set weight to 0). Michael Pfeuffer wq5c (at) texas (dot) net 18 Nov 2008

I've got an LVS-NAT configuration that works for HTTP traffic, but SIP UDP traffic is not being load balanced (and all SIP packets come from the same source address) - it always goes to the 1st RIP on the list. The services are all configured the same except for the checktype.

Graeme The same source address would explain it. You could artificially reduce the UDP protocol timeout for testing: (see man ipvsadm for info) ]]> but you can't make it less than 1 second. Also, because UDP is stateless, a "session" is viewed as traffic arriving from $host with $source_port to a given VIP/Port within the UDP timeout detailed above. This causes problems with SIP signalling data in some cases, because it has a tendency to be sourced from port 5060 to port 5060, and is quite regular. If you have a larger spread of clients, over time things will become roughly balanced according to your RS weights.

UDP timeouts (DNS) the sysctl expire_quiescent_template (see seems to be useful in several situations. we don't find out till the end that Adrian is using persistence Adrian Chapela achapela (dot) rexistros (at) gmail (dot) com 05 Mar 2007 I have a two directors in high availability configuration. All it's OK for TCP, but for UDP it's no OK. In UDP load balance it's ok, but the fail over don't happens. If one client is "redirected" to one real server, the connections are redirected always to this server, even the server goes down. I'm not using DNS's built in redundancy. I use a failover mechanism with LVS to have a high availibility of my DNS servers. Today I change my config of virtual server to a DNS servers. Before I have two config, One in UDP and another in tcp for the "same" service. A DNS service can be in UDP or TCP. I config my DNS servers to serve dns queries in the two protocols. When I serve with keepalived the DNS service in UDP and TCP the fail over is OK, but when I config to serve in UDP only the fail over doesn't happens.. My UDP health check does the right things (I think...). The checks recognize well when a server goes down. In the list doesn't appear, but the packets are thrown to the serve many minutes later. I then the packets are thrown to the "limbo" (/dev/null I think..). I don't know what is happen but in another situation with a firewall maked with Shorewall, I had a similar problem. I changed the rules in firewall (one port for another) and the packets was ruled to the first port. I reboot the machine and all OK. With TCP this not happens never.

Later - apparently after help from Graeme

The solution was set /proc/sys/net/ipv4/vs/expire_nodest_con=1. When a server is removed from the pool the 'established' connections are removed but before the connections are waiting to the protocol timeout and in UDP is too high. Other important variable is /proc/sys/net/ipv4/vs/expire_quiescent_template (see . For now I don't use it. Simon Pearce sp (at) http (dot) net 27 Nov 2006 I am running a dns cluster (Gentoo) with two directors active/active and 4 realservers running powerdns. Each server has a 3Ghz Pentium 4 and 1 Gig of Ram. I have about 250 VIPs. I could do it all with one VIP of course, but quite a few of our customers require there own dns servers with there own ip address. A lot of them don't really need it, but it looks good to them. Everytime time the dns cluster exceedes a certain limit some of the ip addresses stop working properly. It effects the system in a way that for certain domains you get a timeout when querying the cluster. Some of the transfered IP's seem to stop working or slow down to an extend that other dns servers stop querying us. Load average is 1-2. Even though queries don't get through the director (reply in 4000ms), the realservers answer direct requests. The only iptables rule is on the director to masquerade out calls to the internet. Joe: Is the problem load or the number of IPs (if you can tell)? There is another problem with failover of large numbers of IPs, just incase you want to read more on the topic (it may not be related to your problem). http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.failover.html#1024_failover Can you setup ipvsadm with a single fwmark instead of all the IPs? That would shift the responsibility for handling all the IPs to iptables, rather than ipvsadm. Graeme Fowler graeme (at) graemef (dot) net 27 Nov 2006 I know it was LVS-DR, and that it didn't have 250+ IP addresses, but the DNS system I built for my previous employer used LVS with keepalived. The last time I had access to the statistics, it was running at something like 1200 queries/sec (which will have risen now by something like 25% if memory serves), 99% of which were UDP, without a glitch. However - as Joe mentioned - I built it to balance on fwmarks, not on TCP or UDP. Incoming packets were marked in the netfilter 'mangle' table according to protocol and port, and the LVS was then built up from the corresponding fwmarks. There was one network "race" we never bottomed, which has affected the system once or twice since I left, where an unmarked packet somehow slipped through to the "inside" (ie. realserver-facing rather than client-facing) LAn and then caused massive traffic amplification. That however isn't related in any way to the OP's problem. Wayne wayne (at) compute-aid.com who has a financial interest (works for?) Webmux, posted that only Webmux has solved this problem, but didn't give any details (Simon has now solved it). At the moment this stands as an uncorroborated statement by Wayne. Presumably the 53-tcp/udp replies and calls to forwarders (from the RIP?) were being correctly nat'ed. A solution was posted by Graeme Fowler where the IPs are fwmark'ed and the LVS balances the fwmark, but this was for a smaller number of IPs and Graeme didn't know if it would work for 250IPs. Simon Pearce sp (at) http (dot) net 6 Apr 2007 Some of you on the list might remember my problem concerning our DNS cluster last year. http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00278.html These problems (DNS timeouts) have continued throughout this year and I have been desperately trying to find the solution. I have been folowing the mailing list and stumbled over the probems Adrian Chapela was having with his DNS setup. Which brought me to the solution ipvsadm -L --timeout the default settings for UDP packets was set to 500 seconds which should be changed. Which is way to long the load balancers were waiting for 5 minutes to timeout a UDP packet I get ablout 1500 queries a second. I changed the setting to 15 seconds last week. And moved some of our old windows/bind DNS servers to the new linux DNS cluster. Before I changed the timeout settings I always recieved a call from our customers within two hours your DNS services are not responding correctly. The IP's that refused to answer would always change I have 254 IP's some of the large German dialup providers would refuse to talk to us which resulted in domains not being reachable. Our DNS cluster is autorative for about 250000 domains so you can imagine how many complaints I recieved. I was about to give up and scrap keepalived I am so glad I did not. Changing the timeout value solved my problems and I am a happy man at the moment. Is there a way to set the timeout value permently so it is saved after a reboot of the server? One last thing I would like to say is a big thank you to Graeme Fowler, Horms, Adrian Chapela and Alexandre Cassen for writing this great piece of software. and anyone else on the list who maybe contributed to help me finaly find the solution. Thank you guys you do a great job on the mailing list. horms 8 May 2007 glad to hear that you got to the bottom of your problem. I am a little concerned about the idea of reducing UDP timeouts significantly because to be quite frank UDP load-balancing is a bit of a hack. The problem lies in the connectionless nature of the protocol, so naturally LVS has a devil of a time tracking UDP "connections" - that is a series of datagrams between a client and server that are really part of what would be a connection if TCP was being used. As UDP doesn't really have any state all LVS can do to identify such "connections" is to set up affinity based on the source and destination ip and port tuples. If my memory serves me correctly DNS quite often originates from port 53, and so if you are getting lots of requests from the same DNS server then this affinity heristic breaks down. The trouble is that if the timeout is significatnly reduced, the probablility of it breaking down the other way - in the case where that affinity is correct - increases. I'm not saying that you don't have a good case. Nor am I saying that changing the default timeout is off-limits. Just that what exactly is a good default timeout is a tricky question, because what works well in some cases will not work well in others, and vice versa. To some extent I wonder if the userspace tools should have the smarts to change the timeout if port 53 (DNS) is in use. Thought that may be an even worse heuristic. I wonder if a better idea might be the one packet scheduling patches by Julian http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00214.html. Much to my surprise these aren't merged. Perhaps thats my fault. I should look into it. I also wonder, if problem relates to connection entries for servers that have been quiesced, then does setting expire_quiescent_template help (see )? /proc/sys/net/ipv5/vs/expire_quiescent_template ]]> Rudd, Michael Michael (dot) Rudd (at) tekelec (dot) com 8 May 2007 My 2 cents in dealing with DNS and your idea of the OPS feature. I have implemented the OPS feature into the 2.6 kernel and its running well. Without that feature, we wound up having all the DNS queries from our DNS client get sent to the same realserver. The problem we did run into, which I've gotten help from the community on, is when using LVS-NAT, the source packet isn't SNAT'd. This is because LVS on the outgoing packet doesn't know the packet is an LVS packet, so it just forwards it out. I fixed this with an iptables rule to SNAT it myself. Just an FYI if you ever choose to use OPS with LVS-NAT. Horms Mmm, I guess OPS isn't quite the right solution to the DNS problem :(

Julian's One Packet Scheduler (OPS) for UDP, timeouts for DNS Although UDP packets are connectionless and independant of each other, in an LVS, consecutive packets from a client are sent to the same realserver, at least till a timeout or a packet count has been reached. This is required for services like where each realserver is offset in time by a small amount from the other realservers and the client must see the same offset for each packet to determine the time. The output of ipvsadm in this situation will not show a balanced LVS, particularly for a small number of client IPs. Julian has experimental patch LVS patches for a "one packet scheduler" which sends each client's UDP packet to a different realserver.

Ratz 17 Nov 2006 First off: OPS is not a scheduler :), it's a scheduler overrider, at best.

Its really a bit of a hack (but probably a hack that is needed), especially with regard to the non-visible part in user space. I have solved this in my previously submitted Server-Pool implementation, where several flags are exported to user space and displayed visibly. Now I remember that all this has already been discussed ... with viktor (at) inge-mark (dot) hr and Nicolas Baradakis has already ported stuff to 2.6.x kernels: Julian's reply: So Julian sees this patch in 2.6.x as well :). I've also found a thread where I put my concerns regarding OPS: The porting is basically all done, once we've put effort into Julian's and my concerns. User space is the issue here, and with it how Wensong believes it should look like. Horms As an option, I can't see any harm in this and I do appreciate that it is needed for some applications. Definitely not as default policy for UDP, because the semantic difference is rather big: dstIP:dstPort --> call scheduler code & add template srcIP:srcPort <-- dstIP:dstPort --> do nothing srcIP:srcPort --> dstIP:dstPort --> read template entry for this hash srcIP:srcPort <-- dstIP:dstPort --> do nothing srcIP:srcPort --> dstIP:dstPort --> read template entry for this hash srcIP:srcPort <-- dstIP:dstPort --> do nothing [IP_VS_S_UDP] srcIP:srcPort --> dstIP:dstPort --> call scheduler code & add template srcIP:srcPort <-- dstIP:dstPort --> do nothing OPS --- srcIP:srcPort --> dstIP:dstPort --> call scheduler code (RR, LC, ...) srcIP:srcPort <-- dstIP:dstPort --> do nothing srcIP:srcPort --> dstIP:dstPort --> call scheduler code srcIP:srcPort <-- dstIP:dstPort --> do nothing srcIP:srcPort --> dstIP:dstPort --> call scheduler code srcIP:srcPort <-- dstIP:dstPort --> do nothing [IP_VS_S_UDP] srcIP:srcPort --> dstIP:dstPort --> call scheduler code srcIP:srcPort <-- dstIP:dstPort --> do nothing ]]> The other question is, if it makes sense to restrict it to UDP, or give a choice with TCP as well? Ratz This is a problem with people's perception and expectation regarding load balancing. Even though the wording refers to a balanced state, this is all dependent on the time frame you're looking at it. Wouldn't it be more important to evaluate the (median) balanced load over a longer period on the RS to judge the fairness of a scheduler? I've had to explain this to so many customers already. There are situations where the OPS (I prefer EPS for each packet scheduling) approach makes sense. Julian Nov 18 2006 I don't know what the breakage is and who will solve it but I can only summarize the discussions: RADIUS has responses, not only requests DNS expects responses (sometimes many if client retransmits) there is a need for application aware scheduling (e.g.SIP). SIP does not look like a good candidate for OPS in its current semantics. Everyone talks about OPS. But for what applications? May be that is why OPS is not included in kernel - there is no enough demand and understanding about the test case which OPS solved, in one or two setups. Last known 2.6 patch for OPS (parts): Threads of interest: "Rudd, Michael" Michael (dot) Rudd (at) tekelec (dot) com 12 Apr 2007

With the OPS feature turned off, the source IP address is correctly SNATed to my VIP. With the OPS feature on and working correctly(which we need for our UDP service), the source IP address isn't correctly SNATed.

Julian 18 Apr 2007 OPS is implemented for setups where there is no reply for the original packet. DNS and RADIUS have reply, so they need something different (which is not done by OPS): every original packet should pass scheduling step reply (or replies) should go back properly, the hash connection should keep the needed information (VIP, VPORT) So, what you need is something different. OPS now works in this way: schedule original packet but don't hash the connection when reply packet is received it can not find connection and the reply packet is treated as non-IPVS packet (real server receives ICMP or packets goes further). That is why such replies don't have proper VIP:VPORT if passed to output device.

Is anybody aware of the code for this? I assume its related to not looking up the connection in the hash table anymore with OPS thus not SNATing. Maybe an iptables rule could fix this possibly?

You can use rule but may be you will not get the right VPORT every time. Not sure why you need OPS. It was created when someone needed to generate many requests from same CIP:CPORT with the assumption that there are no replies. Only when many connections come from same CIP:CPORT in the configured UDP time period the connection reusing does not allow scheduling to be used for every packet. That is why OPS was needed to schedule many packets (coming before expiration) from same CIP:CPORT->VIP:VPORT to different real servers. May be what you need from OPS is impossible: when OPS is not used if reply is delayed, IPVS will wait until the configured UDP timeout is expired, but this value can be different from the timeout your clients is using. Difference in miliseconds can be fatal. What can happen is that a different request from same CPORT will go to the same real server as long as the UDP timeout is not expired. There can be different situations: clients can retransmit on some timeout (DNS, RADIUS) nobody is instructed how many requests should be passed (and the same number of replies if such application mode is used) before removing NAT connection explicitly before expiration to allow next request to be scheduled to different real server. So, the main problem is that it is not easy to balance single CIP to many real servers if there are replies that can be delayed or when requests can be retransmitted. There is no way IPVS to know when to forget one connection to allow scheduling for the next packet from same CIP:CPORT. So, if the client expects replies then OPS should not be used. Instead, short UDP timeout should be used and one should be ready single CIP:CPORT to be scheduled to same real server even if many distinct (from application point of view) requests are sent from same socket.

So I send my DNS query to my VIP on my directors. It gets routed to a realserver which I've attached the vip to bond1.201:0. According to others I've talked to I shouldn't need an iptables rule but I still don't see the packet out with the source ip address of the VIP. I see the packet with the source IP of the actual realserver. Its possible it is a routing issue though so I plan on digging deeper on that today.

For LVS-DR reply should be generated in real server with src=VIP. If you ask the question for LVS-NAT then with OPS you will need the iptables SNAT rule because IPVS does not recognize replies. But I have never tested such setup. Without OPS you don't need iptables SNAT rule, IPVS translates the source address.

Should I need an iptables rule at all for LVS-DR?

No, reply goes directly from real server to client.

icmp responses aren't generated by UDP timeouts on VIP-less directors Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 19 Jan 2007

I am using LVS director with no VIP for load balancing ipsec servers accessed by NAT'ed clients (udp 500/4500, fwmark method). When I remove a realserver (ipvsadm -d ...), its clients are not notified after their connections expire. I suspect that icmp responses are simply not generated on the director as they should be.

Julian Yes, icmp_send() has code that feeds ip_route_output*() with non-local source address (the VIP that is not configured as IP address in director). The ICMP reply logic is implemented in a way that ensures the ICMP packet will use local IP address, for example, when IP router wants to send reply for packet destined to next hop (looks like our case). The networking maintainers still wait for someone to go and split all callers of ip_route_output*() to such that require local source address and others that don't require. The goal is to move the check for local source address out of ip_route_output to allow code such as NAT or IPVS to get output route with non-local source address (may be there are other such uses). Every place should be audited and check for local IP should be added only if needed. The ICMP reply code is a such place that needs to send ICMP replies with local address, the receiver should see who generates the error. So, for the problem in original posting: the IPVS users that need to send ICMP replies for VIPs should configure the VIPs in director. I'm not sure there will be another solution. If one day ip_route_output does not validate the source address may be icmp_send can rely only on this check as before: daddr; if (!(rt->rt_flags & RTCF_LOCAL)) saddr = 0; ]]> Then director will send ICMP replies from VIPs, by using the local-delivery method to accept traffic for VIP. I hope the problem can be solved in another way, e.g. isakmp keep alive longer UDP timeout persistency It is against principles UDP users to expect reliability from internet. See RFC1122, 4.1.1 Any support for ISAKMP keep alives in your devices? Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 13 Feb 2007 If you mean DPD (dead peer detect) - yes, it is supported (I use OpenSwan), but it does not work very well for me. In my case, several tunnels can use the same ISAKMP association, and only one of them is removed when the peer is assumed dead. Other tunnels stay on, ignoring ICMP port unreachable messages my patched director is sending, until they expire. My current workaround is not using DPD, but setting a short rekey period (15 mins or less). Ratz 5 Feb 2007 It does generally not make sense for the director to send notification of the connection expiration since the connection establishment was between the CIP and VIP. The director does not have any open sockets. However, I can understand the need for such a notification. If the director knew the endpoint's socket state, we wouldn't have the need for this opportunistic timeout handling currently present in IPVS. From my understanding it's not required from anyone to send ICMP message back, especially on the grounds of pulling a machine from the network. I would need to re-read the RFC and study the Steven's diagrams to make a supported statement. Here's Janusz's description of the patch (which is being incorporated into LVS). If a realserver goes down, then the director being the last hop is supposed to let the client know that the (virtual) service it was connected to doesn't exist anymore. Presumably shortly thereafter, failover will fix up the ipvsadm table and the client gets to pick a new service? Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 27 Mar 2007 Yes, that is it, but there's more too. It acts on a VIP-less director in two cases at least: In an overload case, it should allow using the VIP source address in icmp port unreachable packets which are sent by ip_vs_leave(), "called by ip_vs_in, when the virtual service is available, but no destination is available for a new connection" (quote from ip_vs_core.c). As you said, but with some tricks: On my VIP-less fwmark LVS-DR director, I use it for sending icmp port unreachable packets after a UDP connection is removed when a relaserver goes down, or it just expires and a packet could be sent to a different realserver without client knowledge (for TCP, TCP_RESET is sent that does not need this patch). First of all, my LVS-DR traffic all goes through conntrack (see F5_snat http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html#F5_snat) Only the first packet of each connection is marked for a specific service to get IPVS connection entry established. Next packets are just marked for local routing rule, so they can be seen, matched against existing connection and redirected by ip_vs_in(). If the connection is removed (expired, timeout after realserver goes down, or immediately with help of expire_nodest_conn), the packet goes through ip_vs_in() untouched (no matching connection nor service) and ends up in udp_rcv(), where the icmp port unreachable packet with the VIP source address is generated and sent to client with help of the patch in question. Unfortunately, this does not remove the corresponding conntrack entry (unlike TCP_RESET does), so if a client ignores icmp errors and keeps sending, all following packets go the same way and director keeps responding with icmp errors. To solve this problem, I have moved ip_vs_in() before INPUT filter hook (http://www.icnet.pl/download/ip_vs_core-input-before-filter.patch), set up netfilter rules that generate log entries for this case, and set up a user-space daemon tracing log and removing conntrack entries for non-existing LVS connections (http://www.icnet.pl/download/delete_connection.sh).

LVS: Routing and packet delivery to a director without a VIP (for fwmark and transparent proxy)

Introduction This problem comes up for . You need this if you are setting up a director without a VIP. You don't need a VIP on a director if you're forwarding packets by fwmark, or if machines are accepting packets by transparent proxy. (All the simple LVS setups use VIPs on the director. You can skip this chapter if you aren't using fwmark or transparent proxy.) LVS rearranges the tcip connection that normally requires exchanges between 2 machines (client <->server) so that (for LVS-DR) the packets are moved in a circle between 3 machines (client->director->realserver->client). In the case of or , the packets from the client are sent to and accepted by a machine (the director) which doesn't have the IP of the dst_addr. The routing and tcpip connections must not break the tcpip RFCs and the client must still think it is connected to one machine.

Routing to and accepting packets by a VIP-less director When setting up an LVS on a VIP, normal routing mechanisms will deliver the packet from the client (with dst_addr=VIP) to the director. Once the packet has arrived on the director, it will be accepted locally, as the director has the VIP on one of its ethernet devices. IPVS picks up the packet from there and forwards it to the realservers. When the director is setup to forward fwmark'ed packets by LVS, it (usually) will not have an IP that matches the dst_addr of the client's packets. The client doesn't know that the LVS is setup with fwmarks and sends requests to one (or several) VIP:ports. The fwmark on the director could be put onto packets that consist of an arbitary grouping of IP:ports. The VIP:port that the client is connecting to, is actually on the realservers (which are not replying to arp requests), but there is no VIP on the director, just some iptables rules marking the packet and ipvs forwarding the marked packets. Without a VIP on the director, normal routing mechanisms will not send the packets to the director. How does a packet get to the director (or realserver), which is accepting packets for the VIP, if the host doesn't have the VIP on an eth device to respond to arp requests? You have to intervene to make sure the packets get to the director.

Routing to the MAC address of the director This solution is to configure the router by hand to forward packets with dst_addr to the director. These rules rely on forwarding the packet from the router (or test client) to a known MAC address on the director. put an entry for the MAC address of the director in the arp table of the router (see /etc/ethers. add a route to the router with route add -host ... On the realservers in a LVS-DR LVS the routing is already handled for you. In LVS-DR when the director wants to send a packet to a realserver, it looks up the MAC address for the RIP and sends the packet with addresses CIP->VIP by linklayer to the MAC address of the realserver. Once the packet arrives at the realserver, processes listening to VIP:port pick up the packet. These routing methods do not work well when director failover is required, as the router tables need to be programmatically updated about the identity of the new director. This requires code and you may not have access to the router tables. The original plan for vrrpd had a virtual MAC address assigned to an interface. This MAC address could then be moved to the new director on failover. Unfortunately Alexandre found this too difficult to code up due to lack of access to information about the hardware and the plan has not been implemented. Normally, once a packet arrives on the director, it is either delivered locally (if the dst_addr is on the director) or else routed somewhere (after consulting routing tables). In the case of a director forwarding by fwmarks, there is neither a VIP for local delivery, nor routing rules for forwarding packets with dst_addr=VIP. If you send a packet to a director configured this way, it will send arp requests "who has VIP, tell director".

Julian's iproute2 solutions Julian 09 Dec 2003 Currently, IPVS requires the traffic to VIPs to be locally delivered (LOCAL_IN). It means the VIPs should be local IPs or you have to use ip rules selecting table(s) with local route(s). IIRC, transparent proxy does not work for IPVS in standard 2.4 kernels as it is in DNAT form. This is a local route: Realservers can have fwmark rules to process packets with dst_addr=VIP. With LVS-DR the routing problem is already handled; the director sends the packets to the interface with the MAC address of the RIP (LVS-Tun also routes packets to the realservers). Several methods are available to enable local delivery of a packet which has arrived on a machine which does not have the IP of the dst_addr. If you are using a single (or small number of) VIPs, you can put these IPs on the director on an ethernet device or on lo. If a range of addresses is required, an alias can be set to accept a network of addresses, without assigning an IP to the device Horms horms (at) vergenet (dot) net 10 Apr 2000 You can now ping anything in the 192.168.1.0/24 network from the console. Note: You can't ping any of those IP's from a remote host (i.e. after adding a route on the remote host to this network). If you put this network onto an eth0 alias (rather than an lo alias), it won't reply to pings from the console - presumably the ping replies in the lo:0 case are coming from 127.0.0.1. For another example of routing to an interface without an IP, see routing to realservers from director in LVS-DR. Here's the iproute2 method of getting all packets delivered locally. This is now the currently preferred method of arranging for packets to be accepted locally. Julian 7 Jul 2002

The local routes are used for transparent proxy, for example: http://marc.theaimsgroup.com/?l=linux-virtual-server&m=101674735204704&w=2 The recipe is Joe: you might (should?) be able to route packets to the VIP this way, allowing packets for an IP which is not on the machine to be accepted by LVS. If you have just 1 VIP, you don't need these rules, you can just setup the VIP on the director in the normal manner

In many ways, having an IP on the director for no other reason than so that it can accept the packet, is a step backwards.

Joe 9 Jul 2002: is it possible/reasonable/sensible for ipvs, when forwarding fwmarks, to pick the marked packet up from the PREROUTING chain (or wherever it is)? Currently ipvs needs an IP or a functional equivelent of ipchains redirect to be able to get the packet.

Julian: Yes, it is useful for some cases. Such feature is in the bottom of our TODOs but requires many changes, including breaking the routing func prototypes. So, for now the answer is no :) (for 2.2.x kernels) you can use to accept packets for the VIP Transparent proxy code was written for squids to allow them to accept packets destined for remote addresses. Only the transparent proxy code for kernel 2.2.x works for LVS. With this code the packet arrives with dst_addr=VIP and is picked up by ipvs. With the 2.4.x kernels, the packet arrives with dst_addr=127.0.0.1. This is fine for squids, but ipvs ignores the packet. It's unfortunate that by the time the netfilter people were informed that this was a problem for us, the new code was too entrenched and they didn't want to change it. Transparent proxy in the standard 2.4.x kernels, where most future development of LVS will be, does not work for LVS. RedHat have patched their kernel so this is fixed (Mike McLean mikec (at) redhat (dot) com 4 Dec 2002). has patched the 2.4 kernel to restore the original transparent functionality. Horms (from a thread on another topic) Use a fwmark service and be rid of your VIP on an interface all together. Joe 26 Aug 2003

getting rid of the VIP altogether is still a bit of a problem isn't it? there are solutions that apparently came originally from Julian (urls in the archive were then listed).

http://marc.theaimsgroup.com/?l=linux-virtual-server&m=106020019020431&w=2 pretty straight forward and basically the way fwmarks work if you are using them for more than one IP address, which was the reason fwmarks were origionally added to the LVS code. The route commands are needed because ipvs is called after routing takes place. I think that in the case of fwmarks it would be best to move the code to the prerouting stage to avoid the need for this. I.e. hook ip_vs_in into NF_IP_PRE_ROUTING instead of NF_IP_LOCAL_IN.

what will this get us? Will we still need the route command? Are you going to do implement it, or are you just thinking out loud?

Horms Yes, it will remove the requirement for traffic to be local. I made the change - it is one line - and very briefly tested it. It seemed to work quite well. But it is a change that will most likely have side effects so it warrants further thought and investigation. Julian such move can allow IPVS not to require local delivery. There will be some issues with properly identifying the direction of the packets but it is possible to implement. The problem is that we are stuck with the netfilter hooks. If we move out of the hooks or if we add some changes to the kernel we can do everything including proper routing for inout packets (working with multiple ISPs), avoiding the LOCAL_IN->LOCAL_OUT problems that start to appear with 2.6, etc. May be we will need ROUTING hook. IIRC, fwmark is present in PRE_ROUTING but such move can create some compatibility problems, are all we ready for this? Horms In http://marc.theaimsgroup.com/?l=linux-virtual-server&m=106020171022117&w=2 the packets are delivered locally because of the "local" in Again, this isn't really the way it was supposed to work AFAIR.

if/since Julian's routing method works, why do we need transparent proxy (if we ever did)?

Julian The people have alternatives, all these methods differ in some way and can be selected according to their behaviour. IPVS is liberal for the local delivery method. Note the main things: fwmark is a common way to mark packets in the kernel, provided from the firewall functionality IPVS can use this marking to know which packets are destined for the virtual service (fwmark-based servers) the local delivery does not depend on fwmark, i.e. you can safely route locally without using fwmarks, e.g. Note ANY_DEVICE: it does not matter until you try to play with the device state, lo is preferred as it is always up This routing method does not handle ICMP errors, because it is assumed the VIPs are not configured as local IPs. The current kernels have checks for local source IP when generating packets (icmp_send uses ip_route_output which has such checks starting from 2.4) and if they are not configured as local IPs, then the reply is not generated. So, there are some corner cases where the kernel does not like our local delivery methods. If that is considered a problem, better the VIP to be configured locally. Joe: If you're being secure about your LVS, you aren't going to have a route from the VIP on the director back to the outside world (see ) and you won't be sending back icmp traffic anyhow). As for the original subject, the LVS directors can not be realservers, clients and backup servers at the same time for the same virtual service. The VIP must be announced only from one director. If the backup director has the VIP configured then it cannot communicate with other hosts from the cluster. Also, the backup server must not create ARP problem if the VIP is configured there. Joe

Can you set up a squid then with this routing method, without using transparent proxy?

Julian I thought the people know about/use such alternative: may be someone has experience with this method and he can provide actual settings. It is useful in setups where the packet header must be preserved. IIRC, 2.4 TP breaks this rule. Joe

Is this routing method a a generalised way of accepting packets on the director when using fwmark with LVS?

Horms I was wondering that on the way home last night. I would suspect so. It has the potential to cover a lot of issues in a manner that is supported by stock kernels. That would be nice. But then again those issues may disappear if LVS was moved to prerouting.

Ludos LVS target in iptables For the director needs to accept packets with dst_addr=0/0 I don't force the director to accept the packets as local packets. I modified the kernel to send all forwarded traffic into LVS. Thus I force the director to accept them as forwarded packets! This is the purpose of the "LVS" iptables target. No need for other fancy tricks, just match the traffic in iptables and use "-j LVS" as the target. This patch provides an iptables target called "LVS" which calls the entry function of LVS. Thus you can match the traffic to the VIP on the FORWARD hook of iptables: -m state --state NEW -j MARK - --set-mark 1 #iptables -A FORWARD -t mangle -m mark --mark 1 -j LVS ]]> This last line mimics the behaviour of packets going to the director directly, it will call the LVS functions as if the packet was on the local-delivery path.

Transparent proxy Q and A Q. Some demons listen only to specific IPs. What IP is the telnet/httpd listening to when it accepts a connection by transparent proxy? A. It depends where you are when you make the connect request (this is for 2.2.x kernels). example: You are on the console of a host and add x.x.x.111:http by transparent proxy and setup the httpd to listen to x.x.x.111:80. You cannot ping x.x.x.111. To connect to x.x.x.111:http you need to (adding a route to eth0 does not work). If you go to an outside machine, you still cannot ping x.x.x.111 and you cannot connect to x.x.x.111:http unless you make the target box the default gw or add a host route to x.x.x.111. If you now go back to the console of the transparent proxy machine and change the httpd to listen to 127.0.0.1:http (and not to x.x.x.111:http) you can still connect to x.x.x.111:http even though nothing is listening to that IP:port (linux tcpip does local delivery to local IPs). (You can also connect to 127.0.0.1:http, but this is not concerned with transparent proxy.) Returning to the outside machine, you cannot connect to x.x.x.111:http. The connections from the outside machine model connections to the director with the VIP by transparent proxy, while the connections from the console model the realserver which has a packet delivered from the director. On the realserver you could have your services listening to 127.0.0.1 rather than the VIP. You may run into DNS problems (see Running indexing programs) if the process listening to 127.0.0.1 doesn't know that it's answering to lvs.domain.org.

Other tricks For other examples of routing and accepting packets without IPs, look at the section on default gw for LVS-DR.

LVS: Fwmarks (firewall marks)

Introduction fwmark nomenclature: Karl Kopper (Apr 2004) said that he thinks the correct term for this is "netfilter mark". A google search finds references to "netfilter mark" back to 2001, and with "fwmark" current at least to 2003. Both terms seem to be in use. The various netfilter HOWTOs don't say anything about new terminology. Horms (who wrote the fwmark code) doesn't know anything about a change in terminology, but thinks it's possible that fwmark is the implementation of netfilter marks. I asked Harald Welte about this at OLS_2004 and the explanation was as clear as day, except that I didn't write it down and now I've forgotten it (geez, sorry about this). It was a matter of nomenclature rather than logic: it was something like - the entity in the command line is called a mark while the method of marking packets is called fwmark. Whatever it is, you can use either term and people will know what you're talking about. fwmark is a way of aggregating an arbitary collection of VIP:port services into one virtual service (the entry made with ipvsadm -A). Thus a virtual service could be composed of multiple VIP:ports (e.g. VIP1:port1, VIP2:port2...VIPn:portn). This is usefull if the client needs to connect to all of the VIP:port services together on one realserver. Common uses for fwmark are aggregate VIP:http and VIP:https, so that when a client fills their shopping cart on VIP:http and they move to VIP:https (to give their credit card information), they will stay on the same realserver. with multi-port services like ftp (there are some wrinkles with ftp, since the 2nd port calls from the realserver rather than from the client - read the setup of ftp elsewhere in this section and in ). when the realserver is a squid. All traffic to port 80 (for all IPs) is aggregated with a fwmark. A minor advantage is that a realserver can be added, removed and re-weighted with one ipvsadm command. To enable fwmark, the packets coming into the director have to be labelled with a fwmark (some bits are flipped in the tcp packet). This is done with iptables (or ipchains). The fwmark is only a part of the packet while it stays in the skb of the machine which marked the packet (here the director). The fwmark is not on the packet when the packet is put out on the external network (i.e. the fwmark that is put on the packet when it is on the director is not on the packet when it arrives at the realserver). Once the component services are fwmark'ed, filtering (with iptables/ipchains) can be done on the fwmark, rather than on the individual IP:ports. The original method for setting up an LVS used the VIP as the target for ipvsadm commands. Using the VIP as the target, it is possible for LVS to forward multiple services on the same VIP and to forward packets for several different VIPs. However this method does not scale well to large numbers of services or IPs. As well, the connections to each service are independant, unless persistence is invoked. The more flexible fwmark method was introduced by Horms in Apr 2000. Ted Pavlic then showed how used fwmarks to group arbitary services. In this way connection to two otherwise independent services, e.g. http and https, will be linked as one service as far as ipvs is concerned and the client will stay on the same realserver for both services. The fwmarks method is more flexible and simpler to administer for large numbers of services than is the VIP method. fwmark is used to group services together within a single LVS e.g. group 1 - port 80,443 for an e-commerce site group 2 - port 20,21 for an ftp server group large numbers of VIPs together Setting up an LVS on fwmarks rather than the VIP is now the method of choice for setups with multiple VIPs or a group of ports that need to be aggregated. fwmark can be used with all forwarding methods and should have no affect on performance (throughput, latency). Fwmarks are numbers but can be translated into names using the fwmark name translation table patch. Some history from Horms (this has also been described in "Wired" Magazine - see ).

The story starts with a trip from Sillicon Valley, where I was working for VA Linux Systems, to a VA Linux Systems Professional Services customer site in Fort Laurderdale. It was mid-February 2000. I was called onsite to help sway the customer towards using LVS. The customer was interested in using LVS for a very large number of customers. Part of their requirement called for a very large number of virtual services to be configured. I suggested that we could simplify this by collecting the virtual services into contiguous network blocks and modifying LVS to recognised all addresses in a block as belonging to a virtual service. The customer seemed to like this idea. My original proposal and implementation was to allow virtual services based on netmasks. Wensong rejected this because of some potential performance issues. I distinctly remember working on the original implementation on a train trip from the Blue Mountains to Sydney's Central Station with my then girlfriend. By the time I had to change trains go to Wynyard the code was working :) When I got back home to Sillicon Valley I finished off the changes and emailed them to Wensong. That was on the 20th March. He wasn't particularly happy with some aspects of the change, particularly some performance overhead that my implementation introduced. I made some changes and sent him a new version. He suggested making the new code optional, I made that so too. We exchanged email and code for about a week. A few days latter Julian came up with the idea of using a fwmark, a feature of the ip_masq code that had been around for a while, but wasn't heavily used. Wensong passed this on to me (30 Mar). Wensong clearly was not happy with my approach to the problem and suggested the implementation that he and Julian had hashed out. The change involved using netfilter (iptables) to handle deciding which packets belong to a virtual service, rather than putting that logic into LVS itself - it was this portion of the code that Wensong was worred about the performance of. We talked this over a little bit over email and I implemented the idea. On the 6th of April I sent the new code to Wensong and Julian. On the 7th Wensong wrote back explaining a few changes he was going to make, mostly involving having the code always compiled in rather than making it an option as there didn't appear to be any performance overhead in the new code. The new option, which by then was known as firewall mark virtual services was included in IPVS 0.9.10 which was released on the 9th April. Minor fixes were made, mainly by Wensong over the following few months and made it into subsequent releases. I wrote the kernel, ipvsadm and ldirectord changes and largely have maintained them ever since. It is of note that as a part of the work that came out of this customer the -R and -S options to ipvsadm were suggested and implemented by myself. These were released just before the inclusion of the fwmark code. This customer was also the impetus for putting together what is now known as Ultra Monkey. All in all quite an interesting outcome for a couple of days on site. Pleasingly I believe that the customer in question is using Ultra Monkey with the fwmark support in LVS.

A bio of Horms:

I am from Sydney Australia. I have been involved in Linux for, well, a long time. My main area of expertise is High Availability and Load Balancing. Though anything from email to routing is just fine by me. You can see a list of the projects I have worked on as well as the papers that I have presented at confereneces on my web page (http://www.vergenet.net/linux/). I used to work at VA Research which became VA Linux Systems until they changed their business model and became VA Software. During that time I was based in Sillicon Valley, New York City and Sydney (though not all at the same time :). I currently work for VA Linux Systems Japan, in Tokyo - which I should point out is majority owned by the Sumitomo Coropration and is independant of VA Software (USA) these days. I primarily work on the Ultra Monkey Project in conjunction with NTT Commware. http://www.ultramonkey.org/, http://www.vasoftware.com/, http://www.valinux.co.jp/, http://www.nttcom.co.jp/.

(Joe) I first saw Horms when he gave a talk at the 4th Annual Linux Expo at Duke University, Durham, NC in May 1998, on Creating Redundant Linux Servers (http://www.vergenet.net/linux/redundant_linux_paper/). Although I attended the talk and thought it pretty neat, it never occured to me to introduce myself. Later when we both joined the LVS project, it took quite some time before I connected Horms on the LVS mailing list with the person who gave the presentation at the Linux Expo. Sample configurations/topologies for fwmarks are at Ultramonkey.

ipvsadm syntax for fwmark

ipvsadm command ignores ports, fwmark can't translate ports You can enter a port number with a fwmark command with ipvsadm but it is ignored. Leonard Soetedjo

From the HOWTO, when using fwmark, I can set the port to be 0. Is this correct? Is it ok if I do that for a single port service such as telnet? for example Is the use of "0" not important? i.e. I can set to whatever I want?

Horms 17 Dec 2002: The LVS kernel code that handles fwmarks really doesn't care about ports at all. If you want a service to match on specific ports, then you should set up the iptables rules to only mark packets to that port or ports. nick garratt Mar 25, 2004

I'm experiencing issues with port translation using LVS-NAT and FWMARK: What I am trying to achieve is the following: we have a custom written SMPP service that accepts two connection (transmitter and receiver) from a client. We have run into problems with maximum threads per process and large numbers of binds. As an interim measure we are considering running multiple instances of the daemon on the same server. Its is imperative that a user's two binds are routed to the same daemon instance. The user may connect to a port range so as to allow them to specify different receiver and transmitter ports according to their whim or the peculiarities of their client software but the daemon instance will handle both connections on the same port. The intention is to group the VIP port range using FWMARK as we do with many other services and load balance them across the RIP service ports ensuring that: VIP:1237 -> RIP:n userIP:56790 -> VIP:1238 -> RIP:n ]]> where n is the same port guaranteed by persistence. Problem: FWMARK and LVS-NAT port translation does not seem to work at all. what actually happens is: VIP:1237 -> RIP:1237 userIP:56790 -> VIP:1238 -> RIP:1238 ]]> which splits the binds across daemon instances.

Horms horms (at) verge (dot) net (dot) au 06 Apr 2004 Yes, port translation does not work with fwmarks, because there is no way for LVS to tell what the port translation should be. In a fwmark service the virtual service does not have a port (or address for that matter). So it can't know that it is accepting packets for, say port 1237, and then use the realserver entry to translate that to port 1237 (not much of a translaton) or 1238 (or anything else). It has to just assume that the port will be unchanged. It would be possible to modify LVS to allow this kind of translation to take place, but it isn't immediately obviously how this would be configured.

Another approach to the problem is to configure multiple virtual interfaces on my realserver, get the daemon instances to bind to specific IPs/same port ranges and handle as per normal i.e. no port translation: However I would prefer to keep down the number of IPs I need to failover.

I would suggest doing this. You shouldn't need to failover the IP addresses of your realservers anyway. Just use something like ldirectord to monitor their availability and manipulate the LVS table accordingly.

setting up routing and packet delivery to the director If you are accepting packets by a fwmark rather than by the VIP, then (in principle) you don't need the VIP on the node with the fwmark rules (which could be either the director or realserver). To get a working LVS without configuring the VIP on a machine, you need to be able to deliver the packets to the machine concerned (arp now won't be able to find the machine with the VIP) and you have to arrange for the machine with the fwmark rules to accept the packet locally. The node normally only accepts a packets for an address on the machine. Without the VIP, the node will forward the packet to somewhere else. It should be possible to arrange for LVS to accept the packet, and Julian has said this is possible, but he's working on other things right now. To do the examples below, you can either setup the fwmarks to mark only one IP (the VIP) and install the VIP on the director, or you can read the section on routing and delivery of packets and use one of the methods suggested there.

single-port service: telnet with fwmarks Assuming you already have setup the networks and default gw for the machines in your LVS, here's how you'd setup telnet without fwmarks (i.e. the "normal" method, using the VIP as the target for ipvsadm commands) on a two realserver LVS-DR. Here's how to do the same thing with fwmarks. You first mark the packets with ipchains or iptables.

ipchains for 2.2.x director Here's the recipe for setting a fwmark with ipchains: telnet ]]>

iptables for 2.4.x director Here's the recipe for setting a fwmark with iptables: The iptables parameters are taken from an example by Paul Schulz (http://www.foursticks.com.au/~pschulz/qos/pfifo.sample, link dead Jan 2003), which I found through google. First put a mark of value=1 on tcp packets which arrive from anywhere with dst_addr=VIP:telnet (the VIP is on eth1 in my setup). The fwmark is only associated with the packet while it is in the director skb (socket buffer). The packet which emerges from the director and is forwarded to the realserver is a normal (unmarked) packet. (You can't use the director's fwmark information when the packet arrives on the realserver to decide on how to handle the packet.)

install the LVS with ipvsadm Here's the output of ipvsadm RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr -> RS2.mack.net:23 Route 1 0 0 -> RS1.mack.net:23 Route 1 0 0 ]]> You can now telnet to the VIP. You'll get the expected round robin scheduling of your connections to RS2 and RS1.

Grouping services: single group, active ftp(20,21) The telnet example above could equally well be done using the VIP or a fwmark as the target for ipvsadm commands. The same is true for any one port service, where connections to services are made independantly of each other. Sometimes we need to group services together, e.g. port 20,21 for an ftp server or port 80, 443 for an e-commerce site. With persistence, you can only make ports persistent singly (but you can make persistent as many or as few as you want, they will be persistent independently); or make all ports persistent at once (with the :0 option), in which case persistence of the ports will be linked. There is no way to make pairs (or groups) of ports persistence with the current persistence code. The current method for handling this, , links all ports on the VIP, and the director will forward connections to all ports, not just the two we are interested in. For security purposes, if persistence is used to group services, then connection requests to the other ports will have to be blocked. Although workable, it's an ugly solution. For background on how the specifications for fwmarks were set to allow services to be grouped, see Appendix 1 for the initial discussion between Ted and the LVS developers (Horms and Julian), Appendix 2 where Ted let me know that he'd had it working, and Appendix 3 for Ted's announcement to the mailing list.

port grouping using VIP and persistence Here's an example grouping ports 20,21 for ftp. This uses persistence and the VIP as the target for ipvsadm commands (this is the original, VIP way of setting up ftp). Here's the output of ipvsadm RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:0 rr persistent 360 -> RS1.mack.net:0 Route 1 0 0 -> RS2.mack.net:0 Route 1 0 0 ]]> After the client has made the initial connection on port 21, then any subsequent connection on port 20 (within the 360sec timeout period) will go to the same realserver. The problem is that the director will forward to the same realserver, connection requests made to any port by the client. If we have listeners on port 80 and 443 on the realserver, then these services will be linked to each other (which we may want), and they will also be linked to the ftp service (which we may not want). If you telnet to the VIP, this request will be forwarded to the realservers too (in production you'll have to block this).

grouping with fwmarks Here's how to setup an ftp server with fwmarks. First mark the packets of interest with ipchains or iptables (i.e mark all tcp packets destined for VIP:ftp and VIP:ftp-data arriving on eth1).

ipchains for 2.2 director ftp - tcp ------ anywhere lvs2.mack.net any -> ftp-data ]]>

iptables for 2.4 director

install LVS with ipvsadm Next setup ipvsadm to schedule packets marked with fwmark=1 to your realservers. You need persistence (here timeout set to 600secs). Here's the output of ipvsadm with two current connections to the LVS and 3 expiring ones. Note they are all to the same realserver, as expected for a persistent connection. Since forwarding is by LVS-NAT, the ip_vs_ftp module automatically loads. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 2 3 -> RS1.mack.net:0 Route 1 0 0 ]]> A netpipe test showed the same latency and throughput for a connection based on fwmark or based on VIP. What happens now when you telnet from the client to the VIP? (pause to let you think.) The director is only forwarding packets with fwmark=1 to the LVS, so a telnet request to the VIP is accepted by the director and not forwarded to the realservers. If telnetd is running on the director, you'll get a login prompt from the director. In production you'll have to block this too (just like you had to when setting up on a VIP). So what's the difference, you ask, between setting up an ftp server with persistence on the VIP on one hand (which requires you to block all other packets with iptables rules), and grouping 20,21 with fwmarks on the other (which requires exactly the same blocking of unwanted packets)? Not a lot. At the moment you're at least even

Lars Marowsky-Brée lmb (at) suse (dot) de 2000-05-11 When using the LVS box as a firewall/router, the fwmark technique is a perfectly adequate solution, which doesn't cost anything.

But look at the next example.

Grouping services: two groups, active ftp(20,21) and e-commerce(80,443) Setup 2 groups of services, group 1 - ftp(20,21), group 2 - ecommerce(80,443). First mark packets in 2 groups.

ipchains for 2.2 director ftp - tcp ------ anywhere lvs2.mack.net any -> ftp-data - tcp ------ anywhere lvs2.mack.net any -> www - tcp ------ anywhere lvs2.mack.net any -> https ]]>

iptables for 2.4 director

setup LVS to schedule (with persistence) 2 groups of packets Note: The ipvs code in Apr 2001 needed a patch to get the expected behaviour. This section describes the function of LVS before and after this patch. As a result of these tests, the patch will be applied to future releases. ipvs-1.0.7-2.2.19 is already patched (Apr 2001). The 2.4.3 series are not patched yet. To see if the code has been patched look in ipvs/Changelog for something like this

Julian changed persistent connection template for fwmark-based service from <CIP,VIP,RIP> to <CIP,FWMARK,RIP>, so that different fwmark-based services that share the same VIP can work correctly.

If your ipvs code is pre-patched, then you can skip down to the part where the behaviour after applying the patch is described. If your code isn't patched, you should just go get the patch and skip to the part where the expected behaviour is described.

unexpected behaviour Here's what happened with the original code. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 FWM 2 rr persistent 600 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> If you ftp and http to the VIP, you'd expect the ftp connections to go to fwmark 1 (presumably to the first realserver RS2) and the http connections to go to fwmark 2 (again presumably to RS2). With the director running 1.0.6-2.2.19 (ipvs/kernel version), all connections (ftp, http) go to group 1. With the director 0.2.7-2.4.2, all connections go to group 2. Here's the output from ipvsadm for the 2.2.19 example immediately after downloading a webpage. You would expect the http InActConn to be associated with FWM2. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 30 -> RS2.mack.net:0 Route 1 0 2 -> RS1.mack.net:0 Route 1 0 0 FWM 2 rr persistent 30 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 director:/etc/lvs# ]]> It appears (Apr 2001) that the ipvs code doesn't really follow the persistent fwmarks spec. When there is a collision between VIP space and fwmark space (eg in these examples, where all packets are going to the same VIP), then the VIP takes precedence and the two fwmark groups are not differentiated. The collision arises because there is only one set of templates for the connection tables.

expected behaviour Note: May 2001: the ipvs code now has the persistent-fwmark behaviour.

( The code to produce the expected behaviour requires a separate set of templates for fwmarks and VIP. The patch to do this is on Julian's patch page and has names like persistent-fwmark-0.2.8-2.4-1.diff, persistent-fwmark-1.0.5-2.2.18-1.diff. (Note: the 0.2.8 patch had DOS carriage control and wouldn't patch till I removed the ^M characters). (Note: as of ipvs-0.9.0, this patch has been applied to the source tree.) After patching the ip_vs code to produce the new ip_vs.o module (rmmod the old one first), you get the expected fwmark behaviour. )

Here's the output of ipvsadm after ftp'ing and http'ing from a client. Note that the ftp connection is to fwmark=1. The InActConn is the expiring connection from the http client to fwmark=2. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 30 -> RS2.mack.net:0 Route 1 1 0 -> RS1.mack.net:0 Route 1 0 0 FWM 2 rr persistent 30 -> RS2.mack.net:0 Route 1 0 1 -> RS1.mack.net:0 Route 1 0 0 ]]>

example Here's an example of using persistence granularity (from Ratz 3 Jan 2001). The -M 255.255.255.255 sets up /32 granularity. Here port 80 and port 443 are being linked by fwmarks.

The original idea from Ted Pavlic Ted Pavlic tpavlic (at) netwalk (dot) com 2000-10-08 Just another persistence option that you may or may not have thought of... LVS does support port-group sticky persistance. Before FWMARK support was added to LVS, the only types of persistance one could do were: One port persistence (all queries to 80 return to the same realserver per CIP) ALL port persistence (all queries to all ports return to the same RIP per CIP) But now that FWMARK support exists in LVS, it is easy to create group-based sticky persistence. That is... It adds the option where: Only these two ports (443 and 80) return to the same RIP per CIP Meanwhile, another persistence table keeps track of 20, 21, and 1024:65535 Any other port is not persistent Just have ipchains keep track of flagging the incoming packets with the correct port group identifier: And have IPVS stop looking at IPs and start look at FWMARKs:

ssl and cookies Ted Pavlic tpavlic (at) netwalk (dot) com 2000-10-13 LVS DIRECTLY supports two types of persistence and INDIRECTLY supports another. If you are just asking how to make port 443 persistent so that those who receive a cookie on 443 will come back to the same realserver on 443, simply: Will setup persistence just for port 443. However, say someone gets a cookie on port 80 and gives it back on port 443 -- in that case you want to have persistence between multiple ports. Using port 0 accomplishes this: In this setup, anyone who visits ANY service will continue to go back to the same realserver. So requests which come in on 80 or 443 will continue to come in to the same realserver regardless of port. This is an OK solution, but it basically makes all services persistent which might mess up scheduling. That is, this is a decent solution but sometimes not extremely desirable. If you want to simply group ports 80 and 443 together, you need to do something more intuitive. Use FWMARK... Now only port 80 and 443 will be grouped together via persistence. Any other director:/etc/lvs# ipvsadm rules will be completely separate. This means that you can make 80 and 443 persistence by their own little "port group" and leave ports 25 and 110 (for example) not persistent. OR... You could group all the FTP ports together as well on a completely different persistence group... i.e. and again

Wayne wrote Is there a easy way to relating server in both port 80 and port 443 (with LVS-NAT)? Say I have two farms, each with same three servers. One farm load balancing HTTP requests and another farm load balancing HTTPS farms. To make sure the user in the persistent mode connected to the HTTP server always go to the same server for HTTPS service, we would like to have some way to relate the services between the two farms, is there a easy way to do it?

ratz ratz (at) tac (dot) ch 2001-01-03 Two possibilities to solve this with LVS Use port 0 in your setup. (advantage: easy to set up and easy understand) Use fwmark and group them together. (advantage: finer port granularity possible) Example (1): Example (2):

passive ftp You can setup passive ftp with the VIP as the target using persistence. This is not a particular satisfactory solution, as connect requests to all ports will be forwarded. As well, if another service on the realserver fails (eg http), then all services have to be failed out together. Here's a solution to passive ftp from Ted Pavlic using fwmark. This allows setting up passive ftp independantly of other services. Passive ftp listens on an unknown and unpredictable high port on realserver. This is handled by forwarding requests to all high ports (it's still ugly, but at least this way, we can fail out ftp independently of other services).

test session with active ftp Here's ftp setup in active mode, as a control. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> Here's netstat -an on the client and the realserver (RS2) immediately after an ftp file transfer (with the client still connected). Only port 20,21 are involved here. Here's the command line at the client during the active ftp transfer (all expected output). get tulip.c local: tulip.c remote: tulip.c 200 PORT command successful. 150 Opening BINARY mode data connection for tulip.c (104241 bytes). 226 Transfer complete. 104241 bytes received in 0.0232 secs (4.4e+03 Kbytes/sec) ]]> The iptables rules on the director do not allow passive ftp connection. To test this put the ftp client into passive mode. pass Passive mode on. ftp> dir 227 Entering Passive Mode (192,168,2,110,4,72) ftp: connect: Connection refused ftp> ]]> connection is not allowed. To check that the system is still functioning, put the client back into active mode. pass Passive mode off. ftp> dir 200 PORT command successful. 150 Opening ASCII mode data connection for /bin/ls. total 155178 . . -rw-r--r-- 1 root root 104241 Nov 10 1999 tulip.c 226 Transfer complete. ftp> ]]>

test session with passive ftp Here's the setup for passive ftp (2.4.x director) (you can leave ipvsadm untouched). Here's the command line from the ftp client still in active mode dir 200 PORT command successful. ]]> The session is hung, the server shows an established connection to port 21 and the client session has to be killed. Here's the passive session. pass Passive mode on. ftp> cd pub 250 CWD command successful. ftp> dir *.c 227 Entering Passive Mode (192,168,2,110,4,75) 150 Opening ASCII mode data connection for /bin/ls. -rw-r--r-- 1 root root 104241 Nov 10 1999 tulip.c 226 Transfer complete. ftp> mget *.c mget tulip.c? y 227 Entering Passive Mode (192,168,2,110,4,78) 150 Opening BINARY mode data connection for tulip.c (104241 bytes). 226 Transfer complete. 104241 bytes received in 0.0233 secs (4.4e+03 Kbytes/sec) ftp> ]]> Here's the connections at the realserver immediately after the file transfer. There is the regular connection at the ftp port (21) and a connection timing out to a high port on the realserver. Here's the output from ipvsadm after connecting to the URL ftp://vip/ using a web-browser RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 1 5 -> RS1.mack.net:0 Route 1 0 0 ]]>

LVS with 2 groups: group 1 = ftp(active and passive), group 2 = http RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 FWM 2 rr -> RS1.mack.net:80 Route 1 0 0 -> RS2.mack.net:80 Route 1 0 0 ]]> The client connected (in order) ftp://VIP/, http://VIP/ (passive ftp) and then by active (command line) ftp to VIP. Here's the ipvsadm output. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 2 3 -> RS1.mack.net:0 Route 1 0 0 FWM 2 rr -> RS1.mack.net:80 Route 1 2 2 -> RS2.mack.net:80 Route 1 4 0 ]]> Here's the connections showing on the realserver. The most recent ones are at the top of the list. The connection list shows (from the bottom, i.e. in the order of connection), passive ftp, http, and active ftp. The whole point of this setup is to make ftp and http, which belonged to one persistence group when setup on a VIP, into two groups. Now you can bring the httpd and the ftpd up and down independantly (if you want to fail them out, to change the configuration or software).

fwmark with LVS-NAT (based on a posting by Horms on 14 Jul 2000) Here we setup a LVS-NAT LVS on a 2.4.x director. (Note: With 2.4 LVS, the masquerading is setup by the ipvs code, i.e. you don't have to masquerade the packets back from the realservers). These examples assume that the VIP is on eth1 and your network is already setup (i.e. the realservers are using the director as the default gw etc). Mark packets for the VIP and setup the LVS for telnet. this first example is not going to get you anything you want. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr -> RS2.mack.net:23 Masq 1 0 0 -> RS1.mack.net:23 Masq 1 0 0 ]]> You can connect with telnet to the VIP and you'll be forwarded to both realservers in the expected way. All packets from the client will be marked and processed by the director:/etc/lvs# ipvsadm rules. What happens if you attempt to connect to VIP:80 (pause to think)? Here's the answer. If you connect to VIP:80 using a browser for a client, it sits there showing the watch symbol for quite a while. What happened? The explanation is that you told the director to mark all packets (i.e. from any port) from the client, rewrite them to have dest_addr=RIP:telnet and forward the rewritten packets to the realserver. So when you telnet'ed to VIP:80, the packets were forwarded to RIP:23. Just to make sure that I'd interpretted this correctly, here's the first packets seen by tcpdump running on the client and the realserver during the connect attempts. (These are from different sessions, so the ports shown on the client are different.) client: here the client is connecting to VIP:80 (lvs2.www) lvs2.www: S 2887976275:2887976275(0) win 5840 (DF) [tos 0x10] 12:09:44.450453 lvs2.www > client2.1118: S 1441372470:1441372470(0) ack 2887976276 win 32120 (DF) 12:09:44.450579 client2.1118 > lvs2.www: . ack 1 win 5840 (DF) [tos 0x10] ]]> realserver (RS2): here the realserver is receiving packets to the RIP:23 (RS2.telnet) RS2.telnet: S 2722509719:2722509719(0) win 5840 (DF) [tos 0x10] 11:44:28.319974 RS2.telnet > client2.1116: S 1283414485:1283414485(0) ack 2722509720 win 32120 (DF) 11:44:28.320681 client2.1116 > RS2.telnet: . ack 1 win 5840 (DF) [tos 0x10] ]]> If you want only telnet requests to be forwarded to the realservers, you should mark only packets for VIP:telnet. If you want both telnet and http forwarded then you should give them each their own mark. Here's how to setup LVS-NAT with fwmark for both telnet and http. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr -> RS2.mack.net:23 Masq 1 0 0 -> RS1.mack.net:23 Masq 1 0 0 FWM 2 rr -> RS2.mack.net:80 Masq 1 0 0 -> RS1.mack.net:80 Masq 1 0 0 ]]> Here's the (expected) output of ipvsadm showing the client with 2 telnet sessions and having just downloaded a webpage from the LVS. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr -> RS2.mack.net:23 Masq 1 1 0 -> RS1.mack.net:23 Masq 1 1 0 FWM 2 rr -> RS2.mack.net:80 Masq 1 0 1 -> RS1.mack.net:80 Masq 1 0 0 ]]>

collisions between fwmark and VIP rules Since it's possible to write iptables rules that include many different types of packets, it's possible to write VIP and fwmark rules that would conflict by accepting the same packet. Here's a setup that would accept telnet by both VIP and fwmarks. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 FWM 1 rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 ]]> Here's the ipvsadm output after 4 telnet connections from a client RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 2 0 -> RS1.mack.net:telnet Route 1 2 0 FWM 1 rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 ]]> All connections go to the first (here VIP) entries. The same ipvsadm table and connection pattern results if you feed the VIP and fwmarks rules into ipvsadm in the reverse order. This behaviour is not part of the spec (yet). You might want to check the behaviour, if you are doing this sort of setup.

persistence granularity with fwmark

Introduction Persistence granularity was added to LVS by Lars lmb (at) suse (dot) de 1999-10-13

This patch adds netmasks to persistent ports, so you can adjust the granularity of the templates. It should help solve the problems created with non-persistent cache clusters on the client side."

The problem being addressed is that some clients (eg AOL customers) connect to the internet via large proxy farms. The IP they present to the server will not neccessarily be the same for different sessions (tcp connections), even though they remain connected to their proxy machine. Persistence granularity makes all clients from a network equivalent as far as persistence is concerned. Thus a client could appear as CIP=x.x.x.13 for their http connections, but CIP=x.x.x.14 for their https connections. With persistence granularity set to /24, all CIPs from the same class C network will be sent to the same realserver. The default behaviour (i.e. persistence granularity is /32) has the effect that all connections from the same CIP to be sent to the one realserver but other connections from the same network will be scheduled to other realservers. Persistence granularity is applied to the CIP and works the same whether you are using fwmark or the VIP to setup the LVS. You set the netmask (granularity) for persistence granularity with ipvsadm. If the LVS was setup with the following command, the persistence granularity is 255.255.255.0. Let's say a client from a class C network (e.g. with IP=100.100.100.2) connects to the LVS. If any other client connects from 100.100.100.0/24 they will also connect to the same realserver as long as the original client's entry in the persistence table has not expired (i.e. the first client is still connected, or disconnected < 333 secs ago).

examples Here's an example LVS-DR LVS set to mark packets for an IP on the outside of the director (this IP serves as the VIP in the usual LVS setup, but there's no such thing as a VIP with fwmarks) with --dport telnet. Persistence granularity is set to the default (-M 255.255.255.255). Two clients (192.168.2.254, 192.168.2.253) connect to the LVS. Each host connects to different realservers but multiple connects from each client go to the same realserver (i.e. client A always goes to realserver A; client B always goes to realserver B, at least till the persistence timeout clears). Here both clients have connected twice. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 2 0 -> RS1.mack.net:0 Route 1 2 0 ]]> This is the connection pattern expected if the connections were based on the CIP/32 and fwmark (ie all clients are scheduled independently). Here's the same setup with persistence granularity set to /24. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 mask 255.255.255.0 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> Here's what happens when the 2 clients, both of who belong to the same CIP/24 persistence group, connect twice - all connections go to the same realserver. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 mask 255.255.255.0 -> RS2.mack.net:0 Route 1 4 0 -> RS1.mack.net:0 Route 1 0 0 ]]>

Discussion with Julian about persistence granularity Joe

I expect if you were using persistence with fwmark, then any connection requests arriving with the same fwmark will be treated as belonging to that persistence group. Presumably any combination of client IPs and/or networks could have been used to make the rules which marks the packets.

Julian Yes, it is for the same group but in one fwmark group there are many templates created. These templates are different for the client groups. The template looks like this: SERVICE(FWMARK/VIP):0 -> RIP:0 ]]> All ports 0 for the fwmark-based services So, for client 10.1.2.3/24 (24=persistent granularity) the template looks like this: VIP:0 -> RIP:0 ]]> LVS patched with the persistent_fwmark patch: FWMARK:0 -> RIP:0 ]]> So, the templates are created with CIP/GRAN in mind and the lookup uses CIPNET too. We use before creation and lookup.

so if I did only packets from 10.1.2.3 will have a fwmark on them, but the director would forward all packets from 10.1.2.0/24, even those without fwmarks?

The patched LVS will accept only the marked packets for this fwmark service, from the same /24 client subnet. If only one client IP sends packets that are marked then the real service will receive packets only from 10.1.2.3. The current LVS versions don't consider the service and all packets CIPNET -> VIP will be forwarded using the first created template for CIPNET:0->VIP:0, i.e. these packets will randomly hit one of the many services that accept packets for the same VIP (just like in your setup) and then may be a wrong realserver.

The current LVS versions don't consider the service and all packets CIPNET -> VIP
but there is no VIP here, I'm using fwmark only. what does the -M 255.255.255.0 do in this case?

The current LVS versions (i.e. without the persistent_fwmark patch) assume the VIP is the iphdr->daddr, i.e. the destination address in the datagram and this addresses is used to lookup/create the template.

how about your persistent-patch, which I've been working with?

The patch ignores this daddr when creating or looking for templates. Instead, the service fwmark values is used when the service is fwmark-based: CIPNET:0 -> FWMARK:0 -> RIP:0 The normal services use daddr as VIP when looking for or creating templates: CIPNET:0 -> daddr:0 -> RIP:0 The persistence is associated with the client address (CIP). The sequence is this: - packet comes from CIP to VIP1 - fw marking, optional - lookup for existing connection CIP:CPORT->VIP1:VPORT, if yes => forward, if not found: - lookup service => fwmark 1, persistent - try to select real service in context of the virtual service Apply the persistence granularity to the client address netmask ]]> Now lookup for template not patched: check for existing template CIPNET:0, VIP1:0 patched: check for existing template CIPNET:0, 1(fwmark):0 if there is template, bind the new connection to the template's destination if there is no existing template, get one destination using the scheduler and bind it to the newly created template and the new connection. The created template is CIPNET:0, VIP1:0, DEST_RIP:0 CIPNET:0, 1(fwmark):0, DEST_RIP:0 - forward the packet

Persistence granularity was designed for people coming in from large proxy servers (eg AOL). With fwmarks, this can be handled by iptables rules.

Yes, the fact that we group the clients using this netmask is not related to the virtual service type: normal or fwmark-based. Yes, each different IP is treated as different client. When a netmask <32 is used, the group of addresses is treated as one client when applying the persistence rules. This is not related to the packet marking and virtual service type.

fwmark allows LVS-DR director to be default gw for realservers If a LVS-DR director is accepting packets by fwmarks, then it does not have a VIP. The director can then be the default gw for the realservers (see LVS-DR director is default gw for realservers).

fwmark simplifies configuration for large numbers of addresses If a fwmark rule accepts packets for a /24 network, then 254 IPs are configured in one instruction. The next sections are examples.

Example: firewall farm Horms horms (at) vergenet (dot) net 2000-12-06 Assume that packets from out local network (192.168.0.0/23) are outgoing traffic. Mark all outgoing packets with fwmark 1 Where 192.168.1.7, 192.168.1.8 and 192.168.1.9 are your firewall boxen.

Example: LVS'ing a CIDR block Matthew S. Crocker wrote:

would like to put a CIDR block of addresses (/25) through my LVS server. Is there a way I can set one entry for a VIP range and then the load balancing will be handled over the entire range.

Horms horms (at) vergenet (dot) net 2001-01-13 Set up fwmark rules on the input chain to match incoming packets for the CIDR and mark them with a fwmark. e.g. Use the fwmark (1 in this case) as the virtual service.

Miri Groentman, 11 Jul 2001 Is it possible to configure a range of ports rather than a single-port

Joe if you mean ports for services, yes, see fwmark in the HOWTO. You can also forward a range of IPs.

Example: forwarding based on client source IP client A (from 192.x.x.x) should go to realserver 1..3, and client B (from 10.x.x.x) should go to realserver 4..6. (Julian, 10-05-2000) Write fwmark rules based on the source IP of the packets. Then create two virtual services, one for each fwmark.

Example: load balancing multiple class C networks Ian Courtney wrote:

Basically here at our ISP, we tend to have 2-3 Class C's worth of hosting per server. We would like to move the the LVS, but I'm not exactly sure how I should be setting it up.

Chris chris (at) isg (dot) de 2001-01-15 You can use the fwmark option for the loadbalancing the router should point to the director. Ian Courtney wrote back:

It didn't work until I aliased all 3 class C's to my director. Do I have to do this?

Julian Anastasov ja (at) ssi (dot) bg 2001-01-16 Yes, only the packets destined for local addresses/networks are accepted. The others are dropped or forwarded to another box.

the next project involves redoing our standard linux web space, which so far consists of about 8 webservers, each hosting atleast 2 class C's worth of hosting. I some how don't think Linux will take nicely to have 16 or more class C's aliased to it.

If possible use netmask <24. I assume you execute (replace with the right Class C nets): on the director and on each realserver and solve the arp problem using: /proc/sys/net/ipv4/conf/all/hidden echo 1 > /proc/sys/net/ipv4/conf/lo/hidden ]]> in the realservers. If you don't want to advertise these addresses using ARP to the Cisco LAN, you can execute the above two commands in the director too.

Example: proxy server Thomas Proell, 16 Aug 2000

How do you use fwmark if you want the director to accept packets for a wide range of addresses, for which is doesn't have IPs.

(Horms) Here's a setup I used... I have used 192.168/16, but these could be real addresses too. I have only put one proxy server in the diagram but I did test it with 2 Interestingly enough if you add a proxy that just forwads traffic then it will end up going direct. This may be useful as a failback server if the proxy servers fail. The -m 1 means that IPVS will regognise packets patched by this filter as belonging to the virtual service as long as it sees the packets as local. -j REDIRECT 80 makes the packets appear as local. It is of note that the port you redirect to is _ignored_ because of the way IPVS works - paickets using fwmark are sent to the port they arrived on. This means that packets will be sent to proxy servers as port 80 traffic. Note, this is where the redirection to port 8080 takes place.

Example: transparent web cache Pongsit (at) yahoo (dot) samart (dot) co (dot) th May 08, 2000 If I would like to use LVS to balance 3 transparent proxy is this how i do it ? Horms horms (at) vergenet (dot) net 2000-05-08 If you want to do transparent proxying then I would suggest a topology more along the lines of: Use IP chains mark all outgoing port 80 traffic, other than from the 3 proxy servers with firewall mark 1 (ipchains -m 1...). Set up a IPVS virtual service matching of fwmark 1 (ipvsadm -A -f 1...). The proxy servers will need to be set up to recognise all port 80 traffic forwarded to them as local. This way all outgoing traffic hits the LVS box. If it is for port 80 and isn't from one of the proxy servers then it gets load balanced and forwarded to one of the proxy servers. You may want to consider a hot standby LVS/DR host to eliminate a single point of failure on your network. I haven't tested this but I think it should work.

Example: Multiply-connected router (Joe: my initial -apparently incorrect- reaction was that routing protocols would handle this better.) Martin Sk?tt martin (at) xenux (dot) dk 19 Jun 2001 I have several ADSL connections to the Internet (same ISP) and I wan't all the users on my network to be using them. I would like it to work in a way so that all the lines are utilised all the time and without assigning groups of users to specific gateways. What I want to do is assign one default gateway, the LVS box. Joe

..for doing by LVS, you could set up a director to be a router and setup like it was infront of 4 squid boxes (you'll need the IP's of the other end of the ADSL link). There's an example proxy above.

Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr

I have tried some time ago that kind of setup. I have test 4 differents topology Using a dynamic routing protocol like BGP. Using BGP you can use cost onto your routing path. To setup a multipath Internet connection using BGP all the ISP connected to your BGP setting must be informed to add BGP their side. This setup is recommanded by ISP for corporate Internet use. It is mostly expensive due to ISP router side reconfiguration. Implementing a loadsharing topology like discribed into the "Linux 2.4 advanced routing HOWTO" section 9.5. You need here to use the same ISP for all your Internet connections because your ISP must implement the symetric config. This mean that ISP must support linux 2.4 loadsharing over multiple interface. This is rarely implemented by ISP because it is much more interesting implementing constructor integration that is more expensive. This is my feedback in France :/ Setting up router with multiple default gateway. That way you will loadbalance by TCP conversation. I have only implemented this on CISCO, your are limited to the max default gw number implemented (3 or 4 for CISCO). Implement the solution discribed in the LVS HOWTO (above). Loadbalancing a squid server pool, each squid directly connected to your ADSL line. Personally, I prefer the LVS solution which is much more easy and recommanded because it is ISP configuration independent. I have tested that on a RTSP proxy pool.

httpd clients (browsers) Initially when testing you should use a non-persistent (in the netscape sense) client, e.g. telnet VIP 80, or lynx VIP. Or else revert to these if you don't understand what you're seeing with netscape.

Opera Peter Mastren Peter (dot) Mastren (at) chron (dot) com 18 Dec 2001 For the past several weeks, we have experienced almost daily denial of service attacks/events on our www servers. A remote client somewhere has opened a number of TCP connections to LVS that have absolutely no traffic whatsoever, save a single keepalive packet every two minutes. I have seen few as 3 and as many as 120 connections in the various incidents over the weeks. These open connections are counted in the algorithm LVS uses to schedule servers, so the server that has all these open connections receives proportionately fewer new connections, in most cases taking the target server completely out of rotation. Yesterday, I noticed an event coming from 130.80.XXX.XXX, our firewall address. Three connections were being held open from a machine inside our network. The culprit was my own workstation. I killed my browser and the connections went away. I fired up my browser again and tried to retrace my steps to duplicate the situation. To make a long story short, it appears that Opera version 6.0 beta will leave a connection open to a server even after the window that was used for that connection has been closed. The only time the connection is closed is when Opera exits. I will submit a problem report to Opera, in the meantime, there could be hundreds if not thousands of beta Opera browsers out there that could lock up ports on our servers for hours or days or longer. This morning I made a configuation change to LVS that seems to have solved the problem. The masquerading portion of the Linux kernel (using LVS_NAT) uses default times to keep connections open, one for TCP connections, one for closed TCP connections that have received a FIN, and one for UDP connections. These defaults are 15, 2, and 5 minutes respectively. I changed the TCP timeout from 15 minutes to 110 seconds, which is shorter that the two minute intervals that the keepalive packets occur, yet long enough for any imaginable connection to a web server. The change I made was:

Wensong Aug 2002, for 2.4 kernels

Example: dynamically generated images in webpages One of the assumptions of setting up an LVS is that the content presented on the realservers is identical. This is required because the client can be sent to any of the realservers. This requirement is not handled if the client fills in a form which produces a gif on the realserver. Alois Treindl alois (at) astro (dot) ch 30 Apr 2001

If a page is created by a CGI and contains dynamically created GIFs, the requests for these gifs will land on a different realserver than the one where the cgi runs. Will I need persistence? I am running an astrology site; a typical request is to a CGI which creates an astrological drawing, based on some form data; this drawing is stored as a temporary GIF file on the server. A html page is output by the CGI which contains a reference to this GIF. The browser receives the html, and then requests the GIF file from the server. It will mostly hit a different server than the one who created the GIF. So either we make sure that the new client request for the GIF hits the same realserver which ran the CGI (i.e. have persistence) or we must create the GIF on a shared directory, so that each realserver sees it. I have not tested it yet (not ported the CGIs yet to the new LVS box) but I think things are not so simple. In a 'rr' scheduling configuration, for example, the scheduler could play dirty, depending on the number of http requests for the given page, and the number of realservers. Both could be incommensurable in a way that the http request for the GIF never reaches the same realserver as the one which ran the CGI request. I had already decided that I need shared directories between all realservers for our CGI environment which does computationally expensive things all the time. Some CGIs create also data files which are used by later CGIs. It is either shared directories for such files, or a shared database (which we also use). These temp files will be sitting in the RAM cache of the NFS server, so that only network bandwidth between the realservers and the NFS server is the limiting factor. This is why I give the NFS server 2 gb of RAM, the max it will physically take, and this is why I chose 2.2.19 as the kernel because it contains NFS-3, which is said to be faster than NFS-2.

(Joe) I tested it here on a page which generates a gif for the client. I found that I could never get the gif. Presumably after downloading the page containing the reference to the gif, the round robin scheduler sends the request for the gif to another realserver. Presumably even page counters will have this problem. Writing to a shared directory should work. Here's a solution with persistent fwmark using ip_tables to setup on a 2.4.x kernel. (Note: for page counters, this method will increment for each realserver, and not for the total page count over all the realservers as would happen with a shared directory.) RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> Here's the output of ipvsadm after the successful generation and display of the dynamically generated gif. Note all connections went to one realserver. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr persistent 600 -> RS2.mack.net:0 Route 1 5 3 -> RS1.mack.net:0 Route 1 0 0 ]]>

Example: Balancing many IPs/services as one block The simplest LVS balances requests to a VIP:port amongst a group of realservers. If you are servicing many VIPs, then few requests may be present for any particular IP at any time and a disproportionate number of requests will be sent to the first realserver. In this case you should balance all the different IPs as one group. Josh Marcus josh (at) serve (dot) com> 02 Oct 2001 I'm using LVS to serve a few thousand domains, but I don't see how I can setup LVS to load balance all of the domains as if they were all a single ip. In my ideal world, I would have a single entry *:80 that would forward all of our ips at port 80 to our set of realservers, and load balance all requests coming in. The way LVS is working for us now, the vast majority of all of our requests are going to the server that is for some reason being listed first. Only sites with heavy traffic get pushed along to the other servers. Michael E Brown michael_e_brown (at) dell (dot) com>

fwmarks

Example: Source controlled LVS - services and realserver customised by Client IP In a LVS, you may want requests from a certain IP/netmask to to be forwarded to one set of realservers/services (which may be a subset of the total realservers, or may be other dedicated realservers), while the rest of the requests are forwarded normally to the whole LVS. Or another way of putting it... You may want 2 (or more) LVSs setup on the one director, with one of the LVS's accepting only packets from an IP/netmask, while the rest of the requests go to other LVS. Peter Mueller pmueller (at) sidestep (dot) com 18 Apr 2002

source-controlled routing for us gives a few advantages. when clients inside our company launch the client for our product (Sidestep), we want that client to redirect automatically to staging. Going to staging directly means it is easier to test code, etc. This is a small advantage and is merely the "proving grounds" or first step. One of our customers has a proxy server java-code caching problem (their client doesn't work) and we want to steer them to a server that won't have the problem. Unfortunately the customer is not technically competent, and we'd like to avoid changing anything at their end. It'd be nice to redirect our competitors/delinquent customers to a machine that had incorrect or out of date information. Surely most companies would think this is a cool feature! it is advantageous to have more control in case of mishap. Julian supplied the recipe. i.e. normal LVS setup here . . ]]>

Here's the implementation for iptables. Armin.Haken Armin (dot) Haken (at) Sun (dot) COM 10 Feb 2006 How to forward packets based on source address Using the fwmarks in iptables you can create ipvs rules to forward packets to particular realservers or groups of realservers based on source address or source network. I got this information out of a December 2002 post to the lvs-users group by Ratz Here is an example using LVS-NAT with 2 realservers with RIP 10.1.1.1 and 10.1.1.2. The first realserver serves clients on the 10.0.1.X network and the 10.0.5.X network, while the other realserver serves clients on 10.0.2.X. The VIP is 10.0.1.1. Packets destined for server 1 get mark 1, packets destined for server 2 get mark 2 The following command shows you counters of matched rules ipvs forwards based on the marks The iptables rules also allow you to specify the protocol or interface of the packets you mark and you can use negations, specify port numbers, etc. If a packet matches several of the rules, the marks get overwritten so the last matching rule determines the mark. For failover you could either configure multiple realservers per fwmark or put in a system that changes the marking rules or forwarding rules once a failed realserver is detected.

Appendix 1: Specificiations for grouping of services with fwmarks Here are the discussions that has resulted in the current specifications for handling of persistence with fwmarks in LVS. Ted Pavlic Jul 14, 2000

What I was asking about would be something like this: I have 1029 virtual servers -- that is I have 1029 hosts which need to be load balanced.

Horms horms (at) vergenet (dot) net 2000-07-14 (fwmark) has the advantage of simplfying the amount of _kernel_ configuration that has to be done which is a big win, even if this is automated by a user space application. The basic idea is that this provides a means for LVS to have virtual services that have more than one host/port/protocol triplet. In your situation this means that you can have a single virtual service that handles many virtual IP addresses and all ports and protocols (UDP, TCP and You should take a look at ultramonkey (note from Joe, April 2001, UM is now 1.0.2, look for examples there). My understanding is that this is quite similar to how your LVS topology will be set up, though I understand you will be having more than one of these configured. Basically what happens is that you set up LVS to consider any packets like other LVS virtual services other than that no VIP is specified. e.g. RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 rr -> 192.168.6.3:80 Masq 1 0 0 -> 192.168.6.2:80 Masq 1 0 0 ]]> The other half of the equation is that ipchains is used to match incoming traffic for virtual IP addresses and mark them with fwmark 1. Say you have 8 contiguous class C's of virtual addresses beginning at 192.168.0.0/24. The ipchains command to set up matching of these packets would be: You also need to set up a silent interface so that the LVS box sees traffic for the VIPs as local. To do this use: /proc/sys/net/ipv4/conf/all/hidden echo 1 > /proc/sys/net/ipv4/conf/lo/hidden ]]> Now, as long as 192.168.0.0/21 is routed to the LVS box, or more particularly the floating IP address of the LVS box brought up by heartbeat, traffic for the VIPs will be routed to the LVS box, the ipchains rules will mark it with fwmark 1 and LVS will see this fwmark and consider the traffic as destined for a virtual service. Ted Jul 14, 2000

for me to enable persistent connections to every port using direct routing, would this work?

Horms Yes, that would work. The port in the "ipvsadm -a" commands is ignored if the realservers are being added to a fwmark service. Connections will be sent to the port on the realserver that they will be recieved on the virtual server. So port 80 traffic will go to port 80, port 443 traffic will go to port 443 etc... As a caveat you should really make sure that your ipchains statments catch all traffic for the given addresses including ICMP traffic so ICMP traffic is handled correctly by LVS. (Julian on catching ICMP traffic) IIRC, this is already not a requirement in the last LVS versions. If we look in skb->fwmark for ICMP packets it is impossible to use normal and fwmark virtual services to same VIP because we can't create such ipchains rules. The good news is that in 2.4 (0.0.3) the virtual service lookup (the fwmark field) is used only for the new connections. In 2.2 the service is looked up even for existing entries but we don't want to break the MASQ code entirely Ted Pavlic tpavlic (at) netwalk (dot) com 19 Jul 2000 When using fwmark to assign realservers to virtual servers, how is scheduling and persistence handled? In my particular example, I have: 216.69.196.0/22 (ie 4 class C networks) all marked with a fwmark of 1. ipvsadm setup is RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 lc persistent 600 -> nw01:0 Route 1 0 0 -> nw02:0 Route 1 0 0 ]]> Say someone connects to 216.69.196.1 and the connection is assigned to nw01. At this point ipvsadm shows RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 lc persistent 600 -> nw01:0 Route 1 1 0 -> nw02:0 Route 1 0 0 ]]> A new person connects to another IP in 216.69.196.0/22 (say 216.69.196.2). Will this new connection to 216.69.196.2 go to nw02 because it has the least number of TOTAL connections, or will it go to nw01 because for that PARTICULAR IP, both have 0 connections? Now then say that the person who just connected to 216.69.196.1 makes a connection (within the 600 persistence seconds) to 216.69.196.3. Will this new connection go to nw01 because it's being persistent? Or will it go to either server depending on the number of connections? Here's what I think would be the best way to do things... If multiple IPs are marked with FWMARK 1, LVS should consider them all one entry in its active/inactive table. I don't believe that's how things are currently being handled. (Julian) The templates are not accounted in the active/inactive counters.

(Joe, almost a year later - Julian, what do you mean here?) (Julian 13 Apr 2001) Ted here thinks that the templates are accounted in the inactive/active counters. And before the persistent-fwmark patch we can have many templates for one fwmark-based service: VIP1:0 -> RIP_A:0 CIPNET:0 -> VIP2:0 -> RIP_B:0 ]]> where VIP1 and VIP2 are marked with same fwmark. Ted recommends these two templates to be replaced with one, i.e. just like in the persistent-fwmark patch: CIPNET:0 -> FWMARK:0 -> RIP1:0 We can't see the templates (which are normal connection entries with some reserved values in the connections structure fields) accounted in the inactive/active counters. The reason for this is that the inactive/active counters are used to represent the realserver load but our templates don't lead to any load in the realservers, we use them only to maintain the persistence.

When a service is marked persistent all connections from CIP to VIP go to same RIP for the specified period. Even for the fwmark based services. This works for many independent VIPs. The other case is fwmark service covering a DNS name. I expect comments from users with SSL problems and persistent fwmark service. Is there a problem or may be not? I agree, may be the both cases can be useful: VIP 2. CIP->FWMARK ]]> Any examples where/why (2) is needed? But switching the LVS code always to use (2) for the persistent fwmark services is possible. (Ted) In my opinion, here are some pros and cons of case 2: Pros: Improves scheduling, I think, and true load balancing. If someone is using [W]RR or [W]LC, the LVS box will actually look at the realservers as a whole rather than separate realserver entries for EACH VIP. Does that make sense? For example, in my particular configuration I have over one thousand VIPs which are load balanced onto four RIPs. When I configure the LVS server to use LC scheduling, I'd like it to look at how many TOTAL connections are being made to each RIP not how many connections are being made to each RIP PER VIP. I would like to load balance all one thousand VIPs as a WHOLE onto the four RIPs rather than load balance EACH VIP. That is, in some of my less active sites, most of their traffic will probably hit one VIP just because not much traffic will need to be load balanced. However, more active sites will hit both servers. The load will then not be distributed equally among the servers as one server will probably get not only the active traffic but also the less active traffic and the other server will only get the more active traffic (in the case of having two RIPs). Cons: One person on the Internet will keep connecting to the same RIP for many different VIPs if persistence is turned on. If this causes a problem, the LVS administrator can do one of two different things: 1) Rather than load balancing a fwmark template, go back to load balancing specific VIPs. The scheduling will then be unique for those particular VIPs. 2) Create multiple fwmark templates. The scheduling for each template will be unique. In my opinion if you group a bunch of IPs together by marking them with an fwmark, that you say that you want to load balance all of those COLLECTIVELY -- almost like load balancing one site. I'm just saying, are there any examples where CIP->FWMARK is not needed? As far as the LVS is concerned, if someone connects to a VIP marked with fwmark 1, it should treat it just like every other VIP marked with fwmark 1 -- as if they were all one VIP. But today on my LVS (where I have a ten minute persistence setup) I connected to one virtual server marked with fwmark 1 and got a certain real server. I then expected to connect to another virtual server also marked with fwmark 1 and get that same realserver. I did not, however. If what you're telling me is correct, the persistence should have connected me to the same realserver as long as I was connecting within that ten minute window. Now in this particular example -- connecting to DIFFERENT virtual servers -- it isn't so necessary for persistence to be carried through PER virtual server. I'm just worried that least connection scheduling and round-robin scheduling aren't working at the fwmark level -- I'm worried that they are working at the VIP level as if I had setup hundreds of explicit VIP rules inside IPVSADM.

Julian I hope this feature (2) will be implemented in the next LVS version (if Wensong don't see any problems). I.e. the templates can be changed to case (2) for the persistent fwmark services. For now we (I and Horms) don't see any problems after this change. Then connections from one client IP to different VIPs (from the same fwmark service) will go to the same realserver (only for the persistent fwmark services).

Do you see any reason why enabling CIP->FWMARK for all cases would be a bad thing? That is, not only using case 2 for persistent fwmark, but just whenever fwmark was used. Personally, I cannot ever forsee a scenerio when a person would setup an fwmark for load balancing and want each VIP associated with that fwmark to act independently.

Web cluster for independent domains (VIPs). fwmark service is used only to reduce the amount of work for configuration.

I've always thought that the scheduling algorithms should look directly at the realservers rather than the realserver stats for each particular virtual server. That is, least connection scheduling would look at the total number of connections on a realserver, not just the connections from that particular VIP. Round-robin would go round-robin from realserver to real server based on the last connection from ANY VIP to the realservers... However, before fwmark I realized that this would probably very difficult to do especially in cases where an LVS administrator was load balancing to a number of different realserver clusters that may overlap.

This is a job for the user space tools: WRR scheduling method + weights derived from the realserver load. Yes, one real server can be loaded from: many directors many virtual services other processes not part from the real service In this case the director's opinion (for each virtual service) about the realserver load is wrong. The only way to handle such case properly is to use WRR method. In the other cases WLC, LC and RR can do their job.

fwmark, to me, just by causing all VIPs marked with a particular fwmark to look like one big VIP makes it possible to do basically that which I just described. I don't see why anyone would not want such functionality with the fwmark services. If one did want such functionality, he would probably partition the VIPs associated with his fwmark into separate fwmarks or even explicit VIP entries anyway.

Yes. IMO, this can be a problem only for the balancing but I don't think so. The problems will come when one realserver dies and the client can't access any VIP part from the fwmark service for a period of time.

Appendix 2: Demonstration of grouping services with fwmarks Here's the original e-mail between Ted tpavlic (at) netwalk (dot) com 3 Aug 2000 and Joe One of the things it fwmarks lets me do is make ports sticky by groups. Basically I setup ipchains rules that say all packets to ports 80/tcp and 443/tcp mark with a 1. All packets to ports 20/tcp and 21/tcp as well as 1024:65535/tcp mark with a 2. Voila... I just made ports stick by groups. I then go into IPVS and setup my realservers under FWMARK1 and FWMARK2. Ports 80 and 443 are now persistent as a group just as 20 and 21 and 1024:65535 are persistent as a group. If my HTTP goes down on one of my real servers, I do not have to take my FTP down as well. I only have to remove the realserver from the FWMARK1 group. It's great!

Joe most people don't program their own on-line transaction processing program and the point of an LVS is for the realservers to be running the same code as when they're stand alone.

My users run PHP scripts as well as ASPs that keep session information. That session information is unique per server and usually is stored in a local /tmp directory. Users are handed cookies which tie them to their session information. If they go to the wrong realserver, that session information won't exist and a number of things could go wrong. most of my realservers run a lot of services... HTTP, HTTPS, FTP, SMTP, POP3, IMAP, DNS, And when one of them went down (with persistence set up), I would have to take the entire realserver down. Several problems: *) One little thing goes down... POP3, for example. Now the load increases a great deal on all my other realservers... Perhaps causing the load to become so high that sendmail starts rejecting connections... and then THAT realserver also is taken COMPLETELY down... domino effect. If I could have just taken POP3 down off of that server, it would have been perfect. *) Say something horrible happens causing sendmail to go down on all the servers... or HTTP... or POP3... any one service -- just as long as it goes down on all servers. Rather than just causing that service to be affected, ALL of my services go down because every realserver was taken completely off-line until that ONE service is fixed. :( But I figured that those two problems wouldn't be that big of a deal... I could probably put such a system in production. Well -- I put such a system in production and those problems weren't that big of a deal... Except for a COUPLE of times when all services went down and caused a BIG hassle. So my superiors wanted something better -- needed in fact. So at first I came up with the interim idea of separating persistent services and non-persistent services by IP. All of my persistent services were basically on one supernet and all of my non-persistent services were on another subnet. Consequently, I could tie the one supernet to one FWMARK and the other subnet to another FWMARK. Now if a persistent service went down, it would bring down only all of the persistent services. Also, if a non-persistent service went down, it would only bring down all of the non-persistent services. This was definitely an interim solution because it required a lot more IPs that any one administrator should need, and it still was far from perfect.. BUT... I started to realize that just as I could mark different supernets and subnets with different FWMARKs, I could go farther down the TCP/IP layers and mark things at their protocol and port level. That's where I realized that we COULD do persistents by port group just with a little help from ipchains.

Joe I asked Horms if there was any point in having multiple fwmarks. His only example was if you had duplicate sets of realservers. Eg the paying customers get the fast servers, while the people coming into the free site get the 486 with 16M.

Similar idea here... except rather than setting up your policies like: Paying customers -> fast server Free -> slow server You have: SMTP -> a realserver POP3 -> another realserver HTTP/HTTPS -> yet another realserver FTP -> and another realserver The key of it all is the fact that you can group by about any parameter that ipchains can see. If ipchains can segregate it, you can group it. Anything that ipchains can do IPVS can then add onto itself.

Joe Have you solved passive ftp without using persistance?

I really don't think there's any way to get around it... In order to get passive FTP to work, you need to make TCP port 21 persistent with every TCP port above 1024. I mean -- how else could you do it without putting some big brother software inside of LVS which would keep an eye on FTP and see what port it tells the end-user to connect to. Still, putting 21 and 1024:65535 together is a lot better than putting everything together. Personally I only plan on load balancing things in the <1024 range anyway, so I have no problem including that huge group above 1024. This is my setup HTTP/HTTPS (persistent) FWMARK2 => FTP (persistent) FWMARK3 => SMTP FWMARK4 => POP3 FWMARK5 => DOMAIN FWMARK6 => IMAP FWMARK7 => ICMP (for kicks) ================ IP Virtual Server version 0.9.12 (size=4096) Prot LocalAddress:Port Scheduler Flags -> RemoteAddress:Port Forward Weight ActiveConn InActConn FWM 1 lc persistent 600 -> nw04:0 Route 1 58 121 -> nw03:0 Route 1 49 76 -> nw02:0 Route 1 60 98 -> nw01:0 Route 1 61 44 FWM 2 lc persistent 600 -> nw04:0 Route 1 0 2 -> nw03:0 Route 1 0 2 -> nw02:0 Route 1 1 13 -> nw01:0 Route 1 1 0 FWM 3 lc -> nw04:0 Route 1 4 11 -> nw03:0 Route 1 4 12 -> nw02:0 Route 1 3 20 -> nw01:0 Route 1 3 16 FWM 4 lc -> nw04:0 Route 1 3 54 -> nw03:0 Route 1 1 74 -> nw02:0 Route 1 3 51 -> nw01:0 Route 1 2 73 FWM 5 lc -> nw03:0 Route 1 0 46 -> nw01:0 Route 1 0 44 -> nw02:0 Route 1 0 45 -> nw04:0 Route 1 0 45 FWM 6 lc -> nw04:0 Route 1 0 0 -> nw03:0 Route 1 0 0 -> nw02:0 Route 1 1 0 -> nw01:0 Route 1 0 0 FWM 7 lc -> nw04:0 Route 1 0 0 -> nw03:0 Route 1 0 0 -> nw02:0 Route 1 0 0 -> nw01:0 Route 1 0 0 ============== ]]> Is this anything new?

Joe It's new to me and Horms didn't have any other ideas for multiple fwmarks 3 weeks ago, so I expect it will be new to him.

I've been thinking of ways of combining different programs which already exist out there to get L7 scheduling working. For example -- you have some program (sorta like policy routing but one more layer up) that filters packets at the application layer and does something to them... routes them to a particular IP... something like that... and then have ipchains mark each one of those packets with a particular mark... and have LVS work from there. You see -- using multiple fwmarks makes me think that you can do a lot more with LVS. We could probably borrow some of the ideas used for some of the dynamic routing protocols, like BGP or RIP. A master could advertise its IPVS hash table. If it didn't advertise within a given interval of time, other LVS's could take over. During the failover, rather than trading an IP like we were talking about, all LVSs could know which one is the active one and ICMP redirect to that LVS or something like that. Right now I'm routing every virtual server through the active LVS. This lets me do a lot of nifty things (for me at least): * Very little has to happen on the LVS during failovers. They basically just trade an IP. In fact -- I COULD do the failover right at the router before the LVS's -- just have it route to another IP. * I do not have to bring every IP up on my realservers -- I just have to bring the network that they're on up on a hidden loopback device. When you route an entire network to a loopback device, the loopback device answers every IP on that subnet automatically. So even with 1024+ IPs, I have to setup very few interfaces/aliases because a great deal of them are on the same subnet.

Appendix 3: Announcement of grouping services with fwmarks Ted Pavlic tpavlic (at) netwalk (dot) com 4 Aug 2000 Periodically the issue comes up regarding wanting to do persistence by groups of ports. Until now, an LVS administrator could make a single-port persistent or all ports persistent. Single port persistence was nice for quite a few things. However, things like HTTP and HTTPS caused complications with it. Someone who connected to a webpage on HTTP and started a session tied to them with a cookie would want to return to that same realserver when they went to the HTTPS version of that site. FTP would also cause a problem with single-port persistence as someone who wanted to use passive FTP wouldn't be gauranteed the same server when they returned on a random TCP port above 1024. There are other examples as well. So the solution to these problems would be to make every port persistent. This works pretty well, but now anytime a user of a large network behind a firewall would connect to a realserver on ANY service, everyone behind that firewall would hit that same realserver. Plus, if an administrator wanted to stop scheduling a single service to a single realserver, he would have to take all services down on that single realserver. This causes many problems as well... especially if one small service dies on every real server -- brings down every service on every realserver. So there has been the need for persistence by port GROUPS. Rather than saying all ports are persistent, it would be nice to tell LVS to tie just 80/tcp and 443/tcp together or just 21/tcp and 1024:65535/tcp together. Before the wonderful FWMARK additions to LVS, this was not possible. But now that LVS listens to FWMARKs, it becomes possible to group ports together inside ipchains with different FWMARKs and then tell LVS to listen to those FWMARKs. For example, one can setup a rule inside FWMARK to do this... FWMARK1 21/tcp, 1024:65535/tcp --> FWMARK2 25/tcp --> FWMARK3 110/tcp --> FWMARK4 ]]> Then inside LVS (assume on this setup all of these services are served by the same realserver cluster), say: PERSISTENT -> real1,real2,real3,real4 FWMARK2 -> PERSISTENT -> real1, real2, real3, real4 FWMARK3 -> real1, real2, real3, real4 FWMARK4 -> real1, real2, real3, real4 ]]> Not only have you now setup persistence by port groups, but you've also split your services back up into autonomous services that will not bring EVERY server down for the sake of persistence. If FTP goes down on real1, real1 only needs to be stopped scheduling for FTP.

another explanation Ted Pavlic tpavlic (at) netwalk (dot) com 2000-09-15 Using fwmark, you can setup something which used to be a big desire in LVS, persistence by port groups. For example... Say you were serving HTTP and HTTPS. In this case, you would probably want calls to one HTTP server to end up hitting the same HTTPS server. This way session information and such would be accessable no matter how the end-user was accessing the website. Say you also wanted all forms of FTP to work... You would need persistence there, but not necessarily the same persistence as HTTP/HTTPS. And other protocols do not need to be persistent. Back in the olden days before fwmark, to do any of this you would have to make ALL ports persistent. You couldn't simply say "Group 80 and 443 together and make them persistent and then make 21, 20, AND 1024:65535 persistent." If one service went down, you would have to bring down ALL services. Some sort of persistence by port groups would allow you to only need to take down whatever went down and the affected server could still serve other services. FWMARK allows you to do this by way of setting up multiple FWMARKs. That is -- you can use ipchains to say that: FWMARK1 FTP --> FWMARK2 SMTP --> FWMARK3 POP --> FWMARK4 ]]> Then in LVS, setup: WLC Persistent 600 FWMARK2 --> WLC Persistent 300 FWMARK3 --> WLC FWMARK4 --> WLC ]]> And if FTP went down, all you'd have to do is stop scheduling FTP rather than stop scheduling EVERYTHING. Also note that FWMARK makes setting up MASS VIPs really easy (of course because of recent ARIN policy changes, this probably won't be done much more anymore). That is, if you wanted to load balance 1000 VIPs, it might be easy to setup one single rule in ipchains to cover them all, where it would be 1000 rules for EACH realserver in ipvsadm. It makes me think that if there was a utility already out there that could sit on a director and figure out where name-based packets were going it might be able to mark each name-based host with a different FWMARK and pass that right back to LVS... Then LVS wouldn't have to worry about handling name-based stuff ITSELF. Of course the name-based challenge is even more challenging considering how much data needs to be looked at to figure out if a TCP stream is a name-based HTTP session going to specific name X.... but that's a completely other argument... Just food for thought.

fwmark examples from the mailing list Simone Sestini, September 23, 2003

I would like to use director and a backup director as a realserver too. I would like to run http and https on the backup director/realserver how can I configure more than one https domain for each server ? Apache need to use a unique IP for each https domain

Matthew Crocker matthew (at) crocker (dot) com 24 Sep 2003 Search some of the archives a bit. I handle my HTTPS servers with LVS-DR going through my LVS director. The actual web servers are not on the Internet. Here is what I do. Setup keepalived/VRRP to handle a VIP failover between two LVS boxes Setup a static route in the upstream router for a /24 to the VIP IP address Setup netfilter/iptables to mark packets with dest = the /24 and dport = 443 and 80 with fwmark 0x1 Setup LVS to load balance FWM 1 using LVS-DR to the Real servers internal IP (192.168.x.y) Setup the LVS servers (presumably directors - Joe) to treat packets with FWM1 as local Setup the realservers to list on each IP in the /24 Setup apache with SSL certs for each /24 IP address. Point DNS records for https servers to unqiue IPs in the /24 This works great for me. Only packets in the /24 that are marked with the firewall mark actually hit the LVS server and/or the realservers. All other packets are not treated local by the lVS server and will be routed to its default route which will create a routing loop. If you ping/traceroute it will look broken but if you telnet to port 80 on one of the IPs you will get an answer. This also eliminated any ARP issues because the realservers are not on the same LAN segment as the LVS directors and the router doesn't ARP for the /24 IPs anyway because of the static route. Most of the configs for steps 3,4,5 are in the archives from a couple months ago.

LVS: Transparent proxy (TP or Horms' method) Horms worked out that transparent proxy could be used in an LVS Transparent proxy is a piece of Linux kernel code which allows a packet destined for an IP _not_ on the host, to be accepted locally, as if the IP was on the host. Transparent proxy is the mechanism by which you can make a director (or realserver) work without having the VIP configured on it. Transparent proxy allows the realserver to solve . The director sends the packets to the MAC address of the realserver; transparent proxy tells the realserver to accept the packet with dst_addr=VIP (even though this IP is not on the realserver); since there is no VIP on the realserver, it does not reply to arp queries for the VIP. Without the VIP on a machine, methods other than the normal IP routing are required to deliver packets with dst_addr=VIP (see ). A VIP-less way of setting up an LVS is . There an incoming packet is marked and the mark (rather than a VIP) is used to forward the packet. In the case of using a fwmark on a director, the packet still has to be accepted on the director. For this you need the VIP on the outside NIC or you need TP. It would be nice for an LVS if a packet with a fwmark that is in the ipvsadm table (i.e. this is a packet to be forwarded by ip_vs) could be accepted by the node without having to also put the VIP on the node (and without using TP), by a modification of the LVS code. In principle this is possible and Julian would write it, if he thought it was going to be used. At the moment I'm the only one asking for it. Feb 2003: The TP implementation for stock 2.4.x kernels behaves differently than the 2.2. For 2.4 the packet is accepted locally with the primary IP of the NIC, rather than the VIP, as for 2.2 kernels. This makes 2.4 TP unusable for LVS directors, although it still works fine for web-caches (i.e. squids), it's original purpose. On talking to Harald Welte at the 2001 Ottawa Linux Symposium, there had been much discussion on the netfilter mailing lists as to whether to preserve the original behaviour. Since no-one (that they knew about) needed the original behaviour, that functionality was dropped. It seems too late to restore the functionality to netfilter now. Some of the functionality that LVS wants out of TP is available via and so the issue is probably moot now, and we're not going to ask the netfilter people to restore the original TP functionality for 2.4 kernels. It's possible to patch the code for LVS, but this would require someone to keep track of the netfilter code for each version of the kernel. RedHat has patched its kernels to restore the TP functionality and Ratz maintains patches for the standard kernel (see below). Most of the writeup for 2.4 kernels in this section was my efforts to find out what was happening with 2.4 TP. This section of the HOWTO will have to be rewritten when people start using the 2.4 TP patches. Take-home lesson: TP only works for LVS (directors and realservers) on 2.0 and 2.2 kernels. For 2.4 (and higher) TP only works on realservers for LVS to handle the Arp problem. Note: web caches (proxies) can operate in transparent mode, when they cache all IP's on the internet. In this mode, requests are received and transmitted without changing the port numbers (ie port 80 in and port 80 out). In a normal web cache, the clients are asked to reconfigure their browsers to use the proxy, some_IP:3128. It is difficult to get clients to do this, and the solution is transparent caching. This is more difficult to setup, but all clients will then use the cache. In the web caching world, transparent caching is often called "transparent proxy" because it is implemented with transparent proxy. In the future, it is conceivable that transparent web caching will be implemented by another feature of the tcpip layer and it would be nice if functionality of transparent web caching had a name separate from the command that is used to implement it.

setting up routing and packet delivery to the director To use TP in an LVS, packets from the client have to be delivered to a machine which does not have the IP of the dst_addr of the client's packets (i.e. the VIP). Read the part of the section on routing and delivery concerned with routing packets to machines without the dst_addr.

General This is Horms' (horms (at) vergenet (dot) net) method (also called the transparent proxy or TP method). It uses the transparent proxy feature of ipchains to accept packets with dst=VIP by the host (director or realservers) when it doesn't have the IP (eg the VIP) on a device. It can be used on the realservers (where it handles the ) or the director to accept packets for the VIP. When used on the director, TP allows the director to be the default gw for LVS-DR (see ). Unfortunately the 2.2 and 2.4 versions of transparent proxy are as different as chalk and cheese in an LVS. Presumably the functionality has been maintained for for transparent web caching but the effect on LVS has not been considered. You can use transparent proxy for 2.2.x, director and realservers 2.4.x, realservers only (where it handles the Arp problem) (Historical note from Horms:) From memory I was getting a cluster ready for a demo at Inetnet World, New York which was held in October 1999. The cluster was to demo all sorts of services that Linux could run that were relevant to ISPs. Apache, Sendmail, Squid, Bind and Radius I believe. As part of this I was playing with LVS-DR and spotted that the realservers coulnd't accept traffic for the VIP. I had used Transparent Proxying in the past so I tried it and it worked. That cluster was pretty cool, it took me a week to put it together and it was an ISP in an albeit very large box. Transparent proxy is only implemented in Linux. 2.2.x you need IP masquerading, transparent proxing and IP firewalls turned on. 2.4.x, TP is a standard part of the kernel build, there is no separate TP option. In the netfilter options, there are the options under "Full NAT (NEW)" MASQUERADE, REDIRECT. I suspect you need all these. Julian Transparent proxy support calls ip_local_deliver from where the LVS code is reached. One of the advantages of this method is that it is easy for a director and realserver to exchange roles in a failover setup.

How you use TP This is a demonstration of TP using 2 machines: a realserver (which will accept packets by TP) and a client (i.e. this is not an LVS). On the realserver: ipv4 forwarding must be on. /proc/sys/net/ipv4/ip_forward ]]> You want your realserver to accept telnet requests on an IP that is not on the network (say 192.168.1.111). Here's the result of commands run at the server console before running the TP code, confirming that you can't ping or telnet to the IP. so add a route and try again (lo works here, eth0 doesn't) This shows that you can connect to the new IP from the localhost. No transparent proxy involved yet. If you go to another machine on the same network and add a route to the new IP. raw sockets work between the client and server - however you can't ping (i.e. icmp doesn't work) or telnet to that IP from the other machine. Here's the output of tcpdump running on the target host tip.mack.net.telnet: S 1088013012:1088013012(0) win 32120 (DF) [tos 0x10] 14:09:09.791205 realserver.mack.net > client.mack.net: icmp: time exceeded in-transit [tos 0xd0] ]]> (Anyone have an explanation for this, apart from the fact that icmp is not working? Is the lack of icmp the only thing stopping the telnet connect?) The route to 192.168.1.111 is not needed for the next part. Now add transparent proxy to the server to allow the realserver to accept connects to 192.168.1.111:telnet This is the command for 2.2.x kernels telnet => telnet Chain forward (policy ACCEPT): Chain output (policy ACCEPT): ]]>

redirecting any port at all In the normal functioning of an LVS, once the packet has been redirected, the director steps in and sends it to the realservers and the reply comes from the realservers. However you can use the REDIRECT to connect with a socket on a different port independantly of the LVS function. Joe, 4 Jun 2001 If I have 2 boxes (not part of an LVS) and on the server box I run then I can telnet to port 81 on the realserver box and have a normal telnet session. I watched with tcpdump on the server and all I see is a normal exchange of packets with dest-port=81. I thought with REDIRECT that the packet with dest-port=81 was delivered to the listener on realserverIP:telnet. How does the telnetd know to return a packet with source-port=telnet? Julian

This is handled from the protocol, TCP in this case: The higher layer (telnet in this case) can obtain the two dest addr/ports by using getsockname(). In 2.4 this is handled additionally by using getsockopt(...SO_ORIGINAL_DST...) The netfilter mailing list contains examples on this issue. You can search for "getsockname"

For 2.4.x kernels Chain POSTROUTING (policy ACCEPT) target prot opt source destination Chain OUTPUT (policy ACCEPT) target prot opt source destination ]]> You still can't ping the transparent proxy IP on the server from the client The transparent proxy IP on the server will accept telnet connects but not requests to other services ]]> Conclusion: The new IP will only accept packets for the specified service. It won't ping and it won't accept packets for other services.

The original 2.2 TP setup method to client ]]> Here's a script to run on 2.2.x realservers/directors to setup Horms' method. This is incorporated into the configure script. i.e. with the same gateway and IP as if it were in the LVS, but DO NOT # put the VIP on the realserver. The realserver will only have its regular IP # (called the RIP in the HOWTO). #3. Edit "user configurable" stuff below" #4. Run this script #----------------------------------------------------- #user configurable stuff IPCHAINS="/sbin/ipchains" VIP="192.168.1.110" #services can be represented by their name (in /etc/services) or a number #SERVICES is a quote list of space separated strings # eg SERVICES="telnet" # SERVICES="telnet 80" # SERVICES="telnet http" #Since the service is redirected to the local device, #make sure you have SERVICE listening on 127.0.0.1 # SERVICES="telnet http" # #---------------------------------------------------- #main: #turn on IP forwarding (off by default in 2.2.x kernels) echo "1" > /proc/sys/net/ipv4/ip_forward #flush ipchains table $IPCHAINS -F input #install SERVICES for SERVICE in $SERVICES do { echo "redirecting ${VIP}:${SERVICE} to local:${SERVICE}" $IPCHAINS -A input -j REDIRECT $SERVICE -d $VIP $SERVICE -p tcp } done #list ipchain rules $IPCHAINS -L input #rc.horms---------------------------------------------- ]]> Here's the conf file for a LVS-DR LVS using TP on both the director and the realservers. This is for a 2.2.x kernel director. (For a 2.4.x director, the VIP device can't be TP - TP doesn't work on a 2.4.x director). Here's the output from ipchains -L showing the redirects for just the 2.2.x director telnet => telnet REDIRECT tcp ------ anywhere lvs2.mack.net any -> telnet => telnet REDIRECT tcp ------ anywhere lvs2.mack.net any -> www => www REDIRECT tcp ------ anywhere lvs2.mack.net any -> www => www Chain forward (policy ACCEPT): Chain output (policy ACCEPT): ]]>

Transparent proxy for 2.4.x (and presumably 2.6.x) For 2.4.x kernels transparent proxy is built on netfilter and is installed with ip_tables (not ipchains as with 2.2.x kernels). You need ip_tables support in the kernel and the ip_tables module must be loaded. The ip_tables module is incompatible with the ipchains module (which in 2.4.x is available for compatibility with scripts written for 2.2.x kernels). If present, the ipchains module must be unloaded. You shouldn't be running ipchains on 2.4.x kernels anymore and you should have changed over to ip_tables. Unfortunately the transparent proxy that comes with 2.4 kernels does not work for LVS. The packet arrives locally with the IP of the NIC which accepts the packet, rather than with an unchanged IP (the VIP). This still allows a squid to work, but is useless for LVS. The netfilter people didn't realise that someone (i.e. LVS) had found a use for the original behaviour and it was dropped from the 2.4 code. Balazs Scheidler bazsi (at) balabit (dot) hu has written a netfilter patch which restores the original functionality of tproxy, for the firewall Zorp (note: no-one has tested it with LVS yet). Here is Balazs' 2.4 transparent proxy patches README. (In previous HOWTO's, I incorrectly attributed the patch to Ratz. My apologies to Balazs. Ratz has written a tproxy patch for LVS as part of his job, but he is not allowed to release the code - it seems I confused the two patches.) Mike McLean mikem (at) redhat (dot) com 04 Dec 2002

The patch for 2.4 kernels should be shipped by RedHat. If not please file a bug at bugzilla.redhat.com.

If RedHat is patched with Balazs' code, then it is possible that it has been tested with LVS (RedHat doesn't necessarily test their released code). (Dec 2002). Nearly all the following section is me figuring out that TP for 2.4 doesn't work for LVS. It will have to be rewritten as Balazs's patches are incorporated into LVS. (Mar 2006, seems like noone is using them.) The command for installing transparent proxy with iptables for 2.4.x came from looking in Daniel Kiracofe's drk (at) unxsoft (dot) com Transparent Proxy with Squid mini-HOWTO and guessing the likely command. It turns out to be (where $SERVICE = telnet, $SERVER_NET_DEVICE = eth0). Here's the result of installing the VIP by transparent proxy on one of the realservers. This works fine for the realserver allowing it to accept packets for the VIP, without having the VIP on an ethernet device (eg lo, eth0). With the problems of 2.4 kernel TP for the VIP on the director, people seem to have forgotten that TP will still allow the realserver to accept packets for the VIP, solving the arp problem. Bill Omer rediscovered this a few years later Bill Omer bill (dot) omer (at) gmail (dot) com 2 Mar 2006

Here's my setup with all the nitty gritty. I'm using rhel3as, I have all of the lvs portions of the kernel compiled as modules. the director part is straight forward: The realserver(s) /proc/sys/net/ipv4/ip_forward ]]> As far as ports 0:65535 goes, I know its a security risk. It's as secure as the RIP's them self. I plan on having about 30-40 thin clients book up over the network (PXE, which I'd like to in time be lvs'd) to an xdm. After I get some stress testing done and pin point some more bugs here and there, I'll narrow down the port range to be a lil more complaint to rudimentary security measures. However, everything is being ran over a local lan and nothing is exposed to the wild wild web.

If you do the same with TP on the director, setup for an LVS with (say) telnet forwarded in the ipvsadm tables, then the telnet connect request from the client is accepted by the director, rather than forwarded by ipvs to the realservers (tcpdump sees a normal telnet login to the director). Apparently ipchains is sending the packets to a place that ipvs can't get at them. Joe

I have got TP to work on a LVS-DR telnet 2.4 realserver with the command When I put the VIP onto the director this way, the LVS doesn't work. I connect to the director instead of the realservers. ipvsadm doesn't show any connections (active or otherwise) If I run the same command on the director, with ipvsadm blank (ie no LVS configured), then I connect to the director from the client (as expected) getting the director's telnet login. I presume that I'm coming in at the wrong place in the input chain of the director and ipvsadm is not seeing the packets?

Julian I haven't tried tproxy in 2.4 but in theory it can't work. The problem is that netfilter implements tproxy by mangling the destination address in the prerouting. LVS requires from the tproxy implementation only to deliver the packet locally and not to alter the header. So, I assume LVS detects the packets with daddr=local_addr and refuses to work. Netfilter maintains a sockopt SO_ORIGINAL_DST that can be used from the user processes to obtain the original dest addr/port before they are mangled in the pre routing nat place. This can be used from the squids, for example, to obtain these original values. If LVS wants to support this broken tproxy in netfilter we must make a lookup in netfilter to receive the original dst and then again to mangle (for 2nd time) the dst addr/port. IMO, this is very bad and requires LVS always to require netfilter nat because it will always depend on netfilter: LVS will be compiled to call netfilter functions from its modules. So, the only alternative remains to receive packets with advanced routing with fwmark rules. There is one problem in 2.2 and 2.4 when the tproxy setups must return ICMP to the clients (they are internal in such setup), for example, when there is no realserver LVS returns ICMP DEST_UNREACH:PORT_UNREACH. In this case both kernels mute and don't return the ICMP. icmp_send() drops it. I contacted Alexey Kuznetsov, the net maintainer, but he claims there are more such places that must be fixed and "ip route add table 100 local 0/0 dev lo" is not a good command to use. But in my tests I don't have any problems, only the problem with dropped ICMP replies from the director. So, for TP, I'm not sure if we can support it in the director. May be it can work for the realservers and even when the packet is mangled I don't expect peformance problems but who knows.

Experiments showing that 2.4TP is different to 2.2TP These experiments were conducted with 2.2 or 2.4 kernel realservers accepting packets for the VIP by TP. I initially noticed that the connection to 2.4 realservers was not delayed by identd (which is running on my realservers). What was happening was that the realserver was accepting the packet at the RIP and generating the reply from the RIP, rather than the VIP. On my setup, the RIP is routable to the client and the client probably received the identd request directly from the realserver (I didn't figure out what was going on for a while after I did this. I originally thought this had something to do with identd). Here's the data showing that TP behaves differently for 2.2 and 2.4 kernels. If you want to skip ahead, the piece of information you need is that the IP of the packet when it arrives on the target machine by TP, is different for 2.2 and 2.4 TP. As we shall see, for 2.2.x the TP'ed packets arrive on the VIP, while for 2.4.x, the TP'ed packets arrive on the RIP.

Realserver, Linux 2.4.2 kernel accepting packets for VIP on lo:110, is delayed Here's the tcpdump on the realserver (RS2) for a telnet request delayed by authd (the normal result for LVS). Realserver 2.4.2 with Julian's hidden patch, director 0.2.5-2.4.1. The VIP on the realserver is on lo:110. Note: all packets on the realserver are originating and arriving on the VIP (lvs2) as expected for a LVS-DR LVS. lvs2.telnet: S 461063207:461063207(0) win 32120 (DF) [tos 0x10] 21:04:46.611841 lvs2.telnet > client2.1174: S 3724125196:3724125196(0) ack 461063208 win 5792 (DF) 21:04:46.612272 client2.1174 > lvs2.telnet: . ack 1 win 32120 (DF) [tos 0x10] 21:04:46.613965 client2.1174 > lvs2.telnet: P 1:28(27) ack 1 win 32120 (DF) [tos 0x10] 21:04:46.614225 lvs2.telnet > client2.1174: . ack 28 win 5792 (DF) realserver makes authd request to client 21:04:46.651500 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 (DF) 21:04:49.651162 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 (DF) 21:04:55.651924 lvs2.1061 > client2.auth: S 3738365114:3738365114(0) win 5840 (DF) after delay of 10secs, telnet request continues 21:04:56.687334 lvs2.telnet > client2.1174: P 1:13(12) ack 28 win 5792 (DF) 21:04:56.687796 client2.1174 > lvs2.telnet: . ack 13 win 32120 (DF) [tos 0x10] ]]>

Realserver, Linux 2.4.2, accepting packets for VIP by TP, is not delayed Here's the tcpdump on the realserver (RS2) for a telnet request which connects immediately. This is not the normal result for LVS. Realserver 2.4.2 with Julian's hidden patch (not used), director 0.2.5-2.4.1. Packets on the VIP are being accepted by TP rather than on lo:0 (the only difference). Note: some packets on the realserver (RS2) are arriving and originating on the VIP (lvs2) and some on the RIP (RS2). In particular all telnet packets from the CIP are arriving on the RIP, while all telnet packets from the realserver are originating on the VIP. For authd, all packets to and from the realserver are using the RIP. RS2.telnet: S 4245054245:4245054245(0) win 32120 (DF) [tos 0x10] 20:56:43.639209 lvs2.telnet > client2.1169: S 3234171121:3234171121(0) ack 4245054246 win 5792 (DF) 20:56:43.639654 client2.1169 > RS2.telnet: . ack 3234171122 win 32120 (DF) [tos 0x10] 20:56:43.641370 client2.1169 > RS2.telnet: P 0:27(27) ack 1 win 32120 (DF) [tos 0x10] 20:56:43.641740 lvs2.telnet > client2.1169: . ack 28 win 5792 (DF) realserver makes authd request to client 20:56:43.690523 RS2.1057 > client2.auth: S 3231319041:3231319041(0) win 5840 (DF) 20:56:43.690785 client2.auth > RS2.1057: S 4243940839:4243940839(0) ack 3231319042 win 32120 (DF) 20:56:43.691125 RS2.1057 > client2.auth: . ack 1 win 5840 (DF) 20:56:43.692638 RS2.1057 > client2.auth: P 1:10(9) ack 1 win 5840 (DF) 20:56:43.692904 client2.auth > RS2.1057: . ack 10 win 32120 (DF) 20:56:43.797085 client2.auth > RS2.1057: P 1:30(29) ack 10 win 32120 (DF) 20:56:43.797453 client2.auth > RS2.1057: F 30:30(0) ack 10 win 32120 (DF) 20:56:43.798336 RS2.1057 > client2.auth: . ack 30 win 5840 (DF) 20:56:43.799519 RS2.1057 > client2.auth: F 10:10(0) ack 31 win 5840 (DF) 20:56:43.799738 client2.auth > RS2.1057: . ack 11 win 32120 (DF) telnet connect continues, no delay 20:56:43.835153 lvs2.telnet > client2.1169: P 1:13(12) ack 28 win 5792 (DF) 20:56:43.835587 client2.1169 > RS2.telnet: . ack 13 win 32120 (DF) [tos 0x10] ]]> Evidently TP on the realserver is making the realserver think that the packets arrived on the RIP, hence the authd call is made from the RIP. As it happens in my test setup, the client can connect directly to the RIP. (In a LVS-DR LVS, the client doesn't exchange packets with the RIP, so I haven't blocked this connection. In production, the router would not allow these packets to pass). Since the authd packets are between the RIP and CIP, the authd exchange can proceed to completion.

Realserver, Linux 2.2.14, accepting packets for VIP by TP, is delayed Here's the tcpdump on the realserver (RS2) for a telnet request which connects immediately. This is not the normal result for LVS. Realserver 2.2.14, director 0.2.5-2.4.1. Packets on the VIP are being accepted by TP rather than on lo:0. Note: TP is different in 2.2 and 2.4 kernels. Unlike the case for the 2.4.2 realserver, the packets all arrive at the RIP. lvs2.telnet: S 707028448:707028448(0) win 32120 (DF) [tos 0x10] 22:16:23.407955 lvs2.telnet > client2.1177: S 3961823491:3961823491(0) ack 707028449 win 32120 (DF) 22:16:23.408385 client2.1177 > lvs2.telnet: . ack 1 win 32120 (DF) [tos 0x10] 22:16:23.410096 client2.1177 > lvs2.telnet: P 1:28(27) ack 1 win 32120 (DF) [tos 0x10] 22:16:23.410343 lvs2.telnet > client2.1177: . ack 28 win 32120 (DF) authd request from realserver 22:16:23.446286 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 (DF) 22:16:26.445701 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 (DF) 22:16:32.446212 lvs2.1028 > client2.auth: S 3966896438:3966896438(0) win 32120 (DF) after delay of 10secs, telnet proceeds 22:16:33.481936 lvs2.telnet > client2.1177: P 1:13(12) ack 28 win 32120 (DF) 22:16:33.482414 client2.1177 > lvs2.telnet: . ack 13 win 32120 (DF) [tos 0x10] ]]>

What IP TP packets arriving on? Note: for TP, there is no VIP on the realservers as seen by ifconfig. Since telnetd on the realservers listens on 0.0.0.0, we can't tell which IP the packets have on the realserver after being TP'ed. tcpdump only tells you the src_addr after the packets have left the sending host. Here's the setup for the test. The IP of the packets after arriving by TP was tested by varying the IP (localhost, RIP or VIP) that the httpd listens to on the realservers. At the same time the base address of the web page was changed to be the same as the IP that the httpd was listening to. The nodes on each network link can route to and ping each other (eg 192.168.1.254 and 192.168.1.12). The results (LVS-DR LVS) are For 2.2.x realservers the httpd can bind to the VIP, RIP and localhost. LVS client gets webpage if realserver is listening to RIP or VIP. LVS client does not get webpage if realserver is listening to localhost. For 2.4.x realservers httpd can bind to the RIP and localhost. httpd cannot bind to the VIP. LVS client gets webpage if realserver is listening to RIP. LVS client does not get webpage if realserver is listening to localhost. During tests, the browser says "connecting to VIP", then says "transferring from..." LVS-DR, VIP on TP, kernel 2.4.2, "transferring data from RIP" LVS-DR, VIP on TP, kernel 2.2.14, "transferring data from VIP" (or RIP) LVS-DR VIP on lo:0, httpd listening to VIP, "transferring data from VIP" LVS-Tun VIP on tunl0:0, httpd listening on VIP, "transferring from VIP" LVS-NAT, httpd listening on RIP, "transferring data from realserver1" (or realserver2) Some of these connections are problematic. The client in a LVS-DR LVS isn't supposed to be getting packets from the RIP. What is happening is the httpd on the realserver is listening on the RIP the base address of the webpage is the RIP an incoming request from the client to the VIP will retrieve a webpage with references to gif etc that are at the RIP the client will then ask for the gifs from the RIP. in the above setup that I use for testing, the client does not request packets from the RIP. in the above setup, the client can connect to the RIP directly (this will not be allowed in a production server, either the router will prevent the connection, or the RIP will be a non-routable IP). the client retrieves the gifs, and the rest of the page, directly from the realserver The way to prevent this is to remove the route on the client to the RIP network (eg see removing routes not needed for LVS-DR). Doing so when the httpd is listening to the RIP and the base address is the RIP causes the browser on the client to hang. This shows that the client is really retrieving packets directly from the RIP. Changing the base address of the webpage back to the VIP allows the webpage to be delivered to the client, showing that the client is now retrieving packets by making requests to the VIP via the director. It would seem then that with 2.4 TP, the realserver is receiving packets on the RIP, rather than the VIP as it does with 2.2 TP. With a service listening to only 1 port (eg httpd) then the httpd has to listen on the RIP the addresses on the webpage have to be for the VIP The client will then ask for the webpage at the VIP. The realserver will accept this request on the RIP and return a webpage full of references to the VIP (eg gifs). The client will then ask for the gifs from the VIP. The realserver will accept the requests on the RIP and return the gifs.

Take home lesson for setting up TP on realservers

2.2.x Let httpd listen on VIP or RIP, return pages with references to VIP

2.4.x Let httpd listen on RIP, return pages with references to VIP.

Handling identd requests from 2.4.x LVS-DR realservers using TP Since the identd request is coming from the RIP (rather than the VIP) on the realserver, you can use Julian's method for NAT'ing client requests from realservers.

Performance of Transparent Proxy Using transparent proxy instead of a regular ethernet device has slightly higher latency, but the same maximum throughput. For performance of transparent proxy compared to accepting packets on an ethernet device see the performance page. Transparent proxy requires reprocessing of incoming packets, and could have a similar speed penalty as LVS-NAT. However only the incoming packets are reprocessed. Initial results (before the performance tests above) were initially not encouraging. Doug Bagley doug (at) deja (dot) com

Subject: [lvs-users] chosen arp problem solution can apparently affect performance I was interested in seeing if the linux/ipchains workaround for the arp problem would perform just as well as the arp_invisible kernel patch. It is apparently much worse. I ran a test with one client running ab ("apache benchmark"), one director, and one realserver running Apache. They are all various levels of pentium desktop machines running 2.2.13. Using the arp_invisible patch/dummy0 interface, I get 226 HTTP requests/second. Using the ipchains redirect method, I get 70 requests per second. All other things remained the same during the test.

See the performance page for discussion and sample graphs of hits/sec for http servers. Hits/sec can increase to high levels as the payload decreases in size. While large numbers for hits/sec may be impressive, they only indicate one aspect of a web server's performance. If large (> 1 packet) files are transferred/hit or computation is involved, then hits/sec is not a useful measure of web performance. Here's the current explanation for decreased latency of transparent proxy. Kyle Sparger ksparger (at) dialtoneinternet (dot) net

Logically, it's just a function of the way the redirect code operates. TCP/IP -> Application -> TCP/IP -> Ethernet With redirect: Ethernet -> TCP/IP -> Firewall/Redirect Code -> TCP/IP -> Application -> TCP/IP -> Ethernet ]]> That would definitely explain the slowdown, since _every single packet_ received is going to go through these extra steps.

Other people are happy with TP Jerry Glomph Black black (at) real (dot) com Nov 99 (or thereabouts)

The revival of Horms' posting, which I overlooked a month ago, was a lifesaver for us. We had a monster load distribution problem, and spread 4 virtual IP numbers across 10 'real' boxes (running Roxen, a fantastic web platform). The ipchains-REDIRECT feature works perfectly, without any of that arp aggravation! A PII_450 held up just fine at 20 megabits/s of HTTP -REQUEST- TRAFFIC!

Here's Jerry 18 months later. Jerry Glomph Black black (at) prognet (dot) com 06 Jul 2001

The ipchains/iptables REDIRECT method (introduced to this list by Mr Horms a long time ago) works fine, we've used it in production in the past. However, at -very- high packet loads it is far less CPU-efficient than getting the ARP settings correctly working. The REDIRECT method was bogging down our LVS boxes during peak traffic, something which does not happen with doing it the 'right way' with LVS-DR and silent arp-less interfaces on the real servers.

The difference between REDIRECT and TPROXY Horms REDIRECT works by changing the destination IP address to a local address so that it ends up in the LOCAL_IN chain. REDIRECT with 2.2 kernels was the original basis for "Horm's method" Joe Oct 03, 2003

is the original 2.4.x REDIRECT disaster (see ) fixed now?

TPROXY looks like it would work because it is completely different from REDIRECT and uses its own connection tracking. REDIRECT uses netfilter's internal connection tracking routines. Because of the way that LVS is implemented, these do not work for packets that are handled by LVS. Thus the connection tracking for REDIRECT does not work. Thus the return packets from the realservers are not modified and the connection fails. From my reading TPROXY uses its own connection tracking routines (though for what reason I am not sure). These routines probably aren't effected by LVS and thus TPROXY should work. N.B: I have not verified this.

LVS: Transparent Bridging Here's a summary of bridging as it relates to LVS directors with help from Julian and Joe Cooper joe (at) swelltech (dot) com Apr 2002. We haven't done anything thing with transparent bridging in LVS yet, but the subject comes up in the mailing list often enough to warrant some info in the HOWTO. The director sits between 2 networks, the realserver network and the outside world. Bridging has been proposed several times on the list for the director as a way of getting packets between the realservers and the outside world. Initially I thought that bridging could be used to send packets through the director to 0/0 from a realserver in LVS-DR, thus solving the problem now solved by the . A bridge is a layer-2 device for connecting 2 physically separate networks. Being a layer-2 device, a bridge only looks at the MAC addresses on a packet. The bridge doesn't look at the IPs and has no information about routing at the IP level. Here's a 2 NIC bridge connecting 2 networks. In one implementation (the transparent bridge) the bridge learns the network location of hosts (i.e. which NIC the host is attached to) by inspecting the source MAC addresses of packets. Since the bridge only inspects the MAC addresses on packets, the IPs on the hosts in network A and network B, can belong to the same or different IP networks/netmasks. A bridge can be used to separate traffic. If all hosts are on 192.168.1.0/24 but most of the packets are passed between 2 hosts, these two hosts can be put on one of the networks and the rest on the other network. At the same time the bridge connects separate networks, without adding route table entries on the hosts in the two networks. So bridging allows connection of different physical networks without requiring route entries (needed if a router had been used instead), but keeps the traffic off networks that don't need to hear them. Not being a router, the bridge is not seen by traceroute. About transparent bridging from the howstuffworks site http://www.howstuffworks.com/lan-switch4.htm and from cisco (including an explanation of the spanning tree algorithm) http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/transbdg.htm About bridging from the internet encyclopedia at freesoft http://www.freesoft.org/CIE/Topics/30.htm (site down 14 Sep 2004) transparent proxy with bridging from the Transparent Proxy with Linux and Squid mini-HOWTO http://www.tldp.org/HOWTO/mini/TransparentProxy-7.html In Linux, the first bridges were implemented by proxy-arp and is called pseudo-bridging. Proxy-arp works only for IPv4. http://www.tldp.org/HOWTO/Adv-Routing-HOWTO-16.html (from the Advanced Routing HOWTO). Here's some more on proxy-arp. http://www.sjdjweis.com/linux/proxyarp/ With proxy-arp running on the setup above, eth1 would be configured to reply to arp requests for an IP on network A allowing packets from network B to be sent to that host on network A. With proxy-arp, packets are sent through the tcpip stack and routing tables on the bridge host. These packet can be filtered (iptables/ipchains), but also martian packets will be recognised and dropped. Kate (aka John Looney) asked if bridging could be used to put a director infront of a functioning server, to make an LVS, without breaking service to the clients accessing that server.

John P. Looney 18 Apr 2002 The director is configured to listen for the IPs of the realservers at the internal side of the network bridge, and pass them on, with some intelligence. Note the router is not on the same physical network as the realservers, but is on the same logical network. So, when a connection comes in for RS1 (293.2.2.1), the router sends out an ARP request. The DR answers with it's own MAC, and transparently forwards all

Julian Anastasov ja (at) ssi (dot) bg 18 Apr 2002 not with the director's MAC when bridging is used.

connections on to RS1/RS2/RS3 depending on it's load balancing algorithm. The realservers still have their gateway set to be the router.

By using the bridging code you have to stick with the following rules: it is Layer 2, i.e. the decisions to do something with the packet are based entirely on the link layer protocol info (until you patch the code, of course) at link layer level we have broadcast/multicast/unicast addresses all received non-unicast frames are passed to the IP stack and to all other bridge ports all unicast frames are passed to the IP stack or to the appropriate bridge port according to the destination link layer address Linux IP does not accept packets destined to foreign lladdr (for ethernet, link layer address == MAC). Linux does not send ICMP replies for frames not destined to our lladdr (one of the reasons not to see ICMP errors against UDP broadcsts, for example, for missing listener) Linux TCP accepts frames destined only to our lladdr Linux does not forward packets not destined to our lladdr To put this in the LVS space (we assume the Director is using bridging and is between the uplink router and the realservers): - according to (3) we can't stop the broadcast ARPs reaching the realservers and they to reply to the requestor (the router), i.e. we can't avoid the ARP problem for DR and TUN methods. - according to (4) the realservers can reach the router directly without disturbing the director's rp_filter. As result, the bridging helps the director to pass the LVS-DR replies from the realservers to the uplink router - the uplink router sees one LAN because the bridging code passes all frames preserving the original source link layer addresses We can say that the default bridging behavior is not the desired one for all cases. There are some useful modes we can require from the bridging. For example, in one mode we can grab all IP packets (even packets destined to foreign lladdrs) and to feed them to the upper layers and to rely on the proper routing rules for filtering, etc. The bonus is that you don't need to place your IPs, routes, etc on the bridging interfaces, you don't need to implement firewalling specificaly designed for the bridged ports, etc, etc.

Joe Cooper joe (at) swelltech (dot) com CONFIG_NET_DIVERT is the IP packet diverter that allows one to configure selective redirects from a bridged interface, so that it can then be REDIRECTED or whatever by the iptables rules. Benoit Locher wrote it and his homepage about the project is here: http://diverter.sourceforge.net/ It is a part of the official Linux trees (2.2.19+ and 2.4.10+) these days, so no patching is necessary, but you do need the divert-utils package to configure it if you're going to use it. It makes the Linux bridging code a lot cooler than your ordinary bridge.

Julian There is layer 2 software under CONFIG_BRIDGE option (the currently discussed solution) http://bridge.sourceforge.net With Bridging the realservers can send packets to the uplink router through the director's layer 2 bridge. So, yes, the packets are handled from director but do not reach routing. The trick is that if the packets are destined to the director's MAC (which is always true for proxy ARP) then in both solutions the IP packet reaches routing. So, the director's IP should not be used as gateway. But director can run Linux Bridging and to stay betwen the realserver(s) and the client(s)/uplink router. In this case the realservers don't know that when talking to the uplink router's MAC their packets go through director's layer 2. If DIP is used as GW in realservers then even with bridging you have to use the forward_shared flag. If uplink router's IP is used as GW then we can run Bridging on the director if we want to split the segment. Where is the trick: the realservers resolve with ARP their GW IP and later send the packets to the resulting MAC. If GWIP is a director's IP then we receive director's MAC and the traffic reaches routing. So, if we want to put director physically between uplink router and realservers and to use DR or TUN methods without forward_shared flags we can do it by using Bridging and by using the uplink router's IP as GW in the realservers. The only thing that Bridging gives us is that we can use the uplink router's IP as GW in realservers. The Bridging connects the two network segments.

The critical difference then is the gw configured on the realservers. If it is an IP/MAC on the director, then the directors routing tables will see the packet. If the gw is the router on the outside of the director, then the packet will be bridged without seeing the director's routing tables. Presumably this will solve the martian problem. (No-one has tried this out yet).

LVS: Persistent Connection (Persistence, Affinity in cisco-speak) Apr 2006: No-one has tried this, but it seems that the could replace persistence, without the failover problems of persistence. The SH scheduler schedules according to the client IP, meaning that all of a client's connection requests will be sent to the same RIP. The SH scheduler has been around for a while, but it seems that no-one has known what it did. One of the problems was that no-one knew how to use the weight parameter. Sep 2002: Rewritten. All references to the LVS persistence used in kernels <2.2.12 have been dropped. (For another writeup on persistence, see LVS persistence page .) For LVS, the term "persistence" has 2 meanings. "persistent connection" a term used for clients connecting to webservers and databases. For Apache this is called a "keepalive" and is set in httpd.conf and is described in . "persistent connection" used in LVS. LVS persistence is closer to the concept of "affinity" as used by cisco The two types of persistence are quite different. Unfortunately, both features are persistent and can reasonably claim the name "persistent". This causes some confusion in nomenclature. LVS persistence could alternately be described as connection affinity or port affinity. LVS persistence directs all (tcpip) connection requests from the client to one particular realserver. Each new (tcpip) connection request from the client resets a timeout (time set by the -p option of ipvsadm) LVS persistence has been part of LVS for quite a while (first implementation by Pete Kese, when it was called pcc) and was added to handle ssl connections, squids and multiport connections like ftp (squids now have their own scheduler). (LVS) persistence is also used when the realserver must maintains state (i.e. when the client sends information to the realserver in shopping carts, or writing to an application such as a database, or the client must hold a cookie). Persistence has the following effects Because the client is connected to one server for the session, applications like shopping carts and databases don't have to be rewritten when moving services to an LVS. The shopping cart can accumulate information just as it does when it runs on a standalone server. This is the original reason for the persistence feature. Because the client's data is only on one realserver, the data must be propagated to the other realservers before anyone else needing that information connects. There is no load balancing for the client, i.e. all connections are sent to one realserver. This is not a big problem, since a heavily loaded server will be handling thousands of connections at any one time, and whether one client has all its connections to one realserver or to many doesn't make much difference to anyone. The difficulty of handling failover is the worst side effect of using persistence. As initially implemented, if a realserver crashed, or was brought down for maintenance (by setting the weight to zero), the director would still send connections from the client to the same realserver, until the timeout expired. In the case of a crashed realserver, the client wouldn't get a connection and its session would be hung (or get a tcpip reset). In the case of bringing a machine down for maintenance, the administrator would have to wait till all the clients finished their session. Because of the hung connections on realserver failure with persistent connection, in late 2004, Horms changed the behaviour on bringing down a persistent connection so that it was the same as for non-persistent connection, i.e. the no new tcpip connections would be allowed. The problem now is that data written to the crashed realserver is lost. The client will make their next click expecting to move to the next screen, only to find that the realserver has no idea who they are. Presumably the application will have to be written to behave sensibly with this client ("we're sorry, your connection is important to us - who are you and why are you connecting?"). At least this behaviour is better than a hung connection in the middle of a session. Most of the discussion below describes the old behaviour on bringing down a persistent service. You should understand the consequences of using persistence if you plan to use it in production. The ideal approach from a theoretical point of view is to rewrite the application so that data is propagated to all realservers immediately (or at least before the client initiates a new SSL session), allowing the LVS to run in non-persistent mode. Rewriting your application is difficult, but if you're in production with a secure (SSL) site, you're already spending money. Despite us using every opportunity to exhort people to rewrite their applications, we find that most people don't and continue to use persistence. Alternatives to persistence include ftp - move the ftp server out of the LVS - ftp is difficult to secure squids - use the dh scheduler, which is designed for squids and which was developed to overcome the deficiencies of using persistence for squids. See Note in the DH section: Jezz Palmer found that he got better performance for squids with persistence. using persistent to forward multiport services as found in an e-commerce site (i.e. 80, 443). in an L4 friendly manner (i.e. to save state on the realserver).

LVS persistence LVS persistence makes a client connect to the same realserver for different tcpip connections. The LVS persistant connection is at the layer 4 protocol level. LVS persistence is rarely needed and has some pitfalls (as explained below). It's useful when state must be maintained on the realserver, e.g. for https key exchanges, where the session keys are held on the realserver and the client must always reconnect with that realserver to maintain the session. LVS persistence has two consequences A client making a new tcpip connection, within the timeout period (usually 5-10mins), will be sent to the same realserver as on the previous connection. The new tcp connection will reset the timer. A connect request made past the timeout period will be treated as a new connection and will be assigned a realserver by the scheduler. The default timeout varies with LVS release, but is in the 300-600sec range. When implementing LVS persistence, there are problems in recognising a client as the same client returning for another connection. While the application can recognise a returning client by state information e.g. cookies (which we don't encourage, see below for better suggestions), at layer 4, where LVS operates, only the IPs and port numbers are available. If it's left to the application to recognise the client (e.g. by a cookie), it may be too late, the client may be on the wrong realserver and the ssl connection is refused. For LVS persistence, the client is recognised by its IP (CIP) or in recent versions of ip_vs, by CIP:dst_port (i.e. by the CIP and the port being forwarded by the LVS). If only the CIP is used to schedule persistence, then the entries in the output of ipvsadm will be of the form VIP:0 (i.e. with port=0), otherwise the output of ipvsadm will be of the form VIP:port. Recognising the client is simple enough for machines on static IPs, but people on dial-up links come up on a different IP for each dial-up session. If the phone line drops during a session the client will reappear with a different IP (but probably coming from the same class C network) if they are coming through a proxy (like AOL), they will come from different IPs (again probably in the same class C network) for different tcipip connections, within a single session (i.e. requests for hits for a web page may come from several IPs). (for more info see persistence granularity). The solution to this is to set a netmask (e.g. /24) for persistence and to accept any IPs in this netmask as the same client. The downside is that if a significant fraction of your clients are from AOL, they will appear to be a single client and will all be beating on one realserver, while the other realservers are near idle. For regular http, you don't care how many different IP(s) the client uses to request its hits for a single webpage and you don't need persistence. When all ports (VIP:0) are scheduled to be persistent, then requests by a client for services on different ports (e.g. to VIP:telnet, to VIP:http) will go to the same realserver. This is useful when the client needs access to multiple ports to complete a session. Useful multi-port connections are 20,21 for active ftp 21 and a high port for passive ftp port 80,443 for an e-commerce site A side effect is that once persistence is set for all ports, requests by the client to any port, not just the ones you think the client is interested in, will be forwarded to the realserver. (The client will get a "connection refused" if the realserver is not listening on the other forwarded ports.) For security (to stop port scans etc), you'll have to filter requests to the other ports. The ports won't neccessarily be paired in the way you want e.g. in the (admittedly unlikely) event that you have an ftp and e-commerce setup on the same LVS, both ftp and e-commerce requests will go to the same realserver. What you'd like is for the e-commerce (80,443) requests to be scheduled independantly of the ftp (20,21) requests. In this way your ftp requests will go to one realserver while your requests to the e-commerce site will go to a different realserver. Its simpler administratively to have different services (ftp, http/https) on a different lvs. The all ports (VIP:0) approach is quite crude, and was a first attempt at bundling together connect requests for multiple services from a client. This side effect (of persistence activating all ports), does not arise if multiport services are forwarded by a persistent fwmark. To bundle services see fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html) - in particular persistence granularity with fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html#fwmark_persistence_granularity). Note: the persistence timeout is the elapsed time, between different tcpip connections, for the client to be recognised as a returning client. You still have the same idle timeout within a tcpip connection as for other services. Wensong Zhang wensong (at) gnuchina (dot) org 11 Jan 2001

The working principle of persistence in LVS is as follows: a persistent template is used to keep the persistence between the client and the server. when the first connection from a client, the LVS box will select a server according to the scheduling algorithm, then create a persistent template and the connection entry. the control of the connection entry is the template. The late connections from the clients will be forwarded to the same server, as long as the template doesn't expire. The control of their connection entries are the template. If the template has its controlled connections, it won't expire. If the template has no controlled connections, it expires in its own time.

malcolm lists (at) netpbx (dot) org

What the maximum setting for the persistence timeout? The Docs say its unlimited but I don't believe that :-).

Horms 25 Aug 2006 ipvsadm may have some other limit due to signedness issues and the like. But in the kernel it is stored as an unsigned int, which represents seconds. So any value between 0 and (2^32)-1 seconds is valid, which is potentially a rather long time.

Scheduling looks different under persistence In a normal (non-persistent) LVS, if you connect to VIP:telnet with rr scheduling, you will connect to each realserver in turn. This is because the director is scheduling each tcpip connection as separate items. When you logout of your telnet session and telnet to the VIP again, the director sees a new tcpip connection and schedules it round robin style i.e. to the next realserver in the ipvsadm table. However, if you then make the LVS persistent, the director schedules each CIP as a separate item. Repeated telnet tcpip connections (logins and logouts) to the VIP (within the persistence timeout period) will be regarded as the same scheduling item, since they are coming from the same client, and will all be sent to the same realserver. Even though rr scheduling is in effect, you will be connected to the same realserver. To test that the scheduler is round-robin'ing under persistence, you will need to login from several different clients (i.e. with different IPs), or after the persistence timeout has expired. If two services are scheduled as persistent (here telnet, http), they are scheduled independantly. Here I have only 1 client (so it isn't a good test) and I connect twice by telnet and then twice by http. Scheduling is within the blocks setup by the `ipvsadm -A` command (here starting at "TCP ...". Here there are two blocks, scheduled separately. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:http rr persistent 360 -> RS2.mack.net:http Route 1 0 2 -> RS1.mack.net:http Route 1 0 0 TCP lvs.mack.net:telnet rr persistent 360 -> RS2.mack.net:telnet Route 1 0 2 -> RS1.mack.net:telnet Route 1 0 0 ]]> Doing the same test a bit later, I found all connections going to the other realserver.

Will the timeout variable set on persistent connection affect an open socket that's open for several days streaming data?
Horms 2005/02/22 No. The persistance timeout has no effect whatsoever on the timeout of open connections. They have their own timeouts which are generally in line with those of TCP.
Will another connection from the same client go to a different realserver while there's an open socket with streaming data?
Not if you use persistance. If you use persistance, and either there is a connection open, or the persistance timeout has not elapsed since the last connection was closed, then a subsequent connection from the same end-user will go to the same real-server. For those who care, this is all controlled by the expiry of connection entries and persistace templates by ip_vs_conn_expire().

Persistent and regular (non-persistent) services together on the same realserver. If you setup both a non-persistent service (for testing, say telnet) and persistence on the same VIP, then all services will be persistent except telnet, which will be scheduled independantly of the persistent services. In this case connections to VIP:telnet would be scheduled by rr (or whatever) and you would connect with all realservers in rotation, while connections to VIP:http will go to the same realserver. Example: If you setup a 2 realserver LVS-DR LVS with persistence, giving the ipvsadm output RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:0 rr persistent 360 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> then (as expected) a client can connect to any service on the realservers (always getting the same realserver). If you now add an entry for telnet to both realservers, (you can run these next instructions before or after the 3 lines immediately above) giving the ipvsadm output RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:0 rr persistent 360 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 ]]> the client will telnet to both realservers in turn as would be expected for an LVS serving only telnet, but all other services (ie !telnet) go to the same first realserver. All services but telnet are persistent. The director will make persistent all ports except those that are explicitely set as non-persistent. These two sets of ipvsadm commands do not overwrite each other. Persistent and non-persistent connections can be made at the same time. Julian

This is part of the LVS design. The templates used for persistence are not inspected when scheduling packets for non-persistent connections.

Examples: ftp (LVS-NAT): connections to both ftp ports for passive ftp is handled by the module ip_masq_ftp. You don't need to add persistence for ftp with LVS-NAT. ftp (LVS-DR or LVS-Tun): you need persistence on the realservers. Run the first set of commands above. ftp and http (LVS-NAT): persistence not needed (ip_masq_ftp handles the ftp ports for active and passive ftp). ftp and http (LVS-DR or LVS-Tun): persistence needed to handle the two port protocol ftp. If you just have one entry in the ipvsadm table (persistence to VIP:0) then a client connecting to the http service of the LVS will always get the same realserver (this may not be a great problem). If you want to make the http service non-persistent but leaving all other services persistent, then run then add a non-persistent entry for http. http and https (all forwarding methods): Normally an https connection is made after the client has made selections on an http connection when data is stored on the realserver for the client. In this case the realserver should be made persistent for all services. Note: making realserver connections persistent allows _all_ ports to be forwarded by the LVS to the realservers. An open, persistently connected realserver then is a security hazard. You should have filter rules on the director to block all services on the VIP except those you want forwarded to the realservers.

Tracing connections: where will the client connect next? You can trace your system in the following way. For example: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 172.26.20.118:80 wlc persistent 360 -> 172.26.20.91:80 Route 1 0 0 -> 172.26.20.90:80 Route 1 0 0 TCP 172.26.20.118:23 wlc persistent 360 -> 172.26.20.90:23 Route 1 0 0 -> 172.26.20.91:23 Route 1 0 0 [root@kangaroo /root]# ipchains -L -M -n IP masquerading entries prot expire source destination ports TCP 02:46.79 172.26.20.90 172.26.20.222 23 (23) -> 0 ]]> Although there is no connection, the template isn't expired. So, new connections from the client 172.26.20.222 will be forwarded to the server 172.26.20.90. For 2.4 kernels This shows the state of the connection (ESTABLISHED, FIN_WAIT) and the time left till persistence timeout.

Bringing down persistent services. This is the behaviour before late 2004.

Clearing the table If a client is connected (persistently) to a realserver and the ipvsadm table is cleared (ipvsadm -C) then the connection will hang. If you then reinstall the original ipvsadm rules for that service, the connection will work again (and you'll see the correct entries in ActiveConn and InActConn). Wensong (a bit below) explains why the code doesn't clear the entry, but only removes the pointer to the entry. Ratz In new versions of ip_vs (look for it Sep 2002 or later) you can affect the behaviour of ip_vs towards the connection when the ipvsadm table is cleared with a sysctl. Details are in the sysctl document http://www.linux-vs.org/docs/sysctl.html. e.g. net.ipv4.vs.expire_nodest_conn=0 maintain entry in table (but silently drop any packets sent), allowing service to continue if the ipvsadm table entries are restored. net.ipv4.vs.expire_nodest_conn=1 expire the entry in table immediately and inform client that connection is closed. This is the expected behaviour by some people when running `ipvsadm -C` However if you have some client at the other end buying $1M of your software with his credit card, you want to be nice to them. The nice way of deleting a service is to set the weight to zero (when no new connections will be allowed to that realserver) and then wait for the current connections to disconnect/expire before deleting them (use some script to monitor the number of connections). Since the client can stay connected for hours (for some services) you can't predict when you'll be able to bring your server down.

time to clear quiescent persistent connections In a normal (non-persistent) tcp connection, after setting a service to weight=0, the ipvsadm connection (hash) table will clear FIN_WAIT time (with Linux, about 2 mins) after the last client disconnects. With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects. This time defaults to about 5mins but can be much longer. Thus you cannot bring down a realserver offering a persistent service, till the persistence timeout has expired - clients who have connected in recently can still reconnect. Tim Cronin wrote:

if you're using pasv you need persistence.... if I change the weigh of the .20 RIP to 0 and rerun the script my connections continue to go that server even when I zeroed and clear the table.

Julian 1 Nov 2002 Because the virtual service is marked persistent. In such case RSs with weight 0 can continue to accept _new_ conns.

Resetting timeout The persistence timeout is not reset to the original timeout on each new tcpip connection, it is incremented by TIME_WAIT. unknown (possibly Julian) Yes, as implemented, the persistence timeout guarantees affinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful. Terry Green, 7 Feb 2003

Agree completely - however, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT

Julian Anastasov 2003-02-08 2:21:35 The persistence timeout is used only once: when the first connection from this client is established. The current meaning is the persistent time to cover period of time after the client appears for first time. It is extended if there are still active connections. Then there are 3 (or more) options: extend it again with the persistent time extend it with 2mins use the persistence time after the last connection from client terminates The second option is implemented, as it was expected from other users :) A long time ago my opinion was that it is good the persistent time to be used when the last connection terminates (3 above). This can be a config option, if someone wants to implement it. unknown (Julian?) Maybe you see it 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire. Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes.

Julian Well, then it is not set to 1:40 but to 2:00 as expected. Terry

Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection.

Julian True, your reading is accurate :) I now see why it was 1:40 Terry

My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

Julian You can do it in your source tree or to implement it for other users as config option. I don't know what the other people think. Ratz and another poster on 12 Aug 2004 like resetting the timeout to the persistence value.

Persistence is independant of scheduler The scheduler determines which realserver gets the next connection. With persistence, the same realserver gets the next connection. Horms 13 Sep 2004 Persistance opperates independant of the scheduler. It does not matter if you use the RR, WLC, DH or any other type of scheduler, it always works the same way. That is, it looks up a persistance template and if it finds one, then it uses it, else it asks the scheduler what to do. In other words, if there was a connection from a given end-user, and the persistance timeout has not expired, subsequent connections from the same end-user (masked with the persistance netmask) will go to the same realserver. As this lookup occurs _before_ a call to the scheduler, it is not affected by quirks in any scheduler. Brett

I have an LVS director that uses wrr with 3600 of persistence for two realservers. I noticed that connections going through a firewall from my internal network tend to get locked into one of my realservers but usually doesn't go to the other realserver unless all of the connections have expired to the first realserver.

Ratz 10 Aug 2004 Correct.

From what I understood with LVS is it's support to use the source IP for persistence but I wasn't sure if it also used a source port.

No, it doesn't. The persistent template is created as follows: ]]> As you can see, the cport is set to 0 globally. Horms The source IP address is used, but the source port is not. This is because successive connections from the same host will almost certainly have a different ephemereal source port. There is no parameter in LVS to change this behaviour. Though off the top of my head it would seem like a simple hack to alter this if you needed to for some reason.

Would using a different scheduler or a kernel upgrade (with a new lvs version) work around this?

Horms Not likely. Ratz You would need to tweak ../net/ipv4/ipvs/ip_vs_core.c:ip_vs_sched_persist().

Forcing a break in a persistent connection: expire_quiescent_template - Horms sysctl for quiescing persistent connections Horms horms (at) verge (dot) net (dot) au 12 Apr 2004 Expire Quiescent Template. Here's the writeup.

This patch adds a proc entry to tell LVS to expire persistance templates for quiescent server. As per the documentation patch below: expire_quiescent_template - BOOLEAN

This patch was written to allow loadbalancing of https, with failover. However it can be used to force a break of a persistent connection. With persistent connection and the weight of a realserver set to 0, any new connections will go to other realservers, but existing connections will stay till they timeout or the client disconnects an active session (whichever is longest). Experience on the mailing list shows that this could be a long time. Misconfigured clients stay connected forever. This patch forces the client's connection to break. The client probably will not be happy about this, but then you may not want to wait 24hrs to do maintenance either. Graeme Fowler graeme (at) graemef (dot) net 19 Jul 2007 /proc/sys/net/ipv4/vs/expire_quiescent_template ensures that when a realserver's weight is changed to 0 (ie. it is set to "quiescent"), rather than removed from the pool of realservers, existing persistent sessions on that realserver are expired from the persistence template. Where you have persistence set on a virtual service, setting the weight to 0 usually results in no new sessions being forwarded to that realserver *but* existing sessions will continue to be handled until they are closed. It's the "graceful" way of taking a server down for maintenance, for example. If you remove a realserver from the pool, with expire_quiescent_template set to 1, those sessions expire immediately. Additionally, setting expire_nodest_conn=1 also helps by removing persistent entries when a realserver is removed from the pool. Without this, persistent entries will hang until they timeout and get redirected to another realserver.

are their any problems caused by setting both expire_nodest_conn and expire_quiescent_template?

None that I can think of directly; however if a healthcheck fails because something goes awry (local intervening network conditions, transient load on director, something like that) then a realserver could well be quiescent or removed briefly - in which case all established persistent sessions will be terminated. This may not be desirable in the case of a condition which is resolved in a few seconds. I guess careful tuning of healthchecks along with good network design would be the way to not trigger it, but that's outside the scope of this discussion :) Kit Gerrits kitgerrits (at) gmail (dot) com 18 Dec 2008 I am setting up a LVS-NAT cluster for a bunch of webservers. I am using persistence with the expire_nodest_conn setting in my sysctl.conf If I open a connection to the webserver VIP and then kill the webserver that served me (trac-test2), I would expect to get served by its brother (trac-test1). Instead, I get the following: Even by hand I am getting nowhere: telnet 10.100.77.250 80 Connecting To 10.100.77.250...Could not open connection to the host, on port 80: Connect failed /var/log/messages reports: Dec 18 16:56:12 lvs-test1 nanny[5261]: shutting down 192.168.201.22:80 due to connection failure ]]> Joe: he's using nanny RemoteAddress:Port Forward Weight ActiveConn InActConn TCP trac-test-pub.rdc.local:http wlc persistent 300 -> trac-test2.rdc.local:http Masq 0 0 1 -> trac-test1.rdc.local:http Masq 500 0 0 ]]> Graeme Fowler graeme (at) graemef (dot) net 18 Dec 2008 Yep. That's a quiescent server, not a "nodest_conn" - the latter meaning a connection destined for a server no longer in the pool. You need to set the "expire_quiescent_template" sysctl to 1 instead. Nicola Pero nicola (at) brainstorm (dot) co (dot) uk 25 Nov 2004

Has anyone been able to setup ldirectord to load balance two HTTPS servers with failover ? The two real HTTPS servers are stateless (except for the SSL info in the web servers); there are few concurrent users (up to 10), but instant switchover in case of failure is essential. Anyway, the problem we have is that when one of the two HTTPS servers goes down, the load balancer detects it but all clients connected to the server which is down keep being sent to it. Changing 'persistent', 'quiescent', timeouts etc didn't seem to have any effect on this! Our case is also complicated by the fact that in certain cases we might decide that a realserver should not be used even if HTTPS is still running fine on the server. That might happen if the application sitting behind the HTTPS has a problem. We've got a URL on the realserver which can be checked to know if the realserver is OK to be used or not. Checking those seems to be working fine! The problem is with the realserver being marked as down, and all requests still being sent to it! Keep in mind this is not a typical web farm, there are few concurrent users (most often 0 or 1), but it's critical that the web application is always available.

Malcolm Turnbull Nov 25, 2004 you definately want quiescent=no (in ldirectord). Horms 26 Nov 2004 Or use this patch. http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html The patch just makes persistent sessions behave sensibly(tm) when a realserver is made quiescent. This isn't specific to HTTPS at all, but I think it is the problem that the user is seeing. The other solution is not to make the realservers quiescent, and just removed them instead. 2.4 patch (http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html), 2.6 patch (http://article.gmane.org/gmane.linux.network/18906). The Existing behaviour. When a realserver is marked as quiescent (by setting the weight to zero) no additional connections will be allocated to that realserver by the scheduler (the LVS connection allocator, not the cpu scheduler, the packet scheduler, or your secretary). This works quite well, unless the scheduler is bypassed for some reason. As it happens this occurs only if a virtual service is marked as persistant and there is a persistant-template in existance - that is, recently there was prior connection from the same end-user. In this case the presance of the persistant-template is sufficient for additional connections to be sheduled, despite the fact that the server is marked as quiescent. Though the connections have to be from an end-user (IP address/netmask) that was forwarded to the realserver in question within the persistant timeout. My patch allows this behaviour to be changed, by expiring the templates when a real-server is marked as quiescent. Thus the scheduler gets called, and the behaviour is the same as for non-persistant service, which is generally what people expect/want. Joe

and just rip out the connections?

By removing a realserver you break all the connections and remove all the persitant templates. So no further connections are forwared whatsoever. Actually, no further packets are forwarded. Unfortunatelly, this breaks connections that are in progress.

So what happens in the following case: You've filled your shopping cart under http, then you go to https to give your credit card info, which usually takes at least 3 webpages (fill in your credit card and shipping info, click send, get confirmation page, click accept, get final page for printing). Let's say while you're reviewing the confirmation page, the realserver goes down and the LVS removes it by running ipvsadm. The tcpip state of the client is ESTABLISHED and the client has the SSL session ID. The LVS has to cache the credit card info somewhere to make it available to the new realserver. When the user hits the accept button, the browser presumably is going to get a tcpip reset from the new realserver. Does the browser just handle it and attempt to make a new tcpip connection? From what you say above, the browser will find that its SSL session ID is invalid and it will do the long handshake. Once that happens the client will hopefully be SSL connected to a realserver that knows about the credit card transaction already underway.

In the situation you describe above the main factors in determining if it would work or not are how do the realservers store their data? if some sort of shared storage is used, say for example NFS, and the transaction is not in some half broken state, then it should be ok, though there might be a race in there will the client's browser reconnect (either automatically or by the user hitting reload)? the answer is generally yes. What I am trying to say is it really boils down to an interaction between the end-user's browser and the real-servers web-application. The LVS magic in between neither hinders nor helps the situation, other than allowing the end-user to connect to a different-realserver if/when a reload occurs. And the SessionID shouldn't really come into it. Because if it is still valid, it will be used, and if it is invalid it will be discarded and a full handshake will be performed. Sure, it might take an extra few moments, and possible the real-server might be a bit overloaded if a lot of reconnects of this nature occur simultaneously, but the success (or failure) of the SSL handshake should not be affected.

I had thought that the keys are in memory and you can't move the keys/session data from one machine to another.

That is not the case, let me elaborate. (I wrote an SSL implementation once so I know this one :) SSL makes use of public key encryption (e.g. RSA) and private key encryption (e.g. DES, AES) as well as a host of other techniques to make communications more secure. In a nutshell public key encryption - which is slow but does not require any prior agreement of keys - is used to negotiate a key that is used for private key encryption - which is fast, but requires a key to be negotiated. This key negotiation phase is part of the SSL handshake. It turns out, particularly for small transfers as are typical on the web, that the public key encryption negotiation phase of the handshake is quite expensive. To alleviate this the server may (almost always will) give the client a Session ID during the course of the handshake. i If the client reconnects it _may_ offer this Session ID and _if_ the server recognises it then an abridged version of the handshake is performed which relies on cryptographic information that both the client and server have cached. Observe: If the client does not offer a Session ID, the long handshake is performed. If the server does not recognise the Session ID - perhaps because it has expired, perhaps because it is a different machine, perhaps because the Session ID is bogus - the long handshake is performed. Also, if the client tries to guess a Session ID and guesses one that the server knows about, unless the cached key information it holds matches, the handshake will fail and the session will terminate. Guessing the cryptographic information is usually difficult at best, though it depends what cipher suite (combination of cryptographic algorithms) was used for the original session. Thus, DoS issues aside, guessing Session IDs is typically of little value. It is the second point above that allows failover of SSL servers to work. The server is actually allowed to cache the Session Id for as long as it wants, including discarding it immediately. This is catered for by falling back to the long handshake if the Session ID is not matched. Stephane Klein

I have an LVS using persistence. All is working well until I stop a real server. The director continue to send requests to the real server which was stopped. ipvsadm -Lcn confirms that the request is still sent to the stopped real server.

Horms 2005/03/22 The problem here is that persistance still takes effect even after the real server is removed (I assume you have quiescent=1). You can change this behaviour by running. /proc/sys/net/ipv4/vs/expire_quiescent_template ]]> The effect of this is that the persistance templates are expired when a connection is made quiescent. And thus no additional connections will be directed to the real server in question. Pitscheider, Oswald Oswald (dot) Pitscheider (at) siag (dot) it 5 Sep 2008 I have some trouble with a LVS on CentOS 5.1 with kernel 2.6.18-92.1.10.el5. When both real servers are up, everything works fine, but when I shut down one of them, the LVS blocks for a few minutes. After that time, the LVS seems to work well, but when I start the real server, every connection is routed to only one real server. My configuration includes: After restarting keepalived, everything works fine. When I set weight to 0 before the server goes down, I have no problems. Graeme Your problem is being caused because you're quiescing the realserver (ie. setting weight to 0 by using "inhibit on failure") instead of removing it from the pool. When the weight is 0, clients with connections which have not yet reached a protocol timeout will reconnect to the same realserver to continue the connection - this is very common, for example, for a webserver with Keepalives turned on. If you remove "inhibit on failure" your LVS will run as you expect, but you may also need to set the sysctl: That ensures existing connections to realservers which have been removed from the pool are expired immediately. Oswald

I've tried the LVS with this changes having a little succeed, but there is still the problem that if I remove a real server, requests to the server are responded very slowly. From them moment, when the real server is removed from the pool, some requests have to wait seconds for an answer. After a minute, the LVS works as it should. I've tested the LVS using jmeter with 25 threats.

Graeme This is fairly predictable, from your configuration and from the way TCP works. Each realserver is checked every 20 seconds (delay_loop 20). If you stop Apache just as the check is done successfully, requests will stall for 20 seconds until the next check (because the server isn't responding). If a request arrives fractionally after the successful check, the server isn't responding, then the client will retry at the following intervals: Note however that it may take a short period for keepalived to do the server removal, which may overlap with retry 3 - and the next delay to retry 4 is another 24 seconds (3, 6, 12, 24 and so on) which takes you towards 45 seconds altogether. And depending on the way jmeter is configured, alongside your webserver config, this will mean a minimum of 20 seconds (and likely much longer) delay between you dropping the webserver and the clients recovering. It is perfectly permissible to bring down the delay_loop as much as you or your app server can tolerate. For fast failover you need a short delay. I would argue that for most web clients, 20 seconds is perfectly acceptable but that can depend entirely on what you're trying to achieve. Try "delay_loop 1" and see what you get. What you will get, possibly, are a lot of log entries - but you should get very fast recovery.

what if a realserver holding a persistent (sticky) connection crashes An explanation of the problem: normal (non-persistent) connection to a service (e.g. httpd). If the server crashes while your tcpip connection is open, that connection will hang (it will eventually time out). The client will notice some icon showing that the browser is continuing to look for the page. The director will notice that the realserver has died and will remove the realserver from the ipvs table by first setting its weight to 0. This will stop any new connections, but allow current connections to continue (and eventually exit). Since the current connections are hung, the director will assume they have exited after the time of the tcp timeouts. Once the connection table for that realserver is empty, the entries for the realserver are removed from the ipvs table. Eventually the browser will timeout or the user will reload, this establishing a new tcpip connection, whereupon the LVS will connect the user to a working realserver (the dead one not being sent any new connections). The connection with the original realserver was lost (or hung). The persistence of the connection is the tcpip connection - any new tcpip connection will be sent to a new realserver. Clients are used to connections on the internet hanging and will not realise that a realserver died on them. The behaviour of ip_vs here gives a satisfactory behaviour as far as the client is concerned. If the service was telnet, the client would have a hung session and would have to close out their window and reconnect.[C This is not satisfactory, but there's no way to transfer a tcpip connection to a new machine. persistent connection to a service (e.g. https) with -p 600 (10 mins timeout). Everything is the same as for the non-persistent connection, except the criteria for terminating the user session. If you have set a persistence timeout on the director of 10mins, then the director is saying "no matter what happens, I will connect this client to that realserver for all tcpip connection requests for the next 10mins (even if the realserver is dead)". The director is guaranteeing that the realserver will be up for the next 10mins and the persistence extends beyond any single tcpip connection to cover new tcpip connections in the timeout period. If the director sets weight=0 for a realserver (e.g. if it has crashed), then new tcpip connections from the client will still be sent to the same (dead) realserver. The behaviour of ipvs, which satisfactorily removes realservers when the granularity is a tcpip connection, doesn't work when the LVS session can cover many tcpip connections. Ben Hollingsworth

OK, so I've got my setup nailed down pretty well. This is pair of squid web proxies on a 2-host LVS running UltraMonkey / HB 2.0.7-8 on RHEL4 (2.6.9). I'm struggling with one more thing, though. With quiescent=true, if I shut down squid on one box, connections from new hosts fail over to the other box just fine, but connections from persistent hosts keep going to the same, dead box. I realize this is as intended. If I set quiescent=false, all client communication with the dead box ceases immediately, which includes cutting off active connections at the knees. That's not an issue if the squid actually dies. However, most of our failovers will be due to my own planned maintenance. In that case, I'd like to allow existing connections (which may be lengthy downloads) to finish before sending new requests (even from persistent clients) to the live box. I can't find any way to do this without hacking the kernel to match a 2-yr-old patch that Horms published (assuming that even applies to my setup). Most of the info about this seems to have been written three years ago. Is there a way to make this work without a custom compile?

Adrian Chapela achapela (dot) rexistros (at) gmail (dot) com 12 Mar 2007 OK, to solve the problem there are two variables: /proc/sys/net/ipv4/vs/expire_nodest_conn: to expire connections before the protocol timeout. This is to solve the problem when a server goes down. For example in the UDP protocol the protocol timeout is too high. /proc/sys/net/ipv4/vs/expire_quiescent_template: (see ) this variable I think is the variable to solve your problem. With this you timeout your persistent template when a server goes down. I don't know what them makes exactly but the first solve my problems. You could make a test. Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 12 Mar 2007 I have one solution, but it works only in case of transparent proxy setup. Instead of persistance, use lblc scheduler (without persistance). lblc itself gives you some kind of persistance of 6 minutes or more. If 6 minutes is not enough for you, please look here: http://kb.linuxvirtualserver.org/wiki/Talk:Locality-Based_Least-Connection_Scheduling ----------------------------------------------------------------- The material below is older, from when the persistence code was at an earlier stage of development. None of this code exists anymore. Ted Pavlic tpavlic_list (at) netwalk (dot) com

Is this a bug or a feature of the PCC scheduling... A person connects to the virtual server, gets direct routed to a machine. Before the time set to expire persistent connections, that real machine dies. mon sees that the machine died, and deletes the realserver entries until it comes back up. But now that same person tries to connect to the virtual server again, and PCC *STILL* schedules them for the non-existent real server that is currently down. Is that a feature? I mean -- I can see how it would be good for small outages... so that a machine could come back up really quick and keep serving its old requests... YET... For long outages those particular people will have no luck.

Wensong You can set the timeout of template masq entry into a small number now and the connection will expire soon. Or, I will add some codes to let each realserver entry keep a list of its template masq entries, remove those template masq entries if the realserver entry is deleted.

To me, this seems most sensible. Lowering the timeouts has other effects, affecting general session persistence...

I agree with this. This was what I was hoping for when I sent the original message. I figure, if the server the person was connecting to went down, any persistence wouldn't be that useful when the server came back up. There might be temporary files in existence on that server that don't exist on another server, but otherwise... FTP or SSL or anything like that -- it might as well be brought up anew on another server. Plus, any protocol that requires a persistent connection is probably one that the user will access frequently during one session. It makes more sense to bring that protocol up on another server than waiting for the old server to come back up -- will be more transparent to the user. (Even though they may have to completely re-connect once) So, yes, deleting the entry when a realserver goes down sounds like the best choice. I think you'll find most other load balancers do something similar to this. mike mike (at) bizittech (dot) com 28 Sep 2003

I am using LVS-DR to balance 4 MS servers. Due to the nature of the web application and the user behavior I had to set the connection timeout to 30 min. Joe: he does not specify whether he is using persistence and this is the persistence timeout from or the . Presumably it is the persistence timeout. In case of failure of one of the realservers users need to be forced to connect to a different server. That means the lvs tables need to cleared as far as connections from clients to the failed box, so that any reconnect trail will open new connection to one of functioning servers . I am using ldirectord to startup and monitor. I am using ldirectord to poll the realserver for the result of an asp page. In case of failure it turns the weight to 0 on the ipvs rule. No new connections will be sent to the dead realserver but in every retry of the clients still tries to connect to the dead realserver until the timeout of that connection. This is the expected behaviour according to lvs documentation.

Joao Clemente jpcl (at) rnl (dot) ist (dot) utl (dot) pt

How do you delete the entry of the realserver?

Mike

Basically I'm using a similar rule to the one used to insert the virtual servers into lvs. It's something like this (I can't be 100% exact as I don't have access to my lvs box from home)

Matthew Crocker matthew (at) crocker (dot) com 28 Sep 2003 Don't set the weight to 0, remove the realserver from the LVS table when it fails. When you remove the realserver from the table you also remove the information from the persistence table. Setting the weight to 0 is normally used for orderly shutdown of a realserver for maintenance. Joe: the entries for the current connections to the realserver stay in the ip_vs hash table until they timeout, even though they are no longer displayed in the default output of ipvsadm. These current connections can't be used with a dead realserver. Peter Nash peter.nash (at) changeworks (dot) co (dot) uk 29 Sep 2003 I'm using LVS-NAT with persistence controlled by ldirectord. I've found that the "quiescent=" line in ldirectord.cf controls the behaviour you are looking for. If "quiescent=no" then when a realserver fails it's LVS entries are removed from the table and clients immediately failover to an alternate server. If "quiescent=yes" then when a realserver fails it's entries remain in the LVS tables but the weight is set to 0 and clients will continue to try to connect to that server until the persistence expires. The default setting (on my installation) was "yes" and I had to change this to get the behaviour I wanted. Rommel, Florian Florian (dot) Rommel (at) quartal (dot) com 28 Sep 2003 in your ldirectord.cf, add this line at the top (above your virtual section) it deletes the server entry from the table automatically if the server fails. Once the server is back up it'll add it automatically. If that line is not set, the default is yes, which just sets the server to weight 0 and that leaves the connections persistant. I had to look for a while to find that little line. Mike

Thanks Florian Rommel and peter nash. That was it.

vilsalio (atO eupmt (dot) es

I don't know how I can remove the persistence when one of my realservers crash, without waiting for expiration of the timeout.

ratz 27 Nov 2003 Please refer to the . /proc/sys/net/ipv4/vs/expire_nodest_conn should do what you want. Patrick Kormann pkormann (at) datacomm (dot) ch

I have the following problem: I have a direct routed 'cluster' of 4 proxies. My problem is that even if the proxy is taken out of the list of real servers, the persistent connection is still active, that means, that proxy is still used.

Andres Reiner

Now I found some strange behaviour using 'mon' for the high-availability. If a server goes down it is correctly removed from the routing table. BUT if a client did a request prior to the server's failure, it will still be directed to the failed server afterwards. I guess this got something to do with the persistent connection setting (which is used for the cold fusion applications/session variables). In my understanding the LVS should, if a routing entry is deleted, no longer direct clients to the failed server even if the persistent connection setting is used. Is there some option I missed or is it a bug ?

Wensong Zhang wrote: No, you didn't miss anything and it is not a bug either. :) In the current design of LVS, the connection won't be drastically removed but silently drop the packet once the destination of the connection is down, because monitering software may marks the server temporary down when the server is too busy or the monitering software makes some errors. When the server is up, then the connection continues. If server is not up for a while, then the client will timeout. One thing is gauranteed that no new connections will be assigned to a server when it is down. When the client reestablishs the connection (e.g. press reload/refresh in the browser), a new server will be assigned. jacob (dot) rief (at) tis (dot) at wrote:

Unfortunately I have the same problem as Andres (see below) If I remove a realserver from a list of persistent virtual servers, this connection never times out. Not even after the specified timeout has been reached.

Wensong The persistent template won't timeout until all its connections timeout. After all the connections from the same client connection expires, new connections can be assigned to one of the remaining servers. You can use "ipchains -M -L -n" (or netstat -M) to check the connection table (for 2.4.x use cat /proc/net/ip_conntrack).

Only if I unset persistency the connection will be redirected onto the remaining realservers. Now if I turn on persistency again, a prevoiusly attached client does not reconnect anymore - it seems as if LVS remembers such clients. It does not even help, if I delete the whole virtual service and restore it immediately, in the hope to clear the persistency tables. ; ipvsadm -A -t -p; ipvsadm -a -t -R ]]> And it also does not help closing the browser and restarting it. I run LVS in masquerading mode on a 2.2.13-kernel patched with ipvs-0.9.5. Would'nt it be a nice feature to flush the persistent client connection table, and/or list all such connections?

Wensong There are several reasons that I didn't do it in the current code. One is that it is time-consuming to search a big table (maybe one million entries) to flush the connections destined for the dead server; the other is that the template won't expire until its connection expire, the client will be assigned to the same server as long as there is a connection not expired. Anyway, I will think about better way to solve this problem.

Load Balancing time constant is longer with persistence (This is from a thread on 'Preference' instead of 'persistence' started by Martijn Klingens on 2002-10-08.) Load balancing occurs with a time constant of the connection to the LVS. For a non-persistent connection like http, with FIN_WAIT=2mins, loads will balance on a time scale longer than 2mins. At shorter time scales, the loads will not be balanced. For persistence with a persistence time out of 30mins, load balancing will require times greater than 30mins (like several hours). This problem is related to the unbalance caused by proxy farms (e.g. AOL).

The tcp NONE flag Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/26

What does the TCP flag NONE mean? When I make a connection through LVS to a real server and look in the connection table I normally get And everything including persistence works as expected But when I connect using a bit of javascript from IE(client side) I get :

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com This is the way LVS is manages persitence, by creating a NONE connection in the connection table.

And the first connection gets a 404 error, further refreshes work fine, and then persistence doesn't seem to work? Do connections with a status of NONE not get put is the persistence table? The javascript is refreshing a page from the server every 1 minute. If you set the javascript to go every 10mins you get far more 404 errors.

This looks strange. Your persistence timeout seems to be about 20min (the time to timeout is just after the string "TCP"), so 1min or 10min should be the same. I would suspect a problem in the js itself. Look at your server error_log.

This post (http://www.in-addr.de/pipermail/lvs-users/2005-February/013235.html) sugested droping all TCP NONE packets as they weren't required.

Your servers do not receive NONE TCP connections, they are created locally and are just here for perstence management purposes.

Resetting the persistence timeout counter (persistence behaviour for short timeout values) Terry Green tgreen (at) mitra (dot) com 2003-02-06

the LVS-HOWTO states:
With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects.
This appears to be not quite true. (In the following tests I'm using Kernel 2.4.19 with patch 1.0.7) Testing/Observations - using a port 80 definition with 5 minute persistence (keepalived being used to do the configs). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP devlivelink:http rr persistent 300 -> devlivelink2:http Route 1 0 0 -> devlivelink1:http Route 1 0 0 TCP devlivelink:https rr persistent 300 -> devlivelink2:https Route 1 0 0 -> devlivelink1:https Route 1 0 0 ]]> I start a connection to the web server for purposes of downloading a large file (which will take more than 5 minutes). Every time I connect from the client, I see the connection template timeout reset to 5 minutes, as you would expect from the persistence timeout value (300sec). I've shortened the TCP timeouts for purposes of testing using IPVS connection entries with ipvsadm --set 5 4 0 However, if the template record is allowed to expire, it will be kept because there's still an active connection, but it's time will be reset to IP_VS_S_TIME_WAIT constant (defaulted to 2 minutes in ip_vs_conn.c) rather than to the persistence time set for this session. Further, the data structure for the connection template appears to have been corrupted, as any further connections from the client reset the template time to 2 minutes, instead of the original persistence time. To verify this, I changed line 317 of ip_vs_conn.c from and recompiled the kernel Rerunning the tests, I see the connection template record being reset to 1:40 instead of 2:00. Here's the IPVS connection entries (output of ipvsadm -Lc) as time progresses.

Julian Yes, as implemented, the persistence timeout guarantees afinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful. Terry

Agree completely. However, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT.

Julian The persistence timeout is used only once: when the first connection from this client is established. The current meaning is the persistent time to cover period of time after the client appears for first time. It is extended if there are still active connections. Then there are 3 (or more) options: extend it again with the persistent time extend it with 2mins use the persistence time after the last connection from client terminates The second option is the one implemented, as we found it was the behaviour the users expected :) A long time ago my opinion was that it is better to use the persistence time when the last connection terminates (item 3 above). We could make this a config option if anyone wants it. May be you see the value 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire. Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes. Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection. My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

True, your reading is accurate :) I now see why it was 1:40 Joe - other people have found this behaviour too chulmin2 (at) hotmail (dot) com 2003-02-11

I have set the persistence timeout to 30s after I connected, I confirmed the settings But after 30s the timeout returns to 2 mins.

Here's Terry's summary:

I observed the same behavior, and traced it down to the scenario where the template record times out with valid connection records still counting down. In this case, the template record is reset to 2 minutes (actually, to the value of the IP_VS_S_TIME_WAIT constant). When this happens, the data structure record representing the template connection also gets altered, because any further connections from the client reset the template record to 2 minutes (NOT the original session persistence time). The replies I got from Julian suggested that this behavior was intended, (and thus, I would suggest, the documentation is slightly inaccurate). I didn't pursue it too far, as I this only showed up when I was using really short persistence times for testing purposes. I don't expect it will happen too often or have too much impact when using a more practical session timeout time.

Why you don't want persistence for your e-commerce site: why you should rewrite your application Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uk 18 Sep 2002 The main problem with using persistence for session variable tracking is that the only thing you are gaining by using LVS is increased performance. You are not getting any high availability i.e. if your realserver falls over during a persistent SSL session, you loose your shopping basket (or whatever). Anyone using ASP/IIS will be well used to the service restarting all the time due to the 64MB ASP memory limit in IIS5 (wonder if they'll raise this in .net) My wife always leaves web sites open for things like holidays/hotels etc so that when I come home I can see it... Often as soon as I click anything I loose the session.. :-( Bad design. To save money on re-coding.. code it properly in the first place. Joe Stump joe (at) joestump (dot) net 22 Nov 2002 (replying to another thread)

What Joe is trying to get at here (and this would apply to you PHP session users out there as well) is that your realservers should have access to your session files. The simple solution is a shared drive (under windows) or an NFS mount (under *NIX). Other solutions include NetApps, SANs, etc. The problem Devendra was having is that the session files exist independantly on each of the realservers. When a realserver dies or is taken out of the RAIC all of the session files on that realserver are gone. If they were in a central location where all servers had access they wouldn't die.

Joe

I've sat on https sites for more that 30mins of inactivity. Also I've had the modem line drop on me in the middle of filling in forms on badly written websites (eg registering a domainname), - when I come back, I have a new IP. I expect anyone who wants to do internet business to handle these problems seamlessly.

Roberto Nibali ratz (at) tac (dot) ch 10 Sep 2002 Exactly and most of the time you've got non-technical stakeholders or managers in the back that will rip your head of if that happens.

Persistence only gets you so far here, since memory requirements limit you to the number of connections maintained.

Yes, memory and timeout constraints combined in a linear fashion.

Ratz's idea (in the HOWTO) is to redesign the application. He can do that. Not everyone can. He maintains state data on the servers with a database.

Everyone can, and the other people can work with Tomcat's internal state replication module to do that. But it's slow last time I tested it (1 year ago) and tends to have nasty locking issues.

Alternately in php3 you could write the url that the client moved to on the next click to would contain the state information (functions the same as cookies). Joe Dec 2003: I thought this was the solution for quite a while. However I now find that since all the data is encoded in a long string as part of the URL, the client can manipulate it, making the data at the client and server different. You do not want this to happen. If you can't rewrite the application, then you'll risk loosing some customers and I would say that LVS is not for you. DoS problems are difficult for everybody. With persistence it's just worse.

DoS problems are not to be solved on the LVS box. Matthias Krauss 10 Sep 2002

what is the maximum timeout value

Joe There is no maximum value. However the connection underneath will timeout eventually, and you will start to use a lot of memory with a large number of connections. Julian

Note that setting 0 as RS weight is assumed as "stopped temporary". The existing connections continue to work. It is assumed that the RS weight is set to 0 some time before deleting the RS. By this way we give time for all connections/sessions to terminate gracefully. Sometimes weight 0 can be used from health checks as a step before deleting the RS. Such two-step realserver shutdown can avoid temporary unavailability of the realserver. Graceful stop. At least, the health checks can choose whether to stop the RS before deleting it.

Roberto Nibali ratz (at) tac (dot) ch 13 Sep 2002 It's also useful to introduce a service level window for maintainance work. If you have a service level agreement with only a few minutes downtime a year and you need to exchange the HD of one RS you can quiesce that particular RS about 1 hour before the maintainance work and if you have a resonable low or most of the time even no active connection rate, you unplug the cable, shutdown the server and fix the problem. Then you put it back in, set the weight>0 and off you go.

If the RS is deleted the traffic for existing conns is stopped (and if expire_nodest_conn sysctl var is set the conn entries are even deleted). Of course, if for some connections we don't see packets these conns can remain in the table until their timer is expired.

Bobby Johns bobbyj (at) freebie (dot) com 13 Sep 2002 When you add in the persistence problem I suspect you're doing something that's a bad idea. I suspect the reason you need persistence (or think you do) is because you're storing state or session information locally on each web server. Although it may work, it's a weak design for a web app. If you want a high performance solution, use a common server with something like MySQL on it to hold the session or state information. If you're nervous about the single point of failure on the database box, add a replicated sever behind it. Keeping state info on each web server is just a weak solution in a highly-available high-performance environment. Hardware is pretty cheap in comparison. I would suggest 2 LVS servers running HA between them, 2 or more web servers, and 2 session/state db servers running replicated. Bang for the buck, it's a good solution and gives you a pretty resilient, robust, and scalable system. The system you are trying to implement now will hammer 33% of your user sessions if you have a web server failure and ALL of them if you have an LVS server failure. With the proper monitoring and HA, no single machine failure will hammer your users in the system I suggest. For the price of 6 or 7 Linux servers boxes, you have what people used to pay more than $100K for just a few years ago.

more about e-commerce sites: we used to think memory was the problem - it isn't The original idea of persistence was to allow for connections like https sessions. This solved the problem of keeping the client's connection on the same realserver. However it doesn't work well. The first problem is that it uses a lot of memory. The default timeout for LVS persistence is somewhere around 360secs, while the default timeout for a regular LVS connection via LVS-DR is TIME_WAIT (about 1 minute). This means that LVS persistent connections will stay in the LVS connection table for 6 times longer for persistent connection. As a consequence the hash table (and memory requirements) will be 6 times larger for the same number of connections/sec. Make sure you have enough memory to hold the increased table size if you're using persistent connections. If the persistence is being used to hold state (e.g. shopping cart), then you must allow a long enough timeout for the client to surf to another site for a better price, make a cup of coffee, think about it and then go find their credit card. This is going to be much longer than any reasonable timeout for LVS persistence and the state information will have to be held on a disk somewhere on the realservers and you'll have to allow for the client to appear on a different realserver later with their credit card information. The next problem is that persistence doesn't allow for failover. The memory problem really isn't as bad as was originally thought. Here's some exchanges on the mailing list which talk about the real problems. Joe 18 Sep 2002

The conventional LVS wisdom is that it's not a good idea to build an LVS e-commerce website in which https is persistent for long periods. The initial idea was that a long timeout allows the customer to have a cup of coffee or surf to other websites while thinking about their on-line purchase.

Julian 18 Sep 2002 Yes, if your site uses persistence for HTTP/HTTPS then you better to use cookies (not LVS). If you don't care for the HTTPS persistence (any realserver can serve connections from one client "session") then you create normal service. In such case your care for the backend DB.

The problem with this approach is that the amount of memory use is expected to be large and the director will run out of memory. We've been telling people to rewrite their application so that state is maintained on the realservers allowing the customer to take an indefinite time to complete their purchase. Currently 1G of memory costs about an hour of programmer's time (+ benefits, + office rental/heating/airconditioning/equipment + support staff). Since memory is cheap compared to the cost of rewriting your application, I was wondering if brute force might just be acceptable. I can't find any estimates of the numbers involved in the HOWTO although similar situations have been discussed on the mailing list e.g. http://marc.theaimsgroup.com/?l=linux-virtual-server&m=99200010425473&w=2 there the calculation was done to see how long a director would hold up under a DoS. The answer was about 100secs for 128M memory and 100Mbps link to the attacker doing a SYN flood. I'm not running one of these web sites and I don't know the real numbers here. Is amazon.com or ebay connected by 100Mbps to the outside world? What you can do with 1G of memory on the director? Each connection requires 128bytes. 1G/128 is 8M customers online at any one time. Assuming everyone buys something this is 1500 purchases/sec. You'd need the population of a large town just to handle shipping stuff at this rate. I doubt if any website at peak load has 8M simultaneous customers. However you only have 64k ports on each realserver to connect with customers allowing only have 64k customers/realserver.

Note that the port limit is only between two IPs. You still can reuse one port for many connections if the two connections don't have same ends (IP and port).

How much memory do you need on the director to handle a fully connected realserver? 64k x 128 = 8M Let's say there are 8 realservers. How much memory is needed on the director? 8 x 8M = 64M this is not a lot of memory. So the problem isn't memory but realserver ports AFAIK

No, you don't waste realserver ports for connections from client to the LVS. But using many sockets in realserver hurts. Memory for sockets is a problem, sometimes the sockets can reserve huge buffers for data.

What is the minimum throughput of customers assuming they all take 4000 sec (66 mins) to make their purchase? 8 x 64k/4000 = 64 purchases/sec You're still going to need a hire a few people to pack and ship all this stuff. If people use only take 6mins for their purchase, you'll be shipping 640 packages/sec. Assuming you make $10/purchase at 64 purchases/sec, that's $2.5G/yr. So with 64M of memory, 8 realservers, 4000sec persistence timeout, and a margin of $10/purchase I can make a profit of $2.5G/yr. It seems memory is not the problem here, but realserver ports (or being able to ship all the items you sell). Let's look at another use of persistence - for squids (despite the arrival of the -DH scheduler, some people prefer persistence for squids). Here you aren't limited by shipping and handling of purchases. Instead you are just shipping packets to the various target httpd servers on the internet. You are still limited to 64k clients/realserver. Assume you make persistence = 256secs (anyone client who is idle for that time is not interested in performance). This means that the throughput/realserver is 256hits/sec. This isn't great. I don't know what throughput to expect out of a squid, but I suspect it's a lot more.

Ratz Well, it depends what you want to offer. If it's an online shop like amazon.com you certainly want to store the generated cookie or whatever it is on a central DB cluster where every RS can connect to and request for the ID if it doesn't already have one. The memory is a completely different layer. It's about software engineering and not about saving money. Yes, you can probably kill the problem temporary by adding more memory but a broken application framework remains a broken application framework. Plus, normally when you do build an e-commerce site, you have a customer that has outsourced this task to your company. So you do a C-requirement and a feasability study to provide the customer with a proper cost estimation. Now you build the application and it is built in a broken way so that you need to either fix it or add more RAM in our case. The big problem here is: you might have a strict SLA that doesn't permit this you change the C-requirements and thus you need a new test phase the customer gets upset because she spent big bucks on you It's lack of engineering and a typical situation of plain incompetence: When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do. But all this also depends on the situation. I don't think we can give people a generalised view of how things have to be done. One might argue that people come to this project because of monetary constraints and they sure do not care about the application if the problem is solved by putting more RAM into the director. I for example rather spend a few bucks on good hardware and a lot of RAM for the RS because they need to carry the execution weight of the application. The director is just a more or less intelligent router. pb (who has 1GB of memory and who wants to increase his persistence time to 60mins)

Wwe handle 1 million messages a day, and 20,000+ webmail users, thus 125,000 messages per hour send/recv in 8 hour work day. Would changing 15 to 60min Persistance on the LVS take up a lot of memory and processing (CPU/load) overhead? We're running 1gb of memory and dual pentium III.

Malcolm Turnbull malcolm (at) loadbalanceri (dot) org 27 Apr 2004 I would think it would be fine, 1 GB should handle almost 8 million connections in the timeout period i.e. 60 mins (or 2mins with no persistence). Horms horms (at) verge (dot) net (dot) au 27 Apr 2004 I think Malcom is on the money here. Keep in mind that each connection entry / persistance timeout consumes something like 128bytes (actually, it might be a bit bigger now, but it is still in that ball-park). You can do the maths (actually you should, my brain has already checked out for the day), but if you are getting 100 connections/s, for an hour, each from a unique host, then you are still only going to end up using about 45Mb of memory for persistance entries. I doubt that will hurt you. I would also be supprised if you are getting connections from 360,000 unique hosts per hour :-) Joe for 4 realservers that's 40k messages/hr. I don't know how many tcp connections are required for a message transfer, but let's say it's 1. You have 15mins persistence, so 10k connections will be in existance at any one time. For memory for the ipvs hash table: At 128bytes/connection, that's 1.28M of memory for the ipvsadm hash table. You have quite a margin with memory. For disk and network I/O: Let's say the average e-mail is 10kB. Each realserver is processing (10k messages/(15*60) secs) * 10kB = 0.1MBytes/sec. Your disks and network also have large margins of safety.

The error just a couple random people are having with WEBMAIL is "invalid session ID" as though they lost their connection to the realserver (actually a "message director") they were on. But I don't know if it is the "message directors" fault, or LVS.

have no idea, but I don't see any heavy load here. Are the clients timing out after 15mins and attempting to continue their session? WOuldn't the app/client know that the session has been closed and to go through the whole login procedure again? I don't know much about your app I'm sorry. nothing here addresses the issue of persistence timeout. This is determined by how long you allow the client to be disconnected before you propagate to all realservers, the state changes in the realserver that occured in the last connection.

persistence with windows realservers With unix realservers, we've been encouraging developers to rewrite the application (see ), to save client state in a failover safe fashion (i.e. in a place accessable by all realservers). Previously you would ask the client to accept a cookie or save the client state on the realserver to which the client connects (but where the state will be lost if the realserver fails). Rewriting the application is possible with unix, which gives you access to the primitives and you can build the application any way you want (provided that you have enough time and you understand the primitives well enough). With Windows, you aren't given access to the primitives, but instead are given access to an API. If Windows has already coded up the function you want and you are happy to use that, then it's easy. If you want something else, you're SOL. devendra orion dev_orion (at) yahoo (dot) com 22 Nov 2002

I need to enable loadbalancing on our curent director M/c (LVS-NAT enabled). We are currently having 3 realservers serving same website and need to be loadbalanced. The website is hosted on w2k and uses IIS session Mgmt (no cookies). Only problem is we need to keep this session alive for basically 8 hrs as our clients access the application continuously. How can I configure the loadbalancer to keep the connection persistence to same server after successful client login?

Joe (giving the party line)

The best solution is not to use persistence, but to re-write the application, so that the state information is stored in a place accessable to all realservers. In this way, if one realserver fails, the session with the client can continue.

Alex Kramarov alex (at) incredimail (dot) com 22 Nov 2002

I continuously hear on this list suggestions to rewrite applications to use other session management means then the one that comes with IIS. As a Windows/Unix developer/administrator (you can mix and match any one of the 2 groups ;), i would really like to say, that usually this is not that easy in the IIS environment, especially if you try to tell this to windows only developers, that don't know anything else then the MS way to do things. The best they can hope is to wait for the upcoming IIS 6 release, that includes session management that is meant to use in webfarms (db based), or to try some non microsoft (still proprietary) solutions, that try to do the same, like frameWERKS framework, that functions as a drop in replacement for the IIS session components. I am not saying this to start a MS war on the list, but only to tell, that when an ms inclined person hears that he should "re-write the application" - 95% chance that this will be his last try to use lvs for his solutions. on the other hand, saying that there is a such and such solution that can help him will probably be considered...

(Alex has given an explanation of IIS session management below). Tim Cronin tim (at) 13-colonies (dot) com 22 Nov 2002

We use IIS /w sessions and vls_nat and use wlc /w persistance. The presistance time must match the iis.session.timeout. We haven't had any problems, but we only have a 20min session.

messing with the <command>ipvsadm</command> table while your LVS is running This is an example of persistence with . Bowie Bailey

If I start a service with: and then change the persistence flag (setting the persistence granularity netmask to /24 with the -M option) to how does that affect the connections that have already been made?

Julian 30 Jul 2001 The connections are already established. But the persistence is broken and after changing the netmask you can expect the next connections to be established to another realservers (not to the same as before the change). (also see persistence netmask).

If IP address 1.2.3.4 was connected to RIP1 before I changed the persistence and then 1.2.3.5 tries to connect afterwards, would he be sent to RIP1, or would it be considered a new connection and possibly be sent to either server since the mask was 255.255.255.255 when the first connection happened?

New realserver will be selected. unknown

Let's say: I have 1000 http requests (A) through a firewall of a customer (so in fact all requests have the same Source IP for Loadbalancer, because of NAT) and then one request (B) from the Intranet and then again 1000 Request (C) from that firewall, what does LB do? I have three Realservers r1, r2, r3 (ppc with rr)

Ratz ratz (at) tac (dot) ch 12 Sep 1999 If C reachs the load balancer before all the 1000 requests of A expire, then the requests of C will be sent to r1, and the distribution is 2000:1:0. If all the requests of A expires, the requests of C will be forwarded to a server that is selected by a scheduler. BTW, persistent port is used to solve the connection affinity problem, but it may lead to dynamic load imbalance among servers. Jean-Francois Nadeau

I will use LVS to load balance web servers (Direct Routing and WRR algo). I use persitency with a big timeout (10 minutes). Many of our clients are behind big proxies and I fear this will unbalance our cluster because of the persitent timeout.

Wensong persistent virtual services may lead to the load imbalance among servers. Using some weight adapation approaches may help avoid that some servers are overloaded for a long time. When the server is overloaded, decrease its weight so that connections from new clients won't be sent to that server. When the server is underloaded, increase its weight.

Can we alter directly /proc/net/ip_masquerade ?

No, it is not feasible, because directly modifying masq entries will break the established connection.

Persistence for multiport services Persistence was originally used to handle multiport services (e.g. ftp/ftp-data, http/https). While persistence is still the best method for with LVS-DR, LVS-Tun, http/https is better handled by persistence granularity with fwmark.

Proxy services, <emphasis>e.g.</emphasis> AOL Clients from AOL or T-online access the internet via proxies. Because of the way proxies can work, a client can come from one IP for one connection (eg port 80) and from another IP for the next connection (eg port 443) and will appear to be two different clients. Since there is no relation between the CIP and the indentity of the client, LVS cannot loadbalance by CIP. Usually these two connections will come from the same /24 netmask. Lars wrote the persistence granularity patch for LVS, which allows LVS to loadbalance all clients from a netmask as one group. If you set the netmask for persistence to /24 (with the -M option to ipvsadm) and all clients from the same class C network will be sent to the same realserver. This will mean that clients from AOL appear as a single (very active) client, and will likely take up all capacity on one realserver, leading to unbalance in load on the realservers. This is as good as we can do with LVS. Wensong If you want to build a persistent proxy cluster, you just need set a LVS box at the front of all proxy servers, and use the persistent port option in the ipvsadm commands. BTW, you can have a look at wwwcache.ja.net/JanetServices/PilotServices.html "how to build a big JANET cache cluster using LVS" (link dead, May 2002). If you want to build a persistent web service but some proxy farms are non-persistent at client side, then you can use the persistent granularity so that clients can be grouped, for example you use 255.255.255.0 mask, the clients from the same /24 network will go to the same server. Jeremy Johnson jjohnson (at) real (dot) com

how does LVS handles a single client that uses multiple proxies... for instance aol, when an aol user attempts to connect to a website, each request can come from a different proxy so, how/if does LVS know that the request is from the same client and bind them to the same server?

Joe if this is what aol does then each request will be independant and will not neccessarily go to the same realserver. Previous discussions about aol have assumed that everyone from aol was coming out of the same IP (or same class C network). Currently this is handled by making the connection persistant and all connections from aol will go to one realserver. Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk

If ISP user (eg AOL) has a proxy array/farm then the requests are _likely_ to come from two possibilities: A single subnet (if using an L4/L7 switch that rewrites ether frames, or using several NAT based L4/L7 switches) A single IP (If using the common form of L4/L7 switch) The former can be handled using a subnet mask in the persistance settings, the latter is handled by normal persistance. *However* In the case of our proxy farm neither of these would work since we have 2 subnet ranges for our systems - 194.83.240/24 and 194.82.103/24, and an end user request may come out of each subnet totally defeating the persistance idea... (in fact dependent on our clients configuration of their caches, the request could appear to come from the above two subnets or the above 2 subnets and about 1000 other ones as well) Unfortunately this problem is more common that might be obvious, due to the NLANR hierarchy, so whilst persistance on IP/subnet solves a large number of problems, it can't solve all of them.

Billy Quinn bquinn (at) ifleet (dot) com 05 Jun 2001

I've come to conclusion that I need an expensive (higher layer) load balancer node , which load balances port 80 (using persistence because of sessions) to 3 realservers which each run an apache web server, and tomcat servlet engine. Each of the 3 servers is independent and no tomcat load balancing occurs. This has worked great for about a year, while we only had to support certain IP address ranges. Now, however, we have to support clients using AOL and their proxy servers, which completely messes up the session handling in tomcat. In other words, one client comes from multiple different IP addresses based on which proxy server it comes through. It seems the thing to do is to adjust the persistence granularity. However, if I adjust the netmask, all of our internal network traffic will go to one server, which kind of defeats the purpose. What I'm concluding is, that I'll need to change the network architecture (since we are all on one subnet), or buy a load balancer which will look at the actual data in the packets (layer 7?).

Joe There has been comments by people dealing with this problem (not many), but they seem to be still able to use LVS. We don't hear of anyone who is having lots of trouble with this, but it could be because no-one on this list is dealing with AOL as a large slice of their work. If 1/3 of your customers are from AOL you could sacrifice one server to them, but it's not ideal. If all your customers are from AOL, I'd say we can't help you at the moment.

My concern with that would be anyone else doing proxying ... now or in the future . I would not be opposed to routing all of the AOL customers to one server for now though . I guess we could have to deal with each case of proxying individually. I wonder how many other ISP's do proxying like that

How many different proxy IPs do AOL customers arrive on the internet from? How many will appear from multiple IP's in the same session and how big is the subnet they come from? (/24?)

Good question, I'm not sure about that one. The customer that reported the problem seemed to be coming from about 2-4 different IP addresses (for the same session ).

If AOL customers come from at least 3 of these subnets and you have 3 servers, then you can use LVS as a balancer. Peter Mueller pmueller (at) sidestep (dot) com

Over here we also need layer-7 'intelligent' balancing with our apache/jakarta setup. We utilize two tiers of 'load-balancing'. One is the initial LVS-DR round-robin type setup, while the second layer is our own creation, layer-7. Currently we round-robin the first connection to one server, then that server calls a routine that will ask the second-tier layer-7 java monitor boxes which box to send the connection to. (If for some reason the second layer is down, standard round-robin occurs). We're about 50% done with migration from cisco LD (yuck!) to LVS-DR. After the migration is fully complete the goal is to have the two layers interacting more efficiently and hopefully merged into one 'layer' eventually.. for example, if we tell our java-monitor second-tier controllers to shutdown a server, the first tier will then mark the node out of service automatically. PS - we found the added layer-7 intelligent balancing to be about 30-50% (?) added effectiveness to cisco round robin LD.. I think the analogy of a hub versus a switch works fairly well here..

Chris Egolf cegolf (at) refinedsolutions (dot) net>

We're having the exact same problem with WebSphere cookie-based sessions. I was testing this earlier today and I think I've solved this particular problem by using fwmarks. Basically, I'm setting everything from our internal network with one FWMARK and everything else with another. Then, I setup the ipvsadm rules with the default client persistence for our internal network(/32) and a class C netmask granularity (/24) for everything from the outside to deal w/ the AOL proxy farms. Here's the iptables script I'm using to set the marks: Then, I have the following rules setup for ipvsadm: FWMARK #1 doesn't have a persistent mask specified, so each client on the 10.3.4.0/24 network is seen as an individual client. FWMARK #2 packets are seen as a class C client network to deal with the AOL proxy farm problem. (for more on persistent netmask see the section in fwmark on fwmark persistence granularity). Like I said, I just did this today, and based on my limited testing, I think it works. I'm thinking about maybe setting a whole bunch of rules to deal w/ each of the published AOL cache-proxy server networks (http://webmaster.info.aol.com/index.cfm?article=15&sitenum=2), but I think that would be too much of an administrative nightmare if they change it. The ktcpvs project implements some level of layer-7 switching by matching URL patterns, but we need the same type of cookie based persistence for our WebSphere realservers. Hopefully, it won't be too long before that gets added.

Matthias Krauss MKrauss (at) hitchhiker (dot) com 2003-01-30

I turned on our lvs and it didnt take long for the phone rings from AOL people. The are switching between proxys with the result that the targed web will is different - we need it persitant.

Lars The persistency netmask feature might help you, in exchange for lower granularity of the load balancing (but it shouldn't matter). However, all AOL users will then likely hit the same webserver. It just goes on to show that IP addresses are unsuitable to identify a single user ;-) Real fix would be to use layer7 switching based on the URL or a cookie, even; alternatively, you could make your application less dependent on persistence, for example by storing your session data in a global cache/db, which would also make it easier for you to preserve sessions when a single webserver fails.

I have now the persistency netmask feature up and it seems to work fine. All the sender networks are forwarded to 1 RIP and the load share on all RIP's is nearly equal. The AOL users are still complaining and I've got the impression that aol has different netmasks on their proxies. I found a list at http://webmaster.info.aol.com/proxyinfo.html and used this info for my fwmarks. Here's my iptables list and for ipvsadm apply . Of course their is no balancing anymore for the above nets, but fortunately we don't have many aol customers. Alternately, we found a optional way by using p3p http headers which aol offers/describe under http://webmaster.info.aol.com/headers.html P3P is W3C's Platform for Privacy Policy.

Here's another set of postings from Dec 2003, this time using fwmark to aggregate all the traffic from AOL. As with above, all connections from the proxy servers (i.e. all of AOL) will all go to one realserver. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 30 Dec 2003

For AOL clients, I need to use persistent connections. AOL make the IPs rotate very fast using several /8 or /16. Here's the list of AOL IPs for their clients Can I handle this with fwmark?

Matthias Krauss MKrauss (at) hitchhiker (dot) com 30 Dec 2003 Here's the list of AOL proxies (http://webmaster.info.aol.com/proxyinfo.html). All listed AOL traffic is now going to VIP of machine with RealIP ]]>

I think you could concatenate some /21. Regarding both whois and AOL technical contact, I deduced : Here's is my entry for the fwmark rule in keepalived.conf. Note the string "fwmark 1" which replaces "VIP port" as used in the standard setup. (Presumably "fwmark 1" is just a string which is passed to ipvsadm.) So, in one file, you have both you're load balancing and you're HA settings. Of course, you need to configure iptables by yourself. Also, it doesn't do anything on the realservers (but on the realservers, I just need to configure the VIPs and noarpctl things, which is easy). 172.16.1.4:80 Route 1 0 0 ]]>

For an example of using fwmark with keepalived, see ./doc/samples/keepalived.conf.fwmark in the source directory. Casey Zacek cz (at) neospire (dot) net 2005/04/08 Here's the current aol proxy list (http://webmaster.info.aol.com/proxyinfo.html), in a more raw format, but it changes occasionally (I just had to update my list when I went looking for them): Joe: most of these are not class C

key exchanges (SSL) Persistence is required for SSL services, as keys are cached. Francis Corouge wrote:

I made a LVS-DR lvs. All services work well, but with IE 4.1 on secured connection, pages are received randomly. when you make several requests, sometime the page is displayed, but sometimes a popup error message is displayed An error occured with the secured connexion. ]]> I did not test with other versions of IE, but netscape works fine. It works when I connect directly to the realserver (realserver disconnected from the LVS, and the VIP on the realserver allowed to arp).

Julian Is the https service created persistent i.e. using ipvsadm -p ? I assume the problem is in the way SSL is working: cached keys, etc. Without persistence configured, the SSL connections break when they hit another realserver. It may be in the way the bugs are encoded. It also depends on the how the how the SSL requests are performed (which we don't know). Notes from Peter Kese, who implemented the first persistence (pcc) (this is probably from 1999). The PCC scheduling algorithm might produce some imbalance of load on realservers. This happens because the number of connections established by clients might vary a lot. (There are some large companies for example, that use only one IP address for accessing the internet. Or think about what happens when a search engine comes to scan the web site in order to index the pages.) On the other hand, the PCC scheduler resolves some problems with certain protocols (e.g. FTP) so I think it is good to have it. and a comment about load balancing using pcc/ssl. (the problem: once someone comes in from aol.com to one of the realservers, all subsequent connections from aol.com will also go to the same server) - Lars (who about this time implemented persistence granularity, so this might be from 1999 too). Lets examine what happens now with SSL session comeing in from a big proxy, like AOL. Since they are all from the same host, they get forwarded to the same server - *thud*. Now, SSL carries a "session id" which identifies all requests from a browser. This can be used to separate the multiple SSL sessions, even if coming in from one big proxy, and load balance them. (unknown)

SSL connections will not come from the same port, since the clients open many of them at once, just like with normal http. So would we be able to differentiate all the people coming from aol by the port number?

No. A client may open multiple SSL connections at once, which obviously will not come from the same port - but I think they will come in with the same SSL id.

But like I said: really hard to get working, and even harder to get right ;-)

Wensong No, not really! As I know, the PCC (Persistent Client Connection) scheduling in the LVS patch for kernel 2.2 can solve connection affinity problem in SSL. When a SSL connection is made (crypted with server's public key), port 443 for secure Web servers and port 465 for secure mail server, a key (session id) must be generated and exchanged between the server and the client. The later connections from the same client are granted by the server in the life span of the SSL key. So, the PCC scheduling can make sure that once SSL "session id" is exchanged between the server and the client, the later connections from the same client will be directed to the same server in the life span of the SSL key. However, I haven't tested it myself. I will download ApacheSSL and test it sometime. Anyone who have tested or are going to test it, please let me know the result, no matter it is good or bad. :-) (a bit later)

I tested LVS with servers running Apache-SSL. LVS uses the VS patch for kernel 2.2.9, and uses the PCC scheduling. It worked without any problem.

SSL is a little bit different. In use, the client will send a connection request to the server. The server will return a signed digital certificate. The client then authenticates the certificate using the digital signature and the public key of the CA. If the certificate is not authentic the connection is dropped. If it is authentic then the client sends a session key (such as a) and encrypts the data using the servers public key. This ensures only the server can read it since decrypting requires knowing the server private key. The server sends its session key (such as b) and encrypts with its private key, the client decrypt it with server's public key and get b. Since both the client and the server get a and b, they can generate the same session key based on a and b. Once they have the session key, they can use this to encrypt and decrypt data in communication. Since the data sent between the client and server is encrypted, it can't be read by anyone else. Since the key exchange and generating is very time-consuming, for performance reasons, once the SSL session key is exchanged and generated in a TCP connection, other TCP connections can also use this session key between the client and the server in the life-span of the key. So, we have make the connections from the same client is sent to the same server in the life-span of the key. That's why the PCC scheduling is used here.

About longer timeouts felix k sheng felix (at) deasil (dot) com and Ted Pavlic

2. The PCC feature....can I set the permanent connection for something else than the default value ( I need to maintain the client on the same server for 30 minutes at maximum) ? If people connecting to your application will contact your web server at least once every five minutes, setting that value to five minutes is fine. If you expect people to be idle for up to thirty minutes before contacting the server again, then feel free to change it to thirty minutes. Basically remember that the clock is reset every time they contact the server again. Persistence lasts for as long as it's needed. It only dies after the amount of seconds in that value passes without a connection from that address. So if you really want to change it to thirty minutes, check out ip_vs_pcc.h -- there should be a constant that defines how many seconds to keep the entry in the table. (I don't have access to a machine with IPVS on it at this location for me to give you anything more precise)

I think this 30 minute idea is a web specific time out period. That is, default timeout's for cookies are 30 minutes, so many web sites use that value as the length of a given web "session". So if a user hits your site, stops and does nothing for 29 minutes, and then hits your site again, most places will consider that the same session - the same session cookies will still be in place. So it would probably be a nice to have them going to the same server.

passive ftp and persistence Wensong Since there are many messages about passive ftp problem and sticky connection problem, I'd better send a separate message to make it clear. In LinuxDirector (by default), we have assumed that each network connection is independent of every other connection, so that each connection can be assigned to a server independently of any past, present or future assignments. However, there are times that two connections from the same client must be assigned to the same server either for functional or for performance reasons. FTP is an example for a functional requirement for connection affinity. The client establishs two connections to the server, one is a control connection (port 21) to exchange command information, the other is a data connection (usually port 20) that transfer bulk data. For active FTP, the client informs the server the port that it listens to, the data connection is initiated by the server from the server's port 20 and the client's port. LinuxDirector could examine the packet coming from clients for the port that client listens to, and create any entry in the hash table for the coming data connection. But for passive FTP, the server tells the clients the port that it listens to, the client initiates the data connection connectint to that port. For the LVS-Tunneling and the LVS-DRouting, LinuxDirector is only on the client-to-server half connection, so it is imposssible for LinuxDirector to get the port from the packet that goes to the client directly. SSL (Secure Socket Layer) is an example of a protocol that has connection affinity between a given client and a particular server. When a SSL connection is made, port 443 for secure Web servers and port 465 for secure mail server, a key for the connection must be chosen and exchanged. The later connections from the same client are granted by the server in the life span of the SSL key. Our current solution to client affinity is to add persistent client connection scheduling in LinuxDirector. In the PCC scheduling, when a client first access the service, LinuxDirector will create a connection template between the give client and the selected server, then create an entry for the connection in the hash table. The template expires in a configurable time, and the template won't expire if it has its connections. The connections for any port from the client will send to the server before the template expires. Although the PCC scheduling may cause slight load imbalance among servers, it is a good solution to connection affinity. The configuration example of PCC scheduling is as follows: :0 -s pcc director:/etc/lvs# ipvsadm -a -t :0 -R ]]> BTW, PCC should not be considered as a scheduling algorithm in concept. It should be a feature of virtual service port, the port is persistent or not. I will write some codes later to let user to specify whether port is persistent or not.

The Persistence Template (about port 0) Information about a persistent connection is stored in the Persistence Template. The essential difference between the Persistence Template and a hash table entry is that the source port from the client is marked as 0 (i.e.CIP:0). Thanks to Karl Kopper for off-line discussions which got me started here. Horms 06 May 2004 A persistance template is just like a connection entry. It uses the same data structure. It is stored in the same hash table. The only difference is that the source port is set to 0 (i.e. CIP:0) so that the data can be identified as a persistance template. This means that it will never match a hash-table lookup for a connection entry. And a connection entry will never match a lookup for a persistance template which is made in the scheduling code. The purpose of a persistance template is, in a nutshell, to effect persistance. When a connection is started for a persistant virtual service, the persistance template is looked up. If it exists then it is used - that is the connection will be forwarded to the same realserver as the previous corresponding connection. Otherwise the connection is scheduled, just like a connection for a non-persistant virtual service, and the persistance template is created. Like connection entries, persistance templates have timeouts. Actually, again, it is handled by the same code. The only difference is that for persistance templates, the timeout is set by the persistance timeout configured using ipvsadm. Whereas for connection entries the timeout depends on the connection's state.

Let's say I make VIP:https persistent. There will be an entry in the hash table for VIP:https and another entry for VIP:0 in the persistence template.

No. When a connection from an enduser with CIP1 comes in then a persistance template will be created for CIP:0 (a port that doesn't exist). A Connection entry will also be made for CIP:ephemeral_source_port (i.e. the real port >1023 that the client is coming from).

How does ipvsadm know to make only VIP:https persistent?

I am not sure that I understand what you are getting at here. When you configure a virtual service using ipvsadm you can mark it as persistant. This sets a flag in the kernel for the virtual service that is checked by the LVS scheduler so it knows whether to treat the connection as persistant or not.

what happens if you use ipvsadm to enter persistence on all ports (ie VIP:0). Now you have a connection table entry with VIP:0 and a persistence template entry with CIP:0 and VIP:0?

I could check the code, but I don't really think there is a problem at all. The virtual service entry may have VIP:0, but the connecion entry (and persistance tepmpate) that is created will have VIP:XXX, where XXX is the destination port for the connection, which will be the same as the destination port in the connection. All the connection entries and persistance templates are stored in a hash table. To retrieve an entry from the hash table, first the hash key is generated (how is not particularly relevant here) and then that bucket is searched for the matching entry. A match checks various values including the source port. As the following property is true, a persistance template can never be confused with a connection entry: Source Port > 2^16-1 ]]> In other words. If you are looking for a persistance template then your search will always be for something with Source Port = 0. But if you are looking for a connection entry you will always be looking for something with Source Port != 0. Thus there is no ambiguity, despite both types of entries using the same data structure and being stored in the same hash table. call Scheduler to allocate a RealServer for the Connection. use result from Scheduler to create Connection Entry Yes -> does Persistance Template exist? No -> call Scheduler to allocate a RealServer for the Connection. use result from Scheduler to create Persistance Template. now Persistence Template exists. nothing special to do. use Persistance Template to create Connection Entry forward packet using Connection Entry ]]> Just think of persistence templates as special connection entries. Special entries effect which realserver subsequent connections from a given end user are allocated to. Rather than effecting which server packets for a current connection are sent to. Timeouts just handle how long a connection entry or persistance template are valid for. Else they would live in the kernel forever. Guy Waugh, Nov 18, 2003

In my LVS-NAT system (IPVS-1.0.9 + ldirectord), I have an Oracle server on the inside (web-db1) that primarily services the two realservers within the LVS. However, I also have a webserver (www1) on the VIP side of the network whose apache processes make Oracle connections through to the Oracle server on the inside of the LVS. To allow this, I have the Oracle listener service (port 1521) as an LVS service, with persistence set to 25200 seconds (7 hours). I'm noticing a couple of different types of connections from www1 to the Oracle listener port on the VIP: one with a source port of 0, and one with a random source port, like so (the VIP is 'learn'): Connections with a source port of 0 take on the persistence of 25200 seconds (as I have specified in ldirectord.cf), but connections out of a non-zero source port take on a persistence of 15 minutes (900 seconds). I see from http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.persistent_connection.html that: For LVS persistence, the client is recognised by its IP (CIP) or in recent versions of ip_vs, by CIP:dst_port (i.e. by the CIP and the port being forwarded by the LVS). If only the CIP is used to schedule persistence, then the entries in the output of ipvsadm will be of the form VIP:0 (i.e. with port=0), otherwise the output of ipvsadm will be of the form VIP:port. Can anyone tell me why I get both types of connections (source port 0 and source port non-zero)? Perhaps the 'source port 0' connection is some sort of 'master' connection, and the 'source port non-zero' connections are some sort of 'slave' connections? What I'm really wondering is if it is possible to effectively make the persistence for this connection infinite? Perhaps I shouldn't use LVS to do this, but should use iptables instead...? The problem underlying all this is that some apache processes on www1 seem to lose their Oracle connection over time, so any client hitting www1 who happens to get serviced by an apache process that has lost its Oracle connection gets Oracle connection errors all over the page. I see from http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.services.single-port.html#tcpip_idle_timeout that one can set TCP idle timeouts for connections with ipvsadm - perhaps this is what I should be doing?

Horms 18 Nov 2003 I think that you are confused between the concept of persistance and connection-timeouts. Persistance effects which realserver LVS will choose for new connections. If persistance is in effect and the persistant-timeout has not expired then the same realserver will be used for subsequent connections from the same CIP. But in your case you only have one realserver so persistance is a moot point. You are correct in asserting that the CIP:0 entry you see is a master entry. Actually in the code it is refered to as a template. When a new connection comes in LVS looks for VIP:Vport+CIP:0. If it is present then it will use the attached RIP:Rport. If not it just chooses one of the available realservers as per the scheduling algorithm that is in effect. But again this is a moot point, as you only have one realserver. The CIP:0 entry does not acctually represent a connection at all. Just a template for creating new connections. Its timeout should be set to the persistance-timeout each time the template is used to create a new connection. The other entries are the connections themselves. Their timeouts are set by the various timeouts that can be manipulated through /proc/sys/net/ipv4/vs/timeout_*. This is where the value of 900 seconds comes from. __It has nothing to do with persistancy__ As per the HOWTO entry you listed above, some of these values can also be manipulated using ipvsadm --set

persistent clients behind a proxy or nat box Jan Bruvoll 15 Aug 2005

I need to take a realserver out from one of my VS configs temporarily, and I have tried doing this by setting the weight of this particular realserver to 0. However, nothing is happening - the server is still receiving connections as before I made the adjustment. However, since I issued the command the number of active connections has actually increased, much in line with the distribution across the other servers in the same group, leading me to think that the config hasn't kicked in at all (the number of connections still seems very much like a weight of 5 is still active...) I've tried setting the persistence for this particular group to 5 seconds, without any noticeable effect.

Joe Once they're connected they're connected, it doesn't matter what the timeout is. Still the number of connections shouldn't increase after you've changed the weight to 0.

Hm ok - my reasoning for doing that is that these clients are relatively long-lived SSL-based connection from an in-house application to our server park - and that by setting the persistence to 5 seconds, only connections that "come back" within 5 seconds of disconnecting from this particular server (for whatever reason - Apache timeout, client disconnection, network problems, etc.) would be directed to the same server - if not, they would then hopefully be directed to one of the servers with weight>0

Horms This is a fairly simple problem, that is unfortunately difficult to explain. Let me try: When you set a real server to be quiescent (weight=0), this means that no new connections will be allocated to that real server using the scheduler. However, if you have persistance in effect (which you do), and a new connection is recieved from a end-user that recently made a connection, then that connection will be allocated to the same real server as the previous connection. The trick is, this process by-passes the scheduler, and thus by-passes quiescence. So, for a persistant service a new connection is processed a bit like this: Obviously this is a bit of a problem, for the reason you describe in your email. In implementation terms the problem is that when a connection for a persistant service is scheduled, a persistant template is created, with a timeout of the persitant timeout. This template is then used to select the real-server for subsequent connections from the same end-user. It stays in effect until its timeout expires. And its timeout is renewed everytime a packet is recieved for an associated connection. Which means in the case of quiesence, as long as end-users that have active persistance templates keep connecting or sending packaets within the persistance timeout, the real-server will keep having connections. The solution to this is quite simple. The patch at the URL below, which has been included in recent kernel versions, adds expire_quiescent_template to /proc (see ). By default it is set to 0, which gives the behaviour discribed above, the historical behaviour of LVS (which I might add can be desirable in some situations). However, if you set it to 1, then connection templates associated with a quiesced real-server are expired, at lookup time. Which, in a nutshell means that the "if" condition above will always fall through the the "else" clause, and thus quiescence is not by-passed. To effect this change just run the following as root /proc/sys/net/ipv4/vs/expire_quiescent_template ]]> The change effect new connections immediately. Or, on systems that have sysctl, add the following http://archive.linuxvirtualserver.org/html/lvs-users/2004-02/msg00224.html to /etc/sysctl.conf and run This will also take effect immediately, and has the advantage that the change will be persistant across reboots. Jan

Let me just check if I understand this correctly (using our current set-up): Our original persistence was set to 360 seconds, intended to be longer than the expected recurring request frequency of our application, which checks with our server cluster every 300 seconds ("ish"). If I keep the original persistence, any client already known to the cluster requesting data from the cluster again -before- the 360 seconds expire of that particular client "id", will trigger a persistence counter reset for this client

Horms Yes. The request could be opening a fresh connection. Or it could be the end-user sending (for any LVS forwarding meachism) or recieving (in the case of LVS-NAT) data for an existing connection. You can see the persistance teplates, and the progress of their timeouts, in amongst other connection entries if you run ipvsadm -Lcn. The persistance entries are the one with a client port of 0. Jan

However, if the weight for a particular real-server is set to 0, no -new- clients should be allocated to this realserver

Horms Yes, where new clients are one without a persistance template entry. Jan

and any clients not "coming back" within the 360 seconds should be removed from the persistence map, and any new requests from same clients after being removed should be allocated to one of the other realservers

Horms Yes. Though "going away" basically means no packets for existing connections, and no attempt to open a new connection. Jan

since writing, I tried resetting the persistence manually to 5 seconds, in order to try and flush the persistence "map" quicker. This hasn't had any perceivable effect, as the number of connections to this server as I am writing now, still reflects the original weight (some 18 hours after setting the weight to 0).

Horms Ok, obviously waiting 18 hours for connections to flush is impractical. What you are seeing is probably the result of either a bug in lvs, or very entusiastic end-users (i know they are just programmes, but hey), that send packets or recieve packets from the virtual service (and thus the real servers) at least once every 5 seconds. Some examintion of what is happening should shed some light on this: watch -n 1 ipvsadm -Lcn Jan

how can it be that the number of active connections actually increases on the realserver whose weight is 0?

Horms This is quite possible if a known (i.e. has a persistance template because it has sent or received data within the last 5 seconds (your timeout)) end-user opens a second connection, and no connections are closed. I'm not sure if this is actually what is happening, again ipvsadm -Lcn may help to show what is going on. What you are seeing is a bit strange, and hopefully you can diagnose exactly what is going on. But please consider setting /proc/sys/net/ipv4/vs/expire_quiescent_template to 1, as it should give behaviour that better suits your needs. Jan

Dawning suspicion here - if a connection some time ago triggered the creation of a persistance template, with the 360 seconds template, that template would actually stick around for as long as this client comes back to access the cluster - i.e. if I change the persistance of the virtualserver to, say, 5 seconds, that would only apply to -new- connections from clients previously "unknown" to the cluster, and the already existing template could only expire if the client goes away for more than 360 seconds, the original timeout? Actually, I've given this some thought and I think I understand why this number can increase - it is ip address specific only, so if new clients appear from behind an ip that is currently "active", i.e. has a persistance template allocated to it, these would also be allocated to the already quiesced server. So, although I have assigned a weight of 0, the persistence templates wouldn't expire until all traffic subsides and goes away for longer than \$persistence seconds after the last connection closed. Cumbersome, but at least I can understand what's going on.

Joe a bunch of different clients are coming out of a NAT box? like the AOL problem?

Well - similar, but not as acute a scale. In this case we're talking about a client talking HTTPS to our servers, and several users behind the same broadband connection (typically at home), or several users behind the same leased line at work, could be connecting - and thus get caught in the same "bucket".

Horms Good point, yes I am pretty sure that is how it works.

While I am at it; this seems a little odd, given that I have never set anything but persistances of either 360 seconds or 5 seconds: How should i interpret that ~10 minutes expire timeout? (I have "worse" ones too, all the way up to close to 20 minutes)

Horms I'm not sure, but its probably not a problem as once the connection changes out of the ESTABLISHED state, a fresh timeout will be assigned.

Rogue clients hidden by persistence Leon Keijser errtu (at) gmx (dot) net 14 Dec 2005

This morning when i did a 'ipvsadm -ln' i saw something weird: RemoteAddress:Port Forward Weight ActiveConn InActConn TCP 192.168.50.10:3389 wlc persistent 43200 -> 192.168.50.12:3389 Route 1 912 0 -> 192.168.50.15:3389 Route 1 22 0 -> 192.168.50.13:3389 Route 1 22 0 -> 192.168.50.16:3389 Route 1 21 0 -> 192.168.50.11:3389 Route 1 20 1 -> 192.168.50.18:3389 Route 1 21 0 -> 192.168.50.17:3389 Route 1 624 0 TCP 192.168.50.120:1494 wlc persistent 43200 -> 192.168.50.121:1494 Route 1 2 0 -> 192.168.50.122:1494 Route 1 3 0 TCP 192.168.51.202:22 wlc -> 127.0.0.1:22 Local 1 0 0 ]]> 912 and 624 connections? When I check on the realserver, everything seems normal. I have 8 clients on one server, 30 on another, but maybe this is because LVS is thinking it already has 900+ connections, and shouldn't route anyone to there anymore. The logfiles don't show anything abnormal either.

Graeme Fowler graeme (at) graemef (dot) net You're using persistence, which is probably a clue... What does ipvsadm -Lnc tell you? i That'll list the connections out so you should be able to see which clients are causing you the problem. You can grep the output for "ESTABLISHED" and/or "NONE" to see the active and persistent entries respectively.

Yep. I saw 2 IP's that occur several times (okay, several hundred times)

bear in mind that they may not be *your* clients. This could in theory at least be caused by something rogue.

Unfortunately they are my clients. And they are the linux-based thin clients I deployed as a side project. Turned out that they were hardcoded to use one of our domain controllers (which died yesterday night), and kept trying to connect to the cluster.

I'd guess you have a machine (or more than one) in your client base which is broken in some way. Which way I'll leave to you to find, but as these are RDP connections and the most likely clients are Windows machines...

Found them. Fixed them.

Long (1 day) persistence to windows terminal servers Here I summarise a lengthy exchange with Joseph T. Duncan duncan (at) engr (dot) orst (dot) edu starting at http://marc.theaimsgroup.com/?l=linux-virtual-server&m=115706140606154&w=2 on 30 Aug 2006. Joseph uses to monitor his LVS. Joseph's realservers are windows boxes at a university which serve any window app at all (e.g. statistics, word processing). Undergraduate students are mostly interactive doing homework, while faculty and grad students start lengthy jobs and return later. The user can logout of the real server, real server then closes all of their applications, stops their desktop session from running and correctly does a finack packet exchange ending the session. 'disconnect' (a windows option) from the realserver, the realserver puts their session in a disconnected state. Their desktop and application are still running. a finack packet exchange happens ending that tcp connection. After 10 minutes (adjustable) the realserver auto-logs off any 'disconnected' sessions, shuts the applications down and kills the desktop. This is called a clean disconnect. if the client reconnects in this 10 minute window, the director should point them back at their disconnected session. the real server will then give them their running desktop and applications back. if the client reconnects after the 10 minute window, the director should balance them as a new session since they no longer have a desktop+applications running on any realserver The client shuts down and/or closes inappropriately the real server sits there, leaves the desktop and applications active. now here is were weirdness happens. In testing, if the client computer is a linux box and is still accessible, the realserver will close out the connection (broken connection tcp handshake attempt? handled correctly?) In testing, if the client computer is any of windows xp workstations my department maintains, iis still accessible or becomes accessible later (reboot, powered down and then back on later, etc), the realserver will close out the connection (broken connection tcp handshake attempt? handled correctly?) If the client computer is the weird 1/3 of my customer's home/office/whatever computer (ones with who knows what on them, or how they are configured) the desktop+apps just sit there running on the realserver and with no closing of the connection. There is an idle session autologout after x time setting on the realservers (here 1day) and the realserver kills off the active session (but idle connection) desktop+applications. These clients are the troublesome ones, because if they login a few times, they will wind up with a active but idle desktop+application session on each realserver. Two bad things happen. applications can be running full bore (think long batch type jobs.. and use a 100% of a cpu, user processes are limited to a single cpu on the realservers, with 4 cpus avalible per real server) this isn't bad, but come finals week, each user could be eating up a cpu. last write wins.. if a client had something open in the orphaned session on a realserver. Then gets a new desktop session on an different realserver, makes changes to a document, logs out correctly, then the orphaned session dies/closes.. they might loose work, or have their windows profile corrupted. The terminal server session/application does not know weather a disconnect is clean or not. All it does is start recording idle time from the last keyboard/mouse input received Applications running inside the terminal server session (e.g. m$ word) usually have no idea whether they're running on a desktop or a terminal server. I could make the active but idle timeout on the realservers much lower, but that would lead to unhappy proffessors that stay logged in overnight. It's not easy to look for idle sessions (to kill them). I don't know how to test for an idle connection on the director. There is a windows management tool that reports idle time, but I am not aware of mib/snmp way to export that information. There are some wireless labs behind a nat-proxy and they all come out on the same 10.x.x.x IP, so you can't test for multiple connections from the same IP. Microsoft has a built in "network load balancing" based purely on network traffic between boxes. To make it more robust, they have something called "session directory" that up an authentication/identification will redirect an incoming comm to the approprate box in the "network load balanceing cluster" This style of load balanceing is fine for certain applications, and is closer to an L7 approach as your session location is determined by your login/id, instead of your IP. This method relies on lots of bandwidth. Each box participating in a "network load balanceing" setup receives a copy of everything and has to pick out what it's going to proccess... (all machines participating get a fake slaved mac address and ip shared between them)(shared mac address being m$'s solution to the arp problem?? i dunno) However for terminal servers it is no good.. I need to account for %free cpu and %free memory as important metics. (e.g. with 1 realserver at 400% load (4xcpu@100%) and 30kbs network traffic with 4 users, another realserver with 80% load and 900kbs traffic with 10 users, I would want new users to land on the lower cpu load server, as I am not really bandwidth bound till I get above 1gb/sec. The windows way does not take any cpu/memory metrics into account. When there is no LVS (i.e. a single terminal server, if a user had a dirty exit then either at 1 day of idle time they were killed off or if they reconnected, they got their session back. Thus I need persistence of a little more than 1 day with LVS, to make sure they reconnect with their last realserver.

LVS: Running a firewall on the director: Interaction between LVS and netfilter (iptables). May 2004: This chapter has been rewritten. Before the arrival of the , it was not possible to run arbitary iptables rules for ip_vs controlled packets on a director. Hence you couldn't run a firewall on the director and we told people to put their firewall on a separate box. Julian then took over writing the code and now it is possible to run a firewall on the director. The code is now called and is still beta, so keep us informed of how it works. In previous writeups, I misunderstood how the code worked, and made some incorrect statements. Hopefully this rewrite fixes the misinformation I propagated. For one of many introductions to netfilter see The netfilter framework in Linux 2.4 (http://gnumonks.org/~laforge/presentations/netfilter-lk2000/netfilter.ps.gz). According to Ratz (18 Apr 2006), NFCT causes a 20% throughput drop on a GbE inbound service.

Start with no filter rules Although this chapter is about applying iptables rules to directors, be aware that you don't need filter rules to set up an LVS. Misconfiguring the filter rules may cause strange effects. Make sure for testing that you can turn your filter rules on and off. Here's a cautionary tale. Sebastiaan Tesink maillist-lvs (at) virtualconcepts (dot) nl 14 Jul 2006 On one of our clusters we have problems with ipvs at the moment. Our cluster is built with 2 front-end failover ipvs-nodes (managed with ldirectord), with 3 Apache back-end nodes, handling both http as well as https. So all the traffic on a virtual ip on port 80 or 443 of the front-end servers is redirected to the backend webservers. Two weeks ago, we were running a 2.6.8-2-686-smp Debian stable kernel, containing ipvs 1.2.0. We experienced weekly (6 to 8 days) server crashes, which caused the machines to hang completely without any log-information whatsoever. These crashes seemed to be related to IPVS, since all our servers have the exact same configuration, except for the additional ipvs-modules on the front-end servers. Additionally, the same Dell SC1425 servers are used for all servers. For this reason we upgraded our kernel to 2.6.16-2-686-smp (containing ipvs 1.2.1) on Debian stable, which we installed from backports (http://www.backports.org). There aren't any crashes on these machines anymore. However, there are two strange things we noticed since this upgrade. First of all, the number of active connections has increased dramatically, from 1,200 with a 2.6.8-2-686-smp kernel, to well over 30,000 with the new kernel. We are handling the same amount of traffic. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP XXX.net wlc persistent 120 -> apache1:https Route 10 2 0 -> apache2:https Route 10 25 0 -> apache3:https Route 10 14 0 TCP XXX.net wlc persistent 120 -> apache1:www Route 10 10928 13 -> apache2:www Route 10 11433 6 -> apache3:www Route 10 11764 10 ]]> We are using the following IPVS modules: ip_vs ip_vs_rr ip_vs_wlc Secondly, Internet Explorer users are experiencing problems exactly since the upgrade to the new ipvs version. With Internet Explorer, an enormous amount of tcp-connections is opened when visiting a website. Users are experiencing high loads on their local machines, and crashing Internet Explorers. With any version of FireFox this is working fine by the way. Nevertheless, this started exactly since our IPVS upgrade. Note IE/IIS breaks tcpip rules to make loading fast (see What makes IE so fast http://grotto11.com/blog/slash.html?+1039831658 Sebastiaan Tesink sebas (at) virtualconcepts (dot) nl 19 Jun 2007 The solution to this problem was relatively easy, but we only discovered it recently. Basically, this problem was caused by the firewall, which contained "state checks". We used to have the following iptable rule: While using IP_VS, this caused connections to be denied by the firewall. Therefore we changed this to: which leads to a more balanced view on the number of active versus inactive connections in the load balancer. Hopefully this is some useful information to get your documentation even better.

Introduction For 2.4.x kernels (and beyond), LVS was rewritten as a netfilter module, rather than as a piece of stand-alone kernel code. Despite initial expectations by Rusty Russel that LVS could be written as a loadable netfilter module, it turned out not to be possible to write LVS completely within the netfilter framework. As well, there was a minor performance penalty (presumably in latency) for LVS as a netfilter module compared to the original version. This penalty has mostly gone with rewrites of the code. The problem was in connection tracking, which among other things allows the machine to determine if a packet belongs to a RELATED or ESTABLISHED connection. As well connection tracking helps with multiport protocols like FTP-NAT. The ip_vs controlled packets take a different path through the routing code than do non-LVS packets. The netfilter connection tracking doesn't know about the ip_vs controlled packets. Even if it did know about them, netfilter conntrack was considered too slow to use for LVS. For LVS-DR and LVS-Tun, where the reply packets do not go through the director, netfilter is not able to connection tracking on these packets at all. The were written to allow connection tracking of ip_vs controlled packets for LVS-NAT. Connection tracking for ip_vs with LVS-DR or LVS-Tun was not attempted. The ipvs_nfct code now allows conntrack for LVS-DR and LVS-Tun. Julian You can (and always have been able to) use firewall rules that match by device, proto, port or IP, without using . Julian 16 Mar 2007 The NFCT patch is not a way to use iptables NAT rules, it just provides iptables -m state support for IPVS packets. snat_reroute is only for IPVS packets. I just added some information in HOWTO.txt (http://www.ssi.bg/~ja/nfct/HOWTO.txt). SNAT: translate source address. Reroute: call output routing for 2nd time (saddr=VIP), first was the normal input routing for saddr=RIP.

Path of an ip_vs controlled packet Horms horms (at) verge (dot) net (dot) au 19 May 2004 here is my understanding of the way that the Netfilter Hooks and LVS fit together.

Interaction of LVS with Netfilter interactions between netfilter and LVS. The location of LVS hooks into the netfilter framework on the director. Packets travel from left to right. A packet coming from the client enters on the left and exits on the right heading for the realserver. A reply from the realserver (in the case of LVS-NAT) enters on the left and exits on the right heading for the client. For normal LVS-DR and LVS-Tun operation (see the ), reply packets do not go through the director. for incoming packets the path is: LOCAL_IN -> POSTROUTING ]]> for outgoing packets (only LVS-NAT): FORWARD -> POSTROUTING ]]> for incoming ICMP: FORWARD -> POSTROUTING ]]> When the director receives a packet, it goes through PREROUTING where Routing decides that the packet is local (usually because of the presence of the VIP on a local interface). The packet is then sent to LOCAL_IN. On the inbound direction, LVS hooks into LOCAL_IN. Modules register with a priority, the lowest priority getting to look at the packets first. LVS registers itself with a higher priority than iptables rules, and thus iptables will get the packet first and then LVS.

Mike McLean mikem (at) redhat (dot) com 21 Oct 2002 On the director, filter rules intercept packets before ip_vs sees them, otherwise would not work

If LVS gets the packet and decides to forward it to a realserver, the packet then magically ends up in the POSTROUTING chain. LVS does not look for ingress packets in the FORWARD chain. The only time that the FORWARD chain comes into play with LVS is for return packets from realservers when LVS-NAT is in use. This is where the packets get unNATed. Again LVS gets the packets after any iptables FORWARDing rules. ip_vs_in attaches to the LOCAL_IN hook. For a packet to arrive at LOCAL_IN, the dst_addr has to be an IP on a local interface (any interface e.g. eth0). The result of this requirement (that dst_addr is an IP on a local interface) is that you still need the VIP on the director, when accepting packets for LVS by VIP-less methods like fwmark or transparent proxy. (It would be nice to remove the requirement for this otherwise non-functional VIP.) There are ways around the requirement for a local IP, but they may create other problems as well. move ip_vs_in from the LOCAL_IN hook to the PREROUTING hook. I tried that briefly once and it seemed to work. play with routing rules to deliver the packet locally e.g. or Matt wrote: or perhaps But I haven't tested either much. There is an oblique reference to this on http://www.linuxvirtualserver.org/docs/arp.html. If packets for the VIP (on say eth0) started arriving on another interface (say eth1) due to dynamic routing, then LVS wouldn't care - the packet still arrives at LOCAL-IN. rp_filter would probably need to be disabled, and perhaps a few other routing tweaks, but fundamentally if the director could route the traffic (i.e. get the packets), then LVS could load balance it.

how to filter with netfilter netfilter has several families of rules, e.g. NAT and filter. The filter rules filter, but do not alter, packets. Not all iptables commands are filter rules. For some background on filtering with netfilter, see the Linux netfilter Hacking HOWTO (http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html). Although iptables rules can be applied at any chain, the filter rules are only applied at the LOCAL_IN, FORWARD and LOCAL_OUT (see Packet Selection: IP Tables http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html#ss3.2). Horms: LOCAL_IN and LOCAL_OUT in the kernel correspond, more or less, to INPUT and OUTPUT in iptables. Since any packet only traverses one of these three chains, for any packet then, there is one and only one place to filter it. This is a change from ipchains, when you could filter a packet in several chains. You can't filter in the other chains even if you want to. Julian by design you can filter only in these chains because iptable_filter registers only there: we have hooks (places in the kernel stack) where each table module attaches its chains with rules for packet matching. It is a tree with multiple levels of lists. Filtering never worked in the other chains. Table "filter" has only chains INPUT, FORWARD and OUTPUT, each used in the corresponding hook. Horms You can see how many packets and bytes a rule is effecting by running

ipvs_nfct, netfilter connection tracking for ipvs For some information on netfilter connection tracking (nfct) see Linux netfilter Hacking HOWTO (http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html). May 2004: Because ipvs changes the path of LVS controlled packets, netfilter is not able to connection track them. For LVS-NAT (and for LVS-DR/Tun when using the forward_shared flag for the ), packets in both directions go through the director, so it is possible to write replacement conntrack code (e.g. ). For LVS-DR, LVS-Tun, the return packets go directly from the realserver to the client and do not go through the director. You can infer the state of the connection using the same mechanism (timeouts) by which the LVS-DR/LVS-Tun director has always made decisions on the state of the connection (where connections are listed as ActiveConn and InActConn by ipvsadm). There ip_vs assumes that connections are setup and terminated in a normal manner. The first implementations of ipvs used the standard ivp4 timeouts to declare a state transition. More recent implementations allow for a private set of timeout values for the ip_vs controlled connections (see ). Joe - 29 Jan 2003

There were some incompatibilities between LVS and netfilter when running a director and firewall on the same box with 2.4.x?

Julian Yes, LVS and Netfilter use their own (separate) connection tracking implementations. The situation hasn't changed since I explained it January 2002. If we are going to fix this, then changes in Netfilter are required too, mostly in the routing usage. LVS has some requirements for the connection state which are not present in netfilter. I don't think it is good to move LVS to Netfilter conntracking. And I still don't have enough time to think about such big changes for LVS. May 2004: Julian has written the ipvs_nfct module, which among other things, fakes the connection tracking for LVS-DR and LVS-Tun. Julian's ipvs_nfct code for LVS for 2.4 and 2.6 kernels are on his Netfilter Connection Tracking Support page (http://www.ssi.bg/~ja/nfct/). This page is accessed from Julian's software page (http://www.ssi.bg/~ja/) through "Linux IPVS tools and extensions: IPVS Netfilter connection tracking support"). Because of the small demand for this functionality, this code is one of Julian's lower priority projects. Julian's HOWTO that comes with his patches states that the return packets have to go through the director. In fact the patch works for all LVS forwarding methods, but really is only useful for LVS's in which the replies return through the director. Julian I've uploaded the 2.6 version of the NFCT patches, but they aren't tested. ipvs_nfct matchs conntrack state for IPVS connections, e.g. NEW, ESTABLISHED, RELATED. Julian doesn't have a setup to test most of his code. Any untested code from Julian that I've tried has worked first time. The only problem is with the netfilter's non-official tcp window tracking patch (which I haven't tested) that only works for NAT (I suspect). Maybe some of the checks done in the tcp window tracking patch require a bidirectional stream, so possibly it doesn't work for the unidirectional LVS-DR or LVS-TUN connections, i.e. when the director doesn't see the reply packets. In all other cases (LVS-NAT or bidirectional DR/TUN with forward_shared=1) For LVS-NAT, ip_vs forwards packets in both directions in the director. For LVS-DR and LVS-TUN the replies are visible to the Netfilter firewall on the director if is the default gw for the realservers (for LVS-DR this requires the or forward_shared patch). Even if these reply packets are not part of the ipvs stream, they are forwarded, since forward_shared=1. For the Netfilter firewall, there is no difference in the forwarding method. In all cases the incoming traffic is handled in LOCAL_IN and the replies in FORWARD. IPVS always knows the conn state (NEW/RELATED/ESTABLISHED), it is simply exported to the netfilter conntracking. The patch works for cases when the replies don't go through the director, but in this case it is not very useful. The main purpose of the patch is to match reply packets. For the request packets, the conntrack entry is confirmed, which can speedup the packet handling (I hope). IPVS without NFCT drops the conntrack entry for each packet and allocates the conntrack entry again for the next packet. With ipvs_nfct, each skb comes with skb->nfct attached. ipvs-nfct preserves this NFCT struct, while the default IPVS drops it on skb free. Joe

Your NFCT HOWTO with date 10 Apr 2004 (but internal date Sep 2003) says that NFCT works for LVS-NAT and forward shared LVS-DR etc, but doesn't say anything about LVS-DR, LVS-Tun

non-NAT methods if forward_shared flag is used

I assume NFCT provides perfect conntrack for LVS when the replies go through the director and hence you can you use all iptables commands (eg ESTABLISHED). I assume RELATED will need helpers.

Yes, I'm just not sure for FTP for DR/TUN because ip_vs_ftp is not used (for LVS-DR, ftp requires persistence). IIRC, without creating NF expectations we can not expect to match FTP-DATA as RELATED.

For LVS-DR etc, your reply above seems to indicate that NFCT provides conntrack for LVS-DR too, however your NFCT HOWTO doesn't mention it.

Yes, it does not mention for any restrictions but the source has '- support for all forwarding methods, not only NAT'. It is still beta software. As yet there hasn't been a lot of interest in the code.

Is it OK to have a firewall on the director yet?

Yes, for simple ipchains-like rules. I can't be sure for more advanced filtering rules that play with conntrack specific data but if you use only device names, IPs, protos and ports it should work. ipvs works perfectly with netfilter as far as iptables filter rules are concerned.

can the LVS director now (with ipvs_nfct) can have any iptables command run on it, as if it were a regular linux router? Can I now expect to run any iptables command on an ipvs virtual service stream and have it work like on a normal linux box with an normal tcp/udp stream?

Maybe not because the IPVS packets do not use the same path through the network stack as other non-IPVS packets. I can not guarantee complete compatibility e.g you can do very wrong things with using some NAT rules, for example, DNAT. ipvs_nfct when added to ipvs gives you the ability to use -m state and nothing more. Now we can use -m state, with plain IPVS you can not use -m state.

what doesn't work with LVS-DR/Tun?

We support -m state for DR and TUN too. The only thing that doesn't work for DR and TUN is FTP-DATA. Stephane Klein 26 Aug 2004

I installed your ipvs-nfct-2.4.26-1.diff patch, I enabled the CONFIG_IP_VS_NFCT and recompiled the kernel. Here are my rules to enable http service: ACCEPT" $IPTABLES -A RULE_2 -j ACCEPT $IPTABLES -A FORWARD -p tcp -m state --state R,E -j ACCEPT ]]>

Julian All out->in traffic passes INPUT (not FORWARD as in netfilter), you can not allow only NEW packets. FORWARD is passed only for in->out traffic for NAT. Some iptables examples can be found at the NFCT page (http://www.ssi.bg/~ja/nfct/"). You can also read about the netfilter hooks LVS uses here: my LVS page (http://www.ssi.bg/~ja/LVS.txt). "Vince W." listacct1 (at) lvwnet (dot) com 12 Feb 2005 How I successfully compiled 2.6.9 and .10 FC3 kernels withip_vs_nfct patch and GCC 3.4.2 As a followup of sorts to my previous posts, "Error building 2.6.10 kernel with ip_vs_nfct patch - does anyone else get this?", I figured out what the problem was, and with some advice from Julian Anastasov, was successful getting kernels compiled with the ip_vs_nfct patch. This was the error message I would get previously, at the modules stage of the kernel build: The system in question is Fedora Core 3, which sports version 3.4.2 of the GNU C compiler (and everything in the release is built with it). The kernel source I was using is the kernel-2.6.10-1.741_FC3.src.rpm. I had added the ip_vs_nfct and nat patches to the kernel build spec file, inserted the "CONFIG_IP_VS_NFCT=y" kernel config option line between the "CONFIG_IP_VS_FTP=m" and "CONFIG_IPV6=m" lines of each kernel arch/type *.config file, and built the kernel. As it turns out, others have seen problems compiling code which contain external inline functions with GCC 3.4.2, not just people trying to use ip_vs_nfct. I found one such instance where the user documented that by removing "inline" from the function declaration, they were able to compile successfully. Since ip_conntrack_put is declared and exported as an inline-type function in include/linux/netfilter_ipv4/ip_conntrack.h, this seems to cause a problem for ip_vs_nfct making use of the function. I asked Julian what he thought about this idea, and he suggested that "inline" may need to be removed from ip_conntrack_put's declaration in net/ipv4/netfilter/ip_conntrack_core.c also, since this is where the function is exported. Armed with this idea, I modified both ip_conntrack.h and ip_conntrack_core.c to remove "inline" from the function, and created a patch which I then added to the .spec and kernel build. The kernel compiled successfully and ran. My firewall script worked, and ip_vs_nfct did its job. Unfortunately, I discovered several uptime hours later when the box kernel panicked that there is a known spinlock problem in the 2.6.10 kernel - somewhere in the filesystem/block device drivers code. I say known because comments exist in later iterations of the 2.6.10 kernel spec changelog which indicate that steps were taken to increase the verbosity of output when this specific kernel panic occurs. I do not know if it is an issue with the upstream 2.6.10 sources or not, but at least I knew it wasn't because of ip_vs_nfct. At any rate, I have been successful building 2.6.9 FC3 kernels with Julian's 2.6.9 ip_vs_nfct patches, and not seeing the spinlock "not syncing" kernel panics I saw building with any Fedora Core 3 2.6.10 kernel .src.rpm sources I tried building with. It's been up for 3 days now on the box running this kernel, and it is functioning exactly as desired. ...which also means it's gonna be time to update my keepalived Stateful Firewall/LVS Director HOW-TO document from 2003 (http://www.lvwnet.com/vince/linux/Keepalived-LVS-NAT-Director-ProxyArp-Firewall-HOWTO.html) soon... Specifically, I used the 2.6.9-1.681_FC3 .src.rpm file, including Julian's two patches (ip_vs_nfct and also the nat patch) and the one shown below to remove "inline" from the ip_conntrack_put function. If anyone else is interested in building kernels using Julian's patches and you have GCC 3.4.2 (or newer, I'm sure...) you may be interested in this small patch to remove "inline" from ip_conntrack.h and ip_conntrack_core.c. I'll attach it to this post, and also post the text of it here: If anyone else cares to comment on whether removing "inline" from either/both of these places is good/bad, or what performance impact this may have, please do tell. But it is working well for me so far.

LVS-NAT netfilter conntrack example with ftp Julian Anastasov ja (at) ssi (dot) bg 19 May 2004 For users of ipvs-nfct I would recommend the following rules. The example is for ftp by LVS-NAT to VIP=192.168.1.100. Access to all other ports on the VIP is denied. /proc/sys/net/ipv4/vs/conntrack # module to correctly support connection expectations for FTP-DATA modprobe ip_conntrack_ftp # module to detect ports used for FTP-DATA # (May 2004, has a kernel bug which hasn't been fixed) # `modprobe ip_nat_ftp` is optional and ip_nat_ftp needs a fix: # http://marc.theaimsgroup.com/?l=linux-netdev&m=108220842129842&w=2 # if ip_nat_ftp is used together with ipvs_nfct for FTP NAT. modprobe ip_nat_ftp # Restrict LOCAL_IN access # accept packets to dport 21 and related and established connections. # the related connections are determined by the ftp helper module # drop all other packets iptables -A INPUT -p tcp -d 192.168.1.100 --dport 21 -j ACCEPT iptables -A INPUT -p tcp -d 192.168.1.100 -m state --state RELATED,ESTABLISHED - j ACCEPT iptables -A INPUT -p tcp -d 192.168.1.100 -j DROP # Restrict FORWARD access # accept only related, established. drop all others iptables -A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT iptables -A FORWARD -j DROP ]]> Without NFCT support it is difficult to filter in INPUT for FTP-DATA packets but for http/https where the VPORT is known it is not difficult. The same difficulties are IN FORWARD for the NAT replies. This is where ipvs-nfct wins - you have only a small number of rules. Traffic to the VIP is first filtered by iptables in INPUT and then scheduled from IPVS. IPVS in 2.6 causes the scheduled out->in packets (after any LVS-NAT translations) to appear in the LOCAL_OUT hook where they can be filtered again. IPVS works always after the filter rules. It is easy when you know the ports but for FTP-DATA it is not possible, you have to specify input devices and IPs for which you grant access for forwarding. You overcome such problems with ipvs-nfct. Ratz 03 Aug 2004 LVS-NAT with the NFCT patch will work for 2.4.x and 2.6.x kernels regarding filtering, if you don't use fwmark LVS-DR will most probably not work with 2.6.8 and above kernels regarding filtering since the tcp window tracking patch has been merged to the vanilla tree; however there is a relaxation sysctl that could revert the strict TCP window and sequence number checking to the loosly-knitted one (aka: non-existant) as previously found in vanilla Linux kernels.

tcpdump is LVS compatible You can use tcpdump to debug your iptables rules on your running director. tcpdump makes a copy of packets for its own use. tcpdump gets a copy of the packets before netfilter (on the way in) and after netfilter (on the way out). You should see all packets with tcpdump as if netfilter and LVS didn't exist. Joe 16 Mar 2001

I'm looking at packets after they've been accepted by TP and I'm using (among other things) tcpdump. Where in the netfilter chain does tcpdump look at incoming and outgoing packets? When they are put on/received from the wire? After the INPUT, before the OUTPUT chain...?

Julian Before/after any netfilter chains. Such programs hook at packet level before/after the IP stack just before/after the packet is received/must be sent from/by the device. They work for other protocols. tcpdump is a packet receiver just like the IP stack is in the network stack. If you are using twisted pair ethernet through a hub/switch, your NIC will only see the packets to/from it. Thus tcpdump running on the director will not see packets from the realserver to the client in LVS-DR. In the days when people used coax for ethernet, all machines saw all packets on a segment.

Writing Filter Rules If you're writing your own rules, start off with a quiet machine, log all packets and then access one of the services. Write rules to accept the packets you want and keep logging the rest. Try another service... Deny all packets that you know aren't needed for your LVS. You can probably accept all packets that have both src_addr and dst_addr in the RIP network. As well machines might need access to outside services (e.g. ntp, dns). Realservers that are part of LVSs, will also require rules to allow them to access outside services. Joe (on changing from writing ipchains rules for ip_vs for 2.2, to writing iptables rules for ip_vs for 2.4)

I see packets only in the INPUT and OUTPUT chains, but not in FORWARD or in lvs_rules chains.

Ratz 21 May 2001 If you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it: LOCAL_IN(LVS in) -> POST_ROUTING packets leaving the LVS travel: PRE_ROUTING -> FORWARD(LVS out) -> POST_ROUTING ]]> From the iptables howto:

COMPATIBILITY WITH IPCHAINS This iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three.

When writing filter rules (e.g. iptables), keep in mind write the rules in trees. If a packet has to traverse many rule tests before it is accepted/rejected, then throughput will decrease. If many packets traverse a rule set, then you should attempt to shorten the path through the rules, possibly by breaking the rule set into several branches. Ratz has shown that you can have 200-500 rules in a branch before throughput is affected (see pf-speed-test.pdf). "K.W." kathiw (at) erols (dot) com

can I run my ipchains firewall and LVS (piranha in this case) on the same box? It would seem that I cannot, since ipchains can't understand virtual interfaces such as eth0:1, etc.

Brian Edmonds bedmonds (at) antarcti (dot) ca 21 Feb 2001 I've not tried to use ipchains with alias interfaces, but I do use aliased IP addresses in my incoming rulesets, and it works exactly as I would expect it to. Julian

I'm not sure whether piranha already supports kernel 2.4, I have to check it. ipchains does not understand interface aliases even in Linux 2.2. Any setup that uses such aliases can be implemented without using them. I don't know for routing restrictions that require using aliases.

I have a full ipchains firewall script, which works (includes port forwarding), and a stripped-down ipchains script just for LVS, and they each work fine separately. When I merge them, I can't reach even just the firewall box. As I mentioned, I suspect this is because of the virtual interfaces required by LVS.

LVS does not require any (virtual) interfaces. LVS never checks the devices nor any aliases. I'm not sure what is the port forwarding support in ipchains too. Is that the support provided from ipmasqadm: the portfw and mfw modules? If yes, they are not implemented (yet). And this support is not related to ipchains at all. Some good features are still not ported from Linux 2.2 to 2.4 including all these autofw useful things. But you can use LVS in the places where use ipmasqadm portfw/mfw but not for the autofw tricks. LVS can perfectly do the portfw job and even to extend it after the NAT support: there are DR and TUN methods too.

Lorn Kay lorn_kay (at) hotmail (dot) com

I ran into a problem like this when adding firewall rules to my LVS ipchains script. The problem I had was due to the order of the rules. Remember that once a packet matches a rule in a chain it is kicked out of the chain--it doesn't matter if it is an ACCEPT or REJECT rule(packets may never get to your FWMARK rules, for example, if they do not come before your ACCEPT and REJECT tests). I am using virtual interfaces as well (eg, eth1:1) but, as Julian points out, I had no reason to apply ipchains rules to a specific virtual interface (even with an ipchains script that is several hundred lines long!)

unknown

FWMARKing does not have to be a part of an ACCEPT rule. If you have a default DENY policy and then say: To maintain persistence between port 80 and 443 for https, for example, the packets will match on the ACCEPT rule, get kicked out of the input chain tests, and never get marked.

The Antefacto Netfilter Connection Tracking patches The Problem: Because of the incompatibilities between netfilter and LVS, it is not possible (in general) to have iptables firewall rules running on the director (the firewall must be on a separate machine). The first code (the Antefacto patches) were written by Ben North for 2.4 kernels, when he worked for (the now defunct) Antefacto. The code was then (Jun 2003) taken over by Vinnie listacct1 (at) lvwnet (at) com and now (Apr 2004) being ported by Julian who calls them the "netfilter connection tracking" (nfct) patches. The original Antefacto patches allowed a firewall on directors in an LVS where the packets from the realservers return through the director (LVS-NAT, and LVS-DR with the forward_shared flag when the director is the default gw for the realservers). This restriction is required so that the director has full information about the connection (in LVS-DR and LVS-Tun, the director makes guesses about the state of the connection from timeout values). Documentation for the Antefacto code is at setting up an LVS-NAT Director (running keepalived) to function as a stateful firewall, which also happens to use proxy-arp. The code is at Antefacto patch for 2.4.19/1.0.7 and Antefacto patch for 2.4.20/1.0.8

The problem:director can't be firewall as well John P. Looney john (at) antefacto (dot) com Apr 12, 2002

We modified ip_vs to get it to play nicely with iptables on the same box, so you don't need a seperate firewall/vpn box. The patches weren't accepted to the main branch, as the changes were considered non-mainstream, and they were made from 0.8.2 version, which was a little old then. Have a look at; http://www.in-addr.de/pipermail/lvs-users/2002-January/004585.html From memory (I didn't do the kernel work), the ip_vs connection tracking tables and the netfilter connection tracking tables were not always in synch. So, you couldn't statefully firewall an ip_vs service. There is a readme included somewhere in that thread. We've just done a product release. One of our aims is to reimplement these changes in the 1.0.x branch, if someone hasn't already done so. When that's done, we'll post those patches to the list also.

Here's the original posting from Ben North. Feb 2003: Antefacto no longer exists. You should be able to contact Ben at ben (at) redfrontdoor (dot) org.

We've been working with the LVS code for the past while, and we wanted to allow the use of Netfilter's connection-tracking ability with LVS-NAT connections. There was a post on the mailing list a couple of weeks ago asking about this, and my colleague Padraig Brady mentioned that we had developed a solution. I've now had time to clean up the patches, and I attach a README, and two patch files. One is for the Linux kernel, and one is to the LVS code itself. Any comments, get in touch. We have done a fair amount of testing (overnight runs with many tens of thousands of connections), with no problems. Many thanks for the great piece of work. Hope the patches are useful and will be considered for inclusion in future releases of LVS. I notice that 1.0.0 is going to arrive soon; the attached patches might be better applied to a Linux-kernel-style 1.1 "development" branch.

Vinnie listacct1 (at) lvwnet (dot) com 04 May 2003 Well I haven't tried to crash the firewall/Director or anything, but to sum it up, the firewall box is doing its job now just as well as it was before I started dinking around with LVS/IPVS. It is letting traffic come IN that I have IPVS virtual services for, and letting it be FORWARDED to the Real Servers. It's not getting in the way of IPVS connections in progress, nor does it appear to be letting traffic through which is NOT related to connections already in progress. Ratz

Guys, I hope you _do_ realize that not even netfilter has a properly working connection tracking. Without the tcp-window-tracking patch, netfilter allows you to send arbitrary packets through the stack. It's a well-known fact and even the netfilter homepage at some point mentioned it.

Point taken. But that's not an IPVS or Antefacto problem.

I take it that you didn't do any tests of the patch or netfilter in general with a packet generator (where you can modify every last bit of an skb).

No, I can't say that I have. Perhaps you would be willing to put some of that expertise you have to work?

And, to your interest, LVS _does_ have sort of connection state tracking.

I am aware of that. But the point about all of this (and the reason that the folks who actually wrote the Antefacto patch did so) is that IPVS works independently of netfilter's connection tracking. So Netfilter doesn't have a CLUE about all those connections going on (or not going on) to IPVS-based services and RealServers. But if you want your LVS Director to also be your main firewall, that means you have to be able to tell your firewall box, in ways that you can communicate your wishes with iptables commands, what kind of traffic you want to allow to go in/out of your LVS. But that's pretty hard to do since IPVS unmodified doesn't bother to let netfilter in on the loop of what it's doing. The antefacto patch allows netfilter and IPVS to communicate about all that traffic going through your LVS, so that at the iptables ruleset level, it is possible to write rules that work for your LVS. If netfilter's connection tracking is broken, then it's broken -- IPVS, Antefacto, or not.

the patches Following patches, Copyright (C) 2001--2002 Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland, released under GPL, and then Ben's documentation on the patch. First the patch to the kernel sources (http://www.austintek.com/WWW/LVS/LVS-HOWTO/HOWTO/files/antefacto_kernel.diff) And second, the patch to the LVS sources (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/antefacto_lvs.diff) (made against 0.8.2 so may need some clean-up).

Making LVS work with Netfilter's connection tracking The two attached patches modify the kernel and the ipvs modules in such a way that ipvs NAT connections are correctly tracked by the Netfilter connection-tracking code. This means that firewalling rules can be put in place to allow incoming connections to a virtual service, and then by allowing ESTABLISHED and RELATED packets to pass the FORWARD chain, we achieve stateful firewalling of these connections. For example, if director 4.3.2.1 is offering a virtual service on TCP port 8899, we can do and get the desired behaviour. Note that the second rule (the one in the FORWARD chain) covers all virtual services offered by the same director, so if another service is offered on port 9900, the complete set of rules required would be i.e., one rule in the INPUT chain is required per virtual service, but the rule in the FORWARD chain covers all virtual services. Patches required Patch to main kernel source There is a small change required to the kernel patch. The stock kernel patch which comes with the ip_vs distribution just adds a few EXPORT_SYMBOL()s to ksyms.c. For the Netfilter connection-tracking functionality, we need a bit more. The files affected, and reasons, are: ip_conntrack_core.c: init_conntrack(): Mark even more clearly that the newly-created connection-tracking entry is not in the hash tables. This change isn't strictly necessary but makes assertion-checking easier. ip_conntrack_standalone.c: Export the symbol __ip_conntrack_confirm(). I didn't really like the idea of exporting a symbol starting with double-underscore, but nothing too bad seems to have happened. The function seems to take care of reference-counting, so I think we're OK here. ip_nat_core.c: ip_nat_replace_in_hashes(): (new function) Exported wrapper round replace_in_hashes() which deals with the locking on ip_nat_lock. ip_nat_standalone.c: Export the new ip_nat_replace_in_hashes() function. ip_nat.h: Declare the new ip_nat_replace_in_hashes() function. More explanation below. Patch to ip_vs code ip_vs_app.c: skb_replace(): Copy debugging information across to the new skb, if debugging is enabled. This is a separate issue to the main connection-tracking patch, but was causing spurious warnings about which hooks a skb had passed through. ip_vs_conn.c: Include some netfilter header files. Declare a new function ip_vs_deal_with_conntrack(). ip_vs_nat_xmit(): Code to make sure that Netfilter's connection-tracking entry is correct. ip_vs_deal_with_conntrack(): (new function) The guts of the new functionality. Changes the data inside the Netfilter connection-tracking entry to match the actual packet flow. ip_vs_core.c: route_me_harder(): (new function) Copied from ip_nat_standalone.c. Code to re-make the routing decision for a packet, treated as locally-generated. ip_vs_out(): Separate from the connection-tracking code changes, don't send ICMP unreachable messages. This has been discussed on the list recently and I think the consensus was that this change is OK. The sysctl method would be better though, so ignore this bit. Also call route_me_harder() to decide whether the outbound packet needs to be routed differently now it is supposed to be coming from the director machine itself. ip_vs_in(): When checking if a packet might be trying to start a new connection, check that it has SYN but not ACK. Previously, the only check was that it had SYN set. If there is a new connection being attempted, check for consistency between Netfilter's connection-tracking table and LVS'. More explanation of this bit below. ip_vs_ftp.c: Include the Netfilter header files. Declare new function ip_vs_ftp_expect_callback(). ip_vs_ftp_out(): Once we have noticed that a passive data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED. ip_vs_ftp_in(): Once we have noticed that an active data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED. ip_vs_ftp_expect_callback(): (new function) When the RELATED packet arrives (for a data-transfer connection), update Netfilter's connection-tracking entry for the connection.

General connections (<emphasis>i.e.</emphasis> not FTP) Each entry in Netfilter's connection-tracking table has two tuples describing source and destination addresses and ports. One of these tuples is the ORIG tuple, and describes the addressing of packets travelling in the "original" direction, i.e., from the machine that initiated the connection to the machine that responded. The other is the REPLY tuple, which describes the addressing of packets travelling in the "reply" direction, i.e., from the responding machine to the initiating machine. Normally, the REPLY tuple is just the "inverse" of the ORIG tuple, i.e., has its source and destination reversed. But for LVS connections, this is not the case. This is what causes the problem when using the unmodified Netfilter code with IPVS connections. Actually, it's one of the things that causes trouble. The following is roughly what happens with the unmodified code for the start of a TCP connection to a virtual service. Suppose we have Then the client sends a packet to the VIP:VPORT; say VIP:VPORT ]]> Netfilter on the director makes a note of this packet, and sets up a temporary connection-tracking entry with tuples as follows: VIP:VPORT REPL: VIP:VPORT -> CIP:CPORT ]]> (the "src-ip:src-port -> dest-ip:dest-port" notation is hopefully clear enough). We will call a connection-tracking entry a "CTE" from now on. LVS notices (in ip_vs_in(), called as part of the LOCAL_INPUT hook) that VIP:VPORT is something it's interested in, grabs the packet, re-writes it to be addressed RIP:RPORT ]]> and sends it on its way by means of ip_send(). As a result, the POST_ROUTING hook gets called, and ip_vs_post_routing() gets a look at the packet. It notices that the packet has been marked as belonging to LVS, and calls the (*okfn), sending the packet to the wire without further ado. When it has been transmitted, the reference count on the CTE falls to zero, and it is deleted. (This is a mild guess but I think is right.) Normally, CTEs avoid this fate because __ip_conntrack_confirm() is called for them, either via ip_confirm() as a late hook in LOCAL_IN, or through ip_refrag() called as a late hook in POST_ROUTING. "Confirming" the CTE involves linking it into some hash tables, and ensuring it isn't deleted. So this is the first problem --- the CTE is not "confirmed". Suppose we confirmed the connection. Then when the Real Server replies to this packet, it sends a packet addressed as CIP:CPORT ]]> to the director (because the Director is the router for such packets, as seen by the Real Server). Then the connection-tracking code in Netfilter on the director tries to look up the CTE for this packet, but can't find one. The CTE we /want/ it to match says VIP:VPORT REPL: VIP:VPORT -> CIP:CPORT ]]> with no mention of the RIP:RPORT. So this reply packet gets labelled as "NEW", whereas we wanted it to be labelled as "ESTABLISHED". So as well as confirming the CTE, we also need to alter the REPLY tuple so that it will match the CIP:CPORT ]]> packet the Real Server sends back. Then everything will work. These two things are what the ip_vs_deal_with_conntrack() function does. Luckily there is a ip_conntrack_alter_reply() function exported by Netfilter, which we can use. Then we can also call the newly-exported __ip_conntrack_confirm() to confirm the connection. (We need to do the reply altering first because __ip_conntrack_confirm()ing the CTE uses the addresses in the ORIG and REPLY tuples to place the CTE in the hash tables, and we want it placed based on the /new/ reply tuple.) There is a slight complication in that the NAT code in Netfilter gets confused if addressing tuples change, so we need to tell the NAT code to re-place the CTE in its hash tables. This is done with the newly-exported ip_nat_replace_in_hashes() function. The ip_vs_deal_with_conntrack() function is called from the ip_vs_nat_xmit() function, since this whole problem only applies to LVS-NAT. It is only called if the CTE is unconfirmed. Hacking round a possible race When testing this, we found that very occasionally there would be a problem when the Netfilter CTE timed out and was deleted. The code would fail an assertion: the CTE about to be deleted was not linked into the hash chain it claimed it was. This would happen after a few tens of thousands of connections from the same client to the same virtual service. We tracked this down to the above ip_vs_deal_with_conntrack() code being called for a CTE which already existed and was already confirmed. Doing this moved the CTE to a different hash chain and broke things. The only explanation I could come up with is that there is a race in the ip_vs code. The ip_vs code doesn't set up one timer per connection entry. Instead, it uses a kernel timer to do some work every second. I didn't look into this too deeply, but it looked like the following is a possibility. If the slow-timer code decides that a LVS connection should be expired, there seems to be a window where a packet can arrive and update that connection, meaning that it should no longer be expired. But it is anyway. There are more details; supplied on request. But if somebody who knows the timer code could check whether the above is a possibility, and fix it if so, that would be good. The workaround detects if the CTE is already confirmed, and deletes it and also drops the packet if so. Higher levels in the stack take care of retransmitting so nothing too drastic goes wrong. Later, we noticed the workaround being triggered much more often than we'd expect, and it turned out that incoming packets with the SYN and ACK bits both set were being treated as potentially starting new connections, whereas SYN/ACK packets are in fact a response to a connection initiated by the director itself. So we tightened the test to be syn && !h.th->ack) || (iph->protocol != IPPROTO_TCP)) ]]> instead of syn || (iph->protocol!=IPPROTO_TCP)) ]]> which is how it is in the original LVS code. This doesn't seem to have caused any nasty side effects. Note that this only happened when an FTP virtual service was configured, because of the code in ip_vs_service_get() which allows a "wild-card" match for incoming FTP data connections.

FTP connections The other main change is to the LVS FTP module. We add code to the two functions ip_vs_ftp_out() and ip_vs_ftp_in(), to deal with passive and active data transfers respectively. The basic idea is the same for both types of transfer. By keeping an eye on the actual traffic going between the client and the FTP server, we can tell when a data transfer is about to take place. For a passive transfer, the ip_vs_ftp module looks out for the string "227 Entering Passive Mode" followed by the address and port the server will listen on. For an active transfer, the client transmits the "PORT" command followed by the address and port the client will listen on. Once we have detected that a data transfer is about to take place, we add code to tell Netfilter's connection-tracking code to /expect/ the data connection. Then, packets belonging to the data connection will be labelled "RELATED" and can be allowed by firewall rules. There is an exported function ip_conntrack_expect_related(), which we call. The only difference between the set-up for passive and active transfers is that for passive transfers we don't know the port the client will connect from, so have to specify the source port as "don't care" by means of its mask. The ip_conntrack_expect_related() function allows us to specify a callback function; we use ip_vs_ftp_expect_callback() (new function in this patch). ip_vs_ftp_expect_callback() works out whether the new connection is for passive or active, modifies the REPLY tuple, and confirms the CTE. I've just noticed that I modify the reply tuple directly instead of calling ip_conntrack_alter_reply(). Can't see any good reason for this, so should probably change the code to use ip_conntrack_alter_reply() instead. Might not have time to test that change here, so will leave it alone for now. So to run a virtual FTP service, load the extra ip_vs_ftp module, but /not/ the ip_conntrack_ftp or ip_nat_ftp modules. It is very likely that the ip_vs_ftp module would not cooperate very well with those two modules, so if you want to run a non-virtual FTP service /and/ load-balance a virtual FTP service on the same machine, more work might be required. route_me_harder() We call this function to possibly re-route the packet, because we were using policy routing (iproute2). This allows routing decisions to depend on more than just the destination IP address of the packet. In particular, a routing decision can be influenced by the source IP address of the packet, and by the fact that the packet should be treated as originating with the local machine. The call to route_me_harder() re-makes the routing decision in light of the new state of the packet. It could be removed (or disabled via a sysctl) if the overhead was too annoying in an application which didn't require this extra flexibility. Additional #defines There are additional #defines available to add assertion-checking and various amounts of debugging to the output of the new code. #define BN_ASSERTIONS to include extra code which checks various things are as they should be. This adds a small amount of overhead (sorry, haven't measured it) but caught some problems in development. #define BN_DEBUG_FTP to emit diagnostic and tracing information from the modified ip_vs_ftp module. Again, was useful during development but probably not useful in production. #define BN_DEBUG_IPVS_CONN to emit diagnostic and tracing information from the new code which handles Netfilter's CTEs. Same comments apply: useful while I was working on it, but probably not in actual use.

The design of LVS as a netfilter module, pt1 Tao Zhao taozhao (at) cs (dot) nyu (dot) edu 11 Jul 2001

The source code of LVS adds ip_vs_in() to netfilter hook NF_IP_LOCAL_IN to change the destination of packets. As I understand, this hook is called AFTER routing decisions have reached. So how can it forward the packet to the new assigned destination without routing?

Henrik Nordstrom hno (at) marasystems (dot) com Instead of rewriting the packet inside the normal packet flow of Linux-2.4, IPVS accepts the packet and constructs a new one, routes it and sends it out.. This approach does not make much sense for LVS-NAT within the netfilter framework, but fits quite well for the other modes. Julian LVS does not follow the netfilter recommendations. What happens if we don't change the destination (e.g.DR and TUN methods which don't change the IP header). When such packet hits the routing the IP header fields are used for the routing decision. Netfilter can forward only by using NAT methods. LVS tries not to waste CPU cycles in the routing cache. You can see that there is output routing call involved but there is a optimization you can find even in TCP - the destination cache. The output routing call is avoided in most of the cases. This model is near the one achieved in Netfilter, i.e. to call only once the input routing function (2.2 calls it twice for DNAT). I'm now testing a patch for 2.2 (on top of LVS) that avoids the second input routing call and that can reroute the masqueraded traffic to the right gateway when many gateways are used and mostly when these gateways are on same device. The tests will show how different is the speed between this patched LVS for 2.2 and the 2.4 one (one CPU of course). We decided to use the LOCAL_IN hook for many reasons. May be you can find more info for the LVS integration into the Netfilter framework by searching in the LVS mail list archive for "netfilter". Julian 29 Oct 2001 IPVS uses only the netfilter's hooks. It uses own connection tracking and NAT. You can see how LVS fits into the framework on the mailing list archive. Ratz

I see that the defense_level is triggered via a sysctrl and invoked in the sltimer_handler as well as the *_dropentry. If we push those functions on level higher and introduce a metalayer that registers the defense_strategy which would be selectable via sysctrl and would currently contain update_defense_level we had the possibility to register other defense strategies like e.g. limiting threshold. Is this feasible? I mean instead of calling update_defense_level() and ip_vs_random_dropentry() in the sltimer_handler we just call the registered defense_strategy[sysctrl_read] function. In the existing case the defense_strategy[0]=update_defense_level() which also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)

The different strategies work in different places and it is difficult to use one hook. The current implementation allows they to work together. But may be there is another solution considering how LVS is called: to drop packets or to drop entries. There are no many places for such hooks, so may be it is possible something to be done. But first let's see what kind of other defense strategies will come.

Yes, the project got larger and more reputation than some of us initially thought. The code is very clear and stable, it's time to enhance it. The only very big problem that I see is that it looks like we're going to have to separate code paths one patch for 2.2.x kernels and one for 2.4.x.

Yes, this is the reality. We can try to keep the things not to look different for the user space.

This would be a pain in the ass if we had two versions of ipvsadm. IMHO the userspace tools should recognize (compile-time) what kernel it is working with and therefore enable the featureset. This will of course bloat it up in future the more feature-differences we will have regarding 2.2.x and 2.4.x series.

Not possible, the sockopt are different in 2.4 Joe (I think)

Could you point me to a sketch where I could try to see how the control path for a packet looks like in kernel 2.4? I mean some- thing like I would do for 2.2.x kernels:

Julian (I think) I hope there is a nice ascii diagram in the netfilter docs, but I hope the info below is more useful if you already know what each hook means. C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ --> h a |input | e {Routing } |Chain | |output |ACCEPT e n |Chain | m {Decision} |_______| --->|Chain | c i |______| a ~~~~~~~~ | | ->|_______| k t | s | | | | | s y | q | v | | | u | v e v DENY/ | | v m | DENY/ r Local Process REJECT | | DENY/ | v REJECT a | | | REJECT | DENY d --------------------- | v e ----------------------------- DENY ]]> Ratz (I think)

The biggest problem I see here is that maybe the user space daemons don't get enough scheduling time to be accurate enough.

That is definitely true. When the CPU(s) are busy transferring packets the processes can be delayed. So, the director better not spend many cycles in user space. This is the reason I prefer all these health checks to run in the realservers but this is not always good/possible.

No, considering the fact that not all RS are running Linux. We would need to port the healthchecks to every possible RS architecture.

Yes, this is a drawback. unknown (Ratz ?)

Tell me, which scheduler should I take? None of the existing ones gives me good enough results currently with persistency. We have to accept the fact, that 3-Tier application programmers don't know about loadbalancing or clustering, mostly using Java and this is just about the end of trying to load balance the application smoothly.

WRR + load informed cluster software. But I'm not sure in the the case when persistency is on (it can do bad things).

I currently get some values via an daemon coded in perl on the RS, started via xinetd. The LB connects to the healthcheck port and gets some prepared results. He then puts this stuff into a db and starts calculating the next steps to reconfigure the LVS-cluster to smoothen the imbalance. The longer you let it running the more data you get and the less adjustments you have to make. I reckon some guy showing up on this list once had this idea in direction of fuzzy logic. Hey Julian, maybe we should accept the fact that the wlc scheduler also isn't a very advanced one: activeconns)*50+atomic_read(&least->inactconns); ]]> What would you think would change if we made this 50 dynamic?

Not sure :) I don't have results from experiments with wlc :) You can put it in /proc and to make different experiments, for example :) But warning, ip_vs_wlc can be module, check how lblc* register /proc vars.

The design of LVS for Netfilter and Linux 2.4, pt2 The most recent version of Julian's writeup of LVS and Netfilter (NF) is on the LVS website. Here is the version available in Jun 2002.

TODO: - redesign LVS to work in setups with multiple default routes (this requires changes in the kernels, calling ip_route_input with different arguments). The end goal: one routing call in any direction (as before) but do correct routing in in->out direction. The problems: fwmark virtual services and the need for working at prerouting. Solution: hook at PREROUTING after the filter and do there connection creation (after QoS, fwmark setup). Hook at prerouting, listen for traffic for established connections and call ip_route_input with the right arguments (possibly in the routing chain). Goal: always pass one filter chain in each direction (FORWARD). The fwmark is used only for connection setup and then is ignored. hash twice the NAT connections in same table (at prerouting we can see both requests and replies), compare with cp->vaddr to detect the right direction - help from Netfilter to redesign the kernel hooks: ROUTING hook (used from netfilter's NAT, LVS-DR and in->out LVS-NAT) fixed ip_route_input to do source routing with the masquerade address as source (lsrc argument) more control over what to walk in the netfilter hooks? - different timeouts for each virtual server (more control over the connection timeouts) - Allow LVS to be used as NAT router/balancer for outgoing traffic

CURRENT STATE: Running variants: 1. Only lvs - the fastest 2. lvs + ipfw NAT 3. lvs + iptables NAT Where is LVS placed: The chains: The out->in LVS packets (for any forwarding method) walk: LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING LOCAL_IN ip_vs_in -> ip_route_output/dst cache -> mark skb->nfcache with special bit value -> ip_send -> POST_ROUTING POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set ]]> The in->out LVS packets (for LVS/NAT) walk: FORWARD -> POST_ROUTING FORWARD (check for related ICMP): ip_vs_forward_icmp -> local delivery -> mark skb->nfcache -> POST_ROUTING FORWARD ip_vs_out -> NAT -> mark skb->nfcache -> NF_ACCEPT POST_ROUTING ip_vs_post_routing - check skb->nfcache and exit from the chain if our bit is set ]]> Why LVS is placed there: - LVS creates connections after the packets are marked, i.e. after PRE_ROUTING:MANGLE:-150 or PRE_ROUTING:FILTER:0. LVS can use the skb->nfmark as a virtual service ID. - LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS setups. By this way the incoming traffic can be policed before reaching LVS. - LVS creates connections after the input routing because the routing can decide to deliver locally packets that are marked or other packets specified with routing rules. Transparent proxying handled from the netfilter NAT code is not always a good solution. - LVS needs to forward packets not looking in the IP header (direct routing method), so calling ip_route_input with arguments from the IP header only is not useful for LVS - LVS is after any firewall rules in LOCAL_IN and FORWARD

Requirements for the PRE_ROUTING chain Sorry, we can't waste time here. The netfilter connection tracking can mangle packets here and we don't know at this time if a packet is for our virtual service (new connection) or for existing connection (needs lookup in the LVS connection table). We are sure that we can't make decisions whether to create new connections at this place but lookup for existing connections is possible under some conditions: the packets must be defragmented, etc. There are so many nice modules in this chain that can feed LVS with packets (probably modified)

Requirements for the LOCAL_IN chain The conditions when sk_buff comes: - ip_local_deliver() defragments the packets (ip_defrag) for us - the incoming sk_buff can be non-linear - when the incoming sk_buff comes only the read access is guaranteed What we do: - packets generated locally are not considered because there is no known forwarding method that can establish connection initiated from the director. - only TCP, UDP and related to them ICMP packets are considered - the protocol header must be present before making any work based on fields from the IP or protocol header. - we detect here packets for the virtual services or packets for existing connections and then the transmitter function for the used forwarding method is called - the NAT transmitter performs the following actions:

We try to make some optimizations for the most of the traffic we see: the normal traffic that is not bound to any application helper, i.e. when the data part (payload) in the packets is not written or even not read at all. In such case, we change the addresses and the ports in the IP and in the protocol header but we don't make any checksum checking for them. We perform incremental checksum update after the packet is mangled and rely on the realserver to perform the full check (headers and payload). If the connection is bound to some application helper (FTP for example) we always perform checksum checking with the assumption that the data is usually changed and with the additional assumption that the traffic using application helpers is low. To perform such check the whole payload should be present in the provided sk_buff. For this, we call functions to linearize the sk_buff data by assembling all its data fragments. Before the addresses and ports are changed we should have write access to the packet data (headers and payload). This guarantees that the packet data should be seen from any other readers unchanged. The copy-on-write is performed from the linearization function for the packets that were with many fragments. For all other packets we should copy the packet data (headers and payload) if it is used from someone else (the sk_buff was cloned). The packets not bound to application helpers need such write access only for the first fragment because for them only the IP and the protocol headers are changed and we guarantee that they are in the first fragment. For the packets using application helpers the linearization is already done and we are sure that there is only one fragment. As result, we need write access (copy if cloned) only for the first fragment. After the application helper is called to update the packet data we perform full checksum calculation.

- the DR transmitter performs the following actions:

Nothing special, may be it is the shortest function. The only action is to reroute the packet to the bound realserver. If the packet is fragmented then ip_send_check() should be called to refresh the checksum.

- the TUN transmitter performs the following actions:

Copies the packet if is already referred from someone else or when there is no space for the IPIP prefix header. The packet is rerouted to the real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum in the old IP header.

- if the packets must leave the box we send them to POST_ROUTING via ip_send and return NF_STOLEN. This means that we remove the packet from the LOCAL_IN chain before reaching priority LAST-1. The LocalNode feature just returns NF_ACCEPT without mangling the packet.

In this chain if a packet is for LVS connection (even newly created) the LVS calls ip_route_output (or uses a destination cache), marks the packet as a LVS property (sets bit in skb->nfcache) and calls ip_send() to jump to the POST_ROUTING chain. There our ip_vs_post_routing hook must call the okfn for the packets with our special nfcache bit value (Is skb->nfcache used after the routing calls? We rely on the fact that it is not used) and to return NF_STOLEN. One side effect: LVS can forward packet even when ip_forward=0, only for DR and TUN methods. For these methods even TTL is not decremented nor data checksum is checked.

Requirements for the FORWARD chain LVS checks first for ICMP packets related to TCP or UDP connections. Such packets are handled as they are received in the LOCAL_IN chain - they are localy delivered. Used for transparent proxy setups. LVS looks in this chain for in->out packets but only for the LVS/NAT method. In any case new connections are not created here, the lookup is for existing connections only. In this chain the ip_vs_out function can be called from many places: FORWARD:0 - the ipfw compat mode calls ip_vs_out between the forward firewall and the masquerading. By this way LVS can grab the outgoing packets for its connection and to avoid they to be used from the netfilter's NAT code. FORWARD:100 - ip_vs_out is registered after the FILTER=0. We can come here twice if the ipfw compat module is used because ip_vs_out is called once from FORWARD:0 (fw_in) and after that from pri=100 where LVS always registers the ip_vs_out function. We detect this second call by looking in the skb->nfcache bit value. If the bit is set we return NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first returns NF_STOLEN and after calling the okfn function. The actions we perform are the same as in the LOCAL_IN chain for the NAT transmitter with the exception that we should call ip_defrag(). The other difference is that we have write access to the first fragment (it is not referred from someone else) after ip_forward() calls skb_cow().

Requirements for the POST_ROUTING chain LVS marks the packets for debugging and they appear to come from LOCAL_OUT but this chain is not traversed. The LVS requirements from the POST_ROUTING chain include the fragmentation code only. But even the ICMP messages are generated and mangled ready for sending long before the POST_ROUTING chain: ip_send() does not call ip_fragment() for the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is shorter. LVS makes MTU checks when accepting packets and selecting the output device. So, the ip_refrag POST_ROUTING hook is not used from LVS. The result is: LVS must hook POST_ROUTING as first (may be only after the ipfw compat filter) and to return NF_STOLEN for its packets (detected by checking the special skb->nfcache bit value). The Netfilter hooks:

Example ip_tables filter scripts A simple firewall installation script is gshield. This script below by Tim Cronin, was written before became available. Tim Cronin tim (at) 13-colonies (dot) com 14 Feb 2003 RemoteAddress:Port Forward Weight ActiveConn InActConn TCP xx.xx.xx.xx:http wlc persistent 1200 -> 192.168.1.25:http Masq 1 0 2 TCP xx.xx.xx.xx:http wlc persistent 1200 -> 192.168.1.20:http Masq 2 16 11 -> 192.168.1.10:http Masq 3 17 23 ]]> This script (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/tim_cronin.sh) has been running reliably in production for 6 months. The link at the top of the script is a good starting point to understand how it works. Note that the default config generates copious logs. The IP addresses have been changed to protect the innocent server. I had problems with the syn flag hence the section ignoring stuff going to the vips.

performance hit on director with iptables/netfilter Dant dan (at) id-confirm (dot) com

I setup an intrusion detection system (IDS) on the director, and I had a simple test on the director by connectting the mean-service-time, HitRatio .. which cost me dual weeks. And I found that snort does not affect the performance.

ratz 16 Nov 2005 Depends on the configuration, it's actually quiet easy to kill Snort with some advanced ruleset and above L4 checks.

as both snort and iptables use libpcap library to scratch packets,

iptables does not use libpcap, to my avail, so I'm not sure I understand your question.

does it mean the iptables will not affect the director's performance ? or am I right before when using snort?

iptables is the user space part of netfilter, which is in the kernel. So no, iptables will not hurt performance, but netfilter certainly does, depending on the usage, amount of rules and your hardware configuration.

I'm running several highly loaded LVSs, these days I found that there are so many malicious scans, so I want to ban them all by portsentry. And we also confused by by ddos :-/

Portsentry only mitigates the problem, doesn't solve it. Also, it's not something that should be implemented on the LVS. Also having a NIDS on the director is a bit suboptimal, since a IDS should at best not be detectable and should also be in read-only mode. Either put a second box between the networks you need to sniff, preferrably in bridge mode or modify your network cables by removing the TX part, so only receiving is possible. Both suggestions don't work well with a director. On the modifying-network-cables-for-IDS part: http://www.snort.org/docs/tap/

While we're touching on this subject here, what kind of a NIDS do people use inside an LVS setup, and how can it be implemented?

There's nothing special about LVS that would require a different approach to NIDS, so this is more a general question off how to deploy IDS; and this, I'm afraid, is subject to personal views. I don't know on which level you plan on deploying IDS, but a good starter is the Snort documentation corner, which can be found at: http://www.snort.org/docs/ Especially interesting is the IDS load balancer. I've talked to Marty about load balancing traffic to multiple IDS nodes to share the load in 2001 I think, however I don't remember what our consensus was. Other than that you'd have to be a bit more specific. I'd be glad to help, although I've left the IDS field 2-3 years ago. One of the reason is that with the Basel II and the Sarbanes-Oxley acts (http://www.aicpa.org/sarbanes/index.asp) you barely can't allow yourself anymore to "lose" data, which in the sense of IDS translates to either "false positives" or "true negatives". Since the two items mentioned are a general issue of IDS systems, that require highly skilled personnel, other means to acquire the demanded level of security quality management have to be found, for example: reliable logging and monitoring, on top of a well-thought and implemented security policy.

Long sessions through LVS DR director terminated by icmp-host-prohibited (ICMP type 3 code 10) This problem was found in the normal operation of an LVS. The problem is with netfilter, not LVS. Netfilter is gratuitously sending icmp packets in an ESTABLISHED connection. We don't know what the problem is about (yet). It's here in case someone else finds it too. Klaas Jan Wierenga k (dot) j (dot) wierenga (at) home (dot) nl 13 Mar 2007 I have a problem where sometimes some long standing mp3 streaming sessions over HTTP are terminated because the LVS-DR director sends an "ICMP type 3 code 10 - host unreachable" packet to the client (which is the source of the mp3 stream). When this happens the client stops sending packets for 15 minutes 15 minutes (the TCP idle session timeout of LVS?) before it reconnects on the same ports. The 15 minutes seems to be related to the connection timing out of the LVS connection table. When this happens the real servers are all fine, load is not heavy and ldirectord is able to perform it's checks. In fact nothing shows in the ldirectord.log so the real servers are all available. This is quite a long post, I've tried to include all relevant details. My setup: ISP router -> LVS Director -> Local switch -> Realserver[123] -> ISP router Directions I've looked into so far and questions I've asked myself: enabled "quiescent=yes" to maybe not terminate existing connection, but this is not the problem I think because the real servers are all available when this happens. Where is this ICMP packet generated in linux/net/ipv4/ipvs/* source files? Answer: nowhere!, at least not with type 3 code 10 Could it be that this ICMP packet is generated by some sort of denial-of-service defense code that I'm unaware of? Where is this specific ICMP packet (HOST_UNREACH_ANO) genererated in the kernel? Answer: net/ipv4/netfilter/ipt_REJECT.c: send_unreach(*pskb, ICMP_HOST_ANO); So it appears that netfilter (iptables?) is sending it. Why? This could be due to the firewall rule: But why is this sent on an existing, established and active connection? Or is there some TCP timeout because the director only sees incoming packets on the connection? Maybe this rings a bell with someone. Maybe the client is not behaving correctly by not continuing to send data after receiving ICMP host unreachable? TCP/IP Illustrated Vol1 [Stevens] says on page 317, 21.10 ICMP Errors: "A received host unreachable or network unreachable is effectively ignored, since these two errors are considered transient. ... It could be that an intermediate router has gone down and it can take the routing protocols a few minutes to stabilize on an alternative route.. During this period either of these two ICMP errors can occur, but the must not abort the connection. Instead, TCP keeps trying to send the data that caused the error, although it may eventually time out." Later... A while ago I posted about a problem I was having with long mp3 streaming sessions which were terminated because the streaming LVS cluster (managed by me) was sending icmp-host-prohibited on an established connection to the client which was causing the connection to be terminated. Initially I suspected the LVS director but after some investigation I found out that it never sends icmp-host-prohibited. The only other possibility was netfilter sending it. The relevant parts of my initial iptables was (/etc/sysconfig/iptables): After I changed the port 80 rule to the one below effectively disabling connection tracking on port 80 the problem disappeared. Initially I made this iptables change on the LVS director, but then the realservers would send icmp-host-prohibited sometimes on established connections, after also changing iptables on the realservers did the problem go away. It is still unclear to me why netfilter would decide to send icmp-host-unreachable on established connection when connection tracking is active. Maybe someone on the netfilter list can shed some light on this. Still later... did you ever find a better solution? Klaas Jan Wierenga k (dot) j (dot) wierenga (at) home (dot) nl 26 Jun 2007 Not really. It appears to be a netfilter problem because when I changed my firewall rules (/etc/sysconfig/iptables) to disable connection tracking, the problem went away.

stateful filtering: LVS-NAT Laurentiu didn't know about Siim Poder's patch for stateful LVS-NAT filtering about 2 months previously FIXME: write up Siim's patch and link this to it. Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 10 Sep 2008 This is my solution to the problem I found of the last FIN-ACK eaten by iptables. I had a simple LVS-NAT configuration where LVS lives on a gateway: The kernel is 2.6.20 with ipvsadm 1.24 (Fedora 5). LVS does run iptables but fairly open, INPUT allows all incoming traffic for the VIP, OUTPUT allows NEW,ESTABLISHED,RELATED states and FORWARD is open (for what it's worth, I think ipvs does not go through there at all). So that worked fine. Until I noticed the real server has many connections in FIN_WAIT2 state. They have the same timeout as TIME_WAIT so I was gonna let it go, but then I looked at the client and all of them were in LAST_ACK state. The client kept resending FIN-ACKs, none of which made it to the server at all. On the LVS, ipvsadm -Lc shows connections in TIME_WAIT state so it did get them. Well, long story short, the OUTPUT chain blocked *only* that FIN-ACK packet for some odd reason. I was sure that ipvs is shortcircuiting iptables and bypassing OUTPUT, but I guess I misinterpreted the little map in the HOWTO. All the other packets matched the "NEW" rule. This would would have ended up as INVALID probably. I am now adding rules to allow all OUTPUT towards the RIPs, stateless. VIP) 0.016862 CIP -> VIP HTTP HEAD / HTTP/1.1 0.017193 VIP -> CIP [ACK] Seq=1 Ack=117 Win=5888 Len=0 0.021949 VIP -> CIP [TCP segment of a reassembled PDU] 0.022173 VIP -> CIP [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0 0.034046 CIP -> VIP [ACK] Seq=117 Ack=195 Win=6912 Len=0 *0.046042 CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0* the above packet does not make it, beyond here retransmits only 0.250217 VIP -> CIP [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0 0.260110 CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0 0.267333 CIP -> VIP [TCP Dup ACK 11#1] [ACK] Seq=118 Ack=196 Win=6912 Len=0 SLE=195 SRE=196 0.705855 CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0 Coming out the other end towards RIP: 0.016847 CIP -> RIP HTTP HEAD / HTTP/1.1 0.017119 RIP -> CIP [ACK] Seq=1 Ack=117 Win=5888 Len=0 0.021873 RIP -> CIP [TCP segment of a reassembled PDU] 0.022115 RIP -> CIP [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0 0.034021 CIP -> RIP [ACK] Seq=117 Ack=195 Win=6912 Len=0 only two retransmits seen: 0.250147 RIP -> CIP [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0 0.267312 CIP -> RIP [TCP Previous segment lost] [ACK] Seq=118 Ack=196 Win=6912 Len=0 SLE=195 SRE=196 ]]> The current setup seems to work except for a minor annoyance - the netfilter conntrack table still has the connections, when I would have expected that to be almost empty, given that LVS steals the packets from nf. The connections display as UNREPLIED and originating on the RIP:80 so they aren't "real" but I'm curious which packets from the real server triggered them. A blanket ACCEPT rule on outgoing traffic doesn't seem very secure for a firewall, and in my case there's a firewall in front of the LVS. Outgoing FORWARDed traffic is not the one allowed though, it is the traffic originating on the LVS machine itself, the OUTPUT chain in the main table which is usually left open anyway. Since then I have noticed the INPUT chain would have blocked the same packet in the same configuration, so both INPUT and OUTPUT need to have a stateless ACCEPT on that tcp port for the LVS to work.

stateful filtering: LVS-DR Because the director cannot see the reply packets from the realserver, the standard netfilter stateful filtering can't be used with LVS-DR (or LVS-Tun). Thomas Pedoussaut thomas (at) pedoussaut (dot) com 22 Apr 2008 For one of my dozen of services ( a straight TCP connection), the TCP-FIN packets that are arriving on the load balancer are never passed to the real server. I activated the logs of iptable and could see the FIN packets being dropped. No idea why the FIN are dropped and not the other ones. I obviously have the --state ESTABLISHED,RELATED -j ACCEPT in my iptable rules. I had a quick look at /proc/net/ip_conntrack before, during and after the connection but nothing specific to that connection seems to be inserted (the module is loaded and other traffic gets tracked). It even happen when I close the client connection within seconds of creation, so I don't think timeouts are involved. My issue is that the application in backend doesn't deal with timeouts, so never initiate the closing of the connection. Later... Basically, all packets (SYN and non-SYN) are allowed by the "--state NEW" iptables but not by the ESTABLISHED,RELATED, because the director never sees the replies from the real server and so never creates a conntrack for that connection. When a FIN packet arrives, it is not validated as a --state NEW, because it's flag FIN is activated and so, that particular packet is dropped. So the solution is to change the iptables rule

LVS: Cluster friendly versions of applications that need to maintain state

rewriting your application/service Complicated websites can be hard to run under LVS (e.g. websites with servlets). The problem is that the website has been written with the assumption that all functions on the website will always be together on the same machine. This was a reasonable assumption not so long ago, but now with customers wanting high availability (and possibly high throughput), this assumption is no longer valid. The assumption will be invalid if the client writes to the server (see ) or if the server maintains state information about the client (see . Often people setting up an LVS hope that LVS can look inside the packets for content and direct the packet to an appropriate realserver. LVS can't do this. In all cases, this simplest, most robust solution is to rewrite the application to run in a high availability environment. Administratively this sort of proposal is not well received by management, who don't like to hear that their expensive web application was not properly written. Management will likely be more receptive to canned solutions from glib sales people, who will tell management that an L7 loadbalancer is a simple solution (it is, but it is also slow and expensive). Roberto Nibali ratz (at) tac (dot) ch 19 Apr 2001 LVS is a Layer4 load balancer and can't do content based (L7) load balancing. You shouldn't try to solve this problem by changing the TCP Layer to provide a solution which should be handled by the Application Layer. You should never touch/tweak TCP settings out of the boundaries given in the various RFC's and their implementations. If your application passes a cookie to the client, these are the general approaches: buy an L7 load balancer (and don't use LVS). Set a very high persistency timeout and hope it is higher than the period a client will wait to come back after he found his credit card, or look at other sites, or have a cup of coffee. This is not a good solution. Increased persistency timeout increases the number of concurrent connections possible, which increases the amount of memory required to hold the connection table. A persistency timeout of 30min, with clients connecting at 500 connections/s you would need a memory pool of at least: 30*60*128*500/(1024*1024) = 109 MBytes. With the standard timeout of 300 seconds, you'd only need 109/6 = 18 Mbytes. Long persistency times are incompatible with the DoS defense strategies employeed by secure_tcp. Have a 2-tier architecture where you have the application directly on the webserver itself and maybe helped by a database. The problem of the cookies storage is not solved however. You have to deal with the replication problem. Imagine following setup: Web1/App --> / \ Clients ----> director -> Web2/App ---> DB Server \ / ----> Web3/App --> ]]> Cookies are generated and stored locally on each WebX server. But if you have a persistency timeout of 300s (default LVS setting) and the client had his cup of coffee while entering his visa numbers, he would get to a new server. This new server whould then ask the client to reauthenticate. There are solutions to this e.g NFS export a dedicated cookie directory over the back-interfaces. Cookies are quickly distributed among the servers. the application is written to handle cookie replication and propagation between the WebX servers (you have at least 299 seconds time to replicate the cookie on all web servers. This should be enough even for distributing over serial line and do a crosscheck :) This does not work (well) for geographically distributed webserver. 3-Tier architecture Web1 -- / \ Clients ----> LVS ----> Web2 ----> Application Server <---> DB Server \ / --> Web3 --> ]]> The cookies are generated by the application server and either stored there or on the database server. If a request comes in, the LVS assigns the request f.e to Web1 and sets the persistency timeout. Web1 does a short message exchange with the application server which generates the sessionID as a cookie and stores it. The webserver sends the cookie back and now we are safe. Again this whole procedure has t_pers_timeout (300 seconds normally) amout of time. Let's assume the client times out (has gone for a cup of coffee). When he comes back normally on a Layer4 load balancer he will be forwarded to a new server, (say Web2). The CGI script on Web2 does the same as happened originally on Web1: it generates a cookie as sessionID. But the application server will tell the script that there is already a cookie for this client and will pass it to Web2. In this way we have unlimited persistency based on cookies but limited persistency for TCP. Advantages set your own persistency timeout values TCP state timeout values are not changed. table lookup is faster it's cheaper than buying an L7 load balancer Disadvantages: more complex setup, more hardware you have to write some software If a separate database is running on each webserver, use replication to copy the cookie between servers. (You have 300 secs to do this). This was also mentioned by Ted Pavlic in connection with databases.

Session Data, maintaining state in a cluster, from Andreas Koening Andreas J. Koenig andreas.koenig (at) anima (dot) de 2001-06-26 What are sessions? When an application written on top of a stateless protocol like HTTP has a need of stateful transactions, it typically writes some data to disk between requests and retrieves these data again on the subsequent request. This mechanism is known as session handling. The session data typically get written to files or databases. Each followup-request sends some sort of token to the server so that the application can retrieve the correct file or correct record in the database. The old-fashined way to identify sessions At the time when every webserver was a single PC, the session token identified a filename or a record in a database and everything was OK. When an application that relies on this mechanism is ported to a cluster environment, it stops working unless one deteriorates the cluster with a mechanism called persistence. Persistence is a quick and dirty way to get the old-fashioned token to work. It's not a very clever way though. Why persistence is bad Persistence counteracts two purposes of a cluster: easy maintainance by taking single machines out at any time and optimized balancing between the members of a cluster. Above that, persistence consumes memory on the load balancers. How to do it better Recall that there is a token being sent back and forth anyway, that identifies a filename or a database record. Extend this token to unambiguously point to the machine where the session data were created and install a session server on each host that delivers session data within the cluster to any of the peers on request. From that moment you can run your application truely distributed, you can take single machines out for maintainance any time: you turn their weight to 0 and wait for maybe an hour or three, depending on how long you want your sessions to last. You get better balancing, and you save memory on the balancer. Note, that unlike with a dedicated session server, you do not create a single point of failure with this method.

Single Session Related to the concept of persistent connection (whether implemented with LVS persistence or any other method) is the concept of single session. The client must appear to have only one session, as if the server is one machine. You must be able to recognise the client when they make multiple connections and data written on one realserver must be visible on another realserver. Also see . K Kopper 7 Jun 2006 Let's say you are running Java applications (aka Java Threads) inside of a Java container (virtual machine) you should be able to tell the container itself how you want it to store session information (like a shopping cart). The method of storage can therefore automatically make the session information from one cluster node (real server) available to all cluster nodes via a file sharing technique, multicasting to all the nodes, or by storing data on a database server (on a backend HA pair). If you are five pages deep into a shopping cart, for example, and the real server crashes it won't be a problem if you land on a new real server with your next click of "submit" and it can pull up your session information. Check out 7-2 (page 64) of this document for the Oracle approach: http://download-west.oracle.com/otn_hosted_doc/ias/preview/web.1013/b14432.pdf Or for the JBOSS way using multicasting via JGroups: http://www.jgroups.org/javagroupsnew/docs/index.html Building an Oracle OC4J container that is highly available on the HA backend to store session information for a cluster works and seems like a good sound approach to me. The multicast way raises many doubts in my mind (especially if you need to lock the session information for any reason). Dan Baughman dan.baughman () gmail ! com 2006-05-09 Internally, lvs must make decisions about where to send what connection request, right? I want to implement a hash table to keep track of where previous connections from an ip went, and send them to the same server. Any sort of timeout is optional. Once an ip gets a server it can always get that same server. I had previously thougth of giving the session a timeout, but now I'm leaning towards just having it maintain the hash forever, and I'll just restart the director deamon every night at 2 am (or never). Basically we are going to cluster some cold fusion servers and I don't want to pay the 10 grand Adobe wants for an enterprise license to do what we want. We have a lot of app's deployed that use a cookie stored on the client in conjunction with the user's ip to access server side session data. To reimplement the apps we've deployed to access a db instead of the session data would be considerable. Looking at the persistent option, that seems to be exactly what I'm looking for. K Kopper karl_kopper (at) yahoo (dot) com 6 Jun 2006 To share files on the real servers and ensure that all real servers see the same changes at the same time a good NAS box or even a Linux NFS server built on top of a SAN (using Heartbeat to failover the NFS server service and IP address the real servers use to access it) works great. If you run "legacy" applications that perform POSIX-compliant locking you can use the instructions at http://linux-ha.org/HaNFS to build your own HA NFS solution with two NFS server boxes and a SAN (only one NFS server can mount the SAN disks at a time, but at failover time the backup server simply mounts the SAN disks and fails over the locking statd information). Of course purchasing a good HA NAS device has other benefits like non-volatile memory cache commits for faster write speed. If you are building an application from scratch then your best bet is probably to store data using a database and not the file system. The database can be made highly available behind the real servers on a Heartbeat pair (again with SAN disks wired up to both machines in the HA pair, but only one server mounting the SAN disks where the database resides at a time). Heartbeat comes with a Filesystem script that helps with this failover job. If your applications store state/session information in SQL and can query back into the database at each request (a cookie, login id, etc.) then you will have a cluster that can tolerate the failure of a real server without losing session information--hopefully just a reload click on the web browser for all but the worst cases (like "in flight: transactions). With either of these solutions your applications do not have to be made cluster-aware. If you are developing something from scratch you could try something like Zope Enterprise Objects (ZEO) for Python, or in Java (JBOSS) there is JGroups to multicast information to all Java containers/threads, but then you'll have to re-solve the locking problem (something NFS and SQL have a long track record of doing safely). But you were just asking about file systems and I got off topic . . . Christian Bronk chbr (at) webde (dot) de 02 Jun 2006 As long as you want AOL customers on your site, you will need single session server for your cluster (any sort of database will do). Every request from AOL comes from an different proxy-IP and even setting a persitence-netmask will not fix that. malcolm lists (at) netpbx (dot) org 02 Jun 2006 The SH scheduler gives exactly the same kind of response as persistence and it's layer 4 based on source hash... Their are hundreds of session implementations for web servers, it's one of the first things web programmers should learn (i.e. INSERT INTO sessiontable.....) LVS doesn't do L7 because L7 should be done by your app (i.e. that's what L7 is for.) Martijn Grendelman, 2 Jun 2006 I couldn't get the SH scheduler to work (at the time not understanding the weight parameter) and I set up an Msession server for "session clustering" and used the RR scheduler. This setup works perfectly and is still in use today. However, since Msession is hopelessly outdated, and its successor (Mcache) doesn't seem to get off the ground, and I haven't found any workable (open source) alternatives, I would really like have another look at LVS persistence of some sort. Martijn Grendelman martijn () grendelman ! net 2006-05-10 12:09:28 A centralized session manager would be nice, but I for one haven't been able to find a decent solution for use with PHP. I don't know about other systems or APIs. Msession is dead (I still have it running, but any attempt to build a stable daemon on an up-to-date system failed time after time) and its successor (MCache) seems to have died before it even got to beta. Other projects I have looked at (http://www.vl-srm.net/, for example) are also dead, or just aren't suitable (Memcached). I use Msession for a shared hosting cluster. The main advantage is, that Msession has a PHP extension, so it doesn't require any PHP client code. I need this, because I don't want to implement any dirty hacks based on "auto_prepend_file" or something like that, which I would need if I'd put my sessions in MySQL. Well, you can always buy Zend Platform, which features Session Clustering, but some of us don't have the $$$. Any suggestions for alternatives? mike mike503 (at) gmail (dot) com 6 Jun 2006 IMHO storing data in blobs is a horrible idea. If you are coding an application, I'd suggest checking out MogileFS. If this is for general purpose web hosting, where you need a normal POSIX filesystem to access, then that won't do. But for applications, it seems like a great idea (and from what small amount I read about the Google FS, it actually has a couple of the same traits) As far as session management, a central session manager such as msession would work, or just roll your own off a database - it's simple in PHP (that is what I do) - then use DB failover/replication/etc. software to handle the DB clustering/failover. mike mike503 () gmail ! com 2006-05-09 not sure how long you want to track the information, but you might be able to handle this with iptables and firewall marks. then you can group requests by any sort of iptables-configurable tracking (by port(s), ip ranges, etc...) - also i think there's a persistence configuration option in ldirectord (or is it keepalived, i always get them confused - or maybe both.) I don't understand the need though for session persistence like this; I'd expect a centralized session manager (msession for instance) or just using a central database for the information would suffice. that's how I've been doing it, not sure why everyone has all these unique requirements to make sure they can persist sessions across IP addresses and AOL proxies and such. seems overkill, I've never had a problem. mike mike503 (at) gmail (dot) com 10 May 2006 It's very simple to make your own which uses only a single database table in mysql. I used to use msession, but it had some overhead it seemed like, and a database-driven one was less "thick" - the other good thing about writing your own session handler is that you can call other things on session start or close, etc. I'd suggest using that (a mysql one) I'll try giving you mine though, also:

IIS session management: how it works Alex Kramarov alex-lvs (at) incredimail (dot) com 22 Nov 2002 Microsoft's COM model is similar to the CORBA model. Generally, you have components, i.e. code that can be used from other applications. The concept is similar to using shared libraries, but still a little different. You can create an instance of the component and use in a simple fashion with asp (the IIS scripting language IIS). Every time a new user calls for a asp file, a new session component is created, and can be accessed through asp scripting. This component can store data like a perl hash (session(valuename) = value). The data is stored in the memory space of the IIS process. Each session has a unique identifier that is remembered along with the data, and this identifier is maintained during the session by a cookie. On subsequent access by a client, the server looks up the data stored for this session, and makes it available as members of the session component. a simple sample - access autorisation (this code goes on top of a pages you would like to secure): "" Then 'check some conditions. 'if check successful Session("UserID") = "approved" 'else do not allow to proceed showing the page response.write "authorization failed" response.end End If ]]> There are components available that will replace the default session component with one that will store the session data in a shared db, and only minimal modification to the code are required, if any. Generally the session component is an implicit component the server provides. You could use your own component that does the same thing, and the only thing you would have to do is to initialize an instance and give it the unique identifier of the user, like this (purely fictional code) When writing code for MS servers, one almost never deals with files, since the interface provided by MS for that purpose is very cumbersome, and on Windows, file locking problems are very severe issue. With Unix, you can write a cgi that manipulates files, reads and writes to and from a dozen files while running. You would be crazy to that on windows. All data you want to store can be more or less conveniently stored using the session object if this is a per user data, or in the application object (more or less the same idea), which retains data through the life of the application (from the start to the stop of the http service). All data is in memory, hence, it is fast. Long term data is always stored in databases. I believe that the difference in perspective comes from the fact, that in unix, you can have a bare bones system because of your security requirements, and then you want to write a small script that uses and stores some data, so you open some files and do that. On windows, you CANNOT HAVE a bare bones system. From the initial install, you already have some file based db structure (comparable to db3), and all the database connectivity libraries, which you cannot remove, unless you are a windows guru and you start deleting system libraries one by one. (You would be crazy to do that, since there is absolutely no documentation what each of the thousands of library files, which are installed by default, do.) All these libraries are a security risk, as is proven by all the buffer overflow vunerabilities. But since windows developers regard DB connectivity as a standard component of their OS, they use it. (This is a marketing strategy of MS, to sell their MS bloated sql server). why doesn't the application doesn't keep it's own state?: IIS assigns a unique identifier to be used in session management the first time a user accesses an asp file (even if you need it for only 1% of the pages on your site). This is completely transparent to the developer, and saves time in the development process. Writing apps where state is conserved manually (without sessions), is not as easy as it looks, and the mechanism provided by IIS is certainly convenient. Coding using "the Microsoft Way" for IIS took me 4 hours to learn, going through microsoft developer network articles. It is simple if you don't stray from the dictated path, but the second you do stray, it's hard to push something not designed by MS into their framework, and people are afraid of that. IIS 6 includes the option to make the server session object store data in an odbc database, but it is still not released. 3rd party components that should do the task are commercially available, like the frameWerks session component, and it is pretty cheap, at 149$. I also believe that not a lot of sites need and use several realservers to serve a simple logical site, so this was never such an issue, till recently. Now, microsoft woke up to the fact and writing their own implementation for the IIS 6, which will undoubtedly require the use the MS sql server. Mark Weaver mark (at) npsl (dot) co (dot) uk 23 Nov 2002

The default ASP (= MS web scripting) sessions simply use a cookie and stores session state in server memory. The session is a dictionary object, and you just store a bunch of key-value pairs. Since the standard session object stores data in memory, it is not a lot of use for /robust/ load balancing. .NET adds a component that stores session in a database. Such a component is pretty trivial to write, we have had one for a number of years. Storing on disk is a good option when there is no database, but since most of the sites that we have are pretty dynamic (i.e. most pages are generated from database calls), storing the session state in the DB is a good bet. I can probably release the source code for this if anyone is interested.

Maintaining state with persistence You can setup persistence several ways Use port 0 (i.e. all ports) with persistency feature (read the ipvsadm man page). All ports are persistent. A client after connecting to a particular realserver for one service, will (within the timeout period) be connected to the same realserver for all services. This will allow intruders to forward packets for any ports to the realservers, so you will need to write filter rules that block all ports but the ones that you want serviced by the realservers. In practice only 1 or a small number of ports (e.g. http/https, smtp/pop) will ever be used in a persistent manner and you can set persistence for a particular port (e.g. https) while other services are not persistent. The client will (within the timeout period) be sent to the same realserver for the persistent port, while being serviced by all realservers for the other LVS'ed ports. For sophisticated setups (e.g. shopping carts where the client who has been filling his cart on :http, needs to give his credit card details on :https), you should use persistent fwmarks with the . fwmarks and persistent fwmarks scale well with large numbers of services and (once you understand fwmarks) make it easy to setup shopping cart LVSs. Shopping cart applications have to maintain state. Usually state is maintained by sending the customer a cookie. These are instrusive and a security risk (I turn them off on my browser). If you're going to use cookies in your application, you should at least test that the client accepts them, otherwise the client will not be able to accumulate objects in their shopping cart. We encourage you to rewrite the application (see ) so that state is maintained on the realserver(s) in a way that is available to all realservers (e.g. on a replicated database) (see . You have the time of the persistence timeout to make this information available to the other realservers. Having told you that you can setup a shopping cart with persistent fwmarks, please read . One of the problems with persistence is removing a service (e.g. you just want it removed or the realserver has crashed). Even after the weight has been set in ipvsadm to 0, the service is still in the ipvsadm table and will stay there till the end of the client's timeout period. If the realserver has crashed, the client's connection will hang. You would like to have preserved the client's state information in your database, and give the client a new realserver. This problem has now been addressed with the LVS sysclt (see and ). For older material on the topic see . The following examples here use telnet and http. You aren't likely to want to make these persistent in practice. They are used because the clients are simple to use in tests. You'll probably only want to make ftp or https persistent, but not much else. Setup persistence on VIP, default persistence timeout (default timeout value varies a bit with ipvs versions, but it's about 10mins), port not specified (all ports made persistent). Telnet'ing to the VIP from one machine, you will always connect to the same realserver. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:0 wlc persistent 360 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> Here's setting up with a specified persistence timeout (here 600secs), setting persistence granularity (the -M option) to a netmask of /24, and round robin scheduling. If you make the timeout > 15mins (900 sec), you'll also need to change the . RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:0 rr persistent 600 mask 255.255.255.0 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 ]]> Note: only a timeout value can follow "-p". Thus you can have any of but you can't have You can setup persistence by port RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs.mack.net:http wlc persistent 360 -> RS2.mack.net:http Route 1 0 0 -> RS1.mack.net:http Route 1 0 0 ]]>

How others maintain state We spend lot of time telling people not to use cookies to maintain state. I thought I should do a reality check, to see what people are using that's working. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 12 May 2004

I've been involved with a mid sized ecommerce company for about 4 years and we've had very few problems using cookies for state (stored in a single backend db.) Its fast and easy. If the odd customer doesn't have cookies turned on then they are no great loss. Putting the sessionid in the URL i.e. GET is ugly and slightly less secure. I guess you could POST it on every page but would that be slower than cookie? (I think so)

Joe: Having session data in the URL allows the user (or man in the middle) to manipulate it. This is not secure. Joe 11 May 2004 14:23:02 -0400

With the security hazards of cookies, I have them turned off. Usually the application (e.g. mailman) runs a cookie test on me and tells me to turn on cookies. I've been trying to register for the Ottawa Linux Symposium for about 2months, and they've been having trouble with their registration software. Finally they ask me if I've got cookies turned on. I say "of course not". They don't do a cookie test and there's no notice on the web page that I need cookies turned on.

Jacob Coby jcoby (at) listingbook (dot) com 11 May 2004

We aren't an ecommerce site, but we do require some sort of login/authentication to use our site. We haven't worried about cookies either, at least, until the past 4 months or so. It seems that the latest and greatest anti-virus software and pop-up blockers disable cookies (among other things that they have no business doing). Rumor has it that new versions of IE will disable cookies by default. A good portion of IE6 users won't accept cookies unless your site publishes a P3P of some sort. We get appx. 4000 people/day (9000 logins/day), and we were getting up to 10 cookie related problems a day to the helpdesk. I'd estimate that there were at probably 2-10x more that who had problems, but who never reported it. In 3 years of requiring cookies, we had only one nasty email about our requirement to have cookies enabled. At any rate, we now use a different system to autheticate a user. We pass in a sid per page, and use cookies, IP address, browser ident, and other metrics to authenticate the user. Sensitive areas of the site (such as those requiring a credit card) also use SSL. All session data is stored in a single database, as a serialized PHP array. There can be up to 1/2 MB of session data, and part of the session data persists between logins, so it doesn't make sense for us to put session data in the cookie or to store it on the webservers. The sid + (cookie, IP, browser ident) is only used to authenticate the user. The session data itself stores all sorts of things, such as temporary user prefs, some of the things the user looked at, a bit of caching to cut down on subsequent db queries, things like that. Only part of that session data persists between logins, but it has to be stored somewhere between pages. Our situation is a little different from your average e-commerce store. We can't just identify a shopping cart + items by a unique sid. We need the session data to act as a ram drive of sorts for data that needs to be quickly accessed, multiple times per page. All of that temp data is stored in an array, and serialize()'d. PHP's serialize is pretty compact, but it still expands out. For example, a simple int value of the login timestamp looks like: The 1/2mb is the MAXIMUM we allow to store. Typical is more in the 3-10kb range. Average size over 49627 rows of session data is 3134b right now.
Joe: What's going to happen to your session data when IE6 disallows cookies?
It'll still work. The sid cookie is only one of several hints we can use to authenticate the user. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 17 May 2004
If your page is formated correctly with a PICS-Label IE6 will accept the cookie by default.
For reference info about PICS Labels see PICS Label Distribution Label Syntax and Communications Protocols v1.1 (http://www.w3.org/TR/REC-PICS-labels-961031). For a sample HOWTO see How To Label Your Pages with PICS Meta Tag (http://256.com/gray/docs/pics/). These labels appear to be part of the so far futile effort to filter webpages for children. Here's a webpage by the people fronting this effort Information for webmasters. They want you to rate your own website (people with obnoxious content are always honest, right?). If the ICRA expect this approach to succeed, then why do we have spam? The politicians are no help of course. One of them has a bill to stop google from inserting advertisements into their gmail service, since this requires reading people's (private) e-mail. This bill would also stop programs from filtering web content. We have a long way to go. POST is marginally slower than GET if you look at the HTTP spec. There is an additonal request header per variable. GET is only *very slightly* less secure. POST, and cookies are of equal security levels, and they're all trivial to send using command line tools.

Joe

until recently I'd thought that putting the session data into the URL (rather than a cookie) was the way to go, till someone pointed out that the user could manipulated the URL. In that case, could the session id be put in a long enough string in the URL such that any attempt to alter it would result in an invalid string?

Jacob

There is an upper limit on the GET string IE will send. Somewhere around 1 or 2 kb. When it hits the limit, if you use javascript to submit the form, it'll error out. If you just use a ]]> it'll just not work. For sites that will never hit that limit, passing in the session data would work. However, there should still be checks to authenticate the user, mostly to prevent problems when they share links with friends. One solution to the user modifying the string is to pass in a public key: Then you can authenticate that the session data hasn't been modified by checking your computed Kpu against the Kpu that was passed in from the GET/POST. If they match, the session data probably hasn't been modified. If they don't, there is a very good chance the data was either corrupted in transit or corrupted by the user. It's only as strong as the Kpr used and whatever the collision probability of the md5 algorithm.

Horms 21 May 2004 Is there anything to stop a cookie or post from being manipulated by an end user? Sure, it might be margionally more difficult as you would probably need some (command like) tool, rather than just changing the URL directly in your given browser. But it should still be trivial. I rarely write web pages. But if I was to do something like this I would make the string in the URL a hash (md5, sha1 or something like that), which should make guessing a valid string rather difficult. I would do the same in a cookie or a post. I would imagine something like this is pretty common practice. nick garratt nick-lvs (at) wordwork (dot) co (dot) za 12 May 2004

This discussion happens every so often on the list and, as always, I feel the need to mention the msession session service which we have been using very reliably for years from Mohawk Software (http://www.mohawksoft.com/devel/msession.html). It's light-weight, fast and depending on the scripting language you use (we're using php4) it is very easy to implement.

Matthew Smart msmart (at) smartsoftwareinc (dot) com 25 Jun 2007 The problem I have is that all clients from a given location get directed to the same realserver. Since the majority of clients are located in the same office, we are not getting a good load balance. I disabled persistence, moved sessions into mysql, and am relying on mysql's replication to ensure that all servers have the session data. We have state info stored server side in a PHP session. Client side there is a cookie that holds a session id only (no state). We are working on ways to replicate the server side session info across N real servers. I think relying on mysql will work in the short term. Just have to test it under load to see how it behaves. I can see an issue if mysql replication gets behind on a server, but that is not an LVS issue...

LVS: Squid Realservers (poor man's L7 switch) One of the first uses for LVS was to increase throughput of webcaches. A 550MHz PIII director can handle 120Mbps throughput. A scheduler (-dh = destination hash) specially designed for webcaches is described in the section on is in LVS derived from code posted to the LVS mailing list by Thomas Proell (about Oct 2000). This section was written by andreas (dot) koenig (at) anima (dot) de Andreas J. Koenig and was posted to the mailing list. An often lamented shortcoming of LVS clusters is that the realservers have to be configured to work identically. Thus if you want to build up a service with many servers that need to be configured differently for some reason, you cannot take advantage of the powerful LVS. The following describes an LVS topology where not all servers in the pool of available servers are configured identical and where loadbalancing is content-based. The goal is achieved by combining the features of Squid and LVS. The workhorses are running Apache, but any HTTP server would do.

Terminology Before we start we need to introduce a bit of Squid terminology. A redirector (http://www.squid-cache.org/Doc/FAQ/FAQ-15.html) is a director that examines the URL and request method of an HTTP and is enabled to change the URL in any way it needs. An accelerator (http://www.squid-cache.org/Doc/FAQ/FAQ-20.html) plays the role os a buffer and cache. The accelerator handles a relatively big amount of slow connections to the clients on the internet with a relativly small amount of memory. It passes requests through to any number of back-end servers. It can be configured to cache the results of the back-end servers according to the HTTP headers.

Preview In the following example installation, we will realize this configuration (real IP addresses anonymized): Note that a squid and a webserver can coexist in a single box, that's why we have put Squid2 and Webserver7 into a single machine. Note also that squids can cache webserver's output and thus reduce the work for them. We dedicate 24 GB disk to caching in Squid1 and 6 GB disk in Squid2. And finally note that several squids can exchange digest information about cached data if they want. We haven't yet configured for this. Strictly speaking, a single squid can take the role of an LVSdirector, but only for HTTP. It's slower, but it works. By accessing one of the squids in our setup directly, this can be easily demonstrated.

Let's start assembling I'd suggest, the first thing to do is to setup the four apache on Webserver1..4. These servers are the working horses for the whole cluster. They are not what LVS terminology calls realservers though. The realservers according to LVS are the Squids. We configure the apaches completely stardard. The only deviation from a standard installation here is that we specify in the httpd.conf. Everything else is the default configuration file that comes with apache. In the choice of the port we are, of course, free to choose any port we like. It's an old habit of mine to select 81 if a squid is around to act as accelerator. We finish this round of assembling with tests that only try to access Webserver1..4 on port 81 directly. For later testing, I recommend to activate the printenv CGI program that comes with Apache: This program shows us, on which server the script is running (SERVER_ADDR) and which server appears as the requesting site (REMOTE_ADDR).

One squid Next we should configure one Squid box. The second one will mostly be a replication of the first, so let's first nail that first one down. When we compile the squid 2.3-STABLE4, we need already decide about compilation options. Personally I like the features associated with this configuration: We can build and install squid with these settings. But before we start squid, we must go through a 2700 lines configuration file and set lots of options. The following is a collection of diffs between the squid.conf.default and my squid.conf with comments in between. Yes, we want this squid on port 80 because from outside it looks like a normal HTTP server. In the demo installation I turned ICP off, but I'll turn it on again later. ICP is the protocol that the squids can use to exchange sibling information about what they have on their disks. This is the memory reserved for holding cache data. We have 1 GB total physical memory and 24 GB disk cache. To manage the disk cache, squid needs about 150 MB of memory (estimate 6 MB per GB for an average object size of 13kB). Once you're running, you can use squid's statistics to find out *your* average object size. I usually leave 1/6 of the memory for the operating system, but at least 100 MB. Please refer to squid's docs for these values. You do not need bigger disks, you need many disks to speed up squid. Join the squid mailing list to find out about the efficiency of filesystem tuning like "noatime" or Reiser FS. This is the meat of our usage of squid. This program can be as simple as you want or as powerful as you want. It can be implemented in any language and it will be run within a pool of daemons. My program is written in perl and looks something like the following: ) { chomp; my($url,$host,$ident,$method) = split; my @redir = $url =~ /\bh=([\d,]+);?/ ? split(/,/,$1) : (6,7,8,9); # last components of our IP numbers my $redir = $redir[int rand scalar @redir]; $url =~ s/PLACEHOLDER:81/10.0.0.$redir\:81/i; print STDOUT "$url\n"; } ]]> This is ideal for testing, because it allows me to request a single backend server or a set of backend servers to choose from via the CGI querystring. A request like http://10.0.0.62/cgi-bin/printenv?h=6 will then be served by backend apache 10.0.0.6. The more complex the redirector program is, the more processes should be allocated to run it. For all of the above changes, please refer to the squid.conf.default. As we are redirecting everything through the redirector, we can fill in anything we want. No real hostname, no real port is needed. The redirector program will have to know what we chose here. If we want ICP working (and we said, we would like to get it working), we need this turned on. We're done with our first squid, we can start it and test it. If you send a request to this squid, one of the backend servers will answer according to the redirect policy of the redirector program. Basically, at this point in time we have a fully working content based redirector. As already mentioned, we do not really need LVS to accomplish this. But the downside of this approach is: - we are comparatively slow: squid is not famous for speed. - we do not scale well: if the bottleneck is a the squid, we want LVS to scale up.

Another squid So the next step in our demo is to build another squid. This is very trivial given that we have already one. We just copy the whole configuration and adjust a few parameters if there are any differences in the hardware.

Combining pieces with LVS The rest of the story is to read the appropriate docs for LVS. I have used Horms's Ultra Monkey docs and there's nothing to be added for this kind of setup. Keep in mind that only the squids are to be known by the LVS box. They are the "realservers" in LVS terminology. The apache back end servers are only known to the squids' redirector program.

Problems It has been said that LVS is fast and squid is slow, so people believe, they must implement a level 7 switch in LVS to have it faster. This remains to be proofed. Squid is really slow compared to some of the HTTP servers that are tuned for speed. If you're serving static content with a hernel HTTP daemon, you definitely do not want to lose the speed by running it through a squid. If you want persistent connections, you need to implemented them in your redirector. If you want to take dead servers out of the pool, you must implement it in your redirector. If you have a complicated redirector, you need more of them and thus need more ressources. In the above setup, ldirectord monitors just the two squids. A failure of one of the apaches might go by unnoticed, so you need to do something about this. If you have not many cacheable data like SSL or things that need to expire immediately or a high fraction of POST requests, the squid seems like a waste of resources. I'd say, in that case you just give it less disk space and memory. Sites that prove unviewable through Squid are a real problem (Joe Cooper reports there is a stock ticker that doesn't work through squid). If you have contents that cannot be served through a squid, you're in big trouble--and as it seems, on your own.

LVS: Performance and Kernel Tuning We are now (2006) in an era where the CPU is no longer rate determining in an LVS director. Ratz 20 Feb 2006 The processor never is an issue regarding LVS unless your NIC is so badly designed that the CPU would need to take over the packet processing ;)

Performance Articles (a non-LVS article on Configuring large scale unix clusters by Dan Kegel.) The article performance data for a single realserver LVS, shows how to test the network links with netpipe and how to determine the effects of LVS on latency and throughput. Note the effect of the LVS code on system performance is the difference between the performance with the director box just forwarding packets as a router to the realservers and the director box acting as an LVS director. If you find that the director is causing the throughput to decrease, you first have to determine if the slowdown is due to the hardware/OS or due to ip_vs. It is possible that the PCI bus or your network cards cannot handle the throughput when the box us just forwarding packets. Several articles have been published about the performance of LVS, by people who did not differentiate the effects of the ip_vs code from slowdown caused by the hardware that the ip_vs code was running on. Pat O'Rourke has done some performance tests on high end machines. Padraig Brady padraig (at) antefacto (dot) com 29 May 2002, measured 60usec latency for normal forwarding and 80usec latency for forwarding as a director on his setup. Ted Pavlic was running 4 realservers with 1016 (4 x 254) RIPs way back (1999?). Jeremy Kusnet (1 Oct 2002) is running a setup with 53 VIPs, 8 services/VIP, 6 realservers, (53*8*6 = 2688) RIPs. unknown:

Has anyone on this list use LVS to load balance firewalls? If so, what kind of limitations did you see with regard to Mbps and kpps?

Peter Mueller pmueller (at) sidestep (dot) com 30 Dec 2004 Yes, see the list archives. The limit is in the PCI bus. If you are pushing the limit of the LVS PCI bus then it won't help to use LVS. Anyway, I have not gotten more than 100kpps unless using NAPI. Some people report getting up to 1.2mpps (!) a few years ago with intel gigabits, 64-bit 66mhz individual cards and buses. These figures are with 64 byte packets. There was a post on quagga archive recently about this, check there for more details.

Did you run into any issues with stateful connections and how many simultaneous connections did it handle?

I'm sure if you run iptables on a router it will drop your numbers, probably by a lot.

Estimating throughput: Rule of Thumb People on the LVS mailing list have found that a 400MHz director will saturate a 100Mbps ethernet link. Somewhere else I read that you need 1Hz of CPU for every bps of I/O. So 100Mbps ethernet (two directions) will need a 200MHz machine. In the old days, I measured 50Mbps throughput with a 75MHz director, indicating that you need a 150MHz CPU to saturate a 100Mbps link.

Estimating throughput: 100Mbps FE is really 8000packets/sec ethernet If you are just setting up an LVS to see if you can set one up, then you don't care about performance. When you want to put one on-line for other people to use, you'll want to know the expected performance. On the assumption that you have tuned/tweeked your farm of realservers and you know that they are capable of delivering data to clients at a total rate of bits/sec or packets/sec, you need to design a director capable of routing this number of requests and replies for the clients. First some background information on networking hardware. At least for Linux (the OS I've measured, see performance data for single realserver LVS), a network rated at 100Mbps is not 100Mbps all the time. It's only 100Mbps when continuously carrying packets of mtu size (1500bytes). A packet with 1 bit of data takes as long to transmit as a full mtu sized packet. If your packets are <ack>s, or 1 character packets from your telnet editing session or requests for http pages and images, you'll barely reach 1Mbps on the same network. On the performance page, you'll notice that the hit rate increases as the size of the hit targets (in bytes) gets smaller. Hit rate is not neccessarily a good indicator of network throughput. A possible explanation for the ethernet rate being a function only of the packet rate is in an article on Gigabit Ethernet Jumbo Frames (look for the section "Local performance issues") (also see jumbo frames). Each packet requires an interrupt and the per-packet processing overhead sets the limit for TCP performance. This has resulted in a push for larger (jumbo) packets with Gigabit ethernet (e.g. 64kB, a whole NFS frame). The original problem was that the MTU size chosen for 100Mbps ethernet (1500bytes) is the same as for 10Mbps ethernet. This was so that packets traversing the different types of ethernet would not have to be fragmented and defragmented. However the side effect was that the packets are too small for 100Mbps ethernet and you can double your ethernet throughput by doubling your packet size. Tcpip can't use the full 100Mbps of 100Mbps network hardware, as most packets are paired (data, ack; request, ack). A link carrying full mtu data packets and their corresponding <ack>s, will presumably be only carrying 50Mbps. A better measure of network capacity is the packet throughput. An estimate of the packet throughput comes from the network capacity (100Mbps)/mtu size(1500bytes) = 8333 packets/sec. Thinking of a network as 100Mbps rather than ca.8000packets/sec is a triumph of marketing. When offered the choice, everyone will buy network hardware rated at 100Mbps even though this capacity can't be used with your protocols, over another network which will run continuously at 8000packets/sec for all protocols. Only for applications like ftp will near full network capacity be reached (then you'll be only running at 50% of the rated capacity as half the packets are <ack>s). I notice (Jun 2005) that switch speficiations e.g. Netgear FS516 (http://www.netgear.com/products/details/FS516.php) are quoted in packets/sec rather than bytes/sec. A netpipe test (realservers are 75MHz pentiums and can't saturate the 100Mbps network) shows that some packets must be "small". Julian's show_traffic script shows that for small packets (<128bytes), the throughput is constant at 1200packets/sec. As packets get bigger (upto mtu size), the packet throughput decreases to 700packets/sec, and then increases to 2600packets/sec for large packets. The constant throughput in packets/sec is a first order approximation of of tcpip network throughput and is the best information we have to predict director performance. In the case where a client is in an exchange of small packets (<mtu size) with a realserver in a LVS-DR LVS, each of the links (client-director, director-realserver, realserver-client) would be saturated with packets, although the bps rate would be low. This is the typical case for non-persistent http when 7 packets are required for the setup and termination of the connection, 2 packets are required for data passing (eg the request GET /index.html and the reply) and an <ack> for each of these. Thus only 1 out of 11 packets is likely to be near mtu size, and throughput will be 10% of the rated bps throughput even though the network is saturated. The first thing to determine then is the rate at which the realservers are generating/receiving packets. If the realservers are network limited, i.e. the realservers are returning data in memory cache (eg a disk-less squid) and have 100Mbps connections, then each realserver will saturate a 100Mbps link. If the service on the realserver requires disk or CPU access, then each realserver will be using proportionately less of the network. If the realserver is generating images on demand (and hence is compute bound) then it may be using very little of the network and the director can be handling packets for another realserver. The forwarding method affects packet throughput. With LVS-NAT all packets go through the director in both directions. As well the LVS-NAT director has to rewrite incoming and reply packets for each realserver. This is a compute intensive process (but less so for 2.4 LVS-NAT). In a LVS-DR or LVS-Tun LVS, the incoming packets are just forwarded (requiring little intervention by the director's CPU) and replies from the realservers return to the client directly by a separate path (via the realserver's default gw) and aren't seen by the director. In a network limited LVS, for the same hardware, because there are separate paths for incoming and returning packets with LVS-DR and LVS-Tun, the maximum (packet) throughput is twice that of LVS-NAT. Because of the rewriting of packets in LVS-NAT, the load average on a LVS-NAT director will be higher than for a LVS-DR or LVS-Tun director managing twice the number of packets. In a network bound situation, a single realserver will saturate a director of similar hardware. This is a relatively unusual case for the LVS's deployed so far. However it's the situation where replies are from data in the memory cache on the realservers (eg squids). With a LVS-DR LVS, the realservers have their own connection to the internet, the rate limiting step is the NIC on the director which accepts packets (mostly <ack>s) from the clients. The incoming network is saturated for packets but is only carrying low bps traffic, while the realservers are sending full mtu sized packets out their default gw (presumably the full 100Mbps). The information needed to design your director then is the number of packets/sec your realserver farm is delivering. The director doesn't know what's in the packets (being an L4 switch) and doesn't care how big they are (1 byte of payload or full mtu size). If the realservers are network limited, then the director will need the same CPU and network capacity as the total of your realservers. If the realservers are not network limited, then the director will need correspondingly less capacity. If you have 7 network limited realservers with 100Mbps NICs, then they'll be generating an average of 7x8000 = 50k packets/sec. Assuming the packets arrive randomly the standard deviation for 1 seconds worth of packets is +/- sqrt(50000)=200 (ie it's small compared to the rate of arrival of packets). You should be able to connect these realservers to a 1Gbps NIC via a switch, without saturating your outward link. If you are connected to the outside world by a slow connection (eg T1 line), then no matter how many 8000packet/sec realservers you have, you are only going to get 1.5Mbps throughput (or half that, since half the packets are <ack>s). Note: The carrying capacity of 100Mbps network of 8000packets/sec may only apply to tcpip exchanges. My 100Mbps network will carry 10,000 SYN packets/sec when tested with Julian's testlvs program. Wayne wayne (at) compute-aid (dot) com 03 Apr 2001

The performance page calculates the ack as 50% or so of the total packets. I think that might not accurate, since in the twist-pair and full duplex mode, ack and request are travelling on two different pairs. Even in the half duplex mode, the packets for two directions are transmit over two pairs, one for send, one for receive, only the card and driver can handle them in full duplex or half duplex mode. So the packets would be 8000 packets/sec all the times for the full duplex cards.

Joe: presumably for any particular connection, the various packets have to be sent in order and whether they are travelling over one or two pairs of cables would not matter. However multiple connections may be able to make use of both pairs of wires. Unfortunately we only can approximately predict the performance of an LVS director. Still the best estimates come from comparing with a similar machine. The performance page shows that a 133MHz pentium director can handle 50Mbps throughput. With 2.2 kernel LVS-NAT, the load average on the director is unusably high, but with LVS-DR, the director has a low load average. Statements on the website indicate that a 300MHz pentium LVS-DR director running a 2.2.x kernel can handle the traffic generated by a 100Mbps link to the clients. (A 550MHz PIII can direct 120Mbps.) Other statements indicate that single CPU high end (800MHz) directors cannot handle 1Gbps networks. Presumably multiple directors or SMP directors will be needed for Gbps networks. (also see the section on .) From: Jeffrey A Schoolcraft dream (at) dr3amscap3 (dot) com 7 Feb 2001

I'm curious if there are any known DR LVS bottlenecks? My company had the opportunity to put LVS to the test the day following the superbowl when we delivered 12TB of data in 1 day, and peaked at about 750Mbps. In doing this we had a couple of problems with LVS (I think they were with LVS). I was using the latest lvs for 2.2.18, and ldiretord to keep the machines in and out of LVS. The LVS servers were running redhat with an EEPro100. I had two clusters, web and video. The web cluster was a couple of 1U's with an acenic gig card, running 2.4.0, thttpd, with a somewhat performance tuned system (parts of the C10K). At peak our LVS got slammed with 40K active connections (so said ipvsadmin). When we reached this number, or sometime before, LVS became in-accessible. I could however pull content directly from a server, just not through the LVS. LVS was running on a single proc p3, and load never went much above 3% the entire time, I could execute tasks on the LVS but http requests weren't getting passed along. A similar thing occurred with our video LVS. While our realservers aren't quite capable of handling the C10K, we did about 1500 a piece and maxed out at about 150Mbps per machine. I think this is primarily modem users fault. I think we would have pushed more bandwidth to a smaller number of high bandwidth users (of course). I know this volume of traffic choked LVS. What I'm wondering is, if there is anything I could do to prevent this. Until we got hit with too many connections (mostly modems I imagine) LVS performed superbly. I wonder if we could have better performance with a gig card, or some other algorithm (I started with wlc, but quickly changed to wrr because all the rr calculations should be done initially and never need to be done again unless we change weights, I thought this would save us). Another problem I had was with ldirectord and the test (negotiate, connect). It seemed like I needed some type of test to put the servers in initially, then too many connections happened so I wanted no test (off), but the servers would still drop out from ldirectord. That's a snowball type problem for my amount of traffic, one server gets bumped because it's got too many connections, and then the other servers get over-loaded, they'll get dropped to, then I'll have an LVS directing to localhost. So, if anyone has pushed DR LVS to the limits and has ideas to share on how to maximize it's potential for given hardware, please let me know.

Jumbo frames All users of ethernet should understand the effects of MTU size on packet throughput and why you need jumbo frames. The problem is that the MTU=1500bytes was designed for the original implementation of ethernet at 3Mbps. The clock speed was upped to 10Mbps for commercial release, but the MTU was not changed, presumably not to change the required buffer size. When 100Mpbs ethernet arrived, the MTU was maintained for backward compatibility on mixed 10/100 networks. The MTU is 30 times too small for 100Mbps and 300 times too small for 1Gbps ethernet. Joe 26 Apr 2002 (posting to the beowulf mailing list)

I know that jumbo frames increase throughput rate on GigE and was wondering if a similar thing is possible with regular FE.

Donald Becker becker (at) scyld (dot) com 26 Apr 2002 I used to track which FE NICs support oversized frames. Jumbo frames turned out to be so problematic that I've stopped maintaining the table.

the MTU of 1500 was chosen for 10Mbps ethernet and was kept for 100Mbps and 1Gbps ethernet for backwards compatibility on mixed networks. However MTU=1500 is too small for 100Mbps and 1Gbps ethernet. In Gbps ethernet jumbo frames (ie bigger MTU) is used to increase throughput.

Yup, 1500 bytes was chosen for interactive response on original Ethernet. (Note: originally Ethernet was 3Mbps, but commercial equipment started at 10Mbps.) The backwards compatibility issue is severe. The only way to automatically support jumbo frames is using the paged autonegotiation information, and there is no standard established for this. Jumbo frame *will* break equipment that isn't expecting oversized packets. If you detect a receive jabber (which is what a jumbo frame looks like), you are allowed (and _should_) disable your receiver for a period of time. The rationale is that a network with an on-going problem is likely to be generating flawed packets that shouldn't be interpreted as valid.

With netpipe I found that throughput on FE was approx linear with increasing MTU upto the max=1500bytes. I assume that there is no sharp corner at 1500 and if in principle larger frames could be sent, then throughput should also increase for FE. (Let's assume that the larger packets will never get off the LAN and will never need to be fragmented). I couldn't increase the MTU above 1500 with ifconfig or ip link. I found that the MTU seemed to be defined in and increased these by 1500, recompiled the kernel and net-tools and rebooted. I still can't install a device with MTU>1500 VLAN sends a packet larger than the standard MTU, having an extra 4 bytes of out of band data. The VLAN people have problems with larger MTUs. Here's their mailing list http://www.WANfear.com/pipermail/vlan/ where I found the following e-mails which indicate that the MTU is set in the NIC driver and that in some cases the MTU=1500 is coded into the hardware or is at least hard to change.

Most of the vLAN people don't initially understand the capability of the NICs, or why disabling Rx length checks is a Very Bad Idea. There are many modern NIC types that have explicit VLAN support, and VLAN should only be used with those NICs. (Generic clients do not require VLAN support.

I don't know whether regular commodity switches (eg Netgear FS series) care about packet size, but I was going to try to send packets over a cross-over cable initially.

Hardware that isn't expecting to handle oversized frames might break in unexpected ways when Rx frame size checking is disabled. Breaking for every packet is fine. Occasionally corrupting packets as a counter rolls over might never be pinned on the NIC. The driver also comes into play. Most drivers are designed to receive packets into a single skbuff, assigned to a single descriptor. With jumbo frames the driver might need to be redesigned with multiple descriptors per packet. This adds complexity and might introduce new race conditions. Another aspect is that dynamic Tx FIFO threshold code is likely to be broken when the threshold size exceeds 2KB. This is a lurking failure -- it will not reveal itself until the PCI is very busy, then Boom... Most switches very much care about packet size. Consider what happens in store-and-forward mode. All of these issues can be fixed or addressed on a case-by-case basis. If you know the hardware you are using, and the symptoms of the potential problems, it's fine to use jumbo frames. But I would never ship a turn-key product or preconfigured software that used jumbo frames by default. It should always require expertise and explicit action for the end user to turn it on. Josip Loncaric josip (at) icase (dot) edu 29 Apr 2002

The backwards compatibility issue is severe Jumbo frames are great to reduce host frame procesing overhead, but, unfortunately, we arrived at the same conclusion: jumbo frames and normal equipment do not mix well. If you have a separate network where all participants use jumbo frames, fine; otherwise, things get messy. Alteon (a key proponent of jumbo frames) has some suggestions: define a normal frame VLAN including everybody and a (smaller) jumbo frame VLAN; then use their ACEswitch 180 to automatically fragment UDP datagrams when routing from a jumbo frame VLAN to a non-jumbo frame VLAN (TCP is supposed to negotiate MTU for each connection, so it should not need this help). This sounds simple, but it requires support for 802.1Q VLAN tagging in Linux kernel if a machine is to participate in both jumbo frame and in non-jumbo frame VLAN. Moreover, in practice this mix is fragile for many reasons, as Donald Becker has explained... One of the problems I've seen involves UDP packets generated by NFS. When a large UDP packet (jumbo frame MTU=9000) is fragmented into 6 standard (MTU=1500) UDP packets, the receiver is likely to drop some of these 6 fragments because they are arriving too closely spaced in time. If even one fragment is dropped, the NFS has to resend that jumbo UDP packet, and the process can repeat. This results in a drastic NFS performance drop (almost 100:1 in our experience). To restore performance, you need significant interrupt mitigation on the receiver's NIC (e.g. receive all 6 packets before interrupting), but this can hurt MPI application performance. NFS-over-TCP may be another good solution (untested!). We got good gigabit ethernet bandwidth using jumbo frames (about 2-3 times better than normal frames using NICs with Alteon chipsets and the acenic driver), but in the end full compatibility with existing non-jumbo equipment won the argument: we went back to normal frames. The frame processing overhead does not seem as bad now that CPUs are so much faster (2GHz+), even with our gigabit ethernet, and particularly not with fast ethernet. However, if we had a separate jumbo-frame-only gigabit ethernet network, we'd stick to jumbo frames. Jumbo frames are simply a better solution for bulk data transfer, even with fast CPUs.

Network Latency Network latency in an LVS is determined by the internet and is beyond the control of the person setting up the LVS. In a beowulf, the network is local and latency is important. Here's a posting from the beowulf mailing list about latency and throughput for small packets on Gbps (GigE) ethernet. Richard Walsh rbw (at) ahpcrc (dot) org 07 Mar 2003 In the limit of a 1 byte message, the inverse of the latency is the worst-case bandwidth for repeatedly sending 1 byte. On a GigE system with a latency of say 50 usecs your worst case bandwidth is 20 KB/sec :-(. This is mostly a hardware number. If you add in other contributors to the latency things get worse. As message size shrinks latency eventually dominates the transfer time ... the larger the latency the sooner this happens. Under the heading of "everything is just another form of something else", the distinction between latency and bandwidth gets muddy as latency grows relative to message size. On the other hand, if you can manage your message sizes, keep the latency piece a small percentage of the message transit time, and have good bandwidth you may not care what the latency is. Pushing up data volumes per node imply larger surfaces to communicate which imply larger messages. These transfers can be hidden behind compute cycles. Of course, one has to worry about faster processors shrinking compute cycles.

Mixture of 100Mbps and GigE ethernet Jeremy Kusnetz

We are planning on upgrading the network our realservers to gigE to support a gigE connection to our NFS server. I need to have gigE on the realservers due to potential buffering issues losing NFS udp packets coming from the NFS server. Now that the realservers will be on gigE, I can see a potential of the realservers sending data to the director faster then the director's internal 100mb connection can handle and start buffering packets on the swtich. Because of that I'm planning on putting a gigE interface on the internal connection of the director, but leaving a 100mb nic connecting the director to the outside routers. Now the director would be buffering data coming in at gigE speeds and sending out the data at 100mb speeds. Am I going to have any problems on the director doing this kind of buffering? I figure it could probably handle it better then the switch could. Am I right?

Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 29 Mar 2004 TCP is your friend. Even if you had Jumbo frames enabled and large payloads its extremely unlikely that TCP would fail to sufficiently throttle the delivery rate before you would bump into a hard buffer. TCP is decoupled from the underlying transports. While this is inefficient in obvious ways it is also presicely what protects us from situations like the one you describe. The down side is that TCP doesn't really get with the program until a problem BEGINS to happen. GigE on your director will reduce some of the switching burden, but ultimately, its TCP's behavior which will throttle end-to-end traffic within the tolerable capacity of your infrastructure. If top performance is of concern you might consider traffic shaping on the realservers for egress traffic.

Unfortunately we have some UDP protocols we are load balancing, namely DNS and radius. I'm not too worried about TCP traffic, but I am worried about losing UDP packets. Maybe I should keep the interconnects between the realservers and the director 100DB like it is now, but do NFS on a separate network off of the gigE cards.

NICs and Switches, 100Mbps (FE) and 1Gbps (GigE) (Apr 2002) If you are going into production, you should test that your NICs and switch works well with the hardware in your node. Give it a good exercising with a netpipe test. (see the performance page where the netpipe test is used). Netpipe will determine the network latency, the maximum throughput and whether the hardware behaves properly under stress. Latency determines system speed for processes transferring small packets over a small number of hops (usually one hop), while maximum throughput determines your system behaviour for large numbers of MTU sized packets. The beowulfers have done the most to find out which network hardware is useful at high load. You can look through the whole beowulf mailing list archive after downloading it. Unfortunately there is no online keywork search like we have on the LVS mailing list (thanks to Hank). Beowulfers are interested in both latency and throughput. If your LVS is sending packets to clients on the internet, your latency will be determined by the internet. If your connection to your clients is through a T1 line, your maximum throughput will also be determined by the internet. The beowulfers use either 100Mbps FE or for high throughput, myrinet. They don't use 1Gbps ethernet as it doesn't scale and is expensive. Network performance is expensive. In a beowulf with myrinet, half the cost of the hardware is in the networking. The difference between a $100k beowulf and a $5M Cray, which has the same number, type and speed of commodity DEC Alpha CPUs, is the faster interconnects in the Cray. With a Cray supercomputer, you aren't buying fast CPUs (Cray doesn't make its CPUs anymore, they're using the same CPUs that you can buy in a desktop machine), what you're buying is the fast onboard networking between the CPUs.

100Mbps Martin Seigert at Simon Fraser U posted Benchmarks for various NICS to the beowulf mailing list. The conclusion was that for fast CPUs (i.e.600MHz, which can saturate 100Mbps ethernet) the 3c95x and tulip cards were equivelent. from a posting on the beowulf mailing list: "the tulip is the ne2k of the late 90's". (I use them when I can - Joe). For slower CPUs (166MHz) which cannot saturate 100Mbs ethernet, the on-board processing on the 3Com hards allowed marginally better throughput. I use Netgear FA310TX (tulip), and eepro100 for single-port NICs. The related FA311 card seems to be Linux incompatible (postings to the beowulf mailing list), currently (Jul 2001) requiring a driver from Netgear (this was the original situation with the FA310 too). I also use a quad DLink DFE-570TX (tulip) on the director. I'm happy with all of them. The eepro100 has problems as Intel seems to change the hardware without notice and the linux driver writers have trouble handling all the versions of hardware. One kernel (2.2.18?) didn't work with the eepro100. and new kernels seem to have problems occasionally. I bought all of my eepro100's at once and presumably they are identical and presumably the bugs have been worked out for them. There have been a relatively large number of posting of people with eepro100 problems on the LVS mailing list. You should expect continuing problems with this card, which will be incrementally solved by kernel patches.

100Mbps switches These are now cheap. The parameters that determine the performance of a switch are the backplane bandwidth This is the total rate of packet or bit throughput that the switch can handle through all ports at once. If you have an 8 port 100Mbps switch, you need a backplane bandwidth of 400Mbps to allow all 4 pairs of ports to talk at the same time. A hub is just a switch that only allows one pair of ports to talk at a time. cut through, store and foward At low packet throughput a switch will "cut through", i.e. after decoding the dst_lladdr (the MAC address of the target) on the packet, it will switch the packet through to the appropriate port before the rest of the packet has arrived at the switch. This will ensure low latency for packet transfer. You can test switch latency by replacing the switch with a cross-over cable. When the packet throughput exceeds the backplane bandwidth, the switch can store packets till there is space on the link layer. This is called "store and forward" What you want to know is whether the switch does cut through and/or store and forward, and if it does both, when the change over occurs. large packet handling You want to know what the switch does with packets larger than the standard 1500byte MTU. Unfortunately the suppliers of commodity switches, Netgear, 3COM and HP are less than forthcoming on their specs. (Apr 2002, Extreme Networks, quote backplane, which they call "switch fabric", bandwidth for their products, which include GigE switches. Sep 2002 - they've removed the webpage.) Netgear gives the most information (and has the cheapest switches) and I bought my switch from them, as a way of supporting their efforts here. Someone I know who buys a lot of switches said he had some Netgear switches arrive DOA. They were returned without problem, but manufacturers should be shipping working boxes. 3COM gives less information than Netgear, while HP just wants to know whether you are using the device at home or at a business and they'll tell you which box you need. I started seeing advertisements for Dell switches in Sept 2001, at lower prices than for Netgear, but the Dell website didn't acknowlege that they existed and a contact at Dell couldn't find specs for them. The Dell switches appeared to be rebadged equipment from another networking company. None of these suppliers give the above neccessary information - they aren't selling switches, they're selling "productivity enhancement solutions". I'm happy with my Netgear FS series switch, but then I haven't tested it with 100Mbps simultaneously on all ports. Since the price of a switch rises exponentially with its backplane bandwidth (required for more ports), an often suggested solution (which I haven't tested), is to divide your network into smaller groupings of computers, connected by 2 layers of switches (this will increase your network latency, since now two hops may be required).

550MHz CPU saturates 100Mbps ethernet Martin Hamilton martin (at) net (dot) lut (dot) ac (dot) uk Nov 14 2001

we (JWCS) also use LVS on our home institutional caches. These are somewhat smaller scale, e.g. some 10m URLs/day at the moment for Loughborough's campus caches vs. 130m per day typically on the JANET caches. The good news is that LVS in tunnelling mode is happily load balancing peaks of 120MBit/s of traffic on a 550MHz PIII. Folk in ac.uk are welcome to contact us at support (at) wwwcache (dot) ja (dot) net for advice on setting up and operating local caches. I'm afraid we can only provide this to people on the JANET network, like UK Universities and Colleges.

1Gbps (GigE) NICs Here's a review of GigE over copper wire. All current (May 2002) GigE NICs support jumbo frames and are cheap (US$100). The best latency (SysConnect SK9821 NIC) is 50usecs. While this is nice, I'm getting 150usec on a pair of 100Mbps ethernet cards connected between a 133MHz pentium 1 and a dual 180MHz Pentium Pro at a fraction of the cost. The SysConnect card can deliver 900Mbps with jumbo frames. Here's the (http://www.scd.ucar.edu/nets/docs/reports/HighSpeed/ - link dead Jul 2004) NCAR High Performance Networking Tests which gives background info on fast networks (ATM and GigE). The main point for us is that the jumbo frame formats are proprietary and non-iteroperable (we need a standard here).

1Gbps (GigE) switches While GigE NICs are cheap, the switches are still expensive. The suppliers of commodity GigE switches are even less forthcoming with their specs than they are for their 100Mbps switches. (Note: Apr 2002, I just found "http://www.extremenetworks.com/products/products.asp which quote backplane, which they call "switch fabric", bandwidth for their products, which include GigE switches. Note: Oct 2002, link is dead.) It seems that the manufactureres would rather you figure out the specs yourself. Netgear is selling a 24 port GigE switch (according to a vendor), but all you can find on the Netgear website (Apr 2002) is a 100Mbps switch with 4 GigE ports. Some of the vendors want you to figure out the existance of their boxes too. Here's my estimate of commodity GigE switch specs: A 24 port cisco 6000 series GigE chassis (no added boards) costs US$60k, has a backplane bandwidth of 32Gbps and supports jumbo frames. The commodity switches (e.g. 24 port HP4108) costs US$6k and do not support jumbo frames. On the assumption that the backplane bandwidth is what you're paying for, then the spec on the HP box is 3Gbps i.e. only 3 pairs of ports on your 24 port box can be active at once. This box is not much more than a hub, something the manufacturers would not want you to know. You need jumbo frames and you can't have them. It is pointless going to GigE unless you have jumbo frames. These switch specs explain why GigE scales so badly and why beowulfers would rather stay with 100Mbps than spend any money on GigE. For some performance data see Gigabit ethernet TCP switch performance.

Ethernet,NIC Bonding This has not proven reliable or easy to setup, at least in the hands of the Beowulfers. Make sure your LVS is working properly before trying ethernet bonding. Craig Ward

Are there any known issues with bonding NICs and LVS? I've got a setup with 4 boxes, all 4 are web servers and 2 are directors. The VIP is being brought up on bond0:0 fine, but I can't ping this from any machine and its not showing in the arp table on my windows client. Strangely, if I manually bring up an ip on bond0:0 that is NOT the the VIP I can ping it fine. I'm wondering if any of the noarp rules on the each director, used for when they are slave directors, is somehow "stuck" and it's not arping for the VIP whatever interface it's on?

Brad Hudson brad (dot) hudson (at) gmail (dot) com 5 Nov 2005 Here is how I use bonding: $node has eth1 and eth3 bonded together on $fip as bond0 $fvip sits on bond0:0 and accepts all incoming requests for load balancing $node has eth0 and eth2 bonded together on $bip as bond1 $bvip sits on bond1:0 and talks to all real servers where each $real has a gateway of $bvip FYI: $node is also in failover cluster #1 and all real servers are in load balanced cluster #2

NIC problems - eepro100

counter overflows (This is from 1999 I think) linux with an eepro100 can't pass more than 2^31-1 packets. This may not still be a problem. Jerry Glomph Black black (at) real (dot) com Subject: 2-billion-packet bug?

I've seen several 2.2.12/2.2.13 machines lose their network connections after a long period of fine operation. Tonight our main LVS box fell off the net. I visited the box, it had not crashed at all. However, it was not communicating via its (Intel eepro100) ethernet port. The evil evidence: Check out the TX packets number! That's 2^31-1. Prior to the rollover, In-and-out packets were roughly equal. I think this has happened to non-LVS systems as well, on 2.2 kernels. ifconfigging eth0 down-and-up did nothing. A reboot (ugh) was necessary.

It's still happening 2yrs later. This time the counter stops, but the network is still functional. Hendrik Thiel thiel (at) falkag (dot) de 20 Nov 2001

using lvs with eepro100 cards (kernel 2.2.17) and encountered a TX packets value stopping at 2147483647 (2^32-1) thats what ifconfig tells...the system still runs fine ... it seems to be a ifconfig Bug. Check out the TX packets number! That's 2^31-1. Simon A. Boggis
Hmmm, I have a couple of eepro100-based linux routers - the one thats been up the longest is working fine (167 days, kernel 2.2.9) but the counters are jammed - for example, `ifconfig eth0' gives: BUT /proc/net/dev reports something more believable: Thats RX packets: 2177325200 and TX packets: 3474415357 compared to : 2147483647 from ifconfig eth0

From Purrer Wolfgang www (dot) purrer (dot) at 24 Apr 2003: If Donald Becker's drivers aren't helping, you can always get the drivers from Intel. (Joe - it's an appalingly designed site).

new drivers Andrey Nekrasov

After I changed to kernel 2.4.x with "arp hidden patch" I haven't had any problems before with NIC Intel EEPRO/100.

Julian 4 Feb 2002 This problem happens not only with LVS. Search the web or linux-kernel: http://marc.theaimsgroup.com/?t=100444264400003&r=1&w=2

Kernel 2.4.18 Peter Mueller pmueller (at) sidestep (dot) com 22 May 2002

2.4.18 has an eepro100 bug in it. Looking in dejanews, there is a slow down under some circumstances. You should use the driver from 2.4.17.

bonding with eepro100 Roberto Nibali ratz (at) drugphish (dot) ch 07 Nov 2003 The eepro100 never really worked well for me in conjunction with bonding. I also use the e100/e1000 drivers as suggested by Brian. Also note that the bonding architecture has gotten a major (actually huge) overhaul during between the 2.4.21 and 2.4.23-preX phase. Among them is the possibility to set and change the MAC and the MTU in ALB/TBL modes, fixed arp monitoring, better 802.3ad support and proper locking. You might wanna play with some of the newer 2.4.23-pre kernels or at least with 2.4.22 if the problem persists, then again it's highly time-consuming to follow latest ("stable") development kernels currently.

NIC problems - tulip (Joe, Nov 2001 I don't know if this is still a problem, we haven't heard any more about it and haven't had any other tulip problems, unlike the eepro100.) John Connett jrc (at) art-render (dot) com 05 May 1999

Any suggestions as to how to narrow it down? I have an Intel EtherExpress PRO 100+ and a 3COM 3c905B which I could try instead of the KNE 100TX to see if that makes a difference. A tiny light at the end of the tunnel! Just tried an Intel EtherExpress PRO 100+ and it works! Unfortunately, the hardware is fixed for the application I am working on and has to use a single Kingston KNE 100TX NIC ... Some more information. The LocalNode problem has been observed with both the old style (21140-AF) and the new style (21143-PD) of Kingston KNE 100TX NIC. This suggests that there is a good chance that it will be seen with other "tulip" based NICs. It has been observed with both the "v0.90 10/20/98" and the "v0.91 4/14/99" versions of tulip.c. I have upgraded to vs-0.9 and the behaviour remains the same: the EtherExpress PRO 100+ works; the Kingston KNE 100TX doesn't work. It is somewhat surprising that the choice of NIC should have this impact on the LocalNode behaviour but work successfully on connections to slave servers. Any suggestions as to how I can identify the feature (or bug) in the tulip driver would be gratefully received. If it is a bug I will raise it on the tulip mailing list.

dual/quad ethernet cards, IRQ sharing problems

any recommendations?

Ratz 04 Dec 2002: We're using Adaptec Quadboards (Adaptec ANA-62044, 64-bit PCI) and they work like a charm. You can stick in 6 of those on a Intel Serverboard and have 24 NICs. We're however testing the new Intel Quadboards that will officially be available on Q1 2003. We chose Adaptec because in the past we've had bad experiences with badly broken DLink hardware. This mostly concerned their switches. But once a product sheds bad light on the decision you will hardly convince yourself that the rest of the product line works correctly, IMHO.

Are IRQ-sharing lockups a thing of the past?

P3-600's: I very well remember and we have delivered a few such packet filters without any major problems. ASUS P*B-*: never had a single problem. Ok, we use a 2.2.x kernel with some enhancements of mine (not IRQ routing related). There should be no problem.

the boards in this era that I used had this problem, and guides like Anandtech and tomshardware advised to configure the BIOS to have each PCI slot be a set IRQ.

Do _not_ do that! Linux will choose an IRQ for the PCI slot and depending on whether the board has SCSI or IDE the IRQ wired routing on the local APIC is different. Forcing an IRQ on a specific PCI slot makes ASUS boards with older firmware releases go banana when assigning the IRQ routing, especially those with a onboard SCSI chip. There you have a reversed initialisation phase. Also if you're using the PCI-sharing option from the BIOS make sure to enable PCI-2.x compliancy and use an up-to-date BIOS release. And last but not least: All this doesn't work if you use Realtek-chipset based NICs. They are fundamentally flawed when it comes to IRQ sharing. Nowadays this is solved however and you can use this el-cheapo NIC. Nowadays you can look into the motherboard booklet and see the wiring. If you intend to put in an additional SCSI card you need to make sure that the routing is separated. In most 5 to 6 PCI-slot boards, you could for example select slot 1 and 2 for separation since they are not routed over the same chip. It's depending on the bridge however. This all changes if you have a SMP board (how could it be any different of course :)). There you need to distinguish every single motherboard factorisation to know how to solve the eventual problem of deplaced IRQ sharing. It will very much depend on the PCI chipset support in the kernel (in Windows world this would be the busmaster driver).

Ok.. so IRQ sharing is FIXED in 99% of situations now? I can take 2 quad cards from different manufacturer's and put them in the same box and they will work on the same IRQ (from the BIOS perspective)?

This is not said. All 4 ports of a single quadboard will have the same IRQ but if you put in a second quadboard from a different vendor your machine might just end up using different IRQ. Interport IRQ routing on a single quadboard is almost always shared. Also you need to take into account that this can change if you enable local APIC on UP or APIC in general for SMP boards. There you most propably end up with less probability of shared IRQs. However you end up with bigger problems with certain Intel boards based on the 440GX+/440LX+ chipset.

So SCSI needs to be a seperate IRQ from the rest? Don't share SCSI. What about Firewire or USB2 or ... ;)

I'm not that much into firewire and USB2, since having a packet filter or high traffic node with quadboards normally implies in not having the need for any firewire or USB2 devices. YMMV.

Have you looked at the AMD bus at all?

If you're going SMP then yes, pay a 200 bucks more and get a decent board with EMP and MCE support via console (UART). I have looked at the AMD boards but in our lab we've found them to be less ready to work properly with the rest of our hardware then Intel based boards. I wish support for AMD boards in Linux would be better but this is just a matter of time.

Flakey Switch Here a user tracked down a poor performance problem, to a possible flakey switch, when serving . Mark Weaver mark (at) npsl (dot) co (dot) uk 23 Mar 2004

The test client is using the Windows Media Load Simulator. This just makes a lot of connections and streams back data. The average stream only gets up to about 35Mbit. At this point, CPU usage on the director is ~20% (which would seem to indicate that I should be able to get a lot more out of it). CPU on the test box is at about 25% and on the media server at 4%. The problematic part is that the director begins dropping about 10% of externally originated packets at this level of load. I wouldn't say any machine involved is stressed here, but pinging the external IP of the director gives that huge loss. This noticeably affects say, SSH, on the director or TS to the media server. This is constrasted with pinging the external IP of the test box, which gives 0% loss. I would therefore conclude that this is an issue with the director, but I'm not sure what. My next guess would be to try swapping the VIA NIC for another 3com one, but could it really be that bad? I can't see it being an issue with the cisco switch (test box and director are both connected to it); the cisco router (same), or the d-link switch (not involved in ping to director), so I'm at a loss as to what else to conclude. I trundled my self down the the hosting centre to do some further testing. It turns out that when plugging a couple of test clients directly into the switch in front of the director, I can get 90Mbits sustained load out of it using around 40% CPU on the director, nearly 100% CPU on both test clients, 5% CPU on the WMS machine, and ~1.5k concurrent connections. Fantastic! The issue then, appears to be either the cisco switch or the cable connecting it; there is nothing else left to test. I did swap the NICs out for eepros (but the problem still persists when stressing through the cisco kit). Safe to say, I'm very, very impressed by this!

performance testing tools

Web Polygraph Dennis Kruyt, 9 Jan 2002

I am looking for software for testing my lvs webservers with persistant connection. With normal http benchmarking tools all request come from one IP, but I want to simulate a few hundred connections from different IPss with persistant connections.

Joe Cooper joe (at) swelltech (dot) com 09 Jan 2002 Web Polygraph is a benchmarking framework originally designed for web proxies. It will generate thousands of IPs on the client box if you request them. It does not currently have a method to test existing URLs, as far as I know (it provides its own realserver(s) and content, so that data is two-sided). It currently works very well for stress-testing an LVS balancer, but for the realservers themselves it probably needs a pretty good amount of coding. The folks who developed it will add features for pretty good rates, particularly if the new features fit in with their future plans. It does have some unfortunate licensing restrictions, but is free to get, use and modify for your own internal purposes.

getmeter Alexandre Cassen Alexandre (dot) Casseni (at) wanadoo (dot) fr 13 Jan 2003 (2008, Alexandre is now at Alexandre (dot) Cassen (at) free (dot) fr) getmeter: Simple tool for emulating a multi-threaded web browser. The code works for a HTTP/1.1 webserver. The purpose of this tool is to monitor webserver response time. It implements HTTP/1.1 GET using a realtime multi-threaded design, dealing with an url pool to perform global page reponse time (page and first level elements response time). It connects to a webserver (HTTP or SSL), creates 2 multi-threaded persistent connections to this webserver, performs a GET HTTP/1.1 on the url specified, parses the html content returned and creates an element pool performs GET HTTP/1.1 on each element. for each GET, mesure the response time. print the global reponse time for the page requested. An extension can use MRTG or RRDTOOL to graph the output.

Max number of realservers

what is the maximum number of servers I can have behind the LVS without any risk of failure?

Horms horms (at) vergenet (dot) net 03 Jul 2001 LVS does not set artificial limits on the number of servers that you can have. The real limitations are the number of packets you can get through the box, the amount of memory you have to store connection information and in the case of LVS-NAT the number of ports available for masquerading. These limitations effect the number of concurrent connections you can handle and your maximum through-put. This indirectly effects how many servers you can have. (also see the section on port range limitations.)

FAQ: What is the minimum hardware requirements for a director Enough for the machine to boot,i.e 386CPU, 8M memory, no hard disk, 10Mbps ethernet.

FAQ: How fast/big should my director be? There isn't a simple answer. The speed of the director is determined by the packet throughput from/to the clients and not by the number of realservers. From the mailing list, 3-500MHz UMP directors running 2.2.x kernels with the ipvs patch can handle 100Mbps throughput. We don't know what is needed for 1Gpbs throughput, but postings on the mailing list show that top end UMP machines (eg 800MHz) can't handle it. For the complicated answer, see the rest of this section. Horms 12 Feb 2004 If you only want to use LVS to load balance 100Mb/s Ethernet then any machine purchased in the last few years should easily be able to do that. End of conversation :-) If you want to go to 1Gb/s Ethernet then things get more interesting. At that point here are the things to watch out for: Make sure your machines have a nice fast PCI bus. These days most machines seem to have 66Mhz/64bit or 100Mhz/64bit slots so you are fine. Back when 33Mhz/32bit was standard this was a bit more problematic. Buy good NICs that have well maintaineddrivers. Use UP not SMP. Unless you really need SMP on the machine for some reason, then the locking overhead is greater than the gain of an extra CPU when using LVS. This is particularly true when handling small connections, where the TCP handshake becomes significant. (That was on 2.4, not sure about 2.6, though I assume that it still holds) CPU isn't really much of an issue. If you can purchase a CPU these days that is too slow to run LVS, even up to 1Gb/s then I would be very surprised. Certainly anything over a 1GHz PII should be fine. Memory. First understand that LVS has no internal limits on the number of connections it can handle, so you are only bound by your system resources. Here is the equation. For each connection you have in LVS's connection table you need about 128bytes. Connections will stay in the table for 120 seconds after a connection is closed. So if your peak is, say 300 connections/s, then you need about 300*120*128=4608000bytes=4Mbytes of memory for the connection table, which I think you will agree isn't much. If you are using persistance then an extra entry (template) will be created per end-user (masked by the persistance netmask) and these will stay around for the duration of the persistance timeout. You can do the maths there. But the bottom line is that unless you are expecting an extremely high number of connections, then you don't need much memory. Obviously you will need memory for other things like the OS, monitoring tools etc... But I think that 256Mb of RAM should be more than enough.

SMP doesn't help, but 64 bit does LVS is kernel code. In particular the network code is kernel code. Kernel code is only SMP in 2.4.x kernels (user space SMP started in 2.0.x kernels). To take advantage of SMP for LVS then you must be running a 2.4.x kernel. Horms (06 Apr 2003): SMP doesn't help for kernel code at high load - Some things that you may want to consider are: Using a non SMP kernel - there is actually more overhead in obtaining spinlocks than the advantage you get from multile CPUs if you are only doing LVS. If you only have one processor you should definately use a non-SMP kernel. You may also want to consider using NAPI with the ethernet driver and if you really want to use SMP then setting up affinity for the two NICS with different processors is probably a good idea. Dusten Splan Dusten (at) opinionsurveys (dot) com 12 May 2003

Is LVS smp aware? I have dual 1.1Ghz processors, dual 1000BaseT Ethernet, with 2.4.20 compiled as an smp and with the lvs sources compiled in, set up as a one nic one network DR unit with wrr and it is working very nicely. It is pushing about 50Mbps at peek and is only using about 15% on average of the processing power when doing a vmstat. Now here's the question - when doing a vmstat the numbers look like this (this is a snapshot when we are doing about 30Mbps): now if I look at top I get. When doing more traffic the load on cpu0 increases and nothing is happing on cpu1. My question is why am I not seeing this processor usage distributed over both processors. I know that on a Sun box the network card is stuck to a single processor and will not use the other processors. Here's a sample of what mpstat has to say about the hole thing.

Horms LVS really should utilise both CPUs. As you note the 2.4 kernels are multithreaded. LVS should take advantage of this. It definately bears further investigation. The problem with performance is that in multithreading the kernel a lot of spinlocks were introduced. From the testing that I was involved in, its seems that the overhead in obtaining these locks is greater than the advange of having access to a second CPU. That is in the case of using the box only as an LVS Linux Director. If you are doing lots of other things as well then this may not be the case. I would suggest that if you are building a machine that will act primarily as an LVS ldirectord, then a non-SMP kernel should give you the best performance. This however, does not answer, and is not particularly relevant to your problem. Sorry. Wensong The major LVS processing is run inside the softirqs in the kernel 2.4. The softirqs (even the same) can run parallely on the two CPUs or more inside the kernel 2.4. So, the LVS in the kernel 2.4 should take advantage of SMP. We spent a lot of efforts keeping the locking granularity of LVS small too. As for Dusten's problem, I am not sure why one CPU is 80% idle and the other is always 100% idle. From the mpstat output, almost all the interrupts go to the first CPU. Is it possible that 20% CPU cycles have been spent handling interrupts at the first CPU? Michael Brown michael_e_brown (at) dell (dot) com wrote on 26 Dec 2000

I've seen significant improvements using dual and quad processors with 2.4. Under 2.2 there are improvements but not astonishing ones. Things like 90% saturation of a Gig link using quad processors. 70% using dual processors and 55% using a single processor under 2.4.0test. I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most evidence points to >100% improvement for network intensive tasks.

only one CPU can be in the kernel with 2.2. Since LVS is all kernel code, there is no benefit to LVS by using SMP with 2.2.x. Kernel 2.[3-4] can use multiple CPUs. While standard (300MHz pentium) directors can easily handle 100Mbps networks, they cannot handle an LVS at Gbps speeds. Either SMP directors with 2.4.x kernels or multiple directors (each with a separate VIP all pointing to the same realservers) are needed. Since LVS-NAT requires computation on the director (to rewrite the packets) not needed for LVS-DR and LVS-Tun, SMP would help throughput. Joe

If you're using LVS-NAT then you'll need a machine that can handle the full bandwidth of the expected connections. If this is T1, you won't need much of a machine. If it's 100Mbps you'll need more (I can saturate 100Mbps with a 75MHz machine). If you're running LVS-DR or LVS-Tun you'll need less horse power. Since most LVS is I/O I would suspect that SMP won't get you much. However if the director is doing other things too, then SMP might be useful

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg Yep, LVS in 2.2 can't use both CPUs. This is not a LVS limitation. It is already solved in the latest 2.3 kernels: softnet. If you are using the director as realserver too, SMP is recommended. Pat O'Rourke orourke (at) mclinux (dot) com 03 Jan 2000

In our performance tests we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor). We've been using a 2.2.17 kernel with the 1.0.2 LVS patch and bumped the send / recv socket buffer memory to 1mb for both the uni-processor and SMP scenarios. The director is an Intel based system with 550 mhz Pentium 3's. In some tests I've done with FTP, I have seen *significant* improvements using dual and quad processors using 2.4. Under 2.2, there are improvements, but not astonishing ones. Things like 90% saturation of a Gig link using quad processors, 70% using dual processors and 55% using a single processor under 2.4.0test. Really amazing improvements. Michael E Brown michael_e_brown (at) dell (dot) com 26 Dec 2000
What are the percentage differences on each processor configuration between 2.2 and 2.4? How does a 2.2 system compare to a 2.4 system on the same hardware?
I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most of the evidence on tests that I have run points to a > 100% improvement for *network intensive* tasks. In our experiments we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

Kees Hoekzema kees (at) tweakers (dot) net 10 Jun 2008 Yes it is true that for LVS it does not really matter, but LVS is not always the only thing running on a system, in my case I also let it run a DNS server and it has a lot of iptables rules. At the moment we are doing around 60 mbit/s and my stats are: (a quadcore xeon was just a little bit more expensive than a singlecore cpu at the time we bought the servers). Also you can have different cpu's handle the interrupts: $ cat /proc/interrupts With the speed of the current cpu's however I think it does not really matter if you have a single or multi-core cpu, they all can handle gbit/s of data. 64 bit is helpful if you don't want stuff like /proc/net/dev overflowing (4 gbyte/5 minutes=16 Mpbs of traffic, so if you get above that, and you try to use those statistics, the counter may overflow before you read it again)

Performance Hints from the Squid people There is information on the Squid site about tuning a squid box for performance. I've lost the original URL, but here's one about file descriptors and another (link dead Mar 2004 http://www.devshed.com/Server_Side/Administration/SQUID/) by Joe Cooper (occasional contributor to the LVS mailing list) that also addresses the FD_SETSIZE problem (i.e. not enough filedescriptors). The squid performance information should apply to an LVS director. For a 100Mbps network, current PC hardware on a director can saturate a network without these optimizations. However current single processor hardware cannot saturate 1Gpbs network, and optimizations are helpful. The squid information is as good a place to start as any. Here's some more info Michael E Brown michael_e_brown (at) dell (dot) com 29 Dec 2000

How much memory do you have? How fast of network links? There are some kernel parameters you can tune in 2.2 that help out, and there are even more in 2.4. From the top of my head, /proc/sys/net/core/*mem* - tune to your memory spec. The defaults are not optimized for network throughput on large memory machines. 2.4 only /proc/sys/net/ipv4/*mem* For fast links, with multiple adapters (Two gig links, dual CPU) 2.4 has NIC-->CPU IRQ binding. That can really help also on heavily loaded links. For 2.2 I think I would go into your BIOS or RCU (if you have one) and hardcode all NIC adapters (Assuming identical/multiple NICS) to the same IRQ. You get some gain due to cache affinity, and one interrupt may service IRQs from multiple adapters in one go, on heavily loaded links. Think "Interrupt coalescing". Figure out how your adapter driver turns this on and do it. If you are using Intel Gig links, I can send you some info on how to tune it. Acenic Gig adapters are pretty well documented. For a really good tuning guide, go to spec.org, and look up the latest TUX benchmark results posted by Dell. Each benchmark posting has a full list of kernel parameters that were tuned. This will give you a good starting point from which to examine your configuration. The other obvious tuning recommendation: Pick a stable 2.4 kernel and use that. Any (untuned) 2.4 kernel will blow away 2.2 in a multiprocessor configuration. If I remember correctly 2.4.0test 10-11 are pretty stable.

Some information is on http://www.LinuxVirtualServer.org/lmb/LVS-Announce.html only one CPU can be in the kernel with 2.2. Since LVS is all kernel code, there is no benefit to LVS by using SMP with 2.2.x. Kernel 2.[3-4] can use multiple CPUs. While standard (300MHz pentium) directors can easily handle 100Mbps networks, they cannot handle an LVS at Gbps speeds. Either SMP directors with 2.4.x kernels or multiple directors (each with a separate VIP all pointing to the same realservers) are needed. Since LVS-NAT requires computation on the director (to rewrite the packets which is not needed for LVS-DR and LVS-Tun), SMP would help throughput. Joe

If you're using LVS-NAT then you'll need a machine that can handle the full bandwidth of the expected connections. If this is T1, you won't need much of a machine. If it's 100Mbps you'll need more (I can saturate 100Mbps with a 75MHz machine). If you're running LVS-DR or LVS-Tun you'll need less horse power. Since most LVS is I/O I would suspect that SMP won't get you much. However if the director is doing other things too, then SMP might be useful

In our performance tests we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor). We've been using a 2.2.17 kernel with the 1.0.2 LVS patch and bumped the send / recv socket buffer memory to 1mb for both the uni-processor and SMP scenarios. The director is an Intel based system with 550 mhz Pentium 3's. In some tests I've done with FTP, I have seen *significant* improvements using dual and quad processors using 2.4. Under 2.2, there are improvements, but not astonishing ones. Things like 90% saturation of a Gig link using quad processors, 70% using dual processors and 55% using a single processor under 2.4.0test. Really amazing improvements. Michael E Brown michael_e_brown (at) dell (dot) com 26 Dec 2000
What are the percentage differences on each processor configuration between 2.2 and 2.4? How does a 2.2 system compare to a 2.4 system on the same hardware?
I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most of the evidence on tests that I have run points to a > 100% improvement for *network intensive* tasks. In our experiments we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

Joe: here's a posting from the Boewulf mailing list, about increasing the number of file descriptors/sockets. It is similar to the postings on the squid webpages (mentioned above). Yudong Tian yudong (at) hsb (dot) gsfc (dot) nasa (dot) gov 30 Sep 2003 The number of sockets a process can open is limited by the number of file descriptors (fds). Type "ulimit -n" under bash to get this number, which usually 1024 by default. You can increase this number if you wish. Google "increase Linux file descriptors" you will find many examples, like this one: http://support.zeus.com/faq/zws/v4/entries/os/linuxfd.html If you want to be really sure, you can compile and run the following c program to get the number, which is the output plus 3 (stdout, stdin and stderr): If you are running a TCP server and want to test how many clients that server can support, you can run the following perl program to test: new(PeerAddr => $ip, PeerPort => $port, Proto => "tcp", Timeout => 6, Type => SOCK_STREAM); } while ($socket[$i]); print "Max TCP connections supported for this client: ", $i-1, "\n"; ## end of program ]]> Of course for this test you have to make sure you have more fds to use than the server. Brian Barrett brbarret (at) osl (dot) iu (dot) edu 02 Oct 2003 On linux, there is a default per-process limit of 1024 (hard and soft limits) file descriptors. You can see the per-process limit by running limit (csh/tcsh) or ulimit -n (sh). There is also a limit on the total number of file descriptors that the system can have open, which you can find by looking at /proc/sys/fs/file-max. On my home machine, the max file descriptor count is around 104K (the default), so that probably isn't a worry for you. There is the concept of a soft and hard limit for file descriptors. The soft limit is the "default limit", which is generally set to somewhere above the needs of most applications. The soft limit can be increased by a normal user application up to the hard limit. As I said before, the defaults for the soft and hard limits on modern linux machines are the same, at 1024. You can adjust either limit by adding the appropriate lines in /etc/security/limits.conf (at least, that seems to be the file on both Red Hat and Debian). In theory, you could set the limit up to file-max, but that probably isn't a good idea. You really don't want to run your system out of file descriptors. There is one other concern you might want to think about. If you ever use any of the created file descriptors in a call to select(), you have to ensure all the select()ed file descriptors fit in an FD_SET. On Linux, the size of an FD_SET is hard-coded at 1024 (on most of the BSDs, Solaris, and Mac OS X, it can be altered at application compile time). So you may not want to ever set the soft limit above 1024. Some applications may expect that any file descriptor that was successfully created can be put into an FD_SET. If this isn't the case, well, life could get interesting. AlberT AlberT (at) SuperAlberT (dot) it 02 Oct 2003 from man setrlimit:

getrlimit and setrlimit get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure (the rlim argument to both getrlimit() and setrlimit()): The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value. The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()). RLIMIT_NOFILE Specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(), pipe(), dup(), etc.) to exceed this limit yield the error EMFILE.

realservers filling conntrack tables (LVS-DR) Wiboon Warasittichai wiboon (dot) w (at) psu (dot) ac (dot) th 08 Jun 2007 I set up 2 directors IP 192.168.96.11 (active/standby) with 4 real servers (squid) for a week ago. I noticed in dmesg output in the realservers So I restart iptables. Then, ip_conntrack goes below 65536 max. But within a day, it reached max ip_conntrack again. I checked with cat /proc/net/ip_conntrack | grep UNREPLIED which showed many lines with ESTABLISHED and UNREPLIED. I think it's because the squid realservers directly send answer back from internet to client and then client send FIN to director, isn't it? Do IPVS/DR have any configurations to get rid of these ip_conntrack? Do I need to unload module ip_conntrack on all squid boxes? Graeme Fowler graeme (at) graemef (dot) net 08 Jun 2007 Ideally you have to unload the module. Why do you have the conntrack module loaded in the first place? An alternative method, if you absolutely must keep the conntrack rules in place, is to explicitly use the NOTRACK target on packets destined for the Squid service. On the realserver, as an example: The first line will remove tracking from packets destined for TCP port 3128 on the realserver.

Conntrack, effect on throughput I can't figure out if this should belong in the performance or the netfilter section - any suggestions? Conntrack is part of netfilter. It keeps track of connections and knows the relationship of a packet to existing connections. This enables filtering to reject or allow packets e.g. a packet appearing to be part of a passive ftp connection is only valid after a call to setup a passive ftp connection. For a tutorial on conntrack, see the links inside Iptables - How Does It Work? (http://www.sns.ias.edu/~jns/wp/2006/01/24/iptables-how-does-it-work/) by James Stephens.

Rodger Erickson rerickson (at) aventail (dot) com 17 Dec 2001 Does anyone have any comments they can make on the effect of conntrack on LVS performance? The LVS device I'm using also has to do some DNAT and SNAT, which require conntrack to be enabled.

Julian: We need to port the 2.2 masquerade to 2.4 :) LVS reused some code from 2.2 but much of it is removed and I'm not sure it can be added back easily. It would be better to redesign some parts of Netfilter for 2.5 or 2.7 :) You can use the ipchains compat module. But may be it does not work for FTP and is broken at some places. The best approach might be to test the slowdown when using both LVS and conntracking and if it's not fast enough, buy faster hardware. It will take less time :) You can test the slowdown with some app or even with testlvs.

patrick edwards My lvs works with no problem. However with in a matter of an hour or two my bandwidth drops to virtually nothing and the CPU load goes ballistic. I have a 100Mbit internal network, but at times i'm lucky to see 50Kps.

Christian Bronk chris (at) isg (dot) de 15 Jan 2002: For our 2.4 kernel test servers, it turned out that ipchains under kernel 2.4 does full connection-tracking and makes the system slow. Try to use iptables or the arp-patch instead.

Fabrice fabrice (at) urbanet (dot) ch 17 Dec 2001 When I ran testlvs, with conntrack enabled on the client machine (the one that runs testlvs), I had a mean of about 2000 SYN/s. When I removed thoses modules (there are many conntracks) I reached 54'000 SYN/s!

Julian Anastasov ja (at) ssi (dot) bg 22 Dec 2001: I performed these tests with 40K SYN/s incoming (director near its limits), LVS-NAT, 2.4.16, noapic, SYN flood from testlvs with -srcnum 32000 to 1 realserver. only IPVS 0.9.8 able to send the 40K/sec (same as incoming), 3000 context switches/sec after modprobe ip_conntrack hashsize=131072 10-15% performance drop, 500 context switches/sec after modprobe iptable_nat (no NAT rules) 5% performance drop, same number of context switches/sec Additional test: -j DNAT instead of IPVS Tragedy: 1000P/s out, director is overloaded. I looked into the ip_conntrack hashing, it is not perfect for incoming traffic between two IPs, but note that testlvs uses different IPs, so after little tuning, it seems that the DNAT's problem is not the bad hash function. Maybe I'm missing some NF tuning parameters.

Andy Levine: Is it absolutely necessary to have IP Connection tracking turned on in the kernel if we are using LVR_DR? We are experiencing performance hits with the connection tracking code (especially on SMP boxes) and would like to take it out of our kernel.

Wensong:25 Dec 2002: if you have performance problem you can remove it. LVS uses its own simple and fast connection tracking for performance reasons, instead of using netfilter connection tracking. So, it will not affect LVS, if netfilter conntrack modules are not loaded. LVS/NAT should work too without the conntrack modules.

Don't use the preemptible/preemptable/preemptive kernels the different versions of the word in the title is for searching. Brian Jackson

Just as a little experiment, since I have enjoyed my preemptible/low latency patches, I decided to test my lvs cluster with the patches. The results were interesting.

Roberto Nibali ratz (at) tac (dot) ch 10 Sep 2002 Preemptible kernels don't buy you anything on a server, it's simply speaking for a desktop machine where you'd like to listen to mp3 (non-skipping of course) and compile the kernel. Low latency for tcpip as needed for LVS is incompatible with the concept of preemptible kernels, in that of the network stack runs in softirqs and get's worked around by the kernel scheduler. If your driver generates a lot of IRQs for RX buffer dehooking, the scheduler must be invoked to get those packets pushed in the TCP stack or you loose packets. As long as you don't run X and some number crunching software on the realservers, preemtible kernels hurt TCP/IP stack performance IMHO.

9.6Gbps served using LVS-DR with gridftp Horms 24 Nov 2005 This information comes from Dan Yocum, slightly reformated and forwarded with permission. Note that while the cluster was pushing 9.6Gbps, the linux director was doing a negilgable ammount of work, which seems to indicate that LVS could push a great deal more traffic given sufficient real-servers and end-users.

On Mon, Nov 21, 2005 at 01:51:27PM -0600, Dan Yocum wrote: Just a quick update on the LVS-DR server I used for for our bandwidth challenge last week at SuperComputing: The director saw an increase of around 120Kbps when we ran our bandwidth challenge tests. At times the aggregate bandwidth out of the 21 real servers was around 9.6Gbps, so the amount of traffic on the director was negligible. There were 41 clients grabbing the data from the servers, each machine ran a gridftp client with 16 parallel streams. Packet size was standard (1500), so no jumbo frames. The URL for the mrtg graphs are here: http://m-s-fcc-mrtg.fnal.gov/~netadmin/mrtg/mrtg-rrd.cgi/s-s-fcc1-server/s-s-fcc1-server_3_3.html The test occurred on Wed of last week in the last half of the day.

Sure, no problem. One of the BWC participants put a page up here if you're interested in the details of the other participants: http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2005/hiperf.html

LVS: Monitoring

CPU usage/load level on the director? Michael McConnell:

Top doesn't display CPU usage of ipchains or ipvsadm. vmstat doesn't display CPU usage of ipchains or ipvsadm.

Joe

ipchains and ipvsadm are user tools that configure the kernel. After you've run them, they go away and the kernel does it's new thing (which you'll see in "system"). Unfortunately for some reason that no-one has explained to me "top/system" doesn't see everything. I can have a LVS-DR director which is running 50Mbps on a 100Mpbs link and the load average doesn't get above 0.03 and system to be negligable. I would expect it to be higher.

Julian 10 Sep 2001 Yes, the column is named "%CPU", i.e. the CPU spend for one process related to all processes. As for the load average, it is based on the length (number of processes except the current one) of the queue with all processes in running state. As we know, LVS does not interract with any processes except the ipvsadm. So, the normal mode is the LVS box just to forward packets without spending any CPU cycles for processes. This is the reason we want to see load average 0.00 OTOH, vmstat reads /proc/stat and there are the counters for all CPU times. Considering the current value for jiffies (the kernel tick counter) the user apps can see the system, the user and the idle CPU time. LVS is somewhere in the system time. For more accurate measurement for the CPU cycles in the kernel there are some kernel patches/tools that are exactly for this job - to see what time takes the CPU in some kernel functions.

LVS throughput at the director with ipvsadm The number of active/inactive connections are available from the output of ipvsadm. Julian 22 May 2001 Conns is a counter and is incremented when a new connection is created. It is not incremented when a client re-uses a port to make a new connection (Joe, - the default with Linux). RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:0 rr persistent 360 -> RS2.mack.net:0 Route 1 0 0 -> RS1.mack.net:0 Route 1 0 0 TCP lvs2.mack.net:telnet rr -> RS2.mack.net:telnet Route 1 0 0 -> RS1.mack.net:telnet Route 1 0 0 ]]> You can monitor connections with snmp. Dennis Kruyt d (dot) kruyt (at) zx (dot) nl 30 Jun 2004

I use lvs-snmp (http://anakin.swiss-support.net/~romeo/lvs-snmp/) and cacti to graph the connections.

AJ Lemke

I am running a 2 node lvs-cluster and was wonder if the list could recommend a traffic monitoring program. My LVS is the frontend for a reverse proxy cache and I would like to know the traffic that each VIP is handling. I need to know the data rates on a per ip basis. I use mrtg at the switch level but I need to have more granularity, hence the need for per IP basis.

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 11 Jul 2004 munin (http://www.linpro.no/projects/munin/) has a plugin for this. you can get the numbers you need with ipvsadm RemoteAddress:Port TCP smtp.uio.no:smtp 1508879 38457326 0 10461M 0 -> mail-mx6.uio.no:smtp 374117 9490846 0 2664M 0 -> mail-mx3.uio.no:smtp 377646 9961956 0 2543M 0 -> mail-mx2.uio.no:smtp 378502 9288837 0 2707M 0 -> mail-mx1.uio.no:smtp 378614 9715687 0 2546M 0 # ipvsadm -L -t smtp:smtp --rate Prot LocalAddress:Port CPS InPPS OutPPS InBPS OutBPS -> RemoteAddress:Port TCP smtp.uio.no:smtp 7 85 0 20480 0 -> mail-mx6.uio.no:smtp 1 17 0 1126 0 -> mail-mx3.uio.no:smtp 1 17 0 2023 0 -> mail-mx2.uio.no:smtp 2 26 0 6681 0 -> mail-mx1.uio.no:smtp 2 25 0 10650 0 ]]>

Monitoring: LVS director throughput statistics from the /proc system (originally /proc/net/ip_vs_stats) Cyril Bouthors:

Where can I get the info originally in /proc/net/ip_vs_stats and removed since 0.9.4?

Wensong Zhang wensong (at) gnuchina (dot) org 20 Nov 2001 for global stats /proc/net/ip_vs_stats You can get per-service statistics by If you want to program to get statistics info, use libipvs. Here's the writeup that went with the original code.

Packet throughput (in 64-bit integers) is in /proc/net/ip_vs_stats or /proc/net/ip_masq/vs_stats. The counters are not resetable, you have to keep the previous reading and substract. Output is in hexadecimal. Here's the statistics

Joe

Can I zero out these counters if I want to get rates, or should I store the last count?

Ratz, May 2001 There was a recent (2 months ago) talk about zeroing in-kernel counters and I'm not so sure if all the kernel hacker gurus agreed but: You must not zero a counter in the kernel! I didn't really understand the arguments against or pro zeroing counters so I'm not a big help here, but if others agree we certainly can add this feature. It would be ipvsadm -Z as an analogy to ip{chains|tables}. BTW, we are proud of haveing 64-bit counters in the kernel :) Storing ... there are different approaches to this (complexity order): Use a script that extracts the info and writes it flat to a file Use MRTG or rrdtool since I reckon you wanted to use the stats to generate some graphics anyway. These tools handle the problem for you. MRTG requires SNMP, but you can have a slightly modified snmpd.conf and execute a script that parses /proc/net/ip_masq/vs_stats and writes it into a file. The advantage of this over the first one is, that you can write the current number into one file and mrtg will know how to draw the graph. I give you an example: We have a customer named plx. Now he has only one service and 2 realserver. We extended the snmpd.conf with following lines: The scripts are awk scripts that get the information accordingly to the service or the realserver. You can then do a table walk of the OID 1.3.6.1.4.1.2021.8 to see what your values are: Example output if everything is ok: Here you see that the total amount of sessions of the load balancer serving about 8 customers is 292 currently and that customer plx has no connections so far. Write a MIB for LVS stats.

MRTG family: Intro There are a family of monitoring tools descended from MRTG. These now include RRDtool (a descendant of MRTG, written by the same author, Tobias Oetiker) and wrappers around RRDtool like lrrd (which have spawned their own family of programs, e.g. cricket, to monitor and graph just about anything you like). lrrdtool can/does use nagios.

Laurie Baker lvs (at) easytrans (dot) com 20 Jan 2004 Nagios is a monitoring tool previously known as Netsaint.

I've read the documentation for mrtg and several of its descendants and haven't been about to figure out how they work enough to get them going. While the syntax of all of the commands is available, there is no global picture of how they are used to make a working set of programs. I saw Tobias give a talk at Usenix one year about MRTG and while I knew what it did, I didn't know how to set it up. Some people have got these packages going, presumably needing less documenation that I do. I'd like a worked example of how a single simple variable (e.g. the contents of /proc/loadavg) is sampled and plotted. The accompanying packages needed (e.g. SNMP, php, gd...) are not described. While a competent sysadmin will be able to work out what is missing from the output of the crashes, it would be better to prepare ahead of time for the packages needed, so that you can plan the time for the install and won't have to stop for lack of information that you could have handled ahead of time.

MRTG family: LVSGSP This was the first attempt to produce a graphical monitoring tool for LVS. It doesn't seem to be under active developement anymore (Apr 2004) and people are now using rrdtool (or ganglia which uses rrdtool) (see below). Alexandre has a new address Alexandre (dot) Cassen (at) free (dot) fr and moved his pages to Alexandre's open source code (http://www.lnxos.net/). The links below to Alexandre's pages are dead. You can find files here lvsgsp-0.0.4.tar.gz (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/lvsgsp-0.0.4.tar.gz). alex alshu (at) tut (dot) by 20 May 2008 has provided lvsgsp_newscripts.tar.bz2 (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/lvsgsp_newscripts.tar.bz2) that he uses with lvsgsp. Alexandre Cassen alexandre (dot) cassen (at) wanadoo (dot) fr, the author of keepalived has produced a package, LVSGSP that runs with MRTG to output LVS status information. Currently active and inactive connections are plotted (html/png). The LVSGSP package includes directions for installing and a sample mrtg.cfg file for monitoring one service. The mrtg.cfg file can be expanded to multiple services A note from Alexandre

Concerning the use of MRTG directly onto the director, we must take care of the computing CPU time monopolised by the MRTG graph generation. On a very overloaded director, the MRTG processing can degrade LVS performance.

MRTG Peter Nash peter (dot) nash (at) changeworks (dot) co (dot) uk 18 Nov 2003 I'm using a perl script to pull LVS statistics from my directors into MRTG using the ucd-snmp-lvs module. I'm sure this could be easily modified to work with RRDTool. I'm no perl programmer so I'm sure there are better ways to do this but it's been working for me for the last 3 months. Since my MRTG runs on a remote server (not the directors) using SNMP gives me the remote access I need. The main problem to overcome was that the "instance number" of a particular "real service" is dependent on the order in which the services are added to the IPVS table. If you are using something like ldirectord to add/remove services then this order can vary, so the script has to solve this problem. I also had a few problems getting the ucd-snmp-lvs module to compile with net-snmp on my RH8 directors but that was probably down to my lack of knowledge, I got there in the end! The MRTG call to the script is as follows (director names, SNMP community and IP addresses are "dummies"): This aggregates the results from both primary and backup director so it doesn't matter which one is "active". The script returns zeros if the requested service is not currently in the LVS table on the target director.

MRTG family: RRDtool Salvatore D. Tepedino sal (at) tepedino (dot) org 21 Nov 2003 I posted the new version on my site: http://tepedino.org/lvs-rrd/. The new version has a lot of code cleanup, much more flexibility in the coloring, a command line arg so you can just graph traffic to one port (ie: just port 80 traffic), and the update script has been changed slightly to remove a redundant loop (Thanks Francois! If I do something that obviously silly again, you can smack me!) and the removal of the need to specify what type of LVS yours is (Route, Masq, etc). Now it should collect data on all servers in the LVS. Next step is to figure out how to graph specific services (VIP/Port combinations instead of just specific ports)... Jun 2006. tepedino.org is not on the internet. The last entry in the wayback machine is 10 Feb 2005. Leon Keijser e-mailed me lvs-rrd-v0.7.tar.gz (http://www.austintek.com/WWW/LVS/LVS-HOWTO/HOWTO/files/lvs-rrd-v0.7.tar.gz) which has a Changelog of Jan 2006. Sebastian Vieira sebvieira (at) gmail (dot) com 10 Nov 2006 For those interested, the website of lvs-rrd is back up again at its usual address: http://tepedino.org/lvs-rrd/ Joe: I contacted Sal off-list, to find there'd been problems at the ISP. He's back, with the same e-mail address etc. v0.7 is still his latest code. If the server goes down again, you can contact him sal (dot) tepedino (at) gmail (dot) com. 21 Jan 2004 This new version allows you to graph connections to a specific VIP or realserver or VIP port or RS Port or any combination of those via command line options. It also adds in adds an option flip the graph for people with more inactive than active connections. Also it can spit out an HTML page for the specific graphs it created so a simple one line php page (included) can run the script and display the output. Joe: various people (including Francois Jeanmougin) have started sending patches to Salvatore. 17 Jan 2004 This new version allows you to graph connections to a specific VIP or realserver or VIP port or RS Port or any combination of those via command line options. It also adds in adds an option flip the graph for people with more inactive than active connections (you can have either the ActiveConn or InActConn plotted in the negative region below the X-axis). Also it can spit out an HTML page for the specific graphs it created so a simple one line php page (included) can run the script and display the output. Joe - Jan 2004: lvs-rrd worked straight out of the box for me. You first install rrdtool from http://people.ee.ethz.ch/~oetiker/webtools/rrdtool/ with the standard ./configure ; make ; make install. The rrdtool executables are standard ELF files (not perl scripts as I thought). rrdtool has the libraries it needs (zlib, /lib/libm.so.6 (0x40017000) libc.so.6 => /lib/libc.so.6 (0x4003a000) /lib/ld-linux.so.2 => /lib/ld-linux.so.2 (0x40000000) ]]> gd) so you don't need any recursive downloading. Then you follow Salvatore's "Setup" instructions and you'll soon have gifs showing the activity on your LVS. The filenames that Salvatore uses for his databases are derived from the ipvsadm (hex) information in /proc/net/ip_vs. Thus one of my rrd files is lvs.C0A8026E.0017.C0A8010C.0017.rrd representing VIP:port=192.168.2.110:23, RIP:port=192.168.1.12:23. You don't have to look at these files (they're binary rrd database files) and naming them this way was easier than outputting the IP in dotted quad with perl. Salvatore supplies utilites (which he grabbed off the internet) to convert the IP:ports between dotted quad and hex. and Tore Anderson tore (at) linpro (dot) no 07 Dec 2003 There is also LRRD. Plugins for monitoring ipvsadm output are already included, for an demonstration you could take a look at "screenshot" pages at http://linpro.no/projects/lrrd/example/runbox.com/cujo.runbox.com.html Joe: Tore is one of the lrrd developers. After getting Salvatore's code running, I reviewed the lrrd docs and tutorials to realise that there never was any hope of me understanding them without outside help. The docs are written for data coming from snmp, and I assumed that snmp was the only way of getting data. As Salvatore's code shows, rrdtool can use data from anywhere: if you can retreive/fetch/get your data in a script and send it as a parameter to rrdtool, then you can store and graph it with rrdtool. taqu taqumd (at) gmail (dot) com 25 Nov 2008 Hi, I recently created another LVS graphing tool. This scripts parse ipvsadm -L -n --stats --exact and generate the following graphs: Project Home (http://sourceforge.jp/projects/lvs-stats) Screen Shot (http://sourceforge.jp/projects/lvs-stats/wiki/FrontPage)

MRTG family: cacti cacti is another rrdtool based monitoring tool, which has been adapted for lvs. Bruno Bonfils asyd (at) debian-fr (dot) org 26 Jan 2004 If there are some of you who running cacti in order to monitor LVS cluster, you'll probably interest by my xml data query and the associate template. Both are available on http://www.asyd.net/cacti/.

MRTG family: Ganglia (incl. INSTALL)

ganglia intro Karl Kopper karl (at) gardengrown (dot) org 03 Dec 2003 Another cool tool for monitoring the Real Servers is Ganglia. (With version 2) you run gmond monitoring daemons on each RS and a single gmetad daemon to poll the gmonds on a server (that is running Apache) outside the cluster. Then with the Ganglia Web Frontend you get great color graphs that help you to find "hot spots". You can then write your own gmetric script to create your own custom graph for anything happening on the Real Servers (I suppose you could cull the Apache logs for "GET" operations--check out the Gmetric Script Repository). Incidentally, you can also add the gexec program to submit batch jobs (like cron jobs) to the least loaded realserver or to all nodes simultaneously. ganglia is designed for beowulfs. It produces nice colored graphs which managers love and I'm sure lots of beowulfs have been sold because of it. However there is a catch 22 with using it. The compute nodes on a beowulf run synchronously, calculating a subset of a problem. At various points in a calculation, results from the compute nodes need to be merged and all compute nodes halt till the merge finishes. The merge cannot start till all nodes have finished their part of the calculation and if one node is delayed then all the other nodes have to wait. It is unlikely that the ganglia monitoring jobs will run synchronised to the timeslice on each compute node. Thus in a large beowulf (say 128 nodes), it will be likely that one of the compute nodes will have run a ganglia job and the other 127 will have to wait for this node to complete its subset of the main calculation. So while ganglia may produce nice graphs for managers, it is not compatible with large or heavily loaded beowulfs. None of this affects an LVS, where jobs on each realserver run independantly. Ganglia should be a good monitoring tool for LVSs.

ganglia install ganglia is a package for monitoring parameters on a set of nodes, forwarding the data to a display node where the data is displayed as a set of graphs. By default ganglia displays such things as load_average, memory usage, disk usage, network bandwidth. Instructions in the documentation show how to add graphs of your own parameters. The data on the display node is stored by rrdtool. The documentation was not clear to me and the installation took several attempts before I got a working setup. These notes are written on the 3rd iteration of an install, It's possible that I handled something in an earlier iteration that I forgot about. Ganglia has the ability to use gexec by Brent Chun, a tool to remotely execute commands on other nodes (like rsh and ssh). You can configure ganglia to run with or without gexec. Unfortunately I couldn't get gexec to run properly on Linux and on contacting the author (Mar 2004) I find that gexec was developed under another OS (*BSD ?) and because of problems with the Linux pthread implementation, doesn't work on Linux. He's working on fixes. Karl Kopper karl (at) gardengrown (dot) org 11 Apr 2004

Matt Massie of the Ganglia project tried to pull the gexec code into the new Ganglia distro but failed due to this pthreads problem as I understand it, but if you download the old gexec and authd packages directly from Brent's (old) web page I don't think they have the pthreads problem. Well, actually, there is a problem we've had with gexec when you try to run a command or script on all nodes (the -n 0 option) that we've never fully examined. The problem causes the -n 0 option to be so unreliable we don't use. The "-n 1" option works fine for us (we use it for all production cron jobs to select the least loaded cluster node). For the moment you might be better off using the same ssh keys on all cluster nodes and writing a script (this is the way I like to do it now when I have to reliably run a command on all nodes). The great thing about gexec, though, is that it will run the command at the same time on all nodes--the ssh method has to step through each node one at a time (unless, I supposed you background the commands in your script). Hmmm... Theres an idea for a new script...

gexec has similar functionality to dancer's shell - dsh which uses ssh or rsh as transport layer. Using ssh as a transport layer has its own problems - you need passphrase-less login when using ssh for dsh, but you need passphrase enabled login for users starting their sessions. There are 3 types of nodes in ganglia monitored nodes: these will be your realserver and director(s) (i.e. all machines in the LVS). These nodes run gmond, the ganglia monitoring demon, which exchanges data with other monitored nodes by multicast broadcasts. gmond also exchanges data with the relay nodes. relay nodes: these run gmetad. For large setups (e.g. 1024 nodes), gmetad collects data from gmond in tree fashion and feeds the data to the GUI node (which is also running gmond). gmetad like gmond exchanges data by multicast broadcasts. I didn't quite figure out what was going on here and since I only had a small LVS, I just ran gmetad on the GUI node. I assume if you had (say) 8 LVS's running and one GUI machine, that gmond would be running on all nodes and that gmetad would be running on at least one node that was guaranteed to be up in each LVS. For an LVS with a failover pair of directors, gmetad would run on both directors. the GUI node. I didn't figure out how to set up a gmetad node, if it wasn't also the GUI node. From gmetad.conf, it would appear that each gmetad keeps its own set of rrd database files (presumably these are duplicates of the set on the GUI node). Presumably you should keep the rrd database files in the same location as for the GUI node (for me in DocumentRoot/ganglia/rrds/), just to keep things simple, but I don't know. gmetad is not happy if you shut it down while gmond is running, so I modified the gmetad init file to first shutdown gmond. node with the GUI: this node collects the data with gmetad, stores it with rrdtool, and displays it in a webpage using the php files in gmetad-webfrontend. This machine requires apache (I used apache-2.x.x) and php4. On an LVS with a single director, the node with the GUI will likely be the director. In an LVS with an active/backup pair of directors, you would probably have both directors run gmetad and have the GUI running (with gmetad) on an administrative machine. If you like using netstat -a and route rather than their -n counterparts, then you can add the following - Ganglia is installed differently depending on the role of the machine in the data path. machines being monitored: these run gmond. gmond is found in the ganglia-monitor-core. To run gmond you do not need rrdtool to be installed. However compilation of gmond requires /usr/lib/librrdtool.a and /usr/include/rrd.h. Unless you already have these available, you will first have to compile rrdtool on the monitored node. After compilation of rrdtool, you don't have to install it, just copy rrd.h and librrd.a to their target directory. To compile rrdtool, you need to have perl installed to produce the rrd manpages (I needed perl-5.8.0, perl-5.6.1 produced errors). I couldn't see anyway in the Makefile of just producing librrd.a. A make lib; make lib_install option would be nice here. After installing librrd.a and rrd.h, do the default ganglia-monitor-core install: ./configure; make; make install. This will install /usr/bin/gmetric, /usr/bin/gstat and /usr/sbin/gmond. Set up the rc file gmond/gmond.init to start gmond on boot. Copy the default conf_file gmond/gmond.conf to /etc/ and although you will have to modify it shortly, for now don't mess with it. gmond does not need a conf file to start and will assume the values in the default conf file if the conf file doesn't exist. Now see if you can start gmond - you should see 8 copies in the ps table. There are several things that can go wrong at this stage, even if gmond starts. There is no log file for gmond. To figure out problems, you turn on debug in gmond.conf. After doing this, gmond will not detach and will send the debug output to the console. Do not leave debug on through a reboot, as the gmond rc file won't exit and the boot process will hang. gmond may not start. I got the debug message "gmond could not connect to multicast channel" when using an older (2.4.9) kernel, but not with a newer (2.4.20) kernel. If you have a multi-homed machine, gmond defaults to using eth1. If the other machines aren't mulitcast accessable via eth1, you won't know: gmond will happily broadcast out the wrong NIC, but will never hear anything back. If you watch the debug output, you will see messages about packets being sent out, but none about packets being received. When you've got the right NIC and gmond on other nodes are sending packets, you'll also see notices of packets being received. You should know which NIC that you want the gmond packets to go out, so set this now. If gmond is working properly, you should have 8 copies of gmond in the ps table. This node is ready to exchange information with other monitoring nodes. Leave /etc/gmond.conf for now. Here's netstat output for a monitored machine (realserver) running gmond Not knowing much about multicast, I was surprised to find an IP:port in the output of netstat when the IP (239.2.11.71) was not configured on a NIC. The Multicast over TCP/IP HOWTO (http://www.ibiblio.org/pub/Linux/docs/HOWTO/other-formats/html_single/Multicast-HOWTO.html) only discusses multicast which needs to be routed (e.g. MBONE) and so all multicast IPs involved must be configured on NICs. Here's an explanation by Alexandre, who wrote Keepalived, which uses multicast in a similar fashion. Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr 11 Apr 2004

In mcast fashion, Class D address is not configured to NIC. you just join or leave the Class D, so called the mcast group. For mcast you can consider 2 differents design, most of common applications using multicast are done over UDP, but you can also create your own mcast protocol as VRRP or HSRP does, that way you are using mcast at same layer as UDP without adding UDP overhead. Since mcast is not connection oriented the both design UDP or pure RAW protocol are allowed. This contrast with the new SCTP protocol which add retransmission and connection oriented design in a one-to-many design (called associations). So in mcast you must distinguish the sending and the receiving source. if using the UDP transport then you can bind sending/receiving points to special IP. On RAW fashion, you bind directly to device. Keepalived/VRRP operate at RAW implementing its own protocol, use a pair of sending/receiving socket on each interface VRRP instances run.

machine with GUI: You should have apache/php4 installed. Compile/install rrdtool using all defaults (files will go in /usr/local/rrdtool-x.x.x/). Link rrdtool-x.x.x to rrdtool (so you can access rrdtool files from /usr/local/rrdtool/). Unless you want to do a custom configure for ganglia-monitor-core, also copy librrd.a to /usr/lib/ and rrd.h to /usr/include/ (as you did for the gmond nodes). Copy all the files from gmetad-webfrontend to DocumentRoot/ganglia/. Then mkdir DocumentRoot/ganglia/rrds/, the directory for the rrd database files. Edit DocumentRoot/ganglia/conf.php - some of the entries weren't obvious - here's some of my file: Add gmetad to the ganglia-monitor-core install by doing ./configure --with-gmetad; make; make install. You will get an extra file /usr/sbin/gmetad. Install gmetad/gmetad.initd as the init file and gmetad/gmetad.conf in /etc/. Start up gmetad when you should see 8 copies in the ps table. My install worked fine (after a bit of iterative fiddling with the conf files), so I don't know what you do if it doesn't work. By now the conf files need some attention and some of the entries in the two conf files must match up. match "name" in gmond.conf with "data_source" in gmetad.conf (e.g. "Bobs LVS cluster"). This string will be used as the name of a directory to store the rrd files, so don't put any fancy characters in here (like an apostrophe) - blanks in a directory name are already hard enough to deal with. "location": is a 3-D array to order the nodes for presentation in the "Physical View" page (a 3-D array is required for large clusters, where machines are located in 3-D, rather than in a single rack). If you don't specify location, then "Physical View" will give you its own reasonable view - a vertical stack of boxes summarising each node. If you do specify location, then each machine will be put in a Rack according to the first number. Machines with values 0,x,y, will be listed as being in "Rack 0"; machines with 1,x,y will be listed in Rack 1 etc. The second dimension in the array determines the vertical position that ganglia puts the node in the rack. You can number the nodes according to their physical location (I have two beowulf master nodes in the middle of the rack, with 8 compute nodes above and 8 compute nodes below them), or logical location (the two directors can be on the top of the rack, with realservers below). You could have your directors in Rack 0, and your realservers in Rack 1. Nodes with higher number location will be placed on the "Physical View" page above nodes with lower numbers. Location 1,0,0 will be at the bottom of Rack 1, while location 1,15,0 will be above it. If you thought node 0 was going to be at the top of a Rack, then you're sadly mistaken (this order must be a Northerm hemispherism). Presumably there is some connection between location and num_nodes, but I haven't figured it out and in some cases I've left the default value of num_nodes and in some cases I've put num_nodes=32 (larger than the actual number of nodes, in case of expansion). Only having a 1-D LVS, I didn't use the 3rd dimension (left it as 0). If two machines are given the the same location, then only one of them will display in the summary on the "Physical View" page. trusted_hosts are only for data transfers between gmetad nodes (I think) - leave them as defaults. rrd_rootdir (which I set to DocumentRoot:/ganglia/rrds/) and setuid must match or gmetad will exit with error messages telling you to fix it. restart gmetad and gmond (if they haven't been cleanly restarted yet). Here's netstat output for a machine GUI machine running both gmond and gmetad immediately after starting up the demons. (The connections between localhost:highport and localhost:gmond come and go). surf to http:/my_url/ganglia. You should see a page with graphs of activity for your nodes. If you want the current information you have to Shift-reload, unlike with lvs-rrd, where the screen will automatically fresh every 5 mins or so. Presumably you can fiddle the ganglia code to accomplish this too (but I don't know where yet).

MRTG family: rrd images These images are to show that an LVS does balance the load (here number of connections) between the realservers. Salvatore D. Tepedino sal (at) tepedino (dot) org 25 Mar 2004.

LVS with 2 realservers, serving httpd, single day. rrd graph of connections to an LVS with 2 realservers, serving httpd, single day. More images are at Salvatore's lvs-rrd website (http://www.tepedino.org/lvs-rrd/). To get this graph, first you'd need to run the update script (included in the package) to generate the rrd files and start collecting the data. After a little while you can run the graphing script which will see the rrd files and generate the graphs based off the data in them. The easiest way to use the script it to just extract it into the web root of your director (which I figured that alot of people have as ultimate failovers if all their realservers go down), put in the cron tab (explained in the docs) wait a few minutes, then go to the index.php page and you should see the beginnings of a graph. The longer you let it (the cron job) run, the more data you've collected, the more data in your graphs. You don't need to know how to use RRD to use my script. The 'All: All RS: All:All' in the script just means "All VIPs:All Ports; RS: All Real servers: All ports". With the script you can select if you want to graph just connections to a specific VIP or RS, or VIP port or RS port, or any combination. Useful for large clusters. My script generates the rrd line necessary to generate the graphs (tepedino.org/lvs-rrd). If you run it in verbose mode, it will spit out the rrd command line it uses to generate the graphs. If you like, I can give you some help with RRD. It's not the most obvious thing in the world to learn, but I had a lot of time on my hands when I decided to learn it, so I got fairly decent at it.

LVS with 2 realservers, serving httpd, week, showing realserver failure. rrd graph of connections to an LVS with 2 realservers, serving httpd, week, realserver failuare. Note the failure of realserver 216.82.75.205, between 0600-1200 on thursday, with the other realserver picking up the load. Malcolm Turnbull malcolm (at) loadbalancer (dot) org 27 Mar 2004.

LVS with 4 realservers, serving httpd, single day. rrd graph of connections to an LVS with 4 realservers, serving httpd, single day. More images are available at loadbalancer.org (http://www.loadbalancer.org/lbadmin/stats/chart.php). Karl Kopper karl (at) gardengrown (dot) org 2 Apr 2004 Here is an LVS serving telnet. The clients connect through to the realservers where they run their applications. Although the number of connections is balanced, the load on each realserver can be quite different. Here's the ipvsadm output taken at the end of the time period shown. RemoteAddress:Port Forward Weight ActiveConn InActConn TCP cluster:23 wrr -> clnode7:23 Route 1 53 1 -> clnode8:23 Route 1 38 0 -> clnode2:23 Route 1 46 1 -> clnode10:23 Route 1 49 0 -> clnode9:23 Route 1 49 0 -> clnode6:23 Route 1 35 1 -> clnode5:23 Route 1 33 0 -> clnode4:23 Route 1 36 0 -> clnode3:23 Route 1 40 0 -> clnode1:23 Local 1 42 0 ]]>

LVS with 10 realservers, serving telnet, load average for past hour, images of total cluster. Ganglia graph of connections to an LVS with 10 realservers, serving telnet, 1 hr. The graphs above show total one minute load average for the LVS cluster. When the load average on any individual box is greater than 1 (for uniprocessor systems) the icon for the realserver (the boxes at the bottom of the image) turns red.

LVS with 10 realservers, serving telnet, load average for past hour, for each realserver and the cluster manager. rrd graph of load on each of 10 realservers, serving telnet, 1 hr. The graphs above show the bottom-half of the Ganglia web page with the one minute load average (for each realserver) for the past hour. Note that the load average is quite different for each realserver. Also shown is the cluster node manager (outside the LVS), used by the realservers for authentication and print spooling. Magnus Nordseth magnus (at) ntnu (dot) no 05 Apr 2004

LVS with 3 quad processor realservers, serving https, single day, y-axis is cpu-idle (all idle = 400%). Ganglia graph of cpu-idle for LVS with 3 quadprocessor realservers, serving https, single day. The graph shows cpu-idle for three identical realservers running https. Each realserver has 4 cpu's, thus maximum idle cpu is 400%. The graph was created with in-house software.

Nagios Nagios is mentioned elsewhere in this HOWTO by various posters as a monitoring tool. anon I'm interested with LVS to do some load balancing with HTTP. I'm testing LVS with VMWare (i'm simulating two Windows 2003 real servers). Is there a way to do some load monitoring with windows realservers? I know the feedbackd project, but there's no win32 agent... If LVS cannot do Load monitoring I will use bigip or other proprietary solution that could handle load monitoring. Peter Mueller pmueller (at) sidestep (dot) com 11 Jul 2005 You can try using the Nagios windows agents and some shell scripts to accomplish your goals. Two Nagios Windows programs that I am aware of are: http://nagios-wsc.sourceforge.net/ and http://nsclient.ready2run.nl/

MIB/SNMP A MIB has been written for LVS by Romeo Benzoni rb (at) ssn (dot) tp (Nov 2001). It's available as Joe: there's a 64 bit problem with the net-snmp-lvs module that's been fixed. Make sure you're using at least the 0.0.4-2 module. code and documentation (http://anakin.swiss-support.net/~romeo/lvs-snmp/). The latest (Mar 2002) is at http://anakin.swiss-support.net/~romeo/lvs-snmp/ucd-snmp-lvs-module-0.0.2.tar.bz2 (Joe: this seems to have disappeared). (Joe: incase other links go dead): net-snmp-lvs-module-0.0.4.tar.gz, net-snmp-lvs-module-0.0.4-1.EL4.src.rpm, net-snmp-lvs-module-0.0.4-2.EL4.src.rpm Malcolm Turnbull's download/SNMP (this directory includes net-snmp-5.3.0.1.tar.gz) Jack Neely (who provided the updates): net-snmp-lvs-module-0.0.4-1.EL4.src.rpm, and the updated version net-snmp-lvs-module-0.0.4-2.EL4.src.rpm (http://install.linux.ncsu.edu/pub/yum/CLS/CLSTools.EL4/srpms/net-snmp-lvs-module-0.0.4-2.EL4.src.rpm). Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 26 Dec 2008 The new version is in the 4-2 src rpm (for reference it's at http://archive.linuxvirtualserver.org/html/lvs-users/2008-10/msg00011.html). I have been using it with cacti for... well, since then. The source rpm says EL4 but it should compile on many rpm-based distros with "rpmbuild -bb $src.rpm". You can make a source tar.gz by running "rpmbuild -bp $src.rpm" and then archiving the prepared sources from where they were unpacked. Though, essentially, the "source" is the original net-snmp-lvs-module-0.0.4.tar.gz found in the HOWTO somewhere. The src rpm packages it with the ipvsadm sources plus a few patches and automates the patch, build and install procedure. from 4-1 to 4-2. This is the list of files in 4-2: Jack Neely jjneely (at) ncsu (dot) edu 1 Oct 2008

I've setup net-snmp-lvs in an attempt to move the monitoring of my lvs/keepalived balancer into Cacti. I've gotten most things working and even have an RPM that folks might be interested in. I've built this on RHEL4 and 5 i386 (I believe it needs a bit of extra work for 64 bit arches.) http://install.linux.ncsu.edu/pub/yum/CLS/CLSTools.EL4/srpms/net-snmp-lvs-module-0.0.4-1.EL4.src.rpm The values provided for the lvsServiceStatsInBytes attributes are not right. They don't seem to be bits or bytes. Cacti is graphing out my ssh service as doing terabytes of traffic per second. Running snmpwalk ; sleep 2 ; snmpwalk gives me the following: Running ipvsadm -L --stats for the same firewall mark prints out the in bytes service counter at 9737M.

Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 02 Oct 2008 I'm not sure why it's like this, but you just need to swap the two 32-bit fields around to get to the real value (which incidentally is the same as shifting >>32 when they are all zeroes). Take for example 2670058611031409148 returned by SNMP for 2182G. In hex (easier to work with than binary) it is 250df429 000001fc, the real value is 000001fc 250df429 which is 2182465057833 or 2182G. I just used the rates instead for the time being (just remember to change from COUNTER to GAUGE so cacti knows how it works). I was looking at this last night. The problem was that ASN_COUNTER64 expects a "struct counter64" and is given a "long long" (__u64) instead. The attached patch should correct this problem. > 32; + val64.low = *val & 0xffffffff; + snmp_set_var_typed_value(var, ASN_COUNTER64, (u_char *)&val64, sizeof(val64)); +} + /** returns the first data point within the lvsServiceTable table data. Set the my_loop_context variable to the first data point structure @@ -271,10 +282,10 @@ snmp_set_var_typed_value(var, ASN_COUNTER, (u_char *) &stats->outpkts, sizeof(stats->outpkts)); break; case COLUMN_LVSSERVICESTATSINBYTES: - snmp_set_var_typed_value(var, ASN_COUNTER64, (u_char *) &stats->inbytes, sizeof(stats->inbytes)); + netsnmp_set_var_counter64_value(var, &stats->inbytes); break; case COLUMN_LVSSERVICESTATSOUTBYTES: - snmp_set_var_typed_value(var, ASN_COUNTER64, (u_char *)&stats->outbytes, sizeof(stats->outbytes)); + netsnmp_set_var_counter64_value(var, &stats->outbytes); break; case COLUMN_LVSSERVICERATECPS: snmp_set_var_typed_value(var, ASN_GAUGE, (u_char *) &stats->cps, sizeof(stats->cps)); @@ -434,10 +445,10 @@ snmp_set_var_typed_value(var, ASN_COUNTER, (u_char *) &stats->outpkts, sizeof(stats->outpkts)); break; case COLUMN_LVSREALSTATSINBYTES: - snmp_set_var_typed_value(var, ASN_COUNTER64, (u_char *) &stats->inbytes, sizeof(stats->inbytes)); + netsnmp_set_var_counter64_value(var, &stats->inbytes); break; case COLUMN_LVSREALSTATSOUTBYTES: - snmp_set_var_typed_value(var, ASN_COUNTER64, (u_char *) &stats->outbytes, sizeof(stats->outbytes)); + netsnmp_set_var_counter64_value(var, &stats->outbytes); break; case COLUMN_LVSREALRATECPS: snmp_set_var_typed_value(var, ASN_GAUGE, (u_char *) &stats->cps, sizeof(stats->cps)); ]]> Jack a src.rpm: net-snmp-lvs-module-0.0.4-2.EL4.src.rpm (http://install.linux.ncsu.edu/pub/yum/CLS/CLSTools.EL4/srpms/net-snmp-lvs-module-0.0.4-2.EL4.src.rpm) Laurentiu I would also create the necessary config and reload snmpd on install and uninstall, so it will "just work" immediately after install, something like this. $RPM_BUILD_ROOT/etc/snmp/snmpd.local.conf %files ... %config(noreplace) /etc/snmp/snmpd.local.conf %post test $1 = 1 && chkconfig snmpd && service snmpd reload || true; %postun test $1 = 0 && chkconfig snmpd && service snmpd reload || true; ]]> If you can't access the snmp information (get timeouts, or snmpwalk crashes) Michael Moody michael (at) gsc (dot) cc 24 Sep 2008 Depending on distribution, check /var/log/messages for more info. I had a problem where snmpd was running as "other than root", preventing the libnetsnmplvs from being able to access the ipvs tables. Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 12 Dec 2008 Build the net-snmp-lvs-module rpm posted earlier and install it on your servers running snmpd, then you can access the LVS data via SNMP in cacti. You can whip up a quick graph using the snmp-oid template, or you can make a more elaborate graph template that can be attached to hosts. For example, the OIDs you need for bandwidth per service are: where "i" is the service index (1...number of LVS services defined). You can use snmpwalk to find all the fields. Jack Neely jjneely (at) ncsu (dot) edu 17 Dec 2008 In a similar vain...I'm using Cacti to graph services based on firewall mark. When a fwmark is removed I tell Cacti to reindex the LVS SNMP data so each data source is still groking the right part of the SNMP tree. This does that...but (today at least) also edited the path to the RRD for each data source to use the previous data source's RRD. I then became unhappy as I watched Cacti trash all my pretty graphs. How do other folks handle this case when a service disappears from LVS with Cacti?

home brew MIB/SNMP Ratz The file linux/snmp.h represents the SNMP RFCs. IPVS is not specified in an RFC, so adding this has no chance I believe. If you want to generate your own MIB, use one of the reserved sub trees of the MIB DB for such projects and peruse m2c. If you really plan on writing one, get back to us so we can sort out the header to freeze the API. The simple approach we've been using for years: Prepare the values through cronjobs by calling ipvsadm or parsing proc-fs and write SNMP type values (u32, u64, char ...) into single files e.g. /var/run/lvs_snmp/VIP1_act_conns.out. Configure snmpd.conf to read out those files using cat, e.g. exec VIP1_act_conns /bin/cat /var/run/lvs_snmp/VIP1_act_conns.out Use snmpwalk and grep for VIP1_act_conns to get the OID and off you go monitoring those values. Repeat for all values you would like to poll. If you need up to date values (not recommended though) you can also directly call shell scripts using the exec directive. Joseph T. Duncan duncan (at) engr (dot) orst (dot) edu 21 Aug 2006 Presently I collect CPU, Memory in USE, and Network traffic statics from my windows terminal server "real servers" via snmp. I toss this information into an rrd database for making pretty graphs along with usage parsed from lvs stats. Finaly I take the CPU and Memory stats and use them to adjust my weight tables. My script duncan_main.pl for doing this is still in its infancy as I am getting stuff ready for this fall term, but it should be fun to see how it all works out. 28 Dec 2006: Here's an update lvs_weight.pl

Disks Monitoring disks is not directly an LVS problem, however since disks are the most failure prone component of a computer, you need to have a plan to handle disk failure (I pre-emptively change out my disks at the end of their warrantee period, even if they're not giving problems). Linux J. Jan 2004, p 74 has an article on the SMART tools for monitoring ATA and SCSI disks. Apparently for years now IDE and SCSI disks have been using the Self Monitoring, Analysis and Reporting Technology (SMART) standard to report low level errors (e.g. disk read errors, there's dozens of tests). This has been available in tools like Maxtor's PowerMax (for windows). (VAX's and Cray's continuously monitor and report disk errors - I've never known why this wasn't available on other machines.) The current SMARTv2 spec has been around since Apr 1996. Apparently these SMART tools have been available on Linux for a while and run on mounted disks. The source code is at http://smartmontools.sourceforge.net/ There are two components, smartd which reads a config file and runs in background monitoring your disks and writing to syslogd (and/or e-mailing you) smartctl which runs various checks from the command line and which you can run as a cron job to do an exhaustive (1hr long) check (e.g. on Sun morning at 1am).

Other output GUIs

procstatd A lightweight and simple webbased cluster monitoring tool designed for beowulfs procstatd, the latest version was 1.3.4 (you'll have to look around on this page).

OSCE From Putchong Uthayopas pu (at) ku (dot) ac (dot) th a heavyweight (lots of bells and whistles) cluster monitoring tool, originally called KCAP which has a new incarnation as http://www.opensce.org/ Open Scalable Cluster Environment (link dead Jun 2003).

LVS: Details of LVS operation, Security, DoS

Top 20 security vunerabilities See list of top 10 windows and top 10 unix vunerabilities

Top 75 security tools from the people at <filename>nmap</filename> See Top 75 Security Tools survey of May 2003, by polling the nmap mailing list.

Network Testing with Abberant Packets This is not exactly DoS, but is from a thread on another mailing list. Jeff The Riffer riffer (at) vaxer (dot) net 27 Feb 2007 We used several tools to generate abberant behavior, rather than packet replays. One was Core Impact, which actually exploits known holes and installs agents. It can do TCP evasion techniques to a limited extant. For abberant behavior, we found a nifty little open-source tool called isic, which lets you generate all sorts of abnormal traffic: It has binaries to generate abnormal ethernet, UDP, TCP, IP, and ICMP traffic. You can control percentages of the different abnormalities as well as volume of traffic. It's VERY noisy and aggressive stuff, but great for seeing if you can brign down a system. You can also use to to generate a packet storm while trying to sneak in through a more mundane attack amd trick your IDS/IPS route. We had problems getting it compiled, but someone was able to find a Debian package for it. The Debian package was converted to RPM using Alien and the RPM worked great under SuSe 10.0. Other than that we just used NMap and Nessus to generate varying levels of traffic and alerts. Isic was very useful for us... NMap/Nessus are to test how good IPS are to detect scanning. We were doing a very comprehensive test so we made no assumptions about capabilities of the products. NMap and Nessus by themselves would not be sufficient. The problem iwth replaying actual packet captures to test an IPS is that it will be for whatever IP addresses were in play when the capture is done, so that won't really work either. You can muck around with the .cap file and change the IPs and MAC addresses but it's an iffy solution. Core Impact is really great. But it's commercial and expensive, so most folks aren't going to have it. But, Metasploit is free and can do many of the same things. Just not as easily.

Do I need security, really? Malcolm Turnbull

Assuming that you have an LVS loadbalancer running on a linux box and this box is behing a firewall so that only ports 80 and 443 are allowed from clients. Do you really need to harden the loadbalancer firewall rules ? What about SYN cookies?

Ratz 01 Oct 2002 Yes, always. DROP ALL, accept TCP 80/443 only. Especially if the packet filter in front and the LVS are running the same OS :) Nothing can prevent SYN flooding, you can only live better with it when you have SYN cookies enabled. With a wrongly set backlog queue size you still face big penalty with SYN/RST attacks. See syncookies. Roberto Nibali ratz (at) tac (dot) ch 03 May 2001 It doesn't matter whether you're running an e-gov site or you mom's homepage. You have to secure it anyway, because the webserver is not the only machine on a net. A breach of the webserver will lead to a breach of the other systems too. Joe Sep 2005 As Marcus Ranum says (http://www.ranum.com/security/computer_security/editorials/dumb/) "Worms aren't smart enough to realize that your web site/home network isn't interesting". Joe 01 Oct 2002 Yes Virginia, you need security. There's the technical level. Can an intruder, who gets beyond the firewall, do any damage after getting access to the director, the realservers? If so, do you care? Maybe you do, maybe you don't - it will depend on what you have on those machines - if it's only publically available (readonly) webpages, you're less concerned than if you have customer business information on it. Are there adjacent machines on the network, that have more sensitive data than yours, that could be attacked from your compromised machines? You don't want to be an intermediate site to an attack on a expensive setup in the next rack. You may think that with a hardened front end, backend machines need less protection. However a new exploit we haven't thought about may be hop through a firewall and land on one of the less protected machines behind it. You should think about the damage that could occur if the attacker gained root access to any of your machines. LVS-DR is easier to protect in this case, as packets from the attacker on a realserver will be coming from the RIP, while packets from the LVS'ed service will be coming from the VIP. (If the realservers are in a LVS, then only the packets from the RIP on the realserver to the external 3-Tier services should have routes.) There should be no default gw on the realservers for packets from the RIPs. Packets from the RIPs to 0:0 should be dropped (and logged). The only allowed packets from the RIPs are those needed for internal networking between the machine in the LVS (e.g.local mailing of logs, updates to files). These packets will have dst_addr as another machine on the RIP network. On the director, there should be no default gw for packets from the VIP (see routing for LVS-DR). The only way an attacker with root can send packets to the outside world is by changing the routing tables (which you should be able to pickup). But security is more than a technical thing. How will your customers react if their website goes down, gets defaced or has credit card info stolen? You're going to have to explain that your actions were diligent and the break-in was beyond anything that you could be expected to handle. You then have to mollify them and make sure you keep the account. You'll also have to explain to potential customers why that humongous break-in that made it to the front page of the NY Times for a week, doesn't reflect poorly on you. If these people are non-technical (as most people with the money are) this may be difficult. It is just easier to make sure there never is a break-in. Of course there's no end to the things you can do in this department, so somewhere you'll have to decide what you're prepared to deal with upfront at the keyboard and what your prepared to deal with at the backend, after a break-in, face-to-face with an unhappy customer, who is simultaneously dealing with his own unhappy customers. To your client, you aren't a genius with a keyboard who understands computers. Nope - you're the security guard they've hired to look after a warehouse of their widgets and if you let someone get off with them, they're not going to be terribly interested in your explanation of why or how it happened. I'd say the minimum for a production machine, exposed to the internet, is a set of rules on each machine (director and realservers) that only allows the packets needed for the LVS (by port, IP, proto) and drops the rest. Every packet to and from a machine must be inspected by a filter rule. Every rejected packet must be logged (at least till you find out where they're coming from). Routing should be designed to allow out, only the packets you want to go out (outgoing packets are filtered by port and IP). If you're being bombarded with malicious traffic (spam, DoS), tcpdump is not a good diagnostic tool - you will not be able to decipher the deluge. Try snort. Here's an Introduction to Network-Based Intrusion Detection Systems Using Snort. Limit places where intruders can login (e.g. with xinetd). For maintenance, don't login over the same networks that the LVS traffic flows on (e.g.RIP network, outside network with VIP). For maintenance/admin, only allow connection by ssh through a separate ethernet card on a different set of wires and different network (backed up with filter rules and xinetd), or via the console.

What to do after a break-in, prevention strategies In the early 1990's, a break-in was unusual and being a criminal act, some investigative body was notified (e.g. the police). This being a new type of crime, usually the investigators had no idea how to handle the situation. At my work a multi cpu refrigerator sized mail server was compromised and the investigators swooped in and seized the server; not just the disk(s) - the whole server, and wheeled it out the door. We were told that the server would be returned on completion of the investigations (and any trial if suspects were apprehended). On asking them when that might be, we were told that if they could not find any suspects, that the machine would be returned when the case timed out, which would be 8yrs. This was a big lesson to all involved. The next break-in I was involved in, the machine wouldn't boot. I reformatted the disk and reinstalled the OS from CD and the user's files from backups. When the investigators arrived and asked to inspect the machine. I told them that the disk had been reformatted and offered them the last tape backup. They didn't want it. Subsequently I attended a talk by a /programmer/lawyer/cyber-investigator from Washington DC, who worked for the US govt. He told us that after a break-in we must not touch the machine (it's tampering with evidence) and gave us a list of contacts. At question time, I told him the story about the server being wheeled out the door (which several people in the audience were familiar with), and my reformatting of the disk on the machine I handled, at which he grew visibly angry. I asked him if he expected us to call them if they were going to take our hardware away when they only needed to take a bitwise copy of the disks. After all, the police don't seize your house after a burglary. In his reply, it was clear that he recognised that such actions by the investigators weren't optimal, but as to what I should do next time, he only offered more standard party line and I decided that next time I wouldn't be calling anyone. Following a set of Unix/Linux break-ins (Apr 2004), Stanford U put out (link dead Jun 2004 http://securecomputing.stanford.edu/alerts/multiple-unix-6Apr2004.html) "Multiple Unix compromises on campus", describing their problems and offering links to further pages (such as information on rootkits). Unfortunately the current state of security is that much work is needed to get it and much of the prevention work seems to be applying patches. This is a lot of work and I can't imagine that it will prevent most break-ins. I personally find tiresome the practice of forced passwd expiration every 3 months on the 30 accounts I have in several administrative fiefdoms. I'm expected to keep them in my head when they are 8 char mixed uc/lc and contain at least 2 numbers. Who are they kidding? I tape them to the edge of my monitor. The article links to Steps for Recovering from a UNIX or NT System Compromise a CERT/AusCERT paper. It seems to have been written by people who like being on committees and who want you to spend so much time securing your machine, that you'll never be able to use it. Unfortunately the people who make decisions about managing computers have never dealt with a break-in and know nothing about security will cover their asses (arses) with a never ending round of patching. The result is that users have to suffer machines being rebooted from under them every 2 weeks (and loosing all the user settings), the SA never does anything useful but when the inevitable break-in occurs the manager can happily say to some committee "we did everything we could to prevent it". My only interaction with CERT was not good. One day (mid 1990s) I got vitriolic e-mail from a person announcing that he was one of the top AusCERT security experts, and that my map server (now at AZ_PROJ map server and producing about 10,000 maps/yr) was doing robotic attacks on his network. If I didn't cease and desist immediately, a dire fate would befall me. Now if there was some problem with my machine and I had come to the notice of CERT, I would expect a letter from CERT saying

I assumed I was talking to a crazed idiot. The next day I got an even more vitriolic e-mail from the same guy, promising me certain internet death if I didn't stop attacking his machines. Somewhere in here, he sent me logs, whose relevance to the problem was not obvious at the time. Then I got e-mail from a user of my map generator saying that it had stopped working for him and could I help. The map generator produced azimuthal equidistant projection maps in many formats, including an X-client which could popup an X-window map on your screen (there were instructions on setting xhost etc). The user was having problems getting the X-display of my maps going, when previously it had worked. AFAIK, no-one was using the X-client and I had forgotten it was there (everyone was generating gifs). Somehow (IPs?) I connected this user to the AusCERT expert. I told the user what to do and then sent off e-mail to the CERT expert, giving the url of my map generator and telling him to go look at what it did. This only inflammed the CERT security expert more and shortly thereafter I got e-mail from an even higher level AusCERT uber-expert telling me that I'd been listed as one of the biggest internet nasties of all time and that no-one was ever going to get a packet in or out of my machine ever again. I explained to the uber-expert what my machine was doing and that that he was probably getting X-packets from my server and to go try it out for himself. There was silence for a couple of days, and then a rash of apologies from both CERT experts. There was no indication that they had learned anything from this or would change their methods next time. The fact that the top CERT experts in Australia don't know an X-packet from a hole in the wall and are prepared to declare internet death on someone without an investigation (courteous or otherwise), indicates that we shouldn't hold out much hope of CERT saving us from anything. If CERT can save us from CERT, we should be thankfull.

More about syncookies anon

humans usually do not establish SYN connection. It is more likley to be Nimda or other worms. If I can determine a threshold of simultenious SYN connection that nimda usually creates, I will be able to drop packets from specific source IP which meet the threshold.

Roberto Nibali ratz (at) tac (dot) ch 06 Aug 2003 Search google using my name and syncookies for more information on why syncookies have no measurable impact on reducing real DoS. If you can _really_ figure out a metric for mutually exclusive TCP/SYN patterns generated by existing worms and write it down in a mathematical formula which has lower false positive rate than any TCP/QoS "defense" mechanism using stochastic (timed) fairness approach, you need not worry about money anymore. In fact influential people in the Internet business might feel a sudden urge to talk to you! ;)

Can filter rules stop the intruder hopping to other machines? Malcolm Turnbull malcolm (dot) turnbull (at) crocus (dot) co (dot) uk 14 Feb 2003 Nope, if you're hacked they can just change your firewall rules... One of my clients got hacked and the only way they found out was because the hacker (possibly script kiddy) tried to flush the iptables rules, therfore breaking all of the NAT rules therefore taking down the web site... How did he get in: broke into IIS through common bug, installed a trojan, used SSH to get from the web server to the firewall .. etc etc... Even if you put the LVS behind a firewall (which I prefer) you still need to open port 80... is it secure ? yes I think so hackers tend to concentrate on application i.e. apache or IIS these days its much easier.. One other gotcha.. If your fallback server is localhost you are obviously exposing your local apache installation ! Nate Carlson natecars (at) real-time (dot) com

the firewall should be configured so untrusted hosts (e.gthe web server -- any box that isn't the box that people are expected to log in from) can't connect to the SSH port (or any other service) on the firewall.

Where filter rules act Joe - iptables (2.4 kernels) has no "iptables -C" to check your rules (at least not yet - one is promised). Ratz If you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it: packets coming from outside to the LVS do: LOCAL_IN(LVS in) -> POST_ROUTING ]]> packets leaving the LVS travel: FORWARD(LVS out) -> POST_ROUTING ]]> From the iptables howto: COMPATIBILITY WITH IPCHAINS This iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three. Julian 2.4 director: Packets coming into the director (out->in): NAT: INPUT -> input routing -> local: LVS/DEMASQ -> input routing -> forwarding -> OUTPUT DR/TUN: INPUT -> input routing -> local: LVS -> output routing -> OUTPUT packets leaving the LVS travel (in->out): NAT only: INPUT -> input routing -> FORWARD (-j MASQ) -> LVS/MASQ -> OUTPUT 2.2 director: INPUT in 2.2 is similar as PRE_ROUTING in 2.4, i.e. INPUT, OUTPUT and FORWARD are the 2.2 firewall chains Matthew S. Crocker matthew (at) crocker (dot) com 31 Aug 2001

How do I filter LVS? Does LVS grab the packets before or after iptables?

Julian LVS is designed to work after any kind of firewall rules. So, you can put your ipchains/iptables rules safely. If you are using iptables put them on LOCAL_IN, not on FORWARD. The LVS packets do not go through FORWARD. Although LVS is compatible with any kind of filter rule (i.e. ipchains, iptables), it has incompatibilities with netfilter i.e. you maynot be able to have your firewall on the director. For more info see the . Joe If you are being attacked, it might be better to filter upstream (e.g. the router or your ISP), to prevent the LAN from being flooded.

/proc filesystem flags for ipv4, <emphasis>e.g.</emphasis>rp_filter You could wind up flipping a lot of these flags. Explanations are available in the obscure section of the Adv Routing HOWTO . In particular rp_filter and log_martians are used in . For more information on rp_filter see Reverse Path Filtering .

tcp timeout values, don't change them (at least yet) The tcp timeout values have their values for good reason (even if you don't know what they are), and realservers operating as an LVS must appear as normal tcp servers to the clients. Wayne, 19 Oct 2001

I have a question about the 'IP_MASQ_S_FIN_TIMEOUT" values in "net/ipv4/ip_masq.c" for the 2.2.x kernel. What purpose is served by having the terminated masqueraded TCP connection entries remain in memory for the default timeout of 2 minutes? Why isn't the entry freed immediately?

Julian Anastasov ja (at) ssi (dot) bg 20 Oct 2001 Because the TCP connection is full-duplex. The internal-end sends FIN and waits for the FIN from external host. Then TIME_WAIT is entered.

Perhaps what I'm really asking is why there is an mFW state at all.

This state has timeout corresponding to the similar state in the internal end. The remote end is still sending data while the internal side is in FIN_WAIT state (after a shutdown). The remote end can claim that it is still in established state not seeing the FIN packet from internal side. In any case, the remote end has 2 minutes to reply. It can even work for longer time if each packet follows in these two minutes not allowing the timer to expire. It depends in what state is the internal end, FIN_WAIT1 or FIN_WAIT2. May be the socket in the internal end is already closed.

The only thing I can think of is if the other end of the TCP connection spontaneously issues a half close before the initiator sends his half close. Then it might be desirable to wait a while for the initiator to send his half close prior to disposing of the connection totally. What would be the consequences of using ipchains -M -S to set this value to, say, 1 second?

In any case, timeout values lower than those in the internal hosts are not recommended. If we drop the entry in LVS, the internal end still can retransmit its FIN packet. And the remote end has two minutes to flush its sending queue and to ack the FIN. IMO, you should claim that the timer in FIN_WAIT state should not be restarted on packets coming from remote end. Anything else is not good because it can drop valid packets that fit into the normal FIN_WAIT time range. Jaroslav Libak 28 Nov 2006

When I click refresh in firefox several times while viewing load balanced page, I get a FIN_WAIT connection for every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid of them fast (is this ok btw?, it was like 2 minutes before which I think is way too long). What is worse, I get "established" connection on the backup director (running the syncd) for every refresh. I have read this is due to a simplification in the synchronization code. I'm using hash table size 2^20 (which doesn't limit the maximum number of values in it, it just sets the number of rows, then each row has a linked list). Doesn't it cause some slowdown in the LVS?

Horms 29 Nov 2006 There has long been a plan to allow the timeout values to be manipulated from user space. I think it actually was possible using /proc at some stage, but the code was removed for various (good) reasons. Then there was a plan to implement the feature by extending the sysctl interface. I suspect that this, or using sysfs is currently the prefered option by the upstream kernel guys. A really worthwhile contribution to LVS would be to complete this code. I can find out from the upstream people what their prefered option for implementing this is if you are interested in having a crack at it. I don't imagine the code will be that hard. I understand that your concern is memory preasure on the slave in the case of a DoS attack. And it is true that the simplification in the synchronisation protocol can exasabate that problem. However, by doing it this way the synchronisatin traffic is actually reduced, including in the case of a DoS attack. So expanding it may actually just move the problem else where. Keeping in mind that a connection entry is in the vicintity of 128 bytes, it is my opinion that unless you have an extremely small ammount of memory available on the system to start with, DoSing the machine in this way is quite hard. I did try once, DoSing a box from istelf, and basically the default timeouts were easily able to keep up with the DoS, and I think the total memory used never exceded a few hundred Mb. I would be very surprised if increasing the value would cause a slowdown, it does however increase the memory required for the array that forms the base of the hash - at 2^20 you are looking at order 2^20 = 1Mb for the size of that array. For larger values, like 32 (=4Gb), this starts to become rediculous. Decreasing it can, in theory, cause a slowdown if you have a lot of connections. But in practice I don't think it does unless you make it very small. In short, 20 should be fine, though you can probably get the same preformance with 16. 10 is probably a bit too small.

/proc file system settings for LVS: security and private copies of tcp timeouts for LVS connections (you can change these) In LVS-DR, the director only sees the packets from the client going to the realserver, but not the replies. After seeing a CLOSE, the director puts the connection into InActConn and uses its value of TIME_WAIT before assuming that the connection has dropped. (In fact the director has no idea of the connection state of the realserver, but these assumptions seems to work OK). In the earlier versions of LVS, the director uses the standard tcpip timeouts for its estimates of the connection state of the realserver. In the newer versions of LVS (somewhere in 2.4.x), you can fiddle with a set of private copies of the timeout values which ip_vs uses for LVS connection tracking. As well other ip_vs parameters (e.g. for security) can be altered in /proc. Roberto Nibali ratz (at) tac (dot) ch 03 May 2001

The load balancer is basically on as secure as Linux itself is. ipchains settings don't affect LVS functionality (unless by mistake you use the same mark for ipchains and ipvsadm). LVS itself has some built-in security, mainly to try to secure the realserver in case of a DoS attack. There are several parameters you might want to set in the proc-fs. Ratz 10 Aug 2004 Those values below were used as kind of a defense mechanism in the ancient days. I believe these are to be replaced by the same parameters exported through the ip_conntrack module. Load ip_conntrack and walk the /proc/sys/net/ipv4/netfilter tree. /proc/sys/net/ipv4/vs/amemthresh /proc/sys/net/ipv4/vs/am_droprate /proc/sys/net/ipv4/vs/drop_entry /proc/sys/net/ipv4/vs/drop_packet /proc/sys/net/ipv4/vs/secure_tcp /proc/sys/net/ipv4/vs/debug_level With this you select the debug level (0: no debug output, >0: debug output in kernlog, the higher the number to higher the verbosity) The following are timeout settings. For more information see TCP/IP Illustrated Vol. I, R. Stevens. /proc/sys/net/ipv4/vs/timeout_close - CLOSE /proc/sys/net/ipv4/vs/timeout_closewait - CLOSE_WAIT /proc/sys/net/ipv4/vs/timeout_established - ESTABLISHED /proc/sys/net/ipv4/vs/timeout_finwait - FIN_WAIT /proc/sys/net/ipv4/vs/timeout_icmp - ICMP /proc/sys/net/ipv4/vs/timeout_lastack - LAST_ACK /proc/sys/net/ipv4/vs/timeout_listen - LISTEN /proc/sys/net/ipv4/vs/timeout_synack - SYN_ACK /proc/sys/net/ipv4/vs/timeout_synrecv - SYN_RECEIVED /proc/sys/net/ipv4/vs/timeout_synsent - SYN_SENT /proc/sys/net/ipv4/vs/timeout_timewait - TIME_WAIT /proc/sys/net/ipv4/vs/timeout_udp - UDP You don't want your director replying to pings from the outside world.

With the FIN timeout being about 1 min (2.2.x kernels), if most of your connections are non-persistent http (only taking 1 sec or so), most of your connections will be in the InActConn state. unknown

will the info from loading ip_conntrack and walking the /proc/sys/net/ipv4/netfilter tree be used along with secure_tcp defense strategy as LVS DoS defense strategy (http://www.linux-vs.org/docs/defense.html) described to replace the timeouts mentioned.

Ratz 12 Aug 2004 I don't know (I've been out of the development loop for about a year) but I rather think not since they look kind of orthogonal to the existing netfilter timers which only got added about 6 months or so ago. One of the issues in fiddling with those timers is that they influence too much of the rest of the stack. I also don't think the documentation is up to date anymore, it should be adjusted to reflect the current state of operation. Like that it only confuses people who don't want or can't read the kernel code. If you're interested, check out following path: from there you set the TCP state transition table. If you have the secure_tcp sysctl set, the kernel will be dealing with the vs_tcp_states_dos state transition table, if you have it unset, it will be dealing with the normal vs_tcp_states table. The related timer for the state transitions are vs_timeout_table{_dos}. In former days you could influence those timers via proc-fs. Nowadays we seem to switch to the *_dos timer model under attack according to the comment in the code. But this is not correct. It should read that as soon as the sysctrl for tcp_defense is set, we will also be using the *_dos table timers along with the vs_tcp_states_dos state transition table. Conclusion: The disabled proc-fs values have been replace by a static hardcoded mapping of the timers for tcp_defense. I could imagine that not a lot of people really used to tweak those parameters anyway. Hendrik Thiel, 20 Mar 2001

we are using a lvs in NAT Mode and everything works fine ... Probably, the only Problem seems to be the huge number of (idle) Connection Entries. ipvsadm shows a lot of InActConn (more than 10000 entries per Realserver) entries. ipchains -M -L -n shows that these connections last 2 minutes. Is it possible to reduce the time to keep the Masquerading Table small? e.g. 10 seconds ...

Joe For 2.2 kernels, you can use netstat -M instead of ipchains -M -L. For 2.4.x kernels use cat /proc/net/ip_conntrack. Julian One entry occupies 128 bytes. 10k entries mean 1.28MB memory. This is not a lot of memory and may not be a problem. For 2.2, to reduce the number of entries in the ipchains table, you can reduce the timeout values. You can edit the TIME_WAIT, FIN_WAIT values in ip_masq.c, or enable the secure_tcp strategy and alter the proc values there. FIN_WAIT can also be changed with ipchains. It is not a good idea to change the tcpip timeouts (particularly to save 1M). With the later versions of ip_vs (2.4.x), the director has its own copies of the tcpip timeout values, and you can change them. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 10 May 2004 If you are concerned about the number of InActConn, you can reduce the FIN_WAIT timeout in /proc/sys/net/ipv4/vs/timeout_finwait. For 2.6.x versions of ip_vs (May 2004), the timeouts have not been implemented yet. Julian 12 May 2004 IPVS for 2.6 has code to use different timeout tables but we forgot to implement it fully. The intention was to implement per protocol/service/app timeouts by adding some code to libipvs and the kernel. It is not preferred to export so many values via /proc interface, so now it is disabled until someone decides to implement the above set/get controls. Only the timeout_* values in /procare disabled, so now they do not exist in 2.6. All other sysctls remain.

timeouts the same for all services Alois Treindl alois (at) astro (dot) ch

I have LVS-NAT configured so that ssh VIP connects me to one particular realserver. I would like to keep this ssh connection permanent, (to observe the cluster during its operation). This ssh connection times out with inactivity as expected. Can this be changed, without affecting the timeout values of other LVS services? Alternately I can connect by ssh to each machine without using LVS.

Julian Anastasov ja (at) ssi (dot) bg 12 May 2001 Currently, there are only global timeout values which are not very useful for some boxes with mixed functions. The masquerading, the LVS and its virtual services use same timeout values. The problem is that there are too many timeouts. The solution would be to separate these timeouts, i.e. per virtual service timeouts, separated from the masquerading. According to the virtual service protocol it can serve the TCP EST and the UDP timeout role. So this can be one value that will be specified from the users. By this way the in->out ssh/telnet/whatever connections can use their own timeout (1/10/100 days) and the external clients will use the standard credit of 15 minutes. But may be it is too late for 2.2 to change this model. Is one user specified timeout value enough?

Director Connection Hash Table Because the 2.0.x implementation of ip_vs was in the masquerading code, this table used to be called the "IP masquerading table". Joe: A regular table has room for N entries, with an index of range N. A hash table is a table that has room for N entries, but stores entries for indices that have a range of M, where M>N. In the case of LVS, the connection hash table must store entries over the whole range of internet IPs, but only has (initially) 4096 (say) entries. Algorithms exist which allow adding and deleting entries in hash tables at speeds comparable to those in regular tables. from Peter Mueller: a general article on hashing (http://www.citi.umich.edu/projects/linux-scalability/reports/hash.html, site gone Sep 2004) The director maintains a hash table of connections marked with ]]> where CIP: Client IP address CPort: Client Port number VIP: Virtual IP address VPort: Virtual Port number RIP: RealServer IP address RPort: RealServer Port number. The hash table speeds up the connection lookup and keeps state so that packets belonging to a connection from the client will be sent to the allocated realserver. If you are editing the .config by hand look for CONFIG_IP_MASQUERADE_VS_TAB_BITS. Do not even think of changing the LVS (hash) table size unless you know a lot more about ip_vs than we do. If you still want to change the hash table size, at least read everything here first. tao cui

In the output of ipvsadm what does the "size" mean?

Horms 24 Dec 2003 This refers to the number of hash buckets in the IPVS connection table. This is configured at compile time by setting CONFIG_IP_VS_TAB_BITS, the default is 12. size = 4096 CONFIG_IP_VS_TAB_BITS = 16 -> size = 65536 ]]> Note that this is the number of hash buckets, not the maximum number of connections. A bucket can contain zero or more connections. The maximum number of connections is only limited by the memory available. Janno de Wit

How can I see if connectiontable is full? `dmesg` gives no output.

Horms 07 Jan 2005 The connection table cannot become full. It is a hash table and you can continue to add entries until you run out of memory, at which time something very apparent should turn up in dmsg. Ratz The original poster actually has got a point :) So what about this: partial diff shown for brevity - Joe offset) { sprintf(temp, - "IP Virtual Server version %d.%d.%d (size=%d)", + "IP Virtual Server version %d.%d.%d (hash buckets=%d)", NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE); len += sprintf(buf+len, "%-63s\n", temp); len += sprintf(buf+len, "%-63s\n", @@ -1942,7 +1942,7 @@ { char buf[64]; - sprintf(buf, "IP Virtual Server version %d.%d.%d (size=%d)", + sprintf(buf, "IP Virtual Server version %d.%d.%d (hash buckets=%d)", NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE); if (*len < strlen(buf)+1) { ret = -EINVAL; ]]> The default LVS hash table size (2^12 entries) originally meant 2^12 simultanous connections. These early versions of ipvs would crash your machine if you alloted too much memory to this table. Julian 7 Jun 2001

This was because the resulting bzImage was too big. Users selected a value too big for the hash table and even the empty table (without linked connections) couldn't fit in the available memory.

This problem has been fixed in kernels>0.9.9 with the connection table being a linked list. Note: If you're looking for memory use with "top", it reports memory allocated, not memory you are using. No matter how much memory you have, Linux will eventually allocate all of it as you continue to run the machine and load programs. Each connection entry takes 128 bytes, 2^12 connections requires 512kbytes. not all connections are active - some are waiting to timeout. As of ipvs-0.9.9 the hash table is different. Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

With CONFIG_IP_MASQUERADE_VS_TAB_BITS we specify not the max number of the entries (connections in your case) but the number of the rows in a hash table. This table has columns which are unlimited. You can set your table to 256 rows and to have 1,800,000 connections in 7000 columns average. But the lookup is slower. The lookup function chooses one row using hash function and starts to search all these 7000 entries for match. So, by increasing the number of rows we want to speedup the lookup. There is _no_ connection limit. It depends on the free memory. Try to tune the number of rows in this way that the columns will not exceed 16 (average), for example. It is not fatal if the columns are more (average) but if your CPU is fast enough this is not a problem. All entries are included in a table with (1 << IP_VS_TAB_BITS) rows and unlimited number of columns. 2^16 rows is enough. Currently, LVS 0.9.7 can eat all your memory for entries (using any number of rows). The memory checks are planned in the next LVS versions (are in 0.9.9?).

Julian 7 Jun 2001

Here is the picture: the hash table is an array of double-linked list heads, i.e. struct list_head *ip_vs_conn_tab; In some versions ago ( < 0.9.9? ) it was a static array, i.e. struct list_head ip_vs_table[IP_VS_TAB_SIZE]; struct list_head is 8 bytes (d-linked list), the next and prev pointers In the second variant when IP_VS_TAB_SIZE is selected too high the kernel crashes on boot. Currently (the first variant), vmalloc(IP_VS_TAB_SIZE*sizeof(struct list_head)) is used to allocate the space for the empty hash table for connections. Once the table is created, more memory is allocated only for connections, not for the table itself. In any case, after boot, before any connections are created, the occupied memory for this empty table is IP_VS_TAB_SIZE*8 bytes. For 20 bits this is (2^20)*8 bytes=8MB. When we start to create connection they are enqueued in one of these 2^20 double-linked lists after evaluating a hash function. In the ideal case you can have one connection per row (a dream), so 2^20 connections. When I'm talking about columns, in this example we have 2^20 rows and average 1 column used. The *TAB_BITS define only the number of rows (the power of 2 is useful to mask the hash function result with the IP_VS_TAB_SIZE-1 instead of using '%' module operation). But this is not a limit for the number of connections. When the value is selected from the user, the real number of connections must be considered. For example, if you think your site can accept 1,000,000 simultaneous connections, you have to select such number of hash rows that will spread all connections in short rows. You can create these 1,000,000 conns with TAB_BITS=1 too but then all these connections will be linked in two rows and the lookup process will take too much time to walk 500,000 entries. This lookup is performed on each received packet. The selection of *TAB_BITS is entirely based on the recommendation to keep the d-linked lists short (less than 20, not 500,000). This will speedup the lookup dramatically. So, for our example of 1,000,000 we must select table with 1,000,000/20 rows, i.e. 50,000 rows. In our case the min TAB_BITS value is 16 (2^16=65536 >= 50000). If we select 15 bits (32768 rows) we can expect 30 entries in one row (d-linked list) which increases the average time to access these connections. So, the TAB_BITS selection is a compromise between the memory that will use the empty table and the lookup speed in one table row. They are orthogonal. More rows => More memory => faster access. So, for 1,000,000 entries (which is an real limit for 128MB directors) you don't need more than 16 bits for the conn hash table. And the space occupied by such empty table is 65536*8=512KBytes. Bits greater than 16 can speedup the lookup more but we waste too much memory. And usually we don't achieve 1,000,000 conns with 128MB directors, some memory is occupied for other things. The reason to move to vmalloc-ed buffer is because an 65536-row table occupies 512KB and if the table is statically defined in the kernel the boot image is with 512KB longer which is obviously very bad. So, the new definition is a pointer (4 bytes instead of 512KB in the bzImage) to the vmalloc'ed area. Ratz's code adds limits per service while this sysctl can limit everything. Or it can be additional strategy (oh, another one) vs/lowmem. The semantic can be "Don't allocate memory for new connections when the low memory threshold is reached". It can work for the masquerading connections too (2.2). By this way we will reserve memory for the user space. Very dangerous option, though.

Joe what's dangerous about it?

One user process can allocate too much memory and to cause the LVS to drop new connections because the lowmem threshold is reached. May be conn_limit is better or something like this: min_conn_limit && free_memory < lowmem_thresh) DROP_THIS_PACKET_FOR_NEW_CONN ]]>

why have a min_conn_limit in here? If you put more memory into the director, hen you'll have to recompile your kernel. Is it because finding conn_number is cheaper than finding free_memory?

:) The above example with real numbers: 500000 && free_memory < 10MB) DROP ]]> i.e.don't allow the user processes to use memory that LVS can use. But when there are "enough" LVS connections created we can consider reserving 10MB for the user space and to start dropping new connections early, i.e. when there are less than 10MB free memory. If conn_number <500000 LVS simply will hit the 0MB free memory point and the user space will be hurted because these processes allocated too much memory in this case. But obtaining the "free_memory" may be costs CPU cycles. May be we can stick with a snapshot on each second.

The number of valid connections shouldn't change dramatically in 1 sec. However a DoS might still cause problems.

Yes, the problem is on SYN attack.

Ratz

max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

what's the security problem?

SYN/RST flood. My patch will set the weight of the realserver to 0 in case the upper threshold is reached. But I do not test if the requesting traffic is malicious or not, so in case of SYN-flood it may be 99% of the packets causing the server to be taken out of service. In the end we have set all server to weight 0 and the load balancer is non-functional either. But you don't have the memory problem :)

And it hasn't crashed either. Ratz

I kinda like it but as you said, there is the amem_thresh, my approach (which was not actually done because of this problem :) and now having a lowmem_thresh. I think this will end up in a orthogonal semantic for memory allocation. For example if you enable the amem_thresh the conn_number > min_conn_limit && free_memory <lowmem_thresh would never be the case. OTOH if you set the lowmem_thresh to low the amem_thresh is ineffective. My patch would suffer from this too.

Julian Anastasov ja (at) ssi (dot) bg 08 Jun 2001

lowmem_thresh is not related to amemthresh but when amemthresh <lowmem_thresh the strategies will never be activated. lowmem_thresh should be less than amemthresh. Then the strategies will try to keep the free memory in the lowmem_thresh:amemthresh range instead of the current range 0:amemthresh Example (I hope you have memory to waste): lowmem_thresh=16MB (think of it as reserved for user processes and kernel) amemthresh=32MB (when the defense strategies trigger) min_conn_limit=500000 (think of it as 60MB reserved for LVS connections) So, the conn_number can grow far away after min_conn_limit but only while lowmem_thresh is not reached. If conn_number <500000 and free_memory <lowmem_thresh we will wait the OOM killer to help us. So, we have 2 tuning parameters: the desired number of connections and some space reserved for user processes. And may be this is difficult to tune, we don't know how the kernel prevents problems in VM before activating the killer, i.e. swapping, etc. And the cluster software can take some care when allocating memory.

Hayden Myers hayden (at) spinbox (dot) com 18 Mar 2002

There's also some info located in kernel help. I posted it below for convenience. Using a big ipvs hash table for virtual server will greatly reduce conflicts in the ipvs hash table when there are hundreds of thousands of active connections. Note the table size must be power of 2. The table size will be the value of 2 to the your input number power. For example, the default number is 12, so the table size is 4096. Don't input the number too small, otherwise you will lose performance on it. You can adapt the table size yourself, according to your virtual server application. It is good to set the table size not far less than the number of connections per second multiplying average lasting time of connection in the table. For example, your virtual server gets 200 connections per second, the connection lasts for 200 seconds in average in the masquerading table, the table size should be not far less than 200x200, it is good to set the table size 32768 (2**15). Another note that each connection occupies 128 bytes effectively and each hash entry uses 8 bytes, so you can estimate how much memory is needed for your box.

Ratz: Leave the settings as a general rule. Some people still want to change the hash table size Daniel Burke 28 Jun 2002

In anticipation of our capacity requirements growing, we had decided it was necessary to increase the connection table size. The value it was at was 16, based on our calculations we needed to bump it to 26 to handle what were will be throwing at it.

Julian It is insane to use 26. That means 2^26 * space for 2 pointers. On x86 it takes 512MB just for allocating empty hash table with 2^26 d-linked lists. Refer to the HOWTO for calculating the best hash table size according to the number of connections. You can select the size (POWER) in this way: 2^POWER = AVERAGE_NUM_CONNS/10 The magic value 10 in this case is the average number of conns expected in one d-linked list, the lookup is slower for more conns. Example: POWER=16 => 65536 rows => 655360 conns, 10 on each row Joe - Wensong has stepped in to stop people from doing this anymore. Wensong Just added code that limits the input number from 8 to 20, in order to prevent this configuring problem from happening again. Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. I think Julian and/or Wensong wrote it. Does anyone of you still have that code? I'd say that the rehashing that would need to take place would consume more CPU cycles than a yet-to-be-proven gain from increasing the bucket size.

Horms horms (at) verge (dot) net (dot) au 17 Feb 2003 Agreed. To expand on this for the benefit of others. The hash table is just that. A hash. Each bucket in the hash can have multiple entries. The implementation is such that each bucket is a linked list of entries.

Exactly. And around 2000 we tuned the hash generation function to have the best balanced distribution over all buckets with a magic prime number which IIRC was pretty exactly the golden ratio if you divided 2^32 through that number. echo "(2^32)/2654435761" | bc -l 1.61803399392945414737 ratz@zar:~ > ]]> We took the wisdom from Chuck Lever's paper Linux Kernel Hash Table Behavior: Analysis and Improvements (http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf, site down Sep 2004) So once you have an evenly distributed hash table the search for an entry is almost best case for all entries.

Thus there is no limit on the number of entries the hash table can hold (other than the amount of memory available). The only advantage of increasing the hash table size is to reduce, statistically speaking, the number of entries in each bucket.

Which gains you almost nothing. The search is still ... left as an exercise to all CS students here :)

And thus the amount of hash table traversal. To be honest I think the gain is probably minimal even in extreme circumstances.

Especially since computing the hash value entry uses most of the times almost the same amount of time as parsing the linked list.

Anecdotally, a colleague of mine did a test on making the linked lists reordering. So that the most recently used entry was moved to the front.

Interesting approach. However I guess that the cache line probably already had this entry. It would be interesting to see the amount of TLB flushes done by the kernel depending on the amount of traffic and hash table size.

He then pushed a little bit traffic through the LVS box (>700Mbit/s). We didn't really see any improvement. Which would make me think that hash table performance isn't a particular bottleneck in the current code. Jernberg, Bernt wrote:

my throughput is 2Gb/s

The tests that I was involved with with >700Mb/s of traffic used a hash-table with 17bits. I am not entirely sure how that number was derived as I did not do the tests myself. But it would probably be a good start for you. You are probably going to see a bigger difference by compiling a non-SMP kernel and eliminating spinlocks than you will twiddling the hash-table bits. I believe that by having an SMP kernel, the overhead of spinlocks is significant under high load. (For more on SMP/UMP with LVS, see comments by Horms on .) Jernberg, Bernt

I have deployed it at a customer sight which offers ftp,http and rsyc services. They calculated that they will need 2^21 entries in the hash if it is supported.

Ratz Let me see (4 secs session coherency and 1/8 of the traffic are valid SYN requests matching the template): echo "l(4*2*1024*1024*1024/8)" | bc -l 20.79441541679835928251 ratz@zar:~ > ]]> So yes, this would roughly be 21 bits. But now I ask you to read the nice explanation of Horms in this thread on why you do not need to increase the bucket size of the hash table to be able to hold 2**21 entries. You can perfectly well use 17 bits which would give you a linked list depth (provided we have an equilibrium in distribution over the buckets): echo "2^21/2^17" | bc -l 16.00000000000000000000 ratz@zar:~ > ]]> So lousy 16 entries for one bucket when using 17 bit. This is bloody _nothing_. Let's take the worst case: You'd have maybe 32 entries which is still _nothing_. Your CPU doesn't even fully awake to find an entry in this list :). The amount of RAM you need to hold 2^21 templates entries for a session time of 4 seconds is roughly: echo "4*(128*2^21)/1024/1024" | bc -l 1024.00000000000000000000 ratz@zar:~ > ]]> 1GB. So you're on the safe end. However if you plan on using persistency you'd run out of memory pretty soonish. Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. We used it to tweak the hash function.

Julian, 17 Feb 2003 hashlvs I created a script for easy testing. Currently, there are 2 hash functions for tests. I don't remember for what LVS version the hash_*.c files were created. usage: get copy of the conn table output (use real data) and feed it to the scripts by specifying the used hash table size (in bits) and the desired hash function method. The result is the expected access time in pseudo units. Bigger access time means slower lookup.

Hash table connection timeouts

How long are the connection entries held for ? (Column 8 of /proc/net/ip_masquerade ?)

Julian The default timeout value for TCP session is 15 minutes, TCP session after receiving FIN is 2 miniutes, and UDP session 5 minutes. You can use ipchains -M -S tcp tcpfin udp to set your own time values.

If we assume a clunky set of web servers being balanced that take 3s to serve an object, then if the connection entries are dropped immediately then we can balance about 20 million web requests per minute with 128M RAM. If however the connection entries are kept for a longer time period this puts a limit on the balancer.

Yeah, it is true.

e.g. (assuming column 8 is the thing I'm after!)

Actually, the column 8 is the delta value in sequence numbers. The timeout value is in column 10.

i.e. Held for about 2.3 hours, which would limit a 128Mb machine to balance about 10.4 million requests per day. (Which is definitely on the low side knowing our throughput...)

Horms horms (at) vergenet (dot) net When a connection is recieved by an IPVS server and forwarded (by whatever means) to a back-end server at what stage is this connection entered into the IPVS table. It is before or as the packet is sent to the back-end server or delayed until after the 3 way handshake is complete. Lars The first packet is when the connection is assigned to a realserver, thus it must be entered into the table then, otherwise the 3 way handshake would likely hit 3 different realservers. unknown It has been alleged that IBMs Net Director waits until the completion of the three way handshake to avoid the table being filled up in the case of a SYN flood. To my mind the existing SYN flood protection in Linux should protect the IPVS table in any case and the connection needs to be in the IPVS table to enable the 3 way handshake to be completed. Wensong There is state management in connection entries in the IPVS table. The connection in different states has different timeout value, for example, the timeout of the SYN_RECV state is 1 minute, the timeout of the ESTABLISHED state is 15 minutes (the default). Each connection entry occupy 128 bytes effective memory. Supposing that there is 128 Mbytes free memory, the box can have 1 million connection entries. The over 16,667 packet/second rate SYN flood can make the box run out of memory, and the syn-flooding attacker probably need to allocate T3 link or more to perform the attack. It is difficult to syn-flood a IPVS box. It would be much more difficult to attach a box with more memory.

I assume that the timeout is tunable, though reducing the timeout could have implications for prematurely dropping connections. Is there a possibility of implementing random SYN drops if too many SYN are received as I believe is implemented in the kernel TCP stack.

Yup, I should implement random early drop of SYN entries long time ago as Alan Cox suggested. Actually, it would be simple to add this feature into the existing IPVS code, because the slow timer handler is activated every second to collect stale entries. I just need to some code to that handler, if over 90% (or 95%) memory is used, run drop_random_entry to randomly tranverse 10% (or 5%) entries and drop the SYN-RCV entries in them.

A second, related question is if a packet is forwarded to a server, and this server has failed and is sunsequently removed from the available pool using something like ldirectord. Is there a window where the packet can be retransmitted to a second server. This would only really work if the packet was a new connection.

Yes, it is true. If the primary load balaner fails over, all the established connections will be lost after the backup takes over. We probably need to investigate how to exchange the state (connection entries) periodically between the primary and the backup without too much performance degradation.

If persistent connections are being used and a client is cached but doesn't have any active connections does this count as a connection as far as load balancing, particularly lc and wlc is concerned. I am thinking no. This being the case, is the memory requirement for each client that is cached but has no connections 128bytes as per the memory required for a connection.

The reason that the existing code uses one template and creates different entries for different connections from the same client is to manage the state of different connections from the same client, and it is easy to seemlessly add into existing IP Masquerading code. If only one template is used for all the connections from the same client, the box receives a RST packet and it is impossible to identify from which connection.

We using Hash Table to record an established network connection. How do we know the data transmission by one conection is over and when should we delete it from the Hash Table?

Julian Anastasov ja (at) ssi (dot) bg 24 Dec 2000 OK, here we'll analyze the LVS and mostly the MASQ transition tables from net/ipv4/ip_masq.c. LVS support adds some extensions to the original MASQ code but the handling is same. First, we have three protocols handled: TCP, UDP and ICMP. The first one (TCP) has many states and with different timeout values, most of them set to reasonable values corresponding to the recommendations from some TCP related rfc* documents. For UDP and ICMP there are other timeout values that try to keep the both ends connected for reasonable time without creating many connection entries for each packet. There are some rules that keep the things working: - when a packet is received for an existing connection or when a new connection is created a timer is started/restarted for this connection. The timeout used is selected according to the connection state. If a packet is received for this connection (from one of the both ends) the timer is restarted again (and may be after a state change). If no packet is received during the selected period, the masq_expire() function is called to try to release the connection entry. It is possible masq_expire() to restart the timer again for this connection if it is used from other entries. This is the case for the templates used to implement the persistent timeout. They occupy one entry with timer set to the value of the persistent time interval. There are other cases, mostly used from the MASQ code, where helper connections are used and masq_expire() can't release the expired connection because it is used from others. - according to the direction of the packet we distinguish two cases: INPUT where the packet comes in demasq direction (from the world) and OUTPUT where the packet comes from internal host in masq direction. masq. What does "masq direction" mean for packets that are not translated using NAT (masquerading), for example, for Direct Routing or Tunneling? The short answer is: there is no masq direction for these two forwarding methods. It is explained in the LVS docs. In short, we have packets in both directions when NAT is used and packets only in one direction (INPUT) when DR or TUN are used. The packets are not demasqueraded for DR and TUN method. LVS just hooks the LOCAL_IN chain as the MASQ code is privileged in Linux 2.2 to inspect the incoming traffic when the routing decides that the traffic must be delivered locally. After some hacking, the demasquerading is avoided for these two methods, of course, after some changes in the packet and in its next destination - the realservers. Don't forget that without LVS or MASQ rules, these packets hit the local socket listeners. How are the connection states changed? Let's analyze for example the masq_tcp_states table (we analyze the TCP states here, UDP and ICMP are trivial). The columns specify the current state. The rows explain the TCP flag used to select the next TCP state and its timeout. The TCP flag is selected from masq_tcp_state_idx(). This function analyzes the TCP header and decides which flag (if many are set) is meaningful for the transition. The row (flag index) in the state table is returned. masq_tcp_state() is called to change ms->state according to the current ms->state and the TCP flag looking in the transition table. The transition table is selected according to the packet direction: INPUT, OUTPUT. This helps us to react differently when the packets come from different directions. This is explained later, but in short the transitions are separated in such way (between INPUT and OUTPUT) that transitions to states with longer timeouts are avoided, when they are caused from packets coming from the world. Everyone understands the reason for this: the world can flood us with many packets that can eat all the memory in our box. This is the reason for this complex scheme of states and transitions. The ideal case is when there is no different timeouts for the different states and when we use one timeout value for all TCP states as in UDP and ICMP. Why not one for all these protocols? The world is not ideal. We try to give more time for the established connections and if they are active (i.e. they don't expire in the 15 mins we give them by default) they can live forever (at least to the next kernel crash^H^H^H^H^Hupgrade). How does LVS extend this scheme? For the DR and TUN method we have packets coming from the world only. We don't use the OUTPUT table to select the next state (the director doesn't see packets returning from the internal hosts). We need to relax our INPUT rules and to switch to the state required by the external hosts :( We can't derive our transitions from the trusted internal hosts. We change the state only based on the packets coming from the the clients. When we use the INPUT_ONLY table (for DR and TUN) the LVS expects a SYN packet and then an ACK packet from the client to enter the established state. The director enters the established state after a two packet sequence from the client without knowing what happens in the realserver, which can drop the packets (if they are invalid) or establish a connection. When an attacket sends SYN and ACK packets to flood a LVS-DR or LVS-Tun director, many connections are established state. Each each established connection will allocate resources (memory) for 15 mins by default. If the attacker uses many different source addresses for this attack the director will run out of memory. For these two methods LVS introduces one more transition table: the INPUT_ONLY table which is used for the connections created for the DR and TUN forwarding methods. The main goal: don't enter established state too easily - make it harder. Oh, maybe you're just reading the TCP specifications. There are sequence numbers that the both ends attach to each TCP packet. And you don't see the masq or LVS code to try to filter the packets according to the sequence numbers. This can be fatal for some connections as the attacker can cause state change by hitting a connection with RST packet, for example (ES->CL). The only info needed for this kind of attack is the source and destination IP addresses and ports. Such kind of attacks are possible but not always fatal for the active connections. The MASQ code tries to avoid such attacks by selecting minimal timeouts that are enough for the active connections to resurrect. For example, if the connection is hit by TCP RST packet from attacker, this connection has 10 seconds to give an evidence for its existance by passing an ACK packet through the masq box. To make the things complex and harder for the attacker to block a masq box with many established connections, LVS extends the NAT mode (INPUT and OUTPUT tables) by introducing internal server driven state transitions: the secure_tcp defense strategy. When enabled, the TCP flags in the client's packet can't trigger switching to established state without acknowledgement from the internal end of this connection. secure_tcp changes the transition tables and the state timeouts to achieve this goal. The mechanism is simple: keep the connection is SR state with timeout 10 seconds instead of the default 60 seconds when the secure_tcp is not enabled. This trick depends on the different defense power in the realservers. If they don't implement SYN cookies and so sometimes don't send SYN+ACK (because the incoming SYN is dropped from their full backlog queue), the connection expires in LVS after 10 seconds. This action assumes that this is a connection created from attacker, since one SYN packet is not followed by another, as part from the retransmissions provided from the client's TCP stack. We give 10 seconds to the realserver to reply with SYN+ACK (even 2 are enough). If the realserver implements SYN cookies the SYN+ACK reply follows the SYN request immediatelly. But if there are no SYN cookies implemented the SYN requests are dropped when the backlog queue length is exceeded. So secure_tcp is by default useful for realservers that don't implement SYN cookies. In this case the LVS expires the connections in SYN state in a short time, releasing the memory resources allocated from them. In any case, secure_tcp does not allow switching to established state by looking in the clients packets. We expect ACK from the realserver to allow this transition to EST state. The main goal of the defense strategies is to keep the LVS box with more free memory for other connections. The defense for the realservers can be build in the realservers. But may be I'll propose to Wensong to add a per-connection packet rate limit. This will help against attacks that create small number of connections but send many packets and by this way load the realservers dramatically. May be two values: rate limit for all incoming packets and rate limit per connection. The good news is that all these timeout values can be changed in the LVS setup, but only when the secure_tcp strategy is enabled. An SR timeout of 2 seconds is a good value for LVS clusters when realservers don't implement SYN cookies: if there is no SYN+ACK from the realserver then drop the entry at the director. The bad news is of course, for the DR and TUN methods. The director doesn't see the packets returning from the realservers and LVS-DR and LVS-Tun forwarding can't use the internal server driven mechanism. There are other defense strategies that help when using these methods. All these defense strategies keep the director with memory free for more new connections. There is no known way to pass only valid requests to the internal servers. This is because the realservers don't provide information to the director and we don't know which packet is dropped or accepted from the socket listener. We can know this only by receiving an ACK packet from the internal server when the three-way handshake is completed and the client is identified from the internal server as a valid client, not as spoofed one. This is possible only for the NAT method. ksparger (at) dialtoneinternet (dot) net (29 Jan 2001) rephrases this by saying the LVS-NAT is layer-3 aware. For example, NAT can 'see' if a realserver responds to a packet it's been sent or not, since it's watching all of the traffic anyway. If the server doesn't respond within a certain period of time, the director can automatically route that packet to another server. LVS doesn't support this right now, but, NAT would be the more likely candidate to support it in the future, as NAT understands all of the IP layer concepts, and DR doesn't necessarily. Julian Someone must put back the realserver when it is alive. This sounds like a user space job. The traffic will not start until we send requests. We have to send L4 probes to the realserver (from the user space) or to probe it with requests (LVS from kernel space)?

Hash Table DoS A posting (Jun 2003) on Slashdot links to a paper on Denial of Service via Algorithmic Complexity Attacks. The article shows how to mount a DoS by attack on hash tables. Access to entries in hash tables for most algorithms is different for the average case (randomly sorted data where access time might be O(n log n)) and for the worst case (all in reverse order, where access time could be O(n^2)). Programmers hope that real life data is not pathological. If the attacker knows the hash algorithm (i.e. has the source code), then they may be able to construct a worst case dataset for the hashing algorithm, which will bring the server to its knees. The paper discusses constructing attacks in which all data is entered into one bucket of the hash table. In the case of LVS, the hash table contents are (CIP:port, proto, VIP:port). The client only has a small number of variables (the port and proto they are sending from) from which to mount an attack, the others being fixed (CIP, VIP:port). Horms horms (at) verge (dot) net (dot) au 05 Jun 2003 Here is my take on this: LVS uses the following hash >IP_VS_CONN_TAB_BITS) XOR port) & IP_VS_CONN_TAB_MASK where: proto = protocol (TCP=6, UDP=17) CIP = source/client IP address (host byte order) port = source port (host byte order) IP_VS_CONN_TAB_BITS defaults to 8 IP_VS_CONN_TAB_MASK is (1 << IP_VS_CONN_TAB_BITS) - 1 thus the default is 0xff ]]> (from here '^' will mean power) The only inputs a user/client can effect are CIP and port. I would say that it is quite easy for someone to set things up so that they consistently hit the same bucket. For instance by connecting from the same ip address with different ports from the set (port % IP_VS_CONN_TAB_MASK) = n (though we observe that each end-user only has 2^(16-IP_VS_CONN_TAB_BITS) = 256 such ports). The client would need to use multiple source IP addresses. The effect is that instead of n connections going into 2^IP_VS_CONN_TAB_BITS different buckets they will go into one bucket. Thus LVS will have to do on average n/2 traversals instead of n/2^(IP_VS_CONN_TAB_BITS+1) traversals. The real effect is to amplify traversal times by 2^IP_VS_CONN_TAB_BITS. Though it is worth remembering that the larger IP_VS_CONN_TAB_BITS is then lower 2^(16-IP_VS_CONN_TAB_BITS) becomes, and thus the greater the number of source IP addresses required becomes. Though if it was a UDP bassed attack this might not be an issue as the source IP could be spoofed. This could become a problem if n became very large. But how large? Traversal is actually pretty fast. So I think that n would need to be quite large indeed to have a noticable effect on LVS and larger still to seriously degrade performance. Though I could be wrong. As for solutions, it is a bit tricky AFAIK. Perhaps using some component of the skb which is static for a connection, but not directly influenced by end users. But that may well open up a whole new can of worms. Ratz 05 Jun 2003 Theoretically we're susceptible to this sort of attack. Check out the devastating effects on running Julian's testlvs with certain parameters. You can still enable the LVS DoS defense strategies though (see ).

testing hash code Julian, 14 Jun 2003 Maybe it is time to change the hash function used for the table with connections. Today I played with some data from my 2.2 director and fixed the tools that measure different hash functions. I tested the default LVS function, one that uses 2654435761 and the Jenkins hash that is present in recent 2.4 and 2.5 kernels. We need some help from the math perspective. Here are some tools for testing the hash functions Look in Julian's LVS page for test.txt which contains short instructions for testing and ipvs-1.0.9-hash1.diff test hash code for 2.4.21. I created test patch against ipvs-1.0.9 (not tested). This is an attempt to introduce randomness on IPVS load. As for the tests with the different hash functions you can see my results and of course to try them yourself. My conclusion is that 2654435761 is better and faster but I hope we will see other results.

Hash table size, director will crash when it runs out of memory.

Yasser Nabi IP Virtual Server version 0.9.0 (size=16777216)

Julian Anastasov ja (at) ssi (dot) bg 25 May 2001 Too much, it takes 128MB for the table only. Use 16 bits for example.

Is this a hidden/undocumented problem with IPVS or it's just an observation of memory waste ? (we use 18 bits in production)

If the box has 128MB and the bits are 24 the kernel crash is mandatory, soon or later. And this is a good reason the virtual service not to be hit. Expect funny things to happen on box with low memory

I forgot that not anyone uses 256Mb or more RAM on directors :)

Yes, 256MB in real situation is ~1,500,000 connections, 128 bytes each, 64MB for other things ... until someone experiments with SYN attack

However, for me it makes sense to use up to 66% of total memory for LVS, especially on high-traffic directors (in the idea that the directors doesn't run all the desktop garbage that comes with most distros).

If the used bits are 24, an empty hash table is 128MB. For the rest 128MB you can allocate 1048576 entries, 128 bytes each ... after the kernel killed all processes. Some calcs considering the magic value 16 as average bucket length and for 256MB memory: For 17 bits: 2^17=131072 => 1MB for empty hash table 131072*16=2097152 entries=256MB for connections For 18 bits: 2^18=262144 => 2MB for empty hash table for each MB for hash table we lose space for 8192 entries but we speedup the lookup. So, even for 1GB directors, 19 or 20 is the recommended value. Anything above is a waste of memory for hash table. In 128MB we can put 1048576 entries. In the 24-bit case they are allocated for d-linked list heads. Joe 6 Jun 2001 what happens after the table fills up? Does ipvs handle new connect requests gracefully (ie drops them and doesn't crash)?

Julian The table has fixed number of rows and unlimited number of columns (d-lists where the connection structures are enqueued). The number of connections allocated depends on the free memory. Once there is no memory to allocate connection structure, the connection requests will be dropped. Expect crashes maybe at another place (usually user space) :) I'm not sure what the kernel will decide in this situation but don't rely on the fact some processes will not be killed. There is a constant network activity and a need for memory for packets (floods/bursts). And the reason the defense strategies to exist is to free memory for new connections by removing the stalled ones. The defense strategy can be automatically activated on memory threshold. Killing the cluster software on memory presure is not good. So, the memory can be controlled, for example, by setting drop_entry to 1 and tuning amemthresh. On floods it can be increased. It depends on the network speed too: 100/1000mbit. Thresholds of 16 or 32 megabytes can be used in such situations, of course, when there are more memory chips.

Roberto Nibali ratz (at) tac (dot) ch The director never crashes because of exhaustion of memory. If he tries to allocate memory for a new entry into the table and kmalloc returns NULL, we return, or better drop the packet in processing and generate a page fault. You could use my treshhold limitation patch. You calculate how many connections you can sustain with your memory by multiplying each connection entry with 128bytes and divide by the amount of realserver and set the limitation alike. Example: 128MByte, persistency 300s: max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

Joe It would seem that we need a method of stopping the director hash table from using all memory whether as a result of a DoS attack or in normal service. Let's say you fill up RAM with the hash table and all user processes go to swap, then there will be problems - I don't know what, but it doesn't sound great - at a high number of connections I expect the user space processes will be needed too. I expect we need to leave a certain amount for user space processes and not allow the director to take more than a certain amount of memory. It would be nice if the director didn't crash when the number of connections got large. Presumably a director would be functioning only as a director and the amount of memory allocated to user space processes wouldn't change a whole lot (ie you'd know how much memory it needed).

The LVS code does not swap Joe Feb 2001 With sufficient number of connections, a director could start to swap out its tables (is this true?) In this case, throughput could slow to a crawl. I presume the kernel would have to retrieve parts of the table to find the realserver associated with incoming packets. I would think in this case it would be better to drop connect requests than to accept them.

Julian IMO, this is not true. LVS uses GFP_ATOMIC kind of allocations and as I know such allocations can't be swapped out.

If it's possible for LVS to start the director to swap, is there some way to stop this?

You can try with testlvs whether the LVS uses swap. Start the kernel with LILO option mem=8M and with large swap area. Then check whether more than 8MB swap is used.

Other factors determining the number of connections In earlier verions of LVS, you set the amount of memory for the tables (in bytes). Now you allocate a number of hashes, whose size can grow without limit, allowing an unlimited number of connections. Once the number of connections becomes sufficiently large, then other resources will become limiting. out of memory. The ipvs code doesn't handle this, presumably the director will crash (also see the threshold patch). Instead you handle this by brute force, adding enough memory to accept the maximum number of connections your setup will ever be asked to handle (e.g. under a DoS attack). This memory size can be determined by the multiplying the rate at which your network connection can push connect requests to the director, by the timeout values, which are set by FIN_WAIT or the persistence. out of ports. You can expand the number of ports to 65k, but eventually you'll reach the 65k port limit.

Port range: limitations, expanding port range on directors Sometimes client processes on the realservers need to connect with machines on the internet (see clients on realservers. Wayne wayne (at) compute-aid (dot) com Nov 5 2001

Say you have a web page that has to retrieve on-line ads from one of your advertiser (people who pay you for showing their ads). If you have 50,000 visitors on your site, you will open 50,000 connections between your web server and the ad server out there somewhere. The masquerade limit is 4,096 per pair of IP addresses, and 40,960 per LVS box. In our case, the realserver is behind the LVS-NAT director, which also functions as the firewall, so the realserver MUST use the director to reach the ad servers.

Usually the RIP is private (e.g.192.168/16) and will have to be NAT'ed to the outside world. This can be done with LVS-NAT or LVS-DR by adding masquerading rules to the director's iptables/ipchains rules. (With LVS-DR, you also have to route the packets from the RIP - this routing is setup by default with the configure script) Less often you want to use more ports on your LVS client machines. Wang Haiguang

My client machine it uses port numbers between 1024 - 4096. After reaching 4096, it will loop back to 1024 and reuse the ports. I want to use more port nubmers

michael_e_brown (at) dell (dot) com 06 Feb 2001 /proc/sys/net/ipv4/ip_local_port_range /usr/src/linux/Documentation/networking/ip-sysctl.txt ]]> While normal client processes use ports in order starting at 1024, masqueraded ports start at 61440 (2^16-2^12) for 2.2.x kernels (see clients on realservers). The masquerading code does not check if other processes are requesting ports and thus port collisions could occur. It is assumed on a NAT box that no other processes are initiating connections (i.e. you aren't running a passive ftp server). Horms 26 Jun 2007 I beleive that there is a school of thought that source ports should be randomised to mitigate certain classes of security threats. Horms 17 May 2004 I am a bit rusty on 2.2. But the restricted port range for NAT'ed connections with 2.2 sounds familiar. I seem to recall you could change the range, but it required changing a define in the kernel and recompiling. I think it was changed to a /proc value in 2.4. For 2.4.x kernels, the restriction to only use high ports is removed. The NAT router uses ports starting at 1024. Horms 17 May 2004 In 2.4 the ephemerial port range and the nat port range are the same. If this is the case, which I guess it is, then it would seem likely there is some sort of collision detectionion. I took a look and this does not seem to be the case. I assume the kernel has some other way of handling this. But I am not sure at this moment. If you are interested try looking at tcp_v4_get_port() and tcp_unique_tuple(). I'm not convinced that Michael Brown's comment is correct at the moment, but I don't have the definitive answer either. Wayne wayne (at) compute-aid (dot) com 14 May 2000,

If running a load balancer tester, say the one from IXIA to issue connections to 100 powerful web servers, would all the parameters in Julian's description need to be changed, or it should not be a problem for having many many connections from a single tester?

Julian There is no limit for the connections from the internal hosts. Currently, the masquerading allows one internal host to create 40960 TCP connections. But the limit of 4096 connections to one external service is still valid. If 10 internal hosts try to connect to one external service, each internal host can create 4096/10 => 409 connections. For UDP the problem is sometimes worse. It depends on the /proc/sys/net/ipv4/ip_masq_udp_dloose value. Joe

which is internal and which is external here? The client, the realservers?

This is a plain masquerading so internal and external refer to masquerading. These limits are not for the LVS connections, they are only for the 2.2 masquerading. / 65095 Internal Servers External Server:PORT - ... MADDR -------------------- \ 61000 ]]> When many internal clients try to connect to same external real service, the total number of TCP connections from one MADDR to this remote service can be 4096 because the masq uses only 4096 masq ports by default. This is a normal TCP limit, we distinguish the TCP connections by the fact they use different ports, nothing more. And the masq code is restricted by default to use the above range of 4096 ports. In the whole masquerading table there is a space only for 40960 TCP, 40960 UDP and 40960 ICMP connections. These values can be tuned by changing ip_masq.c:PORT_MASQ_MUL. For 2.4 masquerading, all ports can be used for the masqueraded connections. Waynewayne (at) compute-aid (dot) com 1 Nov 2001

PORT_MASQ_MUL appears to serve only as a check to make sure the masquerading facility does not hog all the memory, and that actually things would still work no matter how large PORT_MASQ_MUL is, or even if the checks using it are disabled. Is this true?
Julian By multiplying this constant with the masq port range, you define the connection limit for each protocol. This is related to the memory used for masquerading. This is a real limit, but not for LVS connections, because they are usually not limited by port collisions, and LVS does not check this limit.
What about using more than the 32k range? What is the maximum I could select? Peter Muellerpmueller (at) sidestep (dot) com
You should be able to use about 60k, i.e. 1024-6100. I hope you have lots of RAM :-)

Julian continuing The PORT_MASQ_MUL value simply determines the recommended length of one row in the masq hash table for connections, but in fact it is involved in the above connection limits. It is recommended that the busy masq routers must increase this value. May be the 4096 masq port range too. This involves squid servers behind masq router. LVS uses another table without limits. For LVS setups the same TCP restrictions apply but for the external clients: The limit of client connections to one VIP:VPORT is limited to the number of used client ports from same Client IP. The same restrictions apply to UDP. UDP has the same port ranges. But for UDP the 2.2 kernel can apply different restrictions. They are caused from some optimizations that try to create one UDP entry for many connections. The reason for this is the fact that one UDP client can connect to many UDP servers while this is not common for TCP. Joe

when you increase the port range, you need more memory. Is this only because you can have more connections and hence will need a bigger ipvsadm table?

Yes, the first need is for more masqueraded connections and they allocate memory. LVS uses separate table and it is not limited. We distinguish LVS-NAT from Masquerade. LVS-NAT (and any other method) does not allocate extra ports, even for other ranges. It shadows only the defined port. No other ports are involved until masquerading is used.

ipvs doesn't check port ranges and so collisions can occur with regular services (ftp was mentioned). I would have thought that a process needing to open a IP connnection would ask the tcp code in the kernel for a connection and let that code handle the assignment of the port.

LVS does not allocate local ports. When the masquerade is used to help with some protocol, the masquerade performs the check (ftp for example). The port range has nothing to do with LVS. It helps the masquerading to create more connections because there is fixed limit for each protocol. But sometimes LVS for 2.2 uses ip_masq_ftp, so may be only then this mport range is used.

X-window connections are at 6000.. Will you be able to start an X-session if these ports are in use by a director masquerading out connections from the realservers?

If we put LVS (ipvsadm -A ) in front of this port 6000 then X sessions will be stopped. OTOH, masquerade does not select ports in this range, the default is above 61000. So, any FTP sessions will not disturb local ports, of course, if you don't increase the mport range to cover the well defined server ports such as X.

Director does not have any ports (connections) open for an LVS connection The director is just a router (admittedly with slightly different rules than the standard layer 3 router). There are no connections made (ports opened) between the client and the director, or between the realservers and the director. The director does keep track of the packets passing through that are for LVS'ed services (connection tracking) as part of updating the hash table and for the server state synch demon. Sebastien BONNET sebastien(dot) bonnet (at) experian (dot) fr 11 May 2004 There's no "open" connection on the director, just tracked connections. The clients are not "connected" to the load balancer. For the client, assuming a client always uses a different port for an outgoing connection, it can roughly initiate 65K connections. For the realserver, there's no real port limit for a daemon listening on a single port: it uses just one. The realserver can have connections to all ports on all IPs, i.e. 256*256*256*256*(65536-1024) connections (the realserver may run out of memory before it can make all these connections). If there was no file descriptor limit nor memory constraint, a server could handle way more than the current "port limit" (65K) simultaneous connections.

apps starved for ports LVS Account, 27 Feb 2001

I'm trying to do some load testing of LVS using a reverse proxy cache server as the load balanced app. The error I get is from a load generating app.. Here is the error:

Julian Anastasov ja (at) ssi (dot) bg Broken app.

this goes on for a few hundred requests then I start getting:

App uses too many local ports.

This is when I can't telnet to port 80 any more... If I try to telnet to 10.0.0.80 80 I get this:

No more free local ports.

If I go directly to the web server OR if I go directly to the IP of the reverse proxy cache server, I don't get these errors.

Hm, there are free local ports now.

I'm using a load balancing app that I call this way: upping the local port range has helped tremendously

realserver running out of ports Here's a case where a realserver ran out of udp ports doing DNS looksup while serving http. Hendrik Thiel thiel (at) falkag (dot) de

I am using IP Virtual Server version 0.9.14 (size=4096). We have 6 Realservers each. RemoteAddress:Port Forward Weight ActiveConn InActConn -> server1:www Masq 1 68 12391 ]]> Today we reached a new peak (very fast, few minutes) 30Mbps, up from the normal 15Mbit/s. Afterwards the following kernel messages (dmesg) showed up...

Julian Anastasov ja (at) ssi (dot) bg 20 Nov 2001 (heavily edited by Joe) It seems you are flooding a single remote host with UDP requests from a realserver. Your service, www, is TCP and is not directly connected to these messages. You've reached the UDP limit per destination (4096), there are still free UDP ports on the realserver for other destinations. Hendrik

yes it's DNS, each realserver is a caching DNS.

Maximum number of NICs This is not really an LVS question, but people want to know. ntadmin (at) reachone (dot) com

We are nearing 255 virtual interfaces on the external side of our LVS system (Joe - presumably the number of VIPs). Can somebody tell me if this is going to be a hard limit or if we can go beyond 255 on one network card?

Roberto Nibali ratz (at) tac (dot) ch 17 Dec 2003 No problem (proof of concept below): /dev/null 2>&1; done; done # ip addr show dev lo | grep 'inet ' | wc -l 763 # for ((i = 1; i < 255; i++)); do for ((j = 1; j < 4; j++)); do ip addr del 127.0.$j.$i/32 dev lo brd + label lo:$i-$j 1>/dev/null 2>&1; done; done # ip addr show dev lo | grep 'inet ' | wc -l 1 ]]>

DoS LVS is vunerable to DoS by an attacker making repeated connection requests. Each connection requires 128bytes of memory - eventually the director will run out of memory. This will take a while but an attacker has plenty of time if you're asleep. As well with LVS-DR and LVS-Tun, the director doesn't have access to the TCPIP tables in the realserver(s) which show whether a connection has closed (see director hash table). The director can only guess that the connection has really closed, and does so using timeouts. Roberto Nibali ratzi (at) tac (dot) ch 10 Sep 2002

It's _impossible_ to differentiate between malicious and good traffic. End of story. But you can rate limit incoming SYNs within ingress policy. This was discussed about 2 years ago when the secure_tcp and drop_packet stuff was about to be introduced.

For information on DoS strategies for LVS see DoS page. Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 14 Feb 2001

If I am not misunderstanding something, the variable /proc/sys/net/ipv4/vs/timeout_established gives the time a TCP connection can be idle and after that the entry corresponding to this connection is cleared. My problem is that it seems that sometimes it's not the case. For example I have a system (2.2.16 and ipvs 0.9.15) with /proc/sys/net/ipv4/vs/timeout_established = 480, but the entries are created with a real timeout of 120.

Julian Anastasov ja (at) ssi (dot) bg Read The secure_tcp defense strategy where the timeouts are explained. They are valid for the defense strategies only. For TCP EST state you need to read the ipchains man page. For more explanation of the secure_tcp strategy also see the explanation of the director's hash table.

when I play with ipchains -M -S > [value] 0 0 the variable /proc/sys/net/ipv4/vs/timeout_established is modified even when /proc/sys/net/ipv4/vs/secure_tcp is set to 0, so I'm not using the secure TCP defense. The "real" timeout is of course set to [value] when a new TCP connection appears. So should I understand that timeout_established, timeout_udp,... are always modified by "ipchains -M -S .... whatever I use or not secure TCP defense but if secure-tcp is set to 0, other variables give the timeouts to use? If so, are these variable accessible or how to check their value?

ipchains -M -S modifies the two TCP and the UDP timeouts in the two secure_tcp modes: off and on. So, ipchains changes the three timeout_XXX vars. When you change the timeout_* vars you change them for secure_tcp=on only. Think for the timeouts as you have two sets: for the two secure_tcp modes: on and off. ipchains changes the 3 vars in the both sets. While secure_tcp is off, changing timeout_* does not affect the connection timeouts. They are used when secure_tcp is on. Joe: ipchains 0 value 0, where value=10, does not change the timeout values or number of entries seen in InActConn or seen with netstat -M, or ipchains -M -L -n. LVS has its own tcpip state table, when in secure_tcp mode. carl.huang

what are the vs_tcp_states[ ] and vs_tcp_states_dos[ ] elements in the in ip_vs_conn structure for?

Roberto Nibali ratz (at) tac (dot) ch 16 Apr 2001 The vs_tcp_states[] table is the modified state transition table for the TCP state machine. The vs_tcp_states_dos[] is a yet again modified state table in case we are under attack and secure_tcp is enabled. It is tigher but not conforming to the RFC anymore. Let's take an example how you can read it: The elements 'sXX' mean state XX, so for example, sFW means TCP state FIN_WAIT, sSR means TCP state SYN_RECV and so on. Now the table describes the state transition of the TCP state machine from one TCP state to another one after a state event occured. For example: Take row 2 starting with sES and ending with sCL. At the first, commentary row, you see the incoming TCP flags (syn,fin,ack,rst) which are important for the state transition. So the rest is easy. Let's say, you're in row 2 and get a fin so you go from sES to sCW, which should by conforming to RFC and Stevens. Short illustration: It was some months ago last year when Wensong, Julian and I discussed a security enhancement for the TCP state transition and after some heavy discussion they implemented it. So the second table vs_tcp_states_dos[] was born. (look in the mailing list in early 2000).

DoS, from the mailing list

Malicious attacks (SYN floods) LVS has been tested with a 100Mbit/sec syn-flooding attack by Alan Cox and Wensong. Each connection requires 128 bytes. A machine with 128M of free memory could hold 1M concurrent connections. An average connection lasts 300secs. Connections which just receive the syn packet are expired in 30secs (starting ipvs 0.8 ). An attacker would have to initiate 3k connections/sec (600Mbps) to maintain the memory at the 128M mark and would require several T3 lines to keep up the attack.

testing DoS joern maier 22 Nov 2000

I've got a problem protecting my LVS from SYN-flood attacks. Somehow the drop_entry mechanism seems not to work. Doing a SYN-flood with 3 clients to my LVS ( 1 director + 3 RS ) the system gets unreachable. A single realserver under the same attack by those clients stays alive.

Julian You can't SYN flood the director with only 3 clients. You need more clients (or as an alternative, you can download testlvs from the LVS web site). What does ipvsadm -Ln show under attack? How you activate drop_entry? What does cat drop_entry show?

all realservers have tcp_syncookies enabled (1), tcp_max_syn_backlog=128, the director is set drop_entry var to 1 (echo 1 > drop_entry). Before compiling the kernel, I set the table size to 2^20. My Director has 256 MB and no other applications running.

You don't need such a large table, really. Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 04 Nov 2004

I'm currently facing a ddos syn-flood attack against my cluster. Fortunately, those guys do not have enough machines to flood all my servers and the service is still up and running. They seem to use spoofed source IPs (as usual) so I can't even know where it comes from. Anyway, It is now 24 hours they are playing like that, and I would like to stop it. Do you have an idea? Don't tell me that I have to use iptables to reduce the syn rate, I can't :). I have a lot of mobile clients, and the wap gateways can send me a lot of valid syns.

Jacob Coby jcoby (at) listingbook (dot) com 04 Nov 2004 You can try turning on tcp_syncookies: http://www.mail-archive.com/focus-linux@securityfocus.com/msg00185.html /proc/sys/net/ipv4/tcp_syncookies ]]> I forgot to mention that I've had tcp_syncookies enabled on individual systems for about 3 years now with no problems. I've had it enabled on every machine in a LVS-DR cluster for 6 months with no problems.

With testlvs and two clients, my LVS gets to a denial of service, although cat drop_entry shows me a "1".

run testlvs with 100,000 source addresses.

during the flooding attack the connection values stay around this size. Using the SYN-flood tool with which I tried it before, ivsadm shows me so it shows me about ten times as many connections as your tool. I took a look at the packets, both are quiet similar, they only differ in the Windowsize (testlvs has 0, the other tool uses a size of 65534) and sequence numbers (o.k. checksum as well) I am activating drop entry like this: I switch on my computer (director) and start linux with the LVS Kernel drop_entry ]]>

Julian Maybe you need to tune amemthresh. 1024 pages (4MB) are too low value. How much memory does "free" show under attack? You can try with 1/8 RAM size for example. The main goal of these defense strategies is to keep free memory in the director, nothing more. The defense strategies are activated according to the free memory size. The packet rate is not considered. joern maier

That sounds all good to me, but what I'm really wondering about is, why has the drop_entry variable still a value of 1. I thought it has to be 2 when my System is under attack? To me it looks like LVS does not even think it's under attack and therefore does not use the drop_entry mechanism.

You are right. You forgot to specify when the LVS to think it is under attack. drop_entry switches automatically from 1 to 2 when the free memory reaches amemthresh. Do you know that your free memory is below 4MB? See defense strategies. So, 1,000,000 entries created from the other tool uses 128MB memory. You have 256MB :) To reduce the amount of memory the kernel sees, boot with mem=128MB (in lilo) or set amemthresh to 32768 or run testlvs with more source addresses (2,000,000). I'm not sure if the last will help if the other tool you use does not limit the number of spoofed addresses. But don't run testlvs with less than -srcnum 2000000. If the setup allows rate > 33,333 packets/sec LVS can create 2,000,000 entries that expire for 60 seconds (the SYN_RECV timeout). Better not to use the -random option in testlvs for this test. So, you can test with such large values but make sure you tune amemthresh in production with the best value for your director. The default value is not very useful. You can test whether 1/8 is a good value (8192 for 4K page size). Sameer Garg sameer (dot) garg (at) gmail (dot) com 15 Apr 2008

We have been experiencing D/Dos on http. The LVS is uneffected by the D/Dos but the real servers are suffering. Beside the D/Dos the LVS is currently handling 5 subdomains and approximately 10QPS. We are using LVS-Tun configuration. Due to our distributed setup and service provider limitation we can't put a perimeter firewall so we are thinking of stopping them at or before the LVS. At the director I have tuned the route flush and route garbage collection variables but that is all I could figure out.After reading the howto and the mailing list I have concluded that it is possible to use iptalbles with LVS-DR and LVS-NAT. Is it advisable to put iptables on the director in a LVS-TUN setup? Unrelated question: Anybody using a opensource firewall Iptables/pf in production for 100M connection?

Michael Schwartzkopff misch (at) multinet (dot) de 15 Apr 2008 Yes. iptables is even nescessary if you take LVS descisions based on the mangle table. I haven't seen any 100M setups in production, but shold be possible. Perhaps this helps: http://lists.sans.org/pipermail/unisog/2005-August/025040.html Bgs bgs (at) bgs (dot) hu We use lvs+netfilter solution with ~Gbit/sec traffic and DDoS attack above gigabit. We had DDoS attacks in the 60k-100k bot range. You can handle these with a reasonable level of service, but if you want your users to experience just small hiccups a mitigator on the outmost layer with a good feedback from your system into the mitigator blacklist.

on the design of the DoS preventer Alan Cox alan (at) lxorguk (dot) ukuu (dot) org (dot) uk> The biggest problem with load balancing, when you need to do this sort of trickery (and its one the existing load balancing patches seem to have is that if you store per connection state then a synflood will take out your box (if you run out of ram) or run a delightfully efficient DoS attack, if you don't. The moment you factor time into your state tables you are basically well and truly screwed. Lars Marowsky-Bree lmb (at) teuto (dot) net> 8 Jun 1999

This can be solved with a hashtable, where you take the source ip as the key and look up the server to direct the request. Since the hash table is fixed size, we can do with fixed resources. Given a proper hash function, this scheme is _ideal_ for basic round-robin and weighted round-robin with fixed weights and we should look at implementing this. Keeping state if not necessary _is_ a bug. We are screwed however and can't do this if we want to do least-connections, dynamic load-based balancing, add servers at a later time etc and still deliver sticky connections (ie that connections from client A will stay on server B until a timeout happens or server B dies). Basically, since we _need_ to keep state on a per-client basis for this we can be screwed easily by bombarding us with a randomized source IP. Now - for all but the most simple load balancing, we NEED to keep state. So, we need to weasle our way out of this mess somehow. One approach would be to integrate SYN cookies into the load-balancer itself and only pass on the connection if the TCP handshake succeeded. Now, there are a few obvious problems with this: It is a very complex task. And, it still screws us in the case of an UDP flood. "The easy way out" for TCP connections is to do this stuff in user space - a load-balancing proxy, which connects to the backend servers. Problems with this are that it isn't transparent to the backend servers anymore (all connections come from the IP of the loadbalancer), it does not scale as well (no direct routing approach etc possible), and we still did not solve UDP. I propose the following: We continue to maintain state like we always did. But when we hit, lets say, 32,000 simulteanous connections, we go into "overload" mode - that is, all new connections are mapped using the hash table like Alan proposed, but we still search the stateful database first. There are a few problems with this too: It is not as fast as the pure hash table, since we need to look into the stateful database before consulting the hashtable. If weights change during overload mode, sticky connections can't be easily guaranteed (I thus propose to NOT change weights during overload mode, or at least ignore the changes with regard to the hashing). However, these are disadvantages which only happen under attack. At the moment, we would simply crash, so it IS an improvement. It is a fully transparent approach and works with UDP too. The effort to implement this is acceptable. (if it were userspace I would give it a try sometime;) And if we implement this scheme for fixed loadbalancing, which someone else definetely should, reusing the code here might not be that much of a problem.

Timeout in MASQ tables Michael McConnell michaelm (at) eyeball (dot) com 08 Oct 2001 the command returns a list of masqueraded connections, i.e. 4052 TCP 01:38.08 10.1.1.41 21.1.112.43 80 (80) -> 4053 TCP 00:25.09 10.1.1.11 20.170.180.17 80 (80) -> 4430 ]]> If ipchains (kernel 2.2) has been set with a 10hr TCP timeout Now these connections remain (will populate the ipvsadm table) for 10 hours. Does anyone have any suggestions as to how to purge this table manually? If I run out of ports, I get a DoS (2 hr timeout, 30,000 TCP connections...DoS)

Peter Mueller If you alter /proc/net/ip_masquerade, it will break the established connection. Isn't that what you want to do?

No matter what I do I can not seem to reset, clear or modify this manually.

if you do not like the prospect of altering directly perhaps try a shell script:

Setting this Value only effects *NEW* connections, connections already set are unaffected. Julian Anastasov ja (at) ssi (dot) bg> Without a timeout values specific for each LVS virtual service and another for the masqueraded connections it is difficult to play such games. It seems only one timeout needs to be separated, the TCP EST timeout. The reason that such support is not in 2.2 is because nobody wants to touch the user structures. IMO, it can be added for 2.4 if there are enough free options in ipvsadm but it also depends on some implementation details. If you worry for the free memory you can use some defense the LVS DoS defense strategies drop_entry ]]>

BIG IP SYN Check and Dynamic Reaping unknown

Is there anything like the BIG-IP syn check (http://www.f5.com/solutions/tech/security/) to prevent DoS?.

Ratz 12 Aug 2004 For the RS or the director or both? I think you are referring to those two marketing features: SYN Check: One type of DoS attack is known as a SYN flood in which an attack is made for the purpose of exhausting a system's resources leaving it unable to establish legitimate connections. The BIG-IP system's SYN Check feature works to alleviate SYN flooding by sending cookies to the requesting client on the server's behalf and by not recording state information for connections that have not completed the initial TCP handshake. This unique feature ensures that servers only process legitmate connections and the BIG-IP SYN queue is not exhausted, and normal TCP communication can continue. The SYN Check feature complements the BIG-IP Dynamic Reaping feature, in that while the Dynamic Reaping handles established connection flooding, SYN Check addresses embryonic connection flooding to prevent the SYN queue from becoming exhausted. Dynamic Reaping - The BIG-IP software contains two global settings that provide the ability to reap connections adaptively. Used to prevent denial-of-service (DoS) attacks, enterprises can specify a low-watermark threshold and a high-watermark threshold. The low-watermark threshold determines at what point adaptive reaping becomes more aggressive. The high-watermark threshold determines when non-established connections through the BIG-IP product will no longer be allowed. The value of this variable represents a percentage of memory utilization. Once memory utilization has reached this mark, connections are disallowed until the available memory has been reduced to the low-watermark threshold. SYN Check can be enabled on the RS for all major Unices. For the rest we have to give in the fact that LVS is a software load balancer and does not have the possibilites a hardware load balancer has with regard to doing SYN cookies for other servers. Also limiting the backlog queue on a per socket basis definitely helps. Dynamic Reaping could very well be brought into conjuction with our TCP DoS defense mechanism. Read about it at: LVS DoS defense (http://www.linux-vs.org/docs/defense.html). I have tested F5 BigIP load balancers and I was able to flood the RS just as well as using LVS. SYN flooding cannot be prevented, it can only be rate limited. It's a flaw in the TCP protocol which we'll have to live with. There are a couple of defense mechanisms but non of them can really distinguish between malicious TCP/SYN and friendly TCP/SYN.

Testing DoS Strategies with testlvs: Creating large numbers of InActConn

testlvs testlvs (by Julian ja (at) ssi (dot) bg) is available on Julian's software page. It sends a stream of SYN packets (SYN flood) from a range of addresses (default starting at 10.0.0.1) simulating connect requests from many clients. Running testlvs from a client will occupy most of the resources of your director and the director's screen/mouse/keyboard will/may lock up for the period of the test. To run testlvs, I export the testlvs directory (from my director) to the realservers and the client and run everything off this exported directory.

Fabrice fabrice (at) urbanet (dot) ch 11 Dec 2001 I can reach 60K SYN/s with a mean of about 54K using a PIII 500MHz client. The load on the LVS-NAT director was a high (always 100% system usage, and a swap between the ttys takes about 3-5 seconds). That poor box couldn't handle the load and wasn't able to send back packets (maybe only 10 per seconds). This means that the DoS was successfull but it's only working during the flood, it won't brake any services (thanks to Syn_Cookies).

Julian Anastasov ja (at) ssi (dot) bg If you want to measure a maximal possible rate use -srcnum 10 or another small number to avoid beating the routing cache in the director. If you need to test the defense strategies you need large value in -srcnum. The default is too small for this, it avoids errors.

I think the only way to prevent the DoS in this case is to upgrade the LVS box hardware :)

Not always. LVS does not protect the realservers. The result can be the output pipe loaded from replies on DoS attack. You should try some ingress rate limiting, independent from LVS. Of course, your hardware should not be blocked from such attacks, you need faster MB+CPU if you care for such problems.

I looked with the vmstat 1 and 10, as Julian recommanded. Shouldn't the values of the number of interruptions with "vmstat 10" be 10 times more than "vmstat 1"'s?

No, they should be equal, up to 5% are good, they show that the process scheduling really works. If you are under attack and you can't handle it then the snapshots from vmstat 1 are delayed and the results differ too much from the results provided for longer time interval.

I got with vmstat 1: interrupts = ca. 400'000, cpu sys = 100 and with vmstat 10: interrupts = ca. 60'000, cpu sys = 100

Your director reached its limits. You should try to flood it with slower client(s). When you see that the input packet rate is equal to the successfully forwarded packets (received on the real server) then stop to slow down the attack. You reach the maximal packet rate possible to deliver to the realservers. On NAT you should consider the replies, they are not measured with testlvs tests. They will need may be the same CPU power.

configure realserver The realserver is configured to reject packets with src_address 10.0.0.0/8. Here's my modified version of Julian's show_traffic.sh which is run on the realserver to measure throughput. Start this on the realserver before running testlvs on the client. For your interest you can look on the realserver terminal to see what's happening during a test.

configure director I used LVS-NAT on a 2.4.2 director, with netpipe (port 5002) as the service on two realservers. You won't be using netpipe for this test, ie you won't need a netpipe server on the realserver You just need a port that you can set up an LVS on and netpipe is in my /etc/services, so the port shows up as a name rather than a number. Here's my director RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:netpipe rr -> RS2.mack.net:netpipe Masq 1 0 0 -> RS1.mack.net:netpipe Masq 1 0 0 ]]>

run testlvs from client run testlvs (I used v0.1) on the client. Here testlvs is sending 256 packets from 254 addresses (the default) in the 10.0.0.0 network. (My setup handles 10,000 packets/sec. 256 packets appears to be instantaneous.) when the run has finished, go to the director RemoteAddress:Port Forward Weight ActiveConn InActConn TCP lvs2.mack.net:netpipe rr -> RS2.mack.net:netpipe Masq 1 0 127 -> RS1.mack.net:netpipe Masq 1 0 127 ]]> (If you are running a 2.2.x director, you can get more information from ipchains -M -L -n, or netstat -M. For 2.4.x use cat /proc/net/ip_conntrack.) This output shows 254 connections that have closed are are waiting to timeout. A minute or so later, the InActConn will have cleared (on my machine, it's 50secs). If you send the same number of packets (256), from 1000 different addresses, (or 1000 packets to 256 addresses), you'll get the same result in the output of ipvsadm (not shown) In all cases, you've made 254 connections. If you send 1000 packets from 1000 addresses, you'd expect 1000 connections. Here's the total number of InActConn as a function of the number of packets (connection attempts). Results are for 3 consecutive runs, allowing the connections to timeout in between. The numbers are not particularly consistent between runs (aren't computers deterministic?). Sometimes the blinking lights on the switch stopped during a test, possibly a result of the tcp race condition (see the performance page) You don't get 1000 InActConn with 1000 packets (connection attempts). We don't know why this is.

Julian I'm not sure what's going on. In my tests there are dropped packets too. They are dropped before reaching the director, maybe from the input device queue or from the routing cache. We have to check it.

InActConn with drop_entry defense strategy repeating the control experiment above, but using the drop_entry strategy (see the DoS strategies for more information). director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_entry The drop_entry strategy drops 1/32 of the entries every second, so the number of InActConn decreases linearly during the timeout period, rather than dropping suddenly at the end of the timeout period.

InActConn with drop_packet defense strategy repeating the control experiment above, but using the drop_packet strategy (see the DoS strategies for more information). director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_packet The drop_packet=3 strategy will drop 1/10 of the packets before sending them to the realserver. The connections will all timeout at the same time (as for the control experiment, about 1min), unlike for the drop_entry strategy. With the variability of the InActConn number, it is hard to see the drop_packet defense working here.

InActConn with secure_tcp defense strategy repeating the control experiment above, but using the secure_tcp strategy (see the DoS strategies for more information). The SYN_RECV value is the suggested value for LVS-NAT. /proc/sys/net/ipv4/vs/secure_tcp director:/etc/lvs# echo "10" >/proc/sys/net/ipv4/vs/timeout_synrecv ]]> This strategy drops the InActConn from the ipvsadm table after 10secs.

maximum number of InActConn If you want to get the maximum number of InActConn, you need to run the test for longer than the FIN timeout period (here 50secs). 2M packets is enough here. As well you want as many different addresses used as possible. Since testlvs is connecting from the 10.0.0.0/8 network, you could have 254^3=16M connections. Since only 2M packets can be passed before connections start to timeout and the director connection table reaches a steady state with new connections arriving and old connections timing out, there is no point in sending packets from more that 2M source addresses. Note: you can view the contents of the connection table with 2.2 netstat -Mn cat /proc/net/ip_masquerade 2.4 cat /proc/net/ip_vs_conn Here's the InActConn with various defense strategies. The InActConn is the maximum number reachable, the scrnum and packets are the numbers needed to saturate the director. The time of the test must exceed the timeouts. InActConn was determined by running a command like this and then adding the (two) entries in the InActConn column from the output of ipvsadm.

Is the number of InActConn a problem?

edited from Julian The memory used is 128 bytes/connection and 60k connections will tie up 7M of memory. LVS does not use system sockets. LVS has its own connection table. The limit is the amount of memory you have - virtually unlimited. The masq table (by default 40960 connections per protocol). is a separate table and is used only for LVS/NAT FTP or for normal MASQ connections.

However the director was quite busy during the testlvs test. Attempts to connect to other LVS'ed services (not shown in the above ipvsadm table) failed. Netpipe tests run at the same time from the client's IP (in the 192.168.1.0/24 network) stopped, but resumed at the expected rate after the testlvs run completed (i.e. but before the InActConn count dropped to 0).

port starved machines Matthijs van der Klip matthijs (dot) van (dot) der (dot) klip (at) nos (dot) nl 10 Nov 2001 used a fast (Origin 200) single client to generate generate between 3000 and 3500 hits/connections per second to his LVS'ed web cluster. No matter how many/few realservers in the cluster, he could only get 65k connections.

Julian You are missing one reason for this problem: the fact that your client(s) create connections from limited number of addresses and ports. Try to answer yourself from how many different client saddr/sport pairs you hit the LVS cluster. IMO, you reach this limit. I'm not sure how many test client hosts you are using. If the client host is only one then there is a limit of 65536 TCP ports per src IP addr. Each connection has expiration time according to its proto state. When the rate is high enough not to allow the old entries to expire, you reach a situation where the connections are reused, i.e. the connection number showed from ipvsadm -L does not increase.

Debugging LVS

new way /proc/sys/net/ipv4/debug_level where 0<x<9 ]]>

old way (may still work - haven't tested it) Is there any way to debug/watch the path between the director and the realserver?

Wensong below the entry CONFIG_IP_MASQUERADE_VS_WLC=m in /usr/src/linux/.config, add the line CONFIG_IP_VS_DEBUG=y This switch affects ip_vs.h and ip_vs.c. make clean in /usr/src/linux/net/ipv4 and rebuild the kernel and modules.

(other switches you will find in the code are IP_VS_ERR IP_VS_DBG IP_VS_INFO ) Look in syslog/messages for the output. The actual location of the output is determined by /etc/syslog.conf. For instance sends kernel messages to /usr/adm/kern (re-HUP syslogd if you change /etc/syslog.conf). Here's the output when LVS is first setup with ipvsadm ( Note CONFIG_IP_VS_DEBUG is not a debug level output, so you don't need to add to your syslog.conf file ) Finally check whether packets are forwarded successfully through direct routing. (also you can use tcpdump to watch packets between machines.)

Ratz ratz (at) tac (dot) ch Since some recent lvs-versions, extensive debugging can be enabled to get either more information about what's exactly going on or to help you understanding the process of packet handling within the director's kernel. Be sure to have compiled in debug support for LVS (CONFIG_IP_VS_DEBUG=yes in .config) You can enable debugging by setting: /proc/sys/net/ipv4/vs/debug_level ]]> where DEBUG_LEVEL is between 0 and 10. The do a tail -f /var/log/kernlog and watch the output flying by while connecting to the VIP from a CIP. If you want to disable debug messages in kernlog do: /proc/sys/net/ipv4/vs/debug_level ]]> If you run tcpdump on the director and see a lot of packets with the same ISN and only SYN and the RST, then either you haven't handled the (most likely) you're trying to connect directly to the VIP from within the cluster itself

realserver content: filesystem or database? (the many reader, single writer problem) The client can be assigned to any realserver. One of the assumptions of LVS is that all realservers have the same content. This assumption is easy to fullfill for services like http, where the administrator updates the files on all realservers when needed. For services like mail or databases, the client writes to storage on one realserver. The other realservers do not see the updates unless something intervenes. Various tricks are described elsewhere here for mailservers and databases. These require the realservers to write to common storage (for mail the mailspool is nfs mounted; for databases, the LVS client connects to a database client on each realserver and these database clients write to a single databased on a backend machine, or the databased's on each realserver are capable of replicating). One solution is to have a file system which can propagate changes to other realservers. We have mentioned gfs and coda in several places in this HOWTO as holding out hope for the future. People now have these working. Wensong Zhang wensong (at) gnuchina (dot) org 05 May 2001

It seems to me that Coda is becoming quite stable. I have run coda-5.3.13 with the root volume replicated on two coda file servers for near two months, I haven't met problem which need manual maintance until now. BTW, I just use it for testing purposes, it is not running in production site.

Mark Hlawatschek hlawatschek (at) atix (dot) de 2001-05-04

we've had good experiences with the use of GFS. We've used LVS with the GFS for about one year in older versions and it worked quite stably. We successfully demonstrated the solution with a newer version of GFS (4.0) at the CEBit 2001. Several domains (i.e. http://www.atix.org) will be served by the new configuration next week.

Mark's slides from his talk in German at DECUS in Berlin (2001) is available. K Kopper karl_kopper (at) yahoo (dot) com 6 Jun 2006 To share files on the real servers and ensure that all real servers see the same changes at the same time a good NAS box or even a Linux NFS server built on top of a SAN (using Heartbeat to failover the NFS server service and IP address the real servers use to access it) works great. If you run "legacy" applications that perform POSIX-compliant locking you can use the instructions at http://linux-ha.org/HaNFS to build your own HA NFS solution with two NFS server boxes and a SAN (only one NFS server can mount the SAN disks at a time, but at failover time the backup server simply mounts the SAN disks and fails over the locking statd information). Of course purchasing a good HA NAS device has other benefits like non-volatile memory cache commits for faster write speed. If you are building an application from scratch then your best bet is probably to store data using a database and not the file system. The database can be made highly available behind the real servers on a Heartbeat pair (again with SAN disks wired up to both machines in the HA pair, but only one server mounting the SAN disks where the database resides at a time). Heartbeat comes with a Filesystem script that helps with this failover job. If your applications store state/session information in SQL and can query back into the database at each request (a cookie, login id, etc.) then you will have a cluster that can tolerate the failure of a real server without losing session information--hopefully just a reload click on the web browser for all but the worst cases (like ~Sin flight~T transactions). With either of these solutions your applications do not have to be made cluster-aware. If you are developing something from scratch you could try something like Zope Enterprise Objects (ZEO) for Python, or in Java (JBOSS) there is JGroups to multicast information to all Java containers/threads, but then you~Rll have to re-solve the locking problem (something NFS and SQL have a long track record of doing safely). But you were just asking about file systems and I got off topic . . . Brad Dameron brad (at) seatab (dot) com 7 Jun 2006 I am using RedHat GFS with my SAN to do file sharing. Works great. It's not clear what the following poster is doing. He may just be smb exporting a filesystem to the realserver. Here's more info on LVS'ing . Kai Suchomel1 KAISUCH (at) de (dot) ibm (dot) com 12 Jun 2006 The Samba Service is responsible to share a SAN Filesystem. Here especially GPFS. This File system is shared among all the Samba Services on the RS. So that when the client connects to the VIP and the SAN Filesystem for the Client, it is transparent on which RS the connection will be established. When the RS fails, after doing a reconnect, the Client can access the SAN Filesystem over another RS. I am trying to implement HA for Samba Filesharing. Joe What happens to the state stored on the domain server (or whatever the LVS appears to be to the windows clients), when the RS goes down? Are you copying files between the LVS and the windows clients? What are your windows clients using the LVS for? So you have a single SAN exporting files to multiple realservers? Why do you do this? Is this faster than having the SAN export the files directly? Or are the realservers doing something else as well? (or doesn't the SAN export files to windows machines?) Kai The Realservers are responsible for the FIlesystem, here GPFS is used. GPFS is IBMs Cluster Filessystem. Joe: I guess we'll hear more later.

Developement: Supporting IPSec on LVS see Julian's notes on developing code for IPSec over LVS. Farid Sawari has working with 2.4 and 2.6 LVS-NAT.

LVS: ICMP ICMP is an IP protocol for sending error messages between 2 hosts at the end of a segment. ICMP messages are not propagated beyond the two hosts involved in the error condition. Sometime in 2000, code was added for LVS to handle ICMP. For LVS'ed services, the director handles ICMP redirects and MTU discovery delivering them to the correct realserver. ICMP packets for non-LVS'ed services are delivered locally. Setups where packets are not defragmented properly are difficult to diagnose: only large packets are affected and the setup will work for much of the time, but clients will see their connection hanging. The realservers can have large numbers of connections hung in FIN_WAIT. We see this when packets are enlarged by encapsulation and then decapsulated before arriving at their destination, midway in their passage through the network, and the en/de-capsulating mechanism doesn't send icmp need_defrag packets. LVS-Tun: ipip encapsulation reduces the packet payload, sometimes requiring fragmentation. This is not handled properly by Linux. See . Weird Hardware: see .

MTU discovery and ICMP handling joern maier 13 Dec 2000

What happens if an ICMP "host unreachable" messages is send to the director because a client went down ? Are the entrys from the connection table removed ?

Julian Anastasov ja (at) ssi (dot) bg Wed, 13 Dec 2000 No

Are the messages forwarded to the Realservers ?

Julian 13 Dec 2000 Yes, the embedded TCP or UDP datagram is inspected and this information is used to forward the ICMP message to the right realserver. All other messages that are not related to existing connections are accepted locally. Eric Mehlhaff mehlhaff (at) cryptic (dot) com passed on more info Theoreticaly, path-mtu-discovery happens on every new tcp connection. In most cases the default path MTU is fine. It's weird cases (ethernet LAN conenctions with low MTU WAN connections ) that point out broken path-MTU discovery. i.e. for a while I had my home LAN (MTU 1500) hooked up via a modem connection that I had set MTU to 500 for. The minimum MTU in this case was the 500 for my home, but there were many broken web sites I could not see because they had blocked out the ICMP-must-fragment packets on their servers. One can also see the effects of broken path mtu discovery on FDDI local networks. Anyway, here's some good web pages about it: What happens if a realserver is connected to a client which is no longer reachable? ICMP replies go back to the VIP and will not neccessarily be forwarded to the correct realserver. Jivko Velev jiko (at) tremor (dot) net

Assume that we have TCP connections...and realserver is trying to respond to the client, but it cannot reach it (the client is down, the route doesn't exist anymore, the intermadiate gateway is congested). In these cases your VIP will receive ICMP packets dest unreachable, source quench and friends. If you dont route these packets to the correct realserver you will affect performance of the LVS. For example the realserver will continue to resend packets to the client because they are not confirmed, and gateways will continue to send you ICMP packets back to VIP for every packets they droped. The TCP stack will drop these kind of connections after his timeouts expired, but if the director forwarded the ICMP packets to the appropriate realserver, this will occur a little bit earlier, and will avoid overloading the redirector with ICMP stuff. When you receive a ICMP packet it contains the full IP header of the packet that cause this ICMP to be generated + 64bytes of its data, so you can assume that you have the TCP/UDP header too. So it is possible to implement "Persitance rules" for ICMP packages. Summary: This problem was handled in kernel 2.2.12 and earlier by having the configure script turn off icmp redirects in the kernel (through the proc interface). For 2.2.13 the ipvs patch handles this. The configure script knows which kernel you are using on the director and does the Right Thing (TM).

Joe: from a posting I picked off Dejanews by Barry Margolin

the criteria for sending a redirect are: The packet is being forwarded out the same physical interface that it was received from, The IP source address in the packet is on the same Logical IP (sub)network as the next-hop IP address, The packet does not contain an IP source route option. Routers ignore redirects and shouldn't even be receiving them in the first place, because redirects should only be sent if the source address and the preferred router address are in the same subnet. If the traffic is going through an intermediary router, that shouldn't be the case. The only time a router should get redirects is if it's originating the connections (e.g. you do a "telnet" from the router's exec), but not when it's doing normal traffic forwarding.

unknown Well, remember that ICMP redirects are just bandages to cover routing problems. No one really should be routing that way. ICMP redirects are easily spoofed, so many systems ignore them. Otherwise they risk having their connectivity being disconnected on whim. Also, many systems no longer send ICMP redirects because some people actually want to pass traffic through an intervening system! I don't know how FreeBSD ships these days, but I suggest that it should ship with ignore ICMP redirects as the default.

LVS code only needs to handle icmp redirects for LVS-NAT and not for LVS-DR and LVS-Tun Julian: 12 Jan 2001 Only for LVS-NAT do the packets from the realservers hit the forward chain, i.e. the outgoing packets. LVS-DR and LVS-Tun receive packets only to LOCAL_IN, i.e. the FORWARD chain, where the redirect is sent, is skipped. The incoming packets for LVS/NAT use ip_route_input() for the forwarding, so they can hit the FORWARD chain too and to generate ICMP redirects after the packet is translated. So, the problem always exists for LVS/NAT, for packets in the both directions because after the packets are translated we always use ip_forward to send the packets to the both ends. I'm not sure but may be the old LVS versions used ip_route_input() to forward the DR traffic to the realservers. But this was not true for the TUN method. This call to ip_route_input() can generate ICMP redirects and may be you are right that for the old LVS versions this is a problem for DR. Looking in the Changelog it seems this change occured in LVS version 0.9.4, near Linux 2.2.13. So, in the HOWTO there is something that is true: there is no ICMP redirect problem for LVS/DR starting from Linux 2.2.13 :) But the problems remains for LVS/NAT even in the latest kernel. But this change in LVS is not created to solve the ICMP redirect problem. Yes, the problem is solved for DR but the goal was to speedup the forwarding for the DR method by skipping the forward chain. When the forward chain is skipped the ICMP redirect is not sent. ICMP redirects and LVS: (Joe and Wensong) The test setups shown in this HOWTO for LVS-DR and LVS-Tun have the client, director and realservers on the same network. In production the client will connect via a router from a remote network (and for LVS-Tun the realservers could be remote and all on separate networks). The client forwards the packet for VIP to the director, the director receives the packet on the eth0 (eth0:1 is an alias of eth0), then forwards the packet to the realserver through eth0. The director will think that the packet came and left through the same interface without any change, so an icmp redirect is send to the client to notify it to send the packets for VIP directly to the RIP. However, when all machines are on the same network, the client is not a router and is directly connected to the director, and ignores the icmp redirect message and the LVS works properly. If there is a router between the client and the director, and it listens to icmp redirects, the director will send an icmp redirect to the router to make it send the packet for VIP to the realserver directly, the router will handle this icmp redirect message and change its routing table, then the LVS/DR won't work. The symptoms is that once the load balancer sends an ICMP redirect to the router, the router will change its routing table for VIP to the realserver, then all the LVS won't work. Since you did your test in the same network, your LVS client is in the same network that the load balancer and the server are, it doesn't need to pass through a router to reach the LVS, you won't have such a symptom. :) Only when LVS/DR is used and there is only one interface to receive packets for VIP and to connect the realserver, there is a need to suppress the ICMP redirects of the interface. Joe

The ICMP redirects is turned on in the kernel 2.2 by default. The configure script turns off icmp redirects on the director using sysctl /proc/sys/net/ipv4/conf/eth0/send_redirects ]]>

(Wensong) In the reverse direction, replies coming back from the realserver to the client tunlhost1=======tunlhost2 --> director ------->| ]]> After the first response packet arrives from the realserver at the tunlhost2, tunlhost2 will try to send the packet through the tunnel. If the packet is too big, then tunlhost2 will send an ICMP packet to the VIP to fragment the packet. In the previous versions of ipvs, the director won't forward the ICMP packet to (any) realserver. With 2.2.13 code has been added to handle the icmp redirects and make the director forward icmp packets to the corresponding servers.

If the client has two connections to the LVS (say telnet and http) each to 2 different realservers and the client goes down, the director gets 2 ICMP_DEST_UNREACH packets. The director knows from the CIP:port which realserver to send the icmp packet to?

Wensong Zhang 21 Jan 2000 The director handles ICMP packets for virtual services long time ago, please check the ChangeLog of the code.

ChangeLog for 0.9.3-2.2.13 The incoming ICMP packets for virtual services will be forwarded to the right realservers, and outgoing ICMP packets from virtual services will be altered and send out correctly. This is important for error and control notification between clients and servers, such as the MTU discovery.

Joe

If a realserver goes down after the connection is established, will the client get a dest_unreachable from the director?

No. Here is a design issue. If the director sends an ICMP_DEST_UNREACH immediately, all tranfered data for the established connection will be lost, the client needs to establish a new connection. Instead, we would rather wait for the timeout of connection, if the realserver recovers from the temporary down (such as overloaded state) before the connection expires, then the connection can continue. If the real server doesn't recover before the expire, then an ICMP_DEST_UNREACH is sent to the client.

If the client goes down after the connection is established, where do the dest_unreachable icmp packets generated by the last router go?

If the client is unreachable, some router will generate an ICMP_DEST_UNREACH packet and sent to the VIP, then the director will forward the ICMP packet to the realserver.

Since icmp packets are udp, are the icmp packets routed through the director independantly of the services that are being LVS'ed. i.e. if the director is only forwarding port 80/tcp, from CIP to a particular RIP, does the LVS code which handles the icmp forward all icmp packets from the CIP to that RIP. What if the client has a telnet session to one realserver and http to another realserver?

It doesn't matter, because the header of the original packet is encapsulated in the icmp packet. It is easy to identify which connection is the icmp packet for.

ICMP checksum errors (This problem pops up in the mailing list occasionally, e.g. Ted Pavlic on 2000-08-01.) Jerry Glomph Black

The kernel debug log (dmesg) occasionally gets bursts of messages of the following form on the LVS box: What is this, is it a serious problem, and how to deal with it?

Joe I looked in dejanews. No-one there knows either and people there are wondering if they are being attacked too. It appears in non-LVS situations, so it probably isn't an LVS problem. The posters don't know the identity of the sending node. Wensong I don't think it is a serious problem. If these messages are generated, the ICMP packets must fail in checksum. Maybe the ICMP packets from 199.108.9.188 is malformed for some unknown reason. Here are some other reports Hendrik Thiel thiel (at) falkag (dot) de 18 Jun 2001

I noticed this in dmesg and messages: Is this lvs specific (using nat) ? or can this be an attack?

Alois Treindl alois (at) astro (dot) ch

I see those too not as many as you but every few hours a bunch.

Juri Haberland juri (at) koschikode (dot) com

From time to time I see them also on a firewall masquerading the companies net. I always assumed it is a corrupted ICMP packet... Who knows...

ICMP Timeouts Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 14 Feb 2001

what is the usefulness of the ICMP packets that are sent when new packets arrives for a TCP connection that timed out for in LVS box ? I understand obviously for UDP but I don't see their role for a TCP connection...

Julian I assume your question is about the reply after ip_vs_lookup_real_service. It is used to remove the open request in SYN_RECV state in the realserver. LVS replies for more states and may be some OSes report them as soft errors (Linux), others can report them as hard errors, who knows.

it's about ICMP packets from a LVS-NAT director to the client. For example, a client accesses a TCP virtual service and then stops sending data for a long time, enough for the LVS entry to expire. When the client try to send new data over this same TCP connection the LVS box sends ICMP (port unreachable) packets to the client. For a TCP connection how do these ICMP packets "influence" the client ? It will stop sending packets to this expired (for the LVS box...) TCP connection only after its own timeouts, doesn't it ?

By default TCP replies RST to the client when there is no existing socket. LVS does not keep info for already expired connections and so we can only reply with an ICMP rather than sending a TCP RST. (If we implement TCP RST replies, we could reply TCP RST instead of ICMP). What does the client do with this ICMP packet? By default, the application does not listen for ICMP errors and they are reported as soft errors after a TCP timeout and according to the TCP state. Linux at least allows the application to listen for such ICMP replies. The application can register for these ICMP errors and detect them immediately as they are received by the socket. It is not clear whether it is a good idea to accept such information from untrusted sources. ICMP errors are reported immediately for some TCP (SYN) states.

PMTUD (path MTU discovery) see

Long sessions through LVS DR director terminated by icmp-host-prohibited (ICMP type 3 code 10) Klaas's posting of a bug. We don't know what it's about yet. Klaas Jan Wierenga k.j.wierenga (at) home (dot) nl 26 Mar 2007 I have a problem where sometimes some long standing mp3 streaming sessions over HTTP are terminated, because the LVS-DR director sends an ICMP type 3 code 10 - host unreachable (icmp-host-prohibited) packet to the client (the source of the mp3 stream). When this happens the client stops sending packets for 15 minutes 15 minutes (the TCP idle session timeout of LVS?) Initially I suspected the LVS director but after some investigation I found out that it never sends icmp-host-prohibited (not in linux/net/ipv4/ipvs/* source files). The only other possibility was netfilter sending it (found in net/ipv4/netfilter/ipt_REJECT.c: send_unreach(*pskb, ICMP_HOST_ANO). But why is this sent on an existing, established and active connection? The relevant parts of my initial iptables was (/etc/sysconfig/iptables): After I changed the port 80 rule to the one below effectively disabling connection tracking on port 80 the problem disappeared. Initially I made this iptables change on the LVS director, but then the realservers would send icmp-host-prohibited sometimes on established connections, after also changing iptables on the realservers did the problem go away. It is still unclear to me why netfilter would decide to send icmp- host-unreachable on established connection when connection tracking is active. Maybe someone on the netfilter list can shed some light on this. later on following up: 26 Jun 2007 I never figured it out. It appears to be a netfilter problem because when I changed my firewall rules (/etc/sysconfig/iptables) to disable connection tracking, the problem went away.

LVS: High Availability, Failover protection

Introduction In a production system you want to be able to do planned maintenance: remove, upgrade, add or replace nodes, without interruption of service to the client. Machines may crash, so a mechanism for automatically handling this is required too. Redundancy of services on the realservers is one of the useful features of LVS. One machine/service can be removed from the functioning virtual server for upgrade or moving of the machine and can be brought back on line later without interruption of service to the client. The most common problem found is loss of network access or extreme slowdown (or DoS). Hardware failure or an OS crash (on unix) is less likely. Spinning media (disks) fail near the end of the warrantee period (in my experience) - you should replace your disks preemptively. The director(s) don't need hard disks. I've run my director from 30M of files (including perl and full glibc) pulled from my Slackware distribution. Presumably a mini Linux distribution would be even smaller. You should be able to boot off a floppy/cdrom/flash disk and load all files onto a small ramdisk. Logging information (e.g. for security) can be mailed/scp'ed at intervals to a remote machine via a NIC used for monitoring (note: not one of the NIC's used to connect to the outside world or to the realservers). Reconfiguring services on the fly with ipvsadm will not interrupt current sessions. You can reasonably expect your director to stay up for a long time without crashing and will not need to be brought down for servicing any more than any other diskless router. An alternative to flash memory is a cdrom "Matthew S. Crocker" matthew (at) crocker (dot) com 14 May 2002

My LVS servers are currently EXT2 but I'm either going to go with a diskless server using netboot or a CD based server. Our LVS is becoming our firewall (using NAT) and I'd rather have it stay bullet proof. CD based if it gets compromised I just reboot it.

The LVS code itself does not provide high availability. Other software is used in conjunction with LVS to provide high availability (i.e. to switch out a failed realserver/service or a failed director). Several families of tools are available to automatically handle failout for LVS. Conceptually they are a separate layer to LVS. Some separately setup LVS and the monitoring layer. Others will setup LVS for you and administratively the two layers are not separable. Here's an article on the high cost of delivering high uptime computer service by Steve Levin. The author says that NASA runs on three 9's (99.9%) reliability. For this level of reliability, the system has to handle all faults without human intervention. There are two types of failures with an LVS. director failure This is handled by having a redundant director available. Director failover is handled in the Ultra Monkey Project by heartbeat. Other code used for failover is vrrpd in keepalived. The director maintains session information client IP, realserver IP, realserver port), and on failover this information must be available on the new director. On simple failover, where a new director is just swapped in, in place of the old one, the session information is not transferred to the new director and the client will loose their session. Transferring this information is handled by the . The keepalived project by Alexandre Cassen works with both Linux-HA and LVS. keepalived watches the health of services. It also controls failover of directors using vrrpd. , or failure of a service on a realserver This is relatively simple to handle (compared to director failover). An agent running on the director monitors the services on the realservers. If a service goes down, that service is removed from the ipvsadm table. When the service comes back up, the service is added back to the ipvsadm table. There is no separate handling of realserver failure. If the server catches on fire (a concern of Mattieu Marc marc (dot) mathieu (at) metcelo (dot) com), the agent on the director will just remove that realserver's services from the ipvsadm table as they go down. For LVS-DR, you cannot monitor a service running on the VIP on the realserver from the director (since the director also has the VIP). Instead you arrange for the service to bind to both the VIP and the RIP (or to 0.0.0.0) and test the health of the service bound to the RIP, as a proxy for the service running on the VIP. You can monitor a tcp service by connecting to the ip:port. Testing of udp services (e.g. DNS) is a little more problematic. The DNS monitor that comes with Mon does a functional test on the realserver, asking it to retreive a known DNS entry. Tim Hasson tim (at) aidasystems (dot) com 27 Jan 2004 The attached patch (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/hasson_dns.patch) gets around the problem with ldirectord not doing any udp checks, by using Net::DNS to test if the DNS server is resolving. You cannot simply just do a udp connect check, as udp is connectionless :) That is why ldirectord will always keep all realservers on any udp service, regardless of the service status. So, you just basically install Net::DNS from cpan, and apply the attached patch to /usr/sbin/ldirectord You can change www.test.com in the patch or in ldirectord after you applied the patch if you need to specify an internal domain or something else. The patch applied to several ldirectord versions cleanly, including the latest from ultramonkey heartbeat-ldirectord-1.0.4.rpm (I believe it was ldirectord 1.76) The configure script monitors services with Mon. Setting up mon is covered in The configure script will set up mon for you. Mon was the first tool used with LVS to handle failover. It does not handle director failover. In the Ultra Monkey Project, service failure is monitored by ldirectord. For service failure on the realserver or director failure (without the ), the client's session with the realserver will be lost. This is no different to what would happen if you were using a single server instead of an LVS. With LVS and failover however, the client will be presented with a new connection when they initiate a reconnect. Since only one of several realservers failed, only some of the clients will experience loss of connection, unlike the single server case where all clients loose their connection. In the case of http, the client will not even realise that the server/service has failed, since they get a new connection when clicking on a link. For session oriented connections (e.g.https, telnet) all unsaved data and session information will be lost. If you have a separate firewall, it doesn't have to be Linux Clint Byrum cbyrum (at) spamaps (dot) org 2005/22/05 Honestly, as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com. Carp is available for Linux too. anon

The part I do not understand is how to have a LVS cluster failover without using HA. Since HA is limited to two nodes?

There are several packages available to do failover for LVS. Some of them overlap in functionality and some of them are for different purposes. The LVS can have any number of realservers. Failover of realservers occurs by changing the ipvsadm table on the director. Director failover occurs by transfering the VIP to the backup director, bringing down the primary director, and by using the backup copy of the connection table (put there by the synch demon) on the backup director. Once you've moved the VIP, the network needs to know that the VIP is associated with a new MAC address. To handle this, you can use Yuri Volobuev's send_arp distributied with the Linux-HA package (make sure you understand how arp works: see ). Director failover and realserver failover are logically separate, occur independantly and are done by different pieces of code e.g. MON only handles realserver failover. Since both functionalities are required in a production LVS, some packages have them both. When configuring these packages you must remember that the director failover parts are logically separate from the realserver failover parts. Both keepalived and Linux-HA handle director failover and to monitor the state of service(s) on the realserver. Keepalived has both functionalities in the same piece of code and uses one configure script. Linux-HA uses ldirectod to handle realserver failover. I think now that you set up Linux-HA/ldirectord with one configure script (not sure).

Single Point of Failure (SPOF) - you can't protect against everything Redundancy is a method to handle failure in unreliable components. As a way of checking for unreliable components the concept of a "single point of failure" (spof) is used. However some components are much more reliable than others (e.g. a piece of multicored ethernet cable). You can safely not replicate them. Other components are much more expensive than others: it's expensive to replicate them. You are not looking for a fail-proof setup: you are looking for a setup which has a failure rate and cost that the customer can live with. Mark Junk

I want to setup a lvs cluster firewall but i have only one ethernet cable from my isp...... So my question is how can i achieve this. without introducing a single point of failure? Essentilly i need to plug one cable into two boxes splitting at x

Joe a hub/switch. They have low failure rates.

Yeah that would be a single point of failure though

Clint Byrum cbyrum (at) spamaps (dot) org 27 Oct 2004 You're already dealing with the cable from your ISP failing, the ultra redundant power going down, a meteor hitting the building, and their NOC tech setting fire to the routers. IMHO, If you really want to eliminate all SPOFs, you have to go multisite. At some point while dealing with the problems of going multisite, it just becomes ridiculous, and you have to ask yourself what your clientelle really need in terms of uptime. Sebastien BRIZE sebastien (dot) brize (at) libertysurf (dot) fr A simple but expensive way is to use a couple of Routing Switches (L3/L4) and double-attachement switches using RSTP (Rapid Spanning Tree Protocol) for the Switches attachement and MRP (Metro Ring Protocol) (or even RSTP) between both Routing switches. RS1 and RS2 may be in different sites, and each equipment may have to power supply. This much more expensive than a cable though. Dana Price d (dot) price (at) rutgers (dot) edu

I've got an Ultramonkey 3.0 LB-DR setup, with two directors. I have heartbeat running over eth0 and a crossover on eth1. Since both heartbeat links have to fail for a failover to occur, I'm concerned that something like a bad nic, cable, or switch will bring my web service down (say eth0 fails but the crossover eth1 is still up). Is there any way to define two heartbeat links in ha.cf but to have it failover if a designated one dies? That way the directors can still maintain state over the second link and I'd avoid the split-brained cluster that comes with only 1 HB link.

Joe this may be possible and someone else can give you the answer, but I'll talk about something else... There's only so many things you can worry about, so you pick the ones that are most likely to go. The most likely problem is your network connection will go down - this is usually out of your control. Next is mechanical things like disks and fans, or connectors not making good contact. This is the problem you have to deal with (- see below). Make sure you have ready-to-go copies of your disks, just sitting on the shelf next to the machine. You can update them by putting them in an external USB case and plugging them in somewhere, whenever you change your machine. Disks are really cheap compared to the cost of the labor of replacing them, or the cost of downtime. As well, pre-emptively swap out disks at their warrantee date. Possibly you have unreliable power. Where I live in the US, I get a 1 sec power bump once a week, when the power company must be changing the power feed with a mechanical switch. You need a UPS. Such things are unheard of in more advanced parts of the world, like Europe, where you can have a machine up for 400 days on the regular power without any interruptions and UPS are not needed at all. I've never had a NIC just fail. I (accidently) kicked the BNC connector on one and it died. I killed another with electrostatic shock by _not_ touching the computer case before putting my fingers near the empty RJ-45 socket. That's it - NICs generally don't die and neither do switches. The tcpip stack never locks up, unless the whole OS is hosed and that doesn't happen a whole lot with Linux and if it does, then heartbeat is gone too. The connectors/cables to a NIC are another thing. Make sure your cables are multistranded and not a single strand for each wire. Flexing of single strand wire at the connector leads to cracks that show up as intermittant connections. Single strand has become the default since the .com boom, but they're only tolerated in the commodity market where people would rather save 1% cost than have a reliable connection. Nowhere else in the electronic industry are they used. There's probably not too much problem if the cables are just laid out and plugged in and left there without movement till the computer is junked, but if you're rearranging your cables frequently, use multicored cables. Heartbeat has been used with LVS for years and we haven't had anyone come up with a split brain yet. (Maybe it happens and people dont think it worth mentioning.) I would say that a pair of NICs with a single crossover cable is probably the most reliable part of your set up. I wouldn't bother making it redundant.

Stateful Failover Anywhere that state information is required for continued LVS functioning, failover will have to transfer the state information to the backup machine. An LVS can have (some or all of) the following state information director: ip_vs connection table (displayed with ipvsadm): i.e. which client is connected to which realserver. The . will transfer this information to a backup director. If this information is not transferred, the client will loose their virtual service. For http, this is not a problem, as the client will get a new connection by hitting "refresh" on the browser. realserver: ssl session keys When setting up https as a service under LVS, https is setup with persistence, so that the multiple tcp connections required for an ssl session will all go to the same realserver. On realserver failover, these session keys are lost and the client has to renegotiate the ssl connection. Presumably other persistent information, which is much more important (e.g. shopping cart or database), is being stored on the LVS in a failover safe manner. Compared to loss of the customer data, loss of session keys is not a big deal and we are not working on a solution for this. realserver: persistent data e.g. shopping cart on e-commerce sites. To allow customers to make purchases over an arbitary long period and for their session to survive failover of the realserver to which they are connected, their database information needs to be preserved in a place where any realserver can get to it. Originally this was done with cookies (see the section on ), but these are instrusive. Cookies can be stolen or poisoned and many people turn them off (clients shouldn't be allowing non-trusted machines to write anything on their computer). All customer state information should instead be stored at the LVS site (see the section on ). If you store persistent data on the virtual server, you must write your application to survive failover of the realserver and long timeouts. (The customer should be able to bring up information about vacations and leave it on the screen for the spouse to inspect when they come home. The spouse should be able to click to the next piece of information without the application crashing.) tcp state: filter rules on the director and/or realserver This information is one level lower on the ISO network diagram than is the ipvsadm connection information. Any particular client can make many tcp connections to a realserver. The director is a router (admittedly with slightly different rules than the normal routers) and as such just forwards packets. On failover, a director configured with no filter rules, can be replaced with an identically configured backup with no interuption of service to the client. There will be a time in the middle of the changeover where no packets are being transmitted (and possibly icmp packets are being generated), but in general once the new director is online, the connection between client and realserver should continue with no break in established tcp connections between the client and the realserver. If the director has only stateless filter rules, then the director still appears as a stateless router and director failover will occur without interruption of service. With iptables, a router (e.g. an LVS director) can monitor the tcp state of a connection, (e.g. NEW, RELATED, ESTABLISHED). If stateful filter rules are in place (e.g. only accept packets from ESTABLISHED connections) then after failover, the new director will be presented packets from tcp connections that are ESTABLISHED, but of which it has no record. The new director will REJECT/DROP these packets. Harald Welte (of netfilter) is in the process of writing code for stateful failover of netfilter. Ratz, 01 Jun 2004

He has actually done it (http://cvs.netfilter.org/netfilter-ha/ link dead Feb 2005) and we can expect it to surface the user's world in a couple of months for beta testing. Each new rule has to find its place in the existing rule set resulting in an n^^2 loading of rules. It can take seconds for 50,000 rules to load. This is also being worked on with the new pkttables, ct/nf-netlink and whatever else the nf guys come up with. For large rule sets, use hipac Hipac (http://www.hipac.org/).

Even though a highly available form of stateful netfilter is now available, it really doesn't affect LVS because LVS controlled packets do not traverse the netfilter framework in the normal manner and iptables is not aware of all of the transfers of packets. LVS does its own connection tracking. Untill early 2004, this was not particularly complete, but Julian has beta grade code out now which should satisfy most users (see ). This code allows stateful tracking of LVS controlled packets. Failover of tcp connection state is already handled by . Note that the synch demon only cares whether a connection is ESTABLISHED and only copies connections which are ESTABLISHED to the backup director. Connections in FIN_WAIT etc will timeout on their own and the backup director doesn't need to know about these states on becoming the master director. The situation on the realserver is a little different. If a realserver fails, for most services, there is no way to transfer the connection to a backup realserver and the connection to the client is lost anyway. In this case stateful filter rules on the director will not cause any extra problems with failover. Summary: statefull filter rules are allowed on the realservers anytime you like. Statefull rules are allowed on the director only if you use Julian's NFCT patches. octane indice octane (at) alinto (dot) com 13 Apr 2006

Do you know if you can do something like carp+pfsync with linux+ipvs. My goal is to have two director/firewall machines: a master and a backup, Both sharing the same IP: VIP I can handle the LVS part easily with keepalived and a VRRP method and same ruleset but it means that all the connections tracked by the firewall rules are lost when master comes down. I first want a firewall with failover. _Then_ if it works, I would add director on top of it. I want to use it under linux. So the carp/pfsync solution is not available.. The question is: the sync daemon is helpful with me to synchronize firewalls state or not? I read then http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.server_state_sync_demon.html
but I saw:"Note that the feature of connection synchronization is under experiment now, and there is some performance penalty when connection synchronization, because a highly loaded load balancer may need to multicast a lot of connection information. If the daemon is not started, the performance will not be affected. " and from: http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.failover.html Honestly, as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com. Note Carp is available for Linux too. " yes carp is available for linux but not pfsync which is what I need.
I asked Julian about http://www.ssi.bg/~ja/nfct/ "Does it means that master firewall will updates backup firewall with its conntrack state?". His answer was "No". Seems to be that there is no way to use a cluster firewall with conntrack replication under linux.

Joe no-one has posted that they've done it. Any protocol that updates state information onto a backup machine is going to have overhead. pfsync updates the firewall state (I believe) on the backup, but not the ipvs connection table. Even with carp, you still have to transfer the ipvs table. The ipvs synch state demon only keeps track of the ipvs controlled connections, not the firewall state - it won't help you. Ratz 20 Apr 2006 IPVS has not much to do with firewalling, you can achieve CARP+pfsync like setups using VRRP+ctsync under Linux. Does ctsync not work? I know that you've also asked in the nf-failover ml. It's sort of maintained (there have been a couple of patches to ct_sync this year already) and it sort of works for the handful of people that actually use it. It had problems with tcp window tracking the last time I tried it but Krisztian and Harald are certainly more than happy to fix a couple of issues related to ctsync problems. People send in patches to ct_sync regularly to netfilter-devel and some even maintain out of tree kernel patches: http://vvv.barbarossa.name/files/ct_sync/ Please try out the available software and if this does not work, complain at netfilter-dev ml ;).

Director failure What happens if the director dies? The usual solution is duplicate director(s) with one active and one inactive. If the active director fails, then it is switched out. Although everyone seems to want reliable service, in most cases people are using the redundant directors in order to maintain service through periods of planned maintenance rather than to handle boxes which just fail at random times. At least no-one has admitted to a real director failure in a production system. Matthew Crocker matthew (at) crocker (dot) com 23 May 2002

I have a production LVS server running with 3 realservers handling SMTP, POP3, IMAP for our QMAIL server. We process about a million inbound connections a day. I've never had the primary LVS server crash but I have shut it down on purpose (yanked the power cord) to test the fail over. Everything worked perfectly. We use QMAIL-LDAP for our mail server and Courier-IMAP for the IMAP server. QMAIL saves mail in Maildir format on our NFS server (Network Appliance F720) as a single qmail user. All aliases, passwords, mail quota information is stored in LDAP (openldap.org). The cluster is load balanced using LVS currently with Direct Routing but I'm going to switch to NAT very soon.

Bradley McLean bradlist (at) bradm (dot) net 23 May 2002

We run a pair of load balancers in front of 5 real http/https webservers, using keepalived. In earlier versions of LVS, a memory leak problem caused a failover to occur about once every three days (might have been 0.9.8 with keepalived 0.4.9 + local patches). We're on 1.0.2 and 0.5.6 now, with no problems, except that we don't quite have an auto failback mechanism that works correctly. We preserve connections quite nicely during the failover from the master to the backup, however once in that state, if the master comes back up, it takes over without capturing the connection states from the backup. I believe that Alexandre is close to solving this if he hasn't already; frankly we've been concentrating on other pieces of our infrastructure, and since we've had no failures since we upgraded versions, we haven't been keeping up. We're relatively small, serving up between .5 and 2.5 T1's worth of traffic. The balancers are built from Dell 2350s with 600Mhz PIII and 128MB, with DE570TX quad tulip cards in each. We run NAT, with an external interface that provides a non-routable IP address (there's a separate firewall up front before the web cluster), an internal interface to our web servers, and internal interface to our admin / backup network, and an interface on a crossover cable to the other balancer used for connection sync data. We could consolidate some of these, but since NICs are cheap, it keeps everything conceptually simple and easy to sniff to prove it's clean.

Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no 23 May 2002

We have been running lvs in a production site for about 8 months now, with functional failover for the last 3. We use keepalived for failover and healthchecking. The setup consists of 4 realservers (dell 2550, dual pIII 933, 1Gb RAM) and 2 directors (pII 400). The main director has only been down for maintenance or demonstration purposes. The site has about 2 million hits per day, and the servers are pushing between 20 and 70 Gb of data each day.

Automatic detection of failure in unreliable devices by other unreliable devices is not a simple problem. Currently LVS director failure in an LVS is handled by code from the Linux HA (High Availability) project. (Alexandre Cassen is working on code based on vrrpd, which will also handle director failover). The Linux HA solution is to have two directors and to run a heartbeat between them. One director defaults to being the operational director and the other takes over when heartbeat detects that the default director has died. | |---------- | ----------| | ------------------------------------ | | | | | | RIP1, VIP RIP2, VIP RIP3, VIP ______________ ______________ ______________ | | | | | | | realserver1 | | realserver2 | | realserver3 | |______________| |______________| |______________| ]]> LVS is one of the major uses for the Linux HA code and several of the Linux HA developers monitor the LVS mailing list. Setup problems can be answered on the LVS mailing list. For more detailed issues on the working of Linux HA, you'll be directed to join the Linux HA mailing list. Fake, heartbeat and mon are available at the Linux High Availability site. There are several overlapping families of code being developed by the Linux HA project and the developers seem to contribute to each other's code. The two main branches of Linux HA used for LVS are UltraMonkey and vrrpd/keepalived. Both of these have their own documentation and are not covered in this HOWTO.

UltraMonkey and Linux-HA The UltraMonkey project is a packaged version of LVS combined with Linux HA to give director failure, written by Horms. It uses LVS-DR and is designed to load balance on a LAN. UltraMonkey uses Heartbeat from the Linux-HA project for failover and ldirectord to monitor the realservers. Alternatively the has been written by Peter Mueller. This being functionally equivelent to the Ultra Monkey code.

Two box HA LVS

Doug Sisk sisk (at) coolpagehosting (dot) com 19 Apr 2001 Is it possible to create a two server LVS with fault tolerance? It looks straight forward with 4 servers ( 2 Real and 2 Directors), but can it be done with just two boxes, i.e. directors, each director being a realserver for the other director and a realserver running localnode for itself?

Horms Take a look at www.ultramonkey.org, that should give you all the bits you need to make it happen. You will need to configure heartbeat on each box, and then LVS (ldirectord) on each box to have two realservers: the other box, and localhost.

heartbeat and connection state synch demon

Michael Cunningham m (dot) cunningham (at) xpedite (dot) com I have heartbeat running between two LVS directors. It is working great. It can fail back and forth without issues. Now I would like to setup connection state synchronization between the two directors. But I am two problems/questions. Can I run the multicast connection sync over my 100 mbit private lan link which is being used by heartbeat? How can I setup heartbeat to always run.. on the current master.. and on the current slave at all times? The master can run a script when it starts up/obtains resources but I don't see anyway for the slave to run a script when it starts up or releases resources.

Lars Marowsky-Bree lmb (at) suse (dot) de 02 Feb 2002 The slave runs the resource scripts with the "stop" action when the resources are released, so you could it add in there; anything you want to run before the startup of heartbeat is separate from that and obviously beyond the control of heartbeat. You are seeing the result of heartbeat's rather limited resource manager, I am afraid.

serial connection problems with Linux-HA "Radomski, Mike" Mike (dot) Radomski (at) itec (dot) mail (dot) suny (dot) edu 13 Mar 2002

I am experiencing a strange problem with my LVS+Heartbeat cluster. I have two systems both running ipvsadm and heartbeat(serial and x-over Ethernet). Every 10 hours I get a cpu spike (load of 1.1) on the primary system and then a few minutes later I get the same spike on the secondary system. The system sustains a load of ~1 for about 20 minutes and then returns to ~0. Neither top nor ps are showing the active process causing the spike. The spike lasts for about 20 minutes and then everything is fine. The ipvsadm piece still redirects and load balances with no viewable performance problem. Is there any thing I can do to track this problem down?

Lars A load of 1.1 doesn't mean a CPU spike; it might simply mean that there is a zombie process for some reason. ps should show this (a process in D or Z state); ps fax will show you a process tree so you can figure out where it came from.

It has been over 10 hours since the last sustained spike. I remember when setting up the heartbeat, the serial connection was very slow and intermittent. If I cat'ed information to the serial port, it would take about 30 seconds to reach the other end of the null modem cable. As per a suggestion on Google Groups, I tried to set the serial port with the following: This worked both with simple serial communication and heartbeat. But I found in the dmesg and the logs the following: After shutting off the serial heartbeat, the over all load dropped about .02. I have not seen the sustained spike since.

Lars It is something in the kernel, because that is the only thing which is not accounted for by anything else. There are other options; if you boot the kernel with the "profile=3D2", you can use readprofile to compare the patterns for 5 minutes during the events and outside; remember to use "readprofile -r" to reset the profiling data when doing so so the counters are clean and do NOT run top during the time, because top traverses the /proc file system every second or so which greatly obscures the profiling results. Paul Baker pbaker (at) where2getit (dot) com 14 Mar 2002 I had major reliability issues when I tried using the serial connection with heartbeat. I attributed it to poor chipset design from Intel. My load balancers are 1U's Celeron systems that use that crappy i810 chipset. Pretty much whenever there was any load on the server (such as during rsync replication between the master and slave loadbalancers), the serial connection would completely timeout. Which would cause complete havoc on my lvs. The slave would then think the master was down, and start to bring itself up as the director. Let me tell you from experience, it really sucks when you have two directors fighting over arp for the ip addresses of the lvs. So I just decided to bite the bullet and switch from serial heartbeats to udp. I haven't had a problem since.

Keepalived and Vrrpd Alexandre Cassen alexandre (dot) cassen (at) wandadoo (dot) fr, the co-author of keepalived (http://keepalived.sourceforge.net) and the author of LVSGSP has produced keepalived which sets up an LVS and monitors the health of the services on the realservers and has produced a vrrpd demon for LVS which enables director failover. You build one executable keepalived which has (optionally one or both of) the vrrpd and keepalived functions. If you just want failover between two nodes, you only need the vrrpd part of the build. (notes here produced from discussions with Alexandre). Keepalived will setup an LVS from scratch (services, forwarding method, scheduler, realservers) monitor the services on the realservers and failout dead services on the realservers failover machines (for LVS, this will be the directors) There are examples of using keepalived/vrrpd in a HOWTO for LVS-NAT and another HOWTO for LVS-NAT with the director being patched with the . to act as a firewall as well. The options available for the keepalived.conf are available in doc/keepalived.conf.SYNOPSIS. Sample keepalived.conf files are in ./doc/samples/keepalived.conf.* in the source directory. An elementary set of manpages are available. The functionality for the vrrpd failover is similar to that for . vrrpd adds IP(s) to ethernet card(s) with ip, when the machine is in the master state and removes them, when it is in the backup state. On bringing up the IP(s), vrrpd sends a gratuitous arp for the new location of the IP, flushing the arp tables of other machines on the network. This procedure leaves the arp tables unchanged for the other (unmoving) IP(s) on the same interface. There is some confusion about patents connected to VRRP. Here is some info. FreeBSD has CARP (http://pf4freebsd.love2party.net/carp.html), the Common Address Redundancy Protocol, written to head-off possible problems with cisco claims that its patents on Hot Standby Router Protocol (HSRP) cover the same technical areas as VRRP (http://software.newsforge.com/software/04/04/13/1842214.shtml). Alexandre has contacted cisco about this. CARP has been ported to Linux (http://www.ucarp.org/). Alexandre Cassen acassen (at) freebox (dot) fr 9 Mar 2006 CARP is close to VRRP - they have the same Finite State Machine (FSM). The patent on VRRP is not applicable to Keepalived since I made some assumptions that make the implemetation not as rfc compliant as other implemenations. The VRRP patent for linux implementation is not a problem. The CARP code, except the use of hash instead of IP address, and some others cosmetics stuff, is VRRP like. VRRP is an IETF standard. IMHO, what is important for such a protocol is not re-inventing the FSM (by writing CARP), but stacking components around to make it usefull (like sync_group, ....). VRRP adoption is already made and if CARP doesn't bring new inovation concepts, this will slow down adoption. S.Mehdi Sheikhalishahi 2005-05-21

Is there any comparsion between Load Balancing and HA Solutions? What's the best for a firewall?

Clint Byrum cbyrum (at) spamaps (dot) org 2005/22/05 as good as LVS is for real server load balancing, for firewalls I like OpenBSD with CARP and pfsync. CARP+pfsync provides easy, scalable load balancing and HA for firewalls. pf, the OpenBSD firewall, is very well written and nicely designed. Give it a look, www.openbsd.com. Alexandre 31 Dec 2003

Gratuitous ARP is well supported by routing equipment. Only one packet is lost during takeover.

In earlier versions of vrrpd, the vrrpd fabricated a software ethernet device on the outside of the director (for the VIP) and another for the inside of the director (for the DIP) each with a MAC address from the private range of MAC addresses (i.e. will not be found on any manufactured NIC). When a director failed, vrrpd would re-create the ethernet devices, with the original IP and MACs, on the backup director. Other machines would not have any changes in their arp tables (the IP would move to another port on a switch/hub though) and would continues to route packets to the same MAC address. Unfortunately this didn't work out Alexandre 31 Dec 2003

We discussed this with Julian, and Jamal. The previous code didn't handle the VMAC cleanly. It consisted of changing the interface MAC address inside the kernel to fake the needed one... This is not clean and not scalable since this restricted us to only one VMAC per interface (multiple VMACs were not supported). Later Julian produced his parp netlink patch that offers an arp reply from a VMAC. This did not work, as all traffic stayed with the interface MAC. Later, on the netdev ML, we discussed this with Julian and Jamal, and the best solution was to provide a patch to the ingress/egress code to support these VMAC operations. This code hasn't been written.

keepalived listens on raw:0.0.0.0:112, so you can include in /etc/services. Here's part of the output of netstat -a after starting keepalived on a machine with two instances of vrrpd (one for each interface) (there are no other machines running vrrpd on the network). After starting keepalived on another director, note that one of the vrrpds has received some packets. Although netstat only shows that vrrpd is bound to 0.0.0.0:vrrpd, if you are wondering how to write your filter rules, vrrpd is only bound to the NIC specified in keepalived.conf. VRRP advertisements are sent/received on this protocol socket, using multiplexing. The src_addr of the multicast packet is the primary IP of the interface. Multicast permits you to alter the src_addr (with mcast_src_ip) if you want to hide the primary IP. If you do this, the socket will still be bound to 0.0.0.0 according to netstat (Here is further info on .) Alexandre Dec 2003

VRRP is interface specific (like HSRP and others hot standby protocol). and uses a socket pair for sending/receiving adverts. The sockets are bound to the specified interface. When you configure a VRRP instance on interface on eth0, VRRP will create a raw vrrp-proto socket and bind the socket to interface eth0 (using bindtodevice kernel call). Then it joins the VRRP multicast group. So this socket will receive VRRP adverts only on eth0. The same thing is done for sending socket, vrrp proto sending socket is bound to interface VRRP instance belong to. Additionnaly if you have more than one VRRP instance on the same NIC (for active/active setup) then they will share the same socket. VRRP code will then demux the incoming VRRP adverts performing a hash lookup according to the incoming VRRP advert VRID header field. It performs a o(1) lookup (hash index based on VRRP VRID field). If you run IPSEC-AH VRRP and normal VRRP on the same interface then the code will create 2 sockets referring to each protocol (51 and 112). The rest is the same: demux on a shared socket according to incoming VRRP VRID field. VRRP is based on advert sending over multicast (advert interval determined by advert_int in keepalived.conf normally configured for 1 sec). This is an election protocol. Master is the one with the highest priority. When the master crashes, an election is held to find the next highest higher priority VRRP master.

keepalived has the same split brain problem as . heartbeat tries to beat this by having multiple communication channels. vrrpd only has one channel. There isn't a keepalived status, so you can't programmatically determine the state of any machine. You can look for the moveable IP with ip addr show. You can also inspect the logs (look for "BACKUP" || "MASTER"). Unfortunately as with any failover setup, failover is not guaranteed in the case of a sick machine. If one machine is in an error state e.g. vrrpd dies on the master machine, the logs will show the last entry as MASTER (but it will be an old entry), while another machine which takes over the master role, will have a (current) entry as MASTER in the logs. Presumably you could use notify_master, notify_backup and notify_fault to touch files which you could inspect later to determine the state of the machines. This will have problems too in error states. Inspection of IP(s) will also be meaningless in an error condition. The current mechanism for handling machines in a dubious state is to programmatically power cycle them (a process called STONITH). Hopefully the good machine reboots the sick machine. There is no way to force a master-backup transition (e.g. for testing). However, you can relink keepalived.conf to a file with a lower priority and re-HUP keepalived. You can force a master-fault transition by downing the interface. vrrpd only works with PCI ethernet cards (all of which have an MII transceiver) but not with ISA ethernet cards (which don't). I have a machine with 3 ethernet cards, 1 PCI and 2 (identical) ISA. vrrpd works with the PCI card and one of the ISA cards (making transitions on failover), but not with the last ISA card (eth2) which vrrpd detects as being in a FAULT condition (but vrrpd doesn't execute the "to_fault" script). Alexandre Cassen, 18 Jan 2004 In fact, ISA cards don't support MII since they don't have a MII transciver. So media link detection will not work on your 3c509. Joe

I don't understand why the option "state MASTER|BACKUP" exists, since whatever value I use there is overriden by the election which occurs about 3secs after vrrp comes up. It doesn't help with the split brain problem (not much does).

In fact this is just a kind of speed bootstrap strategy... But you are right, an election follows anyway. And then you can have node configured for BACKUP with a priority higher than a node configured for MASTER.

OK, I will set all states to MASTER and let them have an election.

Behaviour on killing vrrpd

If I kill keepalived, I would like the machine to run the scripts in "to_backup". At the moment I'm running the "to_backup" scripts in the rc.keepalived stop init file before it kills vrrpd. After vrrpd is shut down, a vrrpd on another machine will become master and I would like the machine where vrrpd has been shutdown to be in the BACKUP configuration (not have the movable IPs or point to the wrong default gw). I can't think of a reason why vrrpd should leave the machine in one state or another when it exits. Has the behaviour I see been chosen after some thought or is it just how it works right now?

Currently I have not implemented 'administration state forcing' that overrides the runing vrrp FSM. I hope I will find time for this. logs

I would like the logs to show not only the state of vrrpd on that machine, but following an election or transition, I would like to know which other machines were involved and what state they are in. At the moment the logs don't tell me whether a machine became master because it won the election or it didn't find any other machines. Is it possible to have more logging info like this in a future version of keepalived? I'd like to be able to look at a log file and see what state a machine thinks it is in and what state it thinks all the other machines are in.

In fact this is a VRRP spec. I mean, during election, if a node receives a higher prio advert then it will transit to backup state, since this is multicast design the master node will not hear remote 'old master' transition, since it has won the election. The VRRP specs doesn't support kind of LSA database like OSPF provides (each node knowing the state of the others). I spoke with the IETF working group about this last year but this features didn't receive much echo :). But this can be nice.... I like this :) kind of admin command line requesting neighbor, ... this can be usefull. Padraig Brady padraig (at) antefacto (dot) com 22 Nov 2001

Haven't Cisco got patents on vrrpd? What's the legal situation if someone wanted to deploy this?

Michael McConnell michaelm (at) eyeball (dot) com no - ftp://ftp.isi.edu/in-notes/rfc2338.txt Andre In short yes : http://www.ietf.org/ietf/IPR//VRRP-CISCO, IBM too : http://www.ietf.org/ietf/IPR/NAT-VRRP-IBM In fact there is 2 patents ((http://www.foo.be/vrrp/ link dead Feb 2005) CISCO - http://www.delphion.com/details?pn=US06108300__ Nortel Network - http://www.delphion.com/details?pn=EP01006702A3 When you read this papers you can't find any OpenSource restriction... All that I can see is the commercial product implementation... I plan to post a message into the IETF mailing list to present the LVS work on VRRP and to enlarge the debate on OpenSource implementation and eventual licence... 9 Jan 2002 answer from Robert Barr, CISCO Systems

Cisco will not assert any patent claims against anyone for an implementation of IETF standard for VRRP unless a patent claim is asserted against Cisco, in which event Cisco reserves the right to assert patent claims defensively. I cannot answer for IBM, but I suspect their answer will be different.

Using keepalived to failover routers vrrpd is a router failover demon protocol. While keepalived uses it to failover LVS, vrrpd can be used independantly of LVS to failover a pair of routers. Graeme Fowler graeme (at) graemef (dot) net 11 Sep 2007 config for the ACTIVE router looks like: ...the corresponding config for the BACKUP looks like: i.e. it differs in the "weight" stanza for the VRRP definition (90 instead of 100) and there are cosmetic differences to the name. The "check_running" script is simply a wrapper round: if the result code ($?) is 0, it exits with 0. If not, it exits with 1. If it exits with 1, the weight of the VRRP announcement is pulled down by 20 - this makes sure that the critical process on this machine is up, and if it isn't then we play a smaller part in the VRRP adverts (these are derived from a pair of frontend mail servers).

monitoring/failover messages should stay internal to LVS The LVS server state synch demon, vrrpd and heartbeat need to send messages between the backup and active director. You can send these over the RIP network, or via a dedicated network, but you shouldn't send these packets though the NIC that faces the internet (the one that has the VIP). Reasons are It allows outside people to hack your boxes. The LVS (director, realservers), must appear as a single (highly available) server. The clients must not be able to tell that the server is composed of serveral machines working together. The clients must not be able to see heartbeat packets.

Parsing problems with vrrpd config file (Apr 2006, from several people). The parser is a little touchy. This format is in the manpage and doesn't work. This works. This doesn't But this one does work.

Two instances of vrrpd It is possible to have two independant instances of vrrpd handing two VIPs, which can migrate independantly between directors. Alexandre you can have 2 VRRP VIPs active on different routers... then you have a VRRP configuration with 2 instances. On Director1 one instance with VIP1 in a MASTER state one instance for VIP2 in a BACKUP state. => symetry for Director2. both instances on the same interface with a different router_id. Alex alshu (at) tut (dot) by see http://keepalived.org/pdf/LVS-HA-using-VRRPv2.pdf http://keepalived.org/pdf/UserGuide.pdf Graeme Fowler graeme@graemef (dot) net 27 Apr 2006 The router_id needs to be the same on each director for each vrrp_instance. That value is sent out in the advertisement and is necessary for the pair of directors to synchronise. You only need different priorities on your primary and failover director. You can start up both as MASTER or BACKUP and let them decide according to priority, what does what. Just make sure the router_id values are the same for each instance.

HA MySQL Dominik Klein klein (dot) dominik (at) web (dot) de 25 Apr 2006 Goal: My goal is a HA MySQL database. As the MySQL cluster storage engine lacks several important features (like foreign keys e.g.), I cannot use a MySQL cluster. So now I use MySQL replication in a master-to-master-setup. As my clients are able to re-connect after a connection loss, but cannot connect to a different IP on connection loss, a VIP setup is the goal. So my clients only know the VIP(s), not the real IPs of the MySQL Servers. Setup: I have two machines. Each machine runs keepalived and MySQL. Each machine has 2 NICs. eth0 going to the switch, eth1 connecting SRV1 and SRV2. My setup looks like this: Clients connect through the switch, replication is done over the direct gigabit connection between SRV1 and SRV2. Virtual Services I need two VIPs, as I want write-queries to go to SRV1, and read-queries to go to SRV2 - just as in a normal replication-setup, for loadbalancing-purposes. Note that it is not keepalived or LVS that does the loadbalancing here, as each virtual service only has one realserver and one sorry-server! "Loadbalancing" is just writing-to-the-database-software connecting to one server, reading-from-the-database-software connecting to another server. So this is basically the "localhost"-feature, plus one sorryserver per virtual service. Failover If one of the eth0 network connections fail, the VIP moves to the other director, but connections still get directed to the same MySQL server. So the MySQL-loadbalancing still works. If MySQL fails on one machine, connections are redirected to the other server's eth1-IP (10.250.250.2[01]). In order to be able to route that back over the director it came from, there are ip-rules on each server: /tmp/rt_tables cat /etc/iproute2/rt_tables >> /tmp/rt_tables cp /tmp/rt_tables /etc/iproute2/rt_tables rcnetwork restart ip rule add from 10.250.250.20 table mysqlrouting ip route add default via 10.250.250.21 dev eth1 table mysqlrouting ------------------------------ - SVR2 ip rules and routing: - ------------------------------ cat /etc/iproute2/rt_tables 2 mysqlrouting ... ip rule show ... 32765: from 10.250.250.20 lookup mysqlrouting ... ip route show table mysqlrouting default via 10.250.250.20 dev eth1 Setup-steps for this: echo "2 mysqlrouting" > /tmp/rt_tables cat /etc/iproute2/rt_tables >> /tmp/rt_tables cp /tmp/rt_tables /etc/iproute2/rt_tables rcnetwork restart ip rule add from 10.250.250.21 table mysqlrouting ip route add default via 10.250.250.20 dev eth1 table mysqlrouting ]]> Configuration files As MySQL requires some specific configuration, I will briefly post the relevant parts, but not go into detail here, because it is actually OT for this list. Read the MySQL-Documentation for further detail, if you do not understand the configuration parts below: http://dev.mysql.com/doc/refman/5.0/en/replication.html On failover, there is no connection-sync, so every client has to re-connect. Connection-sync is imho not possible in this setup, as real-servers are different on SRV1 and SRV2. This example is on VIP1: If MySQL fails on SRV1, SRV2 will be used. When SRV1 comes back up, keepalived will immediately switch back to SRV1. This will send clients to a mysql server, that may not have up-to-date-data. As I could not find a way to define any delay before the real-server is added back in, I wrote a MySQL startscript I'll post (below). It blocks port 3306 on loop interface and so lets the healthcheck fail for local mysql, starts mysql, waits for the replication to get new data from SRV2 and then unblock port 3306, so the healthcheck can pass again. After that successful healthcheck, the real-server is inserted by keepalived and clients should have up-to-date data. Another thing one has to be aware of in such a setup is the fact, that the client will not get anything from SRV2, when SRV1 had crashed and comes back up. So socket connections to SRV2 will still be "thought of" as "OK". In order to tell clients they are not OK, i additionaly set wait_timeout 30 in my.cnf

Failover of large numbers (say 1024) of VIPs The problem: If you have a large number of VIPs, they can take a while to failover. I got an e-mail from someone with 200 VIPs whose setup takes 2-3mins to bring up 200 VIPs with ifconfig, on the director which is assuming the master role (and take them down on the one which is assuming the secondary role). I couldn't imagine Horms putting up with this so did some tests to confirm the problem and e-mailed him. Ratz No-one should be using ifconfig anymore. You're lucky if the ip address gets set up in the first place when you're in a hurry. So ip addr add... is the key or in linux-ha parlance, IPaddr2. Christian Bronk chbr (at) webde (dot) de 27 Feb 2006 when they still use ifconfig they have to serialize the startups, because with ifconfig you must have different alias interfaces. If they then send two arp broadcasts with send_arp (with delay of 1s) the complete server takeover will last 3min. To makes this faster they have to rewrite their code to use ip from the iproute2 package and try to startup the IPs in parallel. Joe ISPs are quite restrictive about allocating blocks of IPs, so if you have a large number of IPs, they're likely all in the same block. As a test (on a 200MHz machine) I ran a this loop of 4*252. Here's the times for 1008 addresses on a 200MHz machine (a bit slower than current production directors). Putting the processes into background doesn't help a lot. Presumably forking is as expensive as send_arp. Time,sec to bring 1008 IPs up and down on a 200MHz machine job type time,sec ip addr 12 ip addr & 10 ip addr; send_arp 30 ip addr && send_arp & 28

There are two solutions here: William Olson's dynamic routing Horms' method, where he only fails over 1 VIP with heartbeat and lets static routing handle the other VIPs by routing through the failover VIP. William Olson ntadmin (at) reachone (dot) com 24 Feb 2006 In our previous load balancer configs (scripts then later ldirectord + HA) we experienced the same time lags during failover situations(ex. Stopping heartbeat on the master). Our systems were 700+mhz Dell servers w/at least 512mb ram. They operated in a Master/Backup pair(NAT), each with 2 nics(one for external and one for internal network). Haresources file was used to start and stop ospfd and run IPaddr2 for each of the at least 200 VIPs. You could literally count seconds between each VIP going up/send_arp and the next. We have consequently switched to keepalived which has alleviated this problem. During a failover while tailing the messages file, you could watch each successive ip addr and send_arp (IPaddr2). Consequently, when a failover would happen, all ips would be brought down on the former master almost instantaneously and slooowly come back up on the backup, now master director. It seemed to me that the issue was being caused by the time it took to actually execute the scripts in the haresources file, as using ip addr and send_arp directly gave time results that were very quick on these same systems. We're running ospfd on the directors and router(s). It was an original requirement sent down by our network admin to have dynamic routing on all internal routers. These days, it just seems better to go with what has been working rather than to redesign the whole system. We could probably be just as well off without the ospfd part of the picture however, it's working now and true to specification so it's pretty easy for us to troubleshoot. ospfd seemed like overkill to me when I was originally designing the system, however the dictates of the net admin overrode my input. Now we're operating with an acceptable failover time so I'm inclined to stay with ospfd. It's now working with ospfd running on the directors(always running regardless of director state) and routers with keepalived managing the lvs and failover on the directors. Initial tests of the new keepalived systems are resulting in 15sec or less failover times independent of the number of IP addresses. Joe: dynamic routing can take 30-90sec to find new routes: see . Horms The thing that takes time is heartbeat sending out the gratuitous arp. If you combine this idea with a fwmark virtual host (bunching all VIPs into one fwmark), then everything should be quite fast to failover. But even if you don' use a fwmark virtual host, things should be quite snappy up to 1000 addresses or so. (I made that number up :) The route command used the format for setting a default route, except you route a smaller set of addresses to an alternate place. Or in other words: Of course you can have a bunch of these statements, if the addresses don't fit nicely into one address block. Though its better to try and avoid having one per address. Joe

Do you mean "bunch all the VIPs into a single fwmark". If so that was my first thought for a solution, but you can't route from outside to a fwmark, you still need all the IPs to have arp'ed and the router know where to send the packets to. Or did you mean something else?

Horms "something else". :-) fwmark is half the solution. It allows LVS to efficiently handle a large number of virtual solutions. But its only half the solution because you still need to get the packets to the linux director. The other half of the solution is routing (Note: the halves can be used by themselves if need be). (For more info on the routing used, see ) With routing you don't need to ARP each individual VIP. You just need to make sure that each box on the local network knows that the VIPs go via the address that is being managed by heartbeat (or other means like keepalived). Here is an example. Lets say the network that the ldirectord lives on is 10.130.0.0/24. And lets say you want 1024 virtual IP addresses, say in the block 10.192.0.0-10.195.0.0/22. All you need to do is give heartbeat an address inside 10.130.0.0/24 to manage, say 10.130.0.192, and tell the gateway to route 10.130.0.0/24 via 10.130.0.192 Any host in the network will send packets for 10.192.0.0/23 to the gateway, which will in turn redirect them to 10.130.0.192, which is the linux-director and all is well. Any host outside the network will also end up sending packets via the gateway, and it will duly forward them to 10.130.0.192, again the linux-director gets them and all is well. When a failover occurs, as long as 10.130.0.192 is handled using gratuitous arp (or whatever), then the gateway will know about it, and packets for all of 10.192.0.0/23 will end up on the new linux-director. If the local network happend to include all the VIPs, say because you had 10.0.0.0/8 on your LAN, then each host would need to know to send 10.192.0.0/23 via 10.130.0.192, which is a bit of a hassle, but still not a particularly big deal.

Some vrrpd setup instructions Alexandre CASSEN alexandre (dot) cassen (at) wanadoo (dot) fr 29 Jul 2002 (Alexandre is now at Alexandre (dot) Cassen (at) free (dot) fr) Here is a detailed setup for LVS-HA using a VRRP setup. 1. Topology description In a "standard" design, when you are playing with a LVS/NAT setup, then you need 2 IP classes. Consider the following sketch : So you have 2 classes defining your both LVS-Box segments : 192.168.100.x for WAN segment and 192.168.200.x for LAN segment. For the LVS loadbalancing, we want to define a VIP 192.168.100.253 that will loadbalance traffic on both 192.168.200.2 and 192.168.200.3. For the LVS-Box HA we want to use VRRP setup to use a floating IP to handle director takeover. When playing with LVS-NAT and VRRP, then you need 2 VRRP instances, one for WAN segment and one for LAN segment. To make routing path consitent then we need to define a VRRP synchronization group between this both VRRP instances to be sure that both VRRP instances will have all the time the same state. 2. VRRP Configuration description This configuration will set IP 192.168.100.253 on eth0 and 192.168.200.253 on eth1 3. LVS Configuration description In order to use HA, we use VRRP VIP as LVS VIP so the LVS configuration will be : sorry_server 192.168.200.254 80 real_server 192.168.200.2 80 { weight 1 HTTP_GET { url { path /testurl3/test.jsp digest 640205b7b0fc66c1ea91c463fac6334d } connect_timeout 3 nb_get_retry 3 delay_before_retry 3 } } real_server 192.168.200.3 80 { weight 1 TCP_CHECK { connect_timeout 3 # By default connection port is service port } } } ]]> => VRRP IP 192.168.100.253 will loadbalance traffic to both realservers. 4. Realservers Configuration description And finally, the only things missing in our configuration is the realservers default gateway... This is why we define a VRRP instance for LAN segment. So Realservers default gateway MUST be : VRRP VIP LAN segment = 192.168.100.253 5. Keepalived sumup Configuration 6. Keepalived sumup Configuration on BACKUP node 7. LVS-HA scenario Now run all this on your both director and simulate a crash by unplug the wire on LVS1 eth0 for example. Detecting this trouble, VRRP will takeover eth0 instance on LVS2 and sync eth1 instance on LVS2. So all traffic will run throught LVS2. This a typical active/passive scenario. If you want to extend this configuration to an active/active configuration, then you need to add MASTER VRRP instances on your LVS2. active/active configuration consist of one realserver pool segmentation. This mean that you create 2 realservers pools (in the same IP range) but with a different default gateway that will be the new VRRP LAN VIP. => This part will be described more indepth in the documents I will write soon :)

Filter rules for vrrpd broadcasts If you want to filter (allow) the vrrpd broadcasts, here's the recipe Sebastien BONNET sebastien (dot) bonnet (at) experian (dot) fr It's PROTOCOL 112 (vrrp), not PORT 112. You also need protocol igmp (don't ask why). You have to allow both incoming and outgoing adverts : To be more precise, a tcpdump shows the multicast address is 224.0.0.18 if you want to be more restrictive. Don't forget to allow the trafic needed by keepalived to test your real servers. In my case, it looks like this : Noc Phibee

For Vrrp protocol, how should I configure shorewall? When my group changes state I want to restart Shorewall. I have used the notify_*: When my MASTER are dead, the BACKUP change state (good), but when the MASTER are alive and gets the VIP, it runs the same script (restart of shorewall). Anyone have a idea why it doesn't immediately change the states?

Graeme Fowler graeme (at) graemef (dot) net 23 Aug 2006 You must allow packets from/to network 224.0.0.0/8 If you want to control this a bit more accurately, define mcast_src_ip in your keepalived.conf for each defined vrrp_instance, and set your filters accordingly. Firstly it looks like the Master is receiving the announcements from the Backup. This is good. The Backup is also receiving packets from the Master, which is also good - this is why the Backup flip-flops from BACKUP to MASTER to BACKUP state continuously. However - something else is happening here, and I expect it's your Shorewall config. Ignoring the Master machine for a moment, let me put forward a possible reason: The Backup machine starts up, brings up keepalived, and goes into BACKUP state. Shorewall is dropping packets at this point, so the Backup machine goes to MASTER state, does things to Shorewall with the notify script, and starts to accept packets. It then receives an advertisement from the Master director, so it switches to BACKUP state, changes the Shorewall config back, misses advertisement, switches to MASTER, changes the firewall, misses advertisement, etc etc. Assuming this is correct, there are several things you need to do: Make sure the Shorewall config isn't dropping the packets you want (see the suggestions above). Put your notify* script actions into your vrrp_sync_group block instead of the vrrp_instance. That way it'll only fire once, when the group changes state, rather than one being fired off for every instance state change *and* the group. Graeme Fowler graeme (at) graemef (dot) net 09 Oct 2006 You need You need to explicitly accept multicast for this to work. You can make it more accurate by setting the appropriate config option in your keepalived config to set the mcast_src_address, and then have a corresponding rule to let that in.

Vinnie's comparison between ldirectord/heartbeat and keepalived/vrrpd Vinnie listacct1 (at) lvwnet (dot) com 26 Apr 2003 I set up both ldirectord and keepalived up to try them out and see which I liked better. The goal was to have a redundant pair of LVS(-NAT) directors which would also serve as the primary firewall/gateway for our external connection and DMZ hosts. My firewall scripts use heavy stateful inspection, iproute2 ip utils to add/remove ip's and routes, and proxy arp. I did not want to lose any of the features of the firewall setup if at all possible. These are my observations, but as others have said, both are being heavily developed by their authors, and anything I say here could be obsolete in short time. I think they are both great packages. LDIRECTORD/HEARTBEAT Using ldirectord, you need heartbeat to handle the failover/redundancy capabilities. The modular approach is a good idea, and they have developed resource monitoring which goes well beyond just checking if a realserver responds to a connection/request on a certain port/service. I set up ldirectord first, on a single director, since this would get the high availability of my realservers working. I used my firewall script to add the VIP's to the director (with iproute2 ip commands), and added a section to the script to create the virtual services with ipvsadm commands. (I knew this wouldn't be necessary when/if I got heartbeat set up). Ldirectord works quite well, and it apparently has the ability to do a basic UDP check (since stopping named on one of the realservers causes ldirectord to remove that RS from the 53/udp virtual service until named is started back up). I was disappointed to see that heartbeat (particularly the ipfail part) is still written to use ifconfig and old-way ip "aliases" (ie, eth0:0, eth0:1, etc.) to have multiple IP's on an interface. 2.2.x and ipchains is a bit "passe" - 2.4.x and iptables has been stable/production for quite some time now. iptables does not like interface names with ":" in them, so you are imposing pretty stiff limits on what kinds of firewall rule sets you can write if you use old-style ip aliases. Reading the mailing list archives, it looks like some users have started submitting patches that will cause heartbeat to use the iproute utils to set up the interfaces instead, but this had not been incorporated into the latest-available beta (at that time, 1st half of Apr. '03), and I was not sure if the modifications were comprehensive in scope, or just addressing certain aspects. (My programming skills are pretty weak). Joe, Dec 2003, Linux-HA has been rewritten to use the Policy Routing tools. This pretty much nixed any further looking into of heartbeat for me, so I started looking at keepalived. KEEPALIVED Keepalived is an all-in-one package, which is written in C. It uses 2.4-native netlink functions to set up interfaces, IP addresses, and routes on 2.4 boxes, so it is no big deal to have multiple IP's on one interface. You can use iptables commands which match a single interface to cover all the IP's on the interface, or you can add a -d one.of.my.vips to make that rule match a single VIP, subnet, etc. Keepalived uses VRRPv2 to handle director failover, it's really nice. When failover happens, the new master sends gratuitous arps out on the network, so virtual services experience basically no interruption (especially since keepalived also supports IPVS connection synchronization between directors). There is currently an issue with HEAVY syslog activity when a pair of directors are running (it logs the election process on both directors) but Alexandre is working on that. If you're running proxy-arp on the director, you can use keepalived's ability to run scripts when a machine becomes master to send arping unsolicited arps for the other hosts in your DMZ and your ISP's gateway, so that the other hosts in your DMZ with routable IP's on their interfaces (which only need the director as a router/firewall) are also updated with the new master's MAC addresses. Keepalived doesn't send gratuitous arp's out for IP's it didn't take over, so this is needed for your DMZ hosts to see the ISP's gateway, and also for the ISP's gateway to be apprised of the new (remember, proxy arp!) MAC address of those other DMZ hosts. (I am currently working on a HOW-TO for this, which will apparently be added to the keepalived online documentation, and also our website). Keepalived currently does not support UDP connection check, but it is on Alexandre's to-do list. Also another feature (I haven't looked at yet) of keepalived is the virtual_server_group capability, which allows you to group virtual services together and have their health check pass/fail determined by a single connection check - good for example if you have a stack of IP-based apache virtual hosts on a realserver. You probably don't really need to check each virtual host's IP, and you don't want to flood the realserver with health checks. I think they're both really great packages. If heartbeat were updated to use iproute2 utils, instead of ifconfig and interface aliases to have multiple IP's per interface, it would be much more viable for people running strong iptables firewall rulesets, such as those who wish to use the director as a firewall/gateway. Me personally, I'm going to keep running keepalived. It also has a lower CPU overhead.

Saru: All directors active at the same time saru is a proof of principle piece of code by Horms. It was done for one kernel and has not been maintained through the subsequent kernels. Horms has written code allowing all directors to be active at the same time and an improved syncd. The code is at http://www.ultramonkey.org/papers/active_active/. Horms 18 Feb 2004 Saru means monkey in Japanese. It's the work I did in Japan on Ultra Monkey... well some of it anyway. The kanji is in the original paper I wrote for Saru (http://www.ultramonkey.org/papers/active_active/active_active.shtml). Here are the google links for looking up the meaning of "saru" in Japansese Horms horms (at) verge (dot) net (dot) au 16 Feb 2004 Saru has nothing to do with connection synchronisation, which is what syncd does. Saru provides a mechanism to allow you to have Active-Active Linux Directors. Syncd (and other synchronisation daemons) synchronise connections, allowing them to continue even if the Linux Director that is handling them fails (assuming that there is another Linux Director available for them to fail-over to). If you are using Saru then you probably want to use connection synchronisation, but the reverse is not necessarily true. This paper (http://www.ultramonkey.org/papers/lvs_jan_2004/) should cover how Connection Syncronisation and Active-Active work together. It briefly covers the relevant parts of Alexanre's syncd patches and how they interact with Saru (as I understand the extensions anyway). Here's a short explanation of Saru's working. Francisco Gimeno kikov (at) kikov (dot) org 11 Nov 2006 The directors: All director nodes know about eachother Each director has an ID ( for example, the MAC or the IP ) Each director node can elaborate a sorted list based on that ID Heartbeat everywhere, so the list is dynamic list The "view" of each node should be the same for each node (ie: all nodes should have the same list) They should have a virtual-MAC. Those requirements could be satisfied with a broadcast sync protocol (it could be similar to the WCCP, for example) for each arriving packet Make a HASH with the parameters you want to keep the __affinity__ (like src IP, dst IP, ports, ...) to. Calculate (HASH % numer_of_nodes) ( % := modulo ) If that value it's the order in the list for the node processing the packet, the packet is accepted, if not, discarded. As every packet go to every director... So one of the most important thing here, is that no director has to put the virtual-MAC in the wire, as every director has to receive the packet. Arp responses to the VIP should be the virtual-MAC, but it should be sent with a bogus-MAC. With that, the responsible to route packets to the VIP, will send the packets to that virtual-MAC. As the switch (L2) don't know the physical port associated to it, it sends the packet to all the active ports that hasn't a MAC associated which are the director's. If you use a HUB then, there will not be this kind of problems (who ownes a HUB nowadays?). Horms 13 Nov 2006 It is only active-active for the linux-directors, and its not really supposed to be active-active for a given connection, just for a given virtual service. So different connections for the same virtual service may be handled by different linux-directors. The real trick, is that it isn't a trick at all. LVS doesn't terminate connections, it just forwards packets like a router. So it needs to know very little about the state of TCP (or other) connections. In fact, all it really needs to know is already handled by the ipvs_sync code, and that mainly just a matter of association connections with real servers. Or in other words, the tuple enduser:port,virtual:port,real:port. Ratz I've read it now and I must say that you've pulled a nice trick :). I can envision that this technique works very well in the range of 1-2 Gbit/s for up to 4 or so directors. For higher throughput netfilter and the time delta between saru updating the quorum and the effective rule being in place synchronised on all nodes might exceed the packet arrival interval. We/I could do a calculation if you're interested, based on packet size and arrival on n-Gbit/s switched network. You're setting rules for ESTABLISHED in your code to accept packets by lookup of the netfilter connection tracking and while the kernel 2.4 does not care much about window size and other TCP related settings, 2.6 will simply drop the in-flight TCP connection that is suddenly sent to a new host. There are two solutions to overcome this problem for 2.6 kernel. One is fiddling ip_conntrack_tcp_be_liberal, ip_conntrack_tcp_loose and sometimes ip_conntrack_tcp_max_retrans and the other is checking out Pablo's work on netfilter connection tracking synchronisation.

Active/Active by multipath: random musings Siim Poder windo (at) p6drad-teel (dot) net 09 May 2008

Is anyone running active-active(-active...) LVS setups? Is it saru or something else? It's seems it should be possible to do it without saru by having multipath route-capable router upstream dividing the traffic between all the directors. If you can manage the routes with OSPF, it should be possible to have active-active directors using more common protocols and software with the added benefit of each director only receiving the packets it ends up handling.

Joe few of us have access to routers to try such experiments. So it's possible this would work, but few people would be able to implement it. Graeme I'm not (yet) but I may be doing so sometime this year. Current thinking is (per your email) to run Quagga, Zebra, or ospfd (or something else) on the directors themselves which will announce the VIPs into the local network. The local network devices will then work out the best path for traffic flowing through them; OSPF is designed around a cost and hop-count model so having multiple routes originated in different places should mean "closest network wins" from a client perspective - although this is untested! The interesting part is how you make sure the traffic returns to the clients. In the case of -DR this isn't really a problem, but using -NAT could be difficult if the realservers can receive, and return, traffic to either director. By extension, you could split a cluster into two halves and have each in a different physical and logical location using this model, but that adds complexity at the backend if you're sharing file data and/or databases. What that would give you is, for example, proper geographical resilience such that if one location loses power, hey? Who cares? We'll just talk to the other one instead! And as an added bonus, if you have multiple equal-cost paths to directors in the same location, OSPF can be made to (crudely, in some cases) load-balance between them. This gives a bit more even-handed loading but does add an extra thing to go wrong.

Server Load Balancing Registration Protocol William V. Wollman wwollman (at) mitre (dot) org We have worked with LVS to implement something we call the Server Load Balancing Registration Protocol. It allows a server to plug into a network with the LVS and register it services with the LVS. The LVS is then automatically configured to balance the services registered. We did not modify LVS directly but created a Java program that processes the realserver registration messages and then automatically configures the LVS. The realserver requires the installation of a Java program to build the registration messages and register its services. One benefit is plug and play SLB with minimal administration. Another benefit is controlled configuration management. A paper is available for download that describes the work we completed in a bit more detail. If interested please download and read the article at http://www.mitre.org/work/tech_papers/tech_papers_03/wollman_balancing/index.html.

using iproute2 to keep demons running during failover, while link is down On failover, when the backup machine assumes the active role, it must bring up IP(s) and possibly start demons listening on those IP(s). With the tools e.g. iproute2, the backup machine can have the IP configured with the demons listening, but with the link state down. On failover you just change the linkstate to "up". Roberto Nibali ratz (at) drugphishi (dot) ch 24 Dec 2003 This will keep all assigned IP addresses for the interface. This will remove (flush) all IP addresses from the interface. However Which is completely broken! With ifconfig you have no means to distinguish between flushing IP addresses and setting a link state of a physical interface. There's a huge difference routing wise. In the case of setting the physical link layer to down, you do _not_ disable routing table entries. In the case of flushing an IP address you _also_ remove its routing table entry which can be annoying from a setup point of view and definitely irritating from a security viewpoint. The reason why it is important to have two states of interface setup can for example be found in the security business. You set the link state to down, set up all packet filter rules and then configure all IP addresses and rules and routes. Then you start local daemons (and they will start even if they need to bind and listen to non-local IP addresses because the IP addresses and the routing is complete) _and_ after that you open your gates by setting the link state to up.

LVS: Dynamic Routing, multiple gateways, realservers in multiple LVSs, dead gateway detection Normally multiple routes are handled by routers. However you may not have access to the router tables for administrative reasons or because someone wants to protect their turf (they don't want someone not in their department poisoning their router tables). Here we describe setting up multiple routes and how they can be used in an LVS.

Setting up multiple gateways: Realservers shared between two LVSs: <command>ip route append</command> If realservers are supplying services through two directors, then the realservers need two default routes (one through each director). This is allowed by the TCPIP RFCs but rarely implemented. You cannot add a 2nd default route with ip route add, you'll get an error saying that the route already exists. Instead you use the command ip route append. This was worked out by Posko (Malalon) posko P (dot) Osko (at) elka (dot) pw (dot) edu (dot) pl 17 Apr 2002

I used ip append at home when I was testing source address routing for RealServers. But when I started working with my setup I found that I can't set up two default routes for different addresses in one routing table (in Linux it's by default table 'main' where all normal routes are stored) because only one default route works at the same time (the first added to route table). So I decided to create separate route tables (named 201 and 202 in my setup) containing default route for each alias using the following command: and route packets with source address from 192.168.1.2 according to this table (201)

Here are the details from Pawel Osko, Warsaw University of Technology, Faculty of Electronics and Information Technology. You can create two (or more) LVS-NAT directors using the . The simplest setup is one RS working with two DIRs: The first step is to create working setup with one DIR and one RS. In my setup I'm using one NIC two Networks LVS-NAT. Example (my) setup: (You can use the Configure Script to set it up.) Now test it. If everything is ok, set the second DIR, and change settings on RS: and test it. Now you know for sure that your DIRs are set up properly and your RS can work with both of them. Step 2. Keep directors working. Delete addresses on network interface on RS (using `ip addr del` command for example). Add two addresses to NIC (I'm using eth0): Check if everything is ok: Each of addresses will work with other DIR. Now you must make packets from eth0:1 go to DIR1 and from eth0:2 to DIR2. Source routing will be used to do this. Create rules for each IP: where 201 and 202 are names of tables. Add default routes for each IP: You are done! Now all packets from 192.168.1.2 will go through DIR1 and packets from 192.168.2.2 through DIR2. New RSs can be added now, simply follow instructions in Step 2 for new IPs. You can also have more DIRs, just add more IPs on RS. I set up LVS-NAT with four DIRs working with four RSs using mon to dect RS's failures and everything works perfect (at last!).

Connecting from clients through multiple parallel links: the dead gateway problem This is not an LVS problem, just a normal routing problem. You can have multiple default gateways in Linux. The problem is knowing when one of them has died. The "Connected" site has a discussion of dead gateway detection (http://www.freesoft.org/CIE/RFC/1122/56.htm, site gone 14 Sep 2004) derived from the RFCs. The points raised are active probes (e.g. pings) are expensive, scale poorly and MUST NOT be used continuously to check the status of a first-hop gateway. When pings must be used, they MUST only be used when traffic is being sent to the gateway. other layers (above and below the IP layer) SHOULD be able to give advice to the routing layer, when positive (gateway OK) or negative (gateway dead) information is available. Dead gateway detection is covered in some detail in RFC-816 [IP:11]. Experience to date has not produced a complete algorithm which is totally satisfactory, though it has identified several forbidden paths and promising techniques. In case you're wondering, what they're really saying is that dead gateway detection was not built into the protocol and no satisfactory solution for its absence has been found. Ratz 22 Jan 2006 According to RFC816 and RFC1122 there are multiple ways to perform DGD, however I've only seen about 3 of those in the wild: Link-layer information that reliably detects and reports host failures (e.g., ARPANET Destination Dead messages) should be used as negative advice. An ICMP Redirect message from a particular gateway should be used as positive advice about that gateway. Packets arriving from a particular link-layer address are evidence that the system at this address is alive. However, turning this information into advice about gateways requires mapping the link-layer address into an IP address, and then checking that IP address against the gateways pointed to by the route cache. This is probably prohibitively inefficient. The Alteon switch does media detection and could also listen to special L2 PDU packets, including advertisements. Media detection under Linux is an often discussed and to date not resolved issue. For about 2 months starting last November, a couple of people on netdev have been working on proper link state propagation in the core kernel, the result will be seen in 2.6.17 ;). Other than that I suggest you use non-cheap but excellently supported NICs, like e1000 and check the media state using ethtool or write a netlink listener. You are allowed to ping, but only if nothing else works for you (3.3.1.4): Active probes such as "pinging" (i.e., using an ICMP Echo Request/Reply exchange) are expensive and scale poorly. In particular, hosts MUST NOT actively check the status of a first-hop gateway by simply pinging the gateway continuously. Even when it is the only effective way to verify a gateway's status, pinging MUST be used only when traffic is being sent to the gateway and when there is no other positive indication to suggest that the gateway is functioning. To avoid pinging, the layers above and/or below the Internet layer SHOULD be able to give "advice" on the status of route cache entries when either positive (gateway OK) or negative (gateway dead) information is available. Multiple routes to the internet is discussed in Routing for multiple uplinks/providers (http://lartc.org/howto/lartc.rpdb.multiple-links.html) and Multiple Connections to the Internet (http://linux-ip.net/html/adv-multi-internet.html). Julian (immediately below) has a dead gateway detection mechanism and a working setup with dead gateway detection is shown at Nano-Howto to use more than one independent Internet connection. by Christoph Simon (http://www.ssi.bg/~ja/nano.txt). The author warns

The setup of all this is not a question of 5 minutes

Logu lvslog (at) yahoo (dot) com 5 Oct

I have two isdn internet connection from two different isps. I am going to put an lvs_nat between the users and these two links so as to loadbalace the bandwidth.

Julian You can use the Linux's multipath feature: You can add my dead gateway detection extension (for now only against 2.2) This way you will be able fully to utilize the both lines for masquerading. Without this patch you will not be able to select different public IPs to each ISP. They are named "Alternative routes". Of course, in any case the management is not an easy task. It needs understanding. anon

I currently have multiple adsl modems that connects to the internet.

Alexandre Cassen alexandre (dot) cassen (at) wanadoo (dot) fr 11 Apr 2003 This is a routing design problem, commonly accomplished done by loadbalancing default route at the routing level (netlink). You add 2 default gateway with the same weight to provide outbound loadbalancing. Since current linux kernel routing suffer lake of dead gateway detection, you will need to apply Julian's "dead gateway detection" patch.

Dynamic Routing to handle loss of routing in directors Here I show how use dynamic routing to handle routing following failure of the link from a director to its default gw. The director with the failed default route gets its new routing information from the adjacent director, which is assumed to have a functional route to the outside world. After I got this to work, I found out that you don't do dynamic routing if the interfaces on two machines are in the same networks as shown here, as happens with duplicate directors (or routers). In the case of common networks, alternate routes are (usually) handled by multiple static routes with different weights e.g. Routing for multiple uplinks/providers in the Linux Advanced Routing and Traffic Control HOWTO. This section then is not exactly central to LVS failover and unless you have some other reason to read about dynamic routing, you may want to skip this section. This was my first attempt at dynamic routing. Even if you use dynamic routing, I won't be surprised if there are better ways of doing it. Suggestions welcome. You use dynamic routing only if the hosts are connected to non-common networks, as here, where host_1 is not connected to network_C, while host_2 (which is connected to and can communicate with host_1) is. Dynamic routing would be used by host_2 to send information about network_C to host_1 (etc). I had previously been handling routing failure with scripts. Script driven failover (where as well, you have to reconfigure demons to listen to the moveable IP and the router has to think that is has a new name), requires the scripts to run in pairs (to_up on one machine, and to_down on the other). The scripts have to be synchronized and have to run to completion on both machines. If one machine becomes deranged and looses track of its state, then scripts won't failover cleanly. You should be able to down/crash/wedge any single NIC/route/disk/demon in a failover router pair without loosing routing, no matter what. I found that my scripts would often result in some hung state. Perhaps better scripts would have handled it, but this would indicate that functional scripts are difficult to write. I was looking for other ways of handling routing failure, when John Reuning posted on the mailing list that he was using zebra. I had not managed to even figure out how to setup the .conf file last time I tried (several years ago) as I found the docs inscrutable (some sections were blank). Here's the posting from John Reuning, which showed me how simple it was to configure zebra and which started me off with dynamic routing.

John Reuning john (at) metalab (dot) unc (dot) edu 17 Feb 20004 I've included the .conf files below. I didn't do anything crazy coming up with this stuff. There were sample config files in the source code, and I just copied what I needed. To make snmp work, these need to go in the snmpd config: The one quirk I remember is that one of the daemons needs to start before the other. If zebra isn't running when bgpd starts up, it freaks out. bgpd.conf zebra.conf

I thought it would be better to handle the failover with hardened and well tested demon(s) running on each machine, that maintain communication, and know what to do when one machine is in an arbitary fault state. These demons then would run the minimum depth of the more fragile, dependant scripts. Zebra is a GPL package containing the common dynamic routing demons (ripd, bgpd, ospfd). Zebra runs on many platforms and uses a command syntax close to that of cisco IOS (i.e. you can use the cisco documentation if you're stuck). Useful documentation I found A review on Zebra by Mike Metzger (link dead Mar 2004, http://www.unixreview.com/documents/s=1350/urm001a/). An introduction to setting up Zebra. This didn't give me enough information to get going, but did tell me that someone understood it and gave me hope that I would too. Build a network router on Linux by Dominique Cimafranca and Rex Young (http://www-106.ibm.com/developerworks/linux/library/l-emu/) - a slightly more advanced introduction to Zebra. This, together with the config files supplied by John Reuning (below), contained enough information for me to get Zebra to do something. The Zebra documentation (http://www.zebra.org/) (this seems to be complete now - a few years back, whole sections were blank). Cisco documentation (http://www.cisco.com/pcgi-bin/Support/browse/index.pl?i=Technologies&f=770). After bootstrapping from the Cimafranca and Young article, I could use the articles here e.g. Routing Information Protocol (RIP) (link dead Mar 2004, http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/rip.html), Using the Border Gateway Protocol for Interdomain Routing (link dead Mar 2004, http://www.cisco.com/univercd/cc/td/doc/cisintwk/ics/icsbgp4.html). I was helped by Tom Brosnan and Steve Buchanan, networking people at my job. After I got this working, I found Dynamic routing - OSPF and BGP (http://www.lartc.org/howto/lartc.dynamic-routing.html) in the Linux Advanced Routing and Traffic Control HOWTO. As with most computer documentation, you already have to understand the topic in order to be able to read it. Much documentation about dynamic routing concerns the differences between RIP, BGP, OSPF, and goes into details about convergence, horizons... You don't need any of that right now. All you need to know is that these 3 protocols move routing information from one machine to another and that the syntax of the commands for them is much the same. For moving routing information within an AS, you use rip (the original protocol) or ospf (the newer protocol). For moving routing information between different ASs, you need bgp (I think). To the LVS client, as far as routing is concerned, an LVS appears to be a single leaf node. For an LVS with one director, all routing is to the director and the LVS really is a single leaf node. When multiple directors are involved, and the VIP hops between directors on failover, the inbound routing can be handled at the arp level (the director uses send-arp to update the location of the VIP). For outbound routing (i.e. packets from the VIP on the director to 0/0), dynamic routing protocols can be used. One place that dynamic routing could be used in an LVS, is following failure of the link to 0/0, a director does a failover and no longer having a route to 0/0, has to route packets through the other director (see diagram below). I wanted the setup to run a router failover pair. If you are using this to maintain outbound routing for an LVS, you will only need this for LVS-NAT. For LVS-DR and LVS-Tun, for security, there should be no route from the VIP on the director to 0/0 - see . Normally with dynamic routing, the routers (here, the two directors) are in contact with upstream routers (running a dynamic routing protocol), who feed routing information to them. The link state of the network (up||down) can be inferred from the presence (or absence) of the routing advertisements. With routing advertisements exchanged at 30-60 sec intervals, it will take ripd about 3 mins to timeout a dead link. bgp is a little faster and takes about a minute to timeout a dead link. In the general case, you may not be able to get dynamic routing information from upstream. Some organisations are big and inflexible, there maybe turf battles, and the IT department will worry about getting bogus advertisements from you. Where I live, network link failure (or routing failure, which may appear as a link failure) is the most common problem when maintaining service. Other problems, e.g. power failures occur more often, but these can be handled by UPSs; disk failures, which you have to plan for, are handled by and pre-emptive disk replacement of working disks as they approach their warrantee expiration. Assuming that the two routers (directors) are both functional, then failover after a routing/link failure has to handle two problems detection of link/routing failure A setup is needed that works without link (or routing) information from upstream machines. In the absence of packets from an upstream machine, link (or routing) failure detection is difficult. I will assume that this is being handled by the failover demon (keepalived/vrrpd or Linux-HA). reconfiguring the default gw The director with a failed route to the outside, has to route via the other director. Here's some info about the differences between routing via tables (i.e. how you set up a leaf node) and routing with dynamic routing protocols (i.e. on a router) leaf nodes: automatically route to networks on interfaces. All other packets are sent to a default gw. The machine's view of the network is fixed and it knows that it is at the edge of the internet. routers: advertise networks and IPs. Other routers pick up the messages and figure out the routing. All routers see themselves in the middle of the internet, with no idea where they are in it, or how big the internet is (the Ptolemeic view of the network). In particular, RIP and OSPF routers don't know about edges of ASs or the existence of other ASs. You don't explicitely set routes, rather you list the IPs of the neighbors and then let the routing demons figure out the topology. Except for border routers, the other routers (running RIP or OSPF) don't know about an AS. What you need to have your own AS, depends on your clout and size in the networking world. If you're a big governmental agency with offices throughout the country, have thousands of networked computers, route all your intra-agency's packets through clouds (leased lines, where packets don't go onto the internet) and route all your packets to the internet over multiple redundant links, via local ISPs at each site, then you'll have your own AS. Big ISPs will have an AS. If you're a small organisation, you'll have static links to your provider and you won't have an AS. Small dial-up companies with just a few machines handling the traffic have static links and don't even get routing advertisements from their ISPs. Businesses usually are dealing with computers or networks, but not both. If you're in a business that uses computers (e.g. you're an applications programmer), then you won't have an AS. If you even ask the question "what do I need to have an AS?", then you aren't in the network business and you won't have one. An AS is connected through border routers (usually two or more for redundancy) to an ISP which is connected to the internet backbone. The border routers act as a default gw for the routers inside the AS (and do so by the instruction "default route originate" in their .conf file) and appear as a "route of last resort" in the routing tables of the inside machines. If you want your own (private) AS, then use the private AS numbers 64512-65535 (the AS equivalent of 192.168.x.x IPs). Advertisements for these ASs are not propagated. After convergence, the routers within an AS will know the routes in the AS and will know which machines to use as their default gw (gateway of last resort). Here's the setup for the demonstration with two routers (directors), working as an active/backup pair, running a dynamic routing protocol. dummy0 is configured with an IP in the Cimafranca paper, partly for their demonstration. This IP allows you to ping the node from the outside, as long as at least one hardware NIC is up. Supposedly this IP is a convenience to be able to identify a host (although I didn't have any need for it). dummy0 is chosen as it is the interface least likely to go down. cisco routers use lo for this IP, but apparently the convention with Linux is to use dummy0. The IPs on each dummy0 interface are in different networks. If they are in the same network, you can't route to the IP on dummy0 on adjacent machines. Here is the network during normal functioning Here is the network immediately following link loss to the backup director's default gw. the backup director has no default gw. the active director has a default gw. The job of the dynamic routing demon is to let the backup director know where the default gw is. Here is the network after re-establishing the default route for the backup master node. This takes about 25 secs with RIP. You install the iproute2 tools for zebra to work and and the CLI commands must be policy routing commands. There are two series of network tools available with Linux ifconfig, route; these are the old style commands. ip addr show, ip route show from the iproute2 policy routing tools. The routes/IPs added by rip/zebra are added by the iproute2 tools. The two series of commands are incompatible. IPs (or routes) added by iproute2 may not be visible to ifconfig (or route). Routes added by ip route add may be visible to route but aren't capable of being deleted by route. All IP and route commands from the command line must use the iproute2 tools. If you like names rather than port numbers, add these to /etc/services zebra.conf here's my zebra init script, ripd init script, bgpd init script, ospfd init script. Now use the zebra shell (vtysh or telnet localhost zebra) to install an IP on dummy0 (following the instructions of Cimafranca and Young). these instructions will only work if dummy0 is in zebra.conf the different prompts for bash, zebra, zebra in "enable" mode, zebra in "configure terminal" mode, zebra in "configure interface" mode. You can add the IP for dummy0 into zebra.conf with an editor instead. You could also add the IP on bootup, but by adding the information to the .conf file, the IP will only be present after you start up zebra. enable Password: zebra# configure terminal zebra(config)# interface dummy0 zebra(config-if)# ip address 10.0.1.1/24 zebra(config-if)# quit zebra(config)# write Configuration saved to /etc/zebra/zebra.conf zebra(config)# end zebra# show run Current configuration: ! hostname zebra password zebra enable password zebra log file /var/log/zebra.log ! interface lo ! interface dummy0 ip address 10.0.1.1/24 ! interface tunl0 ! interface eth0 ! interface eth1 ! ! line vty ! end zebra# quit Connection closed by foreign host. director:/etc/zebra# cat zebra.conf ! ! Zebra configuration saved from vty ! 2004/02/24 17:51:02 ! hostname zebra password zebra enable password zebra log file /var/log/zebra.log ! interface lo ! interface dummy0 ip address 10.0.1.1/24 ! interface tunl0 ! interface eth0 ! interface eth1 ! ! line vty ! ]]> Next time you start up zebra, the new zebra.conf script will add the IP to dummy0 and the src route (as if you'd run ip addr add 10.0.1.1/24 dev dummy0 brd + from the command line). Start up zebra on the second director and add an IP to dummy0 there (you can copy the zebra.conf file here to the other director and change the IP for dummy0). Now you're going to start ripd. Here's my ripd.conf Here I add networks to the conf file from the zebra interface (you could use an editor on the conf file too). enable Password: ripd# configure terminal ripd(config)# router rip ripd(config-router)# network 10.0.1.0/24 ripd(config-router)# network 192.168.1.0/24 ripd(config-router)# write Configuration saved to /etc/zebra/ripd.conf ripd(config-router)# show run ripd(config-router)# show run Current configuration: ! hostname ripd password zebra enable password zebra log file /var/log/ripd.log ! interface lo ! interface dummy0 ! interface tunl0 ! interface eth0 ! interface eth1 ! router rip network 10.0.1.0/24 network 192.168.1.0/24 network eth0 network eth1 ! line vty ! end ripd(config-router)# quit ripd(config)# exit ripd# exit Connection closed by foreign host. director:/etc/zebra# ]]> Here's the ripd.conf I used for the demo. Make sure both routers have default routes. Activate debugging in zebra (so you will see notices of rip updates on the screen) and then show the routes enable Password: zebra# debug zebra packet zebra# show ip route Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF, B - BGP, > - selected route, * - FIB route K>* 0.0.0.0/0 via 192.168.1.253, eth1 R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:07:44 C>* 10.0.2.0/24 is directly connected, dummy0 K * 127.0.0.0/8 is directly connected, lo C>* 127.0.0.0/8 is directly connected, lo K * 192.168.1.0/24 is directly connected, eth1 C>* 192.168.1.0/24 is directly connected, eth1 K * 192.168.2.0/24 is directly connected, eth0 C>* 192.168.2.0/24 is directly connected, eth0 ]]> The output shows that the backup router has a default route added by the kernel (at the CLI above) and a route to 10.0.1.0 added by RIP, which enables routing to 10.0.1.1 on the other machine. (A similar view will be seen by running ip route show at the CLI.) The [120/2] indicates the administrative weight of the route [120] and the number of hops [2]. Then do the following in order - From another window, remove the default route at the command prompt (ip route del default via 192.168.1.253) in the zebra window above, up arrow and rerun the show ip route to show that the default route has gone. - selected route, * - FIB route R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:18:01 C>* 10.0.2.0/24 is directly connected, dummy0 K * 127.0.0.0/8 is directly connected, lo C>* 127.0.0.0/8 is directly connected, lo K * 192.168.1.0/24 is directly connected, eth1 C>* 192.168.1.0/24 is directly connected, eth1 K * 192.168.2.0/24 is directly connected, eth0 C>* 192.168.2.0/24 is directly connected, eth0 ]]> watch in the zebra window for a RIP update up arrow in the zebra window and rerun the show ip route to show the new default route. - selected route, * - FIB route R>* 0.0.0.0/0 [120/2] via 192.168.1.254, eth1, 00:00:03 R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:18:31 C>* 10.0.2.0/24 is directly connected, dummy0 K * 127.0.0.0/8 is directly connected, lo C>* 127.0.0.0/8 is directly connected, lo K * 192.168.1.0/24 is directly connected, eth1 C>* 192.168.1.0/24 is directly connected, eth1 K * 192.168.2.0/24 is directly connected, eth0 C>* 192.168.2.0/24 is directly connected, eth0 ]]> The new (x.x.x.254 rather than x.x.x.253) default gw is now installed and this time it's installed by RIP (rather than the kernel). Here's the view of the routing as shown from the CLI This time the default route is installed by zebra. You can time the route failover: At 18:31 (min:sec since executing), the new route has been up for 00:03 seconds. The failover occurred at 18:01, showing that the new route took ((31-3)-1)=27 seconds to appear after failover. The new default gw is the other director's default gw. I had initially hoped that the new default gw would be an IP on the active director, and that ICMP redirects would handle re-routing to the active director's default gw. However this didn't work, although I though it would for a while. Here's what happened. If you activate the line in ripd.conf on just the active director, the active director, having a default route of its own, will advertise that it is a default route. If you then do the failover, the default route on the backup director will be an IP on the active director. (I thought I was home at this stage.) Since you want to do this symmetrically, you activeate the same line to ripd.conf on the backup director. The problem (from talking to Steve Buchanan) is that the backup director, if it's been told to advertise that it is the default route, is not going to accept an advertisement from anyone else (like the active director) declaring that they are the default gw instead. After activating the option default-information originate, then on failure of the link, the backup master node will not accept the RIP update of a default route and will not show a default route. With dynamic routing then, after failover, the default route for the backup router is the default route of the active router, and not an IP on the active router. Functionally these achieve the same result if there are no other problems with the routing on the backup router.

Dynamic routing with gated: An LVS that connects to the outside world through two networks Patrick LeBoutillier patl (at) fusemail (dot) com 26 May 2004 Here is a "recipe" for creating LVS clusters with machines that support redundant networking. Our production environment is fully redundant at the network level (each machine has two network interfaces, each connected to a different network). All machine are connected to both these networks and data can come from either network. On each machine, service run on a local network address and gated announces the route for these networks via both network interface. My task was to create an LVS cluster of 2 such machines (each a potential director and realserver as well). The network setup: gated setup: Have gated announce (and accept) the following routes: Machine 1: - announce 192.168.20.1/32 - accept routes from 192.168.10.2 and 192.168.11.2 Machine 2: - announce 192.168.21.1/32 - accept routes from 192.168.10.1 and 192.168.11.1 These routes will be used by ldirectord to monitor the realservers. Recipe Install UltraMonkey as usual, but: Make sure to configure ping nodes in both networks. A "ping node" is a pingable IP that is used by the heartbeat ipfail plugin, to determine if a director has lost network connectivity. The "ping node" terminology is defined at Getting Started with Linux-HA (heartbeat) (http://linux-ha.org/download/GettingStarted.html). - Create the virtual IP alias as 192.168.30.1 - A virtual service definition in ldirectord.cf should look something like this: In a normal setup, heartbeat manages the virtual IP alias and brings it up on the active director. If I understand correctly, an arp request is then sent, making the other machines in the local network aware that the active director is now the machine to be reached for the virtual IP. In this setup we will tell heartbeat to leave the virtual IP alias alone and have it tell gated to announce the route for the 192.168.30.1/32 network instead. Therefore ONLY the active director will announce the routes to reach the virtual IP network. Change your haresources line to something like this: Place the following (or equivalent) code in a file called /etc/ha.d/resource.d/gated-toggle: $CFG daemon $gated -f $CFG else daemon $gated fi RETVAL=$? [ $RETVAL -eq 0 ] && touch /var/lock/subsys/gated echo return $RETVAL } stop() { # Stop daemons. action $"Stopping $prog" $gdc stop RETVAL=$? if [ $RETVAL -eq 0 ] ; then rm -f /var/lock/subsys/gated fi return $RETVAL } # See how we were called. case "$1" in start) stop start "/etc/gated-heartbeat.conf" ;; stop) stop start ;; *) echo $"Usage: $0 {start|stop}" exit 1 esac exit $RETVAL -------->8-------- ]]> What this script does is: On resource acquisition: Copy the gated configuration file (/etc/gated.conf) to another file (/etc/gated-heartbeat.conf), activate the route for the virtual IP network and restart gated using the new file. On resource loss: Restart gated using the original configuration. gated must always be running and must start at boot time using the non-active (default) config. Modify /etc/gated.conf accordingly. Here is the /etc/gated.conf file for machine 1: 8-------- ]]> The gated-toggle script will look for all lines ending with "# heartbeat-toggle" and turn them on (or off) depending on the cluster state. I suspect you could do something similar with zebra or some other routing software, as long you can restart it with a different config or (even better) change it's config dynamically (maybe you can dynamically change the config for gated, but I'm not aware of this).

flapping stemming from convergence time for spanning tree Shaun McCullagh shaun (dot) mccullagh (dot) marviq (dot) com 27 May 2004

I've encountered some flapping problems with Keepalived v1.1.1 (on RH Linux 7.3 Kernel 2.4.18-5) when used with Cisco 2948 and C3548-XL switches. Both Master and Backup PC's use 3COM 905C NICS. As an experiment I tried ifconfig eth2 down on the Backup system to check it recovered from a FAULT state. The system went into FAULT state as expected, but I when I did ifconfig eth2 up, keepalived initially went to Backup state, then started oscillating between MASTER and BACKUP state. I fixed the problem by increasing the advert_int to 35 seconds (on both Master and Backup system). The problem with this is when Keepalived is started the VIPs obviously take much longer to start than if the advert_int is set to 5 seconds. I'd grateful for suggestions as to what to investigate, as I quite like to set the advert_int back to 5 seconds

Graeme Fowler keepalived (at) graemef (dot) net 27 May 2004

Hard set your switch speed/duplex settings for those ports, and use "mii-tool" (assuming it will support your cards) to do the same at the server end. Cisco switches take up to 30 seconds to complete their autonegotiation - if they're hard set, they don't.

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 27 May 2004 it's not auto-negotiation which takes time, it's the spanning tree algorithm. It's required to wait for 30 seconds to discover loops in the topology (nodes will only announce their presence so often). You can turn this off, with the configuration option spanning-tree portfast, if you're certain the port will never be used to connect to switches. Graeme Fowler keepalived (at) graemef (dot) net 28 May 2004

Whoops! My mistake; indeed it is the spanning tree algorithm. I also ensure that I have "spanning-tree portfast" set on interfaces which I know will always be connected to hosts rather than switches (or in fact where I know that the port may connect to a switch which is not spanning tree capable). One point of note though is that I have on occasion been bitten by interfaces which continually autonegotiate - whilst connectivity seems OK, the interface itself flaps wildly ever few seconds. Hence the comments about hard-setting port speeds :)

LVS: Server State Sync Demon, syncd (saving the director's connection state on failover)

Intro For seemless director failover, all connection state information from the failed director should be transferred/available to the new director. This is a similar problem to backing up a hot database. This problem had been discussed many times on the mailing list without any code being produced. Grabbing the bull by the horns, Ratz and Julian convened the Bulgarian Summit meeting in March 2001 where a design was set for a server state sync demon (look for links to photos of them working on the design).

Release Notice In ipvs-0.9.2 Wensong released a sync demon.

Wensong Zhang wensong (at) gnuchina (dot) org 20 Jun 2001 The ipvs-0.9.2 tar ball is available on the LVS website. The major change is new connection sychronization feature. Added the feature of connection synchroniztion from the primary load balancer to the backup load balancers through multicast. The ipvs syncmaster daemon is started inside the kernel on the primary load balancers, and it multicasts the queue of connection state that need synchronization. The ipvs syncbackup daemon is started inside the kernel too on the backup load balancers, and it accepts multicast messages and create corresponding connections. Here is simple intructions to use connection synchronization. On the primary load balancer, run On the backup load balancers, run To stop the daemon, run Note that the feature of connection synchronization is under experiment now, and there is some performance penalty when connection synchronization, because a highly loaded load balancer may need to multicast a lot of connection information. If the daemon is not started, the performance will not be affected.

There aren't a lot of people using the server state sync demon yet, so we don't have much experience with it yet.

Alexandre Cassen alexandre (dot) cassen (at) wanadoo (dot) fr 9 Jul 2001 Using ipvsadm you start the sync daemon on to the master director. So it send adverts to the backups servers using multicast: 224.0.0.81. You need to start ipvsadm sync daemon on the backups servers too... The master muliticasts messages to the backup load balancers in the following format. I have planned to add an ICV like in IPSEC-AH (with anti-replay and all strong dataexchange format) but I'm still very busy.

There is now a sync demon write up. From Lars Marowski-Bree lmb (at) suse (dot) de: If you're using the -sh and -dh schedulers then there is no state information to transfer ;-) If you're just setting up and have no connections and are checking your setup, then the sync demon has no data to transfer and is silent (i.e. it appears not to be working). Sean Knox sean (dot) knox (at) sbcglobal (dot) net 2003-02-21

I've just installed ipvsadm and LVS on a new Debian 3.0 server. Load balancing works fine, however, connection synchronization doesn't; I confirmed this via tcpdump (no sync information being multicast). The problem is that the sync. daemon won't transmit unless it actually has data (connection states) to send out. If your IPVS state table is empty (i.e. no connections), then you won't see any sync data being sent out. I guessed this was the case after seeing this entry in the kern.log:

combined effort of Sean Know and Bernt Jernberg, 25 Feb 2003

On the backup director, the connection table is listed by running ipsvadm -Lcn. The backup director has no connections, so ipvsadm -L will be empty (ipvsadm -L is only relevant on the master).

Carles Xavier Munyoz Baldo Oct 02, 2003

I'm setting up a high availability LVS director with RedHat 8.0 (kernel-2.4.20-20.8), ipvs-1.0.9 and keepalived. I'm running LVS with connection synchronisation enabled. When the master director faults, the backup director takes its role and all stablished connections works without interruption. GREAT !!!!! :-) The problem is, what to when the failed master is recovered? How may I copy the connection table of the backup director to the master director?

Horms horms (at) verge (dot) net (dot) au 03 Oct 2003 There are various ways to do this, here is the one I would suggest. Read this post and patch by Alexandre Cassen The patch was put into LVS 1.1.X but not 1.0.X so if you want that behaviour and you are using LVS 1.0.X (i.e. 2.4.X kernel) you will need to patch the code yourself. You may want this patch too which fixes a small bug in Alexandre's code. in_pkts); - if (ip_vs_sync_state == IP_VS_STATE_MASTER && + if (ip_vs_sync_state & IP_VS_STATE_MASTER && (cp->protocol != IPPROTO_TCP || cp->state == IP_VS_S_ESTABLISHED) && (atomic_read(&cp->in_pkts) % 50 == sysctl_ip_vs_sync_threshold)) ]]>

With this patch, when I bring back the master director, the backup director will notify it of the new connections but, what happens with the current connections established in the backup director? Are they notified to the recovered master director? When the master directos takes the VIP, this stablished connections will stop.

No. If you wait a short time before the master takes over the VIP then the connections will have been sychronised. Alternatevely when the old master comes back up make it the stand-by. Presumably some time will pass before another failover occurs and synchronisation should have plenty of time to occur. If you are using heartbeat then this is called nice_failback.

Is there any way to copy all the connections table from the backup director to the master director when it gets recovered from a previous fault?

No. It would be possible to add some sort of dump request. But I don't think this would be wise as if you have a lot of connections this could take a while and thus impact load balancing - if you have a lot of connections the linux director is probably already very busy.

Expiration of Connection in Backup Director ong cheechye Mar 14, 2003

I'm running Piranha over IPVS (ipvs-1.0.4.patch). I notice that the connection expire time in the primary director is much longer than the backup director's (seen from ipvsadm -lc below). So when director failover to backup, the connection in backup director might have already expired and removed. Thus the connection would not failover. Is this the right thing to happen ? How ipvs determine the expiry time of a connection ? In Primary Director In Backup Director

Horms Ok, this is a little confusing. The expiry values are used in different ways on the primary and the backup, so it doesn't really matter that they aren't the same. Essentially what is going to happen is that even if the connection expires in the backup, as soon as some more packets arrive on the primary, the connection will be updated on the backup and it will re-appear in the backup's connection table. Bonnet, Mar 14, 2003 You missed Ong's point. Imagine the following : a telnet connection is established thru primary director connection sync'ed to backup no more telnet packets for a while connection remove from backup (but still on primary) primary failure backup taking over incoming packet for telnet session Ooops ! The backup director doesn't know this connection, whereas a few seconds before the primary director did know it ! Horms I guess that I did miss the point. Of course once the connection times out on the backup director failover of that connection is not going to work. This timeout can be modified by changing IP_VS_SYNC_CONN_TIMEOUT in ip_vs_sync.c. The default is 3 minutes. It could be a /proc entry, but it isn't at the moment. ong cheechye

Ok. So connection will not failover if it has been idling

Horms Yes, if the connection has been idling for longer than the timeout on the backup (3 minutes) it will not failover. Sebastian Vieira sebvieira (at) gmail (dot) com 15 Sep 2006

When I do a failover to the other node ipvsadm -l shows zero connections. But client connections _are_ being sync'd. When the original active director was down, i could connect through the backup director without any problem. If i do a failback it shows all connections again. It's not a big issue since LVS works as it should, but it would be handy though. Is there something that can be done about this?

Joe I'm not sure what the expected behaviour is here, but the syncd only transfers enough information for the backup director to take over as the active director - i.e. only the connection table and the active connections (connections in FIN_WAIT etc aren't transferred). When you go back to the original director, you're probably seeing it's original connection count before failover. Monty Ree Jun 04, 2007 I have a two director LVS-DR system like below. when I execute like below. here, 14:22 means expire time, right? at LVS1, after some seconds, above packets (connection entry) disappear but LVS2, doesn not. Horms 4 Jun 2007 Yes, 14:22 is the expiration time. If you are using connection syncrhonisation and LVS1 is the master and LVS2 is the backup, here is a rough sketch of what is likely to be going on. On LVS1 a connection is established and there is a series of packets going between the real-server and the end-user via the master linux-director. During this time the packets are tracked using a connection entry in the ESTABLISHED state. This has a time out of 15:00 minutes which is refreshed each time a packet is received. For the connection above that would seem to indicate that it has been idle for 38 seconds. When the connection is shut down by either the real-server or the end-user, the connectino entry moves into the CLOSE state, with a much shorter timeout. If no more packets are received (which usually the case, typically, there will only be more packets if some arrive out of order), then the connection entry disappears pretty quickly. So far so good. On LVS2 things are a bit different. It doesn't see the packets sent by the real-server and the end-user. Rather it receives connection information via the lvs sychronisation protocol. These are sent out by LVS1. They are sent out once a connection has seen 3 packets (the 3-way handshake is complete) and then every 50th packet. There is also a 2s delay loop in there, but thats not that important. It is this synchronisation information that produces the entries that you see on LVS2. And they may be a little out of date. But its not really that important, because they are just there in case fail-over occurs, so that LVS2 will be able to forward packets for the connections that were synchronized. The precise details of the timeouts and the state of these synchronised connections is not that important, because while LVS2 is acting as a backup, they aren't used. So other than consuming a very small ammount of memory, they do no harm. And, if failover occurs, then the entries that are used at that time are updated and follow the rules that LVS1 was previously following. Just think of them as templates with a timeout, rather than full fledged connection entries.

Syncd boxes must have the same time If the directors are being time updated by ntpdate from cron, rather than ntpd, then after a power down, they may not be in time synchronisation and won't accept messages from each other. Nicklas

After a total power outage today I'm having trouble getting the backup LVS node to function properly. The backup node keeps transitioning to MASTER state. I'm using keepalived. Things have worked flawless before the power outage. The two LVS machines are also firewalls with appropriate rules applied to pass through VRRP messages and such.

Graeme Fowler graeme (at) graemef (dot) net 06 Nov 2008 Are you 100% sure the firewall rules or a network misconfiguration aren't getting in the way? The most common flaw that causes this is a rule or route on the nominal master preventing it sending announcements, so the slave keeps transitioning. It's either that, or your system clocks are out of sync with each other. If you're using ntpdate from cron, that time is your problem. Can you run a local NTP daemon on the directors which is configured against two upstream time servers and only permits local clock slew? That's what I do. The daemon approach means the time is slewed slowly, rather than skewing several tens of seconds at a time. Also, make sure the hardware clocks are sufficiently close together on both systems (hwclock). If not, get the system times close and then do "hwclock --systohc". Joe: we didn't hear back, so we don't know the resolution of this problem.

LVS and syncd do not use conntrack May 2004: ipvs now uses conntrack. see . Horms 22 Oct 2003 IPVS does not use contrack! You can examine the state of connections, including those syncronised to a backup using ipvsadm -L -c -n Carles Xavier Munyoz Baldo Oct 22, 2003

But, which kind of connections will it synchronise? All the connections passing througt the FORWARD chain or only the connections directed to the realservers farm?

All of the connections that have been forwarded to realservers by LVS.

I'm building a high availability firewall using the ipvs sync daemons to synchronise the MASTER FW network connections with the BACKUP FW. Is this possible with ipvs or must I use another high availability software solution for linux?

Not really, unless you run all the connections through LVS.

Connection Synchronisation (TCP Fail-Over) Wensong's original implementation of a synch demon would not failback after failover. This code was modified by Julian so that failover could be followed by failback. It was still a master/slave arrangement, where the sync demon on the active director broadcasts connection table information to the sync demon on the backup director. However if you have and ipvsadm hash table on all directors (i.e. have the virtual services setup on all directors with ipvsadm) including the backup directors, then it's possible to have an LVS where each director is broadcasting the connections information for the virtual services it is handling and receiving connection data from the other directors for the virtual services they are handling, then you have a peer-to-peer (p2p) synch demon. For the synch demon, you only need to keep track of the connections in ESTABLISHED state. The connections in TIME_WAIT etc, will disappear on their own and you don't need to the other directors to take over a TIME_WAIT state on failover. If you can arrange for all directors to have the same ipvsadm hash table, and for some other mechanism to all directors to have the same (arping) VIP, then you have a fully failover proof set of p2p directors. For detailed discussion on the design of such a synchd see Horms paper on Connection Synchronisation (TCP Fail-Over). Horms combines this with where all directors are active. Horms has virtualised the synchd function of LVS, by writing hooks into LVS, which allows a synchd to be loaded as a module. Horms has rewritten Wensong's implementation as a module and his p2p code as a module and moved the user space controls into ipvsadm. In principle then, anyone can write a synchd module and register it with ipvs. Unfortunately this code has not been accepted into LVS and is being maintained separately by Horms. This is a bit of a nightmare to maintain and track with each version of LVS, so if you're going to use Horms code, be prepared for patching and spelunking kernel code. Horms horms (at) verge (dot) net (dot) au 28 Apr 2004

In 2.6 the code has been enhanced by Julian to allow a peer-to-peer setup. The current 2.6 code allows you to run a a master or a backup sync daemon. If you want p2p then you just run both - previously you could only run one or the other. This is configured through ipvsadm. For 2.4, I think Julian has the main patch on his web site. However there was a minor patch contributed by me, that is required for his patch to work. I am not sure if he incorporated that. It is in the mailing list archive. I abstracted the sync deamon, which involved moving a bunch of code around and adding an extra layer. This did not change the functionality of the sync daemon at all. I modified the LVS core code, so that the sync code is implemented by a series of hooks. The idea is that people could implement these hooks however they liked. This could be done in a kernel module that registered itself with LVS when it is inserted into the kernel, or instructed to do so from userspace. Then I implemented a version of these hooks that implemented Wensong's synch demon. These are registered when LVS is intialised. So you get the existing behaviour. Then I implemented another module to handle synchronisation differently. It communicates with a user space daemon. When you insert this module it registers itself as the sync hooks, unregistering the default hooks. The reverse is true when your remove the module. This is explained at some length here. Implementation of Connection Synchronisation Using it is explained in the man page. To start the sync daemon you run depending on if you want a master or backup sync daemon. Using Julian's patches you run both on each node to get the p2p behaviour. To stop the sync daemon run horms 24 Apr 2007 LVSSyncDaemonSwap is a script and its function is decribed in the comment at the top of it. Breifly. Prior to 2.4.27 only the master or backup sync daemon could run but not both. So when a failover occurs the new active noce needs to stop the backup sync deamon and start the master one. The reverse action needs to occur when a node is running as standby. The purpose of LVSSyncDaemonSwap is to make this switch. If you have a newer kernel, just run both daemons on boot. Incidently, the sync deamon doesn't have a way to flush connections, so I recommend setting autofailback to off. As of 2.4.27 the master and backup sync daemons can run simultaniously, which is recommended, and thus if you have a newer kernel you don't need LVSSyncDaemonSwap at all. Graeme Fowler graeme (at) graemef (dot) net 09 Jun 2007 Look at your process list and check that you have two processes running: If you do, you don't need LVSSyncDaemonSwap. (back to Horms) On my extremely long, and ever expanding my todo list I plan to put up my patches. I have made a start by updating them and getting them all together in my workspace. One of the problems is that the patches tend to conflict with each other, or rely on each other (thankfully not both). I quite like Rusty's kernel patches page I might go for something like that. But to be honest, I fear the support burden. The patches that I have made available in the past have resulted in much work for me. Usually I just send my patches to the mailing list and forget about them. But in the case of some of the more substantial work I have done, they can be found at: Julian's code resolves most of the major problems that I saw in the original (current default 2.4) synchronsiation code. So, form a user point of view, since his code is in the ipvs package, it should be easier to setup. My user-space daemon is called ip_vs_user_sync_simple. As the name suggests, the code is quite simple, hopefully people can extend and customise it. You can find it in http://www.ultramonkey.org/download/conn_sync/ Saru and the sync demon are independant code, but you should have a synch demon if you're using multiple directors.

There are several different development branches of the syncd. Here Horms is trying to get me straight on who wrote what, and which functionality is in which branch. Horms 27 May 2004

Wensong's implementation is in the kernel. So everyone who has a recent (since 2001) version of LVS has it. Alexandre wrote some patches (independantly of me) that address most of my concerns. This code, has to the best of my knowledge, been put into IPVS in the 2.6 kernel. Though as of which version I am not sure, possibly 2.6.0. The patches, however, have not been merged into 2.4. There are two patches from Alexandre, I will address each of them in turn. Though I have used neither of them extensively. I believe both of these patches apply against the kernel. Which means Wensong's implementaiton of the synchronisation code. linux-2.4.26-ipvs_syncd.patch.gz (This has been merged into 2.6.something) This adds two features You can run a master and backup sync daemon on the same host. This means that if you have two (or more) linux directors they can act as a master and bakckup. That is they can both send and recieve synchonisation traffic - though typically one machine will only do one of these at a time. This means that regardless of which machine is the active linux director connections will be synchronied and if a failover occurs those connections can continue. This addresses the problem that I discuss here (amongst other places) http://www.ultramonkey.org/papers/conn_sync/conn_sync.shtml#master_slave_problem It adds a SyncID field to the packet. This allows multiple LVS clusters to use synchonisation on the same multicast UDP port. Just set each cluster with a different SyncID and it will ignore packets whose SyncID doesn't match. linux-2.6.4-ipvs_syncd_icv.patch.gz (This is only available for 2.6, presumably it relies on the patch above, which is in 2.6) This patch allows the synchronisation packets to be signed. At a quick glance, this is done using an HMAC digest and a shared secret. This should be both secure and fast. (Actually I have done the same thing but am not able to release the code). This protects against parties unknown injecting packets and possibly causing havoc on in the connection tables. This is pretty imporant if your linux directors will accept sync packets from parties unknown. Keeping in mind that UDP can be pretty easy to spoof, using packet filters to guard against this can be problematic (though not impossible). N.B: I didn't actually read Alexandre's paper, but as I mentioned I have implemented almost the same thing. So I am quite familiar with the concepts. The stem code which you should patch is in the lvs tree, which means in the kernel. The 2.6 code is a bit more advanced than the 2.4 code. Because it has the first of Alexandre's patches, and possibly the second, I have not checked. The two main lots of patches are from Alexandre (discussed immediately above) and mine (further up the page). My patches move a lot of the code to user-space. However, they are quite invasive, and probably not a whole lot better than using the core code with Alexandre's patches if you just want a functional syncd. If you want to hack then my approach should be better, as you can do a lot of the work in userspace.

"dingji" dingji (at) broadeasy (dot) com 2005-02-03

according to Manual-8 there should be three files to configure the sync-daemon but I can only find the last one on my system. according to Connection Synchronisation by Horms, there seems a patch for this. but what's the difference between ip_vs_user_sync_simple and sync-daemon within ipvsadm. and why it seemed to work without the two files, were they set the default values?

Peter Mueller conn_sync (http://www.ultramonkey.org/papers/conn_sync/conn_sync.shtml), and a more in-depth example of HOWTO-doit: lvs_jan_2004.pdf (http://www.ultramonkey.org/papers/lvs_jan_2004/stuff/lvs_jan_2004.pdf). Horms 04 Feb 2005 ip_vs_user_sync_simple is userspace; the ipvsadm controlled daemon is in the kernel. I had worked on the synchonisation code for a customer, and we decided to make a userspace daemon to match up with some requirements for that customer. In a nutshell, its easier to write user-space code than kernel code. It also addressed a number of concerns I had with the in-kernel code at the time. However, the kernel code problems have all been fixed now. So unless you desparately want to do some hacking, the in-kernel, ipvsadm-controlled daemon is my recomendation. Sebastian Vieira sebvieira (at) gmail (dot) com

I've noticed that upon a failover, not all connections are sync'd (via syncdaemons) to the backup director. I read in the docs for ultramonkey (not using it ... I think ... but that was the only source I could find) something about /proc/sys/net/ipv4/vs/sync_threshold and how to manipulate this. If I understand things correctly, sync_threshold has 2 values. By default they are 3 and 50, meaning that after 3 packets the connection will be initially synchronised. After that, each 50 sent packets will cause the connection to be synchronised. That is, if i understand it correctly :)

Ratz 10 Nov 2006 I think this is a sound understanding of the mechanism.

Now I recall reading somewhere that there is a certain timeout involved. I mean that if no packets are sent for a certain time, the connection will not be synchronised. I don't know if this is true, but this could be the reason.

Yes, the "templates" are sent once but within the interval specified.

The synchd produces broadcast traffic If the synchd sends its traffic over the RIP network and it's been a while since you set the LVS up, you might forget that it sends broadcasts. Dan Brown danb (at) zu (dot) com 19 Apr 2006

I've been watching errant traffic via tcpdump trying to track some unrelated problems and have noticed there is a lot of broadcast traffic coming from the active director. The traffic all looks like this: 224.0.0.81.8848: UDP, length 28 ]]> According to some archive posts, this is how Apache session information is shared. I haven't dug deeper into the tcp traffic to figure out if this is true. These broadcasts occur every 2-6 seconds and aren't on a consistent schedule. I have a dedicated set of interfaces for heartbeat information (which I thought also shared the session information) and it looks like this: 09:38:14.093733 IP 10.0.0.1.32847 > 10.0.0.2.ha-cluster: UDP, length 159 09:38:14.319831 IP 10.0.0.2.32807 > 10.0.0.1.ha-cluster: UDP, length 158 09:38:15.095778 IP 10.0.0.1.32847 > 10.0.0.2.ha-cluster: UDP, length 159 09:38:15.317917 IP 10.0.0.2.32807 > 10.0.0.1.ha-cluster: UDP, length 158 I get a pair of broadcasts once per second. I expect this as it is configured that way. The broadcast info to 224.0.0.81.8848 is not configured in ha.cf, and neither director has mcast settings on any device. So what is the information being broadcast on the external internet device? I shouldn't be seeing _ANY_ broadcast packets over the external interface as far as I'm concerned.

Graeme Fowler graeme (at) graemef (dot) net This is the LVS synchronisation daemon pushing state information from the master to the backup director (and it is in fact multicast, not broadcast, see http://www.iana.org/assignments/multicast-addresses). You should have an ipvs_syncmaster process on your master, and an ipvs_syncbackup process on the backup. This gives you the stateful failover which is so desirable upon director failure. It is possible to put this traffic onto a separate interface (like your heartbeat network) to save it being sent out to all the machines on the frontend network, but how that's configured depends on which application you use you manage your LVS. ipvsadm: --mcast-interface interface keepalived: lvs_sync_daemon_interface option in the VRRP instance section ldirectord: seems not to have the option in the CVS version I'm looking at (Id: ldirectord,v 1.136 2006/04/05 02:12:24 horms) but can be driven alongside ipvsadm anyway quite happily, providing you don't stomp on the functionality provided by ldirectord. Horms 25 Aug 2006 The data is completely unsecured. Anyone who joins the multicast group (opens a socket) can get the packets. Though they probably aren't that interesting. What is intersting is that they can also inject packets, to say flood the connection table with entries. I've never crafted an attack, but I'm pretty sure the scope is ample. I worked on some code a few years ago to move part of the synchronisation into user-space, and secure it using a signature and a shared secret. Now that crypto-api is in the kernel (and has been for years) it should be easy enough to move this code into the kernel, which is a less invasive change. I wonder if there is any interest in this, as there certainly wasn't when I worked on it before.

from the mailing list Dave Augustus davea (at) support (dot) kcm (dot) org 14 Nov 2002

We are currently building an LVS and are close to deployment. All machines have 2 nics- 1 public and 1 private. I want to use the private for the connection sync daemon. 2 directors: master and backup 4 realservers All directors and realservers using same kernel: 2.4.19, ipvsadm v1.21 2002/07/09 (compiled with popt and IPVS v1.0.6). When I specify on Master Director: and on the backup Director: No connection information is available on the backup using these settings. Also, my message log for the Master Director lists: "kernel: IPVS: ip_vs_send_async error." However, when I change the mcast-interface to eth0, the connection sync works fine and no errors are reported. ... Now for my belated reply( I just deployed an 8 server LVS): The workaround I came up was simply dropping the --mcast statement altogether. The traffic now is handled on the eth0 interface by default. I didn't want to do it this way. The bug seemed to crop up whenever I specified ANY interface..

tuliol (at) sybatech (dot) com Jun 20, 2004

Currently I have connection synchronization working between directors. The setup is configured by manually running the commands: ipvsadm --start-daemon master #Master Linux director ipvsadm --start-daemon backup #Slave Linux director My question is: Is there a way to automatically start those 2 processes by putting a setting in the ldirectord config file or somewhere else?

Horms Ldirectord is one option. However it only really makes sense if you are running it as a stand alone daemon, as you want the synchronisation daemons running all the time. If you are using something like heartbeat to start and stop ldirectord, then you don't want the synchronisation daemons handled there, else the synchronisation daemon won't run when the ldirectord resource is relinqushed on a node, and this really isn't what you want. ipvsadm has an init script, You should be able to use that to start and stop the daemons.

Well here's what I am going to be doing in my next cluster going live very soon now (sm). If anyone sees anything silly please let me know. Note that if you copy my config you will have to change /etc/ha.d/update's SSH line to the proper host, and reverse it for the backup host. You will also probably want an ssh-public key login setup, although I'm not certain I'm going to do this. #!/bin/sh # script to set the sync state properly on both LVS servers. case "$1" in start) /sbin/ipvsadm --stop-daemon /sbin/ipvsadm --start-daemon master ;; stop) /sbin/ipvsadm --stop-daemon /sbin/ipvsadm --start-daemon backup ;; esac exit 0 IPaddr::ip.goes.here.here \ ... ldirectord::ldirectord.cf \ lvsstate.sh #!/bin/sh # script for updating ldirectord nicely. created 06/05/2001 PM # first, backup ldirectord.cf in case someone messes up later. cp -f /etc/ha.d/conf/ldirectord.cf /etc/ha.d/conf/backup.of.ldirectord.cf # next, scp the ldirectord.cf file over to the other director # the two LVS servers will have to have public-key acceptance for this to work. scp /etc/ha.d/conf/ldirectord.cf lvs2-priv:/etc/ha.d/conf/. # make sure the state is set properly for the active server ssh rwclvs2-priv /etc/ha.d/lvsstate.sh stop /etc/ha.d/lvsstate.sh start # reload ldirectord killall -HUP ldirectord # give it a few seconds to allow the config to set sleep 10 # now display configuration ipvsadm -L ipvsadm -L --daemon ]]>

James Bromberger jbromberger (at) fotango (dot) com we run ldirectord on both master and standby hosts ALL the time. We also run ipvsadm --start-daemon master and ipvsadm --start-daemon backup on BOTH hosts all the time (2.6 kernel). The ONLY thing that heartbeat is doing is bringing the service IP addresses up and down. That way ldirectord doesn't need 10 seconds or so to check services before they come into service: everything is instant. Maybe this is wrong, but it works really well. Our failover time is less than half a second, and all state is retained. Fail back can happen immediately as well. unknown: who asked about the changes needed to the heartbeat scripts for syncd Peter Mueller pmueller (at) sidestep (dot) com 29 Nov 2004 I am using a very simple script for this purpose, called through heartbeat/haresources. The quick summary seems to be go with a solution similar to mine for 2.4, and use a "slave and master on both servers" solution for 2.6 kernels. See the full thread (http://marc.theaimsgroup.com/?l=linux-virtual-server&m=108924839319403&w=2) for all the details. Joe - which I think refers to this Well here's what I am going to be doing in my next cluster going live very soon now (sm). If anyone sees anything silly please let me know. Note that if you copy my config you will have to change /etc/ha.d/update's SSH line to the proper host, and reverse it for the backup host. You will also probably want an ssh-public key login setup, although I'm not certain I'm going to do this. #!/bin/sh # script to set the sync state properly on both LVS servers. case "$1" in start) /sbin/ipvsadm --stop-daemon /sbin/ipvsadm --start-daemon master ;; stop) /sbin/ipvsadm --stop-daemon /sbin/ipvsadm --start-daemon backup ;; esac exit 0 IPaddr::ip.goes.here.here \ ... ldirectord::ldirectord.cf \ lvsstate.sh #!/bin/sh # script for updating ldirectord nicely. created 06/05/2001 PM # first, backup ldirectord.cf in case someone messes up later. cp -f /etc/ha.d/conf/ldirectord.cf /etc/ha.d/conf/backup.of.ldirectord.cf # next, scp the ldirectord.cf file over to the other director # the two LVS servers will have to have public-key acceptance for this to work. scp /etc/ha.d/conf/ldirectord.cf lvs2-priv:/etc/ha.d/conf/. # make sure the state is set properly for the active server ssh rwclvs2-priv /etc/ha.d/lvsstate.sh stop /etc/ha.d/lvsstate.sh start # reload ldirectord killall -HUP ldirectord # give it a few seconds to allow the config to set sleep 10 # now display configuration ipvsadm -L ipvsadm -L --daemon ]]> Sebastiaan Veldhuisen Jun 03, 2005

we just implemented a 2nd director for our HA LVS environment and we want to do connection synchronization between the master and the backup director through ipvsadm 1) Should the backup server list connections its received?

Horms No

2) If not, how do I verify that it's updated its internal tables?

ipvsadm -L -c -n. The connections will show up in the connection table in the Established state.

3) Does it work if I always run both a master and a slave sync daemon at the same time on both directors, even if ipvsadm is only running on the master?

In more recent versions of the kernel, yes.

Bug (fixed) in syncd: mixed endianness on directors Hopefully this is in the standard ipvs release now. Although Justin says that this is an endianness problem, it appears to be a 32/64 bit problem. Justin Ossevoort justin (at) snt (dot) utwente (dot) nl 30 Sep 2004 There was a small bug in the ip_vs_sync.c code that made it impossible for 2 servers of different endian to sync with each other (e.g. a sparc (big endian) and a i386 (little endian) based system). The problem was in the message size. All other data seems to be correctly rearranged to network byte order except for this one (probably because the size is used from the moment the data is being gathered to the moment it is send). This caused "IPVS: bogus message" messages in my dmesg. This patch fixes this problem by converting the m->size at the last moment before sending it to network byte order. And changing it back to host order right before the message is processed. The patch is made agains the Linux kernel version 2.6.8.1. @@ -279,6 +280,9 @@ char *p; int i; + /* Convert size back to host byte order */ + m->size = ntohs(m->size); + if (buflen != m->size) { IP_VS_ERR("bogus message\n"); return; @@ -569,6 +573,23 @@ return len; } +static void +ip_vs_send_sync_msg(struct socket *sock, struct ip_vs_sync_buff *sb) +{ + int msize; + struct ip_vs_sync_mesg *m; + + m = sb->mesg; + msize = m->size; + + /* Put size in network byte order */ + m->size = htons(m->size); + + if (ip_vs_send_async(sock, (char *)m, msize) != msize) + IP_VS_ERR("ip_vs_send_async error\n"); + + ip_vs_sync_buff_release(sb); +} static int ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen) @@ -605,7 +626,6 @@ { struct socket *sock; struct ip_vs_sync_buff *sb; - struct ip_vs_sync_mesg *m; /* create the sending multicast socket */ sock = make_send_sock(); @@ -618,20 +638,12 @@ for (;;) { while ((sb=sb_dequeue())) { - m = sb->mesg; - if (ip_vs_send_async(sock, (char *)m, - m->size) != m->size) - IP_VS_ERR("ip_vs_send_async error\n"); - ip_vs_sync_buff_release(sb); + ip_vs_send_sync_msg(sock, sb); } /* check if entries stay in curr_sb for 2 seconds */ if ((sb = get_curr_sync_buff(2*HZ))) { - m = sb->mesg; - if (ip_vs_send_async(sock, (char *)m, - m->size) != m->size) - IP_VS_ERR("ip_vs_send_async error\n"); - ip_vs_sync_buff_release(sb); + ip_vs_send_sync_msg(sock, sb); } if (stop_master_sync) ]]> Special credit goes to: Byte Internetdiensten (my current employer) for supplying the testbed that triggered this bug and the time sponsored to fix it. @@ -279,6 +280,9 @@ char *p; int i; + /* Convert size back to host byte order */ + m->size = ntohs(m->size); + if (buflen != m->size) { IP_VS_ERR("bogus message\n"); return; @@ -569,6 +573,23 @@ return len; } +static void +ip_vs_send_sync_msg(struct socket *sock, struct ip_vs_sync_buff *sb) +{ + int msize; + struct ip_vs_sync_mesg *m; + + m = sb->mesg; + msize = m->size; + + /* Put size in network byte order */ + m->size = htons(m->size); + + if (ip_vs_send_async(sock, (char *)m, msize) != msize) + IP_VS_ERR("ip_vs_send_async error\n"); + + ip_vs_sync_buff_release(sb); +} static int ip_vs_receive(struct socket *sock, char *buffer, const size_t buflen) @@ -605,7 +626,6 @@ { struct socket *sock; struct ip_vs_sync_buff *sb; - struct ip_vs_sync_mesg *m; /* create the sending multicast socket */ sock = make_send_sock(); @@ -618,20 +638,12 @@ for (;;) { while ((sb=sb_dequeue())) { - m = sb->mesg; - if (ip_vs_send_async(sock, (char *)m, - m->size) != m->size) - IP_VS_ERR("ip_vs_send_async error\n"); - ip_vs_sync_buff_release(sb); + ip_vs_send_sync_msg(sock, sb); } /* check if entries stay in curr_sb for 2 seconds */ if ((sb = get_curr_sync_buff(2*HZ))) { - m = sb->mesg; - if (ip_vs_send_async(sock, (char *)m, - m->size) != m->size) - IP_VS_ERR("ip_vs_send_async error\n"); - ip_vs_sync_buff_release(sb); + ip_vs_send_sync_msg(sock, sb); } if (stop_master_sync) ]]>

LVS: Realserver failure handled by Mon

Introduction Don't even think about doing this till you've got your LVS working properly. If you want the LVS to survive a server or director failure, you can add software to do this after you have the LVS working. For production systems, failover may be useful or required. An agent external to the ipvs code on the director is used to monitor the services. LVS itself can't monitor the services as LVS is just a packet switcher. If a realserver fails, the director doesn't get the failure, the client does. For the standard LVS-Tun and LVS-DR setups (ie receiving packets by an ethernet device and not by TP), the reply packets from the realserver go to its default gw and don't go through the director, so the LVS can't detect failure even if it wants to. For some of the mailings concerning why the LVS does not monitor itself and why an external agent (eg mon) is used instead, see the postings on external agents. In a failure protected LVS, if the realserver at the end of a connection fails, the client will loose their connection to the LVS, and the client will have to start with a new connection, as would happen on a regular single server. With a failure protected LVS, the failed realserver will be switched out of the LVS and a working new server will be made available to you transparently (the client will connect to one of the still working servers, or possibly a new server if one is brought on-line). If the service is http, loosing the connection is not a problem for the client: they'll get a new connection next time they click on a link/reload. For services which maintain a connection, loosing the connection will be a problem.

ratz ratz (at) tac (dot) ch 16 Nov 2000 This is very nasty for persistent setups in an e-commerce environment. Take for example a simple e-com site providing some subjects to buy. You can browse and view all their goodies. At a certain point you want to buy something. Ok, it is common nowadays that people can buy over the Internet with CC. Obviously this is done f.e. with SSL. SSL needs persistency enabled in the lvs-configuration. Imaging having 1000 users (conn ESTABLISHED) that are entering their VISA information when the database server crashes and the healthcheck takes out the server; or even more simple when the server/service itself crashes. Ok, all already established connections (they have a persistent template in the kernel space) are lost and these 1000 users have to reauthenticate. How does this look from a clients point of view which has no idea about the technology behind a certain site.

Here the functioning and setup of "mon" is described. In the Ultra Monkey version of LVS, ldirectord fills the same role. (I haven't compared ldirectord and mon. I used mon because it was available at the time, while ldirectord was either not available or I didn't know about it.) The configure script will setup mon to monitor the services on the realservers. Get "mon" and "fping" from http://www.kernel.org/software/mon/ (I'm using mon-0.38.20) (from ian.martins (at) mail (dot) tju (dot) edu - comment out line 222 in fping.c if compiling under glibc) Get the perl package "Period" from CPAN, ftp://ftp.cpan.org) To use the fping and telnet monitors, you'll need the tcp_scan binary which can be built from satan. The standard version of satan needs patches to compile on Linux. The patched version is at ftp://sunsite.unc.edu/pub/Linux/system/network/admin

ethernet NIC failure, and channel bonding There was a lengthy thread on using multiple NICs to handle NIC failure. Software/hardware to handle such failures is common for unices which run expensive servers (e.g. Solaris) but is less common in Linux. Beowulfs can use multiple NICs to increase thoughput by bonding them together (channel bonding), but redundancy/HA is not important for beowulfs - if a machine fails, it is fixed on the spot. There is no easy way to un-bond NICs - you have to reboot the computer :-( Michael McConnell michaelm (at) eyeball (dot) com 06 Aug 2001

I want to take advantage of dual NICS in the realserver to provide redundancy. Unfortunately the default gw issue comes up.

Michael E Brown michael_e_brown (at) dell (dot) com 09 Aug 2001

Yes, this is a generally available feature with most modern NICS. It is called various things: channel bonding, trunking, link aggregation. There is a native linux driver that implements this feature in a generic way. It is called the bonding driver. It works with any NIC. look in drivers/net/bond*. Each NIC vendor also has a proprietary version that works with only their NIC. I gave urls for intel's product, iANS. Broadcom and 3com also have this feature. I believe there is a standard for this: 802.1q.

John Cronin

It would be nice if it could work across multiple switches, so if a single switch failed, you would not lose connectivity (I think the adaptive failover can do this, but that does not improve bandwidth). Jake Garver garver (at) valkyrie (dot) net 08 Aug 2001
No it wouldn't be nice because it would put a tremendous burden on the link connecting the switches. If you are lucky, this link is 1Gb/sec, much slower than back planes which or 10Gb/sec and up. In general, you don't want to "load balance" your switches. Keep as much as you can on the same back plane.
So, are there any Cisco Fast EtherChannel experts out there? Can FEC run across multiple switches, or at least across multiple Catalyst blades? I guess I can go look it up, but if somebody already knows, I don't mind saving myself the trouble.
Fast EtherChannel cannot run across multiple switches. A colleague spent weeks of our time proving that. In short, each switch will see a distinct link, for a total of two, but your server will think it has one big one. The switches will not talk to each other to bond the two links and you don't want them to for the reason I stated above. Over multiple blades, that depends on your switch. Do a "show port capabilities" to find out; it will list the ports that can be grouped into an FEC group.

Michael E Brown michael_e_brown (at) dell (dot) com

If you want HA, have one machine (machine A) with bonded channels connected to switch A, and have another machine (machine B) with bonded channels connected to switch B. If you want to go super-paranoid, and have money to burn on links that won't be used during normal operations: have one machine (machine A) with bonded channels connected to switch A, and have backup bonded channels to switch B. Have software that detects failure of all bonded channels to switch A and fails over your IP to switch B (still on machine A). Have another machine (B), with two sets of bonded channels connected to switch C and switch D. lather, rinse, repeat. On Solaris, IP failover to backup link is called IP Multipathing, IIRC. New feature of Solaris 8. Various HA softwares for Linux, notably Steeleye Lifekeeper and possibly LinuxFailsafe, support this as well.

John Cronin

For the scenario described above (two systems), in many cases machine A is active and machine B is a passive failover, in which case you have already burned some money on an entire system (with bonded channels, no less) that won't be used during normal operations. Considering I can get four (two for each system) SMC EtherPower dual port cards for about $250, including shipping, or four Zynx Netblaster quad cards for about $820, if I shop around carefully, or $1000 for Intel Dual Port Server adapters or $1600 for Adaptec/Cogent ANA-6944 quad cards, if a name brand is important), the cost seems less significant when viewed in this light (not to mention the cost of two Cisco switches that can do FEC too).

Back to channel bonding (John Cronin)

I presume it's not doable.
I think "not doable" is an incorrect statement - "not done" would be more precise. For the most part, beowulf is about performance, not HA. I know that Intel NICs can use their own channel aggregation or Cisco Fast-EtherChannel to aggregate bandwidth AND provide redundancy. Unfortunately, these features are only available on the closed-source Microsoft and Novell platforms. http://www.intel.com/network/connectivity/solutions/server_bottlenecks/config_1.htm
Having 2 NICs on a machine with one being spare, is relatively new. No-one has implemented a protocol for redundancy AFAIK.
I assume that you mean both of these statements to apply to Linux and LVS only. Sun has had trunking for years, but IP multipathing is the way to go now as it is easier to set up. You do get some bandwidth improvements for OUTBOUND connections only, on a per connection basis, but the main feature is redundancy. Look in http://docs.sun.com/ for IP, multipathing, trunking. Sun also has had Network Adapter Fail-Over groups (NAFO groups) in Sun Cluster 2.X for years, and in Sun Cluster 3.0. Veritas Cluster Server has an IPmultiNIC resource that provides similar functionality. Both of these allow for a failed NIC to be more or less seamlessly replaced by another NIC. I would be surprised if IBM HACMP has not had a similar feature for quite some time. In most cases these solutions do not provide improved bandwidth.
The next question then is how often does a box fail in such a way that only 1 NIC fails and everything else keeps working? I would expect this to be an unusual failure mode and not worth protecting against. You might be better off channel bonding your 2 NICs and using the higher throughput (unless you're compute bound).
I would agree, with one exception. If you have the resources to implement redundant network paths farther out into your infrastructure, then having redundant NICs is much more likely to lead to improved availability. For example if you have two NICs, which are plugged into to two different switches, which are in turn plugged into two different routers, then you start to get some real benefit. It is more complicated to setup (HA isn't easy most of the time), but with the dropping prices of switches and routers, and the increased need for HA in many environments, this is not as uncommon as it might sound, at least not in the ISP and hosting arena. I am not trying to slam LVS and Linux HA products - to the contrary; I am trying to inspire some talented soul to write a multipathing NIC device driver we can all benefit from. ;) I make my living doing work on Sun boxes, but I use Linux on my Dell Inspiron 8000 laptop (my primary workstation, actually - it's a very capable system). I would recommend Linux solutions in many situations, but in most cases my employers won't bite, as they prefer vendor supported solutions in virtually every instance, while complaining about the official vendor support.

for channel bonding both NICS on the host have the same IP and MAC address. You need to split the cabling for the two lots of NICs, so you don't have address collisions - you'll need two switches.
John Cronin You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux). Intel adapters also have their own channel aggregation scheme, and I think the Intel switches could also facilitate this scheme, but Intel is getting out of the switch business. Unfortunately, none of the advanced Intel NIC features are available using Linux (it would be nice to have the hardware IPsec support on their newest adapters, for example).
Michael E Brown michael_e_brown (at) dell (dot) com
Depends on which kind of bonding you do. Fast Etherchannel depends on all of the nics being connected to the same switch. You have to do configure the switch for trunking. Most of the standardized trunking methods I have seen require you to configure the switch and have all your nics connected to the same switch.

You either need multiple switches, or switches that understand and are willing participants in the channel aggregation method being used. Cisco makes switches that do Fast EtherChannel, and Intel makes adapters that understand this protocol (but again, not currently using Linux). Michael E Brown michael_e_brown (at) dell (dot) com
Not true. You can download the iANS software from Intel. Not open source, but that is different from "not available". look in http://isearch.intel.com for ians+linux. Also, if you want channel bonding without intel proprietary drivers, see

Michael McConnell

This definately does it! It create this excellent kernel module, it contains ALL. I just managed to get this running on a Tyan 2515 Motherboard that has two Onboard Intel Nics. I've just tested failover mode, works *PERFECT* not even a single packet dropped! I'm gonna try out adaptive load balancing next, and i'll let you know how I make out. ftp://download.intel.com/df-support/2895/eng/ians-1.3.34.tar.gz

Michael E Brown michael_e_brown (at) dell (dot) com

Broadcom also has proprietary channel bonding drivers for linux. The problem is getting access to this driver. I could not find any driver downloads from their website. It is possible that only OEMs have this driver. Dell factory installs this driver for RedHat 7.0 (and will be on 7.1, 7.2). You might want to e-mail Broadcom and ask. Also
Broadcom also has an SSL offload card which is coming out and it has open source drivers for linux. http://www.broadcom.com/products/5820.html You need the openssl library and kernel. The next release of Red Hat linux will have this support integrated in. The Broadcom folks are working closely with the OpenSSL team to get their userspace integrated directly into 0.9.7. Red Hat has backported this functionality into their 0.9.6 release. If you look at Red Hat's latest public beta, all the support is there and is working. Since there aren't docs yet, the "bcm5820" rpm is the one you want to install to enable everything. Install this RPM, and it contains an init script that enables and disables the OpenSSL "engine" support as appropriate. Engine is the new OpenSSL feature that enables hardware offload.