1. LVS: Introduction

This LVS-HOWTO is posted to the LVS-HOWTO homepage, http://www.austintek.com/LVS/LVS-HOWTO/ about once a month (although I do miss occasional months).

Some of the material is from my own testing and I've tried to make it into a coherent story. Much of the material is from the lvs-users mailing list and is listed chronologically (sometimes forward and sometimes backwards in time) and will thus look like a blog.

1.1. Thanks

Contributions to this HOWTO came from the mailing list and are attibuted to the poster (with e-mail address). Postings may have been edited to fit into the flow of the HOWTO.

The LVS logo (Tux with 3 lighter shaded penguins behind him representing a director and 3 realservers) is by Mike Douglas spike (at) bayside (dot) net

LVS homepage is running on a machine donated by Horms. (Until Jul 2002, we used a machine donated by Voxel).

LVS mailing list is hosted by Lars in Germany lmb (at) suse (dot) de

1.2. About the HOWTO

1.2.1. Purpose

To enable you to understand how a Linux Virtual Server (LVS) works.

The LVS-mini-HOWTO (http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html) tells you how to setup and install an LVS without understanding how the LVS works.

The material here covers directors and realservers with 2.2, 2.4 and 2.6 kernels.

Note

The original material was written for 2.2.x kernels and ipchains. Not all material has been updated for 2.4.x kernels and iptables.

The layout of this HOWTO is almost flat - you go to the section you want information on. You aren't supposed to read it from start to finish. Within any section, newer information may be combined with older information that says different things. I just don't have time to edit everything - I'll be glad if you straighten me out. The only information one level up is

  • how LVS works (from this HOWTO or from documentation on the website, e.g. Wensong's early documents)
  • setting up an LVS in the LVS-mini-HOWTO.html.

The code for 2.0.x kernels still works fine and was used on production systems when 2.0.x kernels were current, but is not being developed further. For 2.2 kernels, the Linux kernel networking code was rewritten, producing for us The Arp Problem. This changes the installation of LVS from a simple process that can be done by almost anyone, to a thought provoking, head scratching exercise, which requires detailed understanding of the workings of LVS. For 2.0 and 2.2, LVS is stand-alone code, based on ip_masquerading and doesn't integrate well with other uses of ip_masquerading. For 2.4 kernels, LVS was rewritten as much as possible to be a netfilter module, to allow it to fit into and be visible to other netfilter modules. Unfortunately the fit isn't perfect, but cooperation with netfilter does work in most cases. If ip_vs() was a real netfilter module, it would be really slow. (The original LVS-NAT code had problems when using your director as a firewall; see the Running a firewall on the director, but much of this has been fixed - Feb 2006.) Being a netfilter module, the latency and throughput are slightly worse for 2.4 LVS than for the 2.2 code. However with modern CPUs being running at 800MHz, the bottleneck now is network throughput rather than LVS throughput (you only need a small number of realservers to saturate 100Mbps ethernet).

In general ipvsadm commands and services have not changed between kernels.

1.2.2. HOWTO source is xml

The HOWTO was originally written in sgml. It is now xml. The char '&' found in C source code has to be written as & in sgml. If you swiped patches from the sgml rather than the html rendering, you would get code that needed to be edited to fix the &. Now that the HOWTO is in xml, this munging is not needed. Although I've tried to remove all munged ampersands, I expect some will persist for a while. Ampersands in URLs still have to be munged.

1.2.3. e-mail addresses in the HOWTO are spam protected

Well we hope so anyhow.

An article on spambots describes robots which ignore the robots.txt file and scan for e-mail addresses in readable files on websites. The author suggests removing any 'mailto:' strings and spam protecting e-mail addresses, by changing them from machine-readable to human-readable format. If you have a better scheme than implemented here, (and I can do it with vi) let me know.

(May 2002): BTW, 160 people have contributed to the HOWTO (as judged by unique e-mail addresses).

1.2.4. Links die frequently

There are links to 180 urls in this HOWTO (May 2002), which came from postings to the LVS mailing list. If people move/rename/delete/change their webpages/links once a year, then I'm going to have to trackdown 15 websites each month. If a site is gone and it isn't in google, I'm not going to be able to find it.

1.3. Nomenclature/Abbreviations

If you use these terms when you mail us, we'll know what you're talking about.

1.3.1. Preferred names

  • IPVS,ipvs,ip_vs the code that patches the linux kernel on the director.
  • LVS, linux virtual server This is the director + realservers. Together these machines are the virtual server, which appears as one machine to the client(s).
  • director: the node that runs the ipvs code. Clientsconnect to the director. The directorforwards packets to the realservers. The director is nothing but an IP router with special rules that make the LVS work.
  • realservers: the hosts that have the services. The realservers handle the requests from the clients.
  • client the host or user level process that connects to the VIP on the director
  • forwarding method (currently LVS-NAT, LVS-DR, LVS-Tun). The director is a router with somewhat different rules for forwarding packets than a normal router. The forwarding method determines how the director sends packets from the client to the realservers.
  • scheduling (ipvsadm and schedulers) - the algorithm the director uses to select a realserver to service a new connection request from a client.

1.3.2. synonyms

Please use the first term in these lines. The other words are valid but less precise (or are redundant).

  • director: load balancer, dispatcher, redirector.
  • realserver: servers, realservers, real-servers.
  • LVS: the whole cluster, the (linux) virtual server (LVS)

1.3.3. virtual services, scheduling groups

Here's the ipvsadm output of an LVS serving telnet and squid.

director:/etc/rc.d# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:squid rr
  -> rs1.mack.net:squid        Route   1      0          0
  -> rs2.mack.net:squid        Route   1      0          0
  -> rs3.mack.net:squid        Route   1      0          0
TCP  lvs.mack.net:telnet rr
  -> rs1.mack.net:telnet       Route   1      0          0
  -> rs2.mack.net:telnet       Route   1      0          0

In the above LVS, there are two virtual services, telnet and squid. There are also two virtual servers; a virtual server for telnet (which has 2 realservers) and a virtual server for squid (which has 3 realservers). This is what the client sees; two services (and two servers).

Connections to each virtual server are scheduled (here by "rr", round robin), to the realservers which belong to the scheduling group. Here the scheduling group for telnet is rs1,rs2. The scheduling group for quid is rs1,rs2,rs3. Connections to the telnet virtual server are scheduled independantly of connections to the squid virtual server.

The above nomenclature can be extended for firewall mark (fwmark).

1.3.4. scheduling instance, scheduled unit, virtual connection

We don't have a good name for this. Suggestions welcome. (We also don't talk much about this concept on the mailing list, so we've done without a name).

The director needs to know how to schedule packets from the client to the realservers. The smallest unit for LVS is a tcpip connection, i.e. all packets that are part of a single tcpip session from a client will be sent to the same realserver. For a tcp virtual service, each tcp connection is scheduled separately, with the first tcp connection going to one realserver, and the next tcp connection going to the next realserver assigned a connection from the scheduler. The virtual connection is the same as the tcp connection.

For a persistent connection all tcp connections that are separated by less than the timeout period are regarded as belonging to the same virtual connection and are scheduled to the same realserver.

For udp, there is no such thing as a connection or session and all packets from the client within a timeout period are scheduled to the same realserver. (People aren't using LVS for udp services a whole lot). The virtual connection then is all udp packets from a client within a certain time period.

1.3.5. backend (multi-tier) servers

The realservers sometimes are frontends to other backend servers. The client does not connect to these backend servers and they are not in the ipvsadm table.

e.g.

  • a realserver may run a web application. The web application in turn connects to a database on another backend server.
  • a webcaching realserver (e.g. a squid). The squid connects to backend webserver(s).

These backend servers are setup separately from the LVS.

1.3.6. the term "the server" is ambiguous

People sometimes call the director or the realservers, "the server". Since the whole LVS appears as a server to the client and since the realservers are also serving services, the term "server" is ambiguous. Do not use the term "the server" or "the lvs server" when talking about LVS. Most often you are referring to the "director" or the "realservers". Sometimes (e.g. when talking about throughput) you are talking about the (whole) virtual server.

I use "realserver" as I despair of finding a reference to a "realserver" in a webpage using the search keys "real" and "server". Horms and I (for reasons that neither of us can remember) have been pushing the term "real-server" for about a year, on the mailing list, and no-one has adopted it. We're going back to "realserver".

1.3.7. names of IPs/networks in an LVS

                        ________
                       |        |
                       | client | (local or on internet)
                       |________|
                          CIP
                           |
--                      (router)
                          DGW
                           | outside network
                           |
L                         VIP
i                      ____|_____
n                     |          | (director can have 1 or 2 NICs)
u                     | director |
x                     |__________|
                      DIP (and PIP)
V                          |
i                          | DRIP network
r         ----------------------------------
t         |                |               |
u         |                |               |
a        RIP1             RIP2            RIP3
l    _____________   _____________   _____________
    |             | |             | |             |
S   | realserver1 | | realserver2 | | realserver3 |
e   |_____________| |_____________| |_____________|
r
v
e
r
---

The router has traditionally not been considered part of the LVS, because often you do not have control over the router. However if you're a paying customer, then the ISP will be glad to set up the router according to your specifications. If you have access to the router, it can solve The Arp Problem and can install filter rules.

Here are the names we use for the various IPs. If you use them when asking questions on the mailing list, we'll be able to answer your questions more easily.

client IP     = CIP
virtual IP    = VIP - the IP on the director that the client connects to)
director IP   = DIP - the IP on the director in the DIP/RIP (DRIP) network
   (this is the realserver gateway for LVS-NAT)
realserver IP = RIP (and RIP1, RIP2...) the IP on the realserver
director GW   = DGW - the director's gw (only needed for LVS-NAT)
   (this can be the realserver gateway for LVS-DR and LVS-Tun)

The VIP and DIP are setup as secondary IPs, (i.e. there is another primary IP on that NIC), so they can be moved to another duplicate director following director failover. For initial setup with a single director, setting up the VIP and DIP as secondary IPs will make the transition to a failover setup easier.

For a two director LVS (where directors failover), the IPs on the DRIP network are

primary director IP	= PIP (the director which will be the master on bootup)
secondary director IP	= SIP (the director which will be the backup on bootup)

The DIP will be on the same NIC as PIP on bootup and will move to the same NIC as SIP on director failover.

We don't seem to need a name for the primary IP on the outside of the director - no-one ever talks about it.

We don't often need to explicitely name the networks in an LVS, but here's some suggestions

  • DRIP network: the network containing the DIP and RIPs. (OK you come up with a better name.)

  • network facing the internet or the outside network: the network on the director which receives packets from the outside world. This shouldn't be called the VIP network, as the VIP is also in the DRIP network (but not replying to arp calls) on the realservers in LVS-DR and LVS-Tun.

1.4. Minimal knowledge required

The mailing list and HOWTOs cover information specific to LVS. The rest you have to handle yourself. All of us knew nothing about computers when we first started, we learnt it, and you can too (we're not saying it's easy). If you can't setup a simple LVS from the LVS-mini-HOWTO, without breaking into a major sweat (or being able to tell us what's wrong with the instructions), then you need to do some more homework. (Also see Help! My LVS doesn't work.)

Ratz ratz (at) tac (dot) ch

To be able to setup/maintain an LVS, you need to be able to

  • know how to patch and compile a kernel
  • the basics of shell-scripting
  • have intermediate knowledge of TCP/IP
  • have read the man-page, the online-documentation and LVS-HOWTO (this document) (and the LVS-mini-HOWTO)
  • know basic system administration (e.g. iptables; syslog; find, compile, install code from source files; use cpan to find perl modules).

1.5. Free Technical Help

All of the people on the LVS mailing list are replying for free in their spare time. The best we can do is to give solutions to technical problems on setting up and running LVS. I give about 15secs to a posting to decide if I've got something useful to say. The posting has to indicate that the person has analysed the problem to a stage where an answer exists. If _they_ can't describe the problem, there's no point in replying - they won't understand the answer.

Please don't e-mail me privately with general questions (feel free to cc: me if you want). The mailing list will archive your question and the answer(s) which can be retrieved later. Other people may have more interesting, relevant or useful comments than I will. If you are writing to me in the hopes of avoiding the humiliation of publically showing your ignorance on the mailing list, it's not going to happen. We've had too many good ideas from "ignorant" people to let this happen. If your question has been answered many times before and it's in the HOWTO and the archives, you'll be told to read the HOWTO, that's all.

To get technical help:

  • Read the docs on the website, the HOWTOs, and search the mailing list archives. The HOWTO (at the top) has a link to a search engine of all known LVS documentation. It will probably return several webpages. You'll have to find the entry from there.
  • The LVS-mini-HOWTO shows you how to setup a simple 3 node (client, director, realserver) LVS without you needing to understand a whole lot about how an LVS works.
  • after you've done a search of the docs, then post to the mailing list.
  • updates/problems/bugs - post to the mailing list

Jakub Suchy jakub (at) rtfm (dot) cz 13 Jan 2005

Please read: smart questions (http://www.catb.org/~esr/faqs/smart-questions.html) before asking questions.

Please only post relevant lines of a debug dump. If you post the whole dump, because you don't understand it, then it will fill up the archive machine and everyone's mail box. If we need the whole debug, we'll ask for it and you can send it to us off-list.

1.5.1. Problem people 1

It's hard to believe, but we get postings like

recompiling the kernel is hard (or I don't read HOWTOs), can't you guys cut me some slack and just tell me what to do?

I expect the people who post these statements don't read this HOWTO, so I may be wasting my time, but - No. The people on the mailing list answer questions for free, and have other important things to do, like keeping up with /. and checking our e-mail. When we're at home, we drink beer and watch Gilligan's Island re-runs.

1.5.2. Problem people 2

can anybody tell me how to setup a windows realserver? thank you very much! I'm in a hurry.

robert (dot) gehr (at) web2cad (dot) de

I can't think of anyone who has set up lvs in a hurry :-)

1.5.3. Problem People 3: People using RedHat LVS

RedHat have LVS in their standard distribution kernel. This gives people the idea that they can setup LVS from their standard RedHat distribution just by clicking on a few buttons or running some scripts. From reading the postings to the mailing list, it's more difficult than doing it our way. You still have to understand LVS and then afterwards, you have to figure out what RedHat did to it. One of the major wastes of time and source of aggravation for me personally on the LVS mailing list, is postings from people using RedHat LVS who assume that it's the same as LVS, and who post as if they're using our setup methods. Just saying that you're using a RedHat distribution doesn't tell us anything, since you can setup LVS our way in RedHat. Things you need to know before you post -

  • There are reasons for wanting to setup LVS in a standard RedHat distribution (e.g. RedHat is "approved" in your location whereas "Linux" isn't).
  • There is information in this HOWTO (PB's Nutshell) and in the various links from here which show you how to setup RedHat LVS.
  • We have a method of setting up LVS which works for all distributions (including RedHat). We are not interested in learning, understanding, debugging, supporting or fixing a setup method that only works for one distibution.
  • RedHat don't talk to us about what they do and while they may monitor the LVS mailing list, rarely (only about once a year, that I can tell) do they reply to people having problems with RedHat LVS. It appears that RedHat does not think their version of LVS worthy of much support and I agree with them.
  • If you setup LVS the RedHat way, you still need an understanding of how an LVS works and is setup (just like everyone else), before posting to the mailing list.

If you are setting up with RedHat and want help with it, make sure that you describe what you've done, that you're using the RedHat files and how you've set it up, otherwise we'll assume that you're setting up using our methods.

1.5.4. Why you may not get an answer

  • no-one knows.

    The LVS-NAT ftp helper bug took a long time to figure out. Since no-one else had seen the problem, we didn't know at first if it was a problem with LVS. It wasn't till 6 months later, when someone else had the same symptom, and found that it only occured when the ftp helper module was loaded, that we could do something.

    I once needed to do something with iproute2 that I spent about 3 weeks trying to figure out. No-one on the list knew the answer. I had to post off-line to someone who could figure it out for me.

  • We may not have a useful answer.

    If you post saying "I want to build an LVS with (list of hardware); do you think it will work?", all we can say is "probably".

    Often when questions like this come up, there are people who are happy to share their experiences, so there's no harm in posting such a question. In general the people who've been working with LVS for years will expect you to have read the docs and know what LVS does before you post. In the time I alot for a reply, I don't have time to figure out whether in your case LVS is best for you - you should pay a consultant to do this if you can't do it yourself.

  • Your question may not be well posed.

    We are reading the postings in our spare time. You will get at most 30secs of attention before we figure out whether we can help you, an answer will take a bit of thinking, or we can't help you.

    If you have a long posting in which you haven't figured out which parts are causing the problem and which parts are working, then we aren't going to try to figure it out either. Post the minimum setup that will produce the problem.

  • It's obvious that you haven't read the HOWTO.

1.5.5. Edit your posts! (top, bottom and in-line posting)

Please edit the posting you're replying to, leaving only the parts relevant to your reply. We don't need to see material from previous posts irrelevant to the current posting, and the disk archive doesn't either.

Reply in-line, i.e. following each statement by the poster. Here's a posting on the subject from one of the kernel mailing lists.

Greg KH greg (at) kroah (dot) com 16 Nov 2005

A: http://en.wikipedia.org/wiki/Top_post
Q: Were do I find info about this thing called top-posting?
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
A: No.
Q: Should I include quotations after my reply?

1.6. After you've Got Technical Help

In most cases when a problem is solved, there's enough info on the mailing list to see how it worked and we can write it up here for the next people. Occasionally, we get a posting "I've worked it out. Thanks for the help." When this happens we have no idea what the solution was and will have to reinvent it for the next person.

If you've got help from the unpaid people on the mailing list, who've given their spare time to help you, when they could instead have been watching Gilligan's Island reruns, please write it up for the HOWTO. When I write to people asking for their solution I don't want to hear that you're busy and have a job. We're busy, have jobs, kids, homework to do and tax forms to fill in and we stopped what we were doing to help you. Here's a template.

  • what you wanted to do
  • why/how it didn't work
  • what you needed to do to get it to work
  • how the solution works

1.7. Paid technical help

Note

We occasionally get requests for people to do an install. The listing is a service to people looking for paid technical help (installs or anything else) and does not imply that I (Joe) or anyone connected to the LVS project endorse the services of the listees. If you want to know more about them, check their postings to the LVS mailing list.

Entries will be listed at no cost, in approximate order of the date I receive/post them. Entries will be listed for at least a year (HOWTOs come out at erratic intervals and new entries will be added/old entries deleted whenever the next HOWTO comes out). If you want to be listed again next year, send me an e-mail in a year. I'm too busy to keep much of an eye on what goes in here and your entry may stay longer than a year. If you really want people to know who you are, don't rely on this entry - make sure google knows about you.

To be listed, send me off-list

your URL (e.g. <http://www.foo.org>The Foo LVS service centre</a>) and/or e-mail, then a blurb of upto 80chars e.g. "We do it all", optionally including your location.

this will be minimum maintenance - I'm just going to mouse swipe your e-mail (i.e. don't plan on changing your URL in the year).

People available for paid technical help.

  • Oct 2007: http://www.dotnoc.com - solutions for hosting sales@dotnoc.com. Linux load balancing and networking specialists

  • Oct 2007: Loadbalancer.org Ltd (http://www.loadbalancer.org/) - Specialise in high availability load balancers based on LVS. Happy for customers to have full access to the OS and source code and offer 24*7 support. However we don't do consultancy on home brew implementations. UK and USA offices.

  • Oct 2007: http://www.netdigix.com Linux solutions for business. contact@netdigix.com. We specialize in Linux networking and setup of LVS for hosting and mission critical infrastructures. Canada:British Columbia:Lower Mainland:Vancouver

1.8. Mailing list: subscribing, unsubscribing, searching

Thanks to Hank Leininger for the mailing list archive which is searchable not only by subject and author, but also by strings in the body. Hank's resource has been of great help in assembling this HOWTO.

The mailing list is available for further questions. A single mailing list handles developers, new users and old users and has about 0-20 postings a day. You don't have to join the mailing list to read the archives. If you want to post questions, then you have to join. If you aren't subscribed and you post (or you post from an unsubscribed address), you'll get a reply saying that your posting is "awaiting moderator approval". It isn't; because of the volume of spam, we no longer review these messages - they're deleted.

1.9. Mailing list: posting to

Please send e-mail with straight ascii (not html) and turn line-wrap on (some mails come with each paragraph on a single long line).

If you're stuck with posting from a Windows machine or Lotus notes, or using Lookout, where each paragraph is sent as one line:

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 09 Jul 2004

System manager -> Global Settings -> Internet Message Format -> Default (or
the one used) -> Advanced -> word wrap 

like shown in fixing outlook (http://www.lemis.com/email/fixing-outlook.html) especially in word wrap (http://www.lemis.com/email/exchsrvr-wordwrap.gif), but this is a very old version of exchange.

Please don't turn on your vacation message, intended only for your work mates, for messages from a list. e.g.

I will be out of the office starting  07/30/2004 and will not return until 08/03/2004.

The LVS mailing list doesn't want to know.

Dan Moljar Aug 2004

For Lotus Notes: The client is not configured correctly. In the 'Out of Office' enable dialog under the 'Exceptions' tab, there is a check box for 'Do not reply to Internet Addresses'. Check it. The server shouldn't do it to begin with, but you can make the client stop.

There's always new ideas and questions being posted on the mailing list. We don't expect this to stop. There are many complexities to LVS and we don't expect new people to understand any more about LVS that we did when we started. No-one is expected to know/understand everything in the docs but your questions will be better received, if you've done your homework, if you have setup the test configurations here, have at least perused this HOWTO (yes we know it's big), and have looked at the mail archives. We can't help you if you just tell us that you've read the documents and your LVS doesn't work. To you, all problems look the same ("it doesn't work"). To help you, we need more information. We at least need the forwarding method, the service(s) being forwarded, the number of networks and the output of ipvsadm in the problem state.

Before you come up on the mailing list -

  • Read the LVS-HOWTO (this document) and the LVS-mini-HOWTO
  • Set up a simple LVS (3 nodes: client, director, realserver) with LVS-DR or LVS-NAT forwarding, with the service telnet using the instructions in the LVS-mini-HOWTO. You should be able to do this starting from a freshly downloaded kernel from ftp.kernel.org and the LVS patches (ipvs and the hidden patch if you have 2.4.x realservers).

    Don't setup first with http, with filter rules, with firewalls, with complicated file systems (e.g. coda, nfs) or network accelators - debug all these nifty things after you have LVS working with telnet and with no filter rules.

    Do use standard compilers (gcc-2.95.3), tools and utilities (ifconfig or iproute2).

    Do not use non-standard tools particular to a distribution designed to capture market share (e.g. ifup).

  • If you are using one of the packages that can be used with LVS (e.g. heartbeat from the Linux HA project http://www.henge.com/~alanr/ha, or piranha from Redhat), again we may know what the problem is, but they need the feedback that you can't get it to work, not us. Many of us are on each others' mailing lists and we try to help when we can, but the best people to handle the problem are the developers for each package.
  • Consult the LVS mailing list archives.
  • Use our jargon as best you can. The machine names will be client, director, realserver1, realserver2... IPs are CIP, VIP, RIP, DIP. If you do this, we won't have to translate "susanne" and "annie" to their functional names as we scan your posting.
  • we need to know your kernel (e.g. 2.2.14) and the ip_vs patch that was applied to it (eg 0.9.11), whether you are using LVS-DR, LVS-NAT or LVS-Tun. Tell us
    • what you did
    • what you expected
    • what you got and why that's a problem

If you don't understand your problem well, here's a suggested submission format from Roberto Nibali ratz (at) tac (dot) ch

  1. System information, such as kernel, tools and their versions.

    Example:

    hog:~ # uname -a
    Linux hog 2.2.18 #2 Sun Dec 24 15:27:49 CET 2000 i686 unknown
    
    hog:~ # <command>ipvsadm</command> -L -n | head -1
    IP Virtual Server version 1.0.2 (size=4096)
    
    hog:~ # <command>ipvsadm</command> -h | head -1
    <command>ipvsadm</command> v1.13 2000/12/17 (compiled with popt and IPVS v1.0.2)
    
  2. Short description and maybe sketch of what you intended to setup.

    Example for LVS-DR:

    	o Using LVS-DR, gatewaying method.
    	o Load balancing port 80 (http) non-persistent.
    	o Network Setup:
    
                            ________
                           |        |
                           | client |
                           |________|
    			   | CIP
                               |
    			(router)
    			   |
    			   | GEP
                     (packetfilter, firewall)
                               | GIP
                               |       __________
                               |  DIP |          |
                               +------+ director |
                               |  VIP |__________|
                               |
             +-----------------+----------------+
             |                 |                |
         RIP1, VIP         RIP2, VIP        RIP3, VIP
        ____________      ____________    ____________
       |            |    |            |  |            |
       |realserver1 |    |realserver2 |  |realserver3 |
       |____________|    |____________|  |____________|
    
    
    	CIP  = 212.23.34.83
    	GEP  = 81.23.10.2	(external gateway, eth0)
    	GIP  = 192.168.1.1	(internal gateway, eth1, masq or NAT)
    	DIP  = 192.168.1.2	(eth0:1, or eth1:1)
    	VIP1 = 192.168.1.110	(director: eth0:110, realserver: lo0:110)
    	RIP1 = 192.168.1.11
    	RIP2 = 192.168.1.12
    	RIP3 = 192.168.1.13
    	DGW  = 192.168.1.1	(GIP for all realserver)
    
    	o ipvsadm -L -n
    
    hog:~ # ipvsadm -L -n
    IP Virtual Server version 1.0.2 (size=4096)
    Prot LocalAddress:Port Scheduler Flags
      -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
    TCP  192.168.1.10:80 wlc
      -> 192.168.1.13:80             Route   0      0          0
      -> 192.168.1.12:80             Route   0      0          0
      -> 192.168.1.11:80             Route   0      0          0
    

    The output from ifconfig from all machines (abbreviated, just need the IP, netmask etc), and the output from netstat -rn.

  3. What doesn't work. Show some output from tcpdump, ipchains/ip_tables, ipvsadm and kernlog. Later we may ask you for a more detailed configuration like routing table, OS-version or interface setup on some machines used in your setup. Tell us what you expected. Example:
    ipchains -L -M -n (2.2.x) or cat /proc/net/ip_conntrack (2.4.x)
    echo 9 > /proc/sys/net/ipv4/vs/debug_level && tail -f /var/log/kernlog
    tcpdump -n -e -i eth0 tcp port 80
    route -n
    netstat -an
    ifconfig -a
    

    tcpdump listings are difficult to read. If you post one, please change the IPs to VIP, CIP, RIP1..n, DIP etc. Since you'll likely be on a switched network, tcpdump will only see packets to that NIC. Tell us which machine (director, realserver...) and the NIC (if there are two NICs on the machine) that it was run on.

1.10. Bug Fixes

It's wonderful to get an unsolicited bug fix. Please let us know what it does and why it's better than the current file. A new version of a file without any information about what it does, or what it fixes isn't much use to us.

1.11. Other load balancing solutions, GPL, opensource and commercial

1.11.1. Open Source and GPL solutions

Malcolm lists (at) loadbalancer (dot) org 23 Nov 2006

Willy Tarreau's written a nice article http://1wt.eu/articles/2006_lb/ - Making applications scalable with Load Balancing on load balancing that covers layer 4 and layer 7 options. I still don't think layer 7 can ever give high availability.. but its a good read:

Ratz 23 Nov 2006

A very nice and to the point introduction. Willy, among being a nice person and a good friend, is an excellent engineer with a lot of expertise in high available, high performance and secure web services, networking and packet filtering and much more. It would be nice to have Willy contributing to/on this list as well :).

Malcolm lists (at) loadbalancer (dot) org 01 Feb 2007

HAProxy http://haproxy.1wt.eu is a tcp proxy (fast) but flexible enough to do cookie insertion and SNAT etc.

from lvs (at) spiderhosting (dot) com a list of load balancers

Brent Cook busterb (at) mail (dot) utexas (dot) edu 28 Mar 2002

There's the http://www.bsdshell.net/ HighUpTime (HUT) projec (link dead Apr 2003). It's FreeBSD.

The HUT author, Sebastian Petit spe (at) selectbourse (dot) net has joined the LVS mailing list.

For L7 Switching see the DRWS project.

Dec 2006: Alexandre Cassen, the author of Section 3.3 has written an L7 Switch at http://www.linux-l7sw.org".

BSD load balancing:

Roberto Nibali ratz (at) tac (dot) ch 05 Nov 2003

As already mentioned by others, LVS will not work on FreeBSD as director due to the kernel part. Using FreeBSD on the RS is of course ok. The BSD folks have not shown bigger interest in adopting the LVS idea or parts of the code yet. If you're interested in load balancing and HA Solutions under FreeBSD, you could check out following links:

http://www.bsdshell.net/hut_fvrrpd.html
http://www.backhand.org/wackamole/
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/isp/2003-05/0026.html
http://redundancy.redundancy.org/fbsd_lb.html

Gavin Henry ghenry (at) suretecsystems (dot) com 06/13/2005

ClusterIP by Harald Welte. What is the list's view on it?

Gavin Henry ghenry (at) suretecsystems (dot) com 13 Jun 2005

The man page for more recent versions of iptables says:

CLUSTERIP: This module allows you to configure a simple cluster of nodes that share a certain IP and MAC address without an explicit load balancer in front of them

Horms

Been there, done that. Works, but is it neccessary? LVS with upto 16 directors active (http://www.ultramonkey.org/papers/active_active/)

A set of postings on /. 2 Mar 2009 at Best Solution for HA and Network Loadbalancing (http://tech.slashdot.org/article.pl?sid=09/03/02/0231241) lists the following

1.11.2. List of Commercial Solutions

Cahya Wirawan cwirawan (at) email (dot) archlab (dot) tuwien (dot) ac (dot) at 19 Feb 2004

I'm implementing proxy, smtp and webserver with LVS as local node, and I have tested it and it's running fine, but because someone from management section thinks that such an implementation is easy (just run setup.exe and everything is installed and ready to use), he pushed me to move the setup into production, and create another one as soon as possible. I want to tell him that such an implementation is not a trivial thing and needs time to setup and to test before we go into production. I want to show him a list of companies who have such complete solutions, so he can see the cost. Then he can understand that high availability and load balancing is not easy to setup, and will cost alot of money if we buy a complete solution.

Vendors just rub their hands with glee on finding management like this - see my review of the book "The IBM Way" (http://www.austintek.com/book_reviews/the_ibm_way.html) for how IBM handles the situation.

Peter Mueller

Prices at this level are negotiable. Who knows what you could pay?

  • http://www.cisco.com/ - the old man on the LB-gig.
  • http://www.f5.com/f5products/bigip/LB520/ - the second old man in the LB gig.
  • http://www.suse.com/us/business/products/server/ - Suse has always been a big player in the Linux-HA world.
  • http://www.redhat.com/software/rhel/purchase/ - they have clustering based on LVS, not sure about price. At this point you have to buy enterprise edition (or http://www.whiteboxlinux.com) to use the clustering software.
  • http://www.ibm.com/ - always an option...
  • http://www.dell.com/ - moving up in the datacenter world. I see lots of Dells now..
  • http://www.ebay.com/ - see how much the gear is worth on the open market.
  • http://www.linuxvirtualserver.org/ - $0.

    There's plenty of people on list who can help you and your boss feel more comfortable with your setup. I'm sure if you posted something some people would be willing to help make you sleep better at night. BTW, you know about the http://www.ultramonkey.org/ and http://www.keepalived.org/ projects, right?

1.11.3. Radware

Joe Oct 2005: From a presentation by Radware) (http://www.radware.com/) given to North Carolina Systems Administrators (NCSA) (http://www.ncsysadmin.org/) on 10 Oct 2005. Unfortunately I was the guy getting the pizzas for the meeting, so I missed most of the talk (which I wanted to hear).

Radware is used by Ebay and Accuweather. Radware has a NAT loadbalancing director that appears to function similarly to an LVS-NAT director. The servers can have private IPs.

Radware's loadbalancing director is only a small part of their offering. Radware have boxes that filter based on packet content (looking for viruses) that sit in the flow of packets (possibly before the director, possibly after - didn't find this out). They have boxes which just handle SYN floods. They use SYN cookies and do a statistical analysis of the packets, letting some through to see which machines reply to the SYN-ACKs. Radware has a gui to control the loadbalancer, which can do things like shutting down some of the backend servers at sometime in the future (e.g. at 10pm later that night) for new connections, so that by 8am next morning these machine have few or no connections and can be taken offline for servicing. Much of their hardware is ASIC based.

Health checking seems to be done from the director, and checks are made through to 3rd-Tier components of the backend servers (e.g. database machines behind the webservers that the client doesn't directly connect to).

Each local NAT'ed load balancing setup is itself a member of a distributed DNS-based load balancer. So www.foo.net might have a loadbalanced set of servers in different sites eg London, New York, San Francisco and Tokyo. Each local setup has an authoritative nameserver for www.foo.net

The way is works is

  • client in Scotland asks for the IP of www.foo.net
  • the client's nameserver doesn't know the IP and asks a rootserver for the machine authoritative for foo.net.
  • The rootserver has a list of 4 authoritative nameservers for foo.net and selects the next nameserver by round robin. If the next one in its list is in New York, it tells the client's nameserver to go query the nameserver in New York.
  • The New York nameserver for foo.net measures the packet latency to the client's nameserver and then returns the VIP for www.foo.net associated with the New York installation of www.foo.net. The latency is propagated to the other foo.net nameservers (in Tokyo, London and San Francisco).
  • Sometime later after the client's nameserver has flushed the IP entry for www.foo.net from its cache, another (or the same) client using the same nameserver asks for the IP of www.foo.net again and this time the rootserver will possibly send the request to another of the sites (say London). The London machine already knows the latency from New York to the client (without knowing where the client is), and sees that its latency to the client is lower than the latency from New York to the client, and returns the IP of its copy of www.foo.net. The London nameserver also updates the latency tables at the other sites (New York, San Francisco and Tokyo).
  • If the next nameserver request from the client site is sent to Tokyo, then the Tokyo machine updates the latency tables in all the other nameservers, and knowing that the latency is lowest to the London nameserver, returns the IP of www.foo.net in London.
  • In this way the four nameserver accumulate the latencies to all nameservers in the world. This works provided that the latencies don't change a lot with time of day (or throughput). Presumably you could store successive latencies and pick the shortest as reflecting the true network distance. The amount of memory required to do this must be small - there can't be more than a million nameservers, can there? 1 million 8 bit latencies is not much to store in memory.

Although I didn't get to ask how it works, if a client winds up at a more distant site (network wise), then http redirects will send the client to a closer site.

Radware SSL accelarators:

When I commented to the speaker that the main reason to use SSL accelarators is financial, i.e. to only have one copy of the certificate, rather than one on each realserver, they said "it's also for certificate management". Presumably some sites have large numbers of certificates. (They didn't disagree with my statement.)

The SSL accelarators in the Radware design don't sit between the director and the realservers (or in front of the director i.e. between the client and the director), but sit at the same level as the other realservers. The https request is balanced by the director to an accelarator, which decrypts the packets and sends the decrypted packet back to the director for loadbalancing as http traffic. Since the director is a NAT balancer, the return http traffic from the http servers, goes back through the director, and then recursively back to the SSL accelarator then back to the director at https traffic and then back to the client.

Being able to have the SSL accelarator as a realserver in LVS would require the realservers to be a client of the director, something that we can do for LVS-NAT, but not for LVS-DR. This is not a capability that we've paid much attention to for LVS. If you need a realserver to be in the path in both inward and outward directions (like an SSL accelarator) then you will have to use LVS-NAT.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 12 Oct 2005

Note that we removed our Radware appliance to use LVS instead. Load Balancing using DNS is _evil_, especially with mobile internet and all those misconfigured operator gateways. Most mobile gateway are written in Java, and I'm probably the only one who read the java.security file. Just have a look on this ugly stuff you can find in it and the unbelievable silly explanation given:

# The Java-level namelookup cache policy for successful lookups:
#
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
#
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set.
#
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.
#
#networkaddress.cache.ttl=-1

For security reasons! Guys! Well. So we removed radware. Note that we had other problem with radware. The DNS cache of the clients is one, the response time of the DNS was another. Several technical issues when you reach some trafic limits was the last.

Henrik Holst

still, geographic load balancing would be very nice to have and I cannot figure out another way to do it than involve DNS round-robin.

Francois

Round-Robin DNS could work if

  • You have enough clients
  • Clients are using DNS as expected
  • Clients are dealing with TTL
  • Client DNS caches or provider DNS are honouring DNS TTL
  • All your sites are always up and working (you can't use a DNS solution for failover)

My clients are mobile phones, basically points 1 to 4 are not OK :). And I have to deal with multiple sources for the same client (the transaction begin in the gallery gateway and continues in the standard surf gateway, and I have to use fwmarks to keep the session)... We used RadWare to try to load-balance between our two peers. It clearly was not working. Unfortunately, I don't have all the details.

Horms

If you want to distribute traffic between hosts that have fast, reliable links, like a LAN, then LVS is a good option. No, an excellent option. If you want to distribute traffic between geographically separated hosts, then you don't want something like LVS that channles packets through a single location then to another. Something DNS based is probably the way to go - though round robin is not nearly smart enough for my liking. In practice, if you do have geographically distributed sites, then each site should probably be an LVS cluster. So essentially you end up using two techniques to solve different parts of the same problem.

I wrote quite a lot of this on supersparrow.org once upon a time, its still there if people want to read/play/enhance/. (links through Super Sparrow Project).

1.11.4. User's view of Radware, F5

bak bak (at) picklefactory (dot) org 09 Jan 2007

I've used Radware, F5, and HP SAs as an admin. My 2-minute executive overview take:

Radware is great for a switch-like, low-key experience. They're relatively cheap for hardware load balancers. You get extra functionality like SSL and link balancing with extra bits of hardware. Sometimes they can be pretty hard to troubleshoot. If you want global balancing/failover, that's part of all their "AS" type switches.

F5 is the other 'big name' option. These boxes are more like Brocade switches: it's running embedded Linux in there, and if you want to run tcpdump, you can. You get extra functionality by buying a 'bigger' box and then paying F5 for more licensing. If you want to do global balancing/failover, you have to get one of their DNS devices.

If you have money to wave around, I've found both Radware and F5 are more than happy to give you a demo unit for 2-4 weeks.

1.12. Books on LVS

Karl Kopper has tackled this. Writing a book on a moving target like LVS is a difficult proposition - certainly more than I was prepared to take on.

The Linux Enterprise Cluster
Karl Kopper
Pub: No Starch Press
ISBN 1593270364

The book is available at your usual suppliers.

I'm loath to mention the names of internet booksellers who require your e-mail address as part of your purchase, so that they can spam you later. I've been buying my books by phone at a marginally higher price since realising their business practices. However recently (Jul 2004) I've discovered disposable e-mail addresses e.g. the free service from Jetable.org (http://www.jetable.org/). They have a google-like (i.e. simple) interface. You give them your e-mail address, the required lifetime of the address (1-8days), and click. Up comes an e-mail address (test by sending a message to it) that you can give to your internet vendor, and mail will be forwarded to you for the period selected. After that time, no more mail will get to you. I've been using jetable since Jul 2004 (now Sep 2004) and have not got any spam from Jetable or from internet vendors.

1.13. LVS in the news

"Wired" Magazine in Jun 2004 has a small article about LVS, illustrating the multinational cooperative nature of GPL software development. The page is here (http://www.wired.com/wired/archive/12.06/images/atlas_software.pdf), or a local copy of the article on this server.

1.14. Software/Information/HOWTOs useful/related to LVS

Ultra Monkey is LVS and HA combined.

tong tong (at) csusb (dot) net 25 Jun 2003

Here's a step-by-step guide for setting up an LVS system with heartbeat (http://www.cula.net/cluster).

Note
This guide was published a year ago and we've only just heard about it. The author has never popped up on the mailing list to say hello.

from lvs (at) spiderhosting (dot) com Super Sparrow Global Load Balancing using BGP routing information.

Ratz is documenting the 2.6 headers and calls with doxygen (http://www.drugphish.ch/~ratz/IPVS/index.html) whenever he has reason to fiddle with a piece of code (i.e. the documentation isn't exhaustive, at least yet).

From ratz, there's a write up on load imbalance with persistence and sticky bits at our friends at M$.

From ratz, Zero copy patches to the kernel to speed up network throughput, Dave Miller's patches, Rik van Riel's vm-patches and more of Rick van Riel's patches at http://www.linux-mm.org/ (link dead Dec 2003). The Zero copy patches may not work with LVS and may not work with netfilter either (from john (at) antefacto (dot) com).

From Michael Brown michael_e_brown (at) dell (dot) com, the TUX kernel level webserver.

Dustin Puryear dustin (at) puryear-it (dot) com gave a talk on LVS at LISA 2003. The tutorial, is avaialble at: LVS: Load Balancing and High Availability for Free (http://www.puryear-it.com/publications.htm#6).