<?xml version="1.0"?>
<!DOCTYPE article PUBLIC "-//OASIS//DTD Docbook XML V4.1.2//EN"
"/usr/local/share/sgml/docbook/xml-dtd-4.1.2/docbookx.dtd">
<article>
	<articleinfo>
	<title>LVS-HOWTO</title>
	<author>
        	<firstname>Joseph</firstname>
        	<surname>Mack</surname>
        	<affiliation>
                	<orgname>jmack (at) wm7d (dot) net </orgname>
                	<orgdiv></orgdiv>
        	</affiliation>
	</author>
	<pubdate>v2009.03 Mar 2009, released under GPL.</pubdate>
	<copyright>
        	<year>1999</year>
        	<year>2000</year>
        	<year>2001</year>
        	<year>2002</year>
        	<year>2003</year>
        	<year>2004</year>
        	<year>2005</year>
        	<year>2006</year>
        	<year>2007</year>
        	<year>2008</year>
        	<year>2009</year>
        	<holder>Joseph Mack</holder>
	</copyright>
	<abstract>
	<para>
Install, testing and running of a Linux Virtual Server with 2.2.x, 2.4.x, 2.6.x kernels
	</para>
	<para>
<emphasis role="bold">search the LVS documentation</emphasis>
	</para>
	<itemizedlist>
		<listitem>
<ulink url="http://www.austintek.com/LVS/htdig/search/search.html">
search the LVS documenation</ulink> with htdig.
		</listitem>
		<listitem>
<ulink url="http://www.linuxvirtualserver.org/mailing.html">
search the various mailing list archives</ulink>
		</listitem>
	</itemizedlist>
	<para>
Hank Leninger's searchable mailing list archive has moved.
It's now at <ulink url="http://marc.info/?l=linux-virtual-server&amp;w=2">
http://marc.info/?l=linux-virtual-server&amp;w=2</ulink>.
	</para>
	</abstract>
	</articleinfo>
<section id="LVS-HOWTO.introduction" xreflabel="LVS Introduction">
<title>LVS: Introduction</title>
<para>
This LVS-HOWTO is posted to the LVS-HOWTO homepage,
<ulink url="http://www.austintek.com/LVS/LVS-HOWTO/">
http://www.austintek.com/LVS/LVS-HOWTO/
</ulink> about once a month (although I do miss occasional months).
</para>
<para>
Some of the material is from my own testing and I've tried to make
it into a coherent story.
Much of the material is from the lvs-users mailing list
and is listed chronologically 
(sometimes forward and sometimes backwards in time)
and will thus look like a blog.
</para>
	<section id="thanks">
	<title>Thanks</title>
	<para>
Contributions to this HOWTO came from the mailing list and are
attibuted to the poster (with e-mail address). Postings may have
been edited to fit into the flow of the HOWTO.
	</para>
	<para>
The LVS logo (Tux with 3 lighter shaded penguins behind him
representing a director and 3 realservers) is by Mike Douglas <emphasis>spike (at) bayside (dot) net</emphasis>
	</para>
	<para>
<ulink url="http://www.linuxvirtualserver.org">LVS homepage</ulink> is running on
a machine donated by Horms. (Until Jul 2002, we used a machine donated by Voxel).
	</para>
	<para>
<ulink url="http://www.linuxvirtualserver.org">LVS mailing list</ulink> is hosted by
Lars in Germany <emphasis>lmb (at) suse (dot) de</emphasis>
	</para>
	</section>
	<section id="about">
	<title>About the HOWTO</title>
		<section id="purpose"><title>Purpose</title>
		<para>
To enable you to understand how a Linux Virtual Server (LVS) works.
		</para>
		<para id="mini-HOWTO">
The
<ulink url="http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html">LVS-mini-HOWTO</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html)
tells you how to setup and install an LVS without understanding how the LVS works.
		</para>
		<para>
The material here covers directors and realservers with 2.2, 2.4 and 2.6 kernels.
		</para>
		<note>
		<para>
The original material was written for 2.2.x kernels and ipchains. Not
all material has been updated for 2.4.x kernels and iptables.
		</para>
		</note>
		<para>
The layout of this HOWTO is almost flat -
you go to the section you want information on.
You aren't supposed to read it from start to finish.
Within any section, newer information may be combined
with older information that says different things.
I just don't have time to edit everything - I'll
be glad if you straighten me out.
The only information one level up is
		</para>
		<itemizedlist>
			<listitem>
how LVS works
(from <link linkend="LVS-HOWTO.what_is_an_LVS">this HOWTO</link> or
from documentation on the website, <emphasis>e.g.</emphasis>
Wensong's early documents)
			</listitem>
			<listitem>
setting up an LVS in the
<link linkend="mini-HOWTO">LVS-mini-HOWTO.html</link>.
			</listitem>
		</itemizedlist>
		<para>
The code for 2.0.x kernels still works fine and was used on production
systems when 2.0.x kernels were current, but is not being developed further.
For 2.2 kernels, the Linux kernel networking code was rewritten,
producing for us <xref linkend="LVS-HOWTO.arp_problem"/>.
This changes the installation of LVS from a simple
process that can be done by almost anyone,
to a thought provoking, head scratching exercise,
which requires detailed understanding of the workings of LVS.
For 2.0 and 2.2, LVS is stand-alone code, based on ip_masquerading and
doesn't integrate well with other uses of ip_masquerading.
For 2.4 kernels, LVS was rewritten as much as possible to be a netfilter module, 
to allow it to fit into and be visible to other netfilter modules.
Unfortunately the fit isn't perfect, but cooperation with netfilter does work in most cases.
If ip_vs() was a real netfilter module, it would be really slow.
(The original LVS-NAT code had problems when using your director as a firewall;
see the <xref linkend="LVS-HOWTO.filter_rules"/>, but much of this has been fixed - Feb 2006.)
Being a netfilter module, the latency and throughput are slightly worse
for 2.4 LVS than for the 2.2 code.
However with modern CPUs being running at 800MHz,
the bottleneck now is network throughput rather than LVS throughput
(you only need a small number of realservers to saturate 100Mbps ethernet).
		</para>
		<para>
In general <command>ipvsadm</command> commands and services have not changed between kernels.
		</para>
		</section>
		<section id="source_xml">
		<title>HOWTO source is xml</title>
		<para>
The HOWTO was originally written in sgml. It is now xml.
The char '&amp;' found in C source code
has to be written as &amp;amp; in sgml.
If you swiped patches from the sgml rather than the html rendering,
you would get code that needed to be edited to fix the &amp;.
Now that the HOWTO is in xml, this munging is not needed.
Although I've tried to remove all munged ampersands,
I expect some will persist for a while.
Ampersands in URLs still have to be munged.
		</para>
		</section>
		<section id="e-mail">
		<title>e-mail addresses in the HOWTO are spam protected</title>
		<para>
Well we hope so anyhow.
		</para>
		<para>
An article on <ulink url="http://www.neilgunton.com/spambot_trap/">spambots</ulink>
describes robots which ignore the robots.txt file and scan for e-mail addresses
in readable files on websites.
The author suggests removing any 'mailto:' strings and spam protecting e-mail addresses,
by changing them from machine-readable to human-readable format.
If you have a better scheme than implemented here, (and I can do it with vi) let me know.
		</para>
		<para>
(May 2002): BTW, 160 people have contributed to the HOWTO
(as judged by unique e-mail addresses).
		</para>
		</section>
		<section id="links">
		<title>Links die frequently</title>
		<para>
There are links to 180 urls in this HOWTO (May 2002),
which came from postings to the LVS mailing list.
If people move/rename/delete/change their webpages/links once a year,
then I'm going to have to trackdown 15 websites each month.
If a site is gone and it isn't in google, I'm not going to be able to find it.
		</para>
		</section>
	</section>
	<section id="nomenclature">
	<title>Nomenclature/Abbreviations</title>
	<para>
If you use these terms when you mail us, we'll know what you're talking about.
	</para>
		<section id="preferred_names">
		<title>Preferred names</title>
		<itemizedlist>
			<listitem>
<emphasis>IPVS,ipvs,ip_vs</emphasis>
the code that patches the linux kernel on the <emphasis>director</emphasis>.
			</listitem>
			<listitem>
<emphasis>LVS, linux virtual server</emphasis>
This is the <emphasis>director</emphasis> + <emphasis>realservers</emphasis>.
Together these machines are the <emphasis>virtual server</emphasis>,
which appears as one machine to the <emphasis>client(s)</emphasis>.
			</listitem>
			<listitem>
<emphasis>director</emphasis>: the node that runs the <emphasis>ipvs</emphasis> code.
<emphasis>Clients</emphasis> <emphasis>connect</emphasis> to the <emphasis>director</emphasis>.
The <emphasis>director</emphasis> <emphasis>forwards</emphasis> packets to the realservers.
The <emphasis>director</emphasis> is nothing but an IP router with special rules
that make the <emphasis>LVS</emphasis> work.
			</listitem>
			<listitem>
<emphasis>realservers</emphasis>: the hosts that have the <emphasis>services</emphasis>.
The <emphasis>realservers</emphasis> handle the requests from the clients.
			</listitem>
			<listitem>
<emphasis>client</emphasis> the host or user level process that connects to the <emphasis>VIP</emphasis>
on the <emphasis>director</emphasis>
			</listitem>
			<listitem>
<emphasis>forwarding method</emphasis>
(currently <xref linkend="LVS-HOWTO.LVS-NAT"/>,
<xref linkend="LVS-HOWTO.LVS-DR"/>,
<xref linkend="LVS-HOWTO.LVS-Tun"/>).
The <emphasis>director</emphasis>
is a router with somewhat different rules for forwarding
packets than a normal router.
The <emphasis>forwarding method</emphasis>
determines how the <emphasis>director</emphasis>
sends packets from the <emphasis>client</emphasis>
to the <emphasis>realservers</emphasis>.
			</listitem>
			<listitem>
<emphasis>scheduling</emphasis> (<xref linkend="LVS-HOWTO.ipvsadm"/>) -
the algorithm the <emphasis>director</emphasis> uses to select a
<emphasis>realserver</emphasis> to service a new connection request
from a <emphasis>client</emphasis>.
			</listitem>
		</itemizedlist>
		</section>
		<section id="synonyms">
		<title>synonyms</title>
		<para>
Please use the first term in these lines. The other words are valid but
less precise (or are redundant).
		</para>
		<itemizedlist>
			<listitem>
<emphasis>director</emphasis>: load balancer, dispatcher, redirector.
			</listitem>
			<listitem>
<emphasis>realserver</emphasis>: servers, realservers, real-servers.
			</listitem>
			<listitem>
<emphasis>LVS</emphasis>: the whole cluster, the (linux) virtual server (LVS)
			</listitem>
		</itemizedlist>
		</section>
		<section id="virtual_services">
		<title>virtual services, scheduling groups</title>
		<para>
Here's the <command>ipvsadm</command> output of an LVS serving telnet and squid.
		</para>
<programlisting><![CDATA[
director:/etc/rc.d# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:squid rr
  -> rs1.mack.net:squid        Route   1      0          0
  -> rs2.mack.net:squid        Route   1      0          0
  -> rs3.mack.net:squid        Route   1      0          0
TCP  lvs.mack.net:telnet rr
  -> rs1.mack.net:telnet       Route   1      0          0
  -> rs2.mack.net:telnet       Route   1      0          0
]]></programlisting>
		<para>
In the above LVS, there are two
<emphasis>virtual services</emphasis>, telnet and squid.
There are also two <emphasis>virtual servers</emphasis>;
a <emphasis>virtual server</emphasis> for telnet (which has 2 realservers)
and a <emphasis>virtual server</emphasis> for squid (which has 3 realservers).
This is what the client sees; two services (and two servers).
		</para>
		<para>
Connections to each
<emphasis>virtual server</emphasis> are <emphasis>scheduled</emphasis> (here by "rr", round robin),
to the realservers which belong to the <emphasis>scheduling group</emphasis>.
Here the <emphasis>scheduling group</emphasis> for telnet is rs1,rs2.
The <emphasis>scheduling group</emphasis> for quid is rs1,rs2,rs3.
Connections to the telnet <emphasis>virtual server</emphasis> are scheduled independantly
of connections to the squid <emphasis>virtual server</emphasis>.
		</para>
		<para>
The above nomenclature can be extended for <xref linkend="LVS-HOWTO.fwmark"/>.
		</para>
		</section>
		<section id="scheduled_unit">
		<title>scheduling instance, scheduled unit, virtual connection</title>
		<para>
We don't have a good name for this. Suggestions welcome.
(We also don't talk much about this concept on the mailing list,
so we've done without a name).
		</para>
		<para>
The director needs to know how to schedule packets from the client to
the realservers.
The smallest unit for LVS is a tcpip connection,
<emphasis>i.e.</emphasis> all
packets that are part of a single tcpip session from a client
will be sent to the same realserver.
For a tcp virtual service, each tcp connection is scheduled separately,
with the first tcp connection going to one realserver,
and the next tcp connection going to the next realserver
assigned a connection from the scheduler.
The <emphasis>virtual connection</emphasis> is the same as the tcp connection.
		</para>
		<para>
For a <xref linkend="LVS-HOWTO.persistent_connection"/>
all tcp connections that are separated by less than the timeout period
are regarded as belonging to the same <emphasis>virtual connection</emphasis> and
are scheduled to the same realserver.
		</para>
		<para>
For udp, there is no such thing as a connection or session and
all packets from the client within a timeout period are scheduled to the
same realserver. (People aren't using LVS for udp services a whole lot).
The <emphasis>virtual connection</emphasis> then is all udp packets from a client
within a certain time period.
		</para>
		</section>
		<section id="multi-tier_servers">
		<title>backend (multi-tier) servers</title>
		<para>
The <emphasis>realservers</emphasis> sometimes are frontends
to other <emphasis>backend</emphasis> servers.
The <emphasis>client</emphasis> does not connect
to these <emphasis>backend</emphasis> servers
and they are not in the <command>ipvsadm</command> table.
		</para>
		<para>
<emphasis>e.g.</emphasis>
		</para>
		<itemizedlist>
			<listitem>
a <emphasis>realserver</emphasis> may run a web application.
The web application in turn connects to a database
on another <emphasis>backend</emphasis> server.
			</listitem>
			<listitem>
a webcaching <emphasis>realserver</emphasis> (<emphasis>e.g.</emphasis> a squid).
The squid connects to <emphasis>backend</emphasis> webserver(s).
			</listitem>
		</itemizedlist>
		<para>
These <emphasis>backend</emphasis> servers are setup separately from the LVS.
		</para>
		</section>
		<section id="server_ambiguous"><title>the term "the server" is ambiguous</title>
		<para>
People sometimes call the <emphasis>director</emphasis> or the <emphasis>realservers</emphasis>,
"the server".
Since the whole LVS appears as a server to the <emphasis>client</emphasis>
and since the <emphasis>realservers</emphasis> are also serving services,
the term "server" is ambiguous.
Do not use the term "the server" or "the lvs server" when talking about LVS.
Most often you are referring to the "director" or the "realservers".
Sometimes (<emphasis>e.g.</emphasis> when talking about throughput)
you are talking about the (whole) virtual server.
		</para>
		<para>
I use "realserver" as I despair of finding a reference to a "realserver"
in a webpage using the search keys "real" and "server".
Horms and I (for reasons that neither of us can remember) have been
pushing the term "real-server" for about a year, on the mailing list,
and no-one has adopted it. We're going back to "realserver".
		</para>
		</section>
		<section id="names_of_IPs">
		<title>names of IPs/networks in an LVS</title>
		<para id="lvs-diagram">
		</para>
<programlisting><![CDATA[
                        ________
                       |        |
                       | client | (local or on internet)
                       |________|
                          CIP
                           |
--                      (router)
                          DGW
                           | outside network
                           |
L                         VIP
i                      ____|_____
n                     |          | (director can have 1 or 2 NICs)
u                     | director |
x                     |__________|
                      DIP (and PIP)
V                          |
i                          | DRIP network
r         ----------------------------------
t         |                |               |
u         |                |               |
a        RIP1             RIP2            RIP3
l    _____________   _____________   _____________
    |             | |             | |             |
S   | realserver1 | | realserver2 | | realserver3 |
e   |_____________| |_____________| |_____________|
r
v
e
r
---
]]></programlisting>
<para>
The router has traditionally not been considered part of the LVS, because
often you do not have control over the router. However if you're a paying
customer, then the ISP will be glad to set up the router according to
your specifications. If you have access to the router, it can solve
<xref linkend="LVS-HOWTO.arp_problem"/> and can install filter rules.
</para>
<para>
Here are the names we use for the various IPs.
If you use them when asking questions on the mailing list,
we'll be able to answer your questions more easily.
</para>
<programlisting><![CDATA[
client IP     = CIP
virtual IP    = VIP - the IP on the director that the client connects to)
director IP   = DIP - the IP on the director in the DIP/RIP (DRIP) network
   (this is the realserver gateway for LVS-NAT)
realserver IP = RIP (and RIP1, RIP2...) the IP on the realserver
director GW   = DGW - the director's gw (only needed for LVS-NAT)
   (this can be the realserver gateway for LVS-DR and LVS-Tun)
]]></programlisting>
		<para>
The VIP and DIP are setup as secondary IPs,
(<emphasis>i.e.</emphasis>
there is another primary IP on that NIC),
so they can be moved to another duplicate director
following director failover.
For initial setup with a single director,
setting up the VIP and DIP as secondary IPs will make the
transition to a failover setup easier.
		</para>
		<para>
For a two director LVS (where directors failover),
the IPs on the <link linkend="drip">DRIP network</link> are
		</para>
<programlisting><![CDATA[
primary director IP	= PIP (the director which will be the master on bootup)
secondary director IP	= SIP (the director which will be the backup on bootup)
]]></programlisting>
		<para>
The DIP will be on the same NIC as PIP on bootup and will move to the
same NIC as SIP on director failover.
		</para>
		<para>
We don't seem to need a name for the primary IP on the outside of the director
- no-one ever talks about it.
		</para>
		<para>
We don't often need to explicitely name the networks in an LVS, but
here's some suggestions
		</para>
		<itemizedlist>
			<listitem>
				<para id="drip">
<emphasis role="bold">DRIP network</emphasis>: the network containing the DIP
and RIPs. (OK you come up with a better name.)
				</para>
			</listitem>
			<listitem>
				<para>
<emphasis role="bold">network facing the internet</emphasis>
or the <emphasis role="bold">outside network</emphasis>: the network
on the director which receives packets from the outside world.
This shouldn't be called the VIP network,
as the VIP is also in the DRIP network (but not replying to arp calls)
on the realservers in LVS-DR and LVS-Tun.
				</para>
			</listitem>
		</itemizedlist>
		</section>
	</section>
	<section id="minimal_knowledge">
	<title>Minimal knowledge required</title>
	<para>
The mailing list and HOWTOs cover information specific to LVS.
The rest you have to handle yourself.
All of us knew nothing about computers when we first started,
we learnt it, and you can too (we're not saying it's easy).
If you can't setup a simple LVS from the
<link linkend="mini-HOWTO">LVS-mini-HOWTO</link>,
without breaking into a major sweat
(or being able to tell us what's wrong with the instructions),
then you need to do some more homework.
(Also see 
<ulink url="http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/#doesnt_work">Help! My LVS doesn't work</ulink>.)
	</para>
	<blockquote>
	<para>
Ratz <emphasis>ratz (at) tac (dot) ch</emphasis>
	</para>
	<para>
To be able to setup/maintain an LVS, you need to be able to
	</para>

	<itemizedlist>
		<listitem>
know how to patch and compile a kernel
		</listitem>
		<listitem>
the basics of shell-scripting
		</listitem>
		<listitem>
have intermediate knowledge of TCP/IP
		</listitem>
		<listitem>
have read the man-page, the online-documentation and LVS-HOWTO (this document)
(and the
<link linkend="mini-HOWTO">LVS-mini-HOWTO</link>)
		</listitem>
		<listitem>
know basic system administration (<emphasis>e.g.</emphasis> iptables; syslog; find, compile,
install code from source files; use cpan to find perl modules).
		</listitem>
	</itemizedlist>
	</blockquote>
	</section>
	<section id="getting_technical_help">
	<title>Free Technical Help</title>
	<para>
All of the people on the LVS mailing list are replying for free in 
their spare time. The best we can do is to give 
solutions to technical problems on setting up and running 
LVS. I give about 15secs to a posting to decide if I've got 
something useful to say. The posting has to indicate that 
the person has analysed the problem to a stage where an 
answer exists. If _they_ can't describe the problem, 
there's no point in replying - they won't understand the answer.
	</para>
	<para>
Please don't e-mail me privately with general questions
(feel free to cc: me if you want).
The mailing list will archive your question
and the answer(s) which can be retrieved later.
Other people may have more interesting,
relevant or useful comments than I will.
If you are writing to me in the hopes of avoiding the humiliation
of publically showing your ignorance on the mailing list, it's not going to happen.
We've had too many good ideas from "ignorant" people to let this happen.
If your question has been answered many times before 
and it's in the HOWTO and the archives,
you'll be told to read the HOWTO, that's all.
	</para>
	<para>
To get technical help:
	</para>
	<itemizedlist>
		<listitem> 
Read the docs on the website, the HOWTOs, and search the mailing list archives. 
The HOWTO (at the top) has a link to a search engine of all known LVS documentation. 
It will probably return several webpages.
You'll have to find the entry from there.
		</listitem>
		<listitem> The
<link linkend="mini-HOWTO">
LVS-mini-HOWTO</link>
shows you how to setup a simple 3 node (client, director, realserver)
LVS without you needing to understand a whole lot about how an LVS works.
		</listitem>
		<listitem>
after you've done a search of the docs, then post to the mailing list.
		</listitem>
		<listitem> updates/problems/bugs - post to the mailing list 
		</listitem>
	</itemizedlist>
	<para>
Jakub Suchy <emphasis>jakub (at) rtfm (dot) cz</emphasis> 13 Jan 2005
	</para>
	<para>
Please read:
<ulink url="http://www.catb.org/~esr/faqs/smart-questions.html">smart questions</ulink>
(http://www.catb.org/~esr/faqs/smart-questions.html)
before asking questions.
	</para>
	<para>
Please only post relevant lines of a debug dump.
If you post the whole dump, because you don't understand it,
then it will fill up the archive machine and everyone's mail box. 
If we need the whole debug, we'll ask for it and you can send it to us off-list.
	</para>
		<section id="problem_people_1">
		<title>Problem people 1</title>
		<para>
It's hard to believe, but we get postings like
		</para>
		<blockquote>
recompiling the kernel is hard (or I don't read HOWTOs),
can't you guys cut me some slack and just tell me what to do?
		</blockquote>
		<para>
I expect the people who post these statements don't read this HOWTO,
so I may be wasting my time, but - No.
The people on the mailing list answer questions for free,
and have other important things to do, like keeping up with /. 
and checking our e-mail.
When we're at home, we drink beer and watch Gilligan's Island re-runs.
		</para>
		</section>
		<section id="problem_people_2">
		<title>Problem people 2</title>
		<blockquote>
can anybody tell me how to setup a windows realserver?
thank you very much! I'm in a  hurry.
		</blockquote>
		<para>
<emphasis>robert (dot) gehr (at) web2cad (dot) de</emphasis>
		</para>
		<para>
I can't think of anyone who has set up lvs in a hurry :-)
		</para>
		</section>
		<section id="RedHat">
		<title>Problem People 3: People using RedHat LVS</title>
		<para>
RedHat have LVS in their standard distribution kernel.
This gives people the idea that they can setup
LVS from their standard RedHat distribution just by clicking on a few
buttons or running some scripts.
From reading the postings to the mailing list,
it's more difficult than doing it our way.
You still have to understand LVS and then afterwards,
you have to figure out what RedHat did to it.
One of the major wastes of time
and source of aggravation for me personally on the LVS mailing list,
is postings from people using RedHat LVS who assume that it's the same as LVS,
and who post as if they're using our setup methods.
Just saying that you're using a RedHat distribution doesn't tell us anything,
since you can setup LVS our way in RedHat.
Things you need to know before you post -
		</para>
		<itemizedlist>
			<listitem>
There are reasons for wanting to setup LVS in a standard RedHat distribution
(<emphasis>e.g.</emphasis> RedHat is "approved" in your location whereas "Linux" isn't).
			</listitem>
			<listitem>
There is information in this HOWTO (<xref linkend="pbs_nutshell"/>)
and in the various links from here which show you how to setup RedHat LVS.
			</listitem>
			<listitem>
We have a method of setting up LVS which works for all distributions (including RedHat).
We are not interested in learning, understanding, debugging, supporting or fixing
a setup method that only works for one distibution.
			</listitem>
			<listitem>
RedHat don't talk to us about what they do and while
they may monitor the LVS mailing list,
rarely (only about once a year, that I can tell)
do they reply to people having problems with RedHat LVS.
It appears that RedHat does not think their version of LVS worthy of much support
and I agree with them.
			</listitem>
			<listitem>
If you setup LVS the RedHat way,
you still need an understanding
of how an LVS works and is setup (just like everyone else),
before posting to the mailing list.
			</listitem>
		</itemizedlist>
		<para>
If you are setting up with RedHat and want help with it,
make sure that you describe what you've done,
that you're using the RedHat files and how you've set it up,
otherwise we'll assume that you're setting up using our methods.
		</para>
		</section>
		<section id="why_you_may_not_get_an_answer">
		<title>Why you may not get an answer</title>
		<itemizedlist>
			<listitem>
				<para>
no-one knows.
				</para> 
				<para>
The <xref linkend="LVS-NAT_ftp_bug"/> took a long time to figure out. 
Since no-one else had seen the problem, we didn't know at first if it was a problem with LVS. 
It wasn't till 6 months later, when someone else had the same symptom,
and found that it only occured when the ftp helper module was loaded, 
that we could do something.
				</para>
				<para>
I once needed to do something with <filename>iproute2</filename> 
that I spent about 3 weeks trying to figure out. 
No-one on the list knew the answer. 
I had to post off-line to someone who could figure it out for me.
				</para>
			</listitem>
			<listitem>
				<para>
We may not have a useful answer.
				</para>
				<para>
If you post saying "I want to build an LVS with (list of hardware);
do you think it will work?", all we can say is "probably".
				</para>
				<para>
Often when questions like this come up, 
there are people who are happy to share their experiences, 
so there's no harm in posting such a question.
In general the people who've been working with LVS for years will expect 
you to have read the docs and know what LVS does before you post. 
In the time I alot for a reply, 
I don't have time to figure out whether in your case LVS is best for you 
- you should pay a consultant to do this if you can't do it yourself.
				</para>
			</listitem>
			<listitem>
				<para>
Your question may not be well posed.
				</para>
				<para>
We are reading the postings in our spare time. 
You will get at most 30secs of attention before we figure out whether 
we can help you, an answer will take a bit of thinking, or we can't help you.
				</para>
				<para>
If you have a long posting in which you haven't figured out which parts
are causing the problem and which parts are working, then we aren't
going to try to figure it out either. 
Post the minimum setup that will produce the problem.
				</para>
			</listitem>
			<listitem>
It's obvious that you haven't read the HOWTO.
			</listitem>
		</itemizedlist>
		</section>
		<section id="edit_posts">
		<title>Edit your posts! (top, bottom and in-line posting)</title>
		<para>
Please edit the posting you're replying to, leaving only the parts relevant to your reply.
We don't need to see material from previous posts irrelevant to the current posting,
and the disk archive doesn't either.
		</para>
		<para>
Reply in-line, <emphasis>i.e.</emphasis> following each statement by the poster.
Here's a posting on the subject from one of the kernel mailing lists.
		</para>
		<para>
Greg KH <emphasis>greg (at) kroah (dot) com</emphasis> 16 Nov 2005 
		</para>
<programlisting><![CDATA[
A: http://en.wikipedia.org/wiki/Top_post
Q: Were do I find info about this thing called top-posting?
A: Because it messes up the order in which people normally read text.
Q: Why is top-posting such a bad thing?
A: Top-posting.
Q: What is the most annoying thing in e-mail?
A: No.
Q: Should I include quotations after my reply?
]]></programlisting>
		</section>
	</section>
	<section id="after_youve_got_help">
	<title>After you've Got Technical Help</title>
	<para>
In most cases when a problem is solved, there's enough info on the mailing list
to see how it worked and we can write it up here for the next people. 
Occasionally, we get a posting "I've worked it out. Thanks for the help."
When this happens we have no idea what the solution was 
and will have to reinvent it for the next person.
	</para>
	<para>
If you've got help from the unpaid people on the mailing list, 
who've given their spare time to help you, 
when they could instead have been watching Gilligan's Island reruns, 
please write it up for the HOWTO. 
When I write to people asking for their solution
I don't want to hear that you're busy and have a job. 
We're busy, have jobs, kids, homework to do and tax forms to fill in 
and we stopped what we were doing to help you. 
Here's a template.
	</para>
	<itemizedlist>
		<listitem>
what you wanted to do
		</listitem>
		<listitem>
why/how it didn't work
		</listitem>
		<listitem>
what you needed to do to get it to work
		</listitem>
		<listitem>
how the solution works
		</listitem>
	</itemizedlist>
	</section>
	<section id="paid_technical_help">
	<title>Paid technical help</title>
	<note>
		<para>
We occasionally get requests for people to do an install.
The listing is a service to people looking for paid technical help 
(installs or anything else)
and does not imply that I (Joe) or anyone connected
to the LVS project endorse the services of the listees.
If you want to know more about them, 
check their postings to the LVS mailing list.
		</para>
		<para>
Entries will be listed at no cost, in approximate order of the date I receive/post them. 
Entries will be listed for at least a year 
(HOWTOs come out at erratic intervals and new entries will be added/old entries 
deleted whenever the next HOWTO comes out). 
If you want to be listed again next year, send me an e-mail in a year. 
I'm too busy to keep much of an eye on what goes in here
and your entry may stay longer than a year.
If you really want people to know who you are, 
don't rely on this entry - make sure google knows about you.
		</para>
		<para>
To be listed, send me off-list
		</para>
		<blockquote>
your URL (<emphasis>e.g.</emphasis> &lt;http://www.foo.org&gt;The Foo LVS service centre&lt;/a&gt;) 
and/or e-mail, then a blurb of upto 80chars <emphasis>e.g.</emphasis> "We do it all", 
optionally including your location.
		</blockquote>
		<para>
this will be minimum maintenance - I'm just going to mouse swipe your e-mail 
(<emphasis>i.e.</emphasis> don't plan on changing your URL in the year).
		</para>
		</note>
	<para>
People available for paid technical help.
	</para>
	<itemizedlist>
		<listitem>
			<para>
Oct 2007: http://www.dotnoc.com - solutions for hosting sales@dotnoc.com. Linux load balancing and networking specialists
			</para>
		</listitem>
		<listitem>
			<para>
Oct 2007: Loadbalancer.org Ltd (http://www.loadbalancer.org/) - Specialise in high
availability load balancers based on LVS. Happy for customers to have full
access to the OS and source code and offer 24*7 support. However we don't do
consultancy on home brew implementations. UK and USA offices.
			</para>
		</listitem>
		<listitem>
			<para>
Oct 2007: http://www.netdigix.com Linux solutions for business.
contact@netdigix.com.
We specialize in Linux networking and setup of LVS for hosting and mission
critical infrastructures.
Canada:British Columbia:Lower Mainland:Vancouver
			</para>
		</listitem>
	</itemizedlist>
	</section>
	<section id="subscribing">
	<title>Mailing list: subscribing, unsubscribing, searching </title>
	<para>
Thanks to Hank Leininger for the mailing list archive which is searchable not
only by subject and author, but also by strings in the body.
Hank's resource has been of great help in assembling this HOWTO.
	</para>
	<para>
The <ulink url="http://www.linuxvirtualserver.org/mailing.html">mailing list</ulink>
is available for further questions.
A single mailing list handles developers, new
users and old users and has about 0-20 postings a day.
You don't have to join the mailing list to read the archives.
If you want to post questions, then you have to join.
If you aren't subscribed and you post (or you post from
an unsubscribed address),
you'll get a reply saying that your posting is
"awaiting moderator approval".
It isn't; because of the volume of spam,
we no longer review these messages - they're deleted.
	</para>
	</section>
	<section id="problem_report">
	<title>
	Mailing list: posting to
	</title>
	<para>
Please send e-mail with straight ascii (not html)
and turn line-wrap on (some mails come with each
paragraph on a single long line).
	</para>
	<para>
If you're stuck with posting from a Windows machine
or Lotus notes, or using Lookout, 
where each paragraph is sent as one line: 
	</para>
	<blockquote>
		<para>
Francois JEANMOUGIN <emphasis>Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com</emphasis> 09 Jul 2004
		</para>
<programlisting><![CDATA[
System manager -> Global Settings -> Internet Message Format -> Default (or
the one used) -> Advanced -> word wrap 
]]></programlisting>
		<para>
like shown in
<ulink url="http://www.lemis.com/email/fixing-outlook.html">
fixing outlook</ulink>
(http://www.lemis.com/email/fixing-outlook.html)
especially in
<ulink url="http://www.lemis.com/email/exchsrvr-wordwrap.gif">
word wrap</ulink>
(http://www.lemis.com/email/exchsrvr-wordwrap.gif),
but this is a very old version of exchange.
		</para>
	</blockquote>
	<para>
Please don't turn on your vacation message, intended only for your work mates,
for messages from a list.
<emphasis>e.g.</emphasis>
	</para>
<programlisting><![CDATA[
I will be out of the office starting  07/30/2004 and will not return until 08/03/2004.
]]></programlisting>
	<para>
The LVS mailing list doesn't want to know.
	</para>
	<para>
Dan Moljar Aug 2004
	</para>
	<para>
For Lotus Notes:
The client is not configured correctly.
In the 'Out of Office' enable dialog under the 'Exceptions' tab, 
there is a check box for 'Do not reply to Internet Addresses'. 
Check it.
The server shouldn't do it to begin with, 
but you can make the client stop.
	</para>
	<para>
There's always new ideas and questions being posted on the mailing list.
We don't expect this to stop.
There are many complexities to LVS and we don't expect
new people to understand any more about LVS that we did when we started.
No-one is expected to know/understand everything
in the docs but your questions will be better received,
if you've done your homework,
if you have setup the test configurations here,
have at least perused this HOWTO (yes we know it's big),
and have looked at the
<ulink url="http://www.linuxvirtualserver.org/mailing.html">mail archives</ulink>.
We can't help you if you just tell us that you've read the documents and your LVS
doesn't work.
To you, all problems look the same ("it doesn't work").
To help you, we need more information.
We at least need the forwarding method,
the service(s) being forwarded, the number of networks
and the output of ipvsadm in the problem state.
	</para>
	<para>
Before you come up on the mailing list -
	</para>
	<itemizedlist>
		<listitem>
Read the LVS-HOWTO (this document) and the
<link linkend="mini-HOWTO">LVS-mini-HOWTO</link>
		</listitem>
		<listitem>
		<para>
Set up a simple LVS (3 nodes: client, director, realserver)
with LVS-DR or LVS-NAT forwarding,
with the service telnet using the instructions in the LVS-mini-HOWTO.
You should be able to do this starting from a
freshly downloaded kernel from ftp.kernel.org and the LVS patches
(ipvs and the hidden patch if you have 2.4.x realservers).
		</para>
		<para>
<emphasis role="bold">Don't</emphasis> 
setup first with http, with filter rules, with firewalls, with complicated
file systems (<emphasis>e.g.</emphasis> coda, nfs) or network accelators
- debug all these nifty things after you have LVS working with telnet
and with no filter rules.
		</para>
		<para>
<emphasis role="bold">Do</emphasis> use standard compilers (gcc-2.95.3), tools
and utilities (<command>ifconfig</command> or <filename>iproute2</filename>).
		</para>
		<para>
<emphasis role="bold">Do not</emphasis> use non-standard tools particular to a distribution
designed to capture market share (<emphasis>e.g.</emphasis> <command>ifup</command>).
		</para>
		</listitem>
		<listitem>
If you are using one of the packages that can be used with LVS
(<emphasis>e.g.</emphasis>
heartbeat from the Linux HA project http://www.henge.com/&#126;alanr/ha,
or piranha from Redhat),
again we may know what the problem is,
but they need the feedback that you can't get it to work, not us.
Many of us are on each others'
mailing lists and we try to help when we can,
but the best people to handle the problem are the developers for each package.
		</listitem>
		<listitem>
Consult the
<ulink url="http://marc.theaimsgroup.com/?l=linux-virtual-server&amp;r=1&amp;w=2">
LVS mailing list archives</ulink>.
		</listitem>
		<listitem>
Use our jargon as best you can.
The machine names will be client, director, realserver1, realserver2...
IPs are CIP, VIP, RIP, DIP.
If you do this,
we won't have to translate "susanne" and "annie" to their
functional names as we scan your posting.
		</listitem>
		<listitem>
we need to know your kernel
(<emphasis>e.g.</emphasis> 2.2.14)
and the ip_vs patch that was applied to it (eg 0.9.11),
whether you are using LVS-DR, LVS-NAT or LVS-Tun.
Tell us
		<itemizedlist>
			<listitem>
what you did
			</listitem>
			<listitem>
what you expected
			</listitem>
			<listitem>
what you got and why that's a problem
			</listitem>
		</itemizedlist>
		</listitem>
	</itemizedlist>
	<para>
If you don't understand your problem well,
here's a suggested submission format from Roberto Nibali
<emphasis>ratz (at) tac (dot) ch</emphasis>
	</para>
	<orderedlist>
		<listitem>
System information, such as kernel, tools and their versions.
		<para>
Example:
		</para>

<programlisting><![CDATA[
hog:~ # uname -a
Linux hog 2.2.18 #2 Sun Dec 24 15:27:49 CET 2000 i686 unknown

hog:~ # <command>ipvsadm</command> -L -n | head -1
IP Virtual Server version 1.0.2 (size=4096)

hog:~ # <command>ipvsadm</command> -h | head -1
<command>ipvsadm</command> v1.13 2000/12/17 (compiled with popt and IPVS v1.0.2)
]]></programlisting>
		</listitem>
		<listitem>
Short description and maybe sketch of what you intended to setup.
		<para>
Example for LVS-DR:
		</para>

<programlisting><![CDATA[
	o Using LVS-DR, gatewaying method.
	o Load balancing port 80 (http) non-persistent.
	o Network Setup:

                        ________
                       |        |
                       | client |
                       |________|
			   | CIP
                           |
			(router)
			   |
			   | GEP
                 (packetfilter, firewall)
                           | GIP
                           |       __________
                           |  DIP |          |
                           +------+ director |
                           |  VIP |__________|
                           |
         +-----------------+----------------+
         |                 |                |
     RIP1, VIP         RIP2, VIP        RIP3, VIP
    ____________      ____________    ____________
   |            |    |            |  |            |
   |realserver1 |    |realserver2 |  |realserver3 |
   |____________|    |____________|  |____________|


	CIP  = 212.23.34.83
	GEP  = 81.23.10.2	(external gateway, eth0)
	GIP  = 192.168.1.1	(internal gateway, eth1, masq or NAT)
	DIP  = 192.168.1.2	(eth0:1, or eth1:1)
	VIP1 = 192.168.1.110	(director: eth0:110, realserver: lo0:110)
	RIP1 = 192.168.1.11
	RIP2 = 192.168.1.12
	RIP3 = 192.168.1.13
	DGW  = 192.168.1.1	(GIP for all realserver)

	o ipvsadm -L -n

hog:~ # ipvsadm -L -n
IP Virtual Server version 1.0.2 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  192.168.1.10:80 wlc
  -> 192.168.1.13:80             Route   0      0          0
  -> 192.168.1.12:80             Route   0      0          0
  -> 192.168.1.11:80             Route   0      0          0
]]></programlisting>
		<para>
The output from ifconfig from all machines (abbreviated, just need the
IP, netmask etc), and the output from netstat -rn.
		</para>
		</listitem>

		<listitem>
What doesn't work.
Show some output from
<command>tcpdump</command>,
<command>ipchains</command>/<command>ip_tables</command>,
<command>ipvsadm</command> and
<filename>kernlog</filename>.
Later we may ask you for a more detailed configuration like routing table,
OS-version or interface setup on some machines used in your setup.
Tell us what you expected. Example:

<programlisting><![CDATA[
ipchains -L -M -n (2.2.x) or cat /proc/net/ip_conntrack (2.4.x)
echo 9 > /proc/sys/net/ipv4/vs/debug_level && tail -f /var/log/kernlog
tcpdump -n -e -i eth0 tcp port 80
route -n
netstat -an
ifconfig -a
]]></programlisting>
		<para>
<command>tcpdump</command> listings are difficult to read.
If you post one, please change the IPs to VIP, CIP, RIP1..n, DIP etc.
Since you'll likely be on a switched network, <command>tcpdump</command>
will only see packets to that NIC. Tell us which machine (director, realserver...)
and the NIC (if there are two NICs on the machine) that it was run on.
		</para>
		</listitem>
	</orderedlist>
	</section>
	<section id="bug_fixes">
	<title>Bug Fixes</title>
	<para>
It's wonderful to get an unsolicited bug fix.
Please let us know what it does and why it's better than the current file.
A new version of a file without any information about what it does,
or what it fixes isn't much use to us.
	</para>
	</section>
	<section id="other_solutions">
	<title>Other load balancing solutions, GPL, opensource and commercial</title>
		<section id="open_source_solutions">
		<title>Open Source and GPL solutions</title>
		<para>
Malcolm <emphasis>lists (at) loadbalancer (dot) org</emphasis> 23 Nov 2006 
		</para>
		<para>
Willy Tarreau's written a nice article 
<ulink url="http://1wt.eu/articles/2006_lb/">
http://1wt.eu/articles/2006_lb/ - Making applications scalable with Load Balancing</ulink>
on load balancing that covers layer 4 and layer 7 options.
I still don't think layer 7 can ever give high availability.. but its a good read:
		</para>
		<para>
Ratz 23 Nov 2006 
		</para>
		<para>
A very nice and to the point introduction. 
Willy, among being a nice person and a good friend, is an
excellent engineer with a lot of expertise in high available, 
high performance and secure web
services, networking and packet filtering and much more. 
It would be nice to have Willy contributing
to/on this list as well :).
		</para>
		<para>
Malcolm <emphasis>lists (at) loadbalancer (dot) org</emphasis> 01 Feb 2007
		</para>
		<para>
HAProxy <ulink url="http://haproxy.1wt.eu/">http://haproxy.1wt.eu</ulink>
is a tcp proxy (fast) but flexible enough to do cookie insertion and SNAT etc.
		</para>
		<para>
from <emphasis>lvs (at) spiderhosting (dot) com</emphasis>
<ulink url="http://dmoz.org/Computers/Software/Internet/Site_Management/Load_Balancing/">a list of load balancers</ulink>
		</para>
		<para>
Brent Cook <emphasis>busterb (at) mail (dot) utexas (dot) edu</emphasis> 28 Mar 2002
		</para>
		<blockquote>
There's the http://www.bsdshell.net/ HighUpTime (HUT) projec (link dead Apr 2003).
It's FreeBSD.
		</blockquote>
		<para>
The HUT author, Sebastian Petit
<emphasis>spe (at) selectbourse (dot) net</emphasis> has joined the LVS mailing list.
		</para>
		<para>
For L7 Switching see the <link linkend="DRWS">DRWS project</link>.
		</para>
		<para>
Dec 2006: Alexandre Cassen, the author of <xref linkend="setup_keepalived"/> has written an 
L7 Switch at <ulink url="http://www.linux-l7sw.org">http://www.linux-l7sw.org"</ulink>.
		</para>
		<para>
BSD load balancing:
		</para>
		<para>
Roberto Nibali <emphasis>ratz (at) tac (dot) ch</emphasis> 05 Nov 2003
		</para>
		<para>
As already mentioned by others, LVS will not work on FreeBSD as director due to
the kernel part. Using FreeBSD on the RS is of course ok.
The BSD folks have not shown bigger interest in adopting the LVS idea or parts
of the code yet.
If you're interested in load balancing and HA Solutions under FreeBSD, you could
check out following links:
		</para>
<programlisting><![CDATA[
http://www.bsdshell.net/hut_fvrrpd.html
http://www.backhand.org/wackamole/
http://unix.derkeiler.com/Mailing-Lists/FreeBSD/isp/2003-05/0026.html
http://redundancy.redundancy.org/fbsd_lb.html
]]></programlisting>

		<para>
Gavin Henry <emphasis>ghenry (at) suretecsystems (dot) com</emphasis> 06/13/2005 
		</para>
		<blockquote>
<ulink url="http://geminis.dyndns.org/wordpress/index.php/2005/06/12/loadbalancer-less-clusters-on-linux/">ClusterIP</ulink> by Harald Welte.
What is the list's view on it?
		</blockquote>
		<para>
Gavin Henry 
<emphasis>ghenry (at) suretecsystems (dot) com</emphasis> 13 Jun 2005
		</para>
		<para>
The man page for more recent versions of iptables says:
		</para>
		<blockquote>
CLUSTERIP: This module allows you to configure a simple
cluster of nodes that share a certain IP and MAC address
without an explicit load balancer in front of them
		</blockquote>
		<para>
Horms 
		</para>
		<para>
Been there, done that. Works, but is it neccessary?
<ulink url="http://www.ultramonkey.org/papers/active_active/">LVS with upto 16 directors active</ulink>
(http://www.ultramonkey.org/papers/active_active/)
		</para>
		<para>
A set of postings on /. 2 Mar 2009 at
<ulink url="http://tech.slashdot.org/article.pl?sid=09/03/02/0231241">Best Solution for HA and Network Loadbalancing</ulink>
(http://tech.slashdot.org/article.pl?sid=09/03/02/0231241) lists the following
		</para>
		<itemizedlist>
			<listitem>
<ulink url="http://haproxy.1wt.eu/">HAproxy</ulink>
			</listitem>
			<listitem> 
<ulink url="http://distributor.sourceforge.net/">Distributor</ulink> 
			</listitem>
			<listitem> 
<ulink url="http://crossroads.e-tunity.com/">Crossroads</ulink> 
			</listitem>
			<listitem> 
<ulink url="http://siag.nu/pen/">Pen</ulink>
			</listitem>
			<listitem> 
<ulink url="http://www.inlab.de/balance.html">Balance/BalanceNG</ulink> 
			</listitem>
			<listitem> 
<ulink url="http://www.apsis.ch/pound/index_html">Pound</ulink>
			</listitem>
		</itemizedlist>
		</section>
		<section id="commercial_solutions">
		<title>List of Commercial Solutions</title>
		<para>
Cahya Wirawan <emphasis>cwirawan (at) email (dot) archlab (dot) tuwien (dot) ac (dot) at</emphasis> 19 Feb 2004
		</para>
		<blockquote>
I'm implementing proxy, smtp and webserver with LVS as local node,
and I have tested it and it's running fine, but because
someone from management section thinks that such an implementation
is easy (just run setup.exe and everything is installed and ready to use),
he pushed me to move the setup into production, 
and create another one as soon as possible.
I want to tell him that such an implementation is not a trivial thing 
and needs time to setup and to test before we go into production.
I want to show him a list of companies who have such complete solutions, 
so he can see the cost. 
Then he can understand that high availability and load balancing is not easy to setup,
and will cost alot of money if we buy a complete solution.
		</blockquote>
		<para>
Vendors just rub their hands with glee on finding management like this
- see my
<ulink url="http://www.austintek.com/book_reviews/the_ibm_way.html">
review of the book &quot;The IBM Way&quot;</ulink>
(http://www.austintek.com/book_reviews/the_ibm_way.html) for
how IBM handles the situation.
		</para>
		<para>
Peter Mueller
		</para>
		<para>
Prices at this level are negotiable.  Who knows what you could pay?
		</para>
		<itemizedlist>
			<listitem>
http://www.cisco.com/ - the old man on the LB-gig.
			</listitem>
			<listitem>
http://www.f5.com/f5products/bigip/LB520/  - the second old man in the LB
gig.
			</listitem>
			<listitem>
http://www.suse.com/us/business/products/server/ - Suse has always been a
big player in the Linux-HA world.
			</listitem>
			<listitem>
http://www.redhat.com/software/rhel/purchase/ - they have clustering based
on LVS, not sure about price.  At this point you have to buy enterprise
edition (or http://www.whiteboxlinux.com) to use the clustering software.
			</listitem>
			<listitem>
http://www.ibm.com/ - always an option...
			</listitem>
			<listitem>
http://www.dell.com/ - moving up in the datacenter world.  I see lots of
Dells now..
			</listitem>
			<listitem>
http://www.ebay.com/ - see how much the gear is worth on the open market.
			</listitem>
			<listitem>
				<para>
http://www.linuxvirtualserver.org/ - $0.
				</para>
				<para>
There's plenty of people on list
who can help you and your boss feel more comfortable with your setup.  I'm
sure if you posted something some people would be willing to help make you
sleep better at night.  BTW, you know about the http://www.ultramonkey.org/
and http://www.keepalived.org/ projects, right?
				</para>
			</listitem>
		</itemizedlist>
		</section>
		<section id="radware" xreflabel="Radware">
		<title>Radware</title>
		<para>
Joe Oct 2005:
From a presentation by 
<ulink url="http://www.radware.com/">Radware</ulink>)
(http://www.radware.com/)
given to <ulink url="http://www.ncsysadmin.org/">North Carolina Systems Administrators (NCSA)</ulink>
(http://www.ncsysadmin.org/) on 10 Oct 2005.
Unfortunately I was the guy getting the pizzas for the
meeting, so I missed most of the talk (which I wanted
to hear).
		</para>
		<para>
Radware is used by Ebay and Accuweather.
Radware has a NAT loadbalancing director that appears to 
function similarly to an LVS-NAT director. The servers can 
have private IPs.
		</para>
		<para>
Radware's loadbalancing director is only a small part of 
their offering. Radware have boxes that filter based on 
packet content (looking for viruses) that sit in the flow of 
packets (possibly before the director, possibly after - didn't 
find this out). They have boxes which just handle SYN 
floods. They use SYN cookies and do a statistical analysis 
of the packets, letting some through to see which machines 
reply to the SYN-ACKs. Radware has a gui to control the 
loadbalancer, which can do things like shutting down some of 
the backend servers at sometime in the future (<emphasis>e.g.</emphasis> at 10pm
later that night) for 
new connections, so that by 8am next morning these machine 
have few or no connections and can be taken offline for 
servicing. Much of their hardware is ASIC based.
		</para>
		<para>
Health checking seems to be done from the director, and 
checks are made through to 3rd-Tier components of the 
backend servers (<emphasis>e.g.</emphasis> database machines 
behind the webservers that the client doesn't directly 
connect to).
		</para>
		<para>
Each local NAT'ed load balancing setup is itself a member of 
a distributed DNS-based load balancer. So www.foo.net might 
have a loadbalanced set of servers in different sites eg 
London, New York, San Francisco and Tokyo. Each local setup 
has an authoritative nameserver for www.foo.net
		</para>
		<para>
The way is works is
		</para>
		<itemizedlist>
			<listitem>
client in Scotland asks for the IP of www.foo.net
			</listitem>
			<listitem>
the client's nameserver doesn't know the IP and asks a 
rootserver for the machine authoritative for foo.net.
			</listitem>
			<listitem>
The rootserver has a list of 4 authoritative nameservers 
for foo.net and selects the next nameserver by round robin. 
If the next one in its list is in New York, it tells the 
client's nameserver to go query the nameserver in New York.
			</listitem>
			<listitem>
The New York nameserver for foo.net measures the packet 
latency to the client's nameserver and then returns the VIP 
for www.foo.net associated with the New York 
installation of www.foo.net. The latency is propagated to 
the other foo.net nameservers (in Tokyo, London and San 
Francisco).
			</listitem>
			<listitem>
Sometime later after the client's nameserver has flushed 
the IP entry for www.foo.net from its cache, another (or the 
same) client using the same nameserver asks for the IP of 
www.foo.net again and this time the rootserver will 
possibly send the request to another of the sites (say 
London).  The London machine already knows the latency from 
New York to the client (without knowing where the client 
is), and sees that its latency to the client is lower than 
the latency from New York to the client, and returns the IP 
of its copy of www.foo.net. The London 
nameserver also updates the latency tables at the other 
sites (New York, San Francisco and Tokyo).
			</listitem>
			<listitem>
If the next nameserver request from the client site is 
sent to Tokyo, then the Tokyo machine updates the latency 
tables in all the other nameservers, and knowing that the 
latency is lowest to the London nameserver, returns the IP 
of www.foo.net in London.
			</listitem>
			<listitem>
In this way the four nameserver accumulate the latencies to 
all nameservers in the world. This works provided that the 
latencies don't change a lot with time of day (or 
throughput). Presumably you could store successive
latencies and pick the shortest as reflecting the
true network distance. The amount of memory required to do this must 
be small - there can't be more than a million nameservers, 
can there? 1 million 8 bit latencies is not much to store in 
memory.
			</listitem>
		</itemizedlist>
		<para>
Although I didn't get to ask how it works, if a client winds
up at a more distant site (network wise), then http redirects will send
the client to a closer site.
		</para>
		<para>
Radware SSL accelarators:
		</para>
		<para>
When I commented to the speaker that the main reason to use 
SSL accelarators is financial, <emphasis>i.e.</emphasis>
to only have one copy of the certificate, 
rather than one on each realserver, they said 
"it's also for certificate management". Presumably some 
sites have large numbers of certificates. (They didn't 
disagree with my statement.)
		</para>
		<para>
The SSL accelarators in the Radware design don't sit between 
the director and the realservers (or in front of the director 
<emphasis>i.e.</emphasis> between the client and the director), 
but sit at the same 
level as the other realservers. The https request is 
balanced by the director to an accelarator, which decrypts 
the packets and sends the decrypted packet back to the 
director for loadbalancing as http traffic. Since the 
director is a NAT balancer, the return http traffic from the 
http servers, goes back through the director, and then 
recursively back to the SSL accelarator then back to the 
director at https traffic and then back to the client.
		</para>
		<para>
Being able to have the SSL accelarator as a realserver in 
LVS would require the realservers to be a client of the 
director, something that we can do for LVS-NAT, but not for 
LVS-DR. This is not a capability that we've paid much attention
to for LVS. If you need a realserver to be in the path in both 
inward and outward directions (like an SSL accelarator) then 
you will have to use LVS-NAT.
		</para>
	<para> 
Francois JEANMOUGIN <emphasis>Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com</emphasis> 12 Oct 2005 
	</para>
	<para>
Note that we removed our Radware appliance to use LVS instead. Load Balancing
using DNS is _evil_, especially with mobile internet and all those
misconfigured operator gateways.
Most mobile gateway are written in Java, and I'm probably the only
one who read the java.security file. Just have a look on this ugly stuff you
can find in it and the unbelievable silly explanation given:
	</para>
<programlisting><![CDATA[
# The Java-level namelookup cache policy for successful lookups:
#
# any negative value: caching forever
# any positive value: the number of seconds to cache an address for
# zero: do not cache
#
# default value is forever (FOREVER). For security reasons, this
# caching is made forever when a security manager is set.
#
# NOTE: setting this to anything other than the default value can have
#       serious security implications. Do not set it unless
#       you are sure you are not exposed to DNS spoofing attack.
#
#networkaddress.cache.ttl=-1
]]></programlisting>
	<para>
For security reasons! Guys! Well. So we removed radware. Note that we had
other problem with radware. The DNS cache of the clients is one, the response
time of the DNS was another. Several technical issues when you reach some
trafic limits was the last.
	</para>
	<para>
Henrik Holst
	</para>
	<blockquote>
still, geographic load balancing would be very nice to have and I
cannot figure out another way to do it than involve DNS round-robin.
	</blockquote>
	<para> 
Francois  
	</para>
	<para>
Round-Robin DNS could work if
	</para>
	<itemizedlist>
		<listitem>
You have enough clients
		</listitem>
		<listitem>
Clients are using DNS as expected
		</listitem>
		<listitem>
Clients are dealing with TTL
		</listitem>
		<listitem>
Client DNS caches or provider DNS are honouring DNS TTL
		</listitem>
		<listitem>
All your sites are always up and working (you can't use a DNS solution for
failover)
		</listitem>
	</itemizedlist>
	<para>
My clients are mobile phones, basically points 1 to 4 are not OK :). And I
have to deal with multiple sources for the same client (the transaction begin
in the gallery gateway and continues in the standard surf gateway, and I have
to use fwmarks to keep the session)...
We used RadWare to try to load-balance between our two peers. It clearly was
not working. Unfortunately, I don't have all the details.
	</para>

	<para>
Horms
	</para>
	<para>
If you want to distribute traffic between hosts
that have fast, reliable links, like a LAN, then LVS is a good option.
No, an excellent option.
If you want to distribute traffic between geographically separated
hosts, then you don't want something like LVS that channles packets
through a single location then to another. Something DNS based is
probably the way to go - though round robin is not nearly smart
enough for my liking.
In practice, if you do have geographically distributed sites,
then each site should probably be an LVS cluster. So essentially
you end up using two techniques to solve different parts of the
same problem.
	</para>
	<para>
I wrote quite a lot of this on supersparrow.org once upon a time,
its still there if people want to read/play/enhance/.
(links through <xref linkend="supersparrow"/>).
	</para>
		</section>
		<section id="review_radware_F5">
		<title>User's view of Radware, F5</title>
		<para>
bak <emphasis>bak (at) picklefactory (dot) org</emphasis> 09 Jan 2007
		</para>
		<para>
I've used Radware, F5, and HP SAs as an admin.
My 2-minute executive overview take:
		</para>
		<para>
Radware is great for a switch-like, low-key experience.  They're relatively
cheap for hardware load balancers.  You get extra functionality like SSL and
link balancing with extra bits of hardware.  Sometimes they can be pretty
hard to troubleshoot.  If you want global balancing/failover, that's part of
all their "AS" type switches.
		</para>
		<para>
F5 is the other 'big name' option.  These boxes are more like Brocade
switches: it's running embedded Linux in there, and if you want to run
tcpdump, you can.  You get extra functionality by buying a 'bigger' box and
then paying F5 for more licensing.  If you want to do global
balancing/failover, you have to get one of their DNS devices.
		</para>
		<para>
If you have money to wave around, I've found both Radware and F5 are more
than happy to give you a demo unit for 2-4 weeks.
		</para>
		</section>
	</section>
	<section id="books">
	<title>Books on LVS</title>
	<para>
Karl Kopper has tackled this. 
Writing a book on a moving target like LVS is a difficult proposition - 
certainly more than I was prepared to take on. 
	</para>
<programlisting><![CDATA[
The Linux Enterprise Cluster
Karl Kopper
Pub: No Starch Press
ISBN 1593270364
]]></programlisting>
	<para>
The book is available at your usual suppliers.
	</para>
	<para>
I'm loath to mention the names of internet booksellers
who require your e-mail address as part of your purchase,
so that they can spam you later. 
I've been buying my books by phone at a marginally higher price 
since realising their business practices.
However recently (Jul 2004) I've discovered disposable e-mail addresses 
<emphasis>e.g.</emphasis> the free service from 
<ulink url="http://www.jetable.org/">Jetable.org</ulink>
(http://www.jetable.org/).
They have a google-like (<emphasis>i.e.</emphasis> simple) interface.
You give them your e-mail address, 
the required lifetime of the address
(1-8days), and click. 
Up comes an e-mail address (test by sending a message to it)
that you can give to your internet vendor, 
and mail will be forwarded to you for the period selected. 
After that time, no more mail will get to you.
I've been using jetable since Jul 2004 (now Sep 2004)
and have not got any spam from Jetable or from internet vendors.
	</para>
	</section>
	<section id="LVS_in_the_news" xreflabel="LVS in the News">
	<title>LVS in the news</title>
	<para>
&quot;Wired&quot; Magazine in Jun 2004 has a small article about LVS, 
illustrating the multinational cooperative nature of GPL software development.
The page is 
<ulink url="http://www.wired.com/wired/archive/12.06/images/atlas_software.pdf">here</ulink>
(http://www.wired.com/wired/archive/12.06/images/atlas_software.pdf),
or a 
<ulink url="files/atlas_software.pdf">local copy of the article</ulink> on this server.
	</para>
	</section>
	<section id="related_info">
	<title>Software/Information/HOWTOs useful/related to LVS</title>
	<para>
<ulink url="http://www.ultramonkey.org/">Ultra Monkey</ulink>
is LVS and HA combined.
	</para>
	<para>
tong <emphasis>tong (at) csusb (dot) net</emphasis>
25 Jun 2003
	</para>
	<para>
Here's a step-by-step
<ulink url="http://www.cula.net/cluster/">
guide for setting up an LVS system with heartbeat</ulink>
(http://www.cula.net/cluster).
	</para>
	<note>
This guide was published a year ago and we've only just heard about it.
The author has never popped up on the mailing list to say hello.
	</note>
	<para>
from <emphasis>lvs (at) spiderhosting (dot) com</emphasis>
<ulink url="http://www.supersparrow.org/">Super Sparrow Global Load Balancing</ulink>
using BGP routing information.
	</para>
	<para>
Ratz is documenting the 
<ulink url="http://www.drugphish.ch/~ratz/IPVS/index.html">
2.6 headers and calls with doxygen</ulink>
(http://www.drugphish.ch/~ratz/IPVS/index.html)
whenever he has reason to fiddle with a piece of code 
(<emphasis>i.e.</emphasis> the documentation isn't exhaustive, at least yet). 
	</para>
	<para>
From ratz, there's a write up on load imbalance with persistence and sticky bits at our friends
at <ulink url="http://www.microsoft.com/technet/prodtechnol/windows2000serv/deploy/confeat/nlbovw.asp">M$</ulink>.
	</para>
	<para>
From ratz, Zero copy patches to the kernel to speed up network throughput,
<ulink url="ftp://ftp.kernel.org/pub/linux/kernel/people/davem">Dave Miller's patches</ulink>,
<ulink url="http://surriel.com/patches/">Rik van Riel's vm-patches</ulink> and
more of Rick van Riel's patches at
http://www.linux-mm.org/ (link dead Dec 2003).
The Zero
copy patches may not work with LVS and may not work with netfilter either 
(from <emphasis>john (at) antefacto (dot) com</emphasis>).
	</para>
	<para>
From Michael Brown <emphasis>michael_e_brown (at) dell (dot) com</emphasis>, the
<ulink url="http://www.redhat.com/software/">TUX kernel level webserver</ulink>.
	</para>
	<para>
Dustin Puryear <emphasis>dustin (at) puryear-it (dot) com</emphasis>
gave a talk on LVS at LISA 2003.
The tutorial, is avaialble at:
<ulink url="http://www.puryear-it.com/publications.htm#6">
LVS: Load Balancing and High Availability for Free</ulink>
(http://www.puryear-it.com/publications.htm#6).
	</para>
	</section>
</section>
<section id="LVS-HOWTO.what_is_an_LVS">
<title>LVS: What is an LVS? Can I use an LVS?</title>
<para>
A Linux Virtual Server (LVS) is a cluster of servers
which appears to be one server to an outside client.
This apparent single server is called here a "virtual server".
The individual servers (realservers) are under
the control of a director (or load balancer),
which runs a Linux kernel patched to include the <emphasis>ipvs</emphasis> code.
The ipvs code running on the director is
<emphasis>the</emphasis> essential feature of LVS.
Other user level code is used to manage the LVS
(set rules for services handled, handle failover).
The director is basically a layer 4 router 
with a modified set of routing rules
(<emphasis>i.e.</emphasis> connections do not originate
or terminate on the director, it doesn't send ACKs etc,
it's just a router).
</para>
<para>
When a new connection is requested from a client to a service provided by the LVS 
(<emphasis>e.g.</emphasis> httpd), 
the director will choose a realserver for the client.
From then, all packets from the client
will go through the director to that particular realserver.
The association between the client and the realserver will
last for only the life of the tcp connection (or udp exchange).
For the next tcp connection, the director will choose a new
realserver (which may or may not be the same as the first realserver).
Thus a web browser connecting to an LVS serving a webpage consisting
of several hits (images, html page), may get each hit from a separate
realserver.
</para>
<para>
Since the director will send the client to an arbitary realserver, 
the services must be either read only
(<emphasis>e.g.</emphasis> web services) or if read/write
(<emphasis>e.g.</emphasis> an on-line shopping cart)
some mechanism external to LVS must be provided for
propagating the writes to the other realservers on 
a timescale appropriate for the service 
(<emphasis>i.e.</emphasis> purchase of an item 
must decrement the stock on all other nodes before
the next client attempts to purchase the same item).
At best LVS is read mostly.
</para>
<para>
If you just want one of several nodes to be up at any one time, 
and the other node(s) to become active on failure of the primary node, 
then you don't need LVS: you need a high availability
setup <emphasis>e.g.</emphasis> Linux-HA (heartbeat), vrrp or carp.
</para>
<para>
If you want independant servers at different locations,
then you want a geographically distributed server like
<link linkend="supersparrow">Supersparrow</link>.
</para>
<para>
Here are some <xref linkend="rrd_images"/>
</para>
<para>
Management of the LVS is through the user space utility <xref linkend="LVS-HOWTO.ipvsadm"/>,
which is used to add/removed realservers/services and to handle failout.
LVS itself does not detect failure conditions; these are detected
by external agents, which then update the state of the LVS through <command>ipvsadm</command>
</para>
	<section id="what_is_a_VIP">
	<title>What is a VIP?</title>
	<para>
The director presents an IP called the Virtual IP (VIP) to clients.
(When using <xref linkend="LVS-HOWTO.fwmark"/>, VIPs are agregated into
groups of IPs, but the same principles apply as for a single IP).
When a client connects to the VIP, the director forwards the client's
packets to one particular realserver for the
duration of the client's connection to the LVS. This connection is chosen
and managed by the director. The realservers serve services
(eg ftp, http, dns, telnet, nntp, smtp) such as are found in
/etc/services or inetd.conf. The LVS presents one IP on the director
(the virtual IP, VIP) to clients.
	</para>
	<para>
Peter Martin <emphasis>p (dot) martin (at) ies (dot) uk (dot) com</emphasis> and John Cronin <emphasis>jsc3 (at) havoc (dot) gtf (dot) org</emphasis> 05 Jul 2001
	</para>
	<blockquote>
	<para>
The VIP is the address which you want to load balance 
<emphasis>i.e.</emphasis> the address of your website. 
The VIP is usually a secondary address 
so that the VIP can be swapped between two directors on failover
(the VIP used to be an alias (<emphasis>e.g.</emphasis> eth0:1).
	</para>
	<para>
The VIP is the IP address
of the "service", not the IP address of any of the particular systems
used in providing the service (ie the director and the realservers).
	</para>
	<para>
The VIP be moved from one director to another backup director
if a fault is directed
(typically this is done by using
<filename>mon</filename> and <filename>heartbeat</filename>,
or something similar).
The director can have <link linkend="multiple_VIPs">multiple VIPs</link>.
Each VIP can have one or more services associated with it
<emphasis>e.g.</emphasis> you could have HTTP/HTTPS
balanced using one VIP, and FTP service (or whatever) balanced using
another VIP, and calls to these VIPs can be answered by the same or different
realservers.
	</para>
	<para>
Groups of VIPs and/or ports can be setup with <xref linkend="LVS-HOWTO.fwmark"/>.
	</para>
	<para>
The realservers have to be configured to work with the VIPs on the director
(this includes handling the <xref linkend="LVS-HOWTO.arp_problem"/>).
	</para>
	<para>
There can be
<xref linkend="LVS-HOWTO.persistent_connection"/> issues,
if you are using cookies or https,
or anything else that expects the realserver fulfilling the requests
to have some connection state information.
This is also addressed on the
<ulink url="http://www.linuxvirtualserver.org/docs/persistence.html">
LVS persistence page
</ulink>
	</para>
	</blockquote>
	</section>
	<section id="where_used">
	<title>Where do you use an LVS?</title>
	<itemizedlist>
		<listitem>
For higher throughput.
The cost of increasing throughput by adding
realservers in an LVS increases linearly,
whereas the cost of increased throughput by buying a larger single machine
increases faster than linearly
		</listitem>
		<listitem>
for redundancy. Individual machines can be switched out of the LVS,
upgraded and brought back on line without interuption of service to the clients.
Machines can move to a new site and brought on line one at a time while machines
are removed from the old site, without interruption of service to the clients.
</listitem><listitem> for adaptability. If the throughput is expected to change gradually (as a
business builds up), or quickly (for an event), the number of servers can be
increased (and then decreased) transparently to the clients.
		</listitem>
	</itemizedlist>
	</section>
	<section id="client_server_relationship">
	<title>Client/Server relationship is preserved in an LVS</title>
	<itemizedlist>
		<listitem>
Client sees only one IP address and believes it is connecting
to a single machine. IPs of all servers is mapped to one IP (the VIP).
While the client is connected to only one machine at a time,
however subsequent connections will be assigned to a new and likely
different machine.
		</listitem>
		<listitem>
servers at different IP addresses believe
they are contacted directly by the client.
		</listitem>
	</itemizedlist>
	</section>
	<section id="L4_switch">
	<title>LVS director is an L4 switch</title>
	<para>
In the computer beastiary, the director is a layer 4 (L4) switch.
The director makes decisions at the IP layer and just sees a stream
of packets going between the client and the realservers.
In particular an L4 switch makes decisions based on the IP information
in the headers of the packets.
	</para>
	<para id="supersparrow" xreflabel="Super Sparrow Project">
Here's a description of an L4 switch from
<ulink url="http://www.supersparrow.org/ss_paper/">Super Sparrow Global Load Balancer documentation</ulink>
	</para>
	<blockquote>
Layer 4 Switching: Determining the path of packets based on
information available at layer 4 of the OSI 7 layer protocol stack.
In the context of the Internet, this implies that the IP address
and port are available as is the underlying protocol, TCP/IP or UCP/IP.
This is used to effect load balancing by keeping an affinity
for a client to a particular server for the duration of a connection.
	</blockquote>
	<para>
This is all fine except
	</para>
	<para>
Nevo Hed <emphasis>nevo (at) aviancommunications (dot) com</emphasis> 13 Jun 2001
	</para>
	<blockquote>
The IP layer is L3.
	</blockquote>
	<para>
Alright, I lied.
TCPIP is a 4 layer protocol and these layers do not map well onto
the 7 layers of the OSI model.
(As far as I can tell the 7 layer OSI model is only used to torture
students in classes.)
It seems that everyone has agreed to pretend that tcpip
uses the OSI model and that tcpip devices like the LVS director
should therefore be named according to the OSI model.
Because of this, the name "L4 switch" really isn't correct,
but we all use it anyhow.
	</para>
	<para>
The director does not inspect the content of the packets and cannot
make decisions based on the content of the packets
(<emphasis>e.g.</emphasis> if the packet contains a <link linkend="cookie">cookie</link>,
the director doesn't know about it and doesn't care).
The director doesn't know anything about the application
generating the packets or what the application is doing.
Because the director does not inspect the content of the packets (layer 7, L7)
it is not capable of session management or providing
service based on packet content. L7 capability would be a useful
feature for LVS and perhaps this will be developed in the future
(preliminary ktcpvs code is out - May 2001 -
<xref linkend="LVS-HOWTO.L7_switch"/>).
	</para>
	<para>
The director is basically a router, with routing tables set up
for the LVS function.
These tables allow the director to forward packets to
realservers for services that are being LVS'ed.
If http (port 80) is a service that is being LVS'ed
then the director will forward those packets.
The director does not
have a socket listener on VIP:80 (i.e. netstat won't see a listener).
	</para>
	<para>
John Cronin <emphasis>jsc3 (at) havoc (dot) gtf (dot) org</emphasis> (19 Oct 2000)
calls these types of servers
(i.e. lots of little boxes appearing to be one machine) "RAILS"
(Redundant Arrays of Inexpensive Linux|Little|Lightweight|L* Servers).
Lorn Kay <emphasis>lorn_kay (at) hotmail (dot) com</emphasis> calls them RAICs (C=computer),
pronounced "rake".
	</para>
	</section>
	<section id="forward_packets">
	<title>LVS forwards packets to realservers</title>
	<para>
The director uses 3 different methods of forwarding.
	</para>
	<itemizedlist>
		<listitem>
LVS-NAT based on network address translation (NAT)
		</listitem>
		<listitem>
LVS-DR (direct routing) where the MAC addresses on the
packet are changed and the packet forwarded to the realserver
		</listitem>
		<listitem>
LVS-Tun (tunnelling) where the packet is IPIP encapsulated
and forwarded to the realserver.
		</listitem>
	</itemizedlist>
	<para>
Some modification of the realserver's ifconfig
and routing tables will be needed for LVS-DR and LVS-Tun forwarding.
For LVS-NAT the realservers only need a functioning tcpip stack (<emphasis>i.e.</emphasis>
the realserver can be a networked printer).
	</para>
	<para>
LVS works with all services tested so far (single and 2 port services)
except that LVS-DR and LVS-Tun cannot work with services that
initiate connects from the realservers (so far; identd and rsh).
	</para>
	<para>
The realservers can be indentical, presenting the same service
(eg http, ftp) working off file systems which are kept in sync
for content. This type of LVS increases the number of clients
able to be served. Or the realservers can be different, presenting a
range of services from machines with different services or
operating systems, enabling the virtual server to present a
total set of services not available on any one server. The
realservers can be local/remote, running Linux (any kernel)
or other OS's. Some methods for setting up an LVS have fast
packet handling (eg LVS-DR which is good for http and ftp)
while others are easier to setup (eg transparent proxy) but
have slower packet throughput. In the latter case, if the
service is CPU or I/O bound, the slower packet throughput
may not be a problem.
	</para>
	<para>
For any one service (eg httpd at port 80) all the realservers
must present identical content since the client could be connected
to any one of them and over many connections/reconnections, will
cycle through the realservers. Thus if the LVS is providing
access to a farm of web, database, file or mail servers, all
realservers must have identical files/content. You cannot split
up a database amongst the realservers and access pieces of it
with LVS.
	</para>
	<para>
The simplest LVS to setup involved clients doing read-only
fetches (<emphasis>e.g.</emphasis> a webfarm).
If the client is allowed to write to the LVS (<emphasis>e.g.</emphasis>
database, mail farm), then some method is required
so that data written on one realserver is
transferred to other realservers before the client disconnects
and reconnects again. This need not be all that fast (you
can tell them that their mail won't be updated for 10mins),
but the simplest (and most expensive) is for the mail farm
to have a common file system for all servers. For a database,
the realservers can be running database clients which connect
to a single backend database, or else the realservers can
be running independant database daemons which replicate their
data.
	</para>
	</section>
	<section id="any_linux">
	<title>LVS runs on Linux and FreeBSD directors</title>
	<para>
LVS was developed on Linux and historically uses a Linux director.
The Intel and Dec Alpha versions of LVS are known to work.
The LVS code doesn't have any Intel specific
instructions and is expected to work on any machine that runs Linux.
	</para>
	<para>
In Apr 2005, LVS was ported to FreeBSD by Li Wang
	</para>
	<para>
Li Wang <emphasis>dragonfly (at) linux-vs (dot) org</emphasis> 2005/04/16
	</para>
	<para>
The URL is:
<ulink url="http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm">FreeBSD port of LVS</ulink>
(http://dragon.linux-vs.org/~dragonfly/htm/lvs_freebsd.htm).
Here's a 
<ulink url="http://dragon.linux-vs.org/~dragonfly/software/doc/ipvs_freebsd/performance.html">
performance test on FreeBSD(version 0.4.0)</ulink>
(http://dragon.linux-vs.org/~dragonfly/software/doc/ipvs_freebsd/performance.html).
	</para>
	</section>
	<section id="lvs_different_for_different_kernels">
	<title>Code for LVS is different for each kernel series</title>
	<para>
There are differences in the coding for LVS for the 2.0.x, 2.2.x,
2.4.x and 2.6.x kernels.
Development of LVS on 2.0.36 kernels has stopped (May 99).
Code for 2.6.x kernels is relatively new.
	</para>
	<para>
The 2.0.x and 2.2.x code is based on the masquerading code. Even if you
don't explicitely use ipchains (eg with LVS-DR or LVS-Tun),
you will see masquerading entries with `ipchains -M -L` (or `netstat -M`).
	</para>
	<para>
Code for 2.4.x kernels was rewritten
to be compatible with the netfilter code (i.e. its entries will
show up in netfilter tables).
It is now production level code.
Because of incompatibilities with LVS-NAT for 2.4.x LVS was in
development mode (till about Jan 2001) for LVS-NAT.
	</para>
	</section>
	<section id="2.4_SMP_kernel">
	<title>kernels from 2.4.x series are SMP for kernel code</title>
	<para>
2.4.x kernels are SMP for kernel code as well as user space
code, while 2.2.x kernels are only SMP for user space code.
LVS is all kernel code. A dual CPU director running a 2.4.x
kernel should be able to push packets at twice the rate
of the same machine running a 2.2 kernel (if other resources
on the director don't become limiting).
(Also see the section on <xref linkend="FAQ:smp_doesnt_help"/>.)
	</para>
	</section>
	<section id="realserver_OS">
	<title>OS for realservers</title>
	<para>
You can have almost any OS on the realservers (all are
expected to work, but we haven't tried them all yet).
The realservers only need a tcpip stack -
a networked printer can be a realserver.
	</para>
	</section>
	<section id="ethernet">
	<title>LVS works on ethernet</title>
	<para>
LVS works on
<ulink url="http://www.ethermanage.com/ethernet/ethernet.html">ethernet</ulink>.
	</para>
	<para>
	There are some limitations on using
<link linkend="ATM">ATM</link>.
	</para>
	<para>
Firewire: (from the Beowulf mailing list - Donald Becket 5 Dec 2002):
The firewire transport layer (IEEE1394) does run
<ulink url="http://developer.apple.com/firewire/IP_over_FireWire.html">
IP over FireWire</ulink>.
However firewire is designed for fixed size repeated frames (video or
continuous disk block reads), but has overhead for other communication.
Throughput is 400Mbps but worst case latency is high (msec range).
	</para>
	<para>
Oracle has released GPL libraries for clustering Linux boxes over FireWire
(http://www.ultraviolet.org/mail-archives/beowulf.2002/2977.html, link dead Dec 2003).
	</para>
	</section>
	<section id="ipv6"><title>LVS works on IPv6</title><para>
Seiji Tsuchiike <emphasis>tsuchiike (at) yggr-drasill (dot) com</emphasis> 02 Jun 2002
	<blockquote>
We just implemented IPv6 to lvs.
We think that Basic Mechanism is same.
(http://www.yggr-drasill.com/LVS6/documents.html. link dead Dec 2003,
but Sep 2004 Horms says its alive; Joe Dec 2006 it's alive).
	</blockquote>
	</para>
	</section>
	<section id="ipvs_continually_developed">
	<title>LVS is continually being developed</title>
	<para>
LVS is continually being developed and usually only the more recent kernel
and kernel patches are supported. Usually development is incremental,
but with the 2.2.14 kernels the entries in the /proc file system changed
and all subsequent 2.2.x versions were incompatible with previous versions.
	</para>
	</section>
	<section id="64_bit">
	<title>LVS is 64 bit</title>
	<para>
Kenny Chamber
	</para>
	<blockquote>
Has anybody here successfully setup lvs-director on sparc64 machine?
I need to know which distro is OK for this.
	</blockquote>
	<para>
Ratz 16 Dec 2004
	</para>
	<para>
Yes. Just recently.
Debian is fine, I reckon Gentoo would do as well.
	</para>
	<para>
INFO: It could be that your ipvsadm binary that comes to instrument the 
kernel tables for LVS is broken with regard to 64bit'ness. You then need 
to download the latest sources and recompile adding '-m64' to the 
CFLAGS. That's all, other than that it seems to work nicely.
	</para>
	<para>
Btw: I took Debian testing, probably not too wise but on the other hand 
I needed more up to date tools. I wouldn't know of too many other 
Distros that have up to date Sparc64 support. Suse used to have, but 
they dropped support a while ago unfortunately.
	</para>
	<para>
Justin Ossevoort <emphasis>justin (at) snt (dot) utwente (dot) nl</emphasis> 16 Dec 2004
	</para>
	<para>
Well our plain debian-sarge here did it just as painlessly as our x86
based machines. So as long as your distro has ipvs (and of course a
sparc tree ;)) support you're in the green.
	</para>
	<para>
liuah
	</para>
	<blockquote>
I want to know whether LVS can work with 64-bit boxes.
If I use LVS-DR, how can I apply the hidden patch to 64-bit linux,
using kernel is 2.4.18?
	</blockquote>
	<para>
ratz 29 Nov 2003
	</para>
	<para>
Yes.

The only problem I see is if either the counters or the hashtable
handling has some bug with 32/64-bit signedness and wrong shift
operators. Just let us know if you experience flakyness on your director ;).
The hidden patch for your kernel is:
http://www.ssi.bg/~ja/hidden-2.4.19pre5-1.diff
I hope you are aware of the fact that 2.4.18 is really buggy in many
ways. I know that some 64-bit archs have been lagging behind in the 2.4.x
tree but if I was you I would upgrade to a newer kernel.
	</para>
	<para>
Peter Mueller <emphasis>pmueller (at) sidestep (dot) com</emphasis> 29 Nov 2003
	</para>
	<para>
The one for straight 2.4.18 is http://www.ssi.bg/~ja/hidden-2.4.5-1.diff.
Since he said 2.4.18 I would suspect he's running Debian.  If you want a
Debian kernel with LVS+hidden use the
<ulink url="http://www.ultramonkey.org/">
Ultramonkey kernel</ulink>
(http://www.ultramonkey.org/").
	</para>
	<para>
liuah <emphasis>liuah (at) langchaobj (dot) com (dot) cn</emphasis> 02 Dec 2003
	</para>
	<para>
The hidden patch compiles and runs on our 64-bit servers successfully.
	</para>
	</section>
	<section id="other_documentation">
	<title>Other documentation</title>
	<para>
For more documentation, look at the LVS web site
(eg a talk I gave on how LVS works on
<ulink url="http://www.linuxvirtualserver.org/Joseph.Mack/linuxexpo99/linuxexpo2.html">2.0.36 kernel directors</ulink>)
	</para>
	<para>
Julian has written
<ulink url="http://www.ssi.bg/~ja/">Netparse</ulink>
for which we don't have a lot of documentation yet.
	</para>
	<para>
For those who want more understanding of netfilter/iptables etc, here
are some starting places. These topics are also covered in many
other places.
	</para>
	<itemizedlist>
		<listitem>
<ulink url="http://www.sunbeam.franken.de/projects/packetjourney/">
Harald Welte (of the netfilter team) description of what happens to a packet under 2.4
</ulink>
		</listitem>
		<listitem>
		<para>
<ulink url="http://www.sunbeam.franken.de/projects/conntrack+nat-HOWTO/">
Harald Welte (of the netfilter team) conntrack HOWTO</ulink>.
		</para>
		<para>
Conntrack is used in filter rules as a way of accepting "related" packets, 
<emphasis>e.g.</emphasis>
the data packets associated with an established ftp connection.
Regular filter rules written for these data packets
would accept ftp data packets (port 20) even if
there were not in response to a PORT call from an already
established ftp connection on port 21.
In this case the filter rules would accept packets that are part of a DoS attack.
		</para>
		<para>
Conntrack is CPU intensive and lowers throughput
(see <link linkend="conntrack">effect of conntrack on throughput</link>).
To disable conntrack you have to rmmod all the conntrack modules.
		</para>
		</listitem>
		<listitem>
<ulink url="http://www.netfilter.org">the docs/FAQs/HOWTOs on the netfilter site</ulink>

		</listitem>
		<listitem>
<ulink url="http://www.tldp.org/LDP/nag2/">Linux Network Administrators Guide</ulink>

		</listitem>
	</itemizedlist>
	</section>
	<section id="lvs_is_not_simple">
	<title>LVS is not simple to install, get going or keep running</title>
	<para>
This is not a utility where you run
<command>../configure &amp;&amp; make &amp;&amp; make check &amp;&amp; make install</command>,
put a few values in a <filename>*.conf</filename> file and you're done.
LVS rearranges the way IP works so that a router and server (here called director and realserver),
reply to a client's IP packets as if they were one machine.
You will spend many days, weeks, months figuring out how it works.
LVS is a lifestyle, not a utility.
	</para>
	<para>
That said, you should be able to get a simple LVS-NAT setup working in a few hours without
really understanding a whole lot about what's going one (see the
<link linkend="mini-HOWTO">LVS-mini-HOWTO</link>).
	</para>
	</section>
	<section id="lvs_control">
	<title>LVS Control (Failure, Thundering Herd, Sorry Servers)</title>
	<para>
LVS is kernel code (<filename>ip_vs</filename>) 
and a user space controller (<filename>ipvsadm</filename>).
When adding functionality to LVS (handling failover, bringing new machines on-line), 
we have to figure the best place to put it: in the kernel code or in the user space code?
Such decisions are relevant if you can choose from two equally functional lots of code -
we usually get what the coder wanted to implement.
	</para>
	<para>
Current thinking is to make the kernel code just handle the switching and
to have all control in user space.
	</para>
	<para>
Should the <xref linkend="thundering_herd"/>
be controlled by LVS or by an external user space program 
<emphasis>e.g.</emphasis> <link linkend="feedbackd">feedbackd</link>.
Currently there is both a kernel patch and a script
to change from rr to lc shortly after adding a new realserver.
An alternative (not implemented) would be a scheduler that 
is rr when there's a large difference in the number of connections
to the different realservers and lc when the number of connections
is similar.
	</para>
	<para>
LVS supplies high throughput using multiple identically configured machines.
You would like to be able to swap out machines for planned maintenance and
to automatically handle node failure (high availability).
	</para>
	<para>
The LVS itself does not provide high availability.
The current thinking is that the software layer that provides high availability
should be logically separate to the layer that it monitors.
The writing of software that attempts to determine whether a machine is
working, is somewhat of a black art.
There are several packages used to help provide high availability for LVS
and these are discussed in the <link linkend="LVS-HOWTO.failover">High Availability LVS</link> section.
	</para>
	<para>
While it is relatively easy to monitor the functionality of the realservers,
fail-out of directors is more difficult.
An even greater problem is handling failure of nodes which are holding state information.
	</para>
	<note>
There is a sorry server option in <xref linkend="keepalived_vrrpd"/>
	</note>
	<para>
Gustavo
	</para>
	<blockquote>
I'm trying to create a sorry server for clients that can't connect to my
real servers (limited by u-threshold); ServerA  - 100 conn, ServerB - 110 conn.
When this limit is reached I want my clients to go to a lighttpd served
page saying "come back later"
I'm trying with weights and thresholds... but it's not working the way I thought.
	</blockquote>
	<para>
Ratz 22 Nov 2006
	</para>
	<para>
I suspect the clients scheduled for the sorry server never return back to
the cluster, right (only if you use persistency of course)?
	</para>
	<blockquote>
That's right.
I'm working on a project for an airline companie.
Some times they post some promotional tickets for a small period of time
(only for passengers that buys on the website can have it) and the servers
go high.
	</blockquote>
	<para>
I've written a patch for the 2.4 kernel series extending IPVS to
support the concept of an atomically switching sorry server environment.
Unfortunately I didn't have the time to port the work to 2.6 kernels yet
(the threshold stuff is already in but a bit broken and the sorry server
stuff needs some adjustments in the 2.6 kernel). If you run 2.4 on your LB,
you could try out the patches posted to this list almost exactly one year
ago:
	</para>
<programlisting><![CDATA[
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113225125532426&w=2
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113225142406014&w=2
]]></programlisting>
	<para>
The fix to the kernel patch above:
	</para>
<programlisting><![CDATA[
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=113802828120122&w=2
]]></programlisting>
	<para>
And the 3/4 cut-off fix:
	</para>
<programlisting><![CDATA[
http://www.in-addr.de/pipermail/lvs-users/2005-December/015806.html
]]></programlisting>
	<para>
I personally believe that the sorry-server feature is a big missing piece of
framework in IPVS, one that is implemented in all commercial HW load
balancers.
	</para>
	<para>
Horms
	</para>
	<blockquote>
That is true, but its also a piece that is trivially inplemented in
user-space, where higher-level monitoring is usually taking place anyway.
Is there a strong argument for having it in the kernel?
	</blockquote>
	<para>
ratz 14 Feb 2007
	</para>
	<para>
Yes, it won't work reliably in user space because of missing atomicity. From
the point the user space daemon decides that it's time to switch over to the
sorry-server pool to the actual switch in the kernel by modifying the
according service flag, there's a couple of us to ms time frame in which the
kernel TCP stack will happily proceed with its normal tasks, including
service more requests to the previously elected service for sorry-server
forwarding. This can lead to broken (half-shown) page views an the
customer's side inside their browser.
	</para>
	<para>
In the field I had to implement load balancing, this was simply not
accepted, especially because it irritated our customer's clients and also
because everybody knew that HW load balancers do it right (tm).
	</para>
	<para>
YMMV and I still didn't sit down and forward port my code to 2.6 but I first
need some interest by enough people before I start :).
	</para>
	<para>
I wrote the 2.4 server pool implementation for a ticket reseller company that
probably had the same problems as your airline company. Normal selling
activities not needing high end web servers and then from time to time (in
your case promotional tickets, in my case Christina Aguilera, U2 or Robbie
Williams or World Soccer Championship tickets) peak selling where tickets
need to be sold in the first 15 minutes having tens of thousands of requests
per second, plus the illicit traffic generated by scripters trying to
sanction the event. These peaks, however, do not warrant the acquisition of
high-end servers and on-demand servers cannot be organized/prepared so
quickly.
	</para>
	<blockquote>
I need to manually limit each server capacity and the remaining connections
need to go to this sorry server.
	</blockquote>
	<para>
That's exactly the purpose of my patch, plus you get to see how many
connections (persistent as in session and active/passive connections) are
forwarded to either the normal webservers (so long as they are within the
u_thresh and l_thresh) or the overflow (sorry server) pool. As soon as one
of the RS in the serving pool drops below l_thresh future connection
requests are immediately sent to the service pool again.
	</para>
	<blockquote>
We have tried F5 Big-IP for a while and it worked perfectly, but it is
very expensive for us :(
	</blockquote>
	<para>
Yep, about USD 20k-30k to have them in a HA-pair.
	</para>
	<para>
So for the 2.4 kernel, I have a patch that has been tested extensively and
is running in production for one year now, having survived some hype events.
I don't know if I find time to sit down for a 2.6 version. Anyway, as has
been suggested, you can also try the sorry server of keepalived, however I'm
quite sure that this is not atomically (since keepalived is user space) and
works more like:
	</para>
<programlisting><![CDATA[
while true {
  for all RS {
    if RS.conns > u_thresh then quiesce RS
    if RS.isQuiesced and RS.conns < l_tresh then {
       if sorry server active then remove sorry server
       set RS.weight to old RS.weight
  }
  if sum_weight of all RS == 0 then invoke sorry server with weight > 0
}
]]></programlisting>
	<para>
If this is the case, it will not work for our use cases with high peak
requests, since sessions are not switched to either one service pool
atomically and thus this will result in people being sent to the overflow
pool even though they would have had a legitimate session and others again
get broken pages back, because in midst of the page view the LB's user space
process gets a scheduler call to update its FSM and so further requests sent
for HTTP 1.0 for example will be broken. The browser hangs on your customers
side and your management gets the angry phone calls of the business users,
to whom you had promised B2B access.
	</para>
	<para>
This is roughly how I came around to implementing the server overflow
(spillover server, sorry server) functionality for IPVS.
	</para>
	</section>
	<section id="clients_on_realservers">
	<title>clients on realservers</title>
	<para>
Sometimes you want a client process that has nothing to do with LVS,
to connect to machines outside the LVS. 
The client could be fetching DNS, running telnet/ssh or sending mail for logging.
None of these client calls are part of the service being balanced by LVS. 
This is covered in <xref linkend="LVS-HOWTO.non-lvs_clients_on_realservers"/>. 
	</para>
	<para>
Sometimes the LVS'ed service (<emphasis>e.g.</emphasis> http)
will fire up a client process 
(<emphasis>e.g.</emphasis> filling in a webpage 
will result in the realserver writing to a database). 
People then want to loadbalance these client calls coming
from the RIPs through a VIP on the same director.
This is covered in 
<xref linkend="LVS-HOWTO.lvs_clients_on_realservers"/>. 
	</para>
	</section>
</section>
<section id="LVS-HOWTO.install">
<title>LVS: Install, Configure, Setup</title>
	<section id="from_source">
	<title>Installing from Source Code</title>
	<para>
Doing this from source code is now described in the
<ulink url="http://www.austintek.com/LVS/LVS-HOWTO/mini-HOWTO/LVS-mini-HOWTO.html">LVS-mini-HOWTO</ulink>.
Two methods of setup are described
	</para>
	<itemizedlist>
		<listitem>
Setup from the command line.
This is fine to understand what's going on,
and if you only want to have a single type of setup.
For LVSs which you're reconfiguring a lot, it's tedious and mistake prone.
If it doesn't work, you will spend some time figuring out why.
		</listitem>
		<listitem>
From a configure script which sets up an LVS with a single director.
This script is fine for initial setups: it's mistake proof 
(will give you enough information about failures to figure
out what might be wrong) and I used it for all my testing of LVS.
Since it's not easily expandable to handle director failover
and other configuration tools handle this now,
the configure script is not being developed anymore.
For production, where you need failover directors, you
should use other setup tools or save your hand-built setup as a script
(<emphasis>e.g.</emphasis> with <command>ipvsadm-sav</command>).
		</listitem>
	</itemizedlist>
	</section>
	<section id="setup_ultramonkey">
	<title>Ultra Monkey</title>
	<para>
<ulink url="http://www.ultramonkey.org">Ultra Monkey</ulink> is a packaged
set of binaries for LVS, including Linux-HA for director failover and
ldirectord for realserver failover.
It's written by Horms, one of the LVS developers.
Ultra Monkey was used on many of the server setups sold by VA Linux
and presumably made lots of money for them.
Ultra Monkey has been around since 2000 and is mature and stable.
Questions about Ultra Monkey are answered on the LVS mailing list.
Ultra Monkey is mentioned in many places in the LVS-HOWTO.
	</para>
	<para>
Ben Hollingsworth <emphasis>ben (dot) hollingsworth (at) bryanlgh (dot) org</emphasis> 29 Jun 2007
	</para>
	<para>
There's step by step instructions on 
<ulink url="http://www.jedi.com/obiwan/technology/ultramonkey-rhel4.html">
How to install Ultra Monkey LVS in a 2-Node HA/LB Setup on CentOS/RHEL4
</ulink>
(http://www.jedi.com/obiwan/technology/ultramonkey-rhel4.html).
	</para>
	<para>
Dan Thagard <emphasis>daniel (at) gehringgroup (dot) com</emphasis> 3 Jul 2007 
	</para>
	<para>
I recently setup LVS using the Ultramonkey RPMs.  
The following is a (based on my understanding) complete howto for setting up CentOS 5 with LVS:
Generic CentOS 5 x64 Install on 2 PCs using Ultramonkey and Streamlined/HA topology with Apache
The following assumptions were made:
	</para>
	<itemizedlist>
		<listitem>
Real Server names are ws01.testlab.local and ws02.testlab.local 
(replace these with the result from uname -n from each RS) 
		</listitem>
		<listitem>
Real Server IPs are 10.0.0.10/24 and 10.0.0.20/24, 
		</listitem>
		<listitem>
Gateway: 10.0.0.1
		</listitem>
		<listitem>
Virtual IP: 10.0.0.100
		</listitem>
		<listitem>
Username: tester
		</listitem>
	</itemizedlist>
	<orderedlist>
		<listitem>
      Power PC and insert CD during BIOS.
		</listitem>
		<listitem>
      Boot to CD.
		</listitem>
		<listitem>
      Hit 'Enter' for Graphical Installer.
		</listitem>
		<listitem>
      You will be prompted to test the installation media.  
You may choose to test the media or skip the test (usually you can skip this step).
		</listitem>
		<listitem>
      Click 'Next' to begin installation.
		</listitem>
		<listitem>
      Select 'English' as installation language and click 'Next'.
		</listitem>
		<listitem>
      Select 'U.S. English' as the keyboard configuration and click 'Next'.
		</listitem>
		<listitem>
      Select 'Remove all partitions on selected drivers and create default layout' and click 'Next'.
		</listitem>
		<listitem>
		<para>
      Configure the network settings for each adapter.
		</para>
		<itemizedlist>
			<listitem>
a.      Click 'Edit'.
			</listitem>
			<listitem>
i.      Uncheck Configure using DHCP
			</listitem>
			<listitem>
ii.     Input the IP Address and Netmask.
			</listitem>
			<listitem>
iii.    Click 'OK'.
			</listitem>
			<listitem>
b.      Input the Gateway and DNS and click 'Next'.
			</listitem>
		</itemizedlist>
		</listitem>
		<listitem>
     Select 'America/ New York' and click 'Next'.
		</listitem>
		<listitem>
     Enter the root password twice and click 'Next'.
		</listitem>
		<listitem>
			<para>
     Select the system packages
			</para>.
			<itemizedlist>>
				<listitem>
a.      Check 'Desktop-Gnome', 'Server', 'Server-GUI', 'Clustering', 'Storage Clustering'
				</listitem>
				<listitem>
b.      Select 'Customize Now'
				</listitem>
				<listitem>
c.      Click 'Next'.
				</listitem>
			</itemizedlist>
		</listitem>
		<listitem>
			<para>
     Configure the system packages.
			</para>
			<itemizedlist>
				<listitem>
a.      Expand and click 'Details' on Desktop Environments->GNOME Desktop Environment.
				</listitem>
				<listitem>
i.      Uncheck 'desktop-printing', 'dvd+rw tools', 'esc', 'gimp-print-utils', 'gnome-audio', 'gnome-backgrounds', 'gnome-mag', 'gnome-pilot', 'gnome-themes', 'gok', and 'nautilus-cd'
				</listitem>
				<listitem>
b.      Expand Servers.
				</listitem>
				<listitem>
i.      Uncheck 'DNS', 'Legacy Network Server', 'Mail Server', 'Network Servers', 'News', and 'Printing Support'
				</listitem>
				<listitem>
c.      Expand Base System.
				</listitem>
				<listitem>
i.      Uncheck 'Dialup Networking Support'
				</listitem>
				<listitem>
d.      Expand and click 'Details' on Base System->Base.
				</listitem>
				<listitem>
i.      Uncheck 'bluez-utils' and 'ccid'
				</listitem>
				<listitem>
e.      Click 'Next'
				</listitem>
			</itemizedlist>
		</listitem>
		<listitem>
     Click 'Next' to begin copying over the files.
		</listitem>
		<listitem>
     Remove DVD and click 'Reboot' to reboot the machine after installation.
		</listitem>
		<listitem>
			<para>
     Set firewall to 'Disabled' and click 'Forward'.
			</para>
		<itemizedlist>
			<listitem>
     Click 'Yes' on pop-up.
			</listitem>
		</itemizedlist>
		</listitem>
		<listitem>
     Set SELinux to 'Disabled' and click 'Forward'.
		</listitem>
		<listitem>
     Select the 'Network Time Protocol' tab, check 'Enable Network Time Protocol', and click 'Forward'.
		</listitem>
		<listitem>
     Enter tester in the username field, 'Test User' in the Full name field, type in the password twice, and click 'Forward'.
		</listitem>
		<listitem>
     Click 'Forward' to skip the audio test.
		</listitem>
		<listitem>
     Click 'Finish' to complete the installation routine.
		</listitem>
		<listitem>
     Login to the local system using the root username and password.
		</listitem>
		<listitem>
			<para>
     Edit the '/etc/group' file
			</para>
<programlisting><![CDATA[
vi /etc/group
]]></programlisting>
		<itemizedlist>
			<listitem>
a.      Locate the user 'tester' and append 'wheel' (i to insert, [ESC] to stop editing).
			</listitem>
			<listitem>
b.      Save the file and exit by typing ':wq'.
			</listitem>
		</itemizedlist>
		</listitem>
		<listitem>
     Leave the server, goto your PC and SSH into the server (<emphasis>e.g.</emphasis> PuTTY)
		</listitem>
		<listitem>
     Login as user 'tester'
		</listitem>
		<listitem>
		<para>
     Su to root
		</para>
<programlisting><![CDATA[
su -
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Install the dries yum repository by creating dries.repo in the /etc/yum.repo.d/ directory with the following contents
			</para>
<programlisting><![CDATA[
[/etc/yum.repo.d/dries.repo]
[dries]
name=Extra Fedora rpms dries - $releasever \
- $basearch baseurl=http://ftp.belnet.be/packages/dries.ulyssis.org/redhat/el5/en/x86_64/dries/RPMS
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Install the dries GPG key
			</para>
<programlisting><![CDATA[
rpm --import http://dries.ulyssis.org/rpm/RPM-GPG-KEY.dries.txt
]]></programlisting>

		</listitem>
		<listitem>
			<para>
     Update your local packages and install some additional ones
			</para>
<programlisting><![CDATA[
yum update -y && yum -y install lynx libawt xorg-x11-deprecated-libs nx freenx arptables_jf httpd-devel
]]></programlisting>

		</listitem>
		<listitem>
			<para>
     Correct release version
			</para>
<programlisting><![CDATA[
mv /etc/redhat-release /etc/redhat-release.orig && \
echo "Red Hat Enterprise Linux Server release 5 (Tikanga)" > /etc/redhat-release
]]></programlisting>

		</listitem>
		<listitem>
     Download the Ultramonkey RPMs from http://www.ultramonkey.org (also grab perl-MAIL-POP3Client, available from http://rpm.pbone.net/index.php3/stat/4/idpl/4508518/com/perl-Mail-POP3Client-2.17-1.el5.centos.noarch.rpm.html as of the time of this writing)
		</listitem>
		<listitem>
			<para>
     Install the <filename>arptables-noarp-addr</filename> and <filename>perl-Mail-POP3Client</filename> RPMs 
(change the cd path to wherever you downloaded Ultramonkey to)
			</para>
<programlisting><![CDATA[
cd /usr/local/src/Ultramonkey && rpm -Uvh arptables-noarp-addr-0.99.2-1.rh.el.um.1.noarch.rpm && \
rpm -Uvh perl-Mail-POP3Client-2.17-1.el5.centos.noarch.rpm
]]></programlisting>
		</listitem>
		<listitem>
			<para>
    Install Ultramonkey
			</para>
<programlisting><![CDATA[
yum install -y heartbeat*
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Download and edit the Ultramonkey config files that relate your desired topology from 
http://www.ultramonkey.org to the /etc/ha.d/ directory 
and edit them to meet your desired configuration.  
Examples as follows:
			</para>
<programlisting><![CDATA[
[/etc/ha.d/authkeys]
auth 2
2 sha1 Ultramonkey!

[/etc/ha.d/ha.cf]
logfacility     local0
mcast eth0 225.0.0.1 694 1 0
auto_failback off
node    ws01.testlab.local
node    ws02.testlab.local
ping 10.0.0.1
respawn hacluster /usr/lib64/heartbeat/ipfail
apiauth ipfail gid=haclient uid=hacluster

[/etc/ha.d/haresources]
ws01.testlab.local      \
        ldirectord::ldirectord.cf \
        LVSSyncDaemonSwap::master \
        IPaddr2::10.0.0.100/24/eth0/10.0.0.255

[/etc/ha.d./ldirector.cf]
checktimeout=10
checkinterval=2
autoreload=yes
logfile="/var/log/ldirectord.log"
quiescent=no
# Virtual Service for HTTP
virtual=10.0.0.100:80
        fallback=127.0.0.1:80
        real=10.0.0.10:80 gate
        real=10.0.0.20:80 gate
        service=http
        request="alive.html"
        receive="I'm alive!"
        scheduler=wrr
        persistent=1800
        protocol=tcp
          checktype=negotiate
# Virtual Service for HTTPS
virtual=10.0.0.100:443
        fallback=127.0.0.1:443
        real=10.0.0.10:443 gate
        real=10.0.0.20:443 gate
        service=https
        request="alive.html"
        receive="I'm alive!"
        scheduler=wrr
        persistent=1800
        protocol=tcp
          checktype=negotiate
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Set the permission on authkeys
			</para>
<programlisting><![CDATA[
chmod 600 /etc/ha.d/authkeys
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Start the httpd server
			</para>
<programlisting><![CDATA[
httpd -k start
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Create <filename>alive.html</filename> in the <filename>/var/www/html</filename> 
folder with the following text (set this to whatever file you have set in the monitoring script)
			</para>
<programlisting><![CDATA[
I'm alive!
]]></programlisting>
			<para>
      Edit the <filename>/etc/hosts</filename> file to include the FQDN of all of the machines in your LVS 
(not strictly necessary, but it helps avoid problems)
			</para>
<programlisting><![CDATA[
# Do not remove the following line, or various programs # that require network functionality will fail.
127.0.0.1               localhost.localdomain localhost
10.0.0.10               ws01.testlab.local      ws01
10.0.0.20               ws02.testlab.local      ws02
::1             localhost6.localdomain6 localhost6
]]></programlisting>

		</listitem>
		<listitem>
			<para>
     Edit the <filename>/etc/sysconfig/network-scripts/ifcfg-lo</filename> file with your virtual IP
			</para>
<programlisting><![CDATA[
DEVICE=lo
IPADDR=127.0.0.1
NETMASK=255.0.0.0
NETWORK=127.0.0.0
BROADCAST=127.255.255.255
ONBOOT=yes
NAME=loopback

DEVICE=lo:0
IPADDR=10.0.0.100
NETMASK=255.255.255.255
NETWORK=10.0.0.0
BROADCAST=10.0.0.255
ONBOOT=yes
NAME=loopback
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Edit the <filename>/etc/sysconfig/network-scripts/ifcfg-eth0</filename> file to match this 
(edit the IP address for each director/real server, change from <filename>eth0</filename> 
to whatever active interface you are using):
			</para>
<programlisting><![CDATA[
[/etc/sysconfig/network-scripts/ifcfg-eth0 on ws01] \
DEVICE=eth0 ONBOOT=yes BOOTPROTO=static IPADDR=10.0.0.10 NETMASK=255.255.252.0 GATEWAY=10.0.0.1

[/etc/sysconfig/network-scripts/ifcfg-eth0 on ws02] \
DEVICE=eth0 ONBOOT=yes BOOTPROTO=static IPADDR=10.0.0.20 NETMASK=255.255.252.0 GATEWAY=10.0.0.1
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Restart the network
			</para>
<programlisting><![CDATA[
service network restart
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Enable packet forwarding and arp ignore in the /etc/sysctl.conf file
			</para>
<programlisting><![CDATA[
net.ipv4.ip_forward = 1
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.eth0.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Reparse the sysctl.conf file
			</para>
<programlisting><![CDATA[
/sbin/sysctl -p
]]></programlisting>
		</listitem>
		<listitem>
			<para>
     Make sure all services set to start at system boot.
			</para>
<programlisting><![CDATA[
chkconfig httpd on && chkconfig --level 2345 heartbeat on && chkconfig --del ldirectord
]]></programlisting>
		</listitem>
		<listitem>
			<para>
 Start the heartbeat service
			</para>
<programlisting><![CDATA[
/etc/init.d/ldirectord stop && /etc/init.d/heartbeat start
]]></programlisting>
		</listitem>
	</orderedlist>
	</section>
	<section id="setup_keepalived">
	<title>Keepalived</title>
	<para>
<ulink url="http://keepalived.sourceforge.net">Keepalived</ulink>
is written by Alexandre Cassen <emphasis>Alexandre (dot) Cassen (at) free (dot) fr</emphasis>, 
and is based on vrrpd for director failover.
Health checking for realservers is included.
It has a lengthy but logical conf file and sets up an LVS for you.
Alexandre released code for this in late 2001.
There is a keepalived mailing list and Alexandre also monitors the LVS mailing list
(May 2004, most of the postings have moved to the keepalived mailing list).
The LVS-HOWTO has some information about
<ulink url="LVS-HOWTO.failover.html#keepalived_vrrpd">Keepalived</ulink>.
	</para>
	</section>
	<section id="ipvsadmd">
       	<title>ipvsman(d)</title>
	<para>
Volker Jaenisch <emphasis>volker (dot) jaenisch (at) inqbus (dot) de</emphasis> 2007-07-04
	</para>
	<para>
http://sourceforge.net/projects/ipvsman/
	</para>
	<para>
ipvsman is a curses based GUI to the IPVS loadbalancer written in python.
ipvsmand is a monitoring instance of ipvs to achive the desired state
of the loadbalancing as ldirectord or keepalived do.
	</para>
	<itemizedlist>
		<listitem>
 ipvsman/d now comes with tcp regular expression chat to check any
tcp-service you can imagine
		</listitem>
		<listitem>
Sorry-Servers can be checked for their avability.
		</listitem>
		<listitem>
Fedora 7 packages are contributed by Gerry Reno.
		</listitem>
	</itemizedlist>
	</section>
	<section id="soekris">
       	<title>Alternate hardware: Soekris (and embedded hardware)</title>
	<para>
Clint Byrum <emphasis>cbyrum (at) spamaps (dot) org</emphasis> 27 Sep 2004
	</para>
	<blockquote>
		<para>
I'd like to setup a two node Heartbeat/LVS load balancer using Soekris
Net4801 machines. These have a 266Mhz Geode CPU, 3 Ethernet, and 128MB
of RAM. The OS (probably LEAF) would live on a CF disk. If these are
overkill, I'd also consider a Net4501, which has a 133Mhz CPU, 64MB RAM,
and 3 ethernet.
		</para>
		<para>
I'd need to balance about 300 HTTP requests per second, totaling about
150kB/sec, between two servers. I'm doing this now with the servers
themselves (big dual P4 3.02 Ghz servers with lots and lots of RAM).
This is proving problematic as failover and ARP hiding are just a major
pain. I'd rather have a dedicated LVS setup.
		</para>
		<para>
1) anybody else doing this?
		</para>
		<para>
2) IIRC, using the DR method, CPU usage is not a real problem because
reply traffic doesn't go through the LVS boxes, but there is some RAM
overhead per connection. How much traffic do you guys think these should
be able to handle? 
		</para>
	</blockquote>
	<para>		
Ratz 28 Sep 2004 
	</para>
	<para>
The Net4801 machines are horribly slow but for your purpose enough. 
The limiting factor on those 
boxes are almost always the cache sizes. I've waded through too many 
processor sheets of those Geode derivates to give your specific details 
on your processor but I would be surprised if it had more than 16kb 
i/d-cache each.
	</para>
	<blockquote>
16k unified cache. :-/
	</blockquote>
	<para>
Make sure that your I/O rate is as low as possible or the first thing to 
blow is your CF disk. I've worked with hundreds of those little boxes in 
all shapes, sizes and configurations. The biggest common mode failures 
were CF disk due to temperature problems and I/O pressure (MTTF was 23 
days); other problems only showed up in really bad NICs locking up half 
of the time.
	</para>
	<blockquote>
I haven't ever had an actual CF card blow on me. LEAF is made to live on
readonly media.. so its not like it will be written to a lot.
	</blockquote>
	<para>
Sorry, blow is exaggerated, I mean they simply fail because they only 
have limited write capacity on the cells.
	</para>
	<para>
RO doesn't mean that there's no I/O going to your disk as you correctly 
noted. The problem is that if you plan on using them 24/7 I suggest you 
monitor your block I/O on your RO partitions using the values from 
/proc/partitions or the wonderful iostat tool. Then extrapolate about 4 
hours worth of samples, check your CF vendor specification on how many 
writes it can endure and see how long you can expect the thing to run.
	</para>
	<para>
I have to add that thermal issues were adding to our high failure rates. 
We wanted to ship those little nifty boxes to every branch of a big 
customer to do a big VPN network. Unfortunately the customer is in the 
automobile industry and this means that those boxes were put in the 
stranges places imaginable in garages sometimes causing major heat 
congestion. Also as it is usual in this sector of industry people are 
used to reliable hardware and so they don't care if at the end of a 
working day they simply shut down the power of the whole garage. 
Needless to say that this adds up to the reduced lifetime of a CF.
	</para>
	<para>
I then did a reliability analysis using the MGL (multiple greek letter, 
derived from the beta-factor model) model to calculate the average risk 
in terms of failure*consequence and we had to refrain from using those 
little nifty things. The costs of repair (detection of failure -> 
replacement of product) at a customer would exceed the income our 
service provided through a mesh of those boxes.
	</para>
	<blockquote>
If these are
overkill, I'd also consider a Net4501, which has a 133Mhz CPU, 64MB RAM,
and 3 ethernet.
	</blockquote>
	<para>
I'd go with the former ones, just to be sure ;).
	</para>
	<blockquote>
Forgive me for being frank, but it sounds like you wouldn't go with
either of them.
	</blockquote>
	<para>
I don't know your business case so it's very difficult to give you a 
definite answer. I only give you an (somewhat intimidating) experience 
report, someone might just as well give you a much better report.
	</para>
	<blockquote>
I'd need to balance about 300 HTTP requests per second, totaling about
150kB/sec, between two servers.
	</blockquote>
	<para>
So one can assume a typical request to your website is 512 Bytes, which 
is rather quite high. But not really an issue for LVS-DR.
	</para>
	<blockquote>
I didn't clarify that. The 150kB/sec is outgoing. This isn't for all of
the website, just the static images/html/css.
	</blockquote>
	<blockquote>
I'm doing this now with the servers
themselves (big dual P4 3.02 Ghz servers with lots and lots of RAM).
This is proving problematic as failover and ARP hiding are just a major
pain. I'd rather have a dedicated LVS setup.
	</blockquote>
	<para>
I'd have to agree to this.
	</para>
	<blockquote>
1) anybody else doing this?
	</blockquote>
	<para>
Maybe. Stupid questions: How often did you have to failover and how 
often did it work out of the box?
	</para>
	<blockquote>
Maybe once every 2 or 3 months I'd need to do some maintenance and
switch to the backup. Every time there was some problem with noarp not
coming up or some weird routing issue with the IPs. Complexity bad. :)
	</blockquote>
	<para>
So frankly speaking: your HA solution didn't work as expected ;).
	</para>
	<blockquote>
2) IIRC, using the DR method, CPU usage is not a real problem because
reply traffic doesn't go through the LVS boxes, but there is some RAM
overhead per connection. How much traffic do you guys think these should
be able to handle? 
	</blockquote>
	<para>
This is very difficult to say since these boxes impose limits also 
through their inefficiant PCI busses, their rather broken NICs and the 
dramatically reduced cache. Also it would be interesting to know if 
you're planning on using persistency on your setup.
	</para>
	<blockquote>
Persistency is not a requirement. Note that most of the time a client
opens a connection once, and keeps it up as long as they're browsing
with keepalives.
	</blockquote>
	<para>
Yes, provided most clients use HTTP/1.1. But since on an application 
level you don't need persistency.
	</para>
	<para>
But to give you a number to start with, I would say those boxes should 
be able (given your constraints) to sustain 5Mbit/s of traffic with 
about 2000pps (~350 Bytes/packet) and only consume 30 Mbyte of your 
precious RAM when running without persistency. This is if every packet 
of your 2000pps is a new client requesting a new connection to the LVS 
and will be inserted by the template at an average of 1 Minute.
	</para>
	<para>
As mentioned previously, you HW configuration is very hard to compare to 
actual benchmarks, thus take those numbers with a grain of salt, please.
	</para>
	<blockquote>
Thats not encouraging. I need something fairly cheap.. otherwise I might
as well go down the commercial load balancer route. 
	</blockquote>
	<para>
Well, I have given you number which are (at a second look) rather low 
estimates ;). Technically, your system should be able to deliver 
25000pps (yes, 25k) at a 50Mbit/s rate. You would then, if every packet 
was a new client, consume about all the memory of your system :). So 
somewhere in between those two numbers I would place the performance of 
your machine.
	</para>
	<para>
Bubba Parker <emphasis>sysadmin (at) citynetwireless (dot) net</emphasis> 27 Sep 2004
	</para>
	<para>
In my tests, the Soekris net4501, 4511, and 4521 all were able to route almost 20Mbps at wire-speed.
I would suspect the 4801 to be in excess of 50Mbps, 
but remember, your Soekris board has 3 nics, 
but what they don't tell you is that they all share the same interrupt, 
so performance degredation is exponential with many packets per second.
	</para>
	<para>
Ratz 28 Sep 2004
	</para>
	<para>
For all Geode based boards I've received more technical documentation 
than I was ever prepared to dive in. Most of the time you get a very 
accurate depiction of your hardware including south and north bridges 
and there you can see that the interrupt lines are hardwired and require 
a interrupt sharing.
	</para>
	<para>
However this is not a problem since there's not a lot of devices on the 
bus anyway that would occupy it and if you're really unhappy about the 
bus speed, use setpci to reduce latency for the NIC's IRQs.
	</para>
	<para>
Newer kernels have excellent handling for shared IRQs btw.
	</para>
	<para>
Did you measure exponential degradation? I know you get a pretty steep performance 
reduction once you push the pps too high but I newer saw exponential 
behaviour.
	</para>


	<para>
Peter Mueller 2004-09-27 
	</para>
	<blockquote>
What about not using these Soekris's and just using those two beefy servers?
e.g.,  http://www.ultramonkey.org/2.0.1/topologies/ha-overview.html or
 http://www.ultramonkey.org/2.0.1/topologies/sl-ha-lb-overview.html
	</blockquote>
	<para>
Clint Byrum 27 Sep 2004
	</para>
	<para>
Thats what I'm doing now. The setup works, but its complexity causes
issues. Bringing up IPs over here, moving them from eth0 to lo over
there, running noarpctl on that box. Its all very hard to keep track of.
Its much simpler to just have two boxes running LVS, and not worry about
whats on the servers.
	</para>
	<para>
Simple things are generally easier to fix if they break. It took me
quite a while to find a simple typo in a script on my current setup,
because it was very non-obvious at what layer things were failing.
	</para>
	</section>
	<section id="turnbull">
       	<title>LVS on a CD: Malcolm Turnbull's ISO files</title>
       	<para>
Malcolm Turnbull <emphasis>Malcolm (dot) Turnbull (at) crocus (dot) co (dot) uk</emphasis>
03 Jun 2003, has released a Bootable ISO image of his Loadbalancer.org appliance software.
The link was at
http://www.loadbalancer.org/modules.php?name=Downloads&amp;d_op=viewdownload&amp;cid=2
but is now dead (Dec 2003).
Checking the website (Apr 2004) I find that the code is available as a
30 day demo
(http://www.loadbalancer.org/download.html, link dead Feb 2005).
	</para>
	<para>
Here's the original blurb from Malcolm
	</para>
	<blockquote>
       		<para>
The basic idea is creating an easy to use layer 4 switch appliance to
compete with Coyote Point Equalizer/ CISCO local director...
All my source code is GPL, but the ISO distribution contains
files that are non-GPL to protect the work and allow vendors to licence the software.
The ISO requires a license before you can legally use it in production.
       		</para>
       		<para>
Burn it to CD and then use it to boot a spare server with
pentium/celeron + ATAPI CD + 64MB RAM + 1 or 2 NICs+20GB HD
       		</para>
<programlisting><![CDATA[
root password is : loadbalancer
ip address is : 10.0.0.21/255.255.0.0
web based login is : loadbalancer
web based password is : loadbalancer
]]></programlisting>
       		<para>
Default setup is DR so just plug it straight into the same hub as your
web servers and have a play..
Download the manuals for configuration info...
       		</para>
	</blockquote>
	</section>
</section>
<section id="LVS-HOWTO.ipvsadm" xreflabel="ipvsadm and schedulers">
<title>LVS: Ipvsadm and Schedulers</title>
<para>
<command>ipvsadm</command> is the user code interface to LVS. 
The scheduler is the part of the ipvs
kernel code which decides which realserver will get the next new connection.
</para>
<para>
There are patches for ipvsadm
</para>
<itemizedlist>
	<listitem>
<link linkend="Padraig">machine readable error codes for ipvsadm</link>
	</listitem>
	<listitem>
<link linkend="stateless_ipvsadm">stateless entry of <command>ipvsadm</command> commands</link>
	</listitem>
	<listitem>
		<para>
Mar 2004. There appears to have been introduced a bug in the wrr code. 
Presumably this will be fixed sometime in the main code, and presumably
older versions of <filename>ipvs</filename> still work (but I don't 
know how far back you need to go, presumably to the 2.4 kernels).
Here are some postings on the matter and links to a patch.
		</para>
		<para>
Jan Kasprzak <emphasis>kas (at) fi (dot) muni (dot) cz</emphasis> 2005/03/25
		</para>
		<blockquote>
			<para>
port unreachable after RS removal:
			</para>
			<para>
   I use IPVS with direct routing and wrr scheduler. The problem is
that for some configurations I get "icmp port unreachable" when one of the
real servers fails and is removed from the ip_vs tables. 
The smallest case where I can
replicate the problem is the following:
			</para>
<programlisting><![CDATA[
ipvs# ipvsadm -A -t virtual.service:http -s wrr
ipvs# ipvsadm -a -t virtual.service:http -r realserver1:http -w 100
ipvs# ipvsadm -a -t virtual.service:http -r realserver2:http -w 1000

client$ wget -O - http://virtual.service/
[works as expected]

ipvs# ipvsadm -d -t virtual.service:http -r realserver2

client$ wget -O - http://virtual.service/
--14:46:29--  http://virtual.service/
           => `-'
Resolving virtual.service... 1.2.3.4
Connecting to virtual.service[1.2.3.4]:80... failed: Connection refused.
]]></programlisting>
			<para>
   I have verified by tcpdump that no traffic is sent to realserver2
after it is removed from the virtual.service pool. The ICMP "tcp port
unreachable" is sent by the ipvs director.
This appears to be a problem in the wrr scheduler. With wlc or rr
it works as expected.
The director is Fedora Core 3 with vanilla 2.6.11.3 kernel,
but I have been experiencing this for a longer time.
			</para>
		</blockquote>
		<para>
Sent by: lvs-users-bounces@LinuxVirtualServer.org 2005/03/26
		</para>
		<para>	
This is exactly the problem I described in my previous mails, 
and for which a patch is available from Wensong and/or Horms.
Search the mailinglist archive for 'overload flag not
resetting' which was my initial (wrong) diagnosis.
See
		</para>
<programlisting><![CDATA[
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=110604584821192&w=2

and the (inital) patch:
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=110749794000222&w=2
]]></programlisting>
	</listitem>
</itemizedlist>
	<section id="using_ipvsadm">
	<title>Using ipvsadm</title>
	<para>
You use <command>ipvsadm</command> from the command line (or in rc files) to setup: -
	</para>
	<itemizedlist>
		<listitem>
services/servers that the director directs
(<emphasis>e.g.</emphasis> http goes to all realservers,
while ftp goes only to one of the realservers).
		</listitem>
		<listitem>
			<para>
weighting given to each realserver - useful if some servers
are faster than others.
			</para>
			<para>
Horms 30 Nov 2004
			</para>
			<para>
The weights are integers, but sometimes they are assigned to an
atomic_t, so they can only be 24bits <emphasis>i.e.</emphasis>
values so 0 to (2^24-1) should work.
			</para>
		</listitem>
		<listitem>
		<link linkend="scheduler">scheduling algorithm</link>
		</listitem>
	</itemizedlist>
	<para>
You use can also use <command>ipvsadm</command> to
	</para>
	<itemizedlist>
		<listitem>
		add services: add a service with weight &gt;0
		</listitem>
		<listitem>
		shutdown (or quiesce) services: set the weight to 0.
		<para>
This allows current connections to continue,
untill they disconnect or expire,
but will not allow new connections.
When there are no connections remaining,
you can bring down the service/realserver.
		</para>
		</listitem>
		<listitem>
delete services: this stops traffic for the service (the connection will hang),
but the entry in the connection table is not deleted till it times out.
This allows deletion, followed shortly thereafter by adding back the service,
to not affect established (but quiescent) connections.
		</listitem>
		<listitem>
			<para>
		Once you have a working LVS, save the 
<command>ipsvadm</command> settings with <command>ipvsadm-sav</command>
<programlisting><![CDATA[
$ipvsadm-sav > ipvsadm.sav
]]></programlisting>
			</para>
			<para>
		and then after reboot, 
restore the <command>ipvsadm</command> settings, with ipvsadm-restore
<programlisting><![CDATA[
$ipvsadm-restore < ipvsadm.sav
]]></programlisting>
			</para>
			<para>
Both of these commands can be part of an <command>ipvsadm</command> init script.
			</para>
		</listitem>
		<listitem>
		list version of ip_vs (here 0.9.4, with a hash table size of 4096)
<programlisting><![CDATA[
director:/etc/lvs# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
]]></programlisting>
		</listitem>
		<listitem>
		list version of <command>ipvsadm</command> (here 1.20)
<programlisting><![CDATA[
director:/etc/lvs# ipvsadm --version
director:/etc/lvs# ipvsadm v1.20 2001/09/18 (compiled with popt and IPVS v0.9.4)
]]></programlisting>
		</listitem>
</itemizedlist>
	</section>
	<section id="memory_requirements">
	<title>Memory Requirements</title>
	<para>
On the director, the entries for each connection are stored in a hash table 
(number of buckets set when compiling <filename>ipvsadm</filename>). Each
entry takes 128 bytes. 
Even large numbers of connections will use only a small amount of memory.
	</para>
	<blockquote>
We would like to use LVS in a system where 700Mbit/s traffic is flowing
through it. Concurrent connection number is about 420.000   . Our main
purpose for using LVS is to direct 80. port requests into number of squid
servers (~80 servers)
I have read performance documents and I just wonder I can handle this much
of traffic with a 2x3.2 Xeon  and 4GB of RAM of hardware.
	</blockquote>
	<para>
Ratz 22 Nov 2006 
      	</para>
	<para>
If you use LVS-DR and your squid caches have a moderate hit rate, the amount
of RAM you'll need to load balance 420'000 connections is:
	</para>
<programlisting><![CDATA[
420000 x 128 x [RTTmin up to RTTmin+maxIdleTime] [bytes]
]]></programlisting>
	<para>
This means with 4GB and a standard 3/1GB split (your Xeon CPU is 32bit only
with 64bit EMT) in the 2.6 kernel (I take it as 3000000000 Bytes), you will
be able to serve half a million parallel connections, each connection
lasting at most 3000000000/(500000*128) [secs] = 46.875 secs.
	</para>
	</section>
	<section id="sysctl documentation">
	<title>sysctl documentation</title>
	<para>
the <filename>sysctl</filename> for ipvs will be in 
<filename>Documentation/networking/ipvs-sysctl.txt</filename> for 2.6.18 (hopefully).
It is derived from http://www.linuxvirtualserver.org/docs/sysctl.html v1.4.
	</para>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis> 08 Mar 2007
	</para>
	<para>
A couple of times recently people have posted to the keepalived list or
the LVS list about different issues which were resolved by toggling
sysctls (most recently expire_quiescent_template - see <xref linkend="new_persistence"/>).
This got me thinking: these sysctls are pretty important, and not
everyone knows what to do with them (or how to change them) since the
recommended ways to modify them can vary between distributions.
So, why not give ipvsadm the capability to modify appropriate sysctls
found in <filename>/proc/sys/net/ipv4/vs/</filename>?
The more I thought about it, the more I considered that the easiest way
to do so would be to use a "generic" option along the lines of the
e2fsprogs style "-O option,option,option=value" with "^option" as a
negation for booleans.
So you'd be able to say:
	</para>
<programlisting><![CDATA[
ipvsadm -O expire_quiescent_template,expire_nodest_conn
ipvsadm -O expire_nodest_conn
ipvsadm -O drop_packet=1,drop_entry=1,expire_nodest_conn
]]></programlisting>
	<para>
By making the option handler "generic" like this, as other sysctls
arrive as the kernel develops they can simply be toggled or changed as
necessary; in all cases, where no corresponding sysctl exists, an error
is thrown to that effect.
In my mind it makes ipvsadm more of a "one stop shop" for the various
settings - not only will it manage the virtual and real servers, but more
of the underlying infrastructure too.
	</para>
	<para>
Ratz  08 Mar 2007
	</para>
	<para>
I tend to agree, however people that want to setup LVS do need to know Linux
on the level of also understanding sysctrl variables and their meaning. I've
always hoped that with having them in the <command>ipvsadm</command> man page, the problem
would be solved.
I know only of one application that modifies sysctls, and this is
the broken pluto of Free/OpenSwan :).
	</para>
	<para>
You still have to know what the options mean, correct? I favour a
different approach more: Make LVS really user friendly, in that you provide
the users with a tool that takes away the low level configuration, just like
in linux-ha or commercial load balancers. It's not so difficult to write
this, it's just that someone has to sit down and do it.
	</para>
	<para>
You still need to absolutely know the semantics of these settings. So what
is the real gain between
	</para>
<programlisting><![CDATA[
   ipvsadm -O expire_quiescent_template=1

and

   echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template
]]></programlisting>

	<para>
Horms
	</para>
	<para>
Of late I've been thinking of the idea of enabling LVS to be configured
via netlink rather than the current /proc + get/setsockopt fun.
I think this was Ratz idea, but it seems like a good one to me,
as it should allow a lot more flexibility in the user-space to kernel
communication, which has always been a problem from the point
of backwards compatibility. So I have kind of been thinking of ipvsadm2
or ipvsadm-nl.
	</para>
	<para>
Ratz
	</para>
	<para>
I've already started once with the conversion of IPVS to netlink and I've
named the new ipvsadm ipvsctl :). I've attached my work (actually the part I
could find right now ... I know on some of my dozens of crashed Laptop
harddisks there should be more) in this email, so it doesn't get lost and
you don't have to duplicate it. This would also allow us to easily implement
the missing features and enable us to move more towards
netfilter-friendliness.
	</para>
	</section>
	<section id="ipvs_kernel_match">
	<title>Compile a version of ipvsadm that matches your ipvs</title>
	<para>
Compile and install <command>ipvsadm</command> on the
director using the supplied Makefile. You can optionally compile ipvsadm
with popt libraries, which allows <command>ipvsadm</command> to handle more complicated
arguments on the command line.
If your <filename>libpopt.a</filename> is too old, your <command>ipvsadm</command> will segv.
(I'm using the dynamic libpopt and don't have this problem).
	</para><para>
Since you compile <filename>ipvs</filename> and <command>ipvsadm</command> independantly and
you cannot compile <command>ipvsadm</command> until you have patched the kernel headers,
a common mistake is to compile the kernel and reboot, forgetting to
compile/install ipvsadm.
	</para><para>
Unfortunately there is only rudimentary version detection code into ipvs/ipvsadm.
If you have a mismatched ipvs/<command>ipvsadm</command> pair,
many times there won't be problems, as any particular
version of <command>ipvsadm</command> will work with a wide range of patched kernels.
Usually with 2.2.x kernels,
if the ipvs/<command>ipvsadm</command> versions mismatch, you'll get weird but non-obvious
errors about not being able to install your LVS. Other possibilities are
that the output of <command>ipvsadm</command> -L will have IP's that are clearly not IPs (or
not the IP's you put in) and ports that are all wrong. It will look something
like this
	</para>
<programlisting><![CDATA[
[root@infra /root]# ipvsadm
IP Virtual Server version 1.0.4 (size=3D4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port Forward Weight ActiveConn InActConn
TCP  C0A864D8:0050 rr
  -> 01000000:0000      Masq    0      0          0
]]></programlisting>
	<para>
rather than
	</para>
<programlisting><![CDATA[
director:/etc/lvs# ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:ssh rr
  -> RS2.mack.net:ssh             Route   1      0          0
]]></programlisting>
	<para>
There was a change in the /proc file system for
ipvs about 2.2.14 which caused problems
for anyone with a mismatched ipvsadm/ipvs.
The <command>ipvsadm</command> from different kernel series (2.2/2.4) do not
recognise the ipvs kernel patches from the other series (they appear to
not be patched for ipvs).
	</para><para>
The later 2.2.x ipvsadms know the minimum version of ipvs that they'll run on,
and will complain about a mismatch.
They don't know the maximum version
(which will be produced presumably some time in the future)
that they will run on.
This protects you against the unlikely event of installing a new 2.2.x version of
director:/etc/lvs# ipvsadm on an older version of ipvs, but will not protect you against the
more likely scenerio where you forget to compile <command>ipvsadm</command> after building your kernel.
The <command>ipvsadm</command> maintainers are aware of the problem.
Fixing it will break the current code and they're waiting
for the next code revision which breaks backward compatibility.
	</para><para>
If you didn't even apply the kernel patches for ipvs, then ipvsadm
will complain about missing modules and exit
(<emphasis>i.e.</emphasis> you can't even do `<command>ipvsadm -h</command>`).
	</para>
		<section id="other_compile_problems">
		<title>Other compile problems</title>
		<para>
Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis>
		</para>
		<blockquote>
		<para>
Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis>
on a slackware 4.0 machine I went to compile <command>ipvsadm</command> and it gave
me an error indicating that the iphdr type was undefined and
it didn't like that when it saw
		</para>
		<para>
Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis>
to <filename>ip_fw.h</filename> I added
		</para>
<programlisting><![CDATA[
#include <linux/ip.h>
]]></programlisting>
		<para>
Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis>
in ipvsadm.c, which is where the iphdr
#structure is defined and everything went ok
		</para>
		</blockquote>
		<para>
Doug Bagley <emphasis>doug (at) deja (dot) com</emphasis>
		</para>
		<para>
The reason that it fails "out of the box" is because fwp_iph's
type definition (struct iphdr) was
		</para>
<programlisting><![CDATA[
#ifdef'd out in <linux/ip_fw.h>
]]></programlisting>
		<para>
(and not included anywhere else) since the symbol __KERNEL_ was
undefined.
		</para>
<programlisting><![CDATA[
Including <linux/ip.h> before <linux/ip_fw.h>
]]></programlisting>
		<para>
in the .c file did the trick.
		</para>
		</section>
	</section>
	<section id="realservers_in_etc_hosts">
	<title>put realservers in /etc/hosts</title>
	<para>
(from a note by Horms 26 Jul 2002)
	</para>
	<para>
<filename>ipvsadm</filename> by default outputs the <emphasis>names</emphasis> 
of the realservers rather than the IPs.
The director then needs name resolution.
If you don't have it, 
<command>ipvsadm</command> will take a long time (upto a minute) to return,
as it waits for name resolution to timeout.
The only IPs that the director needs to resolve are of the realservers.
DNS is slow. To prevent the director from needing DNS,
put the names of the realservers in <filename>/etc/hosts</filename>.
This lookup is quicker than DNS and you won't need
to open a route from the director to a nameserver.
	</para>
	<para>
Or you could use `<command>ipvsadm -n</command>` which outputs the IPs
of the realservers instead.
	</para>
	</section>
	<section id="scheduler">
	<title>RR and LC schedulers</title>
	<para>
On receiving a connect request from a client, the director
assigns a realserver to the client based on a &quot;schedule&quot;.
The scheduler type is set with <command>ipvsadm</command>.
The schedulers available are
	</para>
	<itemizedlist>
		<listitem>
		round robin (rr), weighted round robin (wrr) - new
connections are assigned to each realserver in turn
		</listitem>
		<listitem>
		<para>
		least connected (lc), weighted least connection (wlc) - new
connections go to realserver with the least number of connections.
This is not neccessarily the least busy realserver, but is a step in
that direction.
		</para>
		<para>
		<note>
Doug Bagley <emphasis>doug (at) deja (dot) com</emphasis>
points out that *lc schedulers will not work properly 
if a particular realserver is used in two different LVSs.
		</note>
		</para>
		<para>
Willy Tarreau (in 
<ulink url="http://1wt.eu/articles/2006_lb/">
http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing</ulink>)
says that *lc is suited for very long sessions, 
but not for webservers where the load varies on a short time scale.
		</para>
		</listitem>
		<listitem>
		<xref linkend="LVS-HOWTO.persistent_connection"/>
		</listitem>
		<listitem>
		LBLC: a persistent memory algorythm
		</listitem>
		<listitem>
		DH: destination hash
		</listitem>
		<listitem>
		<para>
		SH: source hash
		</para>
		<para>
Again from Willy: this is used when you want a client to always appear on the same
realserver (<emphasis>e.g.</emphasis> a shopping cart, or database). 
The SH scheduler has not been much used in LVS,
possibly because no-one knew the syntax for a long time and couldn't get it to work.
Most shopping cart type servers are using persistence, 
which has many undesirable side effects.
		</para>
		</listitem>
	</itemizedlist>
	<para>
The original schedulers are rr, and lc (and their weighted versions).
Any of these will do for a test setup. In particular,
round robin will cycle connections
to each realserver in turn, allowing you to check that all
realservers are functioning in the LVS.
The rr,wrr,lc,wlc schedulers should all work similarly when
the director is directing identical realservers with identical services.
The lc scheduler will better handle situations where machines
are brought down and up again
(see <link linkend="thundering_herd">thundering herd problem</link>).
If the realservers are offering different services and some have clients
connected for a long time while others are connected for a short time,
or some are compute bound, while others are network bound,
then none of the schedulers will do a good job of distributing
the load between the realservers.
LVS doesn't have any load monitoring of the realservers.
Figuring out a way of doing this that will work for a range of different types
of services isn't simple (see <link linkend="agent">load and failure monitoring</link>).
	</para>
	<note>
		<para>
Ratz Nov 2006
		</para>
		<para>
After almost 10 years of my involvement with load balancers, 
I have to admit that no customer _ever_ truly asked or cared 
about the scheduling algorithm :). 
This is academia for the rest of the world.
		</para>
	</note>
	</section>
	<section id="netmask_for_VIP">
	<title>Netmask for VIP</title>
	<para>
You setup the RIPs, DIP and other networking with whatever netmask you
choose. For the VIP
	</para>
	<itemizedlist>
		<listitem>
For LVS-DR, LVS-Tun: netmask for VIP on director, realservers must be /32.
		</listitem>
		<listitem>
For LVS-NAT: the netmask can be /32 or the netmask of the RIPs, DIP.
		</listitem>
	</itemizedlist>
	<para>
You will need to setup the routing for the VIP to match the netmask.
For more details, see the chapters for each forwarding method.
	</para>
	<para>
Horms 12 Aug 2004 
	</para>
	<para>
The real story is that the netmask works a little differently
on lo to other interfaces. On lo the interface will answer to
_all_ addresses covered by the netmask. This is how 127.0.0.1/8 on
lo ends up answering 127.0.0.0-127.255.255.255. So if
you add 172.16.4.222/16 to eth0 then it will answer 172.16.4.222 and
only 172.16.4.222. But if you add the same thing to lo then it 
will answer 172.16.0.0-172.16.255.255. So you need to use
172.16.4.222/32 instead.
	</para>
	<para>
To clarify -
	</para>
<programlisting><![CDATA[
ifconfig eth0:0 192.168.10.10 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Add 192.168.10.10 to eth0

ifconfig lo:0 192.168.10.10 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Add 192.168.10.0 - 192.168.10.255 to lo

ifconfig lo:0 192.168.10.0 netmask 255.255.255.0 broadcast 192.168.10.255 up
   -> Same as above, add 192.168.10.0 - 192.168.10.255 to lo

ifconfig lo:0 192.168.10.10 netmask 255.255.255.255 broadcast 192.168.10.10 up
   -> Add 192.168.10.10 to lo
]]></programlisting>

	<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 2005/04/21
	</para>
	<para>
On all platforms apart from windows you want 255.255.255.255 for the 
loopback.
On windows you can get away with 255.255.255.0 IF you use a priority 254 
80% of the time.
255.255.255.255 can be used if you mod the registry...
But we've found that 255.0.0.0 will work better 99% of the time because 
windows by default uses the smallest subnet first for routing
and a class  A will never be used instead of a class C.
	</para>
	</section>
	<section id="DH">
	<title>LBLC, DH schedulers</title>
	<para>
The LBLC code (by Wensong) and the DH scheduler
(by Wensong, inspired by code submitted by Thomas Proell
<emphasis>proellt (at) gmx (dot) de</emphasis>)
are designed for web caching realservers
(<emphasis>e.g.</emphasis> squids).
For normal LVS services (eg ftp, http), the content
offered by each realserver is the same and it doesn't
matter which realserver the client is connected to.
For a web cache, after the first fetch has been made,
the web caches have different content.
As more pages are fetched, the contents of the web caches will diverge.
Since the web caches will be setup as peers,
they can communicate by ICP (internet caching protocol)
to find the cache(s) with the required page.
This is faster than fetching the page from the original webserver.
However, it would be better after the first fetch of a page
from http://www.foo.com/*, for all subsequent clients wanting a
page from http://www.foo.com/ to be connected to that realserver.
	</para>
	<para>
The original method for handling this was to make
connections to the realservers persistent,
so that all fetches from a client went to the same realserver.
	</para>
	<para>
The -dh (destination hash) algorythm makes a hash from the target IP
and all requests to that IP will be sent to the same realserver.
This means that content from a URL will not be retrieved
multiple times from the remote server.
The realservers (eg squids in this case)
will each be retreiving content from different URLs.
	</para>
	<para>
Wensong Zhang <emphasis>wensong (at) gnuchina (dot) org</emphasis> 16 Feb 2001
	</para>
	<para>
Please see "man ipvsadm" for short description of DH and SH
schedulers. I think some examples to use those two schedulers.
	</para>
	<para>
Example:  cache cluster shared by several load balancers.
	</para>
<programlisting><![CDATA[
		Internet
		|
                |------cache array
                |
		|-----------------------------
		   |                |
		   DH               DH
		   |                |
		 Access            Access
                 Network1          Network2
]]></programlisting>
	<para>
The DH scheduler can keep the two load balancer redirect requests
destined for the same IP address to the same cache server. If the server
is dead or overloaded, the load balancer can use cache_bypass feature to
send requests to the original server directly. (Make sure that the cache
servers are added in the two load balancers in the same order)
	</para>
	<para>
Diego Woitasen 12 Aug 2003
	</para>
	<blockquote>
The scheduling algorithms that use dest IP for selecting
the realserver to use (like DH, LBLC, LBLCR) is only aplicable to
transparent proxy, this being the only aplication where the dest ip
could be variable.
	</blockquote>
	<para>
Wensong Zhang <emphasis>wensong (at) linux-vs (dot) org</emphasis> 12 Aug 2003
	</para>
	<para>
Yes, you are almost right. LBLC and LBLCR are written for transparent
proxy clusters only. DH can be used for transparent proxy cluster and can
be used in other clusters needing static mapping.
	</para>
	<para>
	<note>
Here's follows a set of exchanges between a Chinese person and Wensong,
that were in English, that I didn't follow at all. Apparently it was
clear to Wensong.
	</note>
	</para>
	<blockquote>
If lblc uses dh, then is lblc = dh + lc?
	</blockquote>
	<para>
Wensong Zhang 09 Mar 2004
	</para>
	<para>
Maybe lblc = dh + wlc.
	</para>
<programlisting><![CDATA[
/*
 * The lblc algorithm is as follows (pseudo code):
 *
 *       if cachenode[dest_ip] is null then
 *               n, cachenode[dest_ip] <- {weighted least-conn node};
 *       else
 *               n <- cachenode[dest_ip];
 *               if (n is dead) OR
 *                  (n.conns>n.weight AND
 *                   there is a node m with m.conns<m.weight/2) then
 *                 n, cachenode[dest_ip] <- {weighted least-conn node};
 *
 *       return n;
 *
 */
]]></programlisting>

	<para>
	The difference between lblc and lblcr is that cachenode[dest_ip] in
lblc is a server, and cachenode[dest_ip] in lblcr is a server set.
	</para>
	<blockquote>
In lblc the server has overloaded and lvs use wlc and allocate a server in
half load of the server,
Allocate the weighted least-connection server to IP address.
Is this means after allocation for ip address, it will not return to past
server ?
	</blockquote>
	<para>
No, it will not in most cases. There is only one possible situation that
the current map expires after it is not used for six minutes, and the past
server is the one with least connections when next access to the ip
address comes.
	</para>
		<section id="scheduling_squids">
		<title>scheduling <link linkend="squids">squids</link></title>
		<para>
The usual problem with squids not using a cache friendly scheduler
is that fetches are slow. In this case the website is sending hits
to several different RIPs. Some websites detect this and won't
even serve you the pages.
		</para>
		<para>
Palmer J.D.F. <emphasis>J (dot) D (dot) F (dot) Palmer (at) Swansea (dot) ac (dot) uk</emphasis> 18 Mar 2002/
		</para>
		<blockquote>
		<para>
I tried https and online banking sites (<emphasis>e.g.</emphasis> www.hsbc.co.uk).
It seems that this site and undoubtedly many other secure sites don't like
to see connections split across several IP addresses as happens with my
cluster.
Different parts of the pages are requested by different realservers, and
hence different IP addresses.
		</para>
		<para>
It gives an error saying...
"...For your security, we have disconnected you from internet banking due to
a period of inactivity..."
		</para>
		<para>
I have had caching issues with HSBC before, they seem to be a bit more
stringent than other sites.
If I send the requests through one of the squids on it's own it works fine,
so I can only assume it's because it is seeing fragmented requests, maybe
there is a keepalive component that is requested.
How do I combat this?  Is this what persistence does or is there a way of
making the realservers appear to all have the same IP address?
		</para>
		</blockquote>
		<para>
Joe
		</para>
		<para>
change -rr (or whatever you're running) to -dh.
		</para>
		<para>
Lars
		</para>
		<para>
Use a different scheduler, like lblc or lblcr.
		</para>
		<para>
Harry Yen <emphasis>hyen1 (at) yahoo (dot) com</emphasis> 16 April 2002
		</para>
		<para>
What is the purpose of using LVS with Squid to a https site?
HTTPs based material typically is not cachable.
I don't understand why you need Squid at all.
		</para>
		<para>
Once a request reaches a Squid and incurs a cache miss, the forwarded
request will have Squid IP as the source address.  So you need to find a
way to make sure all connections from the same client IP to go to the
same Squid farm. Then when they incur cache misses, they will wind up
via LVS persistency to the same real sever.
		</para>
		<blockquote>
			<para>
The reason https is sent to the squids is because it's much easier to send
all browser traffic to the squids and then let them handle it.
The only way I seemed to be able to get this to work (IE access the bank
site) is to set a persistence (360 seconds), and using lblc scheduling.
The current output of <command>ipvsadm</command> is this... I am a tad concerned at the
apparent lack of load balancing.
			</para>
<programlisting><![CDATA[
TCP  wwwcache-vip.swan.ac.uk:squi lblc persistent 360
  -> squidfarm1.swan.ac.uk:squid  Route   1      202        1045
  -> squidfarm2.swan.ac.uk:squid  Route   1      14         8
]]></programlisting>
			<para>
HSBC seems to be a bit more
stringent than other sites. If I send the requests through one of the
squids on it's own it works fine, so I can only assume it's because it is
seeing fragmented requests, maybe there is a keepalive component that is
requested.
How do I combat this? Is this what persistence does or is there a way
of making the realservers appear to all have the same IP address?
I have sorted it by using persistence, couldn't get any of the dedicated
squid schedulers to work properly. I'm currently running wlc, and 360s
persistance.  Seems to be holding up really well.  Still watching it with
eagle eyes though.
			</para>
		</blockquote>
		<para>
The -dh scheduler was written expressly to handle squids.
Jezz tried it and didn't get it to work satisfactorily but found that
persistence worked. We don't understand this yet.
		</para>
		<para>
Jakub Suchy <emphasis>jakub (at) rtfm (dot) cz</emphasis> 2005/02/23
		</para>
		<para>
round-robin algorithm is not usable for squid.
Some servers (banks etc.) check clients ip address and terminates it's 
connection if it changes.
When you use the source-hashing algorithm, 
IPVS checks the client against its 
local table and forwards connection always to same squid real server, 
so the client always accesses the web through same squid.
source-hashing can become unbalanced when you have few clients 
and one of them use squid more frequently than others. 
With more clients, it's statistically balanced.
		</para>
		</section>
		<section id="dh_persistence">
		<title>DH persistence</title>
		<para>
Just as you can use the <xref linkend="SH-scheduler"/> to achieve persistence (affinity), 
you can also use the DH.
We couldn't work out why this LVS wasn't scheduling the way it was expected, but
the DH scheduler fixed it.
		</para>
		<para>
Steve Haneman <emphasis>stevehaneman (at) yahoo (dot) com</emphasis> 22 Oct 2008 
		</para>
		<blockquote>
			<para>
I'm using ipvsadm to load balance 2 security web servers, so I'm using 3 boxes.
The web servers have 100 IPs each in the 192.168.253.x and 192.168.252.x ranges.
			</para>
			<para>
I'm running a load test through the lb to the web servers from 20 unique IPs.
I'm finding that 4% of the time the sessions are not sticky.  
A user/password web POST goes to one server and a followup POST ends up at the other server.
There is less than 1 second between the POSTs.  
Each transaction (login POST, data POST, logout) is completed with the IP it started with.
			</para>
		</blockquote>
		<para>
Jeff Tchang <emphasis>jeff (dot) tchang (at) gmail (dot) com</emphasis>
		</para>
		<para>
Not sure if this might help but have you tried using different
scheduling algorithms? In particular maybe destination hashing?
		</para>
		<para>
Steve
		</para>
		<blockquote>
That fixed it.  I changed from rr to dh and I'm seeing all goodness. 
		</blockquote>
		</section>
	</section>
	<section id="Henrik">
	<title>LVS with mark tracking: fwmark patches for multiple firewalls/gateways</title>
	<para>
If the LVS is protected by multiple firewall boxes and each
firewall is doing connection tracking, then packets arriving
and leaving the LVS from the same connection will need
to pass through the same firewall box or else they won't be
seen to be part of the same connection and will be dropped.
An initial attempt to handle the firewall problem was
sent in by Henrik Nordstrom, who is involved with developing
<ulink url="http://squid.sourceforge.net/hno/">web caches (squids)</ulink>.
	</para>
	<para>
This code isn't a scheduler, but it's in here awaiting further developements of code
from Julian because it addresses similar problems to the <link linkend="SH-scheduler">
SH scheduler</link> in the next section.
	</para>
	<para>
Julian 13 Jan 2002
	</para>
	<blockquote>
	Unfortunately Henrik's patch breaks the LVS fwmark code.
Multiple gateway setups can be solved with routing and a solution is planned for LVS.
Until then it would be best to contact
Henrick, <emphasis>hno (at) marasystems (dot) com</emphasis> for his patch.
	</blockquote>
	<para>
Here's Henrick's 
<ulink url="files/ipvs-0.2.3-mark-track-v2.patch">patch</ulink> and here's some history.
	</para>
	<para>
Henrik Nordstrom <emphasis>hno (at) marasystems (dot) com</emphasis> 13 Jan 2002
	</para>
	<blockquote>
	<para>
My use of the MARK is for routing purposes of return traffic only, not at
all related to the scheduling onto the farm.
This to solve complex routing problems arising in borders between networks
where it is impractical to know full routing of all clients.
One example of what I do is like this:
	</para>
	<para>
I have a box connected to three networks (firewall, including LVS-NAT load
balancing capabilities for published services)
	</para>
	<itemizedlist>
		<listitem>
a - Internet
		</listitem>
		<listitem>
b - DMZ, where the farm members are
		</listitem>
		<listitem>
c - Large intranet
		</listitem>
	</itemizedlist>
	<para>
For simplicity both Internet and intranet users connect to the same
LVS IP addresses.
Both networks 'a' and 'c' is complex, and maintaining a complete and
correct routing table covering one of the networks (i.e. the 'c' network
in the above) is on the border to impossible and error prone as the use of
addresses change over time.
	</para>
	<para>
To simplify routing decisions I simply simply want return traffic to be
routed back the same way as from where the request was received. This
covers 99.99% of all routing needed in such situation regardless of the
complexity of the networks on the two (or more) sides without the need of
any explicit routing entries. To do this I MARK the session when received
using netfilter, giving it a routing mark indicating which path the
session was received from. My small patch modifies LVS to memorize this
mark in the LVS session, and then restore it on return traffic received
FROM the realservers. This allows me to route the return traffic from the
farm members to the correct client connection using iproute fwmark based
routing rules.
	</para>
	<para>
As farm distribution algorithms I use different ones depending on the type
of application. The MARK I only use for routing of return traffic.
I also have a similar patch for Netfilter connection tracking (and NAT),
for the same purpose of routing return traffic. If interested search for
CONNMARK in the netfilter-devel archives.
The two combined allows me to make multihomed boxes who do not need to
know the networks on any of the sides in detail, besides it's own IP
addresses and suitable gateways to reach further into the networks.
	</para>
	<para>
Another use of the connection MARK memory feature is a device connected to
multiple customer networks with overlapping IP addresses, for example two
customers both using 192.168.1.X addresses. In such case making a standard
routing table becomes impossible as the customers are not uniquely
identified by their IP addresses. The MARK memory however deals with such
routing at ease since it do not care about the detailed addressing as long
as it possible to identify the two customer paths somehow. i.e. interface
originally received on, source MAC of the router who sent us the request,
or anything uniquely identifying the request as coming from a specific
path.
	</para>
	<para>
The two problems above (not wanting to known the IP routing, or not being
able to IP route) are not mutually exclusive. If you have one then the
other is quite likely to occur.
	</para>
	</blockquote>
	<para>
Here's Henrik's announcement and the replies.
	</para>
	<para>
Henrik Nordstrom 14 Feb 2001
	</para>
	<blockquote>
		<para>
Here is a small patch to make LVS keep the MARK,
and have return traffic inherit the mark.
		</para>
		<para>
We use this for routing purposes on a multihomed LVS server, to have
return traffic routed back the same way as from where it was received.
What we do is that we set the mark in the iptables mangle chain
depending on source interface, and in the routing table use this mark to
have return traffic routed back in the same (opposite) direction.
		</para>
		<para>
The patch also moves the priority of LVS INPUT hook back to infront of
iptables filter hook, this to be able to filter the traffic not picked
up by LVS but matchin it's service definitions. We are not
(yet) interested of filtering traffic to the virtual servers, but very
interested in filtering what traffic reaches the Linux LVS-box itself.
		</para>
	</blockquote>
	<para>
Julian - who uses NFC_ALTERED ?
	</para>
	<blockquote>
Netfilter. The packet is accepted by the hook but altered (mark changed).
	</blockquote>
	<para>
Julian -
Give us an example (with dummy addresses) for setup that require
such fwmark assignments.
	</para>
	<blockquote>
		<para>
For a start you need a LVS setup with more than one real interface receiving
client traffic for this to be of any use. Some clients (due to routing
outside the LVS server) comes in on one interface, other clients on another
interface. In this setup you might not want to have a equally complex routing
table on the actual LVS server itself.
		</para>
		<para>
Regarding iptables/ipvs I currently "only" have three main issues.
		</para>
		<itemizedlist>
			<listitem>
As the "INPUT" traffic bypasses most normal routes, the iptables conntrack
will get quite confused by return traffic..
			</listitem>
			<listitem>
	Sessions will be tracked twice. Both by iptables conntrack and by IPVS.
			</listitem>
			<listitem>
There is no obvious choice if IPVS LOCAL_IN sould be placed before or after
iptables filter hook. Having it after enables the use of many fancy iptables
options, but instead requires one to have rules in iptables for allowing ipvs
traffic, and any mismatches (either in rulesets or IPVS operation) will cause the
packets to actually hit the IP interface of the LVS server which in most cases is
not what was intended.
			</listitem>
		</itemizedlist>
	</blockquote>
	</section>
	<section id="SH-scheduler" xreflabel="SH scheduler">
	<title>SH scheduler</title>
	<para>
Using the SH (source hash) scheduler, 
the realserver is selected using a hash of the CIP.
Thus all connect requests from a particular client will go to the same realserver.
Scheduling based on the client IP, 
should solve some of the problems that currently require persistence 
(<emphasis>i.e.</emphasis> having a client always go to the same realserver).
	</para>
	<para> 
Other than the few comments here, no-one has used the -sh scheduler.
The SH scheduler was originally intended for directors with multiple firewalls,
with the balancing based on hashes of the MAC address of the firewall
and this is how it was written up.
Since no-one was balancing on the MAC address of the firewall,
the SH scheduler lay dormant for many years, 
till someone on the mailing list figured out that it could do other things too. 
	</para>
	<para>
It turns out that address hashing is a standard method 
of keeping the client on the same
server in a load balanced server setup.
Willy Tarreau (in 
<ulink url="http://1wt.eu/articles/2006_lb/">
http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing</ulink>)
discusses address hashing (in the section "selecting the best server")
to prevent loss of session data with SSL connections in loadbalanced servers,
by keeping the client on the same server.
	</para>
	<para>
Here's Wensong's announcement:
	</para>
	<para>
Wensong Zhang <emphasis>wensong (at) gnuchina (dot) org</emphasis> 16 Feb 2001
	</para>
	<para>
Please see "man ipvsadm" for short description of DH and SH
schedulers. I think some examples to use those two schedulers.
Example: Firewall Load Balancing
	</para>
<programlisting><![CDATA[
                      |-- FW1 --|
  Internet ----- SH --|         |-- DH -- Protected Network
                      |-- FW2 --|
]]></programlisting>
	<para>
Make sure that the firewall boxes are added in the load balancers in the
same order. Then, request packets of a session are sent to a firewall,
<emphasis>e.g.</emphasis> FW1, the DH can forward the response packets 
from protected network to the FW1 too. 
However, I don't have enough hardware to test this setup myself. 
Please let me know if any of you make it work for you. :)
	</para>
	<para>
For initial discussions on the -dh and -sh scheduler see on the mailing list
under "some info for DH and SH schedulers" and "LVS with mark tracking".
	</para>
		<section id="testing_sh">
		<title>Testing the SH scheduler</title>
		<para>
The SH scheduler schedules by client IP. 
Thus if you test from one client only, 
all connections will go to the first realserver 
in the <command>ipvsadm</command> table.
		</para>
		<para>
Rached Ben Mustapha <emphasis>rached (at) alinka (dot) com</emphasis> 15 May 2002
		</para>
		<blockquote>
			<para>
It seems there is a problem with the SH scheduler and local node feature.
I configured my LVS director (node-102) in direct routing mode on a 2.4.18 linux
kernel with ipvs 1.0.2. The realservers are set up accordingly.
			</para>
<programlisting><![CDATA[
root@node-102# ipvsadm --version
director:/etc/lvs# ipvsadm v1.20 2001/11/04 (compiled with popt and IPVS v1.0.2)
			
root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0
]]></programlisting>
			<para>
With this configuration, it's ok. Connections from different clients are
load-balanced on both servers.
Now I add the director:
			</para>
<programlisting><![CDATA[
root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 127.0.0.1                    Route   1      0          0
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0
]]></programlisting>
			<para>
All new connections whatever the client's IP goes to the director.
And with this config:
			</para>
<programlisting><![CDATA[
root@node-102# ipvsadm -Ln
IP Virtual Server version 1.0.2 (size=65536)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  10.0.0.50:23 sh
  -> 192.168.32.103:23            Route   1      0          0
  -> 192.168.32.101:23            Route   1      0          0
  -> 127.0.0.1
]]></programlisting>
			<para>
Now all new connections whatever the client's IP goes to node-103.
So it seems that with localnode feature, the scheduler always choose the
first entry in the redirection rules.
			</para>
		</blockquote>
		<para>
Wensong
		</para>
		<para>
There is no problem in SH scheduler and localnode feature.
		</para>
		<para>
I reproduced your setup. I issued the
requests from two difficult clients, all the requests were sent to the
localnode. Then, issues the requests from the third client, the requests
were sent to the other server. Please see the result.
		</para>
<programlisting><![CDATA[
[root@dolphin /root]# <command>ipvsadm</command> -ln
IP Virtual Server version 1.0.2 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  172.26.20.118:80 sh
  -> 172.26.20.72:80              Route   1      2          0
  -> 127.0.0.1:80                 Local   1      0          858
]]></programlisting>
		<para>
I don't know how many clients are used in your test. You know that the SH
scheduler is statically mapping algorithm (based on the source IP address
of clients). It is quite possible that two or more client IP addresses are
mapped to the same server.
		</para>
		</section>
		<section id="sh_weight">
		<title>"weight" is really the number of connections for the SH scheduler</title>
		<para>
Con Tassios <emphasis>ct (at) swin (dot) edu (dot) au</emphasis> 7 Jun 2006 
		</para>
		<para>
source hashing with a weight of 1 (the default value for the other schedulers) 
will result in the service being overloaded when the
number of connections is greater than 2, as the output of ipvsadm shows.  
You should increase the weight.
The weight when used with SH and DH has a different meaning than most, if not
all, the other standard LVS scheduling methods.  Although this doesn't appear
to be mentioned in the man page for ipvsadm.
		</para>
<programlisting><![CDATA[
From ip_vs_sh.c

The sh algorithm is to select server by the hash key of source IP
address. The pseudo code is as follows:

      n <- servernode[src_ip];
      if (n is dead) OR
         (n is overloaded, such as n.conns>2*n.weight) then
                return NULL;

      return n;
]]></programlisting>
		<note>
Joe: if weight is 1, then return NULL if number of connections > 2.
If number of connections is twice the weight, don't allow anymore connections. 
		</note>
		<para>
Martijn Grendelman <emphasis>martijn (at) grendelman (dot) net</emphasis> 09 Jun 2006 
		</para> 
		<para>
That would explain the things I saw.
In the meantime, I went back to a configuration using the SH scheduler 
now with a weight for both real servers of 200, instead of 1, 
and things seem to run fine.
		</para>
<programlisting><![CDATA[
martijn@tweety:~> rr ipvsadm -L
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  212.204.230.98:www sh
  -> tweety.sipo.nl:www           Local   200    25         44
  -> daffy.sipo.nl:www            Route   200    12         27
TCP  212.204.230.98:https sh persistent 360
  -> tweety.sipo.nl:https         Local   100    0          0
  -> daffy.sipo.nl:https          Route   100    0          0
]]></programlisting>
		<blockquote>
Joe: Since the SH scheduler sends a client's packets to the same realserver, 
I had thought that it should completely replace persistence. 
However you're using persistence with SH, 
so apparently SH doesn't handle keeping the client on the realserver as I expect. 
So why are you using persistence?
		</blockquote>
		<para>
Ehr.. no reason, I guess. 
It's still there from when I used RR scheduling and I guess I forgot to remove it. 
I don't think it is actually useful.
		</para>
		</section>
		<section id="SH-failout">
		<title>SH failout</title>
		<para>
Shutting down an SH realserver by changing the weight to 0 
(as is done for the other schedulers),
still allows in connection requests to be sent to that realserver 
(you'll get a failed connection if the realserver is down).
This seems to be a result of the different meaning of weight for the SH scheduler.
No new sources will be allowed to initiate connections, 
but all connections from known sources will still be forwarded,
and all known sources will be allowed to initiate connections.
To stop connection requests being forwarded to a realserver,
you have to remove the realserver from the ipvsadm table.
You may have to break current connections to do this :-( 
		</para>
		<para>
Thomas Pedoussaut <emphasis>thomas (at) pedoussaut (dot) com</emphasis> 16 Oct 2008
		</para>
		<para>
Weight to 0 means no more new connections, but existing ones (in the SH 
way) will still hit the real server
You need to have it properly removed.
		</para>
		</section>
	</section>
	<section id="ActiveConn" xreflabel="ActConn">
	<title>What is an ActiveConn/InActConn (Active/Inactive) connnection?</title>
	<para>
The output of ipsvadm lists connections, either as
	</para>
	<itemizedlist>
		<listitem>
ActiveConn - in ESTABLISHED state
		</listitem>
		<listitem>
InActConn - any other state
		</listitem>
	</itemizedlist>
	<para>
With LVS-NAT,
the director sees all the packets between the client and the realserver,
so always knows the state of tcp connections and the listing from
<command>ipvsadm</command> is accurate.
However for LVS-DR, LVS-Tun, the director does not see the packets
from the realserver to the client.
Termination of the tcp connection occurs by one of the ends sending a FIN
(see W. Richard Stevens, TCP/IP Illustrated Vol 1, ch 18, 1994,
pub Addison Wesley) followed by reply ACK from the other end.
Then the other end sends its FIN, followed by an ACK from the first machine.
If the realserver initiates termination of the connection,
the director will only be able to infer that this has
happened from seeing the ACK from the client.
In either case the director has to infer that the connection
has closed from partial information and uses its own
table of timeouts to declare that the connection has terminated.
Thus the count in the InActConn column for LVS-DR, LVS-Tun is
inferred rather than real.
	</para>
	<para>
Entries in the ActiveConn column come from
	</para>
	<itemizedlist>
		<listitem>
service with an established connection.
Examples of services which hold connections in the ESTABLISHED state
for long enough to see with <command>ipvsadm</command> are telnet and ftp (port 21).
		</listitem>
	</itemizedlist>
	<para>
Entries in the InActConn column come from
	</para>
	<itemizedlist>
		<listitem>
		<para>
Normal operation
		</para>
		<itemizedlist>
			<listitem>
Services like http (in non-persistent <emphasis>i.e.</emphasis> HTTP /1.0 mode)
or ftp-data(port 20)
which close the connections as soon as the hit/data (html page, or gif etc)
has been retrieved (&lt;1sec).
You're unlikely to see anything
in the ActiveConn column with these LVS'ed services.
You'll see an entry in the InActConn
column untill the connection times out.
If you're getting 1000connections/sec and
it takes 60secs for the connection to time out (the normal timeout),
then you'll have 60,000 InActConns.
This number of InActConn is quite normal.
If you are running an e-commerce site with 300secs of persistence,
you'll have 300,000 InActConn entries.
Each entry takes 128bytes (300,000 entries is about 40M of memory,
make sure you have enough RAM for your application).
The number of ActiveConn might be very small.
			</listitem>
		</itemizedlist>
		</listitem>
		<listitem>
		<para>
Pathological Conditions (<emphasis>i.e.</emphasis> your LVS is not setup properly)
		</para>
		<itemizedlist>
			<listitem>
			<para>
identd delayed connections:
The 3 way handshake to establish a connection takes
only 3 exchanges of packets (<emphasis>i.e.</emphasis> it's quick on any
normal network) and you won't be quick enough with ipvsadm
to see the connection in the states before it becomes ESTABLISHED.
However if the service on the realserver is under
<xref linkend="LVS-HOWTO.authd"/>, you'll see an InActConn entry
during the delay period.
			</para>
			</listitem>
			<listitem>
			<para>
Incorrect routing
(usually the wrong default gw for the realservers):
			</para>
			<para>
In this case the 3 way handshake will never complete, the connection will hang,
and there'll be an entry in the InActConn column.
			</para>
			</listitem>
		</itemizedlist>
		</listitem>
	</itemizedlist>
	<para>
Usually the number of InActConn will be larger or very much larger than the number
of ActiveConn.
	</para>
	<para>
Here's a LVS-DR LVS, setup for ftp, telnet and http,
after telnetting from the client
(the client command line is at the telnet prompt).
	</para>
<programlisting><![CDATA[
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      0          0
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      1          0
-> RS1.mack.net:telnet              Route   1      0          0
]]></programlisting>
	<para>
showing the ESTABLISHED telnet connection (here to realserver RS2).
	</para>
	<para>
Here's the output of netstat -an | grep (appropriate IP) for the client and the
realserver, showing that the connection is in the ESTABLISHED state.
	</para>
<programlisting><![CDATA[
client:# netstat -an | grep VIP
tcp        0      0 client:1229      VIP:23           ESTABLISHED
			</para><para>
realserver:# netstat -an | grep CIP
tcp        0      0 VIP:23           client:1229      ESTABLISHED
<programlisting><![CDATA[
	<para>
Here's immediately after the client logs out from the telnet session.
	</para>
<programlisting><![CDATA[
director# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      0          0
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      0          0
-> RS1.mack.net:telnet              Route   1      0          0

client:# netstat -an | grep VIP
#ie nothing, the client has closed the connection

#the realserver has closed the session in response
#to the client's request to close out the session.
#The telnet server has entered the TIME_WAIT state.
realserver:/home/ftp/pub# netstat -an | grep 254
tcp        0      0 VIP:23        CIP:1236      TIME_WAIT

#a minute later, the entry for the connection at the realserver is gone.
]]></programlisting>
	<para>
Here's the output after ftp'ing from the client and logging in,
but before running any commands (like `dir` or `get filename`).
	</para>
<programlisting><![CDATA[
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          0
-> RS1.mack.net:www                 Route   1      0          0
TCP  lvs2.mack.net:0 rr persistent 360
-> RS1.mack.net:0                   Route   1      1          1
TCP  lvs2.mack.net:telnet rr
-> RS2.mack.net:telnet              Route   1      0          0
-> RS1.mack.net:telnet              Route   1      0          0

client:# netstat -an | grep VIP
tcp        0      0 CIP:1230      VIP:21        TIME_WAIT
tcp        0      0 CIP:1233      VIP:21        ESTABLISHED

realserver:# netstat -an | grep 254
tcp        0      0 VIP:21        CIP:1233      ESTABLISHED
]]></programlisting>
	<para>
The client opens 2 connections to the ftpd and leaves one open (the ftp prompt).
The other connection, used to transfer the user/passwd information,
is closed down after the login.
The entry in the <command>ipvsadm</command> table corresponding to the TIME_WAIT state
at the realserver is listed as InActConn.
If nothing else is done at the client's ftp prompt, the connection will
expire in 900 secs. Here's the realserver during this 900 secs.
	</para>
<programlisting><![CDATA[
realserver:# netstat -an | grep CIP
tcp        0      0 VIP:21        CIP:1233      ESTABLISHED
realserver:# netstat -an | grep CIP
tcp        0     57 VIP:21        CIP:1233      FIN_WAIT1
realserver:# netstat -an | grep CIP
#ie nothing, the connection has dropped

#if you then go to the client, you'll find it has timed out.
ftp> dir
421 Timeout (900 seconds): closing control connection.
]]></programlisting>
	<para>
http 1.0 connections are closed immediately after retrieving the URL
(<emphasis>i.e.</emphasis> you won't see any ActiveConn in the <command>ipvsadm</command> table immediately
after the URL has been fetched).
Here's the outputs after retreiving a webpage from the LVS.
	</para>
<programlisting><![CDATA[
director:# ipvsadm
IP Virtual Server version 0.2.8 (size=4096)
Prot LocalAddress:Port Scheduler Flags
-> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
-> RS2.mack.net:www                 Route   1      0          1
-> RS1.mack.net:www                 Route   1      0          0

client:~# netstat -an | grep VIP

RS2:/home/ftp/pub# netstat -an | grep CIP
tcp        0      0 VIP:80        CIP:1238      TIME_WAIT
]]></programlisting>
		<section id="programmatically_accessing_activeconn">
		<title>Programmatically access ActiveConn, InActConn</title>
		<blockquote>
I want to get the active and inactive connections of one virtual
service in my program.
		</blockquote>
		<para>
Jeremy Kerr <emphasis>jeremy (at) redfishsoftware (dot) com (dot) au</emphasis> 12 Feb 2003
		</para>
		<para>
You have two options here:
		</para>
		<itemizedlist>
			<listitem>
Read <filename>/proc/net/ip_vs file</filename> and parse it for the required numbers
			</listitem>
			<listitem>
Use <filename>libipvs</filename> (distributed with ipvsadm) to read the tables directly.
Take a look in ipvsadm.c for how this is done.
			</listitem>
		</itemizedlist>
		</section>
		<section>
		<title>ActiveConn/InActConn different for 2.4.x/2.6.x kernels</title>
		<para>
Ratz 15 Oct 2006 
		</para>
		<para>
IPVS between 2.4 and 2.6 have has change significantly 
with regards to the ratio of active/inactive connections. 
We've seen that in our rrdtool/MRTG graphs as well.
In the 2.6.x kernels, at least for the (w)LC scheduler the RS calculation is done differently. 
On top of that, the TCP stack has changed tunables and you hardware also behaves differently. 
The LVS state transition timeouts are different between 2.4.x and 2.6.x kernels, IIRC and so, 
for example if you're using LVS-DR, 
the active connection to passive connection transition takes more time, 
thus yielding a potentially higher amount of sessions in state ActiveConn.
		</para>
		</section>
		<section>
		<title>from the mailing list</title>
		<para>
Ty Beede wrote:
		</para>
		<blockquote>
I am curious about the implementation of the inactconns and
activeconns variables in the lvs source code.
		</blockquote>
		<para>
Julian
		</para>
<programlisting><![CDATA[
        Info about LVS <= 0.9.7
TCP
        active:         all connections in ESTABLISHED state
        inactive:       all connections not in ESTABLISHED state
UDP
        active:         0 (none) (LVS <= 0.9.7)
        inactive:       all (LVS <= 0.9.7)

        active + inactive = all
]]></programlisting>
		<para>
        Look in this table for the used timeouts for each
protocol/state:
		</para>
		<para>
/usr/src/linux/net/ipv4/ip_masq.c, masq_timeout_table
		</para>
		<para>
        For LVS-Tun and LVS-DR the TCP states are changed checking only
the TCP flags from the incoming packets. For these methods UDP entries can
expire (5 minutes?) if only the realservers sends packets and there are
no packets from the client.
		</para>
		<para>
        For info about the TCP states: <filename>/usr/src/linux/net/ipv4/tcp.c</filename>,
<filename>rfc793.txt</filename>
		</para>
		<para>
Jean-francois Nadeau <emphasis>jf (dot) nadeau (at) videotron (dot) ca</emphasis>
		</para>
		<blockquote>
		<para>
Done some testing (netmon) on this and here's my observations :
		</para>
		<para>
1. A connection becomes active when LVS sees the ACK flag in the TCP header
incoming in the cluster : i.e when the socket gets established on the real
server.
		</para>
		<para>
2. A connection becomes inactive when LVS sees the ACK-FIN flag in the TCP
header incoming in the cluster. This does NOT corespond to the socket
closing on the realserver.
		</para>
		<para>
Example with my Apache Web server.
		</para>
<programlisting><![CDATA[
Client  	<---> Server

A client request an object on the web server on port 80 :

SYN REQUEST     ---->
SYN ACK 	<----
ACK             ----> *** ActiveConn=1 and 1 ESTABLISHED socket on realserver.
HTTP get        ----> *** The client request the object
HTTP response   <---- *** The server sends the object
APACHE closes the socket : *** ActiveConn=1 and 0 ESTABLISHED socket on realserver
The CLIENT receives the object. (took 15 seconds in my test)
ACK-FIN         ----> *** ActiveConn=0 and 0 ESTABLISHED socket on realserver
]]></programlisting>
		<para>
Conclusion : ActiveConn is the active number of CLIENT connections.....
not on the server in the case of short transmissions (like objects on a web page).
Its hard to calculate a server's capacity based on this
because slower clients makes ActiveConn greater than
what the server is really processing.
You wont be able to reproduce that effect on a LAN,
because the client receives the segment too fast.
		</para>
		<para>
In the LVS mailing list, many people explained that the correct way
to balance the connections is to use monitoring software.
The weights must be evaluated using values from the realserver.
In LVS-DR and LVS-Tun, the Director can be easily fooled
with invalid packets for some period
and this can be enough to inbalance the cluster when using "*lc" schedulers.
		</para>
		<para>
I reproduce the effect connecting at 9600 bps
and getting a 100k gif from Apache,
while monitoring established sockets on port 80
on the realserver and <command>ipvsadm</command> on the cluster.
		</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
You are probably using LVS-DR or LVS-Tun in your test.
Right?
Using these methods, the LVS is changing the TCP state
based on the incoming packets, <emphasis>i.e.</emphasis> from the clients.
This is the reason that the Director can't see the FIN packet
from the realserver.
This is the reason that LVS can be easily SYN flooded,
even flooded with ACK following the SYN packet.
The LVS can't change the TCP state according
to the state in the realserver.
This is possible only for VS/NAT mode.
So, in some situations you can have invalid entries
in ESTABLISHED state which do not correspond to the connections
in the realserver, which effectively ignores these SYN packets using cookies.
The VS/NAT looks the better solution against the SYN flood attacks.
Of course, the ESTABLISHED timeout can be changed to 5 minutes for example.
Currently, the max timeout interval
(excluding the ESTABLISHED state) is 2 minutes.
If you think that you can serve the clients using a
smaller timeout for the ESTABLISHED state,
when under "ACK after SYN" attack, you can change it with ipchains.
You don't need to change it under 2 minutes in LVS 0.9.7.
In the last LVS version SYN+FIN switches the state to TIME_WAIT,
which can't be controlled using ipchains.
In other cases,
you can change the timeout for the ESTABLISHED and FIN-WAIT states.
But you can change it only down to 1 minute.
If this doesn't help, buy 2GB RAM or more for the Director.
		</para>
		<para>
One thing that can be done, but this is may be paranoia:
		</para>
		<para>
change the INPUT_ONLY table:
		</para>
<programlisting><![CDATA[
from:
           FIN
        SR ---> TW
to:
           FIN
        SR ---> FW
]]></programlisting>
		<para>
        OK, this is incorrect interpretation of the TCP states
but this is a hack which allows the min state timeout to be
1 minute. Now using ipchains we can set the timeout to all
TCP states to 1 minute.
		</para>
		<para>
        If this is changed you can now set ESTABLISHED and
FIN-WAIT timeouts down to 1 minute. In current LVS version
the min effective timeout for ESTABLISHED and FINWAIT state
is 2 minutes.
		</para>
		<para>
Jean-Francois Nadeau <emphasis>jf (dot) nadeau (at) videotron (dot) ca</emphasis>
		</para>
		<blockquote>
		<para>
I'm using DR on the cluster with 2 realservers.  I'm trying to control the
number of connections to acheive this :
		</para><para>
The cluster in normal mode balances requests on the 2 realservers.
If the realservers reaches a point where they can't serve clients fast
enough, a new entry with a weight of 10000 is entered in LVS to send the
overflow locally on a web server with a static web page saying "we're too busy".
It's a cgi that intercept 'deep links' in our site and return a predefined page.
A 600 seconds persistency ensure that already connected clients stays on the
server they began to browse.  The client only have to hit refresh until the
number of AciveConns (I hoped) on the realservers gets lower and the overflow
entry gets deleted.
		</para><para>
Got the idea... Load balancing with overflow control.
		</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
        Good idea. But LVS can't help you. When the clients are
	redirected to the Director they stay there for 600 seconds.
		</para>
		<blockquote>
		<para>
But when we activate the local redirection of requests due to overflow,
ActiveConn continues to grow in LVS, while Inactconn decreases as expected.
So the load on the realserver gets OK... but LVS doesnt sees it and doesnt let
new clients in. (it takes 12 minutes before ActiveConns decreases enough to
reopen the site)
		</para><para>
I need a way, a value to check at that says the server is
overloaded, begin redirecing locally and the opposite.
		</para><para>
I know that seems a little complicated....
		</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
       What about trying to:
		</para>
		<itemizedlist>
			<listitem>
			<para>
use persistent timeout 1 second for the virtual service.
			</para>
			<para>
        If you have one entry for this client you have all entries
from this client to the same realserver. I didn't tested it but
may be a client will load the whole web page. If the server is
overloaded the next web page will be "we're too busy".
			</para>
			</listitem>
			<listitem>
			<para>
switch the weight for the Director between 0 and 10000. Don't
delete the Director as realserver.
			</para>
			<para>
        Weight 0 means "No new connections to the server". You
have to play with the weight for the Director, for example:
			</para>
			</listitem>
			<listitem>
			<para>
if your realservers are loaded near 99% set the weight to 10000
			</para>
			</listitem>
			<listitem>
			<para>
if your realservers are loaded before 95% set the weight to 0
			</para>
			</listitem>
		</itemizedlist>
		<para>
Jean-Francois Nadeau <emphasis>jf (dot) nadeau (at) videotron (dot) ca</emphasis>
		</para>
		<blockquote>
		<para>
Will a weight of 0 redirect traffic to the other realservers
(persistency remains ?)
		</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
If the persistent timeout is small, I think.
		</para>
		<blockquote>
			<para>
I can't get rid of the 600 seconds persistency 
because we run a transactionnal engine. 
<emphasis>i.e.</emphasis> if a client begins on a realserver, 
he must complete the transaction on that server or get an error 
(transactionnal contexts are stored locally).
			</para>
		</blockquote>
		<para>
Such timeout can't help to redirect the clients back to the
realservers.
		</para>
		<para>
You can check the free ram or the cpu idle time for the
realservers. By this way you can correctly set the weights for
the realservers and to switch the weight for the Director.
		</para>
		<para>
These recommendations can be completely wrong. I've never
tested them. If they can't help try to set httpd.conf:MaxClients
to some reasonable value. Why not to put the Director as real
server permanently. With 3 realservers is better.
		</para>
		<para>
Jean
		</para>
		<blockquote>
			<para>
	Those are already optimized,  bottleneck is  when 1500 clients tries our site
	in less than 5 minutes.....
			</para>
			<para>
	One of ours has suggested that the realservers check their own state (via
	TCP in use given by sockstat) and command the director to redirect traffic
	when needed.
			</para>
			<para>
	Can you explain more in details why the number of ActiveConn on realserver
	continue to grow while redirecting traffic locally with a weight of 10000 (and
	Inactonn on that realserver decreasing normally).
			</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
Only the new clients are redirected to the Director at this
moment. Where the active connections continue to grow, in the real
servers or in the Director (weight=10000)?
		</para>
		</section>
		<section id="calculating_activeconn">
		<title>How is ActiveConn/InActConn calculated?</title>
		<para>
<blockquote><para>
Joe, 14 May 2001
			</para><para>
according to the <command>ipvsadm</command> man page, for "lc" scheduling, the
new connections are assigned according to the number of
"active connections". Is this the same as "ActiveConn" in the
output of ipvsadm?
If the number of "active connections" used to determine the
scheduling is "ActiveConn", then for services which don't
maintain connections (<emphasis>e.g.</emphasis> http or UDP services),
the scheduler won't have much information,
just "0" for all realservers?
</para></blockquote>
			</para><para>
Julian, 14 May and 23 May
			</para><para>
It is a counter and it is incremented when new connection is
created. The formula is:
			</para><para>
<programlisting><![CDATA[
active connections = ActConn * K + InActConn
]]></programlisting>
			</para><para>
where K can be 32 to 50 (I don't remember the last used value),
so it is not only the active conns (which would break UDP).
			</para><para>
<blockquote><para>
Is "active connections" incremented if the client re-uses a port?
</para></blockquote>
			</para><para>
No, the reused connections are not counted.
		</para>
		</section>
		<section id="ActiveConn_netstat_not_matching">
		<title>ActiveConn is a guess for LVS-DR</title>
		<para>
For LVS-DR, the director doesn't see the return packets 
and uses tables of timeouts to guess a likely state of the service at the realserver.
For the same reason you can't do stateful filtering on the director for LVS-DR controlled
packets.
		</para>
		<para>
barrywong <emphasis>barrywong (at) sina (dot) com</emphasis> 30 Aug 2008
		</para>
		<blockquote>
			<para>
I'm using  DR wlc persistent 120 
			</para>
<programlisting><![CDATA[
# ipvsadm -Ln
IP Virtual Server version 1.2.0 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn

TCP  xx.xx.xx.xx:80 wlc persistent 120
  -> xx.xx.xx.x1:80             Route   1      6459       7057
  -> xx.xx.xx.x2:80            Route   1      6446       4766

# netstat -n
realserver ESTABLISHED
xx.xx.xx.x1:80 4210 ESTABLISHED
xx.xx.xx.x2:80 4483 ESTABLISHED
]]></programlisting>
			<para>
The realserver ESTABLISHED status connect numberis less than the lvs ActiveConn connect number.
Why is this?
			</para>
		</blockquote>
		<para>
Thomas Pedoussaut <emphasis>thomas (at) pedoussaut (dot) com</emphasis>
		</para>
		<para>   
I guess your issue is that the persistance is low compared to your usage.
I've had similar numbers with a mysql setup. Basically, there was
hundreds of very-long-lasting connections, but that weren't doing much
of traffic, with sometimes pausing for hours. They would disappear from
the LVS status but still be visible on the client and the server as
CONNECTED.
		</para>
		<para>
It's not really a big issue. Usually server affinity make the resuming
packets being directed to the same server so the connection can still be
used. If it wasn't the case, there is enough code on the client side to
re-establish a new connection if that one was to fail. You'll still have
to face a problem with the server side connections that will be
lingering in a limbo state. I would consider setting some sort of
timeout on that side. I'm not 100% sure, but you're real server are
running squid on port 80 correct.
If so, please have a look there
http://www.squid-cache.org/Versions/v3/3.0/cfgman/read_timeout.html and
probably shorten it (or extend your LVS persistance to that value with
ipvsadm --set )

		</para>
		</section>
	</section>
	<section id="faq_entries_in_inactconn">
	<title>FAQ: ipvsadm shows entries in InActConn, but none in ActiveConn, connection hangs. What's wrong?</title>
	<para>
The usual mistake is to have the default gw for the realservers set incorrectly.
	</para>
	<itemizedlist>
		<listitem>
		LVS-NAT: the default gw <emphasis>must</emphasis> be the director.
There <emphasis>cannot</emphasis> be any other path from the realservers to the client,
except through the director.
		</listitem>
		<listitem>
LVS-DR, LVS-Tun: the default gw <emphasis>cannot</emphasis> be the director -
use some local router.
		</listitem>
	</itemizedlist>
	<para>
Setting up an LVS by hand is tedious.
You can use the configure script
which will trap most errors in setup.
	</para>
	</section>
	<section id="faq_initial_connection_delayed">
	<title>FAQ: initial connection is delayed, but once connected everything is fine. What's wrong?</title>
	<para>
Usually you have problems with <xref linkend="LVS-HOWTO.authd"/>.
Simplest thing is to stop your service from calling the identd server
on the client (<emphasis>i.e.</emphasis>disconnect your service from identd).
	</para></section>
	<section id="reusing_ports">
	<title>unbalanced realservers: does rr and lc weighting equally distribute the load? - clients reusing ports</title>
	<para>
(also see <xref linkend="polygraph"/> in the performance section.)
	</para>
	<para>
I ran the <ulink url="http://www.polygraph.org/">polygraph</ulink> simple.pg
test on a LVS-NAT LVS with 4 realservers using rr scheduling.
Since the responses from the realservers should average out
I would have expected the number of connection and load average on the
realservers to be equally distributed over the realservers.
	</para>
	<para>
Here's the output of <command>ipvsadm</command> shortly after the number of connections
had reached steady state (about 5 mins).
	</para>
<programlisting><![CDATA[
IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph rr
  -> RS4.mack.net:polygraph         Masq    1      0          883
  -> RS3.mack.net:polygraph         Masq    1      0          924
  -> RS2.mack.net:polygraph         Masq    1      0          1186
  -> RS1.mack.net:polygraph         Masq    1      0          982
]]></programlisting>
	<para>
The servers were identical hardware. I expect (but am not sure)
that the utils/software on the machines is identical (I set up
RS3,RS4 about 6 months after RS1,RS2).
RS2 was running 2.2.19,
while the other 3 machine were running 2.4.3 kernels.
The number of connections (all in TIME_WAIT) at the realservers
was different for each (otherwise apparently identical) realserver
and was in the range 450-500 for the 2.4.3 machines and 1000 for the
2.2.19 machine (measured with netstat -an | grep $polygraph_port |wc )
and varied about 10&percnt; over a long period.
	</para>
	<para>
This run had been done immediately after another run and InActConn
had not been allowed to drop to 0.
Here I repeated this run, after first waiting for InActConn to drop to 0
	</para>
<programlisting><![CDATA[
IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph rr
  -> RS4.mack.net:polygraph         Masq    1      0          994
  -> RS3.mack.net:polygraph         Masq    1      0          994
  -> RS2.mack.net:polygraph         Masq    1      0          994
  -> RS1.mack.net:polygraph         Masq    1      1          992
TCP  lvs2.mack.net:netpipe rr
]]></programlisting>
	<para>
RS2 (the 2.2.19 machine) had 900 connections in TIME_WAIT while
the other (2.4.3) machines were 400-600. RS2 was also delivering
about 50&percnt; more hits to the client.
	</para>
	<para>
Repeating the run using &quot;lc&quot; scheduling, the InActConn remains
constant.
	</para>
<programlisting><![CDATA[
IP Virtual Server version 0.2.12 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:polygraph lc
  -> RS4.mack.net:polygraph         Masq    1      0          994
  -> RS3.mack.net:polygraph         Masq    1      0          994
  -> RS2.mack.net:polygraph         Masq    1      0          994
  -> RS1.mack.net:polygraph         Masq    1      0          993
]]></programlisting>
	<para>
The number of connections (all in TIME_WAIT)
at the realservers did not change.
	</para>
	<para>
I've been running the polygraph simple.pg test over the weekend
using rr scheduling on what (AFAIK) are 4 identical realservers
in a LVS-NAT LVS. There are no ActiveConn and a large number of
InActConn. Presumably the client makes a new connection for each
request.
	</para>
	<para>
Julian (I think)
	</para>
	<blockquote>
The implicit persistence of TCP connection reuse can cause
such side effects even for RR.
When the setup includes small number of hosts and the used rate is
big enough to reuse the client's port, the LVS detects existing
connections and new connections are not created. This is the reason
you can see some of the rs not to be used at all, even for such method
as RR.
	</blockquote>
	<para>
the client is using ports from 1025-4999 (has about 2000 open
at one time) and it's not going above the 4999 barrier. ipvsadm
shows a constant InActConn of 990-995 for all realservers,
but the number of connections on each of the realservers (netstat -an)
ranges from 400-900.
	</para>
	<para>
So if the client is reusing ports (I thought you always incremented
the port by 1 till you got to 64k and then it rolled over again),
LVS won't create a new entry in the hash table if the old one
hasn't expired?
	</para>
	<blockquote>
	Yes, it seems you have (5000-1024) connections that never expire
in LVS.
	</blockquote>
	<para>
Presumably because the director doesn't know the number of connections
at the realservers (it only has the number of entries in its tables),
and because even apparently identical realservers aren't identical
(the hardware here is the same, but I set them up at different times,
presumably not all the files and time outs are the same), the throughput
of different realservers may not be the same.
	</para>
	<para>
The apparent unbalance in the number of InActConn then is a combination
of some clients reusing ports and the director's method of estimating
the number of connections, which makes assumptions about TIME_WAIT on
the realserver.
A better estimate of the number of connections at the realservers would
have been to look at the number of ESTABLISHED and TIME_WAIT connections
on the realservers, but I didn't think of this at the time when I did
the above tests.
	</para>
	<para>
The unbalance then isn't anything that we're regarding as a big enough
problem to find a fix for.
	</para>
	</section>
	<section id="changing_weights">
	<title>Changing weights with ipvsadm</title>
	<para>
When setting up a service, you set the weight with a command like
(default weight is 1).
			</para><para>
<programlisting><![CDATA[
director:/etc/lvs# ipvsadm -a -t $VIP:$SERVICE -r $REALSERVER_NAME:$SERVICE $FORWARDING -w 1
]]></programlisting>
			</para><para>
If you set the weight for the service to &quot;0&quot;, then no new
connections will be made to that service 
(see also man ipvsadm, about the -w option).
			</para><para>
<blockquote><para>
Lars Marowsky-Bree <emphasis>lmb (at) suse (dot) de</emphasis> 11 May 2001
			</para><para>
Setting weight = 0 means that no further connections will be assigned to the
machine, but current ones remain established. This allows to smoothly take a
realserver out of service, <emphasis>i.e.</emphasis> for maintenance.
			</para><para>
Removing the server hard cuts all active connections. This is the correct
response to a monitoring failure, so that clients receive immediate notice
that the server they are connected to died so they can reconnect.
</para></blockquote>
			</para><para>
<blockquote><para>
Laurent Lefoll <emphasis>Laurent (dot) Lefoll (at) mobileway (dot) com</emphasis> 11 May 2001
			</para><para>
Is there a way to clear some entries in the ipvs tables ?
If a server reboots or crashes, the connection
entries remains in the <command>ipvsadm</command> table.
Is there a way to remove manually some entries? I have tried to remove
the realserver from the service (with <command>ipvsadm</command> -d .... ),
but the entries are still there.
			</para><para>
			</para><para>
<blockquote><para>
Joe
			</para><para>
After a service (or realserver) failure, some agent external to LVS
will run <command>ipvsadm</command> to delete the entry for the
service. Once this is done no new connections can
be made to that service, but the entries are kept in
the table till they timeout.
(If the service is still up, you can delete the entries
and then re-add the service and the client will not
have been disconnected). You can't &quot;remove&quot; those
entries, you can only change the timeout values.
			</para><para>
Any clients connected through
those entries to the failed service(s) will find their
connection hung or deranged in some way. We can't do
anything about that. The client will have to
disconnect and make a new connection.
For http where the client makes
a new connection almost every page fetch, this is not
a problem. Someone connected to a database may find their
screen has frozen.
	</para></blockquote>
</para></blockquote>
			</para><para>
If you are going to set the weight of a connection, you need
to first know the state of the LVS.
If the service is not already in the <command>ipvsadm</command> table, you add (-a) it.
If the service is already in the <command>ipvsadm</command> table, you edit (-a) it.
There is no command to just set the weight no matter what the state.
A patch exists to do this (from Horms) but Wensong doesn't want to
include it.
Scripts which dynamically add, delete or change weights on services
will have to know the state of the LVS before making any changes,
or else trap errors from running the wrong command.
	</para>
	</section>
	<section id="setting_initial_weights">
	<title>Setting initial weights</title>
	<para>
If your hardware is all the same, you set then all to the same weight (1:1:1.., or 1000:1000:1000, it's the
ratio that's important, not the value).
What if you have a bunch of different hardware and you don't know what weight to set for each?
	</para>
	<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 13 Nov 2008 
	</para>
	<para>
Run them all on the same weight for a while and see how they get
loaded, then decide if you need to play with the weights or just add
more servers.
In a layer 4 balanced cluster I normally recommend no greater than 40%
utilisation as a rule of thumb to cope with peaks in demand.
	</para>
	<para>
Graeme
	</para>
	<para>
Ensure you start with (for example) 100/100/100, or 1000/1000/1000. It's
easier to juggle the weights with those values than 1/1/1 !
	</para>
	</section>
	<section id="dynamically_changing_weights">
	<title>Dynamically changing realserver weights</title>
	<para>
Some law of averaging large numbers predicts that realservers with large
numbers of clients should have nearly the same load.
In practice, realservers can have widely different loads or numbers of connections.
Presumably this is for services where a small number of clients can saturate a realserver.
LVS's serving static html pages should have even loads on the realservers.
	</para>
	<para>
Leonard Soetedjo <emphasis>Leonard (at) axs (dot) com (dot) sg</emphasis>
	</para>
	<blockquote>
From reading the howto,
mon doesn't handle dynamically changing the realserver weights.
Is it advisable to create a monitoring program that changes the
weightage of the realserver?
The program will check the worker's load,
memory etc and reduce or increase the weight
in the director based on those information.
	</blockquote>
	<para>
Malcolm Turnbull <emphasis>Malcolm.Turnbull (at) crocus (dot) co (dot) uki </emphasis>
09 Dec 2002
	</para>
	<para>
Personaly I think it adds complication you shouldn't require...
As your servers are running the same app they should respond in roughly
the same way to traffic..
If you have a very fast server reduce its weight.
If you have some very slow pages.. i.e. Global Search...
Then why not set up another VIP to make sure that all requests to
search.mydomain.com are evenly distributed... or restricted to a couple
of servers (so they don't imapact everyone else..)
	</para>
	<para>
But obviously it all depends on how your app works,
with mine its database performance that is the problem...
Time to look at loadbalancing the DB !,
Does anyone have any experience of doing this with MS SQL server and or PostGreSQL ?
I'm thinking about running the session / sales stuff of the MS SQL box,
and all the readonly content from several readonly PostGreSQL DBs...
Due to licencing costs... :-(.
	</para>
	<para>
OTOH, someone recently spoke
on the list about a monitoring tool which could use plugins to monitor
the realservers.
(Joe - see <link linkend="feedbackd">feedbackd</link>.)
	</para>
	<para>
Lars Marowsky-Bree <emphasis>lmb (at) suse (dot) de</emphasis> 17 Mar 2003
	</para>
	<para>
keep in mind that loadavg is a poor indication of real resource
utilization, but it might be enough.
loadavg needs to be at least normalized via the number of CPUs.
	</para>
	<para>
Andres Tello <emphasis>criptos (at) aullox (dot) com</emphasis> 17 Mar 2003
	</para>
	<para>
I use: ram*speed/1000 to calculate the weight
	</para>

	<para>
Joe
	</para>
	<para>
dsh (http://www.netfort.gr.jp/~dancer/software/dsh.html", machine gone Sep 2004)
is good for running commands (via rsh, ssh) on multiple machines
(machines listed in a file).
I'm using it here to monitor temperatures on multiple machines.
	</para>
	<para>
also see
<xref linkend="procstatd"/>
	</para>
	<para>
Bruno Bonfils
	</para>
	<para>
The loads can become unbalanced even if the realservers are indentical.
Customers can read different pages.
Some of them may have heavy php/sql usage,
which implies a higher load average than a simple static html file.
	</para>
	<para>
Rylan W. Hazelton <emphasis>rylan (at) curiouslabs (dot) com</emphasis> 17 Mar 2003
	</para>
	<para>
I still find large differences in loadaverage of the realservers.
WLC has no idea what else (non httpd) that might be happening on a
server.
Maybe I am compiling something for some reason, or I have a large cron.
It would be nice if LVS could redirect load accordingly.
	</para>
	</section>
	<section id="feedbackd">
	<title>feedbackd</title>
	<para>
Joe, Mar 2003
	</para>
	<para>
Jeremy's
<ulink url="http://www.redfishsoftware.com.au/projects/feedbackd/">feedbackd code and HOWTO</ulink>
(<ulink url="http://www.redfishsoftware.com.au/projects/feedbackd/">feedbackd</ulink>)
is now released.
	</para>
	<para>
Jeremy Kerr <emphasis>jeremy (at) redfishsoftware (dot) com (dot) au</emphasis> 09 Dec 2002:
	</para>
	<para>
As I've said earlier (check out the thead starting at
http://www.in-addr.de/pipermail/lvs-users/2002-November/007264.html ), 
the software sends server load information to the director to be inserted in to
the ipvs weighting tables.
	</para>
	<para>
I'm busy writing up the benchmarking results at the moment, and I'll post a
link to the paper (and software) soon. In summary: when the simulation
cluster is (intentionally) unbalanced, the feedback software sucessfully
evens the load between all servers.
	</para>
	<para>
Jeremy at one stage had plans to merge his code with Alexandre's code, but
(Aug 2004) he's not doing anything about it at the moment (he has a real job now).
	</para>
	<para>
Jeremy Kerr <emphasis>jeremy (at) redfishsoftware (dot) com (dot) au</emphasis> 04 Feb 2005
	</para>
	<para>
Everything's available at 
<ulink url="http://ozlabs.org/~jk/projects/feedbackd/">feedbackd</ulink>
(http://ozlabs.org/~jk/projects/feedbackd/).
	</para>

	<para>
Michal Kurowski <emphasis>mkur (at) gazeta (dot) pl</emphasis> 19 Jan 2007
	</para>
	<para>
I want to distribute the load based on criteria such as:
	</para>
	<itemizedlist>
		<listitem>
disk usage
		</listitem>
		<listitem>
OS load average
		</listitem>
		<listitem>
CPU usage
		</listitem>
		<listitem>
custom hooks into my own software
		</listitem>
		<listitem>
You Name It (TM)
		</listitem>
	</itemizedlist>
	<para>
<command>feedbackd</command> has got CPU-monitoring only by default.
It also has a perl plugin that's supposed to let you code something revelant to you quickly. 
That's a perfect idea except the original perl plugin is not fully functional, 
because it breaks some rules regarding linking to C-based modules.
	</para>
	<para>
I wrote <ulink url="files/feedbackd-agent.patch">feedbackd-agent.patch</ulink>
that's solves the problem (it's against the latest release - 0.4, and I've sent Jeremy a copy). 
	</para>
	</section>
	<section id="lvs-kiss">
	<title>lvs-kiss</title>
	<para>
Per Andreas Buer <emphasis>perbu (at) @linpro (dot) .no </emphasis> 14 Dec 2002
	</para>
	<para>
I ran into the same problem this summer. I was setting up a loadbalanced
SMTP cluster - and I wanted to distribute the incomming connections
based on number of e-mails in the mail-servers queues.
We ended up making our own program to do this. Later, I made the thing a
bit more generic and released it. You might want to check it out
	</para>
	<para>
http://www.linpro.no/projects/lvs-kiss/
	</para>
	<para>
lvs-kiss distributes incomming connections based on some numerical value
- as long as you are able to quantify it - it can be used. It can also
time certain test in order to acquire the load of the realservers.
	</para>
	</section>
	<section id="threshold" xreflabel="connection threshold">
	<title>connection threshold</title>
	<note>
according to my dictionary, the spelling is <emphasis role="bold">threshold</emphasis>
and not the more logical <filename>threshhold</filename> as found in many 
webpages (see google: "threshhold dictionary"). 
	</note>
	<para>
If the realservers are limited in the number of connections they can support,
you can use the connection threshold in <command>ipvsadm</command>
(in ip_vs 1.1.0). See the Changelog and man ipvsadm. This
functionality is derived from Ratz's original patches.
	</para>
	<para>
Matt Burleigh
	</para>
	<blockquote>
Is there a stock Red Hat kernel with a new enough version of ip_vs to
include connection thresholds?
	</blockquote>
	<para>Ratz 19 Dec 2002:
I've done the original patch for 2.2.x kernels but I've never ported
it to 2.4.x kernels. I don't know if RH has done so.
In the newest LVS release for 2.5.x kernels the same concept is there,
so with a bit of coding (maybe even luck) you could use that.
	</para>
	<para>
ratz <emphasis>ratz (at) tac (dot) ch</emphasis> 2001-01-29
	</para>
	<para>
This patch on top of ipvs-1.0.3-2.2.18 adds support for threshold
settings per realserver for all schedulers that have the -w option.
	</para>
	<note>
As of Jun 2003, patches are available for 2.4 kernels. All patches are on
<ulink url="http://www.drugphish.ch/patches/ratz/LVS/">Ratz's LVS page</ulink>.
Patches are in active development (<emphasis>i.e.</emphasis> you'll be helping
with the debugging), look at the mailing list for updates.
	</note>
	<para>
Horms 30 Aug 2004
	</para>
	<para>
LVS in 2.6 has its own connection limiting code.
There isn't a whole lot too it. Just get <command>ipvadm</command> for 2.6 and take a look
in the man page. It has details on how the connection thresholds can be
set. Its pretty straight forward as I recall.
	</para>
	<para>
Anon
	</para>
	<para>
Is there any way to
limit connections per IP through IPVS, to mimic the netfilter connection
limit module ipt_connlimit?
I see ipvsadm's threshold option, but it does totals per server.
	</para>
	<para>
Ratz 15 Feb 2007
	</para>
	<para>
how exactly is the threshold option (per RS)
different to the ipt_connlimit regarding the service -> RS pool mapping? If
you need source IP limiting you're better off using QoS anyway.
	</para>
		<section>
		<title>Description/Purpose</title>
		<para>
I was always thinking of how a kernel based implementation of
connection limitation per realserver would work and how it could
be implemented so while waiting in the hospital for the x-ray I
had enough time to write up some little dirty hack to show a
proof of concept. It works like follows. I added three new entries
to the ip_vs_dest() struct, u_thresh and l_thresh in ip_vs.* and
I modified the <command>ipvsadm</command> to add the two new options x and y.
A typical setup would be:
		</para>
<programlisting><![CDATA[
director:/etc/lvs# ipvsadm -A -t 192.168.100.100:80 -s wlc
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 192.168.100.3:80 -w 3 -x 1145 -y 923
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 192.168.100.3:80 -w 2 -x 982 -y 677
director:/etc/lvs# ipvsadm -a -t 192.168.100.100:80 -r 127.0.0.1:80 -w 1 -x 100 -y 50
]]></programlisting>
		<para>
So, this means, as soon as (dest->inactconns + dest->activeconns)
exceed the x value the weight of this server is set to zero. As
soon as the connections drop below the lower threshold (y) the
weight is set back to the initial value.
What is it good for? Yeah well, I don't know exactly, imagine yourself,
but first of all this is proposal and I wanted to ask for a discussion
about a possible inclusion of such a feature or even a derived one into
the main code (of course after fixing the race conditions and bugs and
cleaning up the code) and second, I found out with tons of talks with
customers that such a feature is needed, because also commercial lb
have this and managers always like to have a nice comparision of all
features to decide which product they take. Doing all this in user-
space is unfortunately just not atomic enough.
		</para>
		<para>
Anyway, if anybody else thinks that such a feature might be vital
for inclusion we can talk about it. If you look at the code, it
wouldn't break anything and just add two lousy CPU cycles for checking
if u_thresh is &lt;0. This feature can easily be disabled by just
setting u_thresh to zero or not even initialize it.
		</para>
		<para>
Well, I'm open for discussion and flames. I have it running in
production :) but with a special SLA. I implemented the last
server of resort which works like this: If all RS of a service
are down (healthcheck took it out or treshhold check set weight
to zero), my userspace tool automagically invokes the last
server of resort, a tiny httpd with a static page saying that
the service is currently unavailable. This is also useful if you
want to do maintainance of the realservers.
		</para>
		<para>
I already implemented a dozen of such setups and they work all
pretty well.
		</para>
		<para>
How we will defend against DDoS (distributed DoS)?
		</para>
		<blockquote>
I'm using a packetfilter and in special zones a firewall after the
packetfilter ;) No seriously, I personally don't think the LVS should
take too much part on securing the realservers It's just another part
of the firewall setup.
		</blockquote>
		<para>
        The problem is that LVS has another view for the realserver
load. The director sees one number of connections the realserver
sees another one. And under attack we observe big gap between the
active/inactive counters and the used threshold values. In this case
we just exclude all realservers. This is the reason I prefer the
more informed approach of using agents.
		</para><para>
Using the number of active
or inactive connections to assign a new weight is _very_ dangerous.
		</para>
		<blockquote>
			<para>
I know, but OTOH, if you set a threshold and my code takes the
server out, because of a well formated DDoS attack, I think it
is even better than if you would allow the DDoS and maybe kill the
realservers http-listener.
			</para>
			<blockquote>
				<para>
        No, we have two choices:
				</para><para>
- use SYN cookies and much memory for open requests, accept more
valid requests
				</para><para>
- don't use SYN cookies, drop the requests exceeding the backlog length,
drop many valid requests but the realservers are not overloaded
				</para><para>
In both cases the listeners don't see requests until the handshake is
completed (Linux).
				</para>
			</blockquote>
			<para>
BTW, what if you enable the defense
strategies of the loadbalancer? I've done some tests and I was
able to flood the realservers by sending forged SYNs and timeshifted
SYN-ACKs with the expected seq-nr. It was impossible to work on
the realservers unless of course I enabled the TCP_SYNCOOKIES.
			</para>
		</blockquote>
		<para>
        Yes, nobody claims the defense strategies guard the real
servers. This is not their goal. They keep the director with more
free memory and nothing more :) Only drop_packet can control the
request rate but only for the new requests.
		</para>
		<blockquote>
I then enabled my patch and after the connections exceeded the
threshold, the kernel took the server out temporarily by setting
the weight to 0. In that way the server was usable and I could
work on the server.
		</blockquote>
		<para>
        Yes but the clients can't work, you exclude all servers
in this case because the LVS spreads the requests to all servers
and the rain becomes deluge :)
		</para><para>
In theory, the number of connections is related to the load but
this is true when the world is ideal. The inactive counter can
be set with very high values when we are under attack. Even the WLC
method loads proportionatly the realservers but they are never
excluded from operation.
		</para>
		<blockquote>
			<para>
True, but as I already said. I think LVS shouldn't replace a fw.
I normally have a router configured properly, then a packetfilter,
then a firewall or even another but stateful packetfilter. See,
the patch itself is not even mandatory. I normal setup, my code
is not even touched (except the ``if'':).
			</para>
			<blockquote>
        I have some thoughts about limiting the traffic per
connection but this idea must be analyzed.
			</blockquote>
			<para>
Hmm, I just want to limit the amount of concurrent connections
per realserver and in the future maybe per service. This saved
me quite some lines of code in my userspace healthchecking
daemon.
			</para>
		</blockquote>
		<para>
        Yes, you vote for moving some features from user to the
kernel space. We must find the right balance: what can be done in
LVS and what must be implemented in the user space tools.
		</para><para>
The other alternatives
are to use the Netfilter's "limit" target or QoS to limit the
traffic to the realservers.
		</para>
		<blockquote>
But then you have to add quite some code. The limit target has
no idea about LVS tables. How should this work, f.e. if you
would like to rate limit the amount of connections to a realserver?
		</blockquote>
		<para>
        May be we can limit the SYN rate. Of course, that not covers
all cases, so my thought was to limit the packet rate for all states
or per connection, not sure, this is an open topic. It is easy to open
a connection through the director (especially in LVS-DR) and then
to flood with packets this connection. This is one of the cases where
LVS can really guard the realservers from packet floods. If we
combine this with the other kind of attacks, the distributed ones,
we have better control. Of course, some QoS implementations can
cover such problems, not sure. And this can be a simple implementation,
of course, nobody wants to invent the wheel :)
		</para><para>
        Let's analyze the problem. If we move new connections from
"overloaded" realserver and redirect them to the other realservers we
will overload them too.
		</para>
		<blockquote>
No, unless you use a old machine. This is maybe a requirement of
an e-commerce application. They have some servers and if the servers
are overloaded (taken out by my user-space healthchecking daemon
because the response time it to high or the application daemon is
not listening anymore on the port) they will be taken out. Now I
have found out that by setting thresholds I could reduce the down-
time of flooded server significantly. In case all servers were
taken out or their weights were set to 0 the userspace application
sets up a temporarily (either local route or another server) new
realserver that has nothing else to do then pushing a static webpage
saying that the service is currently unavailable due to high
server load or DDoS attack or whatever. Put this page behind a
TUX 2.0 and try to overflow it. If you can, apply the zero-copy
patches of DaveM. No way you will find such a fast (88MBit/s
requests!!) Link to saturate the server.
		</blockquote>
		<para>
        Yes, I know that this is a working solution. But see, you
exclude all realservers :) You are giving up. My idea is we to find
a state when we can drop some of the requests and to keep the
realservers busy but responsive. This can be a difficult task but
not when we have the help from our agents. We expect that many
valid requests can be dropped but if we keep the realserver in
good health we can handle some valid requests because nobody knows
when the flood will stop. The link is busy but it contains valid
requests. And the service does not see the invalid ones.
		</para>
		<para>
IMO, the problem is that there are
more connection requests than the cluster can handle. The solutions
to try to move the traffic between the realservers can only cause
more problems. If at the same time we set the weights to 0 this
leads to more delay in the processing. May be more useful is to
start to reduce the weights first but this again returns us to
the theory for the smart cluster software.
		</para>
		<para>
        So, we can't exit from this situation without dropping
requests. There is more traffic that can't be served from the cluster.
		</para>
		<para>
        The requests are not meaningful, we care how much load they
introduce and we report this load to the director. It can look, for
example, as one value (weight) for the real host that can be set
for all real services running on this host. We don't need to generate
10 weights for the 10 real services running in our real host. And
we change the weight on each 2 seconds for example. We need two
syscalls (lseek and read) to get most of the values from /proc fs.
But may be from 2-3 files. This is in Linux, of course. Not sure
how this behaves under attack. We will see it :)
		</para>
		<blockquote>
Obviously yes, but if you also include the practical problem of
SLA with customers and guaranteed downtime per month I still have
to say that for my deploition (is this the correct noun?) I go
better with my patch in case of a DDoS and enabled LVS defense
strategies then without.
		</blockquote>
		<para>
If there is no cluster software to keep the realservers equally
loaded, some of them can go offline too early.
		</para>
		<blockquote>
The scheduler should keep them equally loaded IMO even in case
of let's say 70% forged packets. Again, if you don't like to
set a threshold, leave it. The patch is open enough. If you like
to set it, set it, maybe set it very high. It's up to you.
		</blockquote>
		<para>
        The only problem we have with this scheme is the ipvsadm
binary. It must be changed (the user structure in the kernel :))
The last change is dated from 0.9.10 and this is a big period :)
But you know what means a change in the user structures :)
		</para>
		<para>
        The cluster software can take the role to monitor the load
instead of relying on the connection counters. I agree, changing the
weights and deciding how much traffic to drop can be explained
with a complex formula. But I can see it only as a complete solution:
to balance the load and to drop the exceeding requests, serve as many
requests as possible. Even the drop_packet strategy can help here,
we can explicitly enable it specifying the proper drop rate. We don't
need to use it only to defend the LVS box but to drop the exceeding
traffic. But someone have to control the drop rate :) If there is no
exceeding traffic what problems we can expect? Only from the bad load
balancing :)
		</para>
		<para>
        The easiest way to control the LVS is
from user space and to leave in LVS only the basic needed support.
This allows us to have more ways to control LVS.
		</para>
		</section>
	</section>
	<section id="flushing_connection_table">
	<title>Flushing connection table</title>
	<para>
Shinichi Kido <emphasis>shin (at) kidohome (dot) ddo (dot) jp</emphasis>
 	</para>
	<blockquote>
I want to reset all the connection
table (the output list by <command>ipvsadm -lc</command> command) 
immediately without waiting for the expire time for all the connection. 
	</blockquote>
	<para> 
Stefan Schlosser <emphasis>castla (at) grmmbl (dot) org</emphasis> 04 Jun 2004
	</para>
	<para>
you may want to try these patches:
	</para>
<programlisting><![CDATA[
http://grmmbl.org/code/ipvs-flushconn.diff
http://grmmbl.org/code/ipvsadm-flushconn.diff
]]></programlisting>
	<para>
and use <command>ipvsadm -F</command>
	</para>
	<para>
Horms <emphasis>horms (at) verge (dot) net (dot) au</emphasis> 04 Jun 2004 
	</para>
	<para>
Another alternative, if you have lvs compiled as a module is
to reload it. This will clear everything.
	</para>
<programlisting><![CDATA[
ipvsadm -C
# remove the ipvs scheduler and other modules
# rmmod ip_vs_wlc
# ...
rmmod ip_vs 
modprobe ip_vs
]]></programlisting>
	<para>
Then again, Shinichi-san, why do you want to clear the connection table?
It might be useful for testing. But I am not sure what it would
be useful for in production.
	</para>
	<para>
Joe
	</para>
	<para>
In general it's not good for a server to accept a connection and then unilaterally
break it. You should let the connections expire. If you don't want any new
connections, just set weight=0.
	</para>
	</section>
	<section id="thundering_herd" xreflabel="thundering herd problem">
	<title>Thundering herd problem, Slow start code for realserver(s) coming on line</title>
	<para>
Despite what you may have read in the mailing list 
and possibly in earlier versions of this HOWTO, 
there is no slow start for realservers coming on line
(I thought it was in the code from the very early days).
If you bring a new realserver on-line with *lc scheduling,
the new machine, having less connections (<emphasis>i.e.</emphasis> none) 
will get all the new connections.
This will likely stress the new realserver.
	</para>
	<para>
As Jan Klopper points out (11 Mar 2006), you don't get the thundering herd
problem with round robin scheduling. 
In this case the number of connections will even out when the old
connections expire (for http this may only be a few minutes).
It would be simple enough to bring up a new realserver with all rules being rr,
then after 5mins change over to lc (if you want lc).
	</para>
	<para>
Horms says (off-line Dec 2006) that it's simple enough to use the 
in-kernel timers to handle this problem; he just hasn't done it.
Some early patches to handle the problem received zero response,
so he dropped it.
	</para>
	<para>
Christopher Seawood <emphasis>cls (at) aureate (dot) com</emphasis>
	</para>
	<para>
LVS seems to work great until a server goes down (this is where
<command>mon</command> comes in). Here's a couple of things to keep in mind.  If
you're using the Weighted Round-Robin scheduler, then LVS will
still attempt to hit the server once it goes down. If you're
using the Least Connections scheduler, then all new connections
will be directed to the down server because it has 0 connections.
You'd think using mon would fix these problem but not in all
cases.
	</para>
	<para>
Adding mon to the LC setup didn't help matters much. I took one
of three servers out of the loop and waited for mon to drop the
entry.  That worked great.  When I started the server back up,
mon added the entry.  During that time, the 2 running servers had
gathered about 1000 connections apiece.  When the third server
came back up, it immediately received all of the new connections.
It kept receiving all of the connections until it had an equal
number of connections with the other servers (which by this
time...a minute or so later...had fallen to ~700). By this time,
the 3rd server had been restarted after due to triggering a high
load sensor also monitoring the machine (a necessary evil or so
I'm told).  At this point, I dropped back to using WRR as I could
envision the cycle repeating itself indefinitely.
	</para>
	<para>
Horms has fixed this problem with a patch to <command>ipvsadm</command>.
	</para>
	<para>
Horms <emphasis>horms (at) verge (dot) net (dot) au</emphasis> 23 Feb 2004
	</para>
	<para>
Here is a 
<ulink url="files/thundering_herd.diff">patch</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/thundering_herd.diff)
that implements slow start for the WLC scheduler.
This is designed to address the problem where a realserver is added
to the pool and soon inundated with connections. This is sometimes
refered to as the thundering herd problem and was recently
the topic of a thread on this list "Easing a Server into Rotation".
http://marc.theaimsgroup.com/?l=linux-virtual-server&amp;m=107591805721441&amp;w=2
	</para>
	<para>
The patch has two basic parts.
	</para>
	<itemizedlist>
		<listitem>
		<para>
ip_vs_ctl.c:
		</para>
		<para>
When the weight of a realserver is modified (or a realserver is added),
set the IP_VS_DEST_F_WEIGHT_INC or IP_VS_DEST_F_WEIGHT_DEC flag
as appropriate and put the size of the change in dest.slow_start_data.
		</para>
		<para>
This information is intended to act as hints for scheduler
modules to implement slow start. The scheduler modules may
completely ignore this information without any side effects.
		</para>
		</listitem>
		<listitem>
		<para>
ip_vs_wlc.c:
		</para>
		<para>
If IP_VS_DEST_F_WEIGHT_DEC is set then the flag is zeroed -
slow start does not come into effect for weight defects.
		</para>
		<para>
If IP_VS_DEST_F_WEIGHT_INC is set then a handicap is calculated.
The flag is then zeroed.
		</para>
		<para>
The handicap is stored in dest.slow_start_data, along with a scaling
factor to allow gradual decay which is stored in dest.slow_start_data2.
The handicap effectively makes the realserver appear to have
more connections than it does, thus decreasing the number of connections
that the wlc scheduler will allocate to it. This handicap is decayed
over time.
		</para>
		</listitem>
	</itemizedlist>
	<para>
Limited debugging information is available by setting
	</para>
<programlisting><![CDATA[
/proc/sys/net/ipv4/vs/debug_level to 1 (or greater).
]]></programlisting>
	<para>
This will show the size of the handicap when it is calculated
and show a message when the handicap is fully decayed.
	</para>
	</section>
	<section id="files_kernel_version_dependant">
	<title>Handling kernel version dependant files <emphasis>e.g.</emphasis> System.map and ipvsadm</title>
	<para>
If you boot with several different versions of the kernel
(particularly switching between 2.2.x and 2.4.x), and
you have executables or directories with contents that
need to match the kernel version
(<emphasis>e.g.</emphasis> System.map, ipvsadm, /usr/src/linux, /usr/src/ipvs),
then you need some
mechanism for making sure that the appropriate executable
or directory is brought into scope.
	</para>
	<para>
Note:klogd is supposed to read files like /boot/System.map-&lt;kernel_version&gt;
allowing you to have several kernels in / (or /boot). However this doesn't
solve the problem for general executables like ipvsadm.
	</para>
	<para>
If you have the wrong version of System.map you'll get
errors when running some commands
(<emphasis>e.g.</emphasis> `ps` or `top`)
	</para>
<programlisting><![CDATA[
Warning: /usr/src/linux/System.map has an incorrect kernel version.
]]></programlisting>
	<para>
If you the ip_vs and <command>ipvsadm</command> don't match, then
<command>ipvsadm</command> will give invalid numbers for IPs and ports
or it will tell you that you don't have <filename>ip_vs</filename> installed.
	</para>
	<para>
As with most problems in computing, this can
be solved with an extra layer of indirection.
I name my kernel versions in /usr/src like
	</para>
<programlisting><![CDATA[
director:/etc/lvs# ls -alF /usr/src | grep 2.19
lrwxrwxrwx   1 root     root           25 Sep 18  2001 linux-1.0.7-2.2.19 -> linux-1.0.7-2.2.19-module/
drwxr-xr-x  15 root     root         4096 Jun 21  2001 linux-1.0.7-2.2.19-kernel/
drwxr-xr-x  15 root     root         4096 Aug  8  2001 linux-1.0.7-2.2.19-module/
lrwxrwxrwx   1 root     root           18 Oct 21  2001 linux -> linux-1.0.7-2.2.19
]]></programlisting>
	<para>
Here I have two versions of ip_vs-1.0.7 for the 2.2.19 kernel,
one built as a kernel module and the other built into the kernel
(you will probably only have one version of ip_vs for any kernel).
I select the one I want to use by making a link from linux-1.0.7-2.2.19
(I do this by hand).
If you do this for each kernel version, then the
/usr/src directory will have several links
	</para>
<programlisting><![CDATA[
director:/etc/lvs# ls -alFrt /usr/src | grep lrw
lrwxrwxrwx   1 root     root           25 Sep 18  2001 linux-1.0.7-2.2.19 -> linux-1.0.7-2.2.19-module/
lrwxrwxrwx   1 root     root           38 Sep 18  2001 linux-0.9.2-2.4.7 -> linux-0.9.2-2.4.7-module-hidden-shared/
lrwxrwxrwx   1 root     root           39 Sep 18  2001 linux-0.9.3-2.4.9 -> linux-0.9.3-2.4.9-module-forward-shared/
lrwxrwxrwx   1 root     root           17 Sep 19  2001 linux-2.4.9 -> linux-0.9.3-2.4.9/
lrwxrwxrwx   1 root     root           40 Oct 11  2001 linux-0.9.4-2.4.12 -> linux-0.9.4-2.4.12-module-forward-shared/
lrwxrwxrwx   1 root     root           18 Oct 21  2001 linux -> linux-0.9.4-2.4.12/
]]></programlisting>
	<para id="rc.system_map">
The last entry, the link from <filename>linux</filename> is
made by 
<ulink url="files/rc.system_map">rc.system_map</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/rc.system_map).
At boot time <filename>rc.system_map</filename> checks the
kernel version (here booted with 2.4.12) and links <filename>linux</filename>
to <filename>linux-0.9.4-2.4.12</filename>.
If you lable ipvsadm, /usr/src/ip_vs and System.map
in a similar way, then <filename>rc.system_map</filename>
will link the correct versions for you.
	</para>
	<para>
<filename>ipvsadm</filename>
versions match ipvs versions and not kernel versions,
but kernel versions are close enough that this scheme works.
	</para>
	</section>
	<section id="limiting_clients" xreflabel="limiting client">
	<title>Limiting number of clients connecting to LVS</title>
	<para>
This comes up occasionally, and Ratz has developed scheduler code that
will handle overloads for realservers
(see <xref linkend="writing_a_scheduler"/> and discussion of schedulers with
memthresh in <xref linkend="hash_table"/>).
	</para>
	<para>
The idea is that after a certain
number of connections, the client is sent to an overload machine with
a page saying "hold on, we'll be back in a sec".
This is useful if you have an SLA saying that no client connect request
will be refused, but you have to handle 100,000 people trying to buy
Rolling Stones concert tickets from your site, who all connect within
30secs of the opening time.
Ratz' code may even be in the LVS code sometime.
Until it is, ask Ratz about it.
	</para>
	<para>
NewsFlash: Ratz' code is out
	</para>
	<para>
Roberto Nibali <emphasis>ratz (at) tac (dot) ch</emphasis> 23 Oct 2003
	</para>
	<para>
http://marc.theaimsgroup.com/?l=linux-virtual-server&amp;m=105914647925320&amp;w=2
	</para>
	<para>
Horms
	</para>
	<para>
The LVS 1.1.x code for the 2.6 kernel allows you to set connection
limits using ipvsadm. This is documented in the ipvsadm man
page that comes with the 1.1.x code.
The limits are currently not available in the 1.0.x code
for the 2.4 kernel. However I suspect that a backport
would not be that difficult.
	</para>
	<para>
Diego Woitasen, Oct 22, 2003
	</para>
	<blockquote>
but what set the IP_VS_DEST_F_OVERLOAD in struct ip_vs_dst?
	</blockquote>
	<para>
Horms 23 Oct 2003
	</para>
	<para>
This relates to LVS's internal handling
of connection thresholds for realservers which is available
in the 1.1.x tree for the 2.6.x kernel (also see 
<xref linkend="threshold"/>).
	</para>
	<para>
IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS code when
the high and low thresholds are passed for a realserver.
If a scheduler honours this flag then it should not allocate
new connections to realservers with this flag set. As far as
I can see all the supplied schedulers honour this flag. But if
a scheduler did not then it would just be ignored. That is real
servers would have new connections allocated regardless of
if IP_VS_DEST_F_OVERLOAD is set or not. It would be as if
no connection thresholds had been set.
	</para>
	<para>
Note that if persistancy is in use then subsequent connections
to the same realserver for a given client within the persistancy
timeout are not scheduled as such. Thus additional connections
of this nature can be allocated to a realserver even if
it has been marked IP_VS_DEST_F_OVERLOAD. This, IMHO, is
a desirable behaviour.
	</para>
	<blockquote>
ok, I saw that IP_VS_DEST_F_OVERLOAD is set and unset by the core LVS,
but I can't find where the thresholds are set. As can I see, this
thresholds are always set to Zero in userpace.
Is this right?
	</blockquote>
	<para>
No.
In the kernel the threshoulds are set by code in ip_vs_ctl.c
(I guess, as that is where all other configuration from user-space
is handled). If you get the version of ipvsadm that comes with
LVS source that supports IP_VS_DEST_F_OVERLOAD then
it has command line options to set the thresholds.
	</para>

	<para>
Steve Hill
	</para>
	<blockquote>
A number of the schedulers seem to use an is_overloaded() function that 
limits the number of connections to twice the server's weight.
	</blockquote>
	<para>
Ratz 03 Aug 2004
	</para>
	<para>
For the sake of discussion I'll be referring to the 2.4.x kernel. We 
would be talking about this piece of jewelry:
	</para>
<programlisting><![CDATA[
static inline int is_overloaded(struct ip_vs_dest *dest)
{
         if (atomic_read(&dest->activeconns) > 
atomic_read(&dest->weight)*2) {
                 return 1;
         }
         return 0;
}
]]></programlisting>
	<para>
I'm a bit unsure about the semantics of this is_overloaded 
regarding it's mathematical background. Wensong, what was the reason to 
arbitraly us twice the amount of activeconns for the overoad criteria?
	</para>
	<itemizedlist>
		<listitem>
<filename>dest->activeconns</filename> has such a short life span, 
it hardly represents nor reflects the current RS load in any way I could imagine.
		</listitem>
		<listitem>
2.4.x and 2.6.x differ in what they consider a destination to be
overloaded. IP_VS_DEST_F_OVERLOAD is set when ip_vs_dest_totalconns
exceeds the upper threshold limit and totalconns means currently
active + inactive connections which is also kind of unfortunate. And
yes, there is some more code I haven't mentioned yet.
		</listitem>
	</itemizedlist>
	<blockquote>
I'm using 
the dh scheduler to balance across 3 machines - once the connections 
exceed twice the weight, it refuses new connections from IP addresses that 
aren't currently persistent.
	</blockquote>
	<para>
Which kernel?
And 2.4.x and 2.6.x contain 
similar (although not sync'd ... sigh) code regarding this feature:
	</para>
	<blockquote>
2.4.24 sorry
	</blockquote>
<programlisting><![CDATA[
ratz@webphish:/usr/src/linux-2.6.8-rc2> grep -r is_overloaded *
net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_sh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblcr.c:            if (!dest || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_dh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblc.c:                 || is_overloaded(dest, svc)) { ratz@webphish:/usr/src/linux-2.6.8-rc2>


ratz@webphish:/usr/src/linux-2.4.27-rc4> grep -r is_overloaded *
net/ipv4/ipvs/ip_vs_sh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_sh.c:           || is_overloaded(dest)) {
net/ipv4/ipvs/ip_vs_lblcr.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblcr.c:            if (!dest || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_lblc.c:is_overloaded(struct ip_vs_dest *dest, struct ip_vs_service *svc)
net/ipv4/ipvs/ip_vs_lblc.c:                 || is_overloaded(dest, svc)) {
net/ipv4/ipvs/ip_vs_dh.c:static inline int is_overloaded(struct ip_vs_dest *dest)
net/ipv4/ipvs/ip_vs_dh.c:           || is_overloaded(dest)) {
]]></programlisting>
	<para>
Assymetric coding :-)
	</para>
	<blockquote>
This in itself isn't really a problem, but I can't find this behaviour 
actually documented anywhere - all the documentation refers to the 
weights as being "relative to the other hosts" which means there should 
be no difference between me setting the weights on all hosts to 5 or 
setting them all to 500.
Multiplying the 
weight by 2 seems very arbitrary, although in itself there is no real 
problem (as far as I can tell) with limiting the connections like that so 
long as it's documented.
	</blockquote>
	<para>
This is correct. I'm a bit unsure as to what your exact problem is, but 
a kernel version would already help, although I believe you're using a 
2.4.x kernel. Normally the is_overloaded() function was designed to be 
used by the threshold limitation feature only which is only present as a 
shacky backport from 2.6.x. I don't quite understand the is_overloaded() 
function in the ip_vs_dh scheduler, OTOH, I really haven't been using it 
so far.
	</para>
	<blockquote>
		<para>
I have 3 squid servers and had set them all to a weight of 5 (since they 
are all identical machines and the docs said that the weights are 
relative to eachother).  What I found was that once there were >10 
concurrent connections any new hosts that tried to make a connection (i.e. 
any host that isn't "persisting") would have it's connection rejected 
outright.  After some reading through the code I discovered the 
is_overloaded condition, which was failing in the case of > 10 connections 
and so I have increased all the weights to 5000 (to all intents and 
purposes unlimited) which has solved the problem.
		</para>
		<para>
Oddly there is another LVS server with a similar configuration which isn't 
showing this behaviour, but I cannot find any significant difference in 
the configuration to account for it.
		</para>
		<para>
The primary use for LVS in this case is failover in the event of one of 
the servers failing, although load balancing is a good side effect.  I'm 
using ldirectord to monitor the realservers and adjusting the LVS 
settings in response to an outage.  At the moment, for some reason it 
doesn't seem to be doing any load balancing at the moment (something I am 
looking into) - it is just using a single server, although if that server 
is taken down it does fail over correctly to one of the other servers.
		</para>
		<para>
Sorry, I've just realised I've been exceptionally stupid about this bit - 
I should be using the SH scheduler instead of DH.
		</para>
	</blockquote>
	<para>
One problem is that activeconns doesn't define connections and
the implementations for 2.4 and 2.6 differ significantly.
Also <filename>is_overloaded</filename> should be reserved
for another purpose, the threshhold limitation feature.
	</para>
	<para>
Joe: now back into prehistory -
	</para>
	<para>
Milind Patil <emphasis>mpatil (at) iqs (dot) co (dot) in</emphasis> 24 Sep 2001
	</para>
	<blockquote>
I want to limit number of users accessing the LVS services at any given
time. How can I do it.
	</blockquote>
	<para>
Julian
	</para>
	<itemizedlist>
		<listitem>
		<para>
for non-NAT cluster (maybe stupid but interesting)
		</para>
		<para>
	May be an array from policers, for example, 1024 policers or
an user-defined value, power of 2. Each client hits one of the policers
based on their IP/Port. This is mostly a job for QoS ingress, even the
distributed attack but may be something can be done for LVS? May be we
better to develop a QoS Ingress module? The key could be derived
from CIP and CPORT, may be something similar to SFQ but without queueing.
It can be implemented may be as a patch to the normal policer but with
one argument: the real number of policers. Then this extended policer
can look into the TCP/UDP packets to redirect each packet to one of the
real policers.
		</para>
		</listitem>
		<listitem>
			<para>
	for NAT only
			</para>
			<para>
	Run SFQ qdisc on your external interface(s). It seems this is
not a solution for DR method. Of course, one can run SFQ on its uplink
router.
			</para>
		</listitem>
		<listitem>
			<para>
Linux 2.4 only
			</para>
			<para>
	iptables has support to limit the traffic but I'm not sure
whether it is useful for your requirements. I assume you want to set
limit to each one of these 1024 aggregated flows.
			</para>
		</listitem>
	</itemizedlist>
	<para>
Wenzhuo Zhang
	</para>
	<para>
Is anybody actually using the ingress policer for anti-DoS? 
I tried it several days ago using the script in the iproute2
package: iproute2/examples/SYN-DoS.rate.limit.
I've tested it against different 2.2 kernels (2.2.19-7.0.8 - redhat
kernel), 2.2.19, 2.2.20preX, with all QoS related functions either
compiled into the kernel or as modules) and different versions of
iproute2. In all cases, tc fails to install the ingress qdisc policer:
	</para>
<programlisting><![CDATA[
root@panda:~# tc qdisc add dev eth0 handle ffff: ingress
RTNETLINK answers: No such file or directory
root@panda:~# /tmp/tc qdisc add dev eth0 handle ffff: ingress
RTNETLINK answers: No such file or directory
]]></programlisting>
	<para>
Julian
	</para>
	<para>
	For 2.2, you need the ds-8 package, at
<ulink url="http://diffserv.sourceforge.net/">
Package for Differentiated Services on Linux</ulink>.
Compile tc by setting TC_CONFIG_DIFFSERV=y in Config.
The right command is:
	</para>
<programlisting><![CDATA[
	tc qdisc add dev eth0 ingress
]]></programlisting>
	<para>
Ratz
	</para>
	<para>
		The 2.2.x version is not supported anymore. The
advanced routing documentation says to only use 2.4.
	For 2.4, ingress is in the kernel but it is still unusable for
more than one device (look in linux-netdev for reference).
	</para>
	<para>
James Bourne <emphasis>james (at) sublime (dot) com (dot) au</emphasis>
25 Jul 2003
	</para>
	<blockquote>
		<para>
I was after some samples or practical suggestion in regard to Rate Limiting
and Dynamically Denying Services to abusers on a per VIP basis.
I have had a look at the sections in the HOWTO on
"Limiting number of clients connecting to LVS"
and
		</para>
		<para>
<ulink url="http://www.linuxvirtualserver.org/docs/defense.html">
http://www.linuxvirtualserver.org/docs/defense.html</ulink>.
		</para>
	</blockquote>
	<para>
Ratz
	</para>
	<para>
This is a defense mechanism which is always unfair. You don't want that
from what I can read.
	</para>
	<blockquote>
		<para>
Specifically, we are running web based competition entries (<emphasis>e.g.</emphasis>
type in your three bar codes) out of our LVS cluster and want to limit those who might
construct "bots" to auto-enter. The application is structured so that you have
to click through multiple pages and enter a value that is represented in a
dynamically generated PNG.
		</para>
		<para>
I would like to:
		</para>
		<orderedlist>
			<listitem>
rate limit on each VIP (we can potentially do this at the firewall)
			</listitem>
			<listitem>
ban a source ip if it goes beyond a certain number "requests-per-time-interval"
			</listitem>
			<listitem>
dynamically take a vip offline if it goes beyond a certain number of
"requests-per-time-interval"
			</listitem>
			<listitem>
toss all "illegal requests" - eg. codered, nimda etc.
			</listitem>
		</orderedlist>
		<para>
Perhaps a combination of iptables, QoS, SNORT etc. would do the job??
		</para>
	</blockquote>
	<para>
Roberto Nibali 25 Jul 2003
	</para>
	<blockquote>
1. rate limit on each VIP (we can potentially do this at the firewall)
	</blockquote>
	<para>
Hmm, you might need to use QoS or probably better would be to write a
scheduler which uses the rate estimator in IPVS.
	</para>
	<blockquote>
2. ban a source ip if it goes beyond a certain number "requests-per-time-interval"
	</blockquote>
	<para>
A scheduler could do that for you, although I do not think this is a
good idea.
	</para>
	<blockquote>
3. dynamically take a vip offline if it goes beyond a certain number of
"requests-per-time-interval"
	</blockquote>
	<para>
Quiescing the service should be enough, you don't want to put on a
penalty on other people, you simple want to keep your maximum request
-per-time rate.
	</para>
	<blockquote>
4. toss all "illegal requests" - eg. codered, nimda etc.
	</blockquote>
	<para>
This has nothing to do with LVS ;).
	</para>
	<para>
QoS is certainly suitable for 1). For 2) and 3) I think you would need
to write a scheduler.
	</para>

	<para>
Max Sturtz
	</para>
	<blockquote>
I know that iptables can block connections if they exceed a specified
number of connections per second (from anywhere).
The question is, is
anybody doing this on a per-client basis, so that if any particular IP is
sending us more than a specified number of connections per second, they
get blocked but all other clients can keep going?
	</blockquote>
	<para>
ratz 01 Dec 2003
	</para>
	<para>
Using <filename>iptables</filename> is a very bad practice approach to handle such problems.
You have no information if the IP which is making those request attempts
at a high rate is malicious or friendly. If it's malicious (IP spoofing)
you will block an existing friendly IP.
	</para>
	<blockquote>
several times per week we experience
a traffic storm.  LVS handles it just fine,
but the web-servers get loaded up really bad, and pretty soon our site is
all but un-usable.
We're looking for tools we could use to analyze this
(we use Webalizer for our web-logs-- but it can't tell us who's talking to
us in any given time-frame...)
	</blockquote>
	<para>
Could you describe your overloaded system with some metrics or could you
determine the upper connection threshold where your RS are still working
fine?
You could dump the LVS masquerading table from time to time and grep for
connection templates.
	</para>
	<para>
I see 4 approaches (in no particular order) to this problem:
	</para>
	<itemizedlist>
		<listitem>
LVS tcp defense mechanism, best described in
http://www.linux-vs.org/docs/defense.html
		</listitem>
		<listitem>
L7 load balancer which inspects HTTP content, best described in
http://www.linux-vs.org/software/ktcpvs/ktcpvs.html
    or in the package Readme.
		</listitem>
		<listitem>
Use of the per RS threshold limitation patch I wrote (see <xref linkend="limiting_clients"/>).
		</listitem>
		<listitem>
Use the 
<ulink url="http://www.redfishsoftware.com.au/projects/feedbackd/">feedbackd</ulink>
(http://www.redfishsoftware.com.au/projects/feedbackd/)
architecture to signal the director of network
anomalies based on certain metrics gained on the RSs.
		</listitem>
	</itemizedlist>
	</section>
	<section id="who_is_connecting">
	<title>Who is connecting to my LVS?</title>
	<para>
On the realservers you can look with `netstatn -an`. With LVS, the
director also has information.
			</para><para>
<blockquote><para>
<emphasis>malalon (at) poczta (dot) onet (dot) pl</emphasis> 18 Oct 2001
			</para><para>
How do I know who is connecting to my LVS?
</para></blockquote>
			</para><para>
Julian
			</para><para>
<itemizedlist>
<listitem><para>Linux 2.2: netstat -Mn (or /proc/net/ip_masquerade)
</para></listitem><listitem><para>Linux 2.4: ipvsadm -Lcn (or /proc/net/ip_vs_conn)
</para></listitem></itemizedlist>
	</para>
	</section>
	<section id="experimental_schedulers">
	<title>experimental scheduling code</title><para>
This section is a bit out of date now. See the
<xref linkend="LVS-HOWTO.ipvsadm"/> new schedulers by Thomas Prouell for web caches
and by Henrik Norstrom for firewalls. Ratz <emphasis>ratz (at) tac (dot) ch</emphasis> has produced
a scheduler which will keep activity on a particular realserver
below a fixed level.
			</para><para>
For this next code write to Ty or grab the code off the list server
			</para><para>
<blockquote><para>
Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis> 23 Feb 2000
			</para><para>
     This is a hack to the ip_vs_wlc.c schedualing algorithm.  It is
 curently implemnted in a quick, ad hoc fashion. It's purpose is to
 support limiting the total number of connections to a realserver.
 Currently it is implmented using the weigh value as the upper limit
 on the number of activeconns(connections in an established TCP state).
 This is a very simple implementation and only took a few minutes after
 reading through the source. I would like, however, to develop it further.
			</para><para>
     Due to it's simple nature it will not function in several types of
 enviroments, those based on connectionless protocals (UDP, this uses
 the inactconns variable to keep track of things, simply change the
 activeconns varible-in the weigh check- to inactconns for UDP) and it may
 impose complecations when persistance is implemented.  The current
 algorimthm simply checks that weight > activeconns before including
 a server in the standard wlc scheduling. This works for my enviroment,
 but could be changed to perhaps (weight * 50) > (activeconns * 50) + inactconns to
 include the inactconns but make the activeconns more important in the decison.
			</para><para>
     Currently the greatest weight value a user may specify is approimalty
 65000, independant of this modification. As long as the user keeps most
 importanly the weight values correct for the total number of connections
 and in porportion to one another the things should function as expected.
			</para><para>
     In the event that the cluster is full, all real severs have maxed out,
 then it might be neccessary for overflow control, or the client's end will
 hang. I haven't tested this idea but it could simply be implemented by
 specifing the over flow server last, after the real severs using
 the <command>ipvsadm</command> tool. This will work because as each realserver is added
 using <command>ipvsadm</command> it is put on a list, with the last one added being last
 on the list. The scheduling algorithm traverses this list linearly from
 start to finish and if it finds that all severs are maxed out, then the
 last one will be the overflow and that will be the only one to send traffic to.
			</para><para>
     Anyway this is just a little hack, read the code and it should make sense.
 It has been included as an attachment. If you would like to test this
 simply replace the old ip_vs_wlc.c scheduling file in /usr/src/linux/net/ipv4
 with this one. Compile it in and set the weight on the real severs to the max
 number of connections in an established TCP state or modifiy the source to
 your liking.
			</para><para>
 From: Ty Beede <emphasis>tybeede (at) metrolist (dot) net</emphasis> 28 Feb 2000
			</para><para>
 I wrote a little patch and posted it a few days ago... I indicated that
 overflow might be accomplished by adding the overflow server to the lvs last.
 This statement is completely off the wall wrong. I'm not really sure why I
 thought that would work but it won't, first of all the linked list adds
 each new instance of a real sever to the start of the realservers list,
 not the end like I though.  Also it would be impossible do distingish
 the overflow server from the realservers in the case that not all the
 realservers were busy. I don't know where I got that idea from but I'm
 going to blame it on my "bushy eyed youth".  In responce to needing
 overflow support I'm thinking about implementing "prority groups" into
 the lvs code. This would logically group the real severs into different
 groups, though with a higher priority group would fillup before those
 with a lower grouping.  If anybody could comment on this it would be
 nice to hear what the rest of you think about overflow code.
</para></blockquote>
			</para>
		<section id="developing_schedulers">
		<title>Theoretical issues in developing better scheduling algorithms</title>
		<para>
Julian
			</para><para>
It seems to me it would be useful in some cases to use the total number
of connections to a realserver in the load balancing calculation, in
the case where the realserver participates in servicing a number of
different VIPs.
			</para><para>
<blockquote><para>
Wensong
			</para><para>
Yeah, it is true. Sometimes, we need tradeoff between
simplicity/performance and functionality. Let me think more about
this, and probably maximum connection scheduling together together
too. For a rather big server cluster, there may be a dedicated load
balancer for web traffic and another load balancer for mail traffic,
then the two load balancers may need exchange status periodically, it
is rather complicated.
</para></blockquote>
 Yes, if a realserver is used from two or more directors
 the "lc" method is useless.
<blockquote><para>
 Actually, I just thought that dynamic weight adaption according to
 periodical load feedback of each server might solve all the above
 problems.
</para></blockquote>
			</para><para>
Joe - this is part of a greater problem with LVS, we don't have
 good monitoring tools and we don't have a lot of information on
 the varying loads that realservers have, in order to develope
 strategies for informed load regulation.
See <link linkend="agent">load and failure monitoring</link>.
<blockquote><para>
Julian
			</para><para>
From my experience with realservers for web, the only
 useful parameters for the realserver load are:
			</para><para>
<itemizedlist>
<listitem><para>cpu idle time
			</para><para>
                 If you use realservers with equal CPUs (MHz)
                 the cpu idle time in percents can be used.
                 In other cases the MHz must be included in
                 a expression for the weight.
			</para><para>
</para></listitem><listitem><para>free ram
			</para><para>
                 According to the web load the right expression
                 must be used including the cpu idle time
                 and the free ram.
			</para><para>
</para></listitem><listitem><para>free swap
			</para><para>
                 Very bad if the web is swapping.
			</para><para>
</para></listitem></itemizedlist>
			</para><para>
 The easiest parameter to get, the Load Average is
 always &lt;5. So, it can't be used for weights in this case.
 May be for SMTP ? The sendmail guys use only the load average
 in sendmail when evaluating the load :)
			</para><para>
			</para><para>
 So, the monitoring software must send these parameters
 to all directors. But even now each of the directors use
 these weights to create connections proportionally. So,
 it is useful these parameters for the load to be updated
 in short intervals and they must be averaged for this
 period. It is very bad to use current value for a parameter
 to evaluate the weight in the director. For example, it
 is very useful to use something like "Average value for
 the cpu idle time for the last 10 seconds" and to broadcast
 this value to the director on each 10 seconds. If the
 cpu idle time is 0, the free ram must be used. It depends
 on which resource zeroed first: the cpu idle time or the
 free ram. The weight must be changed slightly :)
			</para><para>
 The "*lc" algorithms help for simple setups, eg.
 with one director and for some of the services, eg http,
 https. It is difficult even for ftp and smtp to use these
 schedulers. When the requests are very different, the
 only valid information is the load in the realserver.
			</para><para>
 Other useful parameter is the network traffic (ftp).
 But again, all these parameters must be used from the director
 to build the weight using a complex expression.
			</para><para>
 I think the complex weight for the realserver
 based on connection number (lc) is not useful due to the
 different load from each of the services. May be for
 the "wlc" scheduling method ? I know that the users
 want LVS to do everything but the load balancing is
 very complex job. If you handle web traffic you can be happy
 with any of the current scheduling methods. I didn't tried
 to balance ftp traffic but I don't expect much help from *lc
 methods. The realserver can be loaded, for example, if you
 build new Linux kernel while the server is in the cluster :)
 Very easy way to switch to swap mode if your load is near 100%.
</para></blockquote>
		</para></section>
	</section>
	<section id="writing_a_scheduler" xreflabel="writing a scheduler">
	<title>Ratz's primer on writing your own scheduler</title>
	<para>
Roberto Nibali <emphasis>ratz (at) tac (dot) ch</emphasis> 10 Jul 2003
	</para>

	<para>
the whole setup roughly works as follows:
	</para>
<programlisting><![CDATA[
struct ip_vs_scheduler {
	struct list_head        n_list;   /* d-linked list head */
	char			*name;    /* scheduler name */
	atomic_t                refcnt;   /* reference counter */
 	struct module		*module;  /* THIS_MODULE/NULL */

	/* scheduler initializing service */
	int (*init_service)(struct ip_vs_service *svc);
	/* scheduling service finish */
	int (*done_service)(struct ip_vs_service *svc);
	/* scheduler updating service */
	int (*update_service)(struct ip_vs_service *svc);

	/* selecting a server from the given service */
	struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc,
				       struct iphdr *iph);
};
]]></programlisting>

	<para>
Each scheduler {rr,lc,...} will have to register itself by initialisation of the
ip_vs_scheduler struct object. As you can see it contains above other data types
4 function pointers:
	</para>
<programlisting><![CDATA[
int (*init_service)(struct ip_vs_service *svc)
int (*done_service)(struct ip_vs_service *svc)
int (*update_service)(struct ip_vs_service *svc)
struct ip_vs_dest* (*schedule)(struct ip_vs_service *svc,struct iphdr *iph)
]]></programlisting>
	<para>
Each scheduler will need to provide a callback function for those prototypes
with his own specific implementation.
	</para>
	<para>
Let's have a look at ip_vs_wrr.c:
	</para>
	<para>
We start with the __init function which is kernel specific. It defines
ip_vs_wrr_init() which in turn calls the required
register_ip_vs_scheduler(&amp;ip_vs_wrr_scheduler). You can see the
ip_vs_wrr_scheduler structure definition just above the __init function. There
you will note following:
	</para>
<programlisting><![CDATA[
static struct ip_vs_scheduler ip_vs_wrr_scheduler = {
         {0},                    /* n_list */
         "wrr",                  /* name */
         ATOMIC_INIT(0),         /* refcnt */
         THIS_MODULE,            /* this module */
         ip_vs_wrr_init_svc,     /* service initializer */
         ip_vs_wrr_done_svc,     /* service done */
         ip_vs_wrr_update_svc,   /* service updater */
         ip_vs_wrr_schedule,     /* select a server from the destination list */
};
]]></programlisting>
	<para>
This now is exactly the scheduler specific object instantiation of the struct
ip_vs_scheduler prototype defined in ip_vs.h. Reading this you can see that the
last for "names" map the function names to be called accordingly.
	</para>
	<para>
So in case of the wrr scheduler, what does the init_service (mapped to the
ip_vs_wrr_init_svc function) do?
	</para>
	<para>
It generates a mark object (used for list chain traversal and mark point) which
gets filled up with initial values, such as the maximum weight and the gcd
weight. This is a very intelligent thing to do, because if you do not do this,
you will need to compute those values every time the scheduler needs to schedule
a new incoming request.
	</para>
	<para>
The latter also requires a second callback. Why? Imagine someone decides to
update the weights of one or more server from user space. This would mean that
the initially computed weights are not valid anymore.
	</para>
	<para>
What can be done against it? We could compute those values every time the
scheduler needs to schedule a destination but that's exactly what we don't want.
So in play comes the update_service protoype (mapped to the ip_vs_wrr_update_svc
function).
	</para>
	<para>
As you can easily see the ip_vs_wrr_update_svc function will do part of what we
did for the init_service: it will compute the new max weight and the new gcd
weight, so the world is saved again. The update_service callback will be called
upon a user space ioctl call (you can read about this in the previous chapter of
this marvellous developer guide :)).
	</para>
	<para>
The ip_vs_wrr_schedule function provides us with the core functionality of
finding an appropriate destination (realserver) when a new incoming connection
is hitting our cluster. Here you could write your own algorithm. You only need
to either return NULL (if no realserver can be found) or a destination which is
of type: struct ip_vs_dest.
	</para>
	<para>
The last function callback is the ip_vs_wrr_done_svc function which kfree()'s
the initially kmalloc()'d mark variable.
	</para>
	<para>
This short tour-de-scheduler show give you enough information to write your own
scheduler, at least in theory :).
	</para>

	<para>
unknown
	</para>
	<blockquote>
I'd like to write a user defined scheduler
to guide the load dispatching
	</blockquote>
	<para>
Ratz 12 Aug 2004
	</para>
	<para>
Check out <filename>feedbackd</filename> 
<link linkend="feedbackd">feedbackd</link>
and see if you miss something there. 
I know that 
this is not what you wanted to hear but to provide a generic API for 
user space deamons to interact directly with a generic scheduler is 
definitely out of scope. One problem is that the process of balancing 
incoming network load is not an atomic operation. It can take minutes, 
hours, days, weeks until you get an equalised load on your servers. 
Having a user space doing premature scheduler updates in a short time 
interval only asks for trouble regarding peak load bouncing.
	</para>
	</section>
	<section id="sysctl" xreflabel="sysctl">
	<title>changing ip_vs behaviour with sysctl flags in /proc</title>
	<para>
You can change the behaviour of ip_vs by pushing bits in the /proc filesystem.
This gives finer control of ip_vs than is available with <command>ipvsadm</command>.
For ordinary use, you don't need to worry about the <filename>sysctl</filename>,
since sensible defaults have been installed.
	</para>
	<para>
Here's a
<ulink url="http://www.linuxvirtualserver.org/docs/sysctl.html">
list of the current sysctls at
http://www.linuxvirtualserver.org/docs/sysctl.html
</ulink>.
Note that older kernels will not have all of these sysctls
(test for the existance of the appropriate file in /proc first).
These sysctls are mainly used for
<xref linkend="bringing_down_persistent_services"/>.
	</para>
	<para>
Some, but not all, of the sysctls are documented in ipvsadm(8)
	</para>
	<para>
(Thanks to Kit Gerrits Dec 2008). There's info on <filename>ip_vs()</filename> sysctls at
<ulink url="http://www.mjmwired.net/kernel/Documentation/networking/ipvs-sysctl.txt">ipvs-sysctl.txt</ulink>
(http://www.mjmwired.net/kernel/Documentation/networking/ipvs-sysctl.txt).
	</para>
	<para>
Horms <emphasis>horms (at ) verge (dot) net (dot) au</emphasis> 11 Dec 2003
	</para>
	<para>
I am still strongly of the opinion that the sysctl variables should
be documented in the ipvsadm man page 
as they are strongly tied to its behaviour.
At the moment we are in a situation where
some are documented in ipvsadm(8), 
while all documented in <filename>sysctl.html</filename>
Yet there is no reference to sysctl.html in ipvsadm(8).
My preference is to merge all the information in sysctl.html into
ipvsadm(8) or perhaps a separate man page. If this is not acceptable
then I would advocate removing all of the sysctl infromation from
ipvsadm(8) and replacing it with a reference to sysctl.html.
Though to be honest, why half the information on configuring LVS
should be in ipvsadm(8) and the other half in sysctl.html
is beyond me.
	</para>
	</section>
	<section id="ipvsadm_counters">
	<title>Counters in ipvsadm</title>
	<para>
Rutger van Oosten <emphasis>r (dot) v (dot) oosten (at) benq-eu (dot) com</emphasis> 09 Oct 2003
	</para>
	<blockquote>
		<para>
When I run
		</para>
<programlisting><![CDATA[
ipvsadm -l --stats
]]></programlisting>
		<para>
it shows connections, packets and bytes in and
out for the virtual services and for the realservers. One would expect that
the traffic for the service is the sum of the traffic to the servers - but
it is not, the numbers don't add up at all, whereas in
		</para>
<programlisting><![CDATA[
ipvsadm -l --rate
]]></programlisting>
		<para>
they do (approximately, not exactly for the bytes per second ones).
For example (LVS-NAT, two realservers, one http virtual service):
		</para>
<programlisting><![CDATA[
# ipvsadm --version
ipvsadm v1.21 2002/11/12 (compiled with popt and IPVS v1.0.10)

# ipvsadm -l --stats
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port               Conns   InPkts  OutPkts  InBytes OutBytes
  -> RemoteAddress:Port
TCP  vip:http                      4239091 31977546 29470876    3692M 26647M
  -> www02:http                    3911835 29405279 26900679    3407M 24292M
  -> www01:http                    3395953 25407180 23257431    2931M 20957M

# ipvsadm -l --rate
IP Virtual Server version 1.0.10 (size=4096)
Prot LocalAddress:Port                 CPS    InPPS   OutPPS    InBPS OutBPS
  -> RemoteAddress:Port
TCP  vip:http                           45      348      314    41739 285599
  -> www02:http                         35      252      216    30416 197101
  -> www01:http                         10       96       98    11323 88497
]]></programlisting>
		<para>
Is this a bug, or am I just missing something?
		</para>
	</blockquote>
	<para>
Wensong 12 Oct 2003
	</para>
	<para>
It's quite possible that the conns/bytes/packets statistics of virtual
service is not the sum of the conns/bytes/packets counters of its realservers,
because some realservers may be removed permanetly. The
connection rate of virtual service is the sum of connection rate of its
realservers, because it is an instant metric at a time.
	</para>
	<para>
In the output of your <command>ipvsadm --l --stats</command>, the counters of virtual
service is less than the sum of the counters of its realservers. I guess
that your virtual service must have been removed after it run for a while,
and then must be created later. In current implementation, if realservers
are to be deleted, they will not be removed permanently, but be put in the
trash, because established connections still refer to them; a server can
be looked up in the trash when it is added back to a service. When a
virtual service is created, it always has counters set to zero, but the
realservers can be picked up from the trash, they have the past counters.
We probably need zero the counters of realservers if the service is a new
one. Anyway, you can do <command>cat /proc/net/ip_vs_stats</command>.
The counters of all
IPVS services is larger than or equal to the sum of realservers.
	</para>
	<blockquote>
You are right - after the weekly reboot last night the numbers do add up.
The realservers have been removed and added in the mean time, but the
virtual services have stayed in place and the numbers are still correct. So
that must be it.
Mystery solved, thank you :-)
	</blockquote>
	</section>
	<section id="exact_counters">
	<title>Exact Counters</title>
	<para>
Guy Waugh <emphasis>gwaugh (at) scu (dot) edu (dot) au</emphasis> 
2005/20/05 
	</para>
	<para>
The 
<ulink url="http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvsadm_exact.patch">
ipvsadm_exact.patch 
</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvsadm_exact.patch)
contains a diff for the addition of an <filename>'-x'</filename> or 
<filename>'--exact'</filename> command-line switch to 
<command>ipvsadm</command> (version 1.19.2.2).
The idea behind the new option is to allow users to specify that large 
numbers printed with the <filename>'--stats'</filename> or <filename>'--rate'</filename> 
options are in machine readable bytes, 
rather than in 'human-readable' form (<emphasis>e.g.</emphasis> kilobytes, megabytes).
I needed this to get stats from LVS readable by 
<ulink url="http:/www.nagios.org/">nagios</ulink>
(http:/www.nagios.org/).
	</para>
	</section>
	<section id="TCP_UDP_scheduling" xreflabel="TCP UDP scheduling">
	<title>Scheduling TCP/UDP/SCTP/TCP splicing/</title>
	<note>
		<para>
LVS does not schedule SCTP 
(although people ask about it occasionally).
		</para>
		<para>
SCTP is a connection oriented protocol (like TCP, but not like UDP).
Delivery is reliable (like TCP), but packets are not neccessarily delivered in order.
The Linux-2.6 kernel supports SCTP 
(see <ulink url="http://lksctp.sourceforge.net/">The Linux Kernel sctp project</ulink>
http://lksctp.sourceforge.net/).
For an article on SCTP see
<ulink url="http://www-128.ibm.com/developernetworks/linux/library/l-sctp/?ca=dgr-lnwx01SCTP">Better networking with SCTP</ulink>
(http://www-128.ibm.com/developernetworks/linux/library/l-sctp/?ca=dgr-lnwx01SCTP).
One of the features of SCTP is multistreaming: 
control and data streams are separate streams within an association. 
With tcp to do the same thing, 
you need separate ports (<emphasis>e.g.</emphasis> ftp uses 20 for data, 21 for command) 
or you put both into one connection (<emphasis>e.g.</emphasis> http).
If you use one port then a command will be blocked till queued data is serviced.
If multiple (redundant) routes are available, failover is transparent to the application.
(Thus the requirement that packets not neccessarily be delivered in order.)
<xref linkend="SIP"/> is can use SCTP (I only know about SIP using UDP).
		</para>
	</note>
        <note>
		<para>
Ratz 20 Feb 2006
		</para>
		<para>
There is a remotely similar approach in the 
<ulink url="http://www.linuxvirtualserver.org/software/tcpsp/">TCP splicing code for LVS</ulink>.
(http://www.linuxvirtualserver.org/software/tcpsp/).
It's only a small subset of SCTP.
		</para>
	</note>
	<para>
With TCP, scheduling needs to know the number of current connections to each realserver,
before assigning a realserver for the next connection. 
The length of time for a connection can be short 
(retrieving a page by http) or long (an extended telnet session).
	</para>
	<para>
For UDP there is no "connection". LVS uses a timeout (for 2.4.x kernels
is about 3mins) and any UDP packets from a client within the timeout
period will be rescheduled to the same realserver.
On a short time scale (<emphasis>ca.</emphasis> timeout), 
there will be no load balancing of UDP services
(<emphasis>e.g.</emphasis> as was found for <link linkend="ntp">ntp</link>).
All requests will go to the same realserver. On a long time scale
(&gt;&gt;timeout) loadbalancing will occur.
	</para>
	<para>
Here's the official LVS definition of a UDP "connection"
	</para>
	<para>
Wensong Zhang <emphasis>wensong (at) iinchina (dot) net</emphasis> 2000-04-26
	</para>
	<blockquote>
For UDP datagrams, we create entries for state with the timeout of
IP_MASQ_S_UDP (5*60*HZ) by default.
Consequently all UDP datagrams from the same
source to the same destination will be sent to the same realserver.
Therefore, we call data communication
between a client's socket and a server's socket a "connection",
for both no TCP and UDP.
	</blockquote>
	<para>
Julian Anastasov 2000-07-28
	</para>
	<blockquote>
For UDP we can in principle remove the implicit
persistence for the UDP connections and thus select different real
server for each packet.
My thought was to implement a new feature:
schedule each UDP packet to new realserver.
<emphasis>i.e.</emphasis>something like timeout=~0 for UDP as service flag.
	</blockquote>
	<para>
LVS has been tested with the following UDP services,
	</para>
	<itemizedlist>
		<listitem>
<link linkend="DNS">DNS</link>
		</listitem>
		<listitem>
<link linkend="ntp">ntp</link>
		</listitem>
		<listitem>
<link linkend="xdmcp">xdmcp</link>
		</listitem>
		<listitem>
<link linkend="radius">radius</link>
		</listitem>
	</itemizedlist>
	<para>
So far only DNS has worked well
(but then DNS already fine in a cluster setup without LVS).
ntpd is already self loadbalancing and doesn't need to be run under LVS.
xdmcp dies if left idle for long periods (several hours).
UDP services are not commonly used with LVS and we don't yet know
whether the problems are with LVS or with the service running under LVS.
	</para>
	<para>
Han Bing <emphasis>hb (at) quickhot (dot) com</emphasis> 29 Dec 2002
	</para>
	<blockquote>
	<para>
I am developing several game servers using UDP which I would like
to use with LVS.
LVS supports UDP "connection" persistence.
Does persistence work for UDP too?
	</para>
	<para>
	For example, I have 3 games, every games has 3 servers( 9 servers in 3
groups totally). All game1 servers listen on udp port 10000, game2
servers listen on 10001 udp port, and game3 servers listen on 10002 udp
port. when the client send a udp datagram to game1( to VIP:10000 ), can
the lvs director auto-select one server from the 3 game1 servers and
forward it to the server, AND keep the persistence of this "UDP
connection" when she receives the following datagram from the same CIP?
	</para>
	</blockquote>
Joe: not sure what may happen here. people have had problems with LVS/udp
(<emphasis>e.g.</emphasis> with <xref linkend="ntp"/>).
These problems should go away with persistent udp, but
no-one has tried it and it didn't even occur to me.
The behaviour of LVS with persistent udp may be undefined
for all I know. I would make sure that the setup worked with
ntp before trying a game setup.
	</section>
	<section id="Padraig">
	<title>patch: machine readable error codes from ipvsadm</title>
	<para>
Computers can talk to each other
and read from and write to other programs.
You shouldn't have to get a person to sit at the console
to parse the output of a program.
Here's a patch to make the output of <command>ipvsadm</command> machine readable
	</para>
	<para>
Padraig Brady <emphasis>padraig (at) antefacto (dot) com</emphasis> 07 Nov 2001
	</para>
	<para>
This 1 line patch is useful for me and I don't think it will break anything.
It's against ipvsadm-0.8.2 and returns a specific error code.
	</para>
<programlisting><![CDATA[
--Boundary_(ID_nuebet+LsBGYFsmRPljqqA)
Content-type: text/plain;>ipvsadm-0.8.2-returncode.diff"
Content-disposition: inline; filename="ipvsadm-0.8.2-returncode.diff"
Content-transfer-encoding: 7bit

--- //ipvs-0.8.2/ipvs/ipvsadm/ipvsadm.c	Fri Jun 22 16:03:08 2001
+++ ipvsadm.c	Wed Nov  7 16:29:11 2001
@@ -938,6 +938,7 @@
         result = setsockopt(sockfd, IPPROTO_IP, op,
                             (char *)&urule, sizeof(urule));
         if (result) {
+                result = errno; /* return to caller */
                 perror("setsockopt failed");

                 /*

--Boundary_(ID_nuebet+LsBGYFsmRPljqqA)--
]]></programlisting>
	</section>
	<section id="stateless_ipvsadm">
	<title>patch: stateless ipsvadm - add/edit patch</title>
	<para>
Commands like <command>ifconfig</command> are idempotent,
<emphasis>i.e.</emphasis> they tell the machine to assume a certain state
without reference to the previous state.
You can repeat the same command without errors 
(<emphasis>e.g.</emphasis> put IP=xxx onto eth0).
Not all commands are idempotent - some require you to know the state
of the machine first.
<filename>ipvsadm</filename> is not idempotent:
if a VIP:port entry already exists,
then you will get an error on attempting to enter it.
Whether you make a command idempotent or not, will depend on the 
nature of the command. 
	</para>
	<para>
The problem with <filename>ipvsadm</filename> is that it isn't scriptable
and hence can't be used for automated control of an LVS:
	</para>
	<itemizedlist>
		<listitem>
			<para>
If no entry exists:
			</para>
			<para>
you must <filename>add</filename> the entry with the <filename>-a</filename> option
			</para>
		</listitem>
		<listitem>
			<para>
if the entry exists;
			</para>
			<para>
you must <filename>edit</filename> the entry with the <filename>-e</filename> option.
			</para>
		</listitem>
	</itemizedlist>
	<para>
You will get an error if you use the wrong command. 
Two solutions are:
	</para>
	<itemizedlist>
		<listitem>
parse the output of <command>ipvsadm</command> to see if the entry you are about to
make already is present
		</listitem>
		<listitem>
try both commands and see which one runs
and then have the script figure out if both the error and
the non-error was valid.
		</listitem>
	</itemizedlist>
	<para>
This is a pain and is quite unneccesary.
What is needed is a version of ipvs that accepts valid entries without giving an error.
Here's the 
<ulink url="files/ipvs-0.9.0_add_edit.patch">ipvs-0.9.0_add_edit.patch</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.0_add_edit.patch)
patch by Horms against ipvs-0.9.0. It modifies
several ipvs files, including ipvsadm.
	</para>
	</section>
	<section id="fwmark_nametable">
	<title>patch: fwmark name-number translation table</title>
	<para>
<command>ipvsadm</command> allows entry of fwmark only as numbers.
In some cases, it would be more convenient
to enter/display the fwmark as a name;
<emphasis>e.g.</emphasis> an e-commerce site, serving multiple
customers (<emphasis>i.e.</emphasis> VIPs) and which is linking http and https by a fwmark.
The output of <command>ipvsadm</command> then would list the fwmark as "bills_business", "fred_inc"
rather than "14","15"...
	</para>
	<para>
Horms has written a 
<ulink url="files/ipvs-0.9.5.fwmarks-file.patch">
ipvs-0.9.5.fwmarks-file.patch</ulink>
(http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/ipvs-0.9.5.fwmarks-file.patch)
which allows the use of a string fwmark 
as well as the default method of an integer fwmark,
using a table in <filename>/etc</filename> 
that looks like the <filename>/etc/hosts</filename> table.
	</para>
	<para>
Horms <emphasis>horms (at) vergenet (dot) net</emphasis> Nov 14 2001
	</para>
	<blockquote>
		<para>
while we were at OLS in June, Joe suggested that we have a file to
associate names with firewall marks. I have attached a patch that enables
<filename>ipvsadm</filename> to read names for firewall marks from 
<filename>/etc/fwmarks</filename>. This file is
intended to be analogous to <filename>/etc/hosts</filename> 
(or files in <filename>/etc/iproute2/</filename>).
		</para>
		<para>
The patch to the man page explains the format more fully, but briefly the
format is "fwmark name..." newline delimited
		</para>
<programlisting><![CDATA[
e.g.

1 a_name
2 another_name yet_another_name

Which leads to

director:/etc/lvs# ipvsadm -A -f a_name
]]></programlisting>
	</blockquote>
	</section>
	<section>
	<title>ip_vs_conn.pl</title>
	<note>
you can also run <command>`ipvsadm -lcn`</command> to do the same thing)
	</note>
<programlisting><![CDATA[
#!/usr/bin/perl
#-----------------------------------------------
#Date: Wed, 07 Aug 2002 20:14:25 -0300
#From: Jeff Warnica <noc (at) mediadial (dot) com>

#Here is an /proc/net/ip_vs_conn hex mode to integer script.
#If its given an ipaddress as an argument on the commandline,
#it will show only lines with that ipaddress in it.
#-------------------------------------------------

if (@ARGV) {
        $mask = $ARGV[0];
}
open(DATA, "/proc/net/ip_vs_conn");

$format = "%8s %-17s %-5s %-17s %-5s %-17s %-5s %-17s %-20s\n";
printf $format, "Protocol", "From IP", "FPort", "To IP", "TPort", "Dest
IP", "DPort", "State", "Expires";
while(<DATA>){
        chop;
        ($proto, $fromip, $fromport, $toip, $toport, $destip, $destport,
$state, $expiry) = split();
        next unless $fromip;
        next if ($proto =~ /Pro/);

        $fromip = hex2ip($fromip);
        $toip   = hex2ip($toip);
        $destip = hex2ip($destip);

        $fromport = hex($fromport);
        $toport   = hex($toport);
        $destport = hex($destport);

        if ( ($fromip =~ /$mask/) || ($toip =~ /$mask/) || ($destip =~
/$mask/) || (!($mask))) {
                printf $format, $proto, $fromip, $fromport, $toip,
$toport, $destip, $destport, $state, $expiry;
        }
}


sub hex2ip($input) {
        my $input = shift;

        $first  = substr($input,0,2);
        $second = substr($input,2,2);
        $third  = substr($input,4,2);
        $fourth = substr($input,6,2);

        $first  = hex($first);
        $second = hex($second);
        $third  = hex($third);
        $fourth = hex($fourth);

        return "$first.$second.$third.$fourth";
}

#---------------------------------------------------------------
]]></programlisting>
	</section>
	<section id="lucas_script">
	<title>Luca's php monitoring script</title>
	<para>
Luca Maranzano <emphasis>liuk001 (at) gmail (dot) com</emphasis> 12 Oct 2005
	</para>
	<para>
I've written a simple php script 
<ulink url="files/luca.php">luca.php</ulink>
to monitor the status of an LVS server.
To use it, configure sudo in order to make the Apache user to run
<command>/sbin/ipvsadm</command> as root without password prompt.
The CSS is derived from phpinfo() page.
	</para>

	<para>
Jeremy Kerr <emphasis>jk (at) ozlabs (dot) org</emphasis> 12 Oct 2005
	</para>
<programlisting><![CDATA[
<? $cmd="sudo /sbin/ipvsadm -L ". $dns_flag; passthru($cmd); ?>
]]></programlisting>
	<para>
Whoa.
	</para>
	<para>
If you use this script with register_globals set (and assuming you've
set it up so that the sudo works), you've got a remote *root*
vunerability right there.
<emphasis>e.g.</emphasis>
http://example.com/script.php?resolve_dns=1&amp;dns_flag=;sudo+rm+-rf+/, 
which will do <command>`rm -rf /`</command> as root.
	</para>
	<para>
you may want to ensure your variables are clean beforehand, and avoid
the sudo completely (maybe use a helper process?)
	</para>
	<para>
<emphasis>malcolm (at) loadbalancer (dot) org</emphasis> Oct 12 2005
	</para>
	<para>
That's why PHP no longer has register globals defaulted!
And also why you lock down your admin ip address by source ip.
My code has this vulnerability, but I'm not sure a helper app would be 
any more secure (sudo is a helper app.)
	</para>

	<para>
<emphasis>liuk001 (at) gmail (dot) com</emphasis>Oct 12 2005
	</para>
	<para>
Jeremy, this is a good point. I wrote it as a quick and dirty hack
without security in mind. It is used on the internal net from trusted
users who indeed have root access to the servers ;-)
However, sudo is configured to run only /sbin/ipvsadm from www-data
user, so I think that /bin/rm could not be executed.
	</para>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis> 12 Oct 2005 
	</para>
	<para>
...as all the relevant values are produced in 
<filename>/proc/net/ip_vs[_app,_conn,_stats]</filename>, 
then why not just write something to 
process those values instead? They're globally readable and don't need 
any helper apps to view them at all.
	</para>
	<para>
Yes, you'd be re-inventing a small part of ipvsadm's functionality. The 
security improvements alone are worth it; the fact that the overhead of 
running <command>sudo</command> and then <command>ipvsadm</command> 
is removed by just doing an <filename>open()</filename> on a 
<filename>/proc</filename> file might be worth it in situations where you may have many 
users running your web app.
	</para>
	<para>
Sure, you need to decode the hex values to make them "nice". Unless you 
have the sort of users who read hex encoding all the time :)
	</para>
	</section>
	<section id="ipvsadm_set">
	<title>ipvsadm set option</title>
	<para>anon</para>
	<blockquote>
what <filename>/proc</filename> values does <command>ipvsadm --set</command>
modify? Something in <filename>/proc/sys/net/ipv4/vs</filename>?
	</blockquote>
	<para>Ratz</para>
	<para>
the current proc-fs entries are a read-only representation of what 
could be set regarding state timeouts. <command>ipvsadm --set</command> will set
for IPVS related connection entries
	</para>
<itemizedlist>
	<listitem>	
The time we spend in TCP ESTABLISHED before transition
	</listitem>
	<listitem>
The time we spend in TCP FIN_WAIT before transition
	</listitem>
	<listitem>
The time we spend for UDP in general
	</listitem>
</itemizedlist>
	<para>
Andreas 05 Feb 2006
	</para>
	<blockquote>
Where can I get the currently set values of <command>ipvsadm --set foo bar baz</command>?
	</blockquote>
	<para>
Ratz
	</para>
	<para>
You can't :) IP_VS_SO_GET_TIMEOUTS is not implemented in <command>ipvsadm</command>
or I'm blind.
Also the <filename>proc-fs</filename> related entries for this are not exported. 
I've written a patch to re-instate the proper settings in <filename>proc-fs</filename>, 
however this is only in 2.4.x kernels. 
Julian has recently proposed a very granular timeout framework, 
however none of us has had the time nor impulse to implement it. 
For our (work) customers I needed the ability to instrument all 
the IPVS related timeout values in DoS and non-DoS mode. 
The <command>ipvsadm --set</command> option should be obsoleted, 
since it only covers the timeout settings partially 
and there is no <command>--get</command> method.
	</para>
	<blockquote>
I did not find a way to read them out, I grep through the /proc/sys/foo
and /proc/net foo and was not able to see the numbers I set before. This
was on kernel 2.4.30 at least.
	</blockquote>
	<para>
Correct. The standard 2.4.x kernel only exports the DoS timers for some 
(to me at least) unknown reason. I suggest that we re-instate (I'll send 
a patch or a link to the patch tomorrow) these timer settings until we 
can come up with Julian's framework. It's imperative that we can set 
those transition timers, since their default values are dangerous 
regarding >L4 DoS. One example is HTTP/1.1 slow start if the web servers 
are mis-configured (wrong MaxKeepAlive and its timeout settings).
	</para>
	<blockquote>
This brings me further to the question if the changes of lvs in recent
2.6 development are being backported to 2.4?
	</blockquote>
	<para>
Currently I would consider it the other way 'round. 2.6.x has mostly 
stalled feature wise and 2.4.x is missing the crucial per RS threshold 
limitation feature. I've backported it and also improved it quite a bit 
(service pool) and so we're currently completely out of sync :). I'll 
try to put some more effort into working on the 2.6.x kernel, however 
since it's still too unstable of us, our main focus remains on the 2.4.x 
kernel.
	</para>
	<para>
[And before you ask: No, we don't have the time (money wise) to invest 
into bug-hunting and reporting problems regarding 2.6.x kernels on 
high-end server machines. And in private environment 2.6.x works really 
well on my laptops and other machines, so there's really not much to 
report ;).]
	</para>
        <para>
On top of that LVS 
does not use the classic TCP timers from the stack since it only 
forwards TCP connections. The IPVS timers are needed so we can maintain 
the LVS state table regarding expirations in various modes, mostly LVS_DR.
	</para>
	<para>
Browsing through the code recently I realised that the state transition 
code in <filename>ip_vs_conn:vs_tcp_state()</filename> is very simple, probably too simple. 
If we use Julian's <filename>forward_shared patch</filename> 
(which I consider a great invention, BTW) 
one would assume that IPVS timeouts are more closely 
timed to the actually TCP flow. 
However, this is not the case because, 
from what I've read and understood the IPVS state transitions are done 
without memory, so it's wild guessing :). I might have a closer look at 
this because it just seems sub-optimal. Also the notion of active versus 
inactive connections stemming from this simple view of TCP flow is 
questionable, especially the dependence and weighting of some schedulers.
	</para>
	<blockquote>
So, if I set a
lvs tcp timeout about 2h 12 min, lvs would never drop a tcp connection
unless a client is really "unreachable":
	</blockquote>
	<para>
The timeout is more bound to the connection entry in the IPVS lookup 
table, so we know where to forward incoming packets regarding a specific 
TCP flow. A TCP connection is never drop or not dropped by LVS, only 
specific packets pertaining to a TCP connection.
	</para>
	<blockquote>
After 2h Linux sends tcp
keepalive probes serveral times, so there are some byte send through the
connection.
	</blockquote>
	<para>
Nope, IPVS does not work as a proxy regarding TCP connections. It's a 
TCP flow redirector.
	</para>
	<blockquote> 
lvs will (re)set the internal timer for this connection to
the keepalive time I set with <command>--set</command>.
	</blockquote>
	<para>
Kind of ... only the bits of the state transition table which are 
affected by the three settings. It might not be enough to keep 
persistency for your TCP connection.
	</para>
	<blockquote>
Or does it recognize that the
bytes send are only probes without a vaild answer and thus drop the
connection?
	</blockquote>
	<para>
There is no sending keepalive probes from the director.
	</para>
	<blockquote>
will we eventually get timeput parameters _per service_ instead of global ones? 
	</blockquote>
	<para>
Julian proposed following framework: 
http://www.ssi.bg/~ja/tmp/tt.txt
	</para>
	<para>
So if you want to test, the only thing you have to do is fire up your 
editor of choice :). Ok, honestly, I don't know when this will be done 
because it's quite some work and most us developers here are a pretty 
busy with other daily activities. So unless there is a glaring issue 
regarding timers implemented as-is, chances are slim that this gets 
implemented. Of course I could fly down to Julian's place over the 
week-end and we could implement it together; want to sponsor it? ;).
	</para>
	<para>
There's a lot of TCP timers in the Linux kernel and they all have 
different semantical meanings. There is the TCP timout timer for sockets 
related to locally initiated connections, then there is a TCP timeout 
for the connection tracking table, which on my desktop system for 
example has following settings:
	</para>
<programlisting><![CDATA[
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_close:10
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_close_wait:60
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_established:432000
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_fin_wait:120
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_last_ack:30
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_syn_recv:60
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_syn_sent:120
/proc/sys/net/ipv4/netfilter/ip_conntrack_tcp_timeout_time_wait:120
]]></programlisting>
	<para>
And of course we have the IPVS TCP settings, which look as follows (if 
they weren't disabled in the core :)):
	</para>
<programlisting><![CDATA[
/proc/sys/net/ipv4/vs/tcp_timeout_established:900
/proc/sys/net/ipv4/vs/tcp_timeout_syn_sent:120
/proc/sys/net/ipv4/vs/tcp_timeout_syn_recv:60
/proc/sys/net/ipv4/vs/tcp_timeout_:900
[...]
]]></programlisting>
	<para>
unless you enabled tcp_defense, which changes those timers again. And 
then of course we have other in-kernel timers, which influence those 
timers mentioned above.
	</para>
	<para>
However, the beforementioned timers regarding packet filtering, NAPT and 
load balancing and are meant as a means to map expected real TCP flow 
timeouts. Since there is no socket (as in an endpoint) involved when 
doing either netfilter or IPVS, you have to guess what the TCP flow 
in-between (where you machine is "standing") is doing, so you can 
continue to forward, rewrite, mangle, whatever, the flow, _without_ 
disturbing it. The timers are used for table mapping timeouts of TCP 
states. If we didn't have them, mappings would stay in the kernel 
forever and eventually we'd run out of memory. If we have them wrong, it 
might occur that a connection is aborted prematurely by our host, for 
example yielding those infamous ssh hangs when connecting through a 
packet filter.
	</para>
	<para>
The tcp keepalive timer setting you've mentioned, on the other hand, is 
per socket. And as such only has an influence on locally created or 
terminated sockets. A quick socket(2) and socket(7) skimming reveil:
	</para>
<programlisting><![CDATA[
   [socket(2) excerpt]
        The communications protocols which implement a SOCK_STREAM
        ensure that data is not lost or duplicated.  If a piece of
        data  for  which the peer protocol has buffer space cannot
        be successfully transmitted within a reasonable length  of
        time,  then the connection is considered to be dead.  When
        SO_KEEPALIVE is enabled on the socket the protocol  checks
        in  a  protocol-specific  manner if the other end is still
        alive.

   [socket(7) excerpt]
        These socket options can be set by using setsockopt(2) and
        read with getsockopt(2)  with  the  socket  level  set  to
        SOL_SOCKET for all sockets:

        SO_KEEPALIVE
        Enable sending of keep-alive messages on connection-oriented 
        sockets. Expects a integer boolean flag.
]]></programlisting>
	</section>
	<section id="ipvsadm_error_messages">
	<title>ipvsadm error messages</title>
	<para>
<command>ipvsadm</command>'s error messages are low level and give the
user little indication of what they've done wrong. 
These error messages were written in the early days of LVS
when getting <command>ipvsadm</command> to work was a feat in itself.
Unfortunately the messages have not been updated and enough use of 
<command>ipvsadm</command> is now scripted and so we don't run into 
the messages anymore. 
As people post error messages and what they mean, I'll put them here
	</para>
	<para>
Brian Sheets <emphasis>bsheets (at) singlefin (dot) net</emphasis> 7 Jan 2007 
	</para>
	<blockquote>
<programlisting><![CDATA[
ipvsadm -d -t 10.200.8.1:25 -r 10.200.8.100
Service not defined
]]></programlisting>
	<para>
What am I doing wrong? The syntax looks correct to me.
	</para>
	</blockquote>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis> 07 Jan 2007
	</para>
	<para>
do you have a service defined on VIP 10.200.8.1 port 25? Make
sure you're not getting your real and virtual servers mixed up.
	</para>
	<para>
Brian
	</para>
	<blockquote>
Yup, I had the real and virtuals reversed..
	</blockquote>
	<para>
Joe
	</para>
	<para>
You should be able to delete a realserver for a service that isn't declared,
with only a notice rather than an error, at least in my thinking. However
that battle was lost back in the early days.
	</para>
	</section>
	<section id="ipvsadm_smp_bug">
	<title>ipvsadm fast update bug with smp</title>
	<para>
Kees Hoekzema <emphasis>kees (at) tweakers (dot) net</emphasis> 11 Jul 2007 
	</para>
	<para>
I'm applying weight changes to a 64bit 2-way SMP director quite rapidly (not
quite twice a second, but close) and getting a frozen director, which 
needs a cold reset.
	</para>
	<para>
I have found a bit more information from my debugging, 
and it seems that Horms already knows about it:
(http://marc.info/?l=linux-netdev&amp;m=118040107213444&amp;w=2)
I recompiled the 
<filename>ip_vs()</filename> modules with a bit more debug and everytime my system
crashed I had the same debug output:
	</para>
<programlisting><![CDATA[
Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 913
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 916
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 918
Jul 12 15:43:15 atropos kernel: Leave: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 919
Jul 12 15:43:15 atropos kernel: Enter: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 885
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 886
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 891
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 897
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 906
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 908
Jul 12 15:43:15 atropos kernel: DEBUG: ip_vs_edit_dest,
net/ipv4/ipvs/ip_vs_ctl.c line 910
]]></programlisting>
	<para>
The code after line 910 reads:
	</para>
<programlisting><![CDATA[
while (atomic_read(&svc->usecnt) > 1) {};
]]></programlisting>
	<para>
Every other busy lock in the code reads:
	</para>
<programlisting><![CDATA[
IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);
]]></programlisting>
	<para>
Which basicly is the same except a cpu_relax();
At the moment I am testing my server with cpu_relax() code in the
ip_vs_edit_dest function, and so far it has not crashed yet and is directing
the traffic quite a bit longer than previously was possible.
The only differences between this server and the old server (which didn't
have any problems) are:
	</para>
	<itemizedlist>
		<listitem>
- SMP (4 cores) vs Single core
		</listitem>
		<listitem>
- 64 bits vs 32 bits
		</listitem>
		<listitem>
- 2.6.21.5 vs 2.6.20.4 (but I do not see any changes in ip_vs_ctl.c)
		</listitem>
	</itemizedlist>
	<para>
In my first mail I accused the 64/32 bits difference, but right now I'm more
thinking of a SMP issue, but unfortunatly I lack the kernel hacking skills
to say why, or why that cpu_relax() helps so much in the while loop.
Hopefully Horms understands it better than I do ;)
	</para>

<programlisting><![CDATA[
--- linux-2.6.22.1/net/ipv4/ipvs/ip_vs_ctl.c    2007-07-12
19:41:27.000000000 +0200
+++ old/net/ipv4/ipvs/ip_vs_ctl.c       2007-07-10 20:56:30.000000000 +0200
@@ -909,8 +909,8 @@
        write_lock_bh(&__ip_vs_svc_lock);

        /* Wait until all other svc users go away */
-       IP_VS_WAIT_WHILE(atomic_read(&svc->usecnt) > 1);
-
+       while (atomic_read(&svc->usecnt) > 1) {};
+
        /* call the update_service, because server weight may be changed */
        svc->scheduler->update_service(svc);
]]></programlisting>

	<para>
Joe
	</para>
	<para>
I know this is separate from the problem, but according to 
feedback and control theory you should be making adjustments 
on a timescale that damps the transients. What a transient 
is here is not obvious - the timescale of a tcpip 
connection, the time is takes to change the load by 10%?, 
50%? I don't know, but every couple of seconds would seem to 
be a lot shorter than either of these two time scales.
Do you find your setup has problems when you do adjustments 
on a longer timescale?
	</para>
	</section>
	<section id="no_scheduler">
	<title>Problems when no scheduler</title>
	<note>
What happens if you have the service entered with <command>ipvsadm -A</command>, 
but have no realservers (no <command>ipvsadm -a</command>) to accept the forwarded packets?
We haven't quite figured out what to do about this yet.
Till we can get a better idea about what's going on,
we're not going to do anything.
	</note>
	<para>
Siim Poder <emphasis>windo (at) p6drad-teel (dot) net</emphasis> 07 May 2008
	</para>
	<blockquote>
		<para>
We've had LVS machines dying a couple of times when the service is using
the wrr scheduler and keepalived pulls all real servers from behind the
service IP.
		</para>
		<para>
The symptoms are that there are a lot of (thousands, apparently for
every packet?) messages in syslog:
ip_vs_wrr_schedule(): no available servers
		</para>
		<para>
After which the machine hangs. I don't recall if i've had to boot it
manually or if it boots by itself.
		</para>
		<para>
Also, I'm not sure if it is that message that is killing the machine,
but the problem hasn't occured with other schedulers (that don't print
such a message). We use wrr the most though.
		</para>
		<para>
I think we should either remove the message or ratelimit it (unless the
bug is somewhere else). I tested the patch and it seems to be ok, but as
I'm unable to reproduce the hanging/crashing in test environment, I
can't verify wether it actually helps.
		</para>
		<para>
Something close to this was added to mainline by someone already. But
the problem seems to persist (just without the messages). It seems to
appear with any scheduler (at least wrr, wlc and rr). However I have
been unable to reproduce this neither by connection nor packet rate in
test environment. It's probably not just the missing real servers, but
something relatively infrequent that gets triggered only after there are
no real servers. The LVS goes down in about 5-15 minutes of missing real
servers IIRC.
		</para>
		<para>
I tried generating many connections (and played with ttl/fragmentation a
little), but couldn't trigger the bug. Maybe it has to do with clients
sending some ICMP messages (which would probably be rare enough)?
		</para>
		<para>
This still gets triggered in our live env for high connection rate
services (if the servers fail for any reason and keepalived kicks them
out). We have put sorry servers into keepalived configuration to avoid
the whole LVS going down for now (sorry_server 127.0.0.1:666), so there
is a workaround for us.
		</para>
<programlisting><![CDATA[
--- linux-2.6.24/net/ipv4/ipvs/ip_vs_wrr.c      2008-01-24 22:58:37.000000000 +0000
+++ linux-2.6.24-ipvs_patches/net/ipv4/ipvs/ip_vs_wrr.c 2008-05-06 16:17:17.790662800 +0000
@@ -169,7 +169,7 @@
                                 */
                                if (mark->cw == 0) {
                                        mark->cl = &svc->destinations;
-                                       IP_VS_INFO("ip_vs_wrr_schedule(): "
+                                       IP_VS_DBG_RL("ip_vs_wrr_schedule(): "
                                                   "no available servers\n");
                                        dest = NULL;
                                        goto out;
]]></programlisting>
	</blockquote>
	<para>
Horm 29 Dec 2008
	</para>
	<para>
We're doing nothing till we figure out what's really going on
The important problem seems to be that LVS dies sometimes. 
But unfortunately that can't be fixed right now, 
because nobody knows how to do so, 
despite Siim's efforts to find the cause of the problem.
	</para>
	<para>
With regards to making wrr like the other schedulers, I'd actually
be much more inclined to do the reverse - make all the other schedulers
display a rate-limited warning when they don't have any real servers
available. Perhaps something like the patch against 2.6.28 below.
	</para>
	<para>
In any case, I think that the warning message and the LVS dying issues
are separate, except that it seems likely that the warning message
will help to lead us to the cause of the "LVS dies" bug.
	</para>
	<note>
Joe: The patches to make all schedulers display a warning will be in kernel 2.6.29. 
	</note>
	</section>
</section>
<section id="LVS-HOWTO.LVS-NAT" xreflabel="LVS-NAT">
<title>LVS: LVS-NAT</title>
	<section id="lvs_nat_intro">
	<title>Introduction</title>
	<para>
		<note>
see also <ulink url="http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt">
Julian's layer 4 LVS-NAT setup</ulink>
(http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt).
		</note>
	</para>
	<para>
LVS-NAT is based on cisco's LocalDirector.
	</para>
	<para>
This method was used for the first LVS. If you want to set up
a test LVS, this requires no modification of the realservers
and is still probably the simplest setup.
	</para>
	<para>
In a commercial environment, the owners of servers are
loath to change the configuration of a tested machine.
When they want load balancing, they will clone their
server and tell you to put your load balancer infront of their
rack of servers. 
You will not be allowed near any of their servers, thank you very much.
In this case you use LVS-NAT. 
	</para>
	<blockquote>
		<para>
Ratz Wed, 15 Nov 2006
		</para>
		<para>
Most commercial load balancers are not set up anymore 
using the triangulation mode (Joe: triangulation == LVS-DR)
(at least in the projects I've been involved).
The load balancer is becoming more and more a router,
using well-understood key technologies like VRRP and content processing. 
		</para>
	</blockquote>
	<para>
With LVS-NAT, the incoming packets are rewritten by the director
changing the dst_addr from the VIP to the address of one of the realservers 
and then forwarded to the realserver.
The replies from the realserver are sent to the director where they
are rewritten and returned to the client with the source address 
changed from the RIP to the VIP.
	</para>
	<para>
Unlike the other two methods of forwarding used in an LVS (LVS-DR and LVS-Tun)
the realserver only needs a functioning tcpip stack (eg a networked printer),
<emphasis>i.e.</emphasis> the realserver can have any operating system and no modifications are
made to the configuration of the realservers (except setting their route tables).
	</para>
	</section>
	<section id="lvs_nat_problems">
	<title>LVS-NAT bugs</title>
	<para>
Sep 2006:
Various problems have surfaced in the 2.6.x LVS-NAT code
all relating to routing (netfilter)
on the side of the director facing the internet.
People using LVS-NAT on a director which isn't a firewall
and which only has a single default gw, aren't having any problems.
	</para>
	<para>
It seems the 2.4.x code was working correctly: 
Farid Sarwari had it working for IPSec at least.
The source routing problem has been identified by three people, 
who've all submitted functionally equivalent patches.
While we're delighted to have contributions from so many people,
we regret that we weren't fast enough to recognise the problem
and save the last two people all their work.
One of the problems (we think) is that not
many people are using LVS-NAT and when a weird problem
is reported on the mailing list we say "well 1000's of people
have been using LVS-NAT for years without this problem, 
this guy must not know what he's talking about".
We're now taking the approach that maybe not too many people
are using LVS-NAT.
	</para>
	<para>
Here are the problems which have surfaced so far with LVS-NAT.
They either have been solved or will be in a future release of LVS.
	</para>
	<itemizedlist>
		<listitem>
Firewall incompatibility: You couldn't run a netfilter firewall on the outside of the director. 
This was solved by Ben North with the <xref linkend="antefacto_patches"/>.
These patches were taken over by Vinnie, 
and are now being maintained by Julian as part of the <xref linkend="ipvs_nfct"/>.
Since the NFCT patches are benign when not being used, 
we hope that they will be incorporated into the ip_vs code for the kernel
(when Horms gets time).
		</listitem>
		<listitem>
Source routing: Outbound packets originating at the VIP are not injected
into the routing table but are sent straight out the default gw.
As a result the packets were not affected by <command>iproute2</command> commands.
This problem was found by Ken Brownfield who submitted a patch for his relatively
old kernel, then Farid Sarwari who couldn't get routing to work for his IPSec LVS submitted another,
then David Black realised that Julian's NFCT patches handled the problem from the start.
(see <xref linkend="brownfield"/>). 
Horm's is working on getting Julian's NFCT code into ip_vs.
		</listitem>
		<listitem>
LVS-NAT ftp helper modules for active/passive ftp:
We seem to get a disproportionate number of problems with
ftp on LVS. This seems to be a combination of the small number
of users, real bugs and inadequate documentation.
(see <xref linkend="LVS-NAT_ftp_bug"/>).
		</listitem>
	</itemizedlist>
	</section>
	<section id="lvs_nat_one_network_two_nic">
	<title>Example 1-NIC, 2 Network LVS-NAT (VIP and RIPs on different network)</title>
	<note>
If the VIP and the RIPs are on the same network you need the <xref linkend="one_network"/>
	</note>
	<para>
Here the client is on the same network as the VIP
(in a production LVS, the client will be coming in from an
external network via a router).
(The director can have 1 or 2 NICs -
two NICs will allow higher throughput of packets, since
the traffic on the realserver network will be separated
from the traffic on the client network).
	</para>
<programlisting><![CDATA[
Machine                      IP
client                       CIP=192.168.1.254
director VIP                 VIP=192.168.1.110 (the IP for the LVS)
director internal interface  DIP=10.1.1.1
realserver1                  RIP1=10.1.1.2
realserver2                  RIP2=10.1.1.3
realserver3                  RIP3=10.1.1.4
.
.
realserverN                  RIPn=10.1.1.n+1
dip                          DIP=10.1.1.9 (director interface on the LVS-NAT network)
]]></programlisting>
	<para>
	</para>
<programlisting><![CDATA[
                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                        (router)
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|
            |__________|   |   DIP=10.1.1.9 (eth0:9)
                           |
                           |
          -----------------------------------
          |                |                |
          |                |                |
   RIP1=10.1.1.2      RIP2=10.1.1.3   RIP3=10.1.1.4 (all eth0)
   _____________     _____________    _____________
  |             |   |             |  |             |
  | realserver  |   | realserver  |  | realserver  |
  |_____________|   |_____________|  |_____________|
]]></programlisting>
	<para>
here's the <filename>lvs.conf</filename> for this setup
	</para>
<programlisting><![CDATA[
LVS_TYPE=VS_NAT
INITIAL_STATE=on
VIP=eth0:110 lvs 255.255.255.0 192.168.1.255
DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr realserver1:telnet realserver2:telnet realserver3:telnet
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=dip
#----------end lvs_nat.conf------------------------------------
]]></programlisting>
	<para>
The VIP is the only IP known to the client. The RIPs here are on
a different network to the VIP (although with only 1 NIC on
the director, the VIP and the RIPs are on the same wire).
	</para>
	<para>
In normal NAT, masquerading is the rewriting of packets
originating behind the NAT box.
With LVS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP)
is rewritten by the director (becoming CIP->RIP).
The action of the LVS director is called demasquerading.
The demasqueraded packet is forwarded to the realserver.
The reply packet (RIP->CIP) is generated by the realserver.
	</para>
	</section>
	<section id="NAT_default_gw">
	<title>All packets sent from the LVS-NAT realserver to the client must go through the LVS-NAT director</title>
	<para>
For LVS-NAT to work
	</para>
	<itemizedlist>
		<listitem>
<emphasis>all packets from the realservers to the client must go through the director.</emphasis>
		</listitem>
	</itemizedlist>
	<para>
Forgetting to set this up is the single most common cause of failure
when setting up a LVS-NAT LVS.
	</para>
	<para>
The original (and the simplest from the point of view of setup)
way is to make the DIP
(on the director) the default gw for the packets from the realserver.
The documentation here all assumes you'll be using this method.
(Any IP on the director will do, but in the case where you
have two directors in active/backup failover, you have an
IP that is moved to the active director and this is called the DIP).
Any method of making the return packets go through the
director will do.
With the arrival of the <xref linkend="LVS-HOWTO.policy_routing"/> tools,
you can route packets according to any parameter in the packet header
(<emphasis>e.g.</emphasis> src_addr, src_port, dest_addr..)
Here's an example of ip rules on the realserver to
route packets from the RIP to an IP on the director.
This avoids having to route these packets via a default gw.
	</para>
	<para>
Neil Prockter <emphasis>prockter (at) lse (dot) ac (dot) uk</emphasis> 30 Mar 2004
	</para>
	<blockquote>
		<para>
you can avoid using the director as the default gw by
		</para>
<programlisting><![CDATA[
realserver# echo 80 lvs >> /etc/iproute2/rt_tables
realserver# ip route add default <address on director, eg DIP> table lvs
realserver# ip rule add from <RIP> table lvs
]]></programlisting>
		<para>
For the IPs in <ulink url="http://www.linuxvirtualserver.org/VS-NAT.html">
Virtual Server via NAT</ulink>
(http://www.linuxvirtualserver.org/VS-NAT.html).
		</para>
<programlisting><![CDATA[
echo 80 lvs >> /etc/iproute2/rt_tables
ip route add default 172.16.0.1 table lvs
ip rule add from 172.16.0.2 table lvs
]]></programlisting>
		<para>
I do this with lvs and with cisco css units
		</para>
	</blockquote>
	<para>
Here Neil is routing packets from RIP to 0/0 via DIP.
You can be more restrictive and route packets from RIP:port
(where port is the LVS'ed service) to 0/0 via DIP.
Packets from RIP:other_ports can be routed via other rules.
	</para>
	<para>
For a 2 NIC director (with different physical networks for the realservers and the
clients), it is enough for the default gw of the realservers to be the director.
For a 1 NIC, two network setup (where the two networks are using the
same link layer), in addition, the realservers must only have
routes to the director. For a 1 NIC, 1 network setup, ICMP redirects
must be turned off on the director (see <xref linkend="one_network"/>)
(the configure script does this for you).
	</para>
	<para>
In a normal server farm, the default gw of the realserver
would be the router to the internet and the packet RIP->CIP would
be sent directly to the client.
In a LVS-NAT LVS, the default gw of the realservers must be the director.
The director masquerades the packet from the
realserver (rewrites it to VIP->CIP) and the client
receives a rewritten packet with the expected source IP of the VIP.
	</para>
	<para>
		<note>
the packet must be routed via the director,
there must be no other path to the client.
A packet arriving at the client directly from the
realserver, rather than going through the director,
will not be seen as a reply to the client's
request and the connection will hang.
If the director is not the default gw for the realservers,
then if you use tcpdump on the director to watch an attempt
to telnet from the client to the VIP
(run tcpdump with `tcpdump port telnet`),
you will see the request packet (CIP->VIP),
the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP).
You will not see the rewritten reply packet (VIP->CIP).
(Remember if you have a switch on the
realserver's network, rather than a hub, then each node
only sees the packets to/from it. tcpdump won't see
packets to between other nodes on the same network.)
		</note>
	</para>
	<para>
Part of the setup of LVS-NAT then is to make sure
that the reply packet goes via the director,
where it will be rewritten to have the addresses (VIP->CIP).
In some cases (<emphasis>e.g.</emphasis> <link linkend="one_network">1 net NS-NAT</link>)
icmp redirects have to be turned off
on the director so that the realserver doesn't get a redirect to forward
packets directly to the client.
	</para>
	<para>
In a production system, a router would prevent
a machine on the outside exchanging packets with machines on the RIP network.
As well, the realservers
will be on a private network (eg 192.168.x.x/24) and replies will not be routable.
	</para>
	<para>
In a test setup (no router), these safeguards don't exist.
All machines (client, director, realservers) are on the same piece of wire and
if routing information is added to the hosts,
the client can connect to the realservers independantly of the LVS.
This will stop LVS-NAT from working (your connection will hang),
or it may appear to work (you'll be connecting directly to the realserver).
	</para>
	<para>
In a test setup, traceroute from the realserver to the client
should go through the director (2 hops in the above diagram).
The configure script
will test that the director's gw is 2 hops from
the realserver and that the route to the director's gw is via
the director, preventing this error.
	</para>
	<para>
(Thanks to James Treleaven <emphasis>jametrel (at) enoreo (dot) on (dot) ca</emphasis> 28 Feb 2002, for
clarifying the write up on the ping tests here.)
	</para>
	<para>
In a test setup with the client connected directly to the director
(in the setup above with 1 or 2 NICs, or the
<link linkend="one_network">one NIC, one network LVS-NAT</link> setup), you can ping
between the client and realservers.
However in production, with the client out on internet land,
and the realservers with unroutable IPs,
you should not be able to ping between the realservers and the client.
The realservers should not know
about any other network than their own (here 10.1.1.0). The
connection from the realservers to the client is through
ipchains (for 2.2.x kernels) and LVS-NAT tables setup by the director.
	</para>
	<para>
In my first attempt at LVS-NAT setup, I had all machines on a
192.168.1.0 network and added a 10.1.1.0 private network for the
realservers/director, without removing the 192.168.1.0 network on
the realservers. All replies from the servers were routed onto
the 192.168.1.0 network rather than back through LVS-NAT and the
client didn't get any packets back.
	</para>
	<para>
Here's the general setup I use for testing.
The client (192.168.2.254) connects to the VIP on the director.
(The VIP on the realserver is present only for LVS-DR and LVS-Tun.)
For LVS-DR, the default gw for the realservers is 192.168.1.254.
For LVS-NAT, the default gw for the realservers is 192.168.1.9.
	</para>
<programlisting><![CDATA[
        ____________
       |            |192.168.1.254 (eth1)
       |  client    |----------------------
       |____________|                     |
     CIP=192.168.2.254 (eth0)             |
              |                           |
              |                           |
     VIP=192.168.2.110 (eth0)             |
        ____________                      |
       |            |                     |
       |  director  |                     |
       |____________|                     |
     DIP=192.168.1.9 (eth1, arps)         |
              |                           |
           (switch)------------------------
              |
     RIP=192.168.1.2 (eth0)
     VIP=192.168.2.110 (for LVS-DR, lo:0, no_arp)
        _____________
       |             |
       | realserver  |
       |_____________|
]]></programlisting>
	<para>
This setup works for both LVS-NAT and LVS-DR.
	</para>
	<para>
Here's the routing table for one of the realservers as in the LVS-NAT
setup.
	</para>
<programlisting><![CDATA[
realserver:# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     0.0.0.0         255.255.255.0   U        40 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U        40 0          0 lo
0.0.0.0         192.168.1.9     0.0.0.0         UG       40 0          0 eth0
]]></programlisting>
	<para>
Here's a traceroute from the realserver to the client showing 2 hops.
	</para>
<programlisting><![CDATA[
traceroute to client2.mack.net (192.168.2.254), 30 hops max, 40 byte packets
 1  director.mack.net (192.168.1.9)  1.089 ms  1.046 ms  0.799 ms
 2  client2.mack.net (192.168.2.254)  1.019 ms  1.16 ms  1.135 ms
]]></programlisting>
	<para>
Note the traceroute from the client box to the realserver only has one hop.
	</para>
	<para>
director icmp redirects are on, but the director doesn't issue
a redirect (see <xref linkend="icmp_redirects"/>)
because the packet RIP->CIP from the realserver emerges from a different
NIC on the director than it arrived on (and with different source IP).
The client machine doesn't send a redirect since it is not forwarding packets,
it's the endpoint of the connection.
	</para>
	</section>
	<section id="lvs_nat_configure_script">
	<title>Run the configure script</title>
	<para>
Use lvs_nat.conf as a template (sample here will setup LVS-NAT in the
diagram above assuming the realservers are already on the network and using
the DIP as the default gw).
			</para><para>
<programlisting><![CDATA[
#--------------lvs_nat.conf----------------------
LVS_TYPE=VS_NAT
INITIAL_STATE=on

#director setup:
VIP=eth0:110 192.168.1.110 255.255.255.0 192.168.1.255
DIP=eth0:10 10.1.1.10 10.1.1.0 255.255.255.0 10.1.1.255

#Services on realservers:
#telnet to 10.1.1.2
SERVICE=t telnet wlc 10.1.1.2:telnet
#http to a 10.1.1.2 (with weight 2) and to high port on 10.1.1.3
SERVICE=t 80 wlc 10.1.1.2:http,2 10.1.1.3:8080 10.1.1.4

#realserver setup (nothing to be done for LVS-NAT)

#----------end lvs_nat.conf------------------------------------
]]></programlisting>
			</para><para>
The output is a commented rc.lvs_nat file.
Run the rc.lvs_nat file on the director and then the realservers
(the script knows whether it is running on a director or realserver).
			</para><para>
The configure script
will setup up masquerading, forwarding on
the director and the default gw for the realservers.
	</para>
	</section>
	<section id="lvs_nat_demasquerading">
	<title>Setting up demasquerading on the director; 2.4.x and 2.2.x</title>
	<para>
The packets coming in from the client are being demasqueraded by the director.
	</para><para>
In 2.2.x you need to masquerade the replies.
Here's the masquerading code in rc.lvs_nat, that runs on the
director (produced by the configure script).
			</para><para>
<programlisting><![CDATA[
        echo "turning on masquerading "
        #setup masquerading
        echo "1" >/proc/sys/net/ipv4/ip_forward
        echo "installing ipchain rules"
        /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0
	#repeated for each realserver and service
	..
	..
        echo "ipchain rules "
        /sbin/ipchains -L
]]></programlisting>
			</para><para>
In this example, http is being masqueraded by the director, allowing
the realserver to reply to the telnet requests from the director
being demasqueraded by the director as part of the 2.2.x LVS code.
			</para><para>
In 2.4.x masquerading of LVS'ed services is done explicitely
by the LVS code and no extra masquerading (by iptables)
commands need be run.
	</para>
	</section>
	<section id="re-mapping_ports_lvs_nat" xreflabel="Re-mapping ports with LVS-NAT">
	<title>rewriting, re-mapping, translating ports with LVS-NAT</title>
	<para>
One of the features of LVS-NAT is that you can rewrite/re-map the
ports. Thus the client can connect to VIP:http, while the realserver can
be listening on some other port (!http). You set this up with <command>ipvsadm</command>
	</para>
	<para>
Here the client connects to VIP:http, the director rewrites the packet header so that
dst_addr=RIP:9999 and forwards the packet to the realserver,
where the httpd is listening on RIP:9999.
	</para>
<programlisting><![CDATA[
director:/# /sbin/ipvsadm -a -t VIP:http -r RIP:9999 -m -w 1
]]></programlisting>
	<para>
For each realserver (<emphasis>i.e.</emphasis> each RIP) you can
rewrite the ports differently: each realserver could have the httpd
listening on it's own particular port (<emphasis>e.g.</emphasis> RIP1:9999, RIP2:80,
RIP3:xxxx).
	</para>
	<para>
Although port re-mapping is not possible with LVS-DR or LVS-Tun, it's
possible to use <command>iptables</command> to do
<xref linkend="re-mapping_ports_lvs_dr"/> (and LVS-Tun) on the realserver,
producing the same result.
	</para>
	</section>
	<section id="masquerade_timeouts">
	<title>masquerade timeouts</title>
	<para>
For the earlier versions of LVS-NAT (with 2.0.36 kernels) the
timeouts were set by linux/include/net/ip_masq.h, the default
values of masquerading timeouts are:
	</para><para>
<programlisting><![CDATA[
        #define MASQUERADE_EXPIRE_TCP 15*16*Hz
        #define MASQUERADE_EXPIRE_TCP_FIN 2*16*Hz
        #define MASQUERADE_EXPIRE_UDP 5*16*Hz
]]></programlisting>
	</para>
	</section>
	<section id="lvs_nat_julians_setup">
	<title>Julian's step-by-step check of a L4 LVS-NAT setup</title>
	<para>
Julian has his latest fool-proof setup doc at
<ulink url="http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt">
Julian's software page</ulink>.
Here's the version, at the time I wrote this entry.
			</para><para>
<programlisting><![CDATA[
Q.1 Can the realserver ping client?

	rs# ping -n client

A.1 Yes => good
A.2 No => bad

	Some settings for the director:

	Linux 2.2/2.4:
	ipchains -A forward -s RIP -j MASQ

	Linux 2.4:
	iptables -t nat -A POSTROUTING -s RIP -j MASQUERADE

Q.2 Traceroute to client goes through LVS box and reaches the client?

	traceroute -n -s RIP CLIENT_IP

A.1 Yes => good
A.2 No => bad

	same ipchains command as in Q.1

	For client and server on same physical media use these
	in the director:

	echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
	echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects


Q.3 Is the traffic forwarded from the LVS box, in both directions?

	For all interfaces on director:
	tcpdump -ln host CLIENT_IP

	The right sequence, i.e. the IP addresses and ports on each
	step (the reversed for the in->out direction are not shown):

	CLIENT
	   | CIP:CPORT -> VIP:VPORT
	   |		||
	   |		\/
 out	   | CIP:CPORT -> VIP:VPORT
 ||	LVS box
 \/	   | CIP:CPORT -> RIP:RPORT
 in	   |		||
	   |		\/
	   | CIP:CPORT -> RIP:RPORT
	   +
	REAL SERVER

A.1 Yes, in both directions => good (for Layer 4, probably not for L7)
A.2 The packets from the realserver are dropped => bad:

	- rp_filter protection on the incoming interface, probably
	hit from local client (for more info on rp_filter, see
the section on <xref linkend="proc_filesystem"/>
	- firewall rules drop the replies

A.3 The packets from the realservers leave the director unchanged

	- missing -j MASQ ipchains rule in the LVS box

	For client and server on same physical media:

	The packets simply does not reach the director. The real
	server is ICMP redirected to the client. In director:

	echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
	echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects

A.4 All packets from the client are dropped

	- the requests are received on wrong interface with rp_filter
	protection
	- firewall rules drop the requests

A.5 The client connections are refused or are served from service
in the LVS box

	- client and LVS are on same host => not valid
	- the packets are not marked from the firewall and don't hit
	firewall mark based virtual service

Q.4 Is the traffic replied from the realserver?

	For the outgoing interface on realserver:

	tcpdump -ln host CLIENT_IP

A.1 Yes, SYN+ACK => good
A.2 TCP RST => bad, No listening real service
A.3 ICMP message => bad, Blocked from Firewall/No listening service
A.4 The same request packet leaves the realserver => missing accept
	rules or RIP is not defined
A.5 No reply => realserver problem:

	- the rp_filter protection drops the packets
	- the firewall drops the request packets
	- the firewall drops the replies

A.6 Replies goes through another device or don't go to the LVS
box =? bad

	- the route to the client is direct and so don't pass the LVS
	box, for example:

		- client on the LAN
		- client and realserver on same host

	- wrong route to the LVS box is used => use another

	Check the route:

	rs# ip route get CLIENT_IP from RIP


The result: start the following tests

rs# tcpdump -ln host CIP
rs# traceroute -n -s RIP CIP
lvs# tcpdump -ln host CIP
client# tcpdump -ln host CIP


For more deep problems use tcpdump -len, i.e. sometimes the link layer
addresses help a bit.


For FTP:

	VS-NAT in Linux 2.2 requires:

	- modprobe ip_masq_ftp (before 2.2.19)
	- modprobe ip_masq_ftp in_ports=21 (2.2.19+)

	VS-NAT in Linux 2.4 requires:

	- ip_vs_ftp

	VS-DR/TUN require persistent flag


	FTP reports with debug mode enabled are useful:

	# ftp
	ftp> debug
	ftp> open my.virtual.ftp.service
	ftp> ...
	ftp> dir
	ftp> passive
	ftp> dir

	There are reports that sometimes the status strings reported
	from the FTP realservers are not matched with the string
	constants encoded in the kernel FTP support. For example,
	Linux 2.2.19 matches
	"227 Entering Passive Mode (xxx,xxx,xxx,xxx,ppp,ppp)"


Julian Anastasov
]]></programlisting>

	</para>
	</section>
	<section id="lvs_nat_how_it_works">
	<title>How LVS-NAT works</title>
	<para>
director:/etc/lvs# ipvsadm does the following
	</para>
<programlisting><![CDATA[
#setup connection for telnet, using round robin
director:/etc/lvs# /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr
#connections to x.x.x.110:telnet are sent to
#                 realserver 10.1.1.2:telnet
#using LVS-NAT (the -m) with weight 1
director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.2:23 -m -w 1
#and to realserver 10.1.1.3
#using LVS-NAT with weight 2
director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.3:23 -m -w 2
]]></programlisting>
	<para>
(if the service was http instead of telnet,
the webserver on the realserver could be listening on port 8000 instead of 80)
	</para>
	<para>
Turn on ip_forwarding (so that the packets can be forwarded to the realservers)
	</para>
<programlisting><![CDATA[
director:/etc/lvs# echo "1" > /proc/sys/net/ipv4/ip_forward
]]></programlisting>
	<para>
Example: client requests a connection to 192.168.1.110:23
	</para>
	<para>
director chooses realserver 10.1.1.2:23, updates connection tables, then
	</para>
<programlisting><![CDATA[
packet                source                        dest
incoming              CIP:3456                      VIP:23
inbound rewriting     CIP:3456                      RIP1:23
reply (routed to DIP) RIP1:23                       CIP:3456
outbound rewriting    VIP:23                        CIP:3456
]]></programlisting>

	<para>
The client gets back a packet with the source_address = VIP.
	</para>
	<para>
For the verbally oriented...
	</para>
	<para>
The request packet is sent to the VIP. The director looks up its
tables and sends the connection to realserver1. The packet is
rewritten with a new destination (in this case with the same
port, but the port could be changed too) and sent to RIP1. The
realserver replies, sending back a packet to the client. The
default gw for the realserver is the director. The director
accepts the packet and rewrites the packet to have source=VIP and
sends the rewritten packet to the client.
	</para>
	<para>
Why isn't the source of the incoming packet rewritten to be the
DIP or VIP?
	</para>
	<para>
Wensong
	</para>
	<blockquote>
		<para>
...changing the source of the packet to the VIP sounds good too,
it doesn't require that default route rule, but requires additional
code to handle it.
		</para>
	</blockquote>
	</section>
	<section id="lvs_nat_src_addr_reply">
	<title>
In LVS-NAT, how do packets get back to the client, or how does the
director choose the VIP as the source_address for the outgoing packets?
	</title>
	<note>
	<para>
This was written for 2.0.x and 2.2.x kernel LVSs which was based on
the masquerading code.
With 2.4.x, LVS is based on netfilter and there were initially some problems
getting LVS-NAT to work with 2.4.x.
What happens here for 2.4.x, I don't know.
	</para>
	</note>
	<para>
Joe
	</para>
	<para>
In normal NAT, where a bunch of machines are sitting
behind a NAT box, all outward going packets are given
the IP on the outside of the NAT box.
What if there are several IPs facing the outside world?
For NAT it doesn't really matter as long as the
same IP is used for all packets.
The default value is usually the first interface address (eg eth0).
With LVS-NAT you want the outgoing packets to have the
source of the VIP (probably on eth0:1) rather than the
IP on the main device on the director (eth0).
	</para>
	<para>
With a single realserver LVS-NAT LVS serving telnet,
the incoming packet does this,
	</para>
<programlisting><![CDATA[
CIP:high_port -> VIP:telnet     #client sends a packet
CIP:high_port -> RIP:telnet     #director demasquerades packet, forwards to realserver
RIP:telnet    -> CIP:high_port  #realserver replies
]]></programlisting>
	<para>
The reply arrives on the director (being sent
there because the director is the default
gw for the realserver).
To get the packet from the director to the
client, you have to reverse the masquerading
done by the LVS. To do this (in 2.2 kernels),
on the director
you add an ipchains rule
	</para>
<programlisting><![CDATA[
director:# ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0
]]></programlisting>
	<para>
If the director has multiple IPs facing the outside
world (eg eth0=192.168.2.1 the regular IP for the director
and eth0:1=192.168.2.110 the VIP), the masquerading code
has to choose the correct IP for the outgoing packet.
Only the packet with src_addr=VIP will be accepted by
the client. A packet with any other scr_addr will be
dropped. The normal default for masquerading (eth0)
should not be used in this case. The required m_addr
(masquerade address) is the VIP.
	</para>
	<para>
Does LVS fiddle with the ipchains tables to do this?
	</para>
	<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 01 May 2001
	</para>
	<blockquote>
		<para>
        No, ipchains only delivers packets to the masquerading code.
It doesn't matter how the packets are selected in the ipchains
rule.
		</para>
		<para>
The m_addr (masqueraded_address)
is assigned when the first packet is seen (the connect
request from the client to the VIP).
LVS sees the first packet in the LOCAL_IN chain when it comes from
the client. LVS assigns the VIP as maddr.
		</para>
		<para>
The MASQ code sees the first packet in the FORWARD chain when
there is a -j MASQ target in the ipchains rule. The routing selects
the m_addr. If the connection already exists the packets are masqueraded.
		</para>
		<para>
The LVS can see packets in the FORWARD chain but they are for already
created connections, so no m_addr is assigned and the packets are
masqueraded with the address saved in the connections structure (the
VIP) when it was created.
		</para>
		<para>
There are 3 common cases:
		</para>
		<orderedlist>
			<listitem>
The connection is created as response to packet.
			</listitem>
			<listitem>
The connection is created as response to packet to another connection.
			</listitem>
			<listitem>
The connection is already created
			</listitem>
		</orderedlist>
		<para>
Case (1) can happen in the plain masquerading case where the in->out
packets hit the masquerading rule. In this case when nobody recommends
the s_addr for the packets going to the external side of the MASQ, the
masq code uses the routing to select the m_addr for this new connection.
This address is not always the DIP, it can be the preferred source
address for the used route, for example, address from another device.
		</para>
		<para>
Case (1) happens also for LVS but in this case we know:
		</para>
		<itemizedlist>
			<listitem>
the client address/port (from the received datagram)
			</listitem>
			<listitem>
the virtual server address/port (from the received datagram)
			</listitem>
			<listitem>
the realserver address/port (from the LVS scheduler)
			</listitem>
		</itemizedlist>
		<para>
But this is on out->in packet and we are talking about in->out packets
		</para>
		<para>
Case (2) happens for related connections where the new connection can
be created when all addresses and ports are known or when the protocol
requires some wildcard address/port matching, for example, ftp. In
this case we expect the first packet for the connection after some
period of time.
		</para>
		<para>
It seems you are interested how case (3) works. The answer is that the
NAT code remembers all these addresses and ports in a connection
structure with these components
		</para>
		<itemizedlist>
			<listitem>
external address/port (LVS: client)
			</listitem>
			<listitem>
masquerading address/port (LVS: virtual server)
			</listitem>
			<listitem>
internal address/port (LVS: realserver)
			</listitem>
			<listitem>
protocol
			</listitem>
			<listitem>
etc
			</listitem>
		</itemizedlist>
		<para>
LVS and the masquerading code simply hook in the packet path
and they perform the header/data mangling. In this process they use the
information from the connection table(s). The rule is simple: when a
packet is already for established connection we must remember all
addresses and ports and always to use same values when mangling the
packet header. If we select each time different addresses or ports
we simply break the connection. After the packet is mangled the routing
is called to select the next hop. Of course, you can expect problems
if there are fatal route changes.
		</para>
		<para>
	So, the short answer is: the LVS knows what m_addr to use when
a packet from the realserver is received because the connection is
already created and we know what addresses to use. Only in the
masquerading case (where LVS os not involved) connections can be
created and a masquerading address to be selected without using rule
for this. In all other cases there is a rule that recommends what
addresses to be used at creation time. After creation the same values
are used.
		</para>
	</blockquote>
		<section id="VIP_is_primary_IP">
		<title>So make the VIP the primary IP on the outside of the director</title>
		<blockquote>
			<para>
Wayne <emphasis>wayne (at) compute-aid (dot) com</emphasis>
26 Apr 2000
			</para>
			<para>
Any web server behind the LVS box use LVS-NAT can
initiate communication to the Internet.  However, it is not
using the farm IP address,  rather it is using the
masquerading IP address -- the actual IP address of the
interface.  Is there easy way to let the server in NAT mode
to go out as the farm IP address?
			</para>
		</blockquote>
		<para>
Lars
		</para>
		<para>
No. This is a limitation in the 2.2 masquerading code. It will always use the
first address on the interface.
		</para>
		<blockquote>
We tried and it works!  We put VIP on eth0, and RIP on eth0:1 in
NAT mode and it works fine.  Just need to figure out how to do it
during reboot, since this is done by playing with ifconfigure command.
Once we swap them around, the going out IP address is the VIP
address.  But if LVS box reboot, you just have to redo it again.
		</blockquote>

		<para>
Joe:
		</para>
		<para>
! :-)
I didn't realise you were in VS-NAT mode, therefore not having the
VIP on the realservers. I thought you must be in VS-DR.
		</para>
		</section>
	</section>
	<section id="one_network" xreflabel="One Network LVS-NAT">
	<title>One Network LVS-NAT</title>
	<note>
According to Malcolm Turnbull, this is called "One Arm NAT" in the commercial world
(<emphasis>i.e.</emphasis> one nic and one network)
	</note>
	<para>
The disadvantage of the 2 network LVS-NAT is that the realservers
are not able to connect to machines in the network of the VIP.
You couldn't make a LVS-NAT setup out of machines already on
your LAN, which were also required for other purposes to stay
on the LAN network.
	</para>
	<para>
Here's a one network LVS-NAT LVS.
	</para>
<programlisting><![CDATA[
                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|
            |__________|   |   DIP=192.168.1.9 (eth0:9)
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
   RIP1=192.168.1.2   RIP2=192.168.1.3  RIP3=192.168.1.4 (all eth0)
    _____________      _____________     _____________
   |             |    |             |   |             |
   | realserver  |    | realserver  |   | realserver  |
   |_____________|    |_____________|   |_____________|
]]></programlisting>
	<para>
The problem:
	</para>
	<para>
A return packet from the realserver (with address RIP->CIP)
will be sent to the realserver's default gw (the director).
What you want is for the director to accept the packet and
to demasquerade it, sending it on to the client as
a packet with address (VIP->CIP).
With ICMP redirects on, the director will realise that
there is a better route for this packet,
<emphasis>i.e.</emphasis> directly from the realserver
to the client and will send an ICMP redirect to the realserver,
informing it of the better route.
As a result, the realserver will send subsequent packets directly to the
client and the reply packet will not be demasqueraded by the director.
The client will get a reply from the RIP rather than the VIP
and the connection will hang.
	</para>
	<para>
The cure:
	</para>
	<para>
Thanks to
<emphasis>michael_e_brown (at) dell (dot) com</emphasis>
and Julian <emphasis>ja (at) ssi (dot) bg</emphasis>
for help sorting this out.
	</para>
	<para>
To get a LVS-NAT LVS to work on one network -
	</para>
	<orderedlist>
		<listitem>
		<para>
On the director, turn off icmp redirects on the NIC
that is the default gw for the realservers.
(Note: eth0 may be eth1 etc, on your machine).
		</para>
<programlisting><![CDATA[
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
]]></programlisting>
		</listitem>
		<listitem>
		<para>
Make the director the default and only route for outgoing packets.
		</para>
		<para>
You will probably have set the routing on the realserver up like this
		</para>
<programlisting><![CDATA[
realserver:/etc/lvs# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0
]]></programlisting>
		<para>
Note the route to 192.168.1.0/24.
This route allows the realserver to send packets
to the client by just putting them out on eth0,
where the client will pick them up directly
(without being demasqueraded) and the LVS will not work.
This route also allows the realservers to talk to each
other directly <emphasis>i.e.</emphasis> without routing packets
through the director.
(As the admin, you might want to telnet
from one realserver to another, or you might have
ntp running, sending ntp packets between realservers.)
		</para>
		<para>
Remove the route to 192.168.1.0/24.
		</para>
<programlisting><![CDATA[
realserver:/etc/lvs#route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
]]></programlisting>
		<para>
This will leave you with
		</para>
<programlisting><![CDATA[
realserver:/etc/lvs# netstat -r
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0
]]></programlisting>
		</listitem>
	</orderedlist>
	<para>
Now packets RIP->CIP have to go via the director and will be demasqueraded.
The LVS-NAT LVS now works.
If LVS is forwarding telnet,
you can telnet from the client to the VIP and connect to the realserver.
As a side effect,
packets between the realservers are also routed via the director,
rather than going directly (note: all packets now go via the director).
(You can live with that.)
	</para>
	<para>
You can ping from the client to the realserver.
	</para>
	<para>
You can also connect _directly_ to services on the realserver _NOT_
being forwarded by LVS (in this case <emphasis>e.g.</emphasis> ftp).
	</para>
	<para>
You can no longer connect directly to the realserver for services
being forwarded by the LVS. (In the example here, telnet ports are
not being rewritten by the LVS, <emphasis>i.e.</emphasis> telnet->telnet).
	</para>
<programlisting><![CDATA[
client:~# telnet realserver
Trying 192.168.1.11...
^C
(i.e. connection hangs)
]]></programlisting>
	<para>
Here's tcpdump on the director. Since the network is switched
the director can't see packets between the client and realserver.
The client initiates telnet. `netstat -a` on the client
shows a SYN_SENT from port 4121.
	</para>
<programlisting><![CDATA[
director:/etc/lvs# tcpdump
tcpdump: listening on eth0
16:37:04.655036 realserver.telnet > client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 <mss 1460,sackOK,timestamp 111425176[|tcp]> (DF)
16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0]
]]></programlisting>
	<para>
(repeats every second until I kill telnet on client)
	</para>
	<para>
The director doesn't see the connect request from client->realserver.
The first packet seen is the <command>ack</command> from the realserver,
which will be forwarded via the director.
The director will rewrite the <command>ack</command> to be from the director.
The client will not accept an <command>ack</command> to port 4121 from director:telnet.
	</para>
	<para>
Julian 2001-01-12
	</para>
	<para>
	The redirects are handled in net/ipv4/route.c:ip_route_input_slow(),
<emphasis>i.e.</emphasis> from the routing and before reaching LVS (in LOCAL_IN):
	</para>
<programlisting><![CDATA[
        if (out_dev == in_dev && err && !(flags&(RTCF_NAT|RTCF_MASQ)) &&
            (IN_DEV_SHARED_MEDIA(out_dev)
             || inet_addr_onlink(out_dev, saddr, FIB_RES_GW(res))))
                flags |= RTCF_DOREDIRECT;
]]></programlisting>
	<para>
	Here RTCF_NAT &amp;&amp; RTCF_MASQ are flags used from the dumb nat code
but the masquerading defined with ipchains -j MASQ does not set such
or some of these flags. The result: the redirect is sent according to
the conf/{all,&lt;device&gt;}/send_redirects from ip_rt_send_redirect() and
ip_forward() from net/ipv4/ip_forward.c. So, the meaning is: if we are
going to forward packet and the in_dev is same as out_dev we redirect
the sender to the directly connected destination which is on the same
shared media. The ipchains code in the FORWARD chain is reached too late
to avoid sending these redirects. They are already sent when the -j MASQ
is detected.
	</para>
	<para>
	If all/send_redirects is 1 every &lt;device&gt;/send_redirects
is ignored. So, if we leave it 1 redirects are sent. To stop them we
need all=0 &amp;&amp; &lt;device&gt;=0. default/send_redirects is the value that will
be inherited from each new interface that is created.
	</para>
	<para>
	The logical operation between conf/all/&lt;var&gt; and
conf/&lt;device&gt;/&lt;var&gt; is different for each var. The used operation is
specified in <filename>/usr/src/linux/include/linux/inetdevice.h</filename>
	</para>
	<para>
	For send_redirects it is '||'. For others, for example for
conf/{all,&lt;device&gt;}/hidden), it is '&amp;&amp;'
	</para>
	<para>
So, for the two logical operations we have:
	</para>
<programlisting><![CDATA[
For &&:

all	<dev>	result
------------------------------
0	0	0
0	1	0
1	0	0
1	1	1

For ||:

all	<dev>	result
------------------------------
0	0	0
0	1	1
1	0	1
1	1	1
]]></programlisting>
	<para>
	When a new interface is created we have two choices:
	</para>
	<para>
1. to set conf/default/&lt;var&gt; to the value that we want each new
created interface to inherit
	</para>
	<para>
2. to create the interface in this way:
	</para>
<programlisting><![CDATA[
ifconfig eth0 0.0.0.0 up
]]></programlisting>
	<para>
and then to set the value before assigning the address:
	</para>
<programlisting><![CDATA[
echo <val> > conf/eth0/<var>
ifconfig eth0 192.168.0.1 up
]]></programlisting>
	<para>
but this is risky especially for the tunnel devices, for example, if
you want to play with var rp_filter.
	</para>
	<para>
	For the other devices this is a safe method if there is no
problem with the default value before assigning the IP address. The
first method can be the safest one but you have to be very careful.
	</para>
		<section id="stump">
		<title>One Network LVS from Joe Stump</title>

		<para>
		Joe Stump <emphasis>joe (at) joestump (dot) net</emphasis> 2002-09-04
		</para>

			<section><title>Problem</title>
			<para>
The problem is you have one network that has your realservers, directors, and
clients all together on the same class C. For this example we will say they all
sit on 192.168.1.*. Here is a simple layout.
			</para>
<programlisting><![CDATA[
                         ~~~~~~~~~~~~~
                         {  Internet }------------------------+
                         ~~~~~~~~~~~~~                        |
                              |                               | IP: 192.168.1.1
                              | External IP: 166.23.3.4       |
                              |                               |
                       +---------------+                   +---------+
                       |   Director    |-------------------| Gateway |
                       +---------------+                   +---------+
                              |                                  |
                              | Internal IP: 192.168.1.25        |
                              |                                  |
                   +----------+                                  |
                   |                                             |
                   | IP: 192.168.1.200                     +--------+
                   |                                       | Client |
           +---------------+                               +--------+
           |  Real Server  |                            IP: 192.168.1.34
           +---------------+
]]></programlisting>
			<para>
Everything looks like it should work just fine right? Wrong. The problem is
that in reality all of these machines are able to talk to one another because
they all reside on the same physical network. So here is the problem: clients
outside of the internal network get expected output from the load balancer, but
clients on the internal network hang when connecting to the load balancers.
			</para>
			</section>
			<section>
			<title>Cause</title>
			<para>
So what is causing this problem? The routing tables on the directors and the
realservers are causing your client to become confused and hang the connection.
If you look at your routing tables on your realserver you will notice that
the default gatway for your internal network is 0.0.0.0. Your director will
have a similar route. These routes tell your directors and realservers that
requests coming from machines on that network should be routed directly back
to that machine. So when a request comes to the director the director routes
it to the realserver, but the realserver sends the response directly back to
the client instead of routing it back through the director as it should. The
same thing happens when you try to connect via the director's outside IP from
an internal client IP, only this time the director mistakenly sends directly
to the internal client IP. The internal client IP is expecting the return
packets from the director's external IP, not the director's internal IP.
			</para>
			</section>
			<section>
			<title>Solution</title>
			<para>
The solution is simple. Delete the default routes on your directors and real
servers to the internal network.
			</para>
<programlisting><![CDATA[
route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
]]></programlisting>
			<para>
The above line should do the trick. One thing to note is that you will not
be able to connect to these machines once you have deleted these routes. Y0u
might just want to use the director as a terminal server since you can connect
from there to the realservers.
			</para><para>

Also, if you have your realservers connect to DB's and NFS servers on the
internal network you will have to add direct routes to those hosts. You do
this by typing this:
			</para>

<programlisting><![CDATA[
route add -host $SERVER dev eth0
]]></programlisting>
			<para>

I added these routes to a startup script so it kills my internal routes and
adds the needed direct routes to my NFS and DB server during startup.
			</para>
			</section>
		</section>
		<section id="one_network_nat_with_windows_realservers">
		<title>One Network LVS-NAT with windows realservers</title>
		<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 1 Aug 2008 
		</para>
		<para>
Route configuration for Windows Server with one arm NAT mode:
		</para>
		<para>
When a client on the same subnet as the real server tries to access
the virtual server on the load balancer the request will fail. The
real server will try to use the local network to get back to the
client rather than going through the load balancer and getting the
correct network translation for the connection.
		</para>
		<para>
To rectify this issue we need to add a route to the the load balancer
that takes priority over Windows default routing rules.
This is a simple case of adding a permanent route:
		</para>
<programlisting><![CDATA[
route add -p 192.168.1.0 mask 255.255.255.0 metric 1
]]></programlisting>
		<para>
(NB. Replace 192.168.1.0 with your local subnet address.)
The default route to the local network has a metric of 10, so this new
route overrides all local traffic and forces it to go through the load
balancer as required.
Any local traffic (same subnet) is handled by this route and any
external traffic is handled by the default route (which also points at
the load balancer).
		</para>
		</section>
		<section id="lvs_nat_one_network_malcolm">
		<title>Malcolm's modification of the One Network LVS-NAT</title>
		<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 1 Aug 2008 
		</para>
		<para>
Forgot to add that this method (<emphasis>i.e.</emphasis> the windows realserver method above)
works for Linux as well avoiding the
need to add a route for each host
		</para>
		<para>
Route configuration for Linux with one arm NAT mode:
When a client on the same subnet as the real server tries to access
the virtual server on the load balancer the request will fail. The
real server will try to use the local network to get back to the
client rather than going through the load balancer and getting the
correct network translation for the connection.
To rectify this issue we need to modify the local network route to a
higher metric:
		</para>
<programlisting><![CDATA[
route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
route add -net 192.168.1.0 netmask 255.255.255.0 metric 2000 dev eth0
]]></programlisting>
		<para>
NB. Replace 192.168.1.0 with your local subnet address.
Then we need to make sure that local network access uses the load
balancer as its default route:
		</para>
<programlisting><![CDATA[
route add -net 192.168.1.0 netmask 255.255.255.0 gateway 192.168.1.21 metric 0 dev eth0
]]></programlisting>
		<para>
NB. Replace 192.168.1.21 with your load balancer gateway
Any local traffic (same subnet) is handled by this manual route and
any external traffic is handled by the default route (which also
points at the load balancer).
		</para>
		<note>
FIXME 
Joe: here's what I think Malcolm is saying.
Malcolm's clients are on 0/0 and not on the same logical network as the realservers.
(In the setup above needing the icmp redirects turned off, all machines, client, director, realservers
are on the same network).
		<itemizedlist>
			<listitem>
metric 0 is high priority. anything to 0/0 goes to default gw
			</listitem>
			<listitem>
metric 2000 is low priority. anything to 192.168.1.0/24 goes to eth0.
			</listitem>
			<listitem>	
you don't need to turn off icmp redirects
			</listitem>
		</itemizedlist>
However
		<itemizedlist>
			<listitem>
the linux kernel ignores the metric 
			</listitem>
			<listitem>
only dynamic routing protocols (RIP, GATED) use metric,
and then only to decide between duplicate routes.
			</listitem>
			<listitem>
routes with (metric&gt;16) are ignored by dynamic routing protocols.
Presumably Linux ignoring the metric, treats a route with metric=2000
the same as a route with metric=0.
			</listitem>
		</itemizedlist>
So although Malcolm's method works, we don't understand why at the moment.
		</note>
		</section>
		<section id="lvs_nat_one_network_mailing_list">
		<title>One net LVS-NAT from the mailing list</title>
		<para>
Here's an untested solution from Julian for a one network LVS-NAT
(I assume this is old, maybe 1999, because I don't have a date on it).
		</para><para>
put the client in the external logical network. By this way the
client, the director and the realserver(s) are on same physical network
but the client can't be on the masqueraded logical network. So, change the
client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add
through DIP (I don't see such IP for the Director). Why in your setup
DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you
can later add path for 192.168.1.0/24 through 166.84.192.33. There is no
need to use masquerading with 2 NICs. Just remove the client from the
internal logical network used by the LVS cluster.
		</para><para>
A different working solution from Ray Bellis <emphasis>rpb (at) community (dot) net (dot) uk</emphasis>
		</para><para>
the same *logical* subnet.
I still have a dual-ethernet box acting as a director, and the VIP is
installed as an alias interface on the external side of the director, even
though the IP address it has is in fact assigned from the same subnet as the
		</para><para>
Ray Bellis <emphasis>rpb (at) community (dot) net (dot) uk</emphasis> has used a 2 NIC director
to have the RIPs on the same logical network as the VIP
(ie RIP and VIP numbers are from the same subnet), although
they are in different physical networks.
		</para>
		</section>
	</section>
	<section id="lvs_nat_rewriting_slow">
	<title>re-mapping ports, rewriting is slow for 2.0, 2.2 kernels</title>
	<para>
	For LVS-NAT, the packet headers are re-written
(from the VIP to the RIP and back again).
At no extra overhead, anything else in the header can
be rewritten at the same time. LVS-NAT can rewrite the ports
Thus a request to port VIP:80 received on the director
can be sent to RIP:8000 on the realserver.
	</para>
	<para>
In the 2.0.x and 2.2.x series of IPVS, rewriting the packet
headers is slow on machines from that era
(60usec/packet on a pentium classic)
and limits the throughput of LVS-NAT (for 536byte packets, this is
72Mbit/sec or about 100BaseT). While LVS-NAT throughput does not
scale well with the packet rate (after you run out of CPU), the advantage of
LVS-NAT is that realservers can have any OS, no modifications are
needed to the realserver to run it in an LVS, and the realserver
can have services not found on Linux boxes.
	</para>
	<note>
	<para>
For <xref linkend="LVS-HOWTO.localnode"/>,
headers are <emphasis>not</emphasis> rewritten.
	</para>
	</note>
	<para>
The LVS-NAT code for 2.4 is rewritten as a Netfilter modules
and is not detectably slower than LVS-DR or LVS-Tun.
(The IPVS code for the early 2.4.x kernels in 2001 was buggy
during the changeover, but that is all fixed now.)
	</para>
	</section>
	<section id="two_instances_on_realserver">
	<title>Two instances of demon running on realserver</title>
	<para>
from Horms, Jul 2005
	</para>
	<para>
With LVS-DR or LVS-Tun, the packet arrives on the realserver with
dst_addr=VIP:port. 
Thus even if you set up two RIPs on the realserver you cannot have
two instances of the service demon, because they would both have
to be listening for VIP:port.
With LVS-NAT, you could 
	</para>
	<itemizedlist>
		<listitem>
have two RIPs (RIP1 and RIP2) on one realserver (both IPs could be on one NIC),
<command>ipvsadm</command> forwarding to both RIPs,  
with an instance of the demon listening to RIP1 
and another instance of the demon listening to RIP2.
		</listitem>
		<listitem>
one RIP on the realserver, but have <command>ipvsadm</command> 
forward requests to two different ports. 
Thus one instance of the demon would listen to RIP:port1 
and another would listen to RIP:port2.
		</listitem>
	</itemizedlist>
	</section>
	<section id="lvs_nat_performance">
	<title>Performance of LVS-NAT</title>
	<para>
Horms
	</para>
	<para>
All things are relative. LVS-NAT is actually pretty fast.
I have seen it do well over 600Mbit/s. But in theory LVS-DR
is always going to be faster because it does less work.
If you only have 100Mbit/s on your LAN then either will be fine.
If you have gigabit then LVS-NAT will still probably be fine.
Beyond that... I am not sure if anyone has tested that to see
what will happen.
In terms of number of connections, there is a limit
with LVS-NAT that relates to the number of ports.
But in practice you probably won't reach that limit anyway.
	</para>
		<section id="lvs_nat_performance_2.0_2.2">
		<title>Performance of LVS-NAT, 2.0 and 2.2 kernels</title>
		<para>
With the slower machines around in the early days of LVS,
the throughput of LVS-NAT was limited by the time taken by the
director to rewrite a packet.
The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT).
Since the director is the limiting step,
increasing the number of realservers does not increase the throughput.
		</para>
		<para>
The
<ulink
url="http://www.linuxvirtualserver.org/Joseph.Mack/performance/single_realserver_performance.html">
performance page
</ulink>
shows a slightly higher latency with LVS-NAT compared to LVS-DR or LVS-Tun,
but the same maximum throughput. The load average on the director is high (>5)
at maximum throughput, and the keyboard and mouse are quite sluggish.
The same director box operating at the same throughput under LVS-DR or
LVS-Tun has no perceptable load as measured by <command>top</command>
or by mouse/keyboard responsiveness.
		</para>
		</section>
		<section id="2.4_NAT">
		<title>Performance of LVS-NAT, 2.4 kernels</title>
		<para>
Wayne
		</para>
		<blockquote>
NAT taks some CPU and memory copying. With a slower CPU, it will
be slower.
		</blockquote>
		<para>
<ulink url="http://marc.theaimsgroup.com/?l=linux-virtual-server&amp;m=99554689108488&amp;w=2">
the origial posting</ulink>
		</para>
		<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 19 Jul 2001
		</para>
		<para>
	This is a myth from the 2.2 age. In 2.2 there are 2 input
route calls for the out->in traffic and this reduces the performance.
By default, in 2.2 (and 2.4 too) the data is not copied when the IP
header is changed. Updating the checksum in the IP header does not
cost too much time compared to the total packet handling time.
		</para>
		<para>
	To check the difference between the NAT and DR forwarding
method in out->in direction you can use <link linkend="testlvs">testlvs</link> from
http://www.ssi.bg/~ja/ and to flood a 2.4 director in 2 setups: DR and NAT.
My tests show that I can't see a visible difference.
We are talking about 110,000 SYN packets/sec with 10 pseudo clients and
same cpu idle during the tests (there is not enough client power in my
setup for full test), 2 CPUx 866MHz, 2 100mbit internal i82557/i82558
NICs, switched hub:
		</para><para>
3 testlvs client hosts -> NIC1-LVS-NIC2 -> packets/sec.
		</para><para>
	I use small number of clients because I don't want to spend time
in routing cache or LVS table lookups.
		</para><para>
	Of course, the NAT involves in->out traffic and this can reduce
twice the performance if the CPU or the PCI power is not enough to handle
the traffic in both directions. This is the real reason the NAT method
to look so slow in 2.4. IMO, the overhead from the TUN encapsulation
or from the NAT process is negliable.
		</para><para>
	Here come the surprises:
		</para><para>
The basic setup: 1 CPU PIII 866MHz, 2 NICs (1 IN and 1 OUT), LVS-NAT,
SYN flood using testlvs with 10 pseudo clients, no ipchains rules.
Kernels: 2.2.19 and 2.4.7pre7.
		</para>
		<itemizedlist>
			<listitem>
				<para>
Linux 2.2 (with ipchains support, with modified demasq path to use
one input routing call, something like LVS uses in 2.4 but without dst
cache usage):
				</para>
<programlisting><![CDATA[
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 99% (strange)
In 110,000 SYNs/sec, Out 88,000 SYNs/sec, CPU idle: 0%
]]></programlisting>
			</listitem>
			<listitem>
				<para>
Linux 2.4 (with ipchains support): with 3-4 ipchains rules:
				</para>
<programlisting><![CDATA[
In 80,000 SYNs/sec, Out 55,000 SYNs/sec, CPU idle: 0%
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 0%
In 110,000 SYNs/sec, Out 63,000 SYNs/sec (strange), CPU idle: 0%
]]></programlisting>
			</listitem>
			<listitem>
				<para>
Linux 2.4 (without ipchains support):
				</para>
<programlisting><![CDATA[
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 20%
In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 2%
]]></programlisting>
			</listitem>
			<listitem>
				<para>
Linux 2.4, 2 CPU (with ipchains support):
				</para>
<programlisting><![CDATA[
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 30%
In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 0%
]]></programlisting>
			</listitem>
			<listitem>
				<para>
Linux 2.4, 2 CPU (without ipchains support):
				</para>
<programlisting><![CDATA[
In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 45%
In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 15%, 30000 ctxswitches/sec
]]></programlisting>
			</listitem>
		</itemizedlist>
			<para>
What I see is that:
			</para>
		<itemizedlist>
			<listitem>
				<para>
modified 2.2 and 2.4 UP look equal on 80,000P/s
				</para><para>
limits: 2.2=88,000P/s, 2.4=96,000P/s, i.e. 8% difference
				</para>
			</listitem>
			<listitem>
				<para>
1 and 2 CPU in 2.4 look equal 110,000->96,000 (100mbit or PCI bottleneck?),
may be we can't send more that 96,000P/s through 100mbit NIC?
				</para>
			</listitem>
			<listitem>
the ipchains rules can dramatically reduce the performance - from
88,000 to 55,000 P/s
			</listitem>
			<listitem>
2.4.7pre7 SMP shows too many context switches
			</listitem>
			<listitem>
DR and NAT show equal results for 2.4 UP
110,000->96,000P/s, 2-3% idle,
so I can't claim that there is a NAT-specific overhead.
			</listitem>
		</itemizedlist>
		<para>
I performed other tests, testlvs with UDP flood. The packet rate
is lower, the cpu idle time in the LVS box was increased dramatically
but the client hosts show 0% cpu idle, may be more testlvs client
hosts are needed.
		</para>
		<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 16 Jan 2002
		</para><para>
Many people think that the packet mangling is evil
in the NAT processing. The picture is different: the NAT processing in
2.2 uses 2 input routing calls instead of 1 and this totally kills
the forwarding of packets from/to many destinations. Such problems
are mostly caused from the bad hash function used in the routing code
and because the routing cache has hard limit for entries.
Of course, the NAT setups handle more traffic than the other forwarding
methods (both the forward and reply directions),
a good reason to avoid LVS-NAT with a low power director.
In 2.4 the difference
between the DR and NAT processing in out->in direction can not be
noticed (at least in my tests) because only one route call is used,
for all methods.
		</para>
		<para>
Matthew S. Crocker Jul 26, 2001
		</para><para>
DR is faster, less resource intensive but has issues with configuration
because of the age old 'arp problem'
		</para>
		<para>
Horms <emphasis>horms (at) vergenet (dot) net</emphasis>
		</para>
		<para>
LVS-NAT is still fast enough for many aplications and
is IMHO considerably easier to set up. While I think LVS-DR is great
I don't think people should be under the impresion that LVS-NAT
will intrisicly be a limiting factor to them.
		</para>
		<para>
Don Hinshaw <emphasis>dwh (at) openrecording (dot) com</emphasis> 04 Aug 2001
		</para>
		<para>
Cisco, Alteon and F5 solutions are all NAT based. The real limiting
factor as I understand it is the capacity of the netcard, which these three
deal with by using gigabit interfaces.
		</para>
		<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 05 Mar 2002 in discussion with Michael McConnell
		</para>
		<para>
	Note that I used a modified demasq path which uses
one input route for NAT but it is wrong. It only proves that
2.2 can reach the same speed as 2.4 if there was use_dst
analog in 2.2. Without such feature the difference is 8%.
OTOH, there is a right way to implement one input route call
as in 2.4 but it includes rewriting of the 2.2 input processing.
		</para>
		<para>
Michael McConnell
		</para>
		<blockquote>
From what I see here, it looks as though the 2.2 kernel handles a higher
numberof SYN's better than the 2.4 kernel. Am I to asume, that the for the
110,000SYNs/sec in the 2.4 kernel, only 63,000 SYNs/sec were answers? The
rest failed?
		</blockquote>
		<para>
	In this test 2.4 has firewall rules, while 2.2 has only
ipchains enabled.
		</para>
		<blockquote>
Is the 2.2 kernel better at answer a higher number of requests?
		</blockquote>
		<para>
	No. Note also that the testlvs test was only in one
direction, no replies, only client->director->realserver
		</para>
		<blockquote>
has anyone compared iptables/ipchains, via 2.2/2.4?
		</blockquote>
		<para>
here are my
<ulink url="http://marc.theaimsgroup.com/?l=linux-virtual-server&amp;m=100903333532449&amp;w=2">
results</ulink>
There is some magic in these tests, I don't know at one place
why netfilter shows such bad results. Maybe someone can point to me
to the problem.
		</para>
		</section>
	</section>
	<section id="debugging_routes">
	<title>Various debugging techniques for routes</title>
	<para>
This originally described how I debugged setting up a one-net LVS-NAT LVS
using the output of route. Since it is more about networking tools
than LVS-NAT it has been moved to the section on
<xref linkend="LVS-HOWTO.policy_routing"/>.
	</para>
	</section>
	<section id="lvs_nat_connecting_to_realserver">
	<title>Connecting directly from the client to a service:port on an LVS-NAT realserver</title>
	<para>
If you connect directly to the realserver in a LVS-NAT LVS, the reply
packet will be routed through the director, which will attempt to
masquerade it. This packet will not be part of an established connection
and will be dropped by the director, which will issue an ICMP error.
	</para>
	<para>
Paul Wouters <emphasis>paul (at) xtdnet (dot) nl</emphasis> 30 Nov 2001
	</para>
	<blockquote>
		<para>
It would like to reach all LVS'ed services on the realservers directly,
<emphasis>i.e.</emphasis> without going through the LVS-NAT director, say from
a local client not on the internet.
		</para>
		<para>
Connecting from client to a RIP
should just completely bypass all the lvs code, but it seems that the lvs
code is confused, and thinks a RIP->client answer should be part of its
NAT structure.
		</para>
		<para>
tcpdump running on internal interface of the director shows
a packet from the client received on the RIP;
the RIP replies (never reaches the client, the director drops it).
The director then sends out a port unreachable:
		</para>
	</blockquote>
	<para>
Julian
	</para>
	<para>
	The code that replies with an ICMP error can be removed but then
you still have the problem of reusing connections.
The local_client can select a port for direct connection
with the RIP but if that port was used
some seconds before for a CIP->VIP connection,
it is possible that LVS to catch these replies as part
of the previous connection.
LVS does not inspect the TCP headers and does not accurately keep the TCP state.
So, it is possible that LVS will not to detect that the local_client
and the realserver have established a new connection with
the same addresses and ports that are still known as NAT connection.
Even stateful conntracking can't notice
it because the local_clientIP->RIP packets are not subject to NAT processing. When
LVS sees the replies from RIP to local_clientIP it will SNAT them and this will
be fatal because the new connection is between the local_clientIP and RIP directly,
not from CIP->VIP->RIP. The other thing is that CIP even does not know
that it connects from same port to same server. It thinks there
are 2 connections from same CPORT: to VIP and to RIP, so they can live
even at the same time.
	</para>
	<blockquote>
But a proper TCP/IP stack on a client will not re-use the same port
that quickly, unless it is REALLY loaded with connections right?
And a client won't (can't?) use the same source port to different
destinations (VIP and RIP) right?
So, the problem becomes almost theoretical?
	</blockquote>
	<para>
	This setup is dangerous. As for the ICMP replies, they
are only for anti-DoS purposes but may be are going to die soon.
There is still no enough reason to remove that code (it was not
first priority).
	</para>
	<blockquote>
Or make it switachable as #ifdef or /proc sysctl?
	</blockquote>
	<para>
Wensong
	</para>
	<para>
Just comment out the whole block, for example,
	</para>
<programlisting><![CDATA[
#if 0
                if (ip_vs_lookup_real_service(iph->protocol,
                                              iph->saddr, h.portp[0])) {
                        /*
                         * Notify the realserver: there is no existing
                         * entry if it is not RST packet or not TCP packet.
                         */
                        if (!h.th->rst || iph->protocol != IPPROTO_TCP) {
                                icmp_send(skb, ICMP_DEST_UNREACH,
                                          ICMP_PORT_UNREACH, 0);
                                kfree_skb(skb);
                                return NF_STOLEN;
                        }
                }
#endif
]]></programlisting>
	<blockquote>
This works fine. Thanks
	</blockquote>
	<para>
The topic came up again. Here's another similar reply.
	</para>
	<blockquote>
I've set up a small LVS_NAT-based http load
balancer but can't seem to connect to the realservers
behind them via IP on port 80.
Trying to connect directly to the realservers
on port 80, though, translates everything correctly,
but generates an ICMP port unreach.
	</blockquote>
	<para>
Ben North <emphasis>ben (at) antefacto (dot) com</emphasis> 06 Dec 2001
	</para>
	<para>
The problem is that LVS takes an interest in all packets with a
source IP:port of a Real Service's IP:port, as they're passing
through the FORWARD block.  This is of course necessary ---
normally such packets would exist because of a connection
between some client and the Virtual Service, mapped by LVS to
some Real Service.  The packets then have their source address
altered so that they're addressed VIP:VPort -> CIP:CPort.
	</para>
	<para>
However, if some route exists for a client to make connections
directly to the Real Service, then the packets from the Real
Service to the client will not be matched with any existing LVS
connection (because there isn't one).  At this point, the LVS
NAT code will steal the packet and send the "Port unreachable"
message you've observed back to the Real Server.  A fix to the
problem is to #ifdef out this code --- it's in ip_vs_out() in
the file ip_vs_core.c.
	</para>
	</section>
	<section id="lvs_nat_has_no_connections">
	<title>A NAT router has no connections</title>
	<para>
A NAT router rewrites source IP (and possibly the port)
of packets coming from machines on the inside network.
With an LVS-NAT director, the connection originates on
the internet and terminates on the realserver (de-masquerading).
The replies (from the realserver to the the lvs client) are
masqueraded. In both cases (NAT router, LVS-NAT director),
to the machine on the internet, the connection appears
to be coming from the box doing the NAT'ing. However
the NAT box has no connection (<emphasis>e.g.</emphasis>
with <command>netstat -an</command>) to the box on the
internet. It is just routing packets (and rewriting them). 
	</para>
	<para>
Horms 17 May 2004
	</para>
	<para>
There is no connection as such. Or more specifically,
the connection is routed, not terminated by the kernel. 
However, there is a proc entry, that you can inspect, 
to see the natted connections.
	</para>
	</section>
	<section id="lvs_net_extending">
	<title>Thoughts on extending NAT</title>
	<para>
<blockquote><para>Tao Zhao <emphasis>taozhao (at) cs (dot) nyu (dot) edu</emphasis> 01 May 2002
LVS-NAT assumes that all servers are behind the
director, so the director only need to change the destination IP when a
request comes in and forward that to the scheduled realserver. When the
reply packets go through the director it will change the source IP. This
limits the deployment of LVS using NAT: the director must be the outgoing
gateway for all servers.
			</para><para>
I am wondering if I can change the code so that both source and
destinamtion IPs are changed in both ways. For example,
CIP: client IP;
DIP: director IP;
SIP: server IP (public IPs);
			</para><para>
<programlisting><![CDATA[
Client->Director->Server: address pair (CIP, DIP) is changed to (DIP, SIP)
Server->Director->Client: address pair (SIP, DIP) is changed to (DIP, CIP).
]]></programlisting>
</para></blockquote>
			</para><para>
Lars
			</para><para>
Not very efficient; but this can actually already be done by using the
port-forwarding feature AFAIK, or by a userspace application level gateway. I
doubt its efficiency, since the director would _still_ need to be in between
all servers and the client both ways.
Direct routing and/or tunneling make more sense.
As well clients do not
know where the connection originally came from; making the logs on them nearly
useless, also filtering by client IP and establishing a session back to the
client (ie, ftp or some multimedia protocols) is also very difficult.
			</para><para>
Wayne <emphasis>wayne (at) compute-aid (dot) com</emphasis> 01 May 2002
			</para><para>
Client IP address is very important for analyzing the traffic for
marketing people.  Get rid of the CIP will make web server
has no way to log where the traffic coming from, thus totally
blind the marketing people, that is very undesirable for many
use.
Do you have to allocate a table for tracking these changes, too?
That will further slow down the director.
			</para><para>
<blockquote><para>
Of course, the director need to allocate a new port number and change the
source port number to it when it forwards the packet to the server. Thus
this local port number should be enough for the director to distinguish
different connections.
This way, there will be no limitation where the servers are (the tunneling
solution needs the change of server: setup tunneling)
</para></blockquote>
			</para><para>
Joe
			</para><para>
I talked to Wensong about this in the early days of LVS, but I remember
thinking that keeping track of the CIP would have been a lot of work.
I think I mentioned it in the HOWTO for a while.
However I'd be happy to use the code if someone else wrote it :-)
			</para><para>
Some commercial load balancers seem to have some NAT-like scheme where
the packets can return directly to the CIP without going through the director.
Does anyone know how it works? (Actually I don't know whether it's NAT-like
or not, I think there's some scheme out there that isn't VS-DR which
returns packets directly from the realservers to the clients - this
is called "direct server return" in the commercial world).
			</para><para>
Wayne <emphasis>wayne (at) compute-aid (dot) com</emphasis>
			</para><para>
I think those are switch-like load balancers.  They don't take any IP
addresses, But I think it could be done even with NAT, as long as
the server has two NIC, one talk to the load balancer, the other
talk to the switch/hub before the load balancer.  The load balancer
has to change the packet not have its own IP in it, so there is no
need to NAT back to the public packet.  Server set its default
gateway using the other NIC to send the packets out.
	</para>
	</section>
	<section id="lvs_nat_mailing_list">
	<title>Postings from the mailing list</title>
	<para>
<emphasis>frederic (dot) defferrard (at) ansf (dot) alcatel (dot) fr</emphasis>
	</para>
	<blockquote>
		<para>
 would be possible to use LVS-NAT to load-balance virtual-IPs to
 ssh-forwarded real-IPs?
 Ssh can also be used to create a local access that is forwarded to a
 remote access throught the ssh protocol. For example you can use ssh to
 securely map a local acces to a remote POP server:
		</para>
<programlisting><![CDATA[
 local:localport ==> local:ssh ~~~~~ ssh port forwarding ~~~~~ remote:ssh ==> remote:pop
]]></programlisting>
		<para>
 And when you connect to local:localip you are transparently/securely
 connected to remote:pop
 The main idea is to allow RS in differents LANs
 with RS that are non-Linux (precluding LVS-Tun).
 Example:
		</para>
<programlisting><![CDATA[
                                - VS:81 ---- ssh ---- RS:80
                               /
INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80
                               \
                                - VS:83 ---- ssh ---- RS:80
]]></programlisting>
	</blockquote>
	<para>
Wensong
	</para>
	<para>
 you can use VPN (or CIPE) to map some external realservers into
 your private cluster network. If you use LVS-NAT, make sure the
 routing on the realserver must be configuration properly so that the
 response packets will go through the load balancer to the clients.
	</para>
	<blockquote>
 I think that it isn't necessery to have the default router to the load
 balancer when using ssh because when the RS address is the same that the
 VS address (differents ports)
	</blockquote>
	<para>
 With the NAT method, your example won't work because the LVS/NAT
 treats packets as local ones and forward to the upper layers without
 any change.
	</para>
	<para>
 However, your example give me an idea that we can dynamically redirect
 the port 80 to port 81, 82 and 83 respectively for different
 connections, then your example can work. However, the performance
 won't be good, because lots of works are done in the application
 level, and the overhead of copying from kernel to user-space is high.
	</para>
	<para>
 Another thought is that we might be able to setup LVS/DR with real
 server in different LANs by using of CIPE/VPN stuff. For example, we
 use CIPE to establish tunnels from the load balancer to realservers
 like
	</para>
<programlisting><![CDATA[
                     10.0.0.1================10.0.1.1 realserer1
                     10.0.0.2================10.0.1.2 realserer2
   --- Load Balancer 10.0.0.3================10.0.1.3 realserer3
                     10.0.0.4================10.0.1.4 realserer4
                     10.0.0.5================10.0.1.5 realserer5
]]></programlisting>
	<para>
 Then, you can add LVS-DR configuration commands as:
	</para>
<programlisting><![CDATA[
         ipvsadm -A -t VIP:www
         ipvsadm -a -t VIP:www -r 10.0.1.1 -g
         ipvsadm -a -t VIP:www -r 10.0.1.2 -g
         ipvsadm -a -t VIP:www -r 10.0.1.3 -g
         ipvsadm -a -t VIP:www -r 10.0.1.4 -g
         ipvsadm -a -t VIP:www -r 10.0.1.5 -g
 ]]></programlisting>
	<para>
I haven't tested it. Please let me know the result if anyone tests
this configuration.
	</para>
	<para>
Lucas 23 Apr 2004
	</para>
	<blockquote>
 Is it possible use the cluster as a NAT Router?
 What I'm saying is: I got a private LAN and I want to share my internet
 connection, doing NAT and Firewall and QoS. The realservers are actually
 routers and dont serve any service. Is there a way to use the VIP as the
 private LAN gateway or to pass the traffic through the director to the "real
 servers (real routers)" even when is not destined to a specific port in the
 server?
	</blockquote>
	<para>
Horms 21 May 2004
	</para>
	<para>
I think that should work, as long as you are only wanting to route IPv4
TCP, UDP and related ICMP.  You probably want to use a fwmark virtual
service so that you can forward all ports to the realservers (routers).
That said I haven't tried it, so I can't be sure.
	</para>
	</section>
	<section id="brownfield" xreflabel="source routing patches">
	<title>LVS-NAT source routing patch (Brownfield, Sawari and Black)</title>
	<note>
Mar 2006: This will be in the next release of LVS.
	</note>
	<para>
Ken Brownfield found that ipvs changes the routing of packets from the director to 0/0
(<emphasis>i.e.</emphasis> LVS-NAT or LVS-DR with the forward-shared patch).
The packets from ipvs should use the routing table, but they don't.
Ken had a director with two external NICS. He wanted 
the packets to return via the NIC they arrived. When he tried LVS-NAT, 
with his own installed routing table (which works when tested 
with traceroute), the reply packets from ip_vs are sent to the default gw, 
apparently ignoring his routing table. 
It should be none of ip_vs's business where the packets are routed.
	</para>
        <para>
Here's Ken's 
<ulink url="http:files/ip_vs_source_route.patch.gz">
ip_vs_source_route.patch.gz</ulink> patch.
	</para>
	<para>
Here's Ken's take on the matter
	</para>
	<para>
I need to support VIPs on the director that live on two  
separate external subnets:
	</para>
<programlisting><![CDATA[
       |        |
  eth0 |   eth1 |         eth0 = ISP1_IP on ISP1_SUBNET
----------------------    eth1 = ISP2_IP on ISP2_SUBNET
|      Director      |
----------------------
   internal |
            |
]]></programlisting>
	<para>
The default gateway is on ISP1_SUBNET/eth0, and I have source routes  
set up as follows for eth1:
	</para>
<programlisting><![CDATA[
# cat /etc/SuSE-release
SuSE Linux 9.0 (i586)
VERSION = 9.0

# uname -a
Linux lvs0 2.4.21-303-smp4G #1 SMP Tue Dec 6 12:33:10 UTC 2005 i686  
i686 i386 GNU/Linux

# ip -V
ip utility, iproute2-ss020116

# ip rule list
0:      from all lookup local
32765:  from ISP2_SUBNET lookup 136
32766:  from all lookup main
32767:  from all lookup default

# ip route show table 136
ISP2_SUBNET dev eth1  scope link  src ISP2_IP default via ISP2_GW dev eth1
]]></programlisting>
	<para>
If I perform an mtr/traceroute on the director bind()ed to the  
ISP2_IP interface, outgoing traceroutes traverse the proper ISP2_GW,  
and the same for the ISP1_IP interface.  I'm pretty sure the source- 
route behavior is correct, since I can revert from the proper  
behavior by dropping table 136.
	</para>
	<para>
For a single web service, I'm defining identical VIPs but for each of  
the ISPs:
	</para>
<programlisting><![CDATA[
  -A -t ISP1_VIP:80 -s wlc
  -a -t ISP1_VIP:80 -r 10.10.10.10:80 -m -w 1000
  -a -t ISP1_VIP:80 -r 10.10.10.11:80 -m -w 800
  -A -t ISP2_VIP:80 -s wlc
  -a -t ISP2_VIP:80 -r 10.10.10.10:80 -m -w 1000
  -a -t ISP2_VIP:80 -r 10.10.10.11:80 -m -w 800
]]></programlisting>
	<para>
Incoming packets come in via the proper gateway, but LVS always emits  
response packets through the default gateway, seemingly ignoring the  
source-route rules.  
	</para>
	<para>
I've seen Henrick's general fwmark state tracking described.  
Reading this, it seems like  
this patch isn't exactly approved or even obviously available.  And  
the article is from 2002. :)
	</para>
	<para>
I'm also not sure why this seems like such a difficult problem.  If  
LVS honored routes, there would be no complicated hacks required.   
Unless LVS overrides routes, in which case it might be nice to have a  
switch to turn off that optimization.
	</para>
	<para>
I understand that routes are a subset of the problem fixed by the  
patch, and I can see the value of the patch.  But for the basic route  
case it seems odd for LVS to just dump all outgoing packets to the  
default gw.  I mean, it could cache the routing table instead of just  
a single gw?
	</para>
	<para>
From what I can tell, the SH scheduler decides which realserver will receive an  
incoming request based on the external source IP in the request.  I  
can see four problems with this.
	</para>
	<itemizedlist>
		<listitem>
The first is that I can't see how this will change the return route  
of the packet.  I can see mapping incoming source routes to specific  
real servers with distinct gateways, but I can't see how it could  
effect an LVS-NAT setup.
		</listitem>
		<listitem>
The second is that a single client IP could go through either  
incoming VIP.  Assuming SH was somehow changing outbound routing, it  
would distribute the outbound gateway randomly vs. correctly.  I  
suppose this helps distribute traffic but I'm not really interested  
in perpetuating asymmetric routes.
		</listitem>
		<listitem>
The third is that I'd really like to use LVS as a load-balancer, not  
as a simple load splitter.  wlc is pretty key.
		</listitem>
		<listitem>
The fourth is that using sh doesn't change outbound routes, I just  
tried it. :-)
		</listitem>
	</itemizedlist>
	<para>
The docs state "Multiple gateway setups can be solved with routing  
and a solution is planned for LVS."  Which seems to imply that source  
routing is a fix but sort of not... :(
	</para>
	<para>
Scanning the NFCT patch and looking at the icmp handling, I'm pretty  
sure the problem is that ip_vs_out() is sending out the packet with a  
route calculated from the real server's IP.  Since ip_vs_out() is  
reputedly only called for masq return traffic, I think this is just  
plain incorrect behavior.
	</para>
	<para>
I pulled out the route_me_harder() mod and created the attached  
patch.  My only concern would be performance, but it seems  
netfilter's NAT uses this.
	</para>
	<para>
First, I need to correct the stated provenance of this patch.  It is  
a small tweaked subset of an antefacto patch posted to integrate  
netfilter's connection tracking into LVS, not the NFCT patches as I  
said.  Lots of Googling, not enough brain cells.  This patch applies  
to v1.0.10, but appears to be portable to 2.6.
	</para>
	<para>
During a maintenance window this morning, I had the opportunity to  
test the patch.
	</para>
	<para>
The first time I ever loaded the patched module, and shockingly it  
worked perfectly -- outbound traffic from masq VIPs now follows  
source-routes and choses the correct outbound gateway.  No side  
effects so far, no obvious increased load.
	</para>
	<para>
I also poked around the 2.6 LVS source a bit to see if this issue had  
been resolved in later versions, and noticed uses of  
ip_route_output_key, but the source address was always set to 0  
instead of something more specific.  I'd say it might be worth a  
review of the LVS code to make sure source addresses are set  
usefully, and routes are recalculated where necessary.
	</para>
	<para>
In any case, if anyone has a similar problem with VIPs spanning  
multiple external IP spaces and gateways, this has been working like  
a charm for me in significant production load.  So far.   
*knock*on*wood*  I'll update if it crashes and/or burns.
	</para>
	<para>
Joe
	</para>
	<blockquote>
 	any idea what would happen if there were multiple 
VIPs or the packets coming into the director from the 
outside world were arriving at the LVS code via a fwmark?
	</blockquote>
	<para>
To my understanding, Henrick's fwmark patch allows LVS to route traffic  
based on fwmarks set by an admin in iptables/iproute2.  I can imagine  
certain complex situations where this functionality could be useful  
and even crucial, but setup and maintenance of fwmarks requires  
specifically coded fwmark behavior in each of netfilter, iproute2,  
and ip_vs.
	</para>
	<para>
Source routes are essentially a standard feature these days, and are  
critical for proper routing on gateways and routers (which is  
essentially what a director is in Masq mode).  Having LVS properly  
observe the routing table is a "missing feature", I believe.  The  
patch I created requires no changes for an admin to make (no fwmarks  
to set up in ip_vs, netfilter, *and* iproute2), basically just  
properly and transparently observing routes set by iproute2 (which  
the rest of the director's traffic already obeys).
	</para>
	<para>
So short answer: Henrick's patch allows VIP routing based on fwmarks  
specifically created/handled by an admin for that purpose, whereas  
mine is a minor correction to existing code to properly recalculate  
the routes of outbound VS/NAT VIP traffic after mangling/masquerading  
of the source IP.  A little end-result crossover, but really quite  
different.  My (borrowed :) patch is essentially a one-liner, so the  
code complexity is very small and the behavior easily confirmable at  
a glance.  The fwmark code is more invasive, seemingly.
	</para>
	<para>
Technically, I could have used fwmarks, but until someone needs that  
specific functionality, I suspect proper source-routing covers 90% of  
the alternate use cases.  And it's the cleaner, more specific  
solution to my problem.  But that's just me. :)
	</para>
	<para>
Your summary of SH matches my understanding -- it's hash-based  
persistence calculated from the client's source IP (vs destination in  
DH).  It probably generates a good random, persistent distribution,  
which I can see being useful in a cluster environment where  
persistence is rewarded by caching/sessions/etc.  WLC with  
persistence is probably a better bet for a load-balancer config,  
since it actually balances load.  Without something like wackamole on  
the real servers, rr/sh/dh are happy to send traffic to dead servers,  
AFAICT.
	</para>
	<para>
Ken Brownfield <emphasis>krb (at) irridia (dot) com</emphasis> 22 Mar 2006
	</para>
	<para>
I'm attaching <filename>ip_vs_source_route.patch.gz</filename>, which is the patch itself.
It patches <filename>ip_vs_core.c</filename>, adding a function call at the end of  
<filename>ip_vs_out()</filename> that recalculates the route for an outgoing packet after  
mangling/masquerading has occurred.
	</para>
	<para>
<filename>ip_vs_out()</filename>, according to the comments in the source (and my brief  
perusal of the code) is "used only for VS/NAT."  There should be no  
effect on DR/TUN functionality as far as I can tell.  This type of  
route recalc might be correct behavior in some TUN or DR  
circumstances, but I have no experience in a DR/TUN setup.
So yes, I believe this patch is orthogonal to DR/TUN functionality  
and should be silent with regard to DR/TUN.
	</para>
	<para>
The only concern a user should have after applying this patch is that  
they make sure they are aware of existing source routes before using  
the patch.  Users may be unknowingly relying on the fact that LVS  
always routes traffic based on the real server's source IP instead of  
the VIP IP, and applying the patch could change the behavior of their  
system.  I suspect that will be a very rare concern.
	</para>
	<para>
As long as the source routes on the system are correct, where the  
source IP == the VIP IP, packets from LVS will be routed as the  
system itself routes packets.  Routes confirmed with a traceroute  
(bound to a specific IP on the director) will no longer be ignored  
for traffic outbound from a NAT VIP.
	</para>
	<para>
Joe: next Farid Sarwari stepped in
	</para>
	<para id="sarwari" xreflabel="IPSec">
Farid Sarwari <emphasis>fsarwari (at) exchangesolutions (dot) com</emphasis> 25 Jul 2006 
	</para>
	<para>
I'm having some issues with IPVS and IPSec. When a HTTP client requests
a page, I can see the traffic come all the way to the webserver
(ws1,ws2). However, the return traffic gets to the load balancer but
does not make it through the ipsec tunnel. When doing a tcpdump I can
see that the packets get SNATed by ipvs. I know there is a problem with
ipsec2.6 and SNAT, and I've upgraded my kernel and iptables so now SNAT
with iptables works. But it looks like ipvs is doing its own SNAT which
doesn't pass through the ipsec tunnel.  
	</para>
<programlisting><![CDATA[
My setup:


                      HTTP Clients
                       -------
                         |
                          \ -- Ipsec tunnel
                          /
                         |            
                  +------------+
                  |LoadBalancer|
                  |  ipsec2.6  |  
                  |   ipvs     |
                  +------------+
                         |
                        /\
                       /  \
                      /    \
                 +-----+  +-----+
                 | ws1 |  | ws2 |
                 +-----+  +-----+


Ldirector.conf:
virtual=x.x.x.x:80 #<public ip>
        real=y.y.y.1:80 masq
        real=y.y.y.2:80 masq
        checktype=negotiate
        fallback=127.0.0.1:80 masq
        service=http
        request="/"
        receive=" "
        scheduler=wlc
        protocol=tcp

------------------

ipvsadm -ln output:
P Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  x.x.x.x:80 wlc
  -> y.y.y.1:80            Masq    1      0          0
  -> y.y.y.1:80            Masq    1      0          0

------------------

Software Version #s:
ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0)
Linux Kernel 2.6.16
iptables v1.3.5
ldirectord  version 1.131
]]></programlisting>

	<para>
The Brownfield patch is for an older version of ipvs. When I was applying the
patch, Hunk#3 failed. I was able to apply the third hunk manually. When
I compile it give errors for the code from the first hunk of the patch.
	</para>
	<para>
Finally got it to work! I can access load balanced pages through ipsec.
Ken Brownfield's patch seemed to have been for an older version of
kernel/ipvs. 
If you look in the patch, there is function called ip_vs_route_me_harder
with is an exact copy of ip_route_me_harder from netfilter.c. 
I'm not sure what version of ipvs/kernel Brownfield's patch is for. I
couldn't get ipvs to compile with his patch, so I just used his idea and
copied the new code from the netfilter source code.
I've modified his patch by copying new the ip_route_me_harder function
from net/ipv4/netfiter.c (2.6.16).
Below is the patch for kernel 2.6.16 (kernel sources from FC4)
	</para>
<programlisting><![CDATA[
IPVS Version:     $Id: ip_vs_core.c,v 1.34 2003/05/10 03:05:23 wensong
Exp



------snip--------
--- ip_vs_core.c.orig   2006-03-20 00:53:29.000000000 -0500
+++ ip_vs_core.c        2006-07-27 14:31:14.000000000 -0400
@@ -43,6 +43,7 @@

 #include <net/ip_vs.h>

+#include <net/xfrm.h>

 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
@@ -516,6 +517,76 @@
        return NF_DROP;
 }

+/* This code stolen from net/ipv4/netfilter.c */
+
+int ip_vs_route_me_harder(struct sk_buff **pskb)
+{
+        struct iphdr *iph = (*pskb)->nh.iph;
+        struct rtable *rt;
+        struct flowi fl = {};
+        struct dst_entry *odst;
+        unsigned int hh_len;
+
+        /* some non-standard hacks like ipt_REJECT.c:send_reset() can
cause
+         * packets with foreign saddr to appear on the NF_IP_LOCAL_OUT
hook.
+         */
+        if (inet_addr_type(iph->saddr) == RTN_LOCAL) {
+                fl.nl_u.ip4_u.daddr = iph->daddr;
+                fl.nl_u.ip4_u.saddr = iph->saddr;
+                fl.nl_u.ip4_u.tos = RT_TOS(iph->tos);
+                fl.oif = (*pskb)->sk ? (*pskb)->sk->sk_bound_dev_if :
0;
+#ifdef CONFIG_IP_ROUTE_FWMARK
+                fl.nl_u.ip4_u.fwmark = (*pskb)->nfmark;
+#endif
+                if (ip_route_output_key(&rt, &fl) != 0)
+                        return -1;
+
+                /* Drop old route. */
+                dst_release((*pskb)->dst);
+                (*pskb)->dst = &rt->u.dst;
+        } else {
+                /* non-local src, find valid iif to satisfy
+                 * rp-filter when calling ip_route_input. */
+                fl.nl_u.ip4_u.daddr = iph->saddr;
+                if (ip_route_output_key(&rt, &fl) != 0)
+                        return -1;
+
+                odst = (*pskb)->dst;
+                if (ip_route_input(*pskb, iph->daddr, iph->saddr,
+                                   RT_TOS(iph->tos), rt->u.dst.dev) !=
0) {
+                        dst_release(&rt->u.dst);
+                        return -1;
+                }
+                dst_release(&rt->u.dst);
+                dst_release(odst);
+        }
+
+        if ((*pskb)->dst->error)
+                return -1;
+
+#ifdef CONFIG_XFRM
+        if (!(IPCB(*pskb)->flags & IPSKB_XFRM_TRANSFORMED) &&
+            xfrm_decode_session(*pskb, &fl, AF_INET) == 0)
+                if (xfrm_lookup(&(*pskb)->dst, &fl, (*pskb)->sk, 0))
+                        return -1;
+#endif
+
+        /* Change in oif may mean change in hh_len. */
+        hh_len = (*pskb)->dst->dev->hard_header_len;
+        if (skb_headroom(*pskb) < hh_len) {
+                struct sk_buff *nskb;
+
+                nskb = skb_realloc_headroom(*pskb, hh_len);
+                if (!nskb)
+                        return -1;
+                if ((*pskb)->sk)
+                        skb_set_owner_w(nskb, (*pskb)->sk);
+                kfree_skb(*pskb);
+                *pskb = nskb;
+        }
+
+        return 0;
+}

 /*
  *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_IP_POST_ROUTING
@@ -734,6 +805,7 @@
        struct ip_vs_protocol *pp;
        struct ip_vs_conn *cp;
        int ihl;
+       int retval;

        EnterFunction(11);

@@ -821,8 +893,20 @@

        skb->ipvs_property = 1;

-       LeaveFunction(11);
-       return NF_ACCEPT;
+       /* For policy routing, packets originating from this
+        * machine itself may be routed differently to packets
+        * passing through.  We want this packet to be routed as
+        * if it came from this machine itself.  So re-compute
+        * the routing information.
+        */
+       if (ip_vs_route_me_harder(pskb) == 0)
+               retval = NF_ACCEPT;
+       else
+               /* No route available; what can we do? */
+               retval = NF_DROP;
+
+       LeaveFunction(11);
+       return retval;

   drop:
        ip_vs_conn_put(cp);
------snip--------
]]></programlisting>

	<para>
Joe
	</para>
	<blockquote>
Can you do IPSec with LVS-DR? (the director would only decrypt and the
realservers encrypt)
	</blockquote>
	<para>
I haven't tried it, but I don't see why it shouldn't work. It's probably
easier to get work than LVS-NAT with IPSec :)
You can think of Ipsec as just another interface except that with kernel
2.6 there is more ipsec0 interface. So as long as routing is setup
correctly LVS-DR should work with IPSec.
	</para>
	<blockquote>
so you have an ipsec0 interface and you can put an IP on it and route
to/from it just like with eth0? Can you use iproute2 tools on ipsec0?
	</blockquote>
	<para>
With Kernel 2.6 there is no more ipsec0 interface, but you can use
iproute2 to alter the routing table. You wouldn't want to modify the
routes to the tunnel because ipsec takes care of that, but you can
modify routes for traffic that is coming through the tunnel destined for
LVS-DR.
	</para>
	<para>
Ken Brownfield <emphasis>krb (at) irridia (dot) com</emphasis> 28 Jul 2006 
	</para>
	<para>
At first glance, that's exactly what had to be ported, and I'm glad someone
with enough 2.6 fu did it.
Now, if someone could have it conditional on a proc/sysctl, it would seem
like more of a no-brainer for inclusion. ;)
	</para>
	<para>
Joe: next David Black stepped in
	</para>
	<para>
David Black <emphasis>dave (at) jamsoft (dot) com</emphasis> 28 Jul 2006
	</para>
	<para>
I applied the following patch to a stock 2.6.17.7 kernel, and enabled
the source routing hook via /proc/sys/net/ipv4/vs/snat_reroute:
http://www.ssi.bg/~ja/nfct/ipvs-nfct-2.6.16-1.diff
LVS-NAT connections now appear to obey policy routing - yay!
	</para>
	<para>
Referring to an older version of the NFCT patch, Ken Brownfield says in
the LVS HOWTO: "I pulled out the route_me_harder() mod and created the
attached patch."  So the Brownfield patch is a derivative of the NFCT
patch in the first place.
	</para>
	<para>
And here's a comment from the NFCT patch I used:
	</para>
<programlisting><![CDATA[
/* For policy routing, packets originating from this
 * machine itself may be routed differently to packets
 * passing through.  We want this packet to be routed as
 * if it came from this machine itself.  So re-compute
 * the routing information.
]]></programlisting>
	<para>
For a patched kernel, that functionality is enabled by
	</para>
<programlisting><![CDATA[
echo 1 > /proc/sys/net/ipv4/vs/snat_reroute
]]></programlisting>
	<para>
Farid Sarwari <emphasis>fsarwari (at) exchangesolutions (dot) com</emphasis> 31 Jul 2006 
	</para>
	<para>
The problem I was having with ipvs was that I couldn't access it through
ipsec kernel 2.6. I remember accessing ipvs through ipsec 2.4 a few
years ago and I don't remember running into this problem.
Correct me if I'm wrong, prior to kernel 2.6.16 SNAT (netfilter) didn't
work properly with ipsec. When troubleshooting my problem it looked like
the natting was happening after the routing decision had been made. This
is why I was under the assumption that only code from kernel 2.6.16+
would fix my problem. 
If the NFCT patch works with ipsec, I would much rather us that.
	</para>
	<para>
Joe
	</para>
	<blockquote>
If Julian's patch had been part of the kernel ipvs code,
would anyone have had source routing/iproute2 problems with LVS-NAT?
	</blockquote>
	<para>
Ken 9 Aug 2006
	</para>
	<para>
I don't believe so -- the source-routing behavior appears to be a (happy)
side-effect of working NFCT functionality.  I think the NFCT and
source-routing patches' intentions are to supply a feature and a bug-fix,
respectively, but NFCT is an "accidental" superset.
	</para>
	</section>
	<section id="lvs_nat_ftp">
	<title>LVS-NAT FTP Recipe</title>
	<para>
Stephen Milton <emphasis>smmilton (at) gmail (dot) com</emphasis> 12/17/05
	</para>
	<para>
This may be old hat to many of you on this list, but I had a lot of 
problems deciphering all the issues around FTP in load balanced NAT.  So I 
wrote up the howto on how I got my configuration to work.  I was 
specifically trying to setup for high availability, load balanced, FTP and 
HTTP with failover and firewalling on the load balancer nodes.  Here is a 
permanent link to the article: 
<ulink url="http://sacrifunk.milton.com/b2evolution/blogs/index.php/2005/12/17/load_balanced_ftp_server">
load_balanced_ftp_server</ulink>
(http://sacrifunk.milton.com/b2evolution/blogs/index.php/2005/12/17/load_balanced_ftp_server)
	</para>
	</section>
	<section id="lvs_nat_vhosts">
	<title>LVS-NAT vhosts with apache</title>
	<para>Michael Green <emphasis>mishagreen (at) gmail (dot) com</emphasis>
	</para>
	<blockquote>
Is it possible to make Apache's IP based vhosts work under LVS-NAT?
	</blockquote>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis>
14 Dec 2005
	</para>
	<para>
If, by that, you mean Apache vhosts whereby a single vhost lives on a 
single IP then the answer is definitely "yes", although it may seem 
counter-intuitive at first.
	</para>
	<para>
If you're using IP based virtual hosting, you have a single IP address 
for *each and every* virtual host. In the 'classic' sense this means 
your server has one, two, a hundred, a thousand IP addresses configured 
(as aliases) on its' interface which faces the internet and a different 
vhost listens to each interface.
	</para>
	<para>
In the clearest case of LVS-NAT, you'd have your public interface on 
the director handle the one, two, a hundred, a thousand _public_ IP 
addresses and present those to the internet (or your clients, be those 
as they are).
Assuming you have N realservers, you then require N*(one, two, a 
hundred, a thousand) private IP addresses and you configure up (one, 
two, a hundred, a thousand) aliases per virtual server. You then setup 
LVS-NAT to take each specific public IP and NAT it inbound to N private 
IPs on the realservers.
	</para>
	<para>
Still with me? Good.
	</para>
	<para>
This is a network management nightmare. Imagine you had 256 Virtual 
IPs, each with 32 servers in a pool. You immediately need to manage an 
entire /19 worth of space behind your director. That's a lot of address 
space (8192 addresses to be precise) for you to be keeping up with, and 
it's a *lot* of entries in your ipvsadm table.
	</para>
	<para>
There is, however, a trick you can use to massively simplify your addressing:
	</para>
	<para>
Put all your IP based vhosts on the same IP but a *different port* on 
each realserver. Suddenly you go from 8192 realserver address (aliases) 
to, well, 32 address (aliases) with 256 ports in use on each one. Much 
easier to manage.
	</para>
	<para>
For even more trickery you could probably make use of some of 
keepalived's config tricks to "pool" your realservers and make your 
configuration even more simple, but if you only have a small 
environment you may want to get used to using ipvsadm by hand first 
until you're happy with it.
	</para>
	</section>
	<section id="LVS-NAT_timeout_problem">
	<title>LVS-NAT timeout problem</title>
	<para>
Joe: here's a posting that hasn't been solved. It occured with LVS-NAT,
but we don't know if it occurs with the other forwarding methods.
	</para>
	<para>
Dmitri Skachkov <emphasis>dmitri (at) nominet (dot) org (dot) uk</emphasis> 21 Feb 2007
	</para>
	<para>
I should probably say in the beginning that the issue I'm going to describe
is not directly related to the problem discussed on this list
a while ago (http syn/ack not translated when ftp loadbalancing also enabled).
We have several LVS/NAT installations which are managed by Keepalived.
All of them are pretty much identical and exhibit the same issue.
The setup is looking like this (a backup load balancer and a backup
router are omitted) and is LVS/NAT standard:
	</para>
<programlisting><![CDATA[

        !-----------------!
        !                 !
        !     Internet    !
        !                 !
        !-----------------!
                 !
                 !
        !-----------------!
        !                 !
        !     Router      !
        !                 !
        !-----------------!
                 !
                 !
        !-----------------!
        !      eth0       !
        !                 !
        !  LoadBalancer   !
        !                 !
        !      eth1       !
        !-----------------!
                 !
                 !192.168.1.0/24
    ------------------------
    !       !       !      !
  !---!                  !---!
  !RS1!     .........    !RSN!
  !---!                  !---!
]]></programlisting>
	<para>
This setup is working fine most of the time except when a client sends a TCP SYN
packet and then forgets about this connection. In this case a RealServer starts to
send SYN/ACK packets until this connection on the server times out and it sends RST/ACK.
The issue is that two last packets don't get translated because ipvs on the
LoadBalancer already timed out this connection. Below is a tcpdump on LoadBalancer/eth0:
	</para>

<programlisting><![CDATA[
10:58:20.655059 IP 213.248.204.8.2113 > 213.248.224.116.43: S 1402601529:1402601529(0) win 512
10:58:20.655335 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:24.031708 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:30.792336 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:44.303557 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:59:11.316010 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:00:05.330972 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:01:05.346329 IP 192.168.1.32.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:02:05.362233 IP 192.168.1.32.43 > 213.248.204.8.2113: R 1:1(0) ack 1 win 49312
]]></programlisting>
	<para>
In this example I simulated the situation with sending SYN packet from my PC
to the server and dropping all further packets.
While the SYN/ACK packets were still being translated
	</para>
<programlisting><![CDATA[
director# ipvsadm -lnc
TCP 28:12  NONE        213.248.204.8:0    213.248.224.116:43 192.168.1.32:43
TCP 00:57  SYN_RECV    213.248.204.8:2113 213.248.224.116:43 192.168.1.32:43
]]></programlisting>
	<para>
But once I see only this:
	</para>
<programlisting><![CDATA[
TCP 27:02  NONE        213.248.204.8:0    213.248.224.116:43 192.168.1.32:43
]]></programlisting>
	<para>
Yes, I played with 'ipvsadm --set tcp tcpfin udp' and it doesn't
have any effect on this issue.
	</para>
	<para>
packets from RealServer belonging to this connection (from RealServer 
point of view) stop getting translated.

This is not a real problem but rather a nuisance for me. I just don't want 
packets with private IP's leaving LoadBalancer. 
I can't block this packets with iptables since I believe ipvs does SNATing
somewhere in POSTROUTING chain and there is no way to put any other rules beyond this chain.
I also can't modify SYN_RECV timeout since there is no tcp_timeout_syn_recv entry
in <filename>/proc/sys/net/ipv4/vs/</filename> (this is a stock CentOS 4.3 kernel).
My question is: Is it possible to block not translated packets from 
leaving the LoadBalancer without touching RealServers and the Router?
	</para>
	<para>
If it can help, here is additional info:
	</para>
<programlisting><![CDATA[
# uname -a
Linux lb1 2.6.9-34.ELsmp #1 SMP Thu Mar 9 06:23:23 GMT 2006 x86_64 x86_64 x86_64 GNU/Linux
# ipvsadm --help
ipvsadm v1.24 2003/06/07 (compiled with getopt_long and IPVS v1.2.0)
]]></programlisting>
	<para>
later...
	</para>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis> 25 Jun 2007
	</para>
	<para>
One of my "standard" (I use the term advisedly) configuration settings
for LVS-NAT is to ensure that I have an SNAT rule for packets exiting
the director towards clients.
I make sure that if I have RS1 with VIP 1.2.3.4 and two realservers
192.0.0.1 and 192.0.0.2 that I have rules of the form:
	</para>
<programlisting><![CDATA[
-t nat -A POSTROUTING -o eth0 -s 192.0.0.1 -j SNAT --to-source $VIP
-t nat -A POSTROUTING -o eth0 -s 192.0.0.2 -j SNAT --to-source $VIP 
]]></programlisting>
	<para>
This means any packets escaping the LVS - for example where the LVS
connection entry has timed out, but the realserver application session
hasn't, will be SNATted to the right IP.
	</para>
	<para>
It also means that any sessions originating from the realserver - CGI
calls to other websites, PHP database connections to offboard servers,
SSI includes, RSS inclusion, whatever - appear to come from the right
source. It can help to track down abuse in the case of mass virtual
hosting, and it prevents information leakage of the form you're seeing.
	</para>
	<para>
Longer term, it looks like you need to make sure that the IP stack
timeouts on the realservers match the LVS connection table timeouts on
the director. Have a look at the "--set" option to ipvsadm, and check
the corresponding sysctls in <filename>/proc/sys/net/ipv4/</filename> - you may have to do a
bit of deduction regarding backoff algorithms and retries to get the total
time for (for example) a TCP three-way handshake timeout, like you're
seeing.
	</para>
	<para>
<emphasis>dmitri (at) nominet (dot) org (dot) uk</emphasis> 25 Jun 2007
	</para>
	<para>
If I remember correctly, the POSTROUTING solution didn't work for me as it seemed
LVS stuff in kernel just ignored any postrouting tables for any ip packets under LVS control. 
Neither <command>ipvsadm --set</command> had any effect on this issue. 
Sorry for a short explanation but this is what I remember of top of my head and since in all other 
respects LVS is just working fine for us and so I have not looked often into it lately.
	</para>
	</section>
</section>
<section id="LVS-HOWTO.arp_problem" xreflabel="The Arp Problem">
<title>LVS: The ARP Problem</title>
	<section id="the_problem">
	<title>The problem</title>
	<para>
If you follow the instructions and setup the examples
in the <link linkend="mini-HOWTO">LVS-mini-HOWTO</link>,
then you don't need to know about the arp problem.
If you're going to setup grander LVS's, 
then you'll need to understand the arp problem.
	</para>
	<para>
Although this section comes early in the HOWTO, it has lots of pitfalls.
You shouldn't be reading this unless you've at least setup a working
LVS-NAT and LVS-DR LVS using the canned instructions in the
<link linkend="mini-HOWTO">LVS-mini-HOWTO</link>.
	</para>
	<para>
The LVS allows several machines to function as one machine.
For LVS-DR and LVS-Tun, some trickery
was needed to split the various handshakes 
involved in establishing and maintaining a tcpip connection,
so that some parts of the handshake come from one
machine and other parts from another machine.
The worst problem, which ironically only
happens with realservers running Linux (2.2 and later kernels),
is the "arp problem".
It's just as well we have the source code :-(.
	</para>
	<para>
With LVS-DR and LVS-Tun, all the machines (director, realservers)
in the LVS have an extra IP, the VIP. Here's a LVS-DR in a test setup where
all machines and IPs are on the same physical network
(<emphasis>i.e.</emphasis> are using the same link layer
and can hear each other's broadcasts).
	</para>
<programlisting><![CDATA[

                      ________
                     |        |
                     | client |
                     |________|
 	                 |
                         |
                      (router)
                         |
                         |
                         |       __________
                         |  DIP |          |
                         |------| director |
                         |  VIP |__________|
                         |
                         |
                         |
       ------------------------------------
       |                 |                |
       |                 |                |
   RIP1, VIP         RIP2, VIP        RIP3, VIP
 ______________    ______________    ______________
|              |  |              |  |              |
| realserver1  |  | realserver2  |  | realserver3  |
|______________|  |______________|  |______________|


]]></programlisting>
	<para>
When the client requests a connection to the VIP, it must connect
to the VIP on the director and not to the VIP on the realservers.
	</para>
	<para>
The director acts as a layer-4 IP router,
accepting packets destined for the VIP and then sending them on to a realserver
(where the real work is done and a reply is generated).
For the LVS to function, when the client (or if present, the router)
puts out the arp request "who has VIP, tell client",
the client/router must receive the MAC address
of the director (and not the MAC address of one of the realservers).
After receiving the arp reply,
the client will send the connect request to the director.
(The director will update its <command>ipvsadm</command> tables
to keep track of the connections that it's in charge of and
then forward the connect request packet to the chosen realserver).
	</para>
	<para>
If instead, the client gets the MAC address of one of the realservers,
then the packets will be sent directly to that realserver,
bypassing the LVS action of the director.
If nothing is done to direct arp requests, by the router for the VIP,
to the director (<xref linkend="arp_bouncing"/>), then in some setups,
one particular realserver's MAC address will be in the client/router's
arp table for the VIP and the client will only see one realserver.
If the client's packets are consistently sent to the same realserver,
then the client will have a normal session connected to that realserver.
You can't count on this happening: in the middle of a tcpip sesssion,
the client/router might get the MAC address of another realserver
as a result of an arp request, and the client will start getting
packets for connections it knows nothing about
(and the realserver will send tcp resets).
(In my setup, the machine with the fastest CPU is
in the client's arp table, suggesting that it's the first machine to reply
that gets in. Horms and Steven WIlliams have written that they think
it's the last machine to reply whose entry in in the client's arp table.)
In other setups where the realservers are identical,
the client will connect to different realservers each time the
arp cache times out (see comment by Steven WIlliams elsewhere).
If the director always gets its MAC address in the router arp table,
then the LVS will work without any changes to the realservers
(as happened in my case with a director with the fastest CPU in the LVS),
although this is not a reliable solution for production.
	</para>
	<para>
Getting the MAC address of the VIP on the director
(instead of the MAC address of the VIP on the realservers) to the
client when the client/router does an arp request
is the key to solving the "arp problem".
	</para>
	<para>
The traditional ways of handling the arp problem (as explained here)
all require fiddling with the settings of the VIP on the realservers.
The assumption in the early days of LVS was that you wouldn't
have access to the router (this being under the control
of the IT department or your ISP and you would have to go through
a lot of bureaucracy to changed the settings on the router).
However if you're paying good money to an ISP to house your
LVS, or your inhouse LVS is doing something useful for your
establishment, then you should have no trouble in having the
router setup the way you want.
	</para>
	<para>
If you have access to the router (or can put one in front of your
LVS - a low power linux box is just fine) and you can set it
to route packets for the VIP only to the director(s) and not
to the realservers, or you can use the arp filtering tools
of <filename>iptables</filename>, and you understand what's been said above,
then you've handled the arp problem and need read no futher.
	</para>
	<para>
For those who don't have access to the router, or who want
to setup an LVS on one network, then read on...
	</para>
	<para>
The arp problem is handled in Linux 2.0.x kernels,
as dummy0, tunl0, lo:0,  are available on the realserver 
which don't reply to arp requests.
For other OS's, the NOARP flag for ifconfig stops the VIP
on the realservers from replying to arp requests.
	</para><para>
However with 2.2.x (and later) kernels,
the devices which didn't reply to arp requests in 2.0.x,
now reply to arp requests.
There is a "-arp" (NOARP) option for ifconfig which (according
to the man pages) turns off replies to arp requests for that
device, and an "arp" option which turns them back on again.
Linux does not always honour this flag. You couldn't turn on replies
to arp requests for the dummy0 devices in 2.0.36 kernels and
you can't turn it off for tunl0 in 2.2.x kernels. eth0 behaves
properly in 2.0.36 but in 2.2.x kernels it arps even when you
tell it not to arp. This behaviour of not honouring the NOARP
flag in the Linux 2.2.x kernels
<link linkend="first_inklings">is not regarded as a &quot;problem&quot;</link>
by those writing the Linux TCPIP code and is not going to be &quot;fixed&quot;.
	</para>
	<para>
Julian 22 May 2001
	</para>
	<blockquote>
		<para>
The flag is used to allow arp requests for the specified device.
Although "lo" doesn't reply to arp requests, the requests for the
VIP go through eth*, and so the NOARP flag is of no help to us.
We can't drop the flag for eth.
		</para>
	</blockquote>
	<para id="julian_alias">
Another wrinkle is that in 2.0.36 kernels, aliased devices
(eg eth0:1) could be setup independantly of the options on
the primary (eth0) device. Thus eth0:1 behaved as if it were
on a separate NIC and its arp'ing behaviour could be set
independantly of the primary interface. The settings of
an aliased device belonged to the IP. With the 2.2.x
kernels, the aliased devices are now just alternate names for each
other: you change an option (eg -arp) or up/down of one
alias (or primary) the other aliases follow. With 2.2.x
kernels, the settings of the aliased device belong to the
primary device (there is only one device with several
IPs).
	</para>
	<para>
When LVS was running on 2.0.36 machines, the VIP was usually
configured as an alias (eg lo:0, tunl0) on the main ethernet
device (eth0), allowing the nodes in an LVS to have only one
NIC.
	</para>
	<para>
With 2.2.x kernels,
care is needed when only one NIC is used
on the realserver (the usual case).
On a realserver with eth0 carrying the RIP,
and the realserver having only one NIC, eth0 must reply to
arp requests (to receive packets), then eth0:1 carrying the
VIP will reply to arp requests too, even if you ifconfig it
with -noarp. Thus if a realserver is running a 2.2.x kernel
and has the VIP on an ip_alias, then the VIP on the realserver
will reply to arp requests received from the router.
	</para>
	<para>
The use of ip_aliases is still allowed,
but requires a "label" to be recognised by
the new <xref linkend="LVS-HOWTO.policy_routing"/>
tools (iproute2 and ip_tables).
The "label"ed IPs are now secondary IPs.
	</para>
	<para>
For 2.2.x kernels and beyond the commands <command>ifconfig</command>
and <command>route</command> should only be used with single NIC leaf nodes. 
You can still use them to set up a simple LVS,
but for anything more complicated you will need to start using 
the iproute2 commands.
	</para>
	</section>
	<section id="vip_lo">
	<title>Put the VIP on the realservers lo device</title>
	<para>
In the early days (2.0.x, 2.2.x) I seemed to be able to put
the VIP on any device I liked. I don't know whether this
is still possible with the newer kernels, but people
have not been able to get their LVSes to work unless with
arp_filter and arp_ignore unless they put the VIP on the
realserver's lo device. 
	</para>
	</section>
	<section id="the_cure">
	<title>The Cure(s)</title>
	<para>
Several cures have been produced in an attempt to solve the arp problem. They involve either
	</para>
	<itemizedlist>
		<listitem>
stopping the realservers from replying to arp requests for
the VIP.
		</listitem>
		<listitem>
hiding the VIP on the realservers so that they don't see
the arp requests.
		</listitem>
		<listitem>
priming the client/router in front of the director with the
correct MAC address for the VIP.
		</listitem>
		<listitem>
allowing the realserver to accept a packet with dst=VIP even
though the realserver does not have a device with this IP
(<emphasis>i.e.</emphasis> the host has nothing to reply to an arp request).
This is implemented by 
<xref linkend="LVS-HOWTO.transparent_proxy"/> or 
<xref linkend="LVS-HOWTO.fwmark"/>fwmark.
For transparent proxy on the realserver, the director forwards
the packets to the realserver's MAC address, 
so you don't need to route the packets yourself.
For fwmark, you need to understand 
<xref linkend="LVS-HOWTO.routing_to_VIP-less_director"/>.
There may be performance problems with transparent proxy 
<xref linkend="TP_performance"/> at high packet rates.
Noone has tested <xref linkend="2_6_arp_announce"/> 
against transparent proxy at high throughput.
		</listitem>
		<listitem>
stopping arp requests for the VIP getting to the realservers.
		</listitem>
	</itemizedlist>
	<para>
Note: For the 2.2 and 2.4 kernels, 
most of these cures involve applying a kernel patch to the realserver.
The realserver patch is unrelated to the ipvs patch applied to the director.
	</para>
	<para>
horms 4 Aug 2005
	</para>
	<para>
For the record, the ARP problem is not about honoring the 
<filename>-arp</filename> flag or not.
The problem lies in whether or not the OS regards the IP address
as belonging to the interface, or as belonging to the host.
Both are valid. 
Linux adopts the latter, which turns out to work really well in most situations. 
LVS is a rare case where it doesn't. 
This has been painful in the past, 
but since <filename>arp_ignore</filename> and <filename>arp_announce</filename> were added, 
its quite easy now.
	</para>
	<para>
The following list of cures is a little confusing. 
If you're not using routing to stop packets for the VIP arriving at the realservers,
then you'll have to stop the realservers replying to arp requests for the VIP.
In this case you'll do one of the following on the realservers
(Mar 2005, with help from Horms)
	</para>
	<itemizedlist>
		<listitem>
			<para>
the original method: 
use Julian's <filename>hidden</filename> patch on the realserver. 
You set the VIP on lo and then "hide" it. 
This method has been around since the arp problem first arose
and has been well tested.
For more on the <link linkend="hidden">hidden patch</link> see <ulink url="http://www.ssi.bg/~ja/hidden.txt">julian's page</ulink>.
This code is still being maintained, so if your setup scripts
are for the hidden patch, you can continue to use it. 
Otherwise for new installations, you should use the arp_announce.
			</para>
		</listitem>
		<listitem>
			<para>
the next method: Maurizio's <link linkend="sartori">noarp module</link>. 
This has the advantage that it does not require any patching of the realserver's kernel, 
is simple to setup and is the preferred method for many people. 
It has another advantage that you control the arp behaviour for the IP 
and not for the device.
			</para>
		</listitem>
		<listitem>
			<para>
the new way <link linkend="2_4_arp_announce">arp_announce</link>: see
<ulink url="http://www.ssi.bg/~ja/#arp_announce">arp_announce</ulink>
(http://www.ssi.bg/~ja/#arp_announce) 
which sets <filename>arp_ignore</filename> and <filename>arp_announce</filename> 
on the arping interfaces.
This typically means eth0, but if you have eth1 as well, you need to set it there too.
(If you have multiple NICs; eth0..ethn, you only need fix the NIC that hears the arp requests.)
Setting these parameters on <filename>lo</filename> 
has no effect as far as I understand from testing, 
reading the code and reading correspondance from Jullian,
<emphasis>i.e.</emphasis> you <emphasis role="bold">aren't</emphasis> 
interested in these settings. 
			</para>
			<note>
Make sure you don't bring up the ethernet device (say at bootup) 
before arp_ignore/arp_announce have been setup, 
or you will get a round of arp broadcasts from the NIC.
			</note>
<programlisting><![CDATA[
# ipvs settings for realservers:
net.ipv4.conf.lo.arp_ignore = 1
net.ipv4.conf.lo.arp_announce = 2
]]></programlisting>
			<para>
Horms
			</para>
			<para>
If the VIP is on eth0, and you don't want it advertised over ARP on
eth1, then set:
			</para>
<programlisting><![CDATA[
net.ipv4.conf.eth1.arp_ignore = 1
net.ipv4.conf.eth1.arp_announce = 2
]]></programlisting>
			<para>
This is different to the hidden approach where you put the VIP on lo 
and then hide lo.
			</para>
			<para>
The <filename>arp_ignore</filename> approach has 
<link linkend="ratz_arp_announce">theoretical and aesthetic advantages</link>.
			</para>
		</listitem>
	</itemizedlist>
	</section>
	<section id="2.0_arp">
	<title>The Cure: 2.0 kernels - nothing needed</title>
	<para>
	There is no arp problem with 2.0 kernels on the realservers.
On the realservers, configure the VIP on the lo device
with the <command>-noarp</command> option as you would with any other Unix.
	</para>
	</section>
	<section id="2.2_arp">
	<title>The Cure: 2.2.x kernels - many options</title>
	<para>
The preferred method is "hidden"
	</para>
		<section id="hidden" xreflabel="hidden">
		<title>The hidden patches</title>
		<para>
The &quot;hidden&quot; patches for kernel &gt;=2.2.14
are now in the standard linux distribution
(<emphasis>i.e.</emphasis> you can use the &quot;hidden&quot;
feature with a standard kernel and
don't have to patch the kernel on the realserver).
The arp patches allow you to hide a device from arp requests,
allowing the realserver to function in an LVS.
		</para>
		<note>
The hidden patch hides the device (here the lo) (and any IPs that are on it).
The <command>-noarp</command> flag in 2.0 kernels
affects only the ip_alias (and not other IPs on the same device).
These are different methods, but both stop the
router/client from getting arp replies from the realserver
for the VIP.
		</note>
		<para>
To hide devices from arp calls, on the realservers do
		</para>
<programlisting><![CDATA[
       #to activate the hidden feature
       echo 1 > /proc/sys/net/ipv4/conf/all/hidden
       #to make lo:0 not arp, put lo here
       echo 1 > /proc/sys/net/ipv4/conf/<interface_name>/hidden
]]></programlisting>
		<para>
then test that you've hidden the VIP (<xref linkend="testing_for_arp"/>).
		</para>
		<para>
There is a possible race condition in hiding the VIP -
		</para>
		<para>
Kyle Sparger, 15 Feb 2001
		</para>
		<blockquote>
			<para>
I've found an interesting, but not totally unexpected race condition
under DR in 2.2.x that I've managed to create when installing VIP's on a
machine in DR mode.
Basically, the cause is this:
			</para>
<programlisting><![CDATA[
ifconfig dummy0 10.0.1.15
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
]]></programlisting>
			<para>
You'll notice that there's going to be a small gap between the two which
allows an ARP request to come in, and for the server to reply.  And yes,
it is big enough to be bitten by -- I've been bitten twice by it so far :)
			</para>
		</blockquote>
		<para>
Julian
		</para>
		<para>
On boot:
		</para>
<programlisting><![CDATA[
echo 1 > /proc/sys/net/ipv4/conf/all/hidden
# For each hidden interface (dummy, lo, tunl):
modprobe dummy0
ifconfig dummy0 0.0.0.0 up
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
# Now set any other IP address
]]></programlisting>

		<para>
Kyle's suggestion
		</para>
		<blockquote>
<programlisting><![CDATA[
echo 1 > /proc/sys/net/ipv4/conf/default/hidden
ifconfig dummy0 10.0.1.15
echo 0 > /proc/sys/net/ipv4/conf/default/hidden
]]></programlisting>
		<para>
The echo 0 command is incase I want to configure other
interfaces later that I _do_ want responding to ARP requests.
Technically, it's not necessary, I just find it useful in my particular setup.
		</para>
		</blockquote>

		</section>
		<section id="older_2.2">
		<title>The Cure: Older 2.2 kernels (&lt;2.2.12)</title>
		<para>
These are old and it would better to upgrade (you won't get much
help on the mailing list for these).
However if you have them,
you apply the arp patches to the kernel code of the 2.2.x realservers.
These patches are separate from the ipvs patch applied to the kernel on
the director.
		</para>
		<para>
For kernels &lt;2.2.12, Julian's patch is on the lvs website.
		</para>
		<para>
http://www.linuxvirtualserver.org/arp_invisible-2213-2.diff
		</para>
		<para>
The patch by Stephen WIllIams is at
		</para>
		<para>
http://www.linuxvirtualserver.org/sdw_fullarpfix.patch
		</para>
		<para>
This patch is against a 2.2.5 kernel but can be applied to later kernels
(tested to 2.2.13). The file appears to have DOS carriage control.
Depending what you get on your disk, you may have to convert the file
to unix carriage control (with `tr -d '\015'`) (the unix line extension
of '\' doesn't work in combination with DOS carriage control).
		</para>
		<para>
The whitespace may not match your file so do
		</para>
<programlisting><![CDATA[
$ cd /usr/src/linux
$ patch -p1 -l < sdw_fullarpfix.patch
]]></programlisting>
		<para>
If you are using <xref linkend="martian_modification"/> you will
need the forward_shared-hidden patch as well (needed only on the director,
but can be applied to both director and realservers).
		</para>
		</section>
		<section id="extra_nic">
		<title>Put an extra NIC (eth1) on the realserver to carry the VIP</title>
		<para>
Possible cards would be a discarded ISA card (WD80x3), or a cheap 100Mbit PCI
card (eg Netgear FA310TX, $16 in USA in Nov 99) There is no traffic going
through this NIC and it doesn't matter that it's an old slow card. The extra
card is only required so that the realserver can have the VIP on the machine.
With 2.2.x kernels you can't stop this device (eth1) from replying to arp
requests, but if you don't connect the cable to it or don't put a route to
it in the realserver's routing table, then the client won't be able to send
it an arp request.
		</para>
		<para>
To set this up with the configure script,
enter eth1 as the device for the VIP on the realserver.
		</para>
		<note>
			<para>
Apparently, the 2nd NIC doesn't handle arp problem for 2.4 kernels.
			</para>
			<para>
I tested the 2 NIC method of handling the arp problem with kernel 2.2.13.
I haven't tried it with 2.4 kernels, but apparently it doesn't work.
Julian and Ratz think it shouldn't work with 2.2.x kernels, but I haven't
revisited the matter to see why we have come to different conclusions.
			</para>
		</note>
		</section>
	</section>
	<section id="2.4_arp">
	<title>The Cure: 2.4.x kernels - arp_ignore/arp_announce</title>
	<para>
The current (kernels starting 2.4.26) accepted method is 
<link linkend="2_4_arp_announce">arp_ignore/arp_announce</link>.
	</para>
	<para>
There are several ways of handling the arp problem for 2.4.x kernels.
They all work, but some of them have been around longer and so have
been used more and people on the mailing list are more familiar with
them.
	</para>
		<section id="2.4_hidden">
		<title>2.4 Hidden Patch</title>
		<para>
Julian's hidden patch has been around the longest and is well tested.
Although included in the standard 2.2.x kernel, it is not being included in
the 2.4.x kernels. You'll have to patch the kernel on the realservers.
The preferred method for new installations is 
<link linkend="2_4_arp_announce">arp_ignore/arp_announce</link>.
		</para>
		<para>
For early 2.4.x kernels (eg x=0), the patch is available at
http://www.linuxvirtualserver.org/hidden-2.3.41-1.diff.
(This patches a part of the kernel that isn't being actively fiddled with,
so hopefully the patch will work against later 2.4.x kernels too.)
		</para>
		<para>
The 2.4.x &quot;hidden&quot; patch is 
included in ipvs-x.x.x/contrib/patches/hidden-x.x.x.diff
		</para>
		<para>
Assuming you are patching 2.4.2 with the ipvs-0.2.5 files
		</para>
<programlisting><![CDATA[
cd /usr/src/linux
patch -p1 <../ipvs-0.2.5/contrib/patches/hidden-2.4.2-1.diff
]]></programlisting>
		<para>
Then build the kernel (can use same options as for the 2.4 director kernel build).
		</para>
		<para>
You activate the hidden feature as for 2.2 (see <link linkend="hidden">hidden</link>).
		</para>
		<para>
As to why the hidden patch is in the 2.2 kernels but not the 2.4 kernels see
the <ulink url="http://marc.theaimsgroup.com/?l=linux-kernel&amp;m=98032243112274&amp;w=2">
the mailing list archives</ulink> or for
<ulink url="http://marc.theaimsgroup.com/?t=98019795800013&amp;w=2&amp;r=1">the thread</ulink>
		</para>
		</section>
		<section id="2_4_arp_announce">
		<title>2.4 arp_announce</title>
		<para>
The 2.6.x <link linkend="2_6_arp_announce">arp_announce, arp_ignore</link>
code has been back ported to 2.4.26 (and later) kernels.
		</para>
		</section>
		<section id="arp_filtering">
		<title>arp filtering</title>
		<para>
Julian has written an extension to the iproute2 tools,
which filters arp packets.
You can use this to handle the arp problem.
See
<ulink url="http://www.ssi.bg/~ja/iparp.txt">Julian's software page</ulink>
for more details.
This method does not require patching of the 2.4 kernel on the realserver.
		</para>
		<note>
Julian's arp filtering is not <xref linkend="arptables"/>.
		</note>
		<para>
Joe
		</para>
		<blockquote>
Is <filename>arptables</filename> the extension to <filename>iptables</filename> 
that you wrote a while ago?
<filename>arptables</filename> seems pretty simple. 
What are the problems with <filename>arptables</filename> 
that you've written arp_ignore and keep maintaining the hidden patch?
		</blockquote>
		<para>
Julian 11 Jul 2004 
		</para>
		<para>
	Almost true, I'm not the <filename>arptables</filename> author.
You're referring to the arprules/iparp functionality which is
based on <command>ip</command>, not on <filename>iptables</filename>. 
Similar names.
		</para>
		<para>
At that time there was no user space tool for the arptables
changes in kernel (done by David Miller), now there is such tool (I didn't
tried it), so the list of options to hide addresses in clusters is
extended.
		</para>
		<para>
	arp_ignore was born day(s) after arp_announce. Both flags are
easy to set default policy for playing with ARP requests and replies
which was needed for years for stuff like interoperability with
other ARP stacks (mostly for controlling the source address selection
in ARP requests with arp_announce) or for hiding of addresses for
cluster setups.
		</para>
		</section>
		<section id="sartori">
		<title>Maurizio Sartori's noarp module</title>
		<para>
Maurizio Sartori <emphasis>masar (at) masarlabs (dot) com</emphasis> 28 Nov 2002
		</para>
		<para>
On <ulink url="http://www.masarlabs.com">my site</ulink>
is a simple kernel module for Linux 2.4.x to solve the ARP Problem.
You don't have to patch the kernel but only to
compile, install and configure the 'noarp' module,
to use your loopback interface filtering its arp
reply.
I've tested it on Debian 'Sarge' and RedHat 8.
		</para>
		<note>
Maurizio later produced a patch for 2.6.
		</note>
		<para>
Sebastien Bonnet <emphasis>Sebastien (dot) Bonnet (at) experian (dot) fr</emphasis>
04 Jun 2003
		</para>
		<para>
Nobody seems to recall what a smart Italian guy named Maurizio Sartori did.
Instead of the hidden patch, which requires a full kernel build,
he's written a *module* called noarp, way more handy, as
		</para>
		<orderedlist>
			<listitem>
it requires only a one module build, doesn't require a kernel
build, takes about 1 minute to install and get working.
			</listitem>
			<listitem>
it allows hidding IPs, not interfaces.
			</listitem>
		</orderedlist>
		<para>
I'm using it in production and it works perfectly.
		</para>
		<para>
Joe
		</para>
		<blockquote>
Can you hide the VIP on eth0:x and not hide the RIP on eth0?
(I should know this, but I don't)
		</blockquote>
		<para>
Jan Abraham <emphasis>jan_abraham (at) gmx (dot) net</emphasis>
31 Oct 2003
		</para>
		<para>
Yes, you can :)
I used Maurizio Sartori's noarp module, suggested in your HOWTO in
chapter 4.5.3. It can be controlled by IP, not by interface.
		</para>
		<para id="ratz_arp_announce">
Ratz 17 Dec 2004
		</para>
		<para>
Julian's arp_ignore is the way to go, portable and ready for upgrades ;). 
Nothing against Maurizio by all means, but after years of fighting with the netdev's Julian 
finally convinced the high priest of Linux networking to solve the arp Problem 
once and for eternity. If you read the accompagning documentation on arp_* 
sysctrl you can pretty much figure out that nothing is impossible anymore ;).
		</para>
		<para>
Joe - I would have been happy if they'd left the arp behaviour as it was originally
and as it is in all the other unices (except HPUX).
		</para>
		<para>
Todd Lyons <emphasis>tlyons (at) ivenue (dot) com</emphasis> 03 Feb 2005 
		</para>
		<blockquote>
			<para>
Hi guys (and masar), I am trying to use your noarp module, but am
hitting the limit of 16 entries.  I need it to work for (at the moment)
an additional 10 entries. I see in <filename>noarp.h</filename>:
			</para>
<programlisting><![CDATA[
#define NOARP_MAX_IP            (16)
]]></programlisting>
			<para>
Is it going to create problems to pick this number up to 32 or 64?  
I've already done it and it seems to handle the problem. I
don't want to create any memory leaks or overruns.  Your code looks like
it allocates memory based on that NOARP_MAX_IP, but my c is not good
enough to know for sure if that will be a problem.  Here's what happens
on my system (RH 7.3 with 2.4.20-28.7smp kernel).  You can see that it's
failing on the 10 additional IP's after the initial 16.  Please let me
know if I can safely raise that number.
			</para>
<programlisting><![CDATA[
[root@rproxy1a init.d]# /etc/init.d/noarp start
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
[root@rproxy1a init.d]# /etc/init.d/noarp status
64.14.201.41 10.10.10.160 0 0 0
64.14.201.151 10.10.10.160 0 0 0
64.14.201.161 10.10.10.160 0 0 0
64.14.201.162 10.10.10.160 0 0 0
64.14.201.163 10.10.10.160 0 0 0
64.14.201.164 10.10.10.160 0 0 0
64.14.201.165 10.10.10.160 0 0 0
64.14.201.166 10.10.10.160 0 0 0
64.14.201.167 10.10.10.160 0 0 0
64.14.201.168 10.10.10.160 0 0 0
64.14.201.169 10.10.10.160 0 0 0
64.14.201.175 10.10.10.160 0 0 0
64.14.201.153 10.10.10.160 0 0 0
64.14.201.178 10.10.10.160 0 0 0
64.14.201.170 10.10.10.160 0 0 0
64.14.201.171 10.10.10.160 0 0 0

[root@rproxy1a network-scripts]# ls ifcfg-lo:*
ifcfg-lo:0   ifcfg-lo:13  ifcfg-lo:18  ifcfg-lo:22  ifcfg-lo:4 ifcfg-lo:9
ifcfg-lo:1   ifcfg-lo:14  ifcfg-lo:19  ifcfg-lo:23  ifcfg-lo:5
ifcfg-lo:10  ifcfg-lo:15  ifcfg-lo:2   ifcfg-lo:24  ifcfg-lo:6
ifcfg-lo:11  ifcfg-lo:16  ifcfg-lo:20  ifcfg-lo:25  ifcfg-lo:7
ifcfg-lo:12  ifcfg-lo:17  ifcfg-lo:21  ifcfg-lo:3   ifcfg-lo:8
]]></programlisting>
			<para>
I have a question about the man page 
			</para>
<programlisting><![CDATA[
NOARPCTL COMMANDS
       add    Adds a new Virtual IP to  the  list.  Requires  two
              arguments: the VIP is the address to hide, the Real
              IP (RIP) is a real address of this host to use when
              ARP query are made that would use VIP.
]]></programlisting>
			<para>
I must be misunderstanding something very basic.  I thought you didn't
want real servers to arp at all for VIPs, no matter what interface the
arp comes in on and no matter what interface is defined with the
matching address.  The only acceptable arp answer is for the RIP
(implying local traffic or traffic that is not desired to be load
balanced).  But the above man page contradicts my ideas.  So I'm a bit
confused as to how exactly noarp is working.
			</para>
		</blockquote>
		<para>
Maurizio Sartori <emphasis>masar (at) MasarLabs (dot) com</emphasis> 04 Feb 2005
		</para>
		<para>
there should be no problem to incremente NOARP_MAX_IP,
all memory is allocated statically.
The RIP in the 'add' command of noarpctl is
used to suppress the selection of the VIP as the sender 
IP address in arp requests. 
If not suppressed the back-end host request updates all arp
cache entries on the local net for the VIP with the mac of
the back-end host.
		</para>
		<para>
A way to generate a request of this type is, from a real server:
		</para>
<programlisting><![CDATA[
#> nc -s $VIP somehost 80
]]></programlisting>
		</section>
		<section id="2.4_2NIC">
		<title>extra NIC doesn't solve arp problem for 2.4 kernel realservers</title>
		<para>
Jean Paul Piccato <emphasis>j (dot) piccato (at) studenti (dot) to (dot) it</emphasis>
		</para>
		<blockquote>
I'm setting up a DR_LVS with a director and two servers...
I've to handle the ARP problem so I've put two NIC on the two
realservers...
		</blockquote>
		<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 16 Jan 2002
		</para>
		<para>
This works maybe only with Linux 2.0.
(Joe: see <link linkend="extra_nic">2.2 kernels with extra NIC</link>).
For 2.2+ you need <ulink url="http://www.ssi.bg/~ja/#hidden">
a specific kind of ARP control</ulink>.
In Linux 2.2+ the operation of adding IP address involves
the following 2 steps:
		</para>
		<orderedlist>
			<listitem>
Define a local IP address as a host property - remote hosts can
talk to it through any device
			</listitem>
			<listitem>
Define network link route on the specified device - you can talk
with other hosts from this local network only through this device
			</listitem>
		</orderedlist>
		<para>
(1) allows the Linux 2.2+ box to send ARP replies
through any device that received the reply. Additionally,
the user can provide some filtering by setting some device
specific values:
		</para>

<programlisting><![CDATA[
/proc/sys/net/ipv4/conf/*/<FLAG>
]]></programlisting>
		<para>
These are explained in /usr/src/linux/Documentation/networking/ip-sysctl.txt
		</para>
		<para>
The LVS setups depend mostly on the FLAGs rp_filter, hidden, arp_filter,
send_redirects.
(for more info on kernel flags see the section on
<xref linkend="proc_filesystem"/>).
On problems, check them after learning what they
mean and how they can kill your setup.
		</para>
		<para>
By setting rp_filter or arp_filter on some device you can
ignore the ARP requests (and the traffic if rp_filter is set)
coming from addresses if we don't have a route to these addresses
through the mentioned above device.
		</para>
		<para>
The send_redirects values must be checked for setups playing
with NAT on one physical medium.
		</para>
		<para>
Information on using the hidden patch is in hidden.txt
		</para>
		<blockquote>
It seems that eth0 reply to the server instead of eth1
		</blockquote>
		<para>
Any device can reply if the ARP probe is not filtered.
See hidden.txt from the above URL
		</para>
		<para>
Michael McConnell <emphasis>michaelm (at) eyeball (dot) com</emphasis> 10 Jun 2002
		</para>
		<blockquote>
			<para>
I currently have a system which has a Tyan 2515 Motherboard. This
motherboard features a Dual Intel 82559 NIC.
			</para>
			<para>
The problem I am face is that which using both ports of this dual
interface network card (plugged into the same switch) I find that the
second interface is answering arp requests (on rare occasions) that the
first interface should be answering.
I have used tcpdump and clearly seen eth1 answering arps requests that
eth0 should be answering... how odd.... It's rare, but when it happens
of course that address is offline. (Note: this only seems to happen on
alias IP address, it has never happened on the primary interface)
			</para>
			<para>
I am using the open source drivers provided with the 2.2.19 kernel, I'm
wondering if the drivers provided by Intel would help this problem?
			</para>
		</blockquote>
		<para>
Roberto Nibali <emphasis>ratz (at) tac (dot) ch</emphasis> 11 Jun 2002
		</para>
		<para>
The drivers indeed can't make the difference but not because they are the same, 
but because the driver doesn't have anything to do with the arp/routing issue.
		</para>
		<para>
Julian
		</para>
		<para>
Classic problem of attaching multiple Linux interfaces to
shared medium. You can set arp_filter on all your ARP devices or
why not to restrict even the IP protocol by setting rp_filter.
		</para>
		<para>
	Such answering (of arp requests) can not be never a problem. If the Linux box
answers via many interfaces then it is willing to accept traffic
through these ifaces. Of course, the achieved failover when attaching
two interfaces to same hub is not perfect because the remote LAN boxes
will use the alive Linux interface but Linux routing still uses the first
interface (even if it is failed on Layer 2) for the used subnet.
If your goal is to restrict the talks for each subnet through one
interface then you have to use arp_filter=1 but still to use rp_filter=0
to allow cross-subnet talks. One day rp_filter will be aware of the
medium_id values for each interface and will allow the Linux box
to interconnect multiple hubs securely (and still to use many interfaces
to these hubs).
		</para>
		<para>
	By default Linux replies to ARP probes for any local
IP address configured on any device no matter on what device the
probe is received. Such probes look like "who-has TARGET tell SENDER".
If the probe is answered later we can receive IP traffic from
SENDER to TARGET destined to the TARGET's MAC address.
		</para>
		<para>
	When we have different subnets (network routes) configured
on multiple interfaces attached to same hub sometimes we prefer (may
be the reader can find good reason for this) the traffic to/from one
subnet always to use one interface. In such cases replying through
many interfaces is not desired. We have 2 options:
		</para>
		<para>
arp_filter:
		</para>
		<blockquote>
			<para>
when
			</para>
<programlisting><![CDATA[
/proc/sys/net/ipv4/conf/DEV/arp_filter is set to 1
or
/proc/sys/net/ipv4/conf/all/arp_filter is set to 1
]]></programlisting>
			<para>
then the flag will cause any probe received on interface
        DEV to be dropped if the route from TARGET to SENDER points
        to different interface. With the usual local networks in
        table main in the form "from 0/0 to local_net lookup main"
        we see that the TARGET is ignored. As result, we drop
        probes received from SENDER that comes from wrong
        interface. As result, if the route from TARGET to
        SENDER1 is via DEV1 and from TARGET to SENDER2 is
        via DEV2, then we will reply only through one device
        for each of the senders. Of course, the arp_filter
        relies on the routing and as result the bahavior
        depends on the used ip rules and routes. The above
        is a simple example for normal local networks. The
        arp_filter simply checks the route for the reversed
        addresses. It should point to the input device.
			</para>
		</blockquote>
		<para>
rp_filter:
		</para>
		<blockquote>
	The rp_filter flag (DEV/rp_filter or all/rp_filter)
	set to 1 has similar semantic. It has nearly the same
	function as arp_filter and can control the ARP for
	the same purposes: symmetric talks (in and out using
	same device) but it covers the IP traffic too. It is
	assumed that where ARP is received (replied more
	exactly) there the IP traffic will be accepted too.
	It has mostly security function and can defend
	against IP spoofing. It controls the reverse path
	protection: we accept traffic from SENDER to TARGET
	received on DEV only when the reverse path (from
	TARGET to SENDER) points to the input interface
	DEV. It is used usually for "external" interfaces.
		</blockquote>
		<para>
	How you can use it:
		</para>
<programlisting><![CDATA[
ifconfig eth0 192.168.1.2
ifconfig eth1 192.168.2.2

echo 1 > /proc/sys/net/ipv4/conf/eth0/arp_filter
echo 1 > /proc/sys/net/ipv4/conf/eth1/arp_filter
]]></programlisting>
		</section>
		<section id="different_network">
		<title>Put the realservers on a different network to the VIP</title>
		<para>
Setup routing tables so that the client cannot route
to the realserver network <link linkend="Lars_method">(Lars' method)</link>.
This method requires the director to not forward
packets for the VIP (easy to implement if 2 NICS on the director).
The reply packets from the realservers return to the client
via a different router to the one attached to the director.
Thus the director's router cannot send arp requests to the
realservers.
		</para>
		</section>
		<section id="ethers">
		<title>
On the client(router), route packets with dst_addr=VIP to the director
		</title>
		<para>
You can hardwire the MAC address of the director
as the MAC address of the VIP. You can do this with
		</para>
<programlisting><![CDATA[
#arp -s lvs.mack.net 00:80:C8:CA:A7:E4

or

arp -f /etc/ethers.
]]></programlisting>
		<para>
Here is my /etc/ethers file (on the client)
		</para>
<programlisting><![CDATA[
lvs.mack.net 00:80:C8:CA:A7:E4
]]></programlisting>
		<para>
This requires no extra NICs or patching of realservers. However in a production
environment, redundant directors with heartbeat/failover may be required and
some method (eg running send-arp) will be needed to change the static arp entry
as the failover occurs. If multiple NICs are involved, it is possible that
the above instruction will result in a route through the wrong NIC. In this
case bring up the NIC of interest first and then run the above command.
		</para>
		<para>
Alternately if the router has several NICs, use one for the director and
another for the realservers. Route the VIP to the director.
		</para>
		</section>
		<section id="horms_method" xreflabel="Horms method">
		<title>Use transparent proxy allow the incoming packet to be accepted locally - Horms method.</title>
		<para>
see the sections on <xref linkend="LVS-HOWTO.transparent_proxy"/>,
and its setup for LVS-DR and LVS-Tun.
The configure script will set this up for you.
		</para>
		</section>
	</section>
	<section id="2.6_arp">
	<title>The Cure: 2.6.x kernels - arp_ignore/arp_announce</title>
		<section id="2_6_arp_announce" xreflabel="2.6 arp announce">
		<title>2.6 arp_announce</title>
	        <para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 25 Feb 2004
       		</para>
		<blockquote>
2.4.26 and 2.6.4 come with 2 new device flags for tuning the ARP stack:
<filename>arp_announce</filename> and <filename>arp_ignore</filename>.
All IPVS-like setups can use arp_announce=2 and arp_ignore=1/2/3 
to solve the "ARP problem" on realservers with DR/TUN setups. 
These flags are going to replace the "hidden"
functionality which does not work well when directors are
changing role between master/slave for a particular VIP.
The risk is that other hosts can probe for VIP using unicast packets
for which the hidden flag always replies. I'll continue to support the
hidden flag for 2.4 and 2.6 to help existing setups but switching
to the new device flags (or other solutions) is recommended.
		</blockquote>
		<para>
Documentation is in the
<ulink url="file:/usr/src/linux/Documentation/networking/ip-sysctl.txt">
2.6 kernel docs</ulink>
(linux/Documentation/networking/ip-sysctl.txt) (here from the 2.6.17 kernel).
		</para>
<programlisting><![CDATA[
arp_announce - INTEGER
	Define different restriction levels for announcing the local
	source IP address from IP packets in ARP requests sent on
	interface:
	0 - (default) Use any local address, configured on any interface
	1 - Try to avoid local addresses that are not in the target's
	subnet for this interface. This mode is useful when target
	hosts reachable via this interface require the source IP
	address in ARP requests to be part of their logical network
	configured on the receiving interface. When we generate the
	request we will check all our subnets that include the
	target IP and will preserve the source address if it is from
	such subnet. If there is no such subnet we select source
	address according to the rules for level 2.
	2 - Always use the best local address for this target.
	In this mode we ignore the source address in the IP packet
	and try to select local address that we prefer for talks with
	the target host. Such local address is selected by looking
	for primary IP addresses on all our subnets on the outgoing
	interface that include the target IP address. If no suitable
	local address is found we select the first local address
	we have on the outgoing interface or on all other interfaces,
	with the hope we will receive reply for our request and
	even sometimes no matter the source IP address we announce.

	The max value from conf/{all,interface}/arp_announce is used.

	Increasing the restriction level gives more chance for
	receiving answer from the resolved target while decreasing
	the level announces more valid sender's information.

arp_ignore - INTEGER
	Define different modes for sending replies in response to
	received ARP requests that resolve local target IP addresses:
	0 - (default): reply for any local target IP address, configured
	on any interface
	1 - reply only if the target IP address is local address
	configured on the incoming interface
	2 - reply only if the target IP address is local address
	configured on the incoming interface and both with the
	sender's IP address are part from same subnet on this interface
	3 - do not reply for local addresses configured with scope host,
	only resolutions for global and link addresses are replied
	4-7 - reserved
	8 - do not reply for all local addresses

	The max value from conf/{all,interface}/arp_ignore is used
	when ARP request is received on the {interface}
]]></programlisting>
		<para>
On the realservers the VIP will still be on <filename>lo</filename> (as for the hidden method).
If the reply packets to the client are routed through <filename>eth0</filename>, 
then the arp announcements/requests are made through <filename>eth0</filename> and you
will apply the <filename>arp_ignore</filename>/<filename>arp_announce</filename> sysctls to 
<filename>eth0</filename>, 
not to <filename>lo</filename> 
(you cannot use <filename>arp_ignore</filename>/<filename>arp_announce</filename> on <filename>lo</filename>).
		</para>
<programlisting><![CDATA[
/etc/sysctl.conf
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.eth0.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2
]]></programlisting>
		<para>
As with all devices that reply to arp requests, 
you should stop the arp behaviour before bringing up the VIP,
or else flush the arp tables on the router before using the LVS.
		</para>
		<para>
Mag2589 Walla Feb 21, 2007
		</para>
		<blockquote>
			<para>
On my realservers I have them set up to listen to the virtual address
on eth0:0 I need them to respond to arp on eth0 but I need them to
ignore it on eth0:0 To do this would I enter the following line in my
<filename>/etc/sysctl.conf</filename> file?
			</para>
<programlisting><![CDATA[
net.ipv4.conf.eth0:0.arp_ignore = 8
]]></programlisting>
			</blockquote>
		<para>
Horms
		</para>
		<para>
In a word: No
		</para>
		<para>
<filename>arp_ignore</filename> works only on physical interfaces. 
The old <filename>eth0:0</filename> notation is
a hang-over from the days of ip aliases, where in a round-about way you
could establish virtual interfaces (sort of). These days an interface
can have 0 or more addresses.  The <filename>arp_ignore</filename> 
semantics apply to such addresses.
		</para>
		<para>
If you really
need fine-grained arp control, take a look at <xref linkend="arptables"/>, 
which is kind of like iptables for arp.
		</para>
		</section>
		<section id="noarp_2.6">
		<title>noarp v2.6</title>
		<para>
Masar <emphasis>masar (at) MasarLabs (dot) com</emphasis> 04 Mar 2004
		</para>
		<para>
<ulink url="http://www.masarlabs.com/">noarp 2.0.0</ulink>
(http://www.masarlabs.com) is now available.
This is the port of <filename>noarp</filename> to the Linux 2.6.x kernel.
For the 2.4.x kernel use <filename>noarp 1.x.x</filename>.
I'm making separate packages of <filename>noarp</filename>
for the two kernels,
because the method for producing a module is different.
If there is sufficient interest,
I may produce a single package for both kernel versions.
		</para>
		</section>
		<section id="arp_ignore_ubuntu">
		<title>arp ignore with Ubuntu</title>
		<para>
Julien Cornuwel <emphasis>cornuwel (at) gmail (dot) com</emphasis> 29 Sep 2008 
		</para>
		<blockquote>
			<para>
I'm trying to set up load balancing with IPVS on two Apache webservers.
The loadbalancer and both apache servers are virtual machines running Ubuntu
8.04 server on VMware ESX 3.5.
			</para>
			<para>
If the platform has been idle for some time (like when I came back to work
this morning), I can point a browser to http://VIP and get pages from
server1 or server2 alternatively (I'm using rr during setup). But after
about 5 seconds, I get nothing and the browser times out.
			</para>
			<para>
Here is my configuration on the load balancer :
			</para>
<programlisting><![CDATA[
ipvsadm -A -t $VIP:80 -s rr
ipvsadm -a -t $VIP:80 -r $RIP1:80 -g
ipvsadm -a -t $VIP:80 -r $RIP2:80 -g
]]></programlisting>
			<para>
On webservers, I added the following to /etc/sysctl.conf (as suggested on
http://www.linuxvirtualserver.org/VS-DRouting.html) :
			</para>
<programlisting><![CDATA[
net.ipv4.conf.all.hidden = 1
net.ipv4.conf.lo.hidden = 1
]]></programlisting>
			<para>
I rebooted them and then :
			</para>
<programlisting><![CDATA[
ifconfig lo:0 $VIP netmask 255.255.255.255 up
]]></programlisting>
			<para>
Unless I did something stupid (if so, please tell me), it should work. 
			</para>
<programlisting><![CDATA[
echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce
]]></programlisting>
			<para>
I'm quite sure my problem is not on the loadbalancer but on webservers.
			</para>
			<para>
I have one interface and tried the following with no more luck :
			</para>
<programlisting><![CDATA[
net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.eth0.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
net.ipv4.conf.eth0.arp_announce=2
]]></programlisting>
			<para>
On the first request, that works, I have an ESTABLISHED line.
But after a few seconds, I get dozens of SYN_RECV.
OK, so I definitely have an ARP problem, with the above configuration.
Any idea why the above commands doesn't work on Ubuntu?
			</para>
			<para>
I did a TCP dump on all 3 servers and here is what I see :
			</para>
			<itemizedlist>
				<listitem>
On webservers, when it works, I see outgoing IP packets with the LB's
address as origin. When it doesn't, I just see nothing. About once per
second, the LB sends ARP requests trying to find both webservers (on their
real addresses), I never saw an ARP reply.
				</listitem>
				<listitem>
On the load balancer, I see incomming requests from clients, and some
"ICMP host lb" not reachable sent to the client when it doesn't work.
				</listitem>
			</itemizedlist>
			<para>
Webservers should reply to ARP requests on their primary addresses, but they
don't :(
			</para>
		</blockquote>
		<para> 
Laurentiu C. Badea (L.C.) <emphasis>lc (at) waat (dot) com</emphasis> 07 Oct 2008
		</para>
		<para>
Well I think it's either that you applied "hidden" to eth0 on the 
webservers, or the LB has the VIP as a primary address. See if the ARPs 
were going out with VIP as source and if that's the case, try giving the 
LB a different primary address and make VIP an alias.
		</para>
		<para>
Julien
		</para>
		<blockquote>
Great ! That was it. Now that I have the VIP as an alias on LB, it works.
Note to the documentation team : on Ubuntu 8.04, there is a trap with real servers.
If you set arp_ignore/arp_announce configuration in /etc/sysctl.conf AND set
the VIP on lo:0 in /etc/interfaces. It seems that the interface is brought
up  *before* the sysctl commands are passed. You have to set the VIP
manually at the end of the boot process.
		</blockquote>
		</section>
	</section>
	<section id="arptables" xreflabel="arptables">
	<title>arptables</title>
        <para>
Kjetil Torgrim Homme <emphasis>kjetilho (at) ifi (dot) uio (dot) no</emphasis>
11 Jul 2004
        </para>
        <blockquote>
                <para>
arptables is a method supported by Red Hat.
The package, <filename>arptables_jf</filename>,
is part of Advanced Server, but the src.rpm can be downloaded,
rebuilt and used on Workstation since it has the same kernel support.
The configuration is pretty straightforward:
                </para>
<programlisting><![CDATA[
# arptables -A IN -d $VIP -j DROP
# arptables -A OUT -s $VIP -j mangle --mangle-ip-s $RIP
# service arptables_jf save
# chkconfig --add arptables_jf
# chkconfig --levels 12345 arptables_jf on
]]></programlisting>
                <para>
This service will start before the network is brought up.  Note that you
have to specify an explicit runlevel, since it stupidly won't start in
single user by default.
                </para>
        </blockquote>
	<para>
Bandit Lazuli <emphasis>banditlazuli (at) yahoo (dot) com</emphasis> 13 Apr 2006 
	</para>
	<para>
Our cluster of web frontends periodically exhibited a kind of Fatal
Attraction behavior, where one host would suddenly be the recipient of
all hits. Attempting to add new hosts to the existing cluster
triggered this behavior in a consistent way. With something clear to
fix, we installed the latest version of keepalived on the latest RHEL4
kernel.
 	</para>
	<para>
And lo, nothing changed. 
Add a new host, it became a Fatal Attractor within 6 minutes of operation 
(note that this is NOT the <xref linkend="thundering_herd"/>; 
things were relatively well balanced for a minute or 6).
	</para>
	<para> 
Stranger yet, ipvsadm on the director revealed that the Attractor was
getting NO hits. So it wasn't that the LVS was sending all hits to one
machine. You guessed it. The new machine was arping for the shared ip,
and connections were coming directly to it.
We had arptables set up as follows:
 	</para>
<programlisting><![CDATA[
*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
COMMIT
]]></programlisting>
	<para> 
And in desperation, started arptables at runlevel 1. This didn't help,
because it wasn't responding to an inbound arp request, but was
instead generating it's OWN arp request, and broadcasting the response
it made to itself.
This could be seen with:
	</para> 
<programlisting><![CDATA[
tcpdump -i any arp > file
]]></programlisting>
	<para> 
And then pawing through the file for the shared ip (name). So there
lies the smoking gun. Arptables was NOT working as advertised. So we
added:
	</para>

<programlisting><![CDATA[
-A OUT -d 192.168.0.12 -j mangle --mangle-ip-s 192.168.0.104
]]></programlisting>

	<para> 
This still did not do the trick; apparently arptables implicitly
operates on the interface owing the ip (lo:1, in our case), if no
interface is specified. That left eth0 leaking arps.
Specifying the interface did the trick:
	</para>

<programlisting><![CDATA[
-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104
]]></programlisting>
	<para>
And here is the whole filter:
 	</para>
<programlisting><![CDATA[
*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104
COMMIT
]]></programlisting>
	<para>
arps are now properly squelched, and fatal attractor behavior has vanished. I'm posting this because I longed for google to return such a message in response to many searches.
	</para>
	</section>
	<section id="vip_not_rip" xreflabel="vip not rip">
	<title>The arp problem is on the realserver's VIP not the RIP</title>
	<para>
Cali Federico
	</para>
	<blockquote>
		<para>
I've configured an http service on director as below:
		</para>
<programlisting><![CDATA[
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  194.153.172.249:80 wlc
  -> 194.153.172.222:80             Route   1      0          0
]]></programlisting>
		<para>
and I've installed the noarp module on the realserver as below:
		</para>
<programlisting><![CDATA[
[root@cautha2 root]# noarpctl list
194.153.172.249 194.153.172.222 12 0 3
]]></programlisting>
		<para>
The problem I can see is that invoking the http://VIP/index.html from my 
PC (outside the LVS network) I can see the page provided by realserver 2 
or 3 times after that I receive a "Page Cannot Displayed".
The page remains unreachable for several minutes after I have the same 
behaviour again.
Looking at the director's arp table, the HWaddress related 
to the realserver is "(incomplete)".
After that I set (using arp -s) the correct realserver MacAddress the LVS 
works properly.
		</para>
	</blockquote>
	<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 2005/19/05 
	</para>
	<para>
You are no arp'ing your RIP which is not a good idea. It's just the VIP 
that needs NOARP on the realserver.
	</para>
	</section>
	<section id="testing_for_arp" xreflabel="testing for arp">
	<title>Testing an interface for replies to arp requests</title>
	<para>
To test that the VIP on the realservers
(here lo:0) is hidden from arp requests:
You test from a client machine on the same network segment as
the NICs on the realservers.
For your sanity, you could try this with one realserver at a time.
You do _NOT_ have the director (with its pingable VIP) connected to the network
(unplug it).
	</para>
	<itemizedlist>
		<listitem>
			<para>
Optional: on each realserver, to accumulate a list of the MAC addresses
for each NIC.
			</para>
<programlisting><![CDATA[
realserver: # ping VIP
realserver: # arp -a	# look for the MAC address of the VIP
realserver: # ifconfig	# should show the same MAC address
]]></programlisting>
		</listitem>
		<listitem>
			<para>
find the MAC address for the realserver's VIP from the test client.
			</para>
<programlisting><![CDATA[
client: # ping VIP
or ping the broadcast address
client: # ping 192.168.1.255	#for a VIP in the 192.169.1.0/24 network
then
client: # arp -a	# look for the MAC address of the VIP on the realserver.
			# if you have several realservers on-line,
			# it could be the MAC address of the NIC on any of the realservers
]]></programlisting>
		</listitem>
		<listitem>
			<para>
Hide the lo interface on the realserver (<xref linkend="hidden"/>).
Before the arp tables expire (15secs - 2mins depending
on the OS), ping the VIP again from the test client.
The realserver will still reply to the ping,
since the MAC address for the VIP will still be in the arp table of the test client.
			</para>
<programlisting><![CDATA[
client: # ping VIP
]]></programlisting>
		</listitem>
		<listitem>
			<para>
let the arp cache expire (wait 15sec - 2mins) or clear the arp cache of the test client.
			</para>
<programlisting><![CDATA[
client: # sleep 120
or
client: # arp -d VIP	# delete/flush/clear the entry for the VIP
then
client: # arp -a	# show that the arp entry for the VIP is gone
]]></programlisting>
		</listitem>
		<listitem>
			<para>
ping the VIP. You should get no reply.
			</para>
<programlisting><![CDATA[
client: # ping VIP
]]></programlisting>
		</listitem>
		<listitem>
			<para>
Do for all realservers, making sure you get no ping replies for the VIP.
			</para>
		</listitem>
		<listitem>
			<para>
On the director (don't connect it to the network yet)
find the MAC address of the VIP
			</para>
<programlisting><![CDATA[
director: # ping VIP		#VIP can be an IP or a resolvable name
and/or
director: # ifconfig		#look for MAC address of NIC with VIP
then
director: cat /proc/net/arp	#shows list of IP-MAC address pairs
or
director: arp -a 		#shows FQDN as well
]]></programlisting>
		</listitem>
		<listitem>
			<para>
Connect the director to the network.
Just to be sure, clear the arp entry for the VIP on the test client
(<command>arp -d VIP</command>) and ping the VIP again.
You should get ping replies.
The test client's arp cache should have the MAC address of the director's NIC
for the VIP.
			</para>
		</listitem>
	</itemizedlist>
	</section>
	<section id="normal_realservers">
	<title>Normal machines, Solaris, Novell Server</title>
	<para>
The arp problem only occurs on Linux with kernels 2.2 and later. 
All other OS's honor the arp flag 
(Joe HPUX does something wierd with <filename>noarp</filename>, but I've forgotten what).
	</para>
	<para>
Mark de Vries
	</para>
	<blockquote>
I need to do some DR to a Solaris 8 box... Anyone know how to set it to
ignore arp requests? So far I have only done DR to linux and windows
boxen...
	</blockquote>
	<para>
Lasse Karstensen <emphasis>lkarsten (at) hyse (dot) org</emphasis>21 Apr 2006
	</para>
	<para>
At least for Solaris 9, you can just create the file
<filename>/etc/hostname.lo0:14</filename> (14=some number)
	</para>
	<para>
Inside of it you write
	</para>
<programlisting><![CDATA[
"""
plumb 1.2.3.4 -arp netmask 255.255.255.255 up
"""
]]></programlisting>
	<para>
where 1.2.3.4 is your vip-address.
I'm pretty sure this also works in Solaris 8. 
The Solaris-people here also mention that you can just use
addif in <filename>/etc/hostname.lo0</filename>, 
if you rather fancy having everything in one file.
	</para>
	<para>
Mark de Vries <emphasis>markdv (dot) lvsuser (at) asphyx (dot) net</emphasis>21 Apr 2006 
	</para>
	<blockquote>
		<para>
Ah yes, the -arp option... Yeah works! Not that it matters because half
way throught the exersize I suddenly realized that the service has always
used NAT for a reason; the real servers have the service on different
ports then on the VIP... so DR is a no-go... sigh...
		</para>
		<para>
And the whole reason we wanted to use dr in this first, or actually
second, case was because half way through the first attempt at configuring
it as NAT I suddenly realized that wouldn't work because in this case the
realserver has an interface in the same network as the VIP is in (so there
is a direct route to the client and packets won't get de-natted)... So
short of someone pointing me to a source-based-routing-HOWTO-on-Solaris
it's just not gonna work out...
		</para>
	</blockquote>
	<para>
Malcolm Turnbull <emphasis>malcolm (at) loadbalancer (dot) org</emphasis> 17 Dec 2008
	</para>
	<para>
We recently had a customer using that funny thing called Novell Server....
I couldn't find anything in the LVS manual about Novell Server in DR
mode but eventually figured out the following which works great:
	</para>
<programlisting><![CDATA[
add secondary ipaddress <ipaddress> noarp
]]></programlisting>
	</section>
	<section id="switches_that_bight">
	<title>problems with switches</title>
	<para>
There are other places in the network with arp caches,
like "smart" switches.
These will bight you if you don't know about them.
	</para>
	<para>
<emphasis>frederic (dot) buche (at) equant (dot) com</emphasis> 29 Oct 2003
	</para>
	<blockquote>
		<para>
OK Julian, you are right.
The problem came from my network-switch,
which keeps in memory the MAC address of all machines.
So it just relays the arp request to the concerned server,
by using a unicast arp request.
		</para>
		<para>
Just for a test, I have deleted the MAC entry on my switch. Then reproduce
the same test than before ... and the hidden patch works well!
		</para>
	</blockquote>
	<para>
Carlos J. Ramos <emphasis>cjramos (at) genasys (dot) com</emphasis> 15 Dec 2003
	</para>
	<para>
We are using an HP Procurve Switch 2124 in a
cluster using Heartbeat and Ldirectord as HA and Balancing mechanisms.
Previously we have similar working setups with a hub in the
same location.
Eerything works fine, till we make a takeover on directors. As
the switch documentation saids, the switch automatically learn MAC
address and associate it to its ports, so that although heartbeat
changes IP address, the switch try to use the same switch port.
The  situation remains for at least 1 hour... for this time the forwarding in
the cluster does not works... and realservers are unable to be reached
from outside... We are assuming this is an arp caching problem,
although we haven't eliminated other possible causes yet.
	</para>
	<para>
Is there any way to force the switch to refresh MAC Address Table?, is
there any Linux tool that sent any kind of packet over the net forcing
the ARP Table to be updated?
	</para>
	</section>
	<section id="first_inklings">
	<title>The ARP problem, the first inklings</title>
	<para>
History: The ARP behaviour changed between 2.0.x and 2.2.x kernels.
Here's the original posting by Wensong
and a reply from Alexy Kuznet (2.2 tcpip author)
	</para>
	<para>
Wensong Zhang <emphasis>wensong (at) iinchina (dot) net</emphasis>  24 Mar 1999
	</para>
	<para>
Today I upgraded the kernel to 2.2.3 with tunneling support on
one of a realserver, and found a problem that the Linux 2.2.3
tunnel device answers ARP requests.  Even if I used the NOARP
options as follows:
	</para>
<programlisting><![CDATA[
realserver:# ifconfig tunl0 172.26.20.110 -arp netmask 255.255.255.255 broadcast 172.26.20.110
]]></programlisting>
	<para>
It still answers the ARP requests. This will greatly affect the
virtual server via tunneling work properly.  In fact, the tunnel
device shouldn't answer the ARP requests from the ethernet. I
think it is a bug of linux/net/ipv4/ipip.c, which is now a clone
of ip_gre.c not the original tunneling code.
	</para>
	<para>
If you are interested, you can test yourself on kernel 2.2.3,
choose a free IP address of your ethernet and configure it on the
tunl0 device, then telnet to that IP address from other host, I
guess you can. Finally, have a look at the ipip.c, maybe you can
debug it. :-) --
	</para>
	<para>
But, what is the IFF_NOARP flag of the tunnel device for?
	</para>
	<para>
<emphasis>kuznet (at) ms2 (dot) inr (dot) ac (dot) ru</emphasis>
	</para>
	<blockquote>
		<para>
IFF_NOARP means that ARP is not used by THIS device.
On normal IPIP tunnels it does not make much of sense, but may be
used for example to turn on/off endpoint reachability detection.
		</para>
		<para>
I do not see any reasons to disable answering ARP in such
curcumstances. Isolation of VPNs on adjacent segments is impossible
at routing/arp level, it is just not well-defined behaviour.
		</para>
		<para>
If the isolation is made with firewall policy rules, then
it is clear that arp policy must be handled at this level too.
		</para>
	</blockquote>
	<para>
In kernel 2.0.x, the tunnel device doesn't answer ARP requests.
	</para>
	<blockquote>
Yes.
	</blockquote>
	<para>
Yeah, we can have link-local addresses that doesn't answer ARP requests in
kernel 2.2.x. For example, we can configure all the hosts in a network
with the following command:
	</para>
<programlisting><![CDATA[
ifconfig lo:0 192.168.0.10 up
]]></programlisting>
	<para>
There will no collision. The lookback alias interfaces don't answer ARP
requests.
	</para>
	<blockquote>
		<para>
Are you sure? I am not. Please, test.
		</para>
		<para>
BTW you risk adding non-loopback addresses on loopback device.
They have the HIGHEST preference to be used as router identifier.
so that VPN addresses cannot be added to loopback at all.
		</para>
	</blockquote>
	<para>
No, it doesn't fail. I tested it with kernel 2.0.36, it worked.
	</para>
	<blockquote>
It does not work under 2.2. To be honest, I am about to stop to understand
you. You talk about 2.2, but all your tests are made for 2.0. 8)
	</blockquote>
	</section>
	<section id="Kese">
	<title>A posting to the mailinglist by Peter Kese explaining the "arp problem"</title>
	<para>
(saved for posterity by Ted Pavlic, minor editing by Joe)
	</para>
	<para>
<emphasis>peter (dot) kese (at) ijs (dot) si</emphasis>
	</para>
	<para>
Before we start, let's assume we have following network
configuration for an LVS running LVS-DR.
	</para>
<programlisting><![CDATA[
client		10.10.10.10

gw		192.168.1.1

director	192.168.1.10 	IP for admin (director IP)
        	192.168.1.110 	VIP (responds to arp requests)

realserver	192.168.1.11 	IP to which each service is listening (realserver IP)
		192.168.1.110 	VIP (DOES NOT respond to arp requests)
]]></programlisting>
	<para>
The virtualserver is the combination of the director and
the realserver running LVS.
	</para>
	<para>
Or goal is:
	</para>
	<orderedlist>
		<listitem>
Virtual server should respond to arp requests for both
the VIP and the director IP.
		</listitem>
		<listitem>
		The realserver should respond to arp requests for the
realserver IP but NOT the VIP.
		</listitem>
		<listitem>
Gateway sends packets for the VIP to the director IP
load balancer no matter what.
		</listitem>
	</orderedlist>
	<para>
Problem 1: Interface aliases
	</para>
	<para>
Realserver and director need to have an interface with the VIP in
order to respond to packets for virtual server. A real interface
is not needed, an IP alias will do just fine and this interface
alias could be either eth0:0 or lo:0.
	</para>
	<para>
On the 2.0 kernels, the ARP responding ability of an interface
alias (eg eth0:0) could either be enabled or disabled
independantly of the main (eth0) interface.  If you wanted eth0:0
not to respond to ARP requests, you could simply say:
	</para>
<programlisting><![CDATA[
        ifconfig eth0:0 192.168.1.2 -arp up
]]></programlisting>
	<para>
Thus in the 2.0 kernels it is possible, on a realserver, to have
the realserver IP (on eth0) respond to arp requests and for the
VIP (on eth0:0) to not respond.
	</para>
	<para>
In the 2.2 kernels this doesn't work any more. Whether the an
interface alias responds to ARP requests or not, depends only on
the way the real interface is configured.  So if eth0 responds to
ARP requests (which it normally will), eth0:0 carrying the VIP
will also respond to ARP requests no matter what.
	</para>
	<para>
This means an ethernet alias (eth0:0) is not permitted on real
servers, because realservers should not respond ARP requests.
	</para>
	<para>
On the other hand, loopback aliases never respond ARP requests,
which means that the loopback alias (lo:0) must not be used on
the director for the VIP.
	</para>
	<para>
Problem 2: Loopback aliases
	</para>
	<para>
I haven't done much checking on loopback interface problem, but
it seems that if an alias is used on a loopback interface (as is
required for LVS-DR) on a realserver running kernel 2.2.x, the
whole ARP gets screwed.
	</para>
	<para>
It appears that loopback interfaces get special ARP treatment in
the kernel, so I suggest avoiding the loopback aliases as whole.
	</para>
	<para>
The question now is: What kind of an interface can I use on real
servers?
	</para>
	<para>
As I already noted, eth0:0 alias can not be used, because such
aliases respond to ARP requests. lo:0 aliases can not be used,
because they make ARP problems too.
	</para>
	<para>
In case of tunneling VS configuration, the answer is trivial:
tunl0. But to be honest, tunl0 interface can also be used for
direct routing.
	</para>
	<para>
(Joe: the dummy device is OK too, at least for the 2.0.x kernels)
	</para>
	<para>
With direct routing, the only thing we need an interface for is
to let kernel know we posses an additional IP address. This
means, we can set up any kind of an interface, as long as it
doesn't respond ARP requests. Instead of tunl0, you could also
set up a ppp0, slip0, eth1 or whatever. I suggest setting up a
tunl0:
	</para>
<programlisting><![CDATA[
        ifconfig tunl0 192.168.1.2 -arp up
]]></programlisting>
	<para>
Problem 3: Real server ARP requests.
	</para>
	<para>
Suppose we have set up a virtual server as described at the
beginning. All computers are running, but no requests have been
made.
	</para>
	<para>
Then the client sends a request to the VIP.
	</para>
	<para>
When the packet arrives to gateway, the gateway makes an ARP
query for the VIP and the director responds. Gateway remembers
the director's MAC address and sends the packet to the director.
Director receives the packet, looks up its ipvsadm/LVS tables and
chooses the realserver and forwards the packet to the real
server by direct routing or tunneling method.
	</para>
	<para>
Real server receives the packet and generates a response packet
with destination=client, source=VIP.
	</para>
	<para>
(until now everything works correctly)
	</para>
	<para>
When realserver wants to send the response packet to the
gateway, it finds out, that it does not know the gateway's MAC
address.
	</para>
	<para>
It sends an ARP request to the local network and asks for the
gateway MAC address. This should look like:
	</para>
	<para>
        ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)
	</para>
	<para>
But in reality, realserver asks something like:
	</para>
	<para>
        ARP, who has 192.168.1.1 (gw), tell 192.168.1.110 (VIP),
	</para>
	<para>
because it takes the source address from the packet it wants to
send.
	</para>
	<para>
Here the problems come in.
	</para>
	<para>
Gateway receives the packet and responds to it, which is correct.
But at the same time, gatweay does a little optimization. It
finds out, that the realserver's MAC address is not listed in its
ARP tables and adds the entry into the table, just in case it
might need that address in the near future.
	</para>
	<para>
The ARP request contained the VIP address and the realserver's
MAC address, so from now on, the gateway will send all packets
destined for the VIP to the realserver instead (due to MAC
address). This means all packets that follow will avoid the
virtual server as whole and get responded by the realserver.
	</para>
	<para>
If the realserver's ARP request would be:
	</para>
	<para>
        ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)
	</para>
	<para>
all this would not have happened. Therefore I have patched the
2.2 VS kernel in such a way, that it composes ARP requests based
on the address of the interface selected by the routing tables
instead of the address taken from the packet itself.
	</para>
	<para>
In order for virtual server to work correctly, the realservers
should have patched kernels as well, or at least copy the patched
/usr/src/linux/net/ipv4/arp.c file to the realservers before
compiling the kernels.
	</para>
	<para>
Conclusion
	</para>
	<para>
Those were my experience with ARP problems, and the 2.2 kernel
virtual server.
	</para>
	<para>
I think it would be wise to add this letter to the web site and
notify the network developers about our findings at some point in
time.
	</para>
	<para>
Here are some golden rules I stick to, when I do virtual server
configuration:
	</para>
<programlisting><![CDATA[
Rule 1:
        Do not use lo:0 alias on the director.
        Use eth0:0 alias instead.

Rule 2:
        Avoid using lo:0 alias, not even on realservers.
        Use tunl0 or some other simulated interface
        on realservers instead. (Joe: use dummy0)


Rule 3:
        Apply the VS patch to kernels on realservers.
]]></programlisting>
	</section>
	<section id="arp_bouncing" xreflabel="arp bouncing">
	<title>arp bouncing</title>
	<para>
symptoms of realservers arp'ing - arp bouncing
	</para>
	<para>
Stephen WIlliams <emphasis>sdw (at) lig (dot) net</emphasis> (Stephen wrote one of
the patches that stop devices in 2.2.x kernels from replying
to arp requests)
	</para>
	<para>
If you don't use the patch you'll find that the 'active' box will
bounce from machine to machine as each one sends an ARP reply
that is heard last. Additionally you will get TCP Reset's as
connections that were on one box suddenly start going to others.
Very nasty and unusable.
	</para>
	</section>
	<section id="lars_method">
	<title>Lar's Method</title>
	<para>
(This is called <link linkend="Lars_method">Lars' method</link>)
	</para>
	<para>
Lars
	</para>
	<para>
I have thought about how the ARP problem can occur at all with
direct routing, because I never noticed it. Then it occured to me
that your VIP comes from the same subnet as the RIP of
the LVS and also all the realservers share this media.
	</para>
	<para>
To avoid the "ARP problem" in this case without adding a kernel
patch or anything else, you can just add a direct route for the
VIP using the RIP of the LVS as a gateway address on the
router in front of the LVS. ("ip route VIP 255.255.255.255
real_ip" on a Cisco, or "route add -host VIP gw RIP" on Linux)
	</para>
	<para>
Since I just used 2 ethernet cards and had the LVS act as
gateway/firewall anyway, I never noticed the ARP problem. (We
have 2 LVS in a standby configuration to eliminate the SPOF)
	</para>
	</section>
	<section id="static_routing">
	<title>Static Routing to Director</title>
	<para>
The arp problem is handled if the router in front of the director
has a static route for the VIP to the director (<emphasis>i.e.</emphasis>
packets for the VIP from the outside world are sent to the director
and cannot get to the realservers).
	</para>
	<para>
Wensong
	</para>
	<para>
For the clients who reach the virtual server through the router,
there is no problem if a static route for VIP is added.
	</para>
	<para>
However, for the clients who are in the network of virtual
server, the "ARP problem" will arise. There is fight in ARP
response, and the clients don't know send the packets to the load
balancer or the realserver.
	</para>
	<para>
In my point of view, the VIP address is shared by the director
and realservers in LVS-Tun or LVS-DR, only the director
does ARP response for VIP to accept request packets, and
the realservers has the VIP but don't, so that they can process
packets destined for VIP.
	</para>
	</section>
	<section id="iproute2_NOARP">
	<title>iproute2 arp on|off flag</title>
	<para>
Joe, 21 May 2001
	</para>
	<blockquote>
		<para>
Was looking at the ip (<emphasis>i.e.</emphasis>iproute2) notes and it says
		</para>
<programlisting><![CDATA[
ip arp on|off

--change NOARP flad on the device

1cm NB. This operation is not allowed if the device is in state UP.
Though neither ip utility nor kernel check for this condition, you can
get unpredictable results changing the flag while the device is running.
]]></programlisting>
		<para>
Is this like the old -noarp flag for ifconfig?
		</para>
	</blockquote>
	<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis> 21 May 2001
	</para>
	<para>
	This is the device ARP flag, same as ifconfig [-]arp.
The flag is used to allow ARP packets for the specified device.
It is correct that "lo" does not talk ARP, but you connect to
the VIPs on "lo" through eth*, so the flag is of no help for LVS.
We can't drop the flag for eth device.
	</para>
	<para>
Andreas J. Koenig, 02 Jun 2001
	</para>
	<blockquote>
kernel 2.4.5 has arp_filter
	</blockquote>
	<para>
Julian Anastasov <emphasis>ja (at) ssi (dot) bg</emphasis>
	</para>
	<para>
arp_filter does not solve the ARP problem for LVS
	</para>
	<para>
	This is a new proposal to control the ARP probes and replies
based on route flag "noarp". It will be discussed on the netdev mailing
list and may be something like this is going to be included in 2.4,
may be in 2.2 too, not sure. All you know that the hidden feature is
not considered to 2.4. The net developers have the final word. I'll
try to maintain the hidden flag in all next kernels while this flag is
more usable than the new feature and because the hidden flag has other
semantic. And because may be there are some user space tools that rely
on this.
	</para>
	</section>
	<section id="arp_2.2">
	<title>Is the arp behaviour of 2.2.x kernel a bug?</title>
	<note>
		<para>
Julian Anastasov is replying to correct an
error in a previous version of the HOWTO
where I state that the dummy0 device in
2.2.x kernels does not arp. Julian wrote
one of the realserver patches which
fix the "arp problem".
		</para>
	</note>
	<para>
Julian
	</para>
	<para>
         In fact, the documentation is incorrect. There is no difference,
 all devices are reported in the ARP replies: lo, tunl and dummy. So, only
 the ARP patch can solve the problem. This can be tested using this
 configuration with any device (before the patch applied):
	</para>
<programlisting><![CDATA[
Host A:
         eth:x 192.168.0.1

Host B:
         eth:x 192.168.0.2
         lo, dummy, tunl: 192.168.0.3
]]></programlisting>
	<para>
On host A try: ping 192.168.0.3
	</para>
	<para>
Host B replies for 192.168.0.3 through 192.168.0.2 device
	</para>
	<para>
         So, the ARP problem means: "All local interfaces are reported"
until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP
to hide the interface are incorrect. I don't expect them in the kernel.
	</para>
	<para>
Stephen WIlliams (who wrote another of the patches to
fix the arp problem).
	</para>
	<blockquote>
		<para>
Of course the ARP code in the kernel needs to be fixed so my filter code isn't
needed.  Still, I'm confused by this statement.  The IFF_NOARP flag determines
whether a device arp replies or not.  What's wrong with honoring that?
		</para>
		<para>
If you mean that arp replies should never be sent on another interface, that is
what I currently believe to be correct.
		</para>
	</blockquote>
	<para>
Julian
	</para>
	<para>
         My understanding is that 2.2.x ARP code is not buggy and
 there is no need to be "fixed". I must say that your patch is
 working for the LVS folks but not for all linux users.
	</para>
	<para>
         IFF_NOARP means "Don't talk ARP on this device",
 from the 'man ifconfig':
	</para>
	<para>
 [-]arp  Enable or disable the use of the ARP protocol on
 this interface.
	</para>
	<para>
         So, where is the bug ? The ARP code never talks through
 lo, dummy and tunl devices when they are set NOARP. It uses
 eth (ARP) device.
 If You hide all NOARP interfaces from the ARP protocol
 this is a bug. One example:
	</para>
<programlisting><![CDATA[
 +--------+ppp0                          +------+
 | Host A |------------ppp link----------|ROUTER|------ The World
 +--------+A.B.C.1 (www.domain.com)      +------+
   |eth0
   |A.B.C.2
   |
   |A.B.C.3
 +--------+
 | Host B |
 +--------+
]]></programlisting>
	<para>
 Is it possible after your patch Host B to access www.domain.com ?
 How ? Host A doesn't send replies for A.B.C.1 through eth0 after
 your patch. OK, may be this is not fatal. Tell it to all kernel
 users. You hide all their NOARP interfaces. May be there are other
 examples where this is a problem too. Or may be there is something
 wrong in this configuration?
	</para>
	<para>
         I want to say that this patch hurts all users if present
 in the kernel. On Nov 6 I posted one patch proposal to the
 linux-kernel list which adds the ability to hide interfaces
 from the ARP queries and replies. But the difference is that
 only specified interfaces are not replied, not all NOARP
 interfaces. Its arp_invisible sysctl can be used by LVS
 folks to hide lo, tunl or dummy interfaces but this feature
 doesn't hurt all kernel users. I think, this patch is more
 acceptable and can be included in the 2.2 kernel, may be after
 some tunning. And I'm still expecting comments from the net
 folks and from all LVS users.
	</para>
	</section>
	<section id="arp" xreflabel="the kernel replies to arp requests">
	<title>The device doesn't reply to arp requests, the kernel does.</title>
	<para>
ARP requests/replies are thought of as coming from a device
and people make statements like
	</para>
	<para>
"the dummy device in 2.0.x kernels does not reply to arp
requests while the same device in 2.2.x kernels does reply".
	</para>
	<para>
It is the kernel that handles arp requests according to a
set of rules and not the device. The code for the dummy
device is the same in 2.0.x and 2.2.x kernels and is
not responsible for the change in arp behaviour.
	</para>
	<para>
(The RPC for ARP is at ftp://ftp.isi.edu/in-notes/std/std37.txt.
- also see rfc826 and rfc1122. The model system used there is 2
machines on a single ethernet. It doesn't shed any light on the
implementation of ARP on multi-interface systems like LVS.)
	</para>
	</section>
	<section id="vip_devices" xreflabel="vip devices">
	<title>Properties of devices for the VIP</title>
	<para>
In a previous version of the HOWTO I stated that the dummy0
device did not arp in 2.2.x kernels and therefore could be
used as the device for the VIP on an unpatched 2.2.13 realserver.
Julian Anastasov replied that they did arp (see below
for his posting and the ensuing discussions).
	</para>
	<para>
I hadn't actually tested whether the dummy0 device arp'ed
but had concluded that it wasn't arp'ing because I had a
working LVS using the dummy0 interface for the VIP on
unpatched 2.2.x realservers and because as everyone
knows ;-) an LVS needs to have a non-arp'ing device on
the VIP of the realservers.
	</para>
	<para>
I had a LVS-DR LVS which worked with dummy0, lo:0 and tunl0
as the VIP device and which on further testing, I found
also worked with eth0:1 or eth1 as the VIP device on
2.2.13 realservers. Whatever the arp'ing status of dummy0,
lo:0 or tunl0, clearly eth1 replies to arp requests,
so despite the conventional wisdom, it is possible
to build an LVS with arp'ing VIP's on the realservers.
	</para>
	<para>
On investigating why this LVS worked, I found that the
MAC address for the VIP in the client's arp cache (# arp -a)
was always the director. I assume this was
because the director is 3-4x the speed of the other
machines in the LVS and it replies to arp requests first
for the VIP (another posting from Stephen WIlliams
says that the address which replies last is stored in the
arp cache - we'll figure out what's really going on here
eventually). On another LVS where the realservers were all
identical hardware with 2.2.13 unpatched kernels, one
particular realserver always was the machine in the client's
arp cache for the VIP (to check, delete entry for VIP
with arp -d, then ping again, then look in arp cache).
	</para>
	<para>
	I found that I could get a working LVS using almost
anything to hold the VIP on the realservers, including eth0:1
and eth1 (another NIC in the realserver). These devices carrying
the VIP were pingable from the client and I could get the
corresponding MAC addresses in the arp table of the client
if the director was not setup with a VIP. When I setup a
working LVS this way, I found each time that the MAC
address for the VIP in the client's arp cache was the
director's MAC address. For some reason, that I don't know,
whenever the client does an arp request for the VIP, it gets
the director's MAC address.
	</para>
	<para>
Possible reasons for the MAC address of the director always
being associated with the VIP in my LVS -
	</para>
	<orderedlist>
		<listitem>
1. I configure the director first and then the realservers.
I don't make requests for a service till the realservers
are setup. (Still I can't imagine the client
asking for the MAC address of the VIP until it makes a connect
request.)
		</listitem>
		<listitem>
2. The director is 3 times faster (CPU speed) than the next
machine in the LVS and it always replies to arp request first.
		</listitem>
		<listitem>
3. I was lucky.
		</listitem>
	</orderedlist>
	<para>
Since you can make a working LVS-DR LVS with the realserver VIP
on an arp'ing eth0:1 device I decided that the relevent piece
of information about arp'ing was (ta da!)
	</para>
	<para>
* <emphasis>an LVS will work if the client always gets the MAC address
of the director when it asks for the MAC address of the VIP</emphasis> *
	</para>
	<para>
This provides an easy solution - you tell the client (or the router) the
MAC address of the VIP with <command>arp -s</command> or <command>arp -f</command>.
	</para>
	<para>
here's my <filename>/etc/ethers</filename>
	</para>
<programlisting><![CDATA[
lvs.mack.net 00:A0:CC:55:7D:47
]]></programlisting>
	<para>
After installing the MAC address of the DIP (director) as
the MAC address of the VIP (lvs) in the arp table
(<command>arp -f /etc/ethers</command>) I get
	</para>
<programlisting><![CDATA[
client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at 00:A0:CC:55:7D:47 [ether] PERM on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0
]]></programlisting>
	<para>
notice the "PERM" in the VIP entry on the client.
	</para>
	<para>
removing the permanent entry
	</para>
<programlisting><![CDATA[
client:/usr/src/temp/lvs# arp -d lvs.mack.net
client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at <incomplete> on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0
]]></programlisting>
	<para>
If I edited <filename>/etc/ethers</filename> changing the MAC address of lvs to
anything else, the LVS did not work anymore. So the arp
information is coming from <filename>/etc/ethers</filename> rather than some
uncontrolled variable I'm not aware of.
	</para>
	<para>
I had thought that in an LVS with the VIP on realservers
on an arping device that the VIP would hop from one machine
to another (see the postings in the MISC section). Since
naturally occuring LVS's with arping VIP's on realservers
existed and worked well (mine), I set up an LVS
by making a permanent entry for the VIP of the director
in the arp cache of the client (router). This can be done by
	</para>
<programlisting><![CDATA[
$ arp -f /etc/ethers
]]></programlisting>
or
<programlisting><![CDATA[
$ arp -s 192.168.1.110 MAC_ADDRESS
]]></programlisting>
	<para>
There are 2 results of this
	</para>
	<orderedlist>
		<listitem>
the realservers can have the VIP on an
an <filename>arp</filename>'ing device (eg eth0:1, eth1)
- you don't need lo or dummy0, tunl0
for realservers with 2.0.36 and 2.2.x kernels.
		</listitem>
		<listitem>
If two (or more) directors are setup in failover mode, the
mechanism by for changing the VIP from one to another is
broken by making a permanent entry for VIP on the director
in the arp cache of the router. This is not a problem for a test
setup to demonstrate an LVS but may be a problem in a high
availability environment (a solution may be found n the meantime
too).
		</listitem>
	</orderedlist>
	<para>
The normal method for changing directors
(<emphasis>e.g.</emphasis> with heartbeat) includes
a gratuitous arp. To force a gratuitous arp
	</para>
	<para>
Julian
	</para>
	<blockquote>
		<para>
You can use Yuri Volobuev's send_arp.c from the 'fake' package or
Alexey Kuznetsov's arping from its iputils package:
		</para>
		<itemizedlist>
			<listitem>
fake - http://vergenet.net/linux/fake/
			</listitem>
			<listitem>
iputils - ftp://ftp.inr.ac.ru/ip-routing/iputils-ss991024.tar.gz
iputils is also used for IPAT, IP address takeover
			</listitem>
		</itemizedlist>
	</blockquote>
	<para>
If you're not sure if the network knows that the VIP has moved, try this.
	</para>
	<para>
Graeme Fowler <emphasis>graeme (at) graemef (dot) net</emphasis> 13 Mar 2006
	</para>
	<para>
At failover, make the new live director run something along the lines of:
	</para>
<programlisting><![CDATA[
/sbin/ping -c5 -I $VIP $GW_IP
]]></programlisting>
	<para>
Where $GW_IP is the IP address of your upstream router. It's not exactly 
gratuitous ARP but it does, in my experience, help to rapdily converge 
the systems which currently don't talk to each other.
	</para>
	<para>
Also make absolutely sure that the VIP is being torn down on the 
failed director. If it isn't, and it still ARPs for it, you'll end up in 
all sorts of problems.
	</para>
	<para>
To monitor this you could feasibly run <command>arpwatch</command> 
on both the directors' upstream interfaces. 
You should see the VIP flip-flop on failover. If 
you see it repeatedly flip-flop at regular intervals, you're not tearing 
down properly.
	</para>
	<para>
Joe Dec 2003
	</para>
	<para>
There is also <ulink url="http://www.vergenet.net/~acassen/software/garp-0.1.1.tar.gz">
http://www.vergenet.net/~acassen/software/garp-0.1.1.tar.gz</ulink>
which has been available for over a year, without me even knowing about it.
	</para>
	<para>
Here's some tests I did
	</para>
<programlisting><![CDATA[
LVS equipment: 2.2.13 client, and 0.9.4/2.2.13 director.
2 realservers
a) 2.0.36 kernel, libc5, gcc-2.7.2.3, net-tools 1.42.
b) 2.2.13 kernel, glibc, gcc-2.95,    net-tools 1.52
]]></programlisting>
	<para>
Experiment 1: Result - arp'ing is independant of [-]arp
	</para>
	<para>
Summary: the -arp/+arp option for ifconfig had no effect
on any devices back to 2.0.36 kernels with net-tools 1.42.
If it normally arps then -arp had no effect, if it normally
doesn't arp, than "arp" doesn't turn it on (data below).
	</para>
	<para>
Method:
IP=192.168.1.1/24 with VIP=192.168.1.110/32. The VIP was on
dummy0. The test was to see if the VIP was pingable from
another (external) machine on the 192.168.1.0/24 network
or pingable from the machine itself (ie internally from
the console). (I assume I had a route add -host for the
VIP although I didn't record this). The test was done with
ifconfig using arp or -arp (the output of ifconfig -a
didn't change)
	</para>
<programlisting><![CDATA[
                 -----2.0.36------- -----2.2.13------
ping from        internal  external internal external
VIP device
dummy	ARP        +         -	      +        +
        NOARP      +         -        +        +
        down       -         -        -        - (control)
]]></programlisting>
	<para>
Experiment2: Can the VIP be on a separate NIC?
	</para>
	<para>
Summary: yes, as long as the NIC doesn't have a cable
plugged into it.
	</para>
	<para>
Method:
same as above except VIP on eth1 (another NIC).
	</para>
<programlisting><![CDATA[
                 -----2.0.36-------
ping from        internal  external
VIP device
eth1 has cable connected to 192.168.1.0 network
eth1    ARP        +         +
        NOARP      +         +

eth1 cable to network removed
eth1    ARP        +         -
        NOARP      +         -
        works as realserver in LVS - yes
]]></programlisting>
	<para>
One of the reasons an no_arp interface is used on the
realserver is that it is not visible to the rest of the
network. Does the LVS work if the eth1 VIP on the realserver
is not visible to the rest of the network?
	</para>
	<para>
Conclusion: for 2.0.36 dummy0 doesn't arp, and eth1 does arp.
the arp/-arp option to ifconfig has no effect on arp behaviour.
LVS works with both dummy0 and eth1, I assume since VIP need
only be resolved as local on the realserver and does not
need to be visible to the network.
	</para>
	<para>
Experiment 3: What devices and netmasks are neccessary for
a working LVS?
	</para>
	<para>
Using the /etc/ethers approach for setting the MAC address of the
VIP I then set up an LVS with pair of realservers serving telnet.
All IPs are 192.168.1.x, all machines have a route to 192.168.1.0
via eth0. There is no default route.
	</para>
<programlisting><![CDATA[
1. 2.0.36, libc5, gcc 2.7.2.3, net-tools 1.42
2. 2.2.13, glibc-2.1.2, gcc-2.95, net-tools 1.52
]]></programlisting>
	<para>
with the following devices holding the VIP, tunl0, eth0:1, lo:0, dummy0,
eth1. In each case there was no route entry for the VIP device and
there was no cable connected to eth1 when it was used for the VIP.
The table below shows whether the LVS worked. The VIP is installed with
	</para>
<programlisting><![CDATA[
ifconfig $DEVICE 192.168.1.110 netmask $NETMASK broadcast $BROADCAST
with $NETMASK="255.255.255.255" $BROADCAST="192.168.1.110"
or   $NETMASK="255.255.255.0"   $BROADCAST="192.168.1.255"
]]></programlisting>
	<para>
the result belong to 1 of 3 groups
	</para>
<programlisting><![CDATA[
+ works fine
- doesn't work
  (at $ prompt on client get
  "unable to connect to remote host.  Protocol not available"
  then client returns to regular unix $ prompt)
hang - client hangs, realserver cannot access network anymore,
  have to run rc.inet1 from console prompt on realserver to
  start network again.
]]></programlisting>
	<para>
netmask of VIP=255.255.255.255 (normal LVS setup)
	</para>
<programlisting><![CDATA[
LVS type  -----VS-Tun------     ----VS-DR------
kernel    2.0.36     2.2.13     2.0.36   2.2.13

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           -         +         +
dummy0     +           -         +         +
eth1       +           -         +         +
]]></programlisting>
	<para>
netmask of VIP=255.255.255.0 (not normally used for LVS)
	</para>
<programlisting><![CDATA[
VIP on
tunl0      +