6. LVS: The ARP Problem

6.1. The problem

If you follow the instructions and setup the examples in the LVS-mini-HOWTO, then you don't need to know about the arp problem. If you're going to setup grander LVS's, then you'll need to understand the arp problem.

Although this section comes early in the HOWTO, it has lots of pitfalls. You shouldn't be reading this unless you've at least setup a working LVS-NAT and LVS-DR LVS using the canned instructions in the LVS-mini-HOWTO.

The LVS allows several machines to function as one machine. For LVS-DR and LVS-Tun, some trickery was needed to split the various handshakes involved in establishing and maintaining a tcpip connection, so that some parts of the handshake come from one machine and other parts from another machine. The worst problem, which ironically only happens with realservers running Linux (2.2 and later kernels), is the "arp problem". It's just as well we have the source code :-(.

With LVS-DR and LVS-Tun, all the machines (director, realservers) in the LVS have an extra IP, the VIP. Here's a LVS-DR in a test setup where all machines and IPs are on the same physical network (i.e. are using the same link layer and can hear each other's broadcasts).


                      ________
                     |        |
                     | client |
                     |________|
 	                 |
                         |
                      (router)
                         |
                         |
                         |       __________
                         |  DIP |          |
                         |------| director |
                         |  VIP |__________|
                         |
                         |
                         |
       ------------------------------------
       |                 |                |
       |                 |                |
   RIP1, VIP         RIP2, VIP        RIP3, VIP
 ______________    ______________    ______________
|              |  |              |  |              |
| realserver1  |  | realserver2  |  | realserver3  |
|______________|  |______________|  |______________|


When the client requests a connection to the VIP, it must connect to the VIP on the director and not to the VIP on the realservers.

The director acts as a layer-4 IP router, accepting packets destined for the VIP and then sending them on to a realserver (where the real work is done and a reply is generated). For the LVS to function, when the client (or if present, the router) puts out the arp request "who has VIP, tell client", the client/router must receive the MAC address of the director (and not the MAC address of one of the realservers). After receiving the arp reply, the client will send the connect request to the director. (The director will update its ipvsadm tables to keep track of the connections that it's in charge of and then forward the connect request packet to the chosen realserver).

If instead, the client gets the MAC address of one of the realservers, then the packets will be sent directly to that realserver, bypassing the LVS action of the director. If nothing is done to direct arp requests, by the router for the VIP, to the director (arp bouncing), then in some setups, one particular realserver's MAC address will be in the client/router's arp table for the VIP and the client will only see one realserver. If the client's packets are consistently sent to the same realserver, then the client will have a normal session connected to that realserver. You can't count on this happening: in the middle of a tcpip sesssion, the client/router might get the MAC address of another realserver as a result of an arp request, and the client will start getting packets for connections it knows nothing about (and the realserver will send tcp resets). (In my setup, the machine with the fastest CPU is in the client's arp table, suggesting that it's the first machine to reply that gets in. Horms and Steven WIlliams have written that they think it's the last machine to reply whose entry in in the client's arp table.) In other setups where the realservers are identical, the client will connect to different realservers each time the arp cache times out (see comment by Steven WIlliams elsewhere). If the director always gets its MAC address in the router arp table, then the LVS will work without any changes to the realservers (as happened in my case with a director with the fastest CPU in the LVS), although this is not a reliable solution for production.

Getting the MAC address of the VIP on the director (instead of the MAC address of the VIP on the realservers) to the client when the client/router does an arp request is the key to solving the "arp problem".

The traditional ways of handling the arp problem (as explained here) all require fiddling with the settings of the VIP on the realservers. The assumption in the early days of LVS was that you wouldn't have access to the router (this being under the control of the IT department or your ISP and you would have to go through a lot of bureaucracy to changed the settings on the router). However if you're paying good money to an ISP to house your LVS, or your inhouse LVS is doing something useful for your establishment, then you should have no trouble in having the router setup the way you want.

If you have access to the router (or can put one in front of your LVS - a low power linux box is just fine) and you can set it to route packets for the VIP only to the director(s) and not to the realservers, or you can use the arp filtering tools of iptables, and you understand what's been said above, then you've handled the arp problem and need read no futher.

For those who don't have access to the router, or who want to setup an LVS on one network, then read on...

The arp problem is handled in Linux 2.0.x kernels, as dummy0, tunl0, lo:0, are available on the realserver which don't reply to arp requests. For other OS's, the NOARP flag for ifconfig stops the VIP on the realservers from replying to arp requests.

However with 2.2.x (and later) kernels, the devices which didn't reply to arp requests in 2.0.x, now reply to arp requests. There is a "-arp" (NOARP) option for ifconfig which (according to the man pages) turns off replies to arp requests for that device, and an "arp" option which turns them back on again. Linux does not always honour this flag. You couldn't turn on replies to arp requests for the dummy0 devices in 2.0.36 kernels and you can't turn it off for tunl0 in 2.2.x kernels. eth0 behaves properly in 2.0.36 but in 2.2.x kernels it arps even when you tell it not to arp. This behaviour of not honouring the NOARP flag in the Linux 2.2.x kernels is not regarded as a "problem" by those writing the Linux TCPIP code and is not going to be "fixed".

Julian 22 May 2001

The flag is used to allow arp requests for the specified device. Although "lo" doesn't reply to arp requests, the requests for the VIP go through eth*, and so the NOARP flag is of no help to us. We can't drop the flag for eth.

Another wrinkle is that in 2.0.36 kernels, aliased devices (eg eth0:1) could be setup independantly of the options on the primary (eth0) device. Thus eth0:1 behaved as if it were on a separate NIC and its arp'ing behaviour could be set independantly of the primary interface. The settings of an aliased device belonged to the IP. With the 2.2.x kernels, the aliased devices are now just alternate names for each other: you change an option (eg -arp) or up/down of one alias (or primary) the other aliases follow. With 2.2.x kernels, the settings of the aliased device belong to the primary device (there is only one device with several IPs).

When LVS was running on 2.0.36 machines, the VIP was usually configured as an alias (eg lo:0, tunl0) on the main ethernet device (eth0), allowing the nodes in an LVS to have only one NIC.

With 2.2.x kernels, care is needed when only one NIC is used on the realserver (the usual case). On a realserver with eth0 carrying the RIP, and the realserver having only one NIC, eth0 must reply to arp requests (to receive packets), then eth0:1 carrying the VIP will reply to arp requests too, even if you ifconfig it with -noarp. Thus if a realserver is running a 2.2.x kernel and has the VIP on an ip_alias, then the VIP on the realserver will reply to arp requests received from the router.

The use of ip_aliases is still allowed, but requires a "label" to be recognised by the new Policy Routing tools (iproute2 and ip_tables). The "label"ed IPs are now secondary IPs.

For 2.2.x kernels and beyond the commands ifconfig and route should only be used with single NIC leaf nodes. You can still use them to set up a simple LVS, but for anything more complicated you will need to start using the iproute2 commands.

6.2. Put the VIP on the realservers lo device

In the early days (2.0.x, 2.2.x) I seemed to be able to put the VIP on any device I liked. I don't know whether this is still possible with the newer kernels, but people have not been able to get their LVSes to work unless with arp_filter and arp_ignore unless they put the VIP on the realserver's lo device.

6.3. The Cure(s)

Several cures have been produced in an attempt to solve the arp problem. They involve either

  • stopping the realservers from replying to arp requests for the VIP.
  • hiding the VIP on the realservers so that they don't see the arp requests.
  • priming the client/router in front of the director with the correct MAC address for the VIP.
  • allowing the realserver to accept a packet with dst=VIP even though the realserver does not have a device with this IP (i.e. the host has nothing to reply to an arp request). This is implemented by Transparent Proxy (Horm's method) or firewall mark (fwmark)fwmark. For transparent proxy on the realserver, the director forwards the packets to the realserver's MAC address, so you don't need to route the packets yourself. For fwmark, you need to understand routing to a director without a VIP. There may be performance problems with transparent proxy transparent proxy performance at high packet rates. Noone has tested 2.6 arp announce against transparent proxy at high throughput.
  • stopping arp requests for the VIP getting to the realservers.

Note: For the 2.2 and 2.4 kernels, most of these cures involve applying a kernel patch to the realserver. The realserver patch is unrelated to the ipvs patch applied to the director.

horms 4 Aug 2005

For the record, the ARP problem is not about honoring the -arp flag or not. The problem lies in whether or not the OS regards the IP address as belonging to the interface, or as belonging to the host. Both are valid. Linux adopts the latter, which turns out to work really well in most situations. LVS is a rare case where it doesn't. This has been painful in the past, but since arp_ignore and arp_announce were added, its quite easy now.

The following list of cures is a little confusing. If you're not using routing to stop packets for the VIP arriving at the realservers, then you'll have to stop the realservers replying to arp requests for the VIP. In this case you'll do one of the following on the realservers (Mar 2005, with help from Horms)

  • the original method: use Julian's hidden patch on the realserver. You set the VIP on lo and then "hide" it. This method has been around since the arp problem first arose and has been well tested. For more on the hidden patch see julian's page. This code is still being maintained, so if your setup scripts are for the hidden patch, you can continue to use it. Otherwise for new installations, you should use the arp_announce.

  • the next method: Maurizio's noarp module. This has the advantage that it does not require any patching of the realserver's kernel, is simple to setup and is the preferred method for many people. It has another advantage that you control the arp behaviour for the IP and not for the device.

  • the new way arp_announce: see arp_announce (http://www.ssi.bg/~ja/#arp_announce) which sets arp_ignore and arp_announce on the arping interfaces. This typically means eth0, but if you have eth1 as well, you need to set it there too. (If you have multiple NICs; eth0..ethn, you only need fix the NIC that hears the arp requests.) Setting these parameters on lo has no effect as far as I understand from testing, reading the code and reading correspondance from Jullian, i.e. you aren't interested in these settings.

    Note
    Make sure you don't bring up the ethernet device (say at bootup) before arp_ignore/arp_announce have been setup, or you will get a round of arp broadcasts from the NIC.
    # ipvs settings for realservers:
    net.ipv4.conf.lo.arp_ignore = 1
    net.ipv4.conf.lo.arp_announce = 2
    

    Horms

    If the VIP is on eth0, and you don't want it advertised over ARP on eth1, then set:

    net.ipv4.conf.eth1.arp_ignore = 1
    net.ipv4.conf.eth1.arp_announce = 2
    

    This is different to the hidden approach where you put the VIP on lo and then hide lo.

    The arp_ignore approach has theoretical and aesthetic advantages.

6.4. The Cure: 2.0 kernels - nothing needed

There is no arp problem with 2.0 kernels on the realservers. On the realservers, configure the VIP on the lo device with the -noarp option as you would with any other Unix.

6.5. The Cure: 2.2.x kernels - many options

The preferred method is "hidden"

6.5.1. The hidden patches

The "hidden" patches for kernel >=2.2.14 are now in the standard linux distribution (i.e. you can use the "hidden" feature with a standard kernel and don't have to patch the kernel on the realserver). The arp patches allow you to hide a device from arp requests, allowing the realserver to function in an LVS.

Note
The hidden patch hides the device (here the lo) (and any IPs that are on it). The -noarp flag in 2.0 kernels affects only the ip_alias (and not other IPs on the same device). These are different methods, but both stop the router/client from getting arp replies from the realserver for the VIP.

To hide devices from arp calls, on the realservers do

       #to activate the hidden feature
       echo 1 > /proc/sys/net/ipv4/conf/all/hidden
       #to make lo:0 not arp, put lo here
       echo 1 > /proc/sys/net/ipv4/conf/<interface_name>/hidden

then test that you've hidden the VIP (testing for arp).

There is a possible race condition in hiding the VIP -

Kyle Sparger, 15 Feb 2001

I've found an interesting, but not totally unexpected race condition under DR in 2.2.x that I've managed to create when installing VIP's on a machine in DR mode. Basically, the cause is this:

ifconfig dummy0 10.0.1.15
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden

You'll notice that there's going to be a small gap between the two which allows an ARP request to come in, and for the server to reply. And yes, it is big enough to be bitten by -- I've been bitten twice by it so far :)

Julian

On boot:

echo 1 > /proc/sys/net/ipv4/conf/all/hidden
# For each hidden interface (dummy, lo, tunl):
modprobe dummy0
ifconfig dummy0 0.0.0.0 up
echo 1 > /proc/sys/net/ipv4/conf/dummy0/hidden
# Now set any other IP address

Kyle's suggestion

echo 1 > /proc/sys/net/ipv4/conf/default/hidden
ifconfig dummy0 10.0.1.15
echo 0 > /proc/sys/net/ipv4/conf/default/hidden

The echo 0 command is incase I want to configure other interfaces later that I _do_ want responding to ARP requests. Technically, it's not necessary, I just find it useful in my particular setup.

6.5.2. The Cure: Older 2.2 kernels (<2.2.12)

These are old and it would better to upgrade (you won't get much help on the mailing list for these). However if you have them, you apply the arp patches to the kernel code of the 2.2.x realservers. These patches are separate from the ipvs patch applied to the kernel on the director.

For kernels <2.2.12, Julian's patch is on the lvs website.

http://www.linuxvirtualserver.org/arp_invisible-2213-2.diff

The patch by Stephen WIllIams is at

http://www.linuxvirtualserver.org/sdw_fullarpfix.patch

This patch is against a 2.2.5 kernel but can be applied to later kernels (tested to 2.2.13). The file appears to have DOS carriage control. Depending what you get on your disk, you may have to convert the file to unix carriage control (with `tr -d '\015'`) (the unix line extension of '\' doesn't work in combination with DOS carriage control).

The whitespace may not match your file so do

$ cd /usr/src/linux
$ patch -p1 -l < sdw_fullarpfix.patch

If you are using martian modification you will need the forward_shared-hidden patch as well (needed only on the director, but can be applied to both director and realservers).

6.5.3. Put an extra NIC (eth1) on the realserver to carry the VIP

Possible cards would be a discarded ISA card (WD80x3), or a cheap 100Mbit PCI card (eg Netgear FA310TX, $16 in USA in Nov 99) There is no traffic going through this NIC and it doesn't matter that it's an old slow card. The extra card is only required so that the realserver can have the VIP on the machine. With 2.2.x kernels you can't stop this device (eth1) from replying to arp requests, but if you don't connect the cable to it or don't put a route to it in the realserver's routing table, then the client won't be able to send it an arp request.

To set this up with the configure script, enter eth1 as the device for the VIP on the realserver.

Note

Apparently, the 2nd NIC doesn't handle arp problem for 2.4 kernels.

I tested the 2 NIC method of handling the arp problem with kernel 2.2.13. I haven't tried it with 2.4 kernels, but apparently it doesn't work. Julian and Ratz think it shouldn't work with 2.2.x kernels, but I haven't revisited the matter to see why we have come to different conclusions.

6.6. The Cure: 2.4.x kernels - arp_ignore/arp_announce

The current (kernels starting 2.4.26) accepted method is arp_ignore/arp_announce.

There are several ways of handling the arp problem for 2.4.x kernels. They all work, but some of them have been around longer and so have been used more and people on the mailing list are more familiar with them.

6.6.1. 2.4 Hidden Patch

Julian's hidden patch has been around the longest and is well tested. Although included in the standard 2.2.x kernel, it is not being included in the 2.4.x kernels. You'll have to patch the kernel on the realservers. The preferred method for new installations is arp_ignore/arp_announce.

For early 2.4.x kernels (eg x=0), the patch is available at http://www.linuxvirtualserver.org/hidden-2.3.41-1.diff. (This patches a part of the kernel that isn't being actively fiddled with, so hopefully the patch will work against later 2.4.x kernels too.)

The 2.4.x "hidden" patch is included in ipvs-x.x.x/contrib/patches/hidden-x.x.x.diff

Assuming you are patching 2.4.2 with the ipvs-0.2.5 files

cd /usr/src/linux
patch -p1 <../ipvs-0.2.5/contrib/patches/hidden-2.4.2-1.diff

Then build the kernel (can use same options as for the 2.4 director kernel build).

You activate the hidden feature as for 2.2 (see hidden).

As to why the hidden patch is in the 2.2 kernels but not the 2.4 kernels see the the mailing list archives or for the thread

6.6.2. 2.4 arp_announce

The 2.6.x arp_announce, arp_ignore code has been back ported to 2.4.26 (and later) kernels.

6.6.3. arp filtering

Julian has written an extension to the iproute2 tools, which filters arp packets. You can use this to handle the arp problem. See Julian's software page for more details. This method does not require patching of the 2.4 kernel on the realserver.

Note
Julian's arp filtering is not arptables.

Joe

Is arptables the extension to iptables that you wrote a while ago? arptables seems pretty simple. What are the problems with arptables that you've written arp_ignore and keep maintaining the hidden patch?

Julian 11 Jul 2004

Almost true, I'm not the arptables author. You're referring to the arprules/iparp functionality which is based on ip, not on iptables. Similar names.

At that time there was no user space tool for the arptables changes in kernel (done by David Miller), now there is such tool (I didn't tried it), so the list of options to hide addresses in clusters is extended.

arp_ignore was born day(s) after arp_announce. Both flags are easy to set default policy for playing with ARP requests and replies which was needed for years for stuff like interoperability with other ARP stacks (mostly for controlling the source address selection in ARP requests with arp_announce) or for hiding of addresses for cluster setups.

6.6.4. Maurizio Sartori's noarp module

Maurizio Sartori masar (at) masarlabs (dot) com 28 Nov 2002

On my site is a simple kernel module for Linux 2.4.x to solve the ARP Problem. You don't have to patch the kernel but only to compile, install and configure the 'noarp' module, to use your loopback interface filtering its arp reply. I've tested it on Debian 'Sarge' and RedHat 8.

Note
Maurizio later produced a patch for 2.6.

Sebastien Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 04 Jun 2003

Nobody seems to recall what a smart Italian guy named Maurizio Sartori did. Instead of the hidden patch, which requires a full kernel build, he's written a *module* called noarp, way more handy, as

  1. it requires only a one module build, doesn't require a kernel build, takes about 1 minute to install and get working.
  2. it allows hidding IPs, not interfaces.

I'm using it in production and it works perfectly.

Joe

Can you hide the VIP on eth0:x and not hide the RIP on eth0? (I should know this, but I don't)

Jan Abraham jan_abraham (at) gmx (dot) net 31 Oct 2003

Yes, you can :) I used Maurizio Sartori's noarp module, suggested in your HOWTO in chapter 4.5.3. It can be controlled by IP, not by interface.

Ratz 17 Dec 2004

Julian's arp_ignore is the way to go, portable and ready for upgrades ;). Nothing against Maurizio by all means, but after years of fighting with the netdev's Julian finally convinced the high priest of Linux networking to solve the arp Problem once and for eternity. If you read the accompagning documentation on arp_* sysctrl you can pretty much figure out that nothing is impossible anymore ;).

Joe - I would have been happy if they'd left the arp behaviour as it was originally and as it is in all the other unices (except HPUX).

Todd Lyons tlyons (at) ivenue (dot) com 03 Feb 2005

Hi guys (and masar), I am trying to use your noarp module, but am hitting the limit of 16 entries. I need it to work for (at the moment) an additional 10 entries. I see in noarp.h:

#define NOARP_MAX_IP            (16)

Is it going to create problems to pick this number up to 32 or 64? I've already done it and it seems to handle the problem. I don't want to create any memory leaks or overruns. Your code looks like it allocates memory based on that NOARP_MAX_IP, but my c is not good enough to know for sure if that will be a problem. Here's what happens on my system (RH 7.3 with 2.4.20-28.7smp kernel). You can see that it's failing on the 10 additional IP's after the initial 16. Please let me know if I can safely raise that number.

[root@rproxy1a init.d]# /etc/init.d/noarp start
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
/usr/local/sbin/noarpctl: ioctl error: No space left on device
[root@rproxy1a init.d]# /etc/init.d/noarp status
64.14.201.41 10.10.10.160 0 0 0
64.14.201.151 10.10.10.160 0 0 0
64.14.201.161 10.10.10.160 0 0 0
64.14.201.162 10.10.10.160 0 0 0
64.14.201.163 10.10.10.160 0 0 0
64.14.201.164 10.10.10.160 0 0 0
64.14.201.165 10.10.10.160 0 0 0
64.14.201.166 10.10.10.160 0 0 0
64.14.201.167 10.10.10.160 0 0 0
64.14.201.168 10.10.10.160 0 0 0
64.14.201.169 10.10.10.160 0 0 0
64.14.201.175 10.10.10.160 0 0 0
64.14.201.153 10.10.10.160 0 0 0
64.14.201.178 10.10.10.160 0 0 0
64.14.201.170 10.10.10.160 0 0 0
64.14.201.171 10.10.10.160 0 0 0

[root@rproxy1a network-scripts]# ls ifcfg-lo:*
ifcfg-lo:0   ifcfg-lo:13  ifcfg-lo:18  ifcfg-lo:22  ifcfg-lo:4 ifcfg-lo:9
ifcfg-lo:1   ifcfg-lo:14  ifcfg-lo:19  ifcfg-lo:23  ifcfg-lo:5
ifcfg-lo:10  ifcfg-lo:15  ifcfg-lo:2   ifcfg-lo:24  ifcfg-lo:6
ifcfg-lo:11  ifcfg-lo:16  ifcfg-lo:20  ifcfg-lo:25  ifcfg-lo:7
ifcfg-lo:12  ifcfg-lo:17  ifcfg-lo:21  ifcfg-lo:3   ifcfg-lo:8

I have a question about the man page

NOARPCTL COMMANDS
       add    Adds a new Virtual IP to  the  list.  Requires  two
              arguments: the VIP is the address to hide, the Real
              IP (RIP) is a real address of this host to use when
              ARP query are made that would use VIP.

I must be misunderstanding something very basic. I thought you didn't want real servers to arp at all for VIPs, no matter what interface the arp comes in on and no matter what interface is defined with the matching address. The only acceptable arp answer is for the RIP (implying local traffic or traffic that is not desired to be load balanced). But the above man page contradicts my ideas. So I'm a bit confused as to how exactly noarp is working.

Maurizio Sartori masar (at) MasarLabs (dot) com 04 Feb 2005

there should be no problem to incremente NOARP_MAX_IP, all memory is allocated statically. The RIP in the 'add' command of noarpctl is used to suppress the selection of the VIP as the sender IP address in arp requests. If not suppressed the back-end host request updates all arp cache entries on the local net for the VIP with the mac of the back-end host.

A way to generate a request of this type is, from a real server:

#> nc -s $VIP somehost 80

6.6.5. extra NIC doesn't solve arp problem for 2.4 kernel realservers

Jean Paul Piccato j (dot) piccato (at) studenti (dot) to (dot) it

I'm setting up a DR_LVS with a director and two servers... I've to handle the ARP problem so I've put two NIC on the two realservers...

Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002

This works maybe only with Linux 2.0. (Joe: see 2.2 kernels with extra NIC). For 2.2+ you need a specific kind of ARP control. In Linux 2.2+ the operation of adding IP address involves the following 2 steps:

  1. Define a local IP address as a host property - remote hosts can talk to it through any device
  2. Define network link route on the specified device - you can talk with other hosts from this local network only through this device

(1) allows the Linux 2.2+ box to send ARP replies through any device that received the reply. Additionally, the user can provide some filtering by setting some device specific values:

/proc/sys/net/ipv4/conf/*/<FLAG>

These are explained in /usr/src/linux/Documentation/networking/ip-sysctl.txt

The LVS setups depend mostly on the FLAGs rp_filter, hidden, arp_filter, send_redirects. (for more info on kernel flags see the section on /proc filesystem flags). On problems, check them after learning what they mean and how they can kill your setup.

By setting rp_filter or arp_filter on some device you can ignore the ARP requests (and the traffic if rp_filter is set) coming from addresses if we don't have a route to these addresses through the mentioned above device.

The send_redirects values must be checked for setups playing with NAT on one physical medium.

Information on using the hidden patch is in hidden.txt

It seems that eth0 reply to the server instead of eth1

Any device can reply if the ARP probe is not filtered. See hidden.txt from the above URL

Michael McConnell michaelm (at) eyeball (dot) com 10 Jun 2002

I currently have a system which has a Tyan 2515 Motherboard. This motherboard features a Dual Intel 82559 NIC.

The problem I am face is that which using both ports of this dual interface network card (plugged into the same switch) I find that the second interface is answering arp requests (on rare occasions) that the first interface should be answering. I have used tcpdump and clearly seen eth1 answering arps requests that eth0 should be answering... how odd.... It's rare, but when it happens of course that address is offline. (Note: this only seems to happen on alias IP address, it has never happened on the primary interface)

I am using the open source drivers provided with the 2.2.19 kernel, I'm wondering if the drivers provided by Intel would help this problem?

Roberto Nibali ratz (at) tac (dot) ch 11 Jun 2002

The drivers indeed can't make the difference but not because they are the same, but because the driver doesn't have anything to do with the arp/routing issue.

Julian

Classic problem of attaching multiple Linux interfaces to shared medium. You can set arp_filter on all your ARP devices or why not to restrict even the IP protocol by setting rp_filter.

Such answering (of arp requests) can not be never a problem. If the Linux box answers via many interfaces then it is willing to accept traffic through these ifaces. Of course, the achieved failover when attaching two interfaces to same hub is not perfect because the remote LAN boxes will use the alive Linux interface but Linux routing still uses the first interface (even if it is failed on Layer 2) for the used subnet. If your goal is to restrict the talks for each subnet through one interface then you have to use arp_filter=1 but still to use rp_filter=0 to allow cross-subnet talks. One day rp_filter will be aware of the medium_id values for each interface and will allow the Linux box to interconnect multiple hubs securely (and still to use many interfaces to these hubs).

By default Linux replies to ARP probes for any local IP address configured on any device no matter on what device the probe is received. Such probes look like "who-has TARGET tell SENDER". If the probe is answered later we can receive IP traffic from SENDER to TARGET destined to the TARGET's MAC address.

When we have different subnets (network routes) configured on multiple interfaces attached to same hub sometimes we prefer (may be the reader can find good reason for this) the traffic to/from one subnet always to use one interface. In such cases replying through many interfaces is not desired. We have 2 options:

arp_filter:

when

/proc/sys/net/ipv4/conf/DEV/arp_filter is set to 1
or
/proc/sys/net/ipv4/conf/all/arp_filter is set to 1

then the flag will cause any probe received on interface DEV to be dropped if the route from TARGET to SENDER points to different interface. With the usual local networks in table main in the form "from 0/0 to local_net lookup main" we see that the TARGET is ignored. As result, we drop probes received from SENDER that comes from wrong interface. As result, if the route from TARGET to SENDER1 is via DEV1 and from TARGET to SENDER2 is via DEV2, then we will reply only through one device for each of the senders. Of course, the arp_filter relies on the routing and as result the bahavior depends on the used ip rules and routes. The above is a simple example for normal local networks. The arp_filter simply checks the route for the reversed addresses. It should point to the input device.

rp_filter:

The rp_filter flag (DEV/rp_filter or all/rp_filter) set to 1 has similar semantic. It has nearly the same function as arp_filter and can control the ARP for the same purposes: symmetric talks (in and out using same device) but it covers the IP traffic too. It is assumed that where ARP is received (replied more exactly) there the IP traffic will be accepted too. It has mostly security function and can defend against IP spoofing. It controls the reverse path protection: we accept traffic from SENDER to TARGET received on DEV only when the reverse path (from TARGET to SENDER) points to the input interface DEV. It is used usually for "external" interfaces.

How you can use it:

ifconfig eth0 192.168.1.2
ifconfig eth1 192.168.2.2

echo 1 > /proc/sys/net/ipv4/conf/eth0/arp_filter
echo 1 > /proc/sys/net/ipv4/conf/eth1/arp_filter

6.6.6. Put the realservers on a different network to the VIP

Setup routing tables so that the client cannot route to the realserver network (Lars' method). This method requires the director to not forward packets for the VIP (easy to implement if 2 NICS on the director). The reply packets from the realservers return to the client via a different router to the one attached to the director. Thus the director's router cannot send arp requests to the realservers.

6.6.7. On the client(router), route packets with dst_addr=VIP to the director

You can hardwire the MAC address of the director as the MAC address of the VIP. You can do this with

#arp -s lvs.mack.net 00:80:C8:CA:A7:E4

or

arp -f /etc/ethers.

Here is my /etc/ethers file (on the client)

lvs.mack.net 00:80:C8:CA:A7:E4

This requires no extra NICs or patching of realservers. However in a production environment, redundant directors with heartbeat/failover may be required and some method (eg running send-arp) will be needed to change the static arp entry as the failover occurs. If multiple NICs are involved, it is possible that the above instruction will result in a route through the wrong NIC. In this case bring up the NIC of interest first and then run the above command.

Alternately if the router has several NICs, use one for the director and another for the realservers. Route the VIP to the director.

6.6.8. Use transparent proxy allow the incoming packet to be accepted locally - Horms method.

see the sections on Transparent Proxy (Horm's method), and its setup for LVS-DR and LVS-Tun. The configure script will set this up for you.

6.7. The Cure: 2.6.x kernels - arp_ignore/arp_announce

6.7.1. 2.6 arp_announce

Julian Anastasov ja (at) ssi (dot) bg 25 Feb 2004

2.4.26 and 2.6.4 come with 2 new device flags for tuning the ARP stack: arp_announce and arp_ignore. All IPVS-like setups can use arp_announce=2 and arp_ignore=1/2/3 to solve the "ARP problem" on realservers with DR/TUN setups. These flags are going to replace the "hidden" functionality which does not work well when directors are changing role between master/slave for a particular VIP. The risk is that other hosts can probe for VIP using unicast packets for which the hidden flag always replies. I'll continue to support the hidden flag for 2.4 and 2.6 to help existing setups but switching to the new device flags (or other solutions) is recommended.

Documentation is in the 2.6 kernel docs (linux/Documentation/networking/ip-sysctl.txt) (here from the 2.6.17 kernel).

arp_announce - INTEGER
	Define different restriction levels for announcing the local
	source IP address from IP packets in ARP requests sent on
	interface:
	0 - (default) Use any local address, configured on any interface
	1 - Try to avoid local addresses that are not in the target's
	subnet for this interface. This mode is useful when target
	hosts reachable via this interface require the source IP
	address in ARP requests to be part of their logical network
	configured on the receiving interface. When we generate the
	request we will check all our subnets that include the
	target IP and will preserve the source address if it is from
	such subnet. If there is no such subnet we select source
	address according to the rules for level 2.
	2 - Always use the best local address for this target.
	In this mode we ignore the source address in the IP packet
	and try to select local address that we prefer for talks with
	the target host. Such local address is selected by looking
	for primary IP addresses on all our subnets on the outgoing
	interface that include the target IP address. If no suitable
	local address is found we select the first local address
	we have on the outgoing interface or on all other interfaces,
	with the hope we will receive reply for our request and
	even sometimes no matter the source IP address we announce.

	The max value from conf/{all,interface}/arp_announce is used.

	Increasing the restriction level gives more chance for
	receiving answer from the resolved target while decreasing
	the level announces more valid sender's information.

arp_ignore - INTEGER
	Define different modes for sending replies in response to
	received ARP requests that resolve local target IP addresses:
	0 - (default): reply for any local target IP address, configured
	on any interface
	1 - reply only if the target IP address is local address
	configured on the incoming interface
	2 - reply only if the target IP address is local address
	configured on the incoming interface and both with the
	sender's IP address are part from same subnet on this interface
	3 - do not reply for local addresses configured with scope host,
	only resolutions for global and link addresses are replied
	4-7 - reserved
	8 - do not reply for all local addresses

	The max value from conf/{all,interface}/arp_ignore is used
	when ARP request is received on the {interface}

On the realservers the VIP will still be on lo (as for the hidden method). If the reply packets to the client are routed through eth0, then the arp announcements/requests are made through eth0 and you will apply the arp_ignore/arp_announce sysctls to eth0, not to lo (you cannot use arp_ignore/arp_announce on lo).

/etc/sysctl.conf
net.ipv4.conf.eth0.arp_ignore = 1
net.ipv4.conf.eth0.arp_announce = 2
net.ipv4.conf.all.arp_ignore = 1
net.ipv4.conf.all.arp_announce = 2

As with all devices that reply to arp requests, you should stop the arp behaviour before bringing up the VIP, or else flush the arp tables on the router before using the LVS.

Mag2589 Walla Feb 21, 2007

On my realservers I have them set up to listen to the virtual address on eth0:0 I need them to respond to arp on eth0 but I need them to ignore it on eth0:0 To do this would I enter the following line in my /etc/sysctl.conf file?

net.ipv4.conf.eth0:0.arp_ignore = 8

Horms

In a word: No

arp_ignore works only on physical interfaces. The old eth0:0 notation is a hang-over from the days of ip aliases, where in a round-about way you could establish virtual interfaces (sort of). These days an interface can have 0 or more addresses. The arp_ignore semantics apply to such addresses.

If you really need fine-grained arp control, take a look at arptables, which is kind of like iptables for arp.

6.7.2. noarp v2.6

Masar masar (at) MasarLabs (dot) com 04 Mar 2004

noarp 2.0.0 (http://www.masarlabs.com) is now available. This is the port of noarp to the Linux 2.6.x kernel. For the 2.4.x kernel use noarp 1.x.x. I'm making separate packages of noarp for the two kernels, because the method for producing a module is different. If there is sufficient interest, I may produce a single package for both kernel versions.

6.7.3. arp ignore with Ubuntu

Julien Cornuwel cornuwel (at) gmail (dot) com 29 Sep 2008

I'm trying to set up load balancing with IPVS on two Apache webservers. The loadbalancer and both apache servers are virtual machines running Ubuntu 8.04 server on VMware ESX 3.5.

If the platform has been idle for some time (like when I came back to work this morning), I can point a browser to http://VIP and get pages from server1 or server2 alternatively (I'm using rr during setup). But after about 5 seconds, I get nothing and the browser times out.

Here is my configuration on the load balancer :

ipvsadm -A -t $VIP:80 -s rr
ipvsadm -a -t $VIP:80 -r $RIP1:80 -g
ipvsadm -a -t $VIP:80 -r $RIP2:80 -g

On webservers, I added the following to /etc/sysctl.conf (as suggested on http://www.linuxvirtualserver.org/VS-DRouting.html) :

net.ipv4.conf.all.hidden = 1
net.ipv4.conf.lo.hidden = 1

I rebooted them and then :

ifconfig lo:0 $VIP netmask 255.255.255.255 up

Unless I did something stupid (if so, please tell me), it should work.

echo 1 > /proc/sys/net/ipv4/conf/lo/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/lo/arp_announce
echo 1 > /proc/sys/net/ipv4/conf/all/arp_ignore
echo 2 > /proc/sys/net/ipv4/conf/all/arp_announce

I'm quite sure my problem is not on the loadbalancer but on webservers.

I have one interface and tried the following with no more luck :

net.ipv4.conf.all.arp_ignore=1
net.ipv4.conf.eth0.arp_ignore=1
net.ipv4.conf.all.arp_announce=2
net.ipv4.conf.eth0.arp_announce=2

On the first request, that works, I have an ESTABLISHED line. But after a few seconds, I get dozens of SYN_RECV. OK, so I definitely have an ARP problem, with the above configuration. Any idea why the above commands doesn't work on Ubuntu?

I did a TCP dump on all 3 servers and here is what I see :

  • On webservers, when it works, I see outgoing IP packets with the LB's address as origin. When it doesn't, I just see nothing. About once per second, the LB sends ARP requests trying to find both webservers (on their real addresses), I never saw an ARP reply.
  • On the load balancer, I see incomming requests from clients, and some "ICMP host lb" not reachable sent to the client when it doesn't work.

Webservers should reply to ARP requests on their primary addresses, but they don't :(

Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 07 Oct 2008

Well I think it's either that you applied "hidden" to eth0 on the webservers, or the LB has the VIP as a primary address. See if the ARPs were going out with VIP as source and if that's the case, try giving the LB a different primary address and make VIP an alias.

Julien

Great ! That was it. Now that I have the VIP as an alias on LB, it works. Note to the documentation team : on Ubuntu 8.04, there is a trap with real servers. If you set arp_ignore/arp_announce configuration in /etc/sysctl.conf AND set the VIP on lo:0 in /etc/interfaces. It seems that the interface is brought up *before* the sysctl commands are passed. You have to set the VIP manually at the end of the boot process.

6.8. arptables

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 11 Jul 2004

arptables is a method supported by Red Hat. The package, arptables_jf, is part of Advanced Server, but the src.rpm can be downloaded, rebuilt and used on Workstation since it has the same kernel support. The configuration is pretty straightforward:

# arptables -A IN -d $VIP -j DROP
# arptables -A OUT -s $VIP -j mangle --mangle-ip-s $RIP
# service arptables_jf save
# chkconfig --add arptables_jf
# chkconfig --levels 12345 arptables_jf on

This service will start before the network is brought up. Note that you have to specify an explicit runlevel, since it stupidly won't start in single user by default.

Bandit Lazuli banditlazuli (at) yahoo (dot) com 13 Apr 2006

Our cluster of web frontends periodically exhibited a kind of Fatal Attraction behavior, where one host would suddenly be the recipient of all hits. Attempting to add new hosts to the existing cluster triggered this behavior in a consistent way. With something clear to fix, we installed the latest version of keepalived on the latest RHEL4 kernel.

And lo, nothing changed. Add a new host, it became a Fatal Attractor within 6 minutes of operation (note that this is NOT the thundering herd problem; things were relatively well balanced for a minute or 6).

Stranger yet, ipvsadm on the director revealed that the Attractor was getting NO hits. So it wasn't that the LVS was sending all hits to one machine. You guessed it. The new machine was arping for the shared ip, and connections were coming directly to it. We had arptables set up as follows:

*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
COMMIT

And in desperation, started arptables at runlevel 1. This didn't help, because it wasn't responding to an inbound arp request, but was instead generating it's OWN arp request, and broadcasting the response it made to itself. This could be seen with:

tcpdump -i any arp > file

And then pawing through the file for the shared ip (name). So there lies the smoking gun. Arptables was NOT working as advertised. So we added:

-A OUT -d 192.168.0.12 -j mangle --mangle-ip-s 192.168.0.104

This still did not do the trick; apparently arptables implicitly operates on the interface owing the ip (lo:1, in our case), if no interface is specified. That left eth0 leaking arps. Specifying the interface did the trick:

-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104

And here is the whole filter:

*filter
:IN ACCEPT [0:0]
:OUT ACCEPT [0:0]
-A IN -d 192.168.0.12 -j DROP
-A OUT -s 192.168.0.12 -o eth0 -j mangle --mangle-ip-s 192.168.0.104
COMMIT

arps are now properly squelched, and fatal attractor behavior has vanished. I'm posting this because I longed for google to return such a message in response to many searches.

6.9. The arp problem is on the realserver's VIP not the RIP

Cali Federico

I've configured an http service on director as below:

Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  194.153.172.249:80 wlc
  -> 194.153.172.222:80             Route   1      0          0

and I've installed the noarp module on the realserver as below:

[root@cautha2 root]# noarpctl list
194.153.172.249 194.153.172.222 12 0 3

The problem I can see is that invoking the http://VIP/index.html from my PC (outside the LVS network) I can see the page provided by realserver 2 or 3 times after that I receive a "Page Cannot Displayed". The page remains unreachable for several minutes after I have the same behaviour again. Looking at the director's arp table, the HWaddress related to the realserver is "(incomplete)". After that I set (using arp -s) the correct realserver MacAddress the LVS works properly.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/19/05

You are no arp'ing your RIP which is not a good idea. It's just the VIP that needs NOARP on the realserver.

6.10. Testing an interface for replies to arp requests

To test that the VIP on the realservers (here lo:0) is hidden from arp requests: You test from a client machine on the same network segment as the NICs on the realservers. For your sanity, you could try this with one realserver at a time. You do _NOT_ have the director (with its pingable VIP) connected to the network (unplug it).

  • Optional: on each realserver, to accumulate a list of the MAC addresses for each NIC.

    realserver: # ping VIP
    realserver: # arp -a	# look for the MAC address of the VIP
    realserver: # ifconfig	# should show the same MAC address
    
  • find the MAC address for the realserver's VIP from the test client.

    client: # ping VIP
    or ping the broadcast address
    client: # ping 192.168.1.255	#for a VIP in the 192.169.1.0/24 network
    then
    client: # arp -a	# look for the MAC address of the VIP on the realserver.
    			# if you have several realservers on-line,
    			# it could be the MAC address of the NIC on any of the realservers
    
  • Hide the lo interface on the realserver (hidden). Before the arp tables expire (15secs - 2mins depending on the OS), ping the VIP again from the test client. The realserver will still reply to the ping, since the MAC address for the VIP will still be in the arp table of the test client.

    client: # ping VIP
    
  • let the arp cache expire (wait 15sec - 2mins) or clear the arp cache of the test client.

    client: # sleep 120
    or
    client: # arp -d VIP	# delete/flush/clear the entry for the VIP
    then
    client: # arp -a	# show that the arp entry for the VIP is gone
    
  • ping the VIP. You should get no reply.

    client: # ping VIP
    
  • Do for all realservers, making sure you get no ping replies for the VIP.

  • On the director (don't connect it to the network yet) find the MAC address of the VIP

    director: # ping VIP		#VIP can be an IP or a resolvable name
    and/or
    director: # ifconfig		#look for MAC address of NIC with VIP
    then
    director: cat /proc/net/arp	#shows list of IP-MAC address pairs
    or
    director: arp -a 		#shows FQDN as well
    
  • Connect the director to the network. Just to be sure, clear the arp entry for the VIP on the test client (arp -d VIP) and ping the VIP again. You should get ping replies. The test client's arp cache should have the MAC address of the director's NIC for the VIP.

6.11. Normal machines, Solaris, Novell Server

The arp problem only occurs on Linux with kernels 2.2 and later. All other OS's honor the arp flag (Joe HPUX does something wierd with noarp, but I've forgotten what).

Mark de Vries

I need to do some DR to a Solaris 8 box... Anyone know how to set it to ignore arp requests? So far I have only done DR to linux and windows boxen...

Lasse Karstensen lkarsten (at) hyse (dot) org21 Apr 2006

At least for Solaris 9, you can just create the file /etc/hostname.lo0:14 (14=some number)

Inside of it you write

"""
plumb 1.2.3.4 -arp netmask 255.255.255.255 up
"""

where 1.2.3.4 is your vip-address. I'm pretty sure this also works in Solaris 8. The Solaris-people here also mention that you can just use addif in /etc/hostname.lo0, if you rather fancy having everything in one file.

Mark de Vries markdv (dot) lvsuser (at) asphyx (dot) net21 Apr 2006

Ah yes, the -arp option... Yeah works! Not that it matters because half way throught the exersize I suddenly realized that the service has always used NAT for a reason; the real servers have the service on different ports then on the VIP... so DR is a no-go... sigh...

And the whole reason we wanted to use dr in this first, or actually second, case was because half way through the first attempt at configuring it as NAT I suddenly realized that wouldn't work because in this case the realserver has an interface in the same network as the VIP is in (so there is a direct route to the client and packets won't get de-natted)... So short of someone pointing me to a source-based-routing-HOWTO-on-Solaris it's just not gonna work out...

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 17 Dec 2008

We recently had a customer using that funny thing called Novell Server.... I couldn't find anything in the LVS manual about Novell Server in DR mode but eventually figured out the following which works great:

add secondary ipaddress <ipaddress> noarp

6.12. problems with switches

There are other places in the network with arp caches, like "smart" switches. These will bight you if you don't know about them.

frederic (dot) buche (at) equant (dot) com 29 Oct 2003

OK Julian, you are right. The problem came from my network-switch, which keeps in memory the MAC address of all machines. So it just relays the arp request to the concerned server, by using a unicast arp request.

Just for a test, I have deleted the MAC entry on my switch. Then reproduce the same test than before ... and the hidden patch works well!

Carlos J. Ramos cjramos (at) genasys (dot) com 15 Dec 2003

We are using an HP Procurve Switch 2124 in a cluster using Heartbeat and Ldirectord as HA and Balancing mechanisms. Previously we have similar working setups with a hub in the same location. Eerything works fine, till we make a takeover on directors. As the switch documentation saids, the switch automatically learn MAC address and associate it to its ports, so that although heartbeat changes IP address, the switch try to use the same switch port. The situation remains for at least 1 hour... for this time the forwarding in the cluster does not works... and realservers are unable to be reached from outside... We are assuming this is an arp caching problem, although we haven't eliminated other possible causes yet.

Is there any way to force the switch to refresh MAC Address Table?, is there any Linux tool that sent any kind of packet over the net forcing the ARP Table to be updated?

6.13. The ARP problem, the first inklings

History: The ARP behaviour changed between 2.0.x and 2.2.x kernels. Here's the original posting by Wensong and a reply from Alexy Kuznet (2.2 tcpip author)

Wensong Zhang wensong (at) iinchina (dot) net 24 Mar 1999

Today I upgraded the kernel to 2.2.3 with tunneling support on one of a realserver, and found a problem that the Linux 2.2.3 tunnel device answers ARP requests. Even if I used the NOARP options as follows:

realserver:# ifconfig tunl0 172.26.20.110 -arp netmask 255.255.255.255 broadcast 172.26.20.110

It still answers the ARP requests. This will greatly affect the virtual server via tunneling work properly. In fact, the tunnel device shouldn't answer the ARP requests from the ethernet. I think it is a bug of linux/net/ipv4/ipip.c, which is now a clone of ip_gre.c not the original tunneling code.

If you are interested, you can test yourself on kernel 2.2.3, choose a free IP address of your ethernet and configure it on the tunl0 device, then telnet to that IP address from other host, I guess you can. Finally, have a look at the ipip.c, maybe you can debug it. :-) --

But, what is the IFF_NOARP flag of the tunnel device for?

kuznet (at) ms2 (dot) inr (dot) ac (dot) ru

IFF_NOARP means that ARP is not used by THIS device. On normal IPIP tunnels it does not make much of sense, but may be used for example to turn on/off endpoint reachability detection.

I do not see any reasons to disable answering ARP in such curcumstances. Isolation of VPNs on adjacent segments is impossible at routing/arp level, it is just not well-defined behaviour.

If the isolation is made with firewall policy rules, then it is clear that arp policy must be handled at this level too.

In kernel 2.0.x, the tunnel device doesn't answer ARP requests.

Yes.

Yeah, we can have link-local addresses that doesn't answer ARP requests in kernel 2.2.x. For example, we can configure all the hosts in a network with the following command:

ifconfig lo:0 192.168.0.10 up

There will no collision. The lookback alias interfaces don't answer ARP requests.

Are you sure? I am not. Please, test.

BTW you risk adding non-loopback addresses on loopback device. They have the HIGHEST preference to be used as router identifier. so that VPN addresses cannot be added to loopback at all.

No, it doesn't fail. I tested it with kernel 2.0.36, it worked.

It does not work under 2.2. To be honest, I am about to stop to understand you. You talk about 2.2, but all your tests are made for 2.0. 8)

6.14. A posting to the mailinglist by Peter Kese explaining the "arp problem"

(saved for posterity by Ted Pavlic, minor editing by Joe)

peter (dot) kese (at) ijs (dot) si

Before we start, let's assume we have following network configuration for an LVS running LVS-DR.

client		10.10.10.10

gw		192.168.1.1

director	192.168.1.10 	IP for admin (director IP)
        	192.168.1.110 	VIP (responds to arp requests)

realserver	192.168.1.11 	IP to which each service is listening (realserver IP)
		192.168.1.110 	VIP (DOES NOT respond to arp requests)

The virtualserver is the combination of the director and the realserver running LVS.

Or goal is:

  1. Virtual server should respond to arp requests for both the VIP and the director IP.
  2. The realserver should respond to arp requests for the realserver IP but NOT the VIP.
  3. Gateway sends packets for the VIP to the director IP load balancer no matter what.

Problem 1: Interface aliases

Realserver and director need to have an interface with the VIP in order to respond to packets for virtual server. A real interface is not needed, an IP alias will do just fine and this interface alias could be either eth0:0 or lo:0.

On the 2.0 kernels, the ARP responding ability of an interface alias (eg eth0:0) could either be enabled or disabled independantly of the main (eth0) interface. If you wanted eth0:0 not to respond to ARP requests, you could simply say:

        ifconfig eth0:0 192.168.1.2 -arp up

Thus in the 2.0 kernels it is possible, on a realserver, to have the realserver IP (on eth0) respond to arp requests and for the VIP (on eth0:0) to not respond.

In the 2.2 kernels this doesn't work any more. Whether the an interface alias responds to ARP requests or not, depends only on the way the real interface is configured. So if eth0 responds to ARP requests (which it normally will), eth0:0 carrying the VIP will also respond to ARP requests no matter what.

This means an ethernet alias (eth0:0) is not permitted on real servers, because realservers should not respond ARP requests.

On the other hand, loopback aliases never respond ARP requests, which means that the loopback alias (lo:0) must not be used on the director for the VIP.

Problem 2: Loopback aliases

I haven't done much checking on loopback interface problem, but it seems that if an alias is used on a loopback interface (as is required for LVS-DR) on a realserver running kernel 2.2.x, the whole ARP gets screwed.

It appears that loopback interfaces get special ARP treatment in the kernel, so I suggest avoiding the loopback aliases as whole.

The question now is: What kind of an interface can I use on real servers?

As I already noted, eth0:0 alias can not be used, because such aliases respond to ARP requests. lo:0 aliases can not be used, because they make ARP problems too.

In case of tunneling VS configuration, the answer is trivial: tunl0. But to be honest, tunl0 interface can also be used for direct routing.

(Joe: the dummy device is OK too, at least for the 2.0.x kernels)

With direct routing, the only thing we need an interface for is to let kernel know we posses an additional IP address. This means, we can set up any kind of an interface, as long as it doesn't respond ARP requests. Instead of tunl0, you could also set up a ppp0, slip0, eth1 or whatever. I suggest setting up a tunl0:

        ifconfig tunl0 192.168.1.2 -arp up

Problem 3: Real server ARP requests.

Suppose we have set up a virtual server as described at the beginning. All computers are running, but no requests have been made.

Then the client sends a request to the VIP.

When the packet arrives to gateway, the gateway makes an ARP query for the VIP and the director responds. Gateway remembers the director's MAC address and sends the packet to the director. Director receives the packet, looks up its ipvsadm/LVS tables and chooses the realserver and forwards the packet to the real server by direct routing or tunneling method.

Real server receives the packet and generates a response packet with destination=client, source=VIP.

(until now everything works correctly)

When realserver wants to send the response packet to the gateway, it finds out, that it does not know the gateway's MAC address.

It sends an ARP request to the local network and asks for the gateway MAC address. This should look like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

But in reality, realserver asks something like:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.110 (VIP),

because it takes the source address from the packet it wants to send.

Here the problems come in.

Gateway receives the packet and responds to it, which is correct. But at the same time, gatweay does a little optimization. It finds out, that the realserver's MAC address is not listed in its ARP tables and adds the entry into the table, just in case it might need that address in the near future.

The ARP request contained the VIP address and the realserver's MAC address, so from now on, the gateway will send all packets destined for the VIP to the realserver instead (due to MAC address). This means all packets that follow will avoid the virtual server as whole and get responded by the realserver.

If the realserver's ARP request would be:

ARP, who has 192.168.1.1 (gw), tell 192.168.1.11 (realserver IP)

all this would not have happened. Therefore I have patched the 2.2 VS kernel in such a way, that it composes ARP requests based on the address of the interface selected by the routing tables instead of the address taken from the packet itself.

In order for virtual server to work correctly, the realservers should have patched kernels as well, or at least copy the patched /usr/src/linux/net/ipv4/arp.c file to the realservers before compiling the kernels.

Conclusion

Those were my experience with ARP problems, and the 2.2 kernel virtual server.

I think it would be wise to add this letter to the web site and notify the network developers about our findings at some point in time.

Here are some golden rules I stick to, when I do virtual server configuration:

Rule 1:
        Do not use lo:0 alias on the director.
        Use eth0:0 alias instead.

Rule 2:
        Avoid using lo:0 alias, not even on realservers.
        Use tunl0 or some other simulated interface
        on realservers instead. (Joe: use dummy0)


Rule 3:
        Apply the VS patch to kernels on realservers.

6.15. arp bouncing

symptoms of realservers arp'ing - arp bouncing

Stephen WIlliams sdw (at) lig (dot) net (Stephen wrote one of the patches that stop devices in 2.2.x kernels from replying to arp requests)

If you don't use the patch you'll find that the 'active' box will bounce from machine to machine as each one sends an ARP reply that is heard last. Additionally you will get TCP Reset's as connections that were on one box suddenly start going to others. Very nasty and unusable.

6.16. Lar's Method

(This is called Lars' method)

Lars

I have thought about how the ARP problem can occur at all with direct routing, because I never noticed it. Then it occured to me that your VIP comes from the same subnet as the RIP of the LVS and also all the realservers share this media.

To avoid the "ARP problem" in this case without adding a kernel patch or anything else, you can just add a direct route for the VIP using the RIP of the LVS as a gateway address on the router in front of the LVS. ("ip route VIP 255.255.255.255 real_ip" on a Cisco, or "route add -host VIP gw RIP" on Linux)

Since I just used 2 ethernet cards and had the LVS act as gateway/firewall anyway, I never noticed the ARP problem. (We have 2 LVS in a standby configuration to eliminate the SPOF)

6.17. Static Routing to Director

The arp problem is handled if the router in front of the director has a static route for the VIP to the director (i.e. packets for the VIP from the outside world are sent to the director and cannot get to the realservers).

Wensong

For the clients who reach the virtual server through the router, there is no problem if a static route for VIP is added.

However, for the clients who are in the network of virtual server, the "ARP problem" will arise. There is fight in ARP response, and the clients don't know send the packets to the load balancer or the realserver.

In my point of view, the VIP address is shared by the director and realservers in LVS-Tun or LVS-DR, only the director does ARP response for VIP to accept request packets, and the realservers has the VIP but don't, so that they can process packets destined for VIP.

6.18. iproute2 arp on|off flag

Joe, 21 May 2001

Was looking at the ip (i.e.iproute2) notes and it says

ip arp on|off

--change NOARP flad on the device

1cm NB. This operation is not allowed if the device is in state UP.
Though neither ip utility nor kernel check for this condition, you can
get unpredictable results changing the flag while the device is running.

Is this like the old -noarp flag for ifconfig?

Julian Anastasov ja (at) ssi (dot) bg 21 May 2001

This is the device ARP flag, same as ifconfig [-]arp. The flag is used to allow ARP packets for the specified device. It is correct that "lo" does not talk ARP, but you connect to the VIPs on "lo" through eth*, so the flag is of no help for LVS. We can't drop the flag for eth device.

Andreas J. Koenig, 02 Jun 2001

kernel 2.4.5 has arp_filter

Julian Anastasov ja (at) ssi (dot) bg

arp_filter does not solve the ARP problem for LVS

This is a new proposal to control the ARP probes and replies based on route flag "noarp". It will be discussed on the netdev mailing list and may be something like this is going to be included in 2.4, may be in 2.2 too, not sure. All you know that the hidden feature is not considered to 2.4. The net developers have the final word. I'll try to maintain the hidden flag in all next kernels while this flag is more usable than the new feature and because the hidden flag has other semantic. And because may be there are some user space tools that rely on this.

6.19. Is the arp behaviour of 2.2.x kernel a bug?

Note

Julian Anastasov is replying to correct an error in a previous version of the HOWTO where I state that the dummy0 device in 2.2.x kernels does not arp. Julian wrote one of the realserver patches which fix the "arp problem".

Julian

In fact, the documentation is incorrect. There is no difference, all devices are reported in the ARP replies: lo, tunl and dummy. So, only the ARP patch can solve the problem. This can be tested using this configuration with any device (before the patch applied):

Host A:
         eth:x 192.168.0.1

Host B:
         eth:x 192.168.0.2
         lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

So, the ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

Stephen WIlliams (who wrote another of the patches to fix the arp problem).

Of course the ARP code in the kernel needs to be fixed so my filter code isn't needed. Still, I'm confused by this statement. The IFF_NOARP flag determines whether a device arp replies or not. What's wrong with honoring that?

If you mean that arp replies should never be sent on another interface, that is what I currently believe to be correct.

Julian

My understanding is that 2.2.x ARP code is not buggy and there is no need to be "fixed". I must say that your patch is working for the LVS folks but not for all linux users.

IFF_NOARP means "Don't talk ARP on this device", from the 'man ifconfig':

[-]arp Enable or disable the use of the ARP protocol on this interface.

So, where is the bug ? The ARP code never talks through lo, dummy and tunl devices when they are set NOARP. It uses eth (ARP) device. If You hide all NOARP interfaces from the ARP protocol this is a bug. One example:

 +--------+ppp0                          +------+
 | Host A |------------ppp link----------|ROUTER|------ The World
 +--------+A.B.C.1 (www.domain.com)      +------+
   |eth0
   |A.B.C.2
   |
   |A.B.C.3
 +--------+
 | Host B |
 +--------+

Is it possible after your patch Host B to access www.domain.com ? How ? Host A doesn't send replies for A.B.C.1 through eth0 after your patch. OK, may be this is not fatal. Tell it to all kernel users. You hide all their NOARP interfaces. May be there are other examples where this is a problem too. Or may be there is something wrong in this configuration?

I want to say that this patch hurts all users if present in the kernel. On Nov 6 I posted one patch proposal to the linux-kernel list which adds the ability to hide interfaces from the ARP queries and replies. But the difference is that only specified interfaces are not replied, not all NOARP interfaces. Its arp_invisible sysctl can be used by LVS folks to hide lo, tunl or dummy interfaces but this feature doesn't hurt all kernel users. I think, this patch is more acceptable and can be included in the 2.2 kernel, may be after some tunning. And I'm still expecting comments from the net folks and from all LVS users.

6.20. The device doesn't reply to arp requests, the kernel does.

ARP requests/replies are thought of as coming from a device and people make statements like

"the dummy device in 2.0.x kernels does not reply to arp requests while the same device in 2.2.x kernels does reply".

It is the kernel that handles arp requests according to a set of rules and not the device. The code for the dummy device is the same in 2.0.x and 2.2.x kernels and is not responsible for the change in arp behaviour.

(The RPC for ARP is at ftp://ftp.isi.edu/in-notes/std/std37.txt. - also see rfc826 and rfc1122. The model system used there is 2 machines on a single ethernet. It doesn't shed any light on the implementation of ARP on multi-interface systems like LVS.)

6.21. Properties of devices for the VIP

In a previous version of the HOWTO I stated that the dummy0 device did not arp in 2.2.x kernels and therefore could be used as the device for the VIP on an unpatched 2.2.13 realserver. Julian Anastasov replied that they did arp (see below for his posting and the ensuing discussions).

I hadn't actually tested whether the dummy0 device arp'ed but had concluded that it wasn't arp'ing because I had a working LVS using the dummy0 interface for the VIP on unpatched 2.2.x realservers and because as everyone knows ;-) an LVS needs to have a non-arp'ing device on the VIP of the realservers.

I had a LVS-DR LVS which worked with dummy0, lo:0 and tunl0 as the VIP device and which on further testing, I found also worked with eth0:1 or eth1 as the VIP device on 2.2.13 realservers. Whatever the arp'ing status of dummy0, lo:0 or tunl0, clearly eth1 replies to arp requests, so despite the conventional wisdom, it is possible to build an LVS with arp'ing VIP's on the realservers.

On investigating why this LVS worked, I found that the MAC address for the VIP in the client's arp cache (# arp -a) was always the director. I assume this was because the director is 3-4x the speed of the other machines in the LVS and it replies to arp requests first for the VIP (another posting from Stephen WIlliams says that the address which replies last is stored in the arp cache - we'll figure out what's really going on here eventually). On another LVS where the realservers were all identical hardware with 2.2.13 unpatched kernels, one particular realserver always was the machine in the client's arp cache for the VIP (to check, delete entry for VIP with arp -d, then ping again, then look in arp cache).

I found that I could get a working LVS using almost anything to hold the VIP on the realservers, including eth0:1 and eth1 (another NIC in the realserver). These devices carrying the VIP were pingable from the client and I could get the corresponding MAC addresses in the arp table of the client if the director was not setup with a VIP. When I setup a working LVS this way, I found each time that the MAC address for the VIP in the client's arp cache was the director's MAC address. For some reason, that I don't know, whenever the client does an arp request for the VIP, it gets the director's MAC address.

Possible reasons for the MAC address of the director always being associated with the VIP in my LVS -

  1. 1. I configure the director first and then the realservers. I don't make requests for a service till the realservers are setup. (Still I can't imagine the client asking for the MAC address of the VIP until it makes a connect request.)
  2. 2. The director is 3 times faster (CPU speed) than the next machine in the LVS and it always replies to arp request first.
  3. 3. I was lucky.

Since you can make a working LVS-DR LVS with the realserver VIP on an arp'ing eth0:1 device I decided that the relevent piece of information about arp'ing was (ta da!)

* an LVS will work if the client always gets the MAC address of the director when it asks for the MAC address of the VIP *

This provides an easy solution - you tell the client (or the router) the MAC address of the VIP with arp -s or arp -f.

here's my /etc/ethers

lvs.mack.net 00:A0:CC:55:7D:47

After installing the MAC address of the DIP (director) as the MAC address of the VIP (lvs) in the arp table (arp -f /etc/ethers) I get

client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at 00:A0:CC:55:7D:47 [ether] PERM on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

notice the "PERM" in the VIP entry on the client.

removing the permanent entry

client:/usr/src/temp/lvs# arp -d lvs.mack.net
client:/usr/src/temp/lvs# arp -a
realserver1.mack.net (192.168.1.1) at 00:90:27:66:CE:EB [ether] on eth0
lvs.mack.net (192.168.1.110) at <incomplete> on eth0
director.mack.net (192.168.1.10) at 00:A0:CC:55:7D:47 [ether] on eth0

If I edited /etc/ethers changing the MAC address of lvs to anything else, the LVS did not work anymore. So the arp information is coming from /etc/ethers rather than some uncontrolled variable I'm not aware of.

I had thought that in an LVS with the VIP on realservers on an arping device that the VIP would hop from one machine to another (see the postings in the MISC section). Since naturally occuring LVS's with arping VIP's on realservers existed and worked well (mine), I set up an LVS by making a permanent entry for the VIP of the director in the arp cache of the client (router). This can be done by

$ arp -f /etc/ethers
or
$ arp -s 192.168.1.110 MAC_ADDRESS

There are 2 results of this

  1. the realservers can have the VIP on an an arp'ing device (eg eth0:1, eth1) - you don't need lo or dummy0, tunl0 for realservers with 2.0.36 and 2.2.x kernels.
  2. If two (or more) directors are setup in failover mode, the mechanism by for changing the VIP from one to another is broken by making a permanent entry for VIP on the director in the arp cache of the router. This is not a problem for a test setup to demonstrate an LVS but may be a problem in a high availability environment (a solution may be found n the meantime too).

The normal method for changing directors (e.g. with heartbeat) includes a gratuitous arp. To force a gratuitous arp

Julian

You can use Yuri Volobuev's send_arp.c from the 'fake' package or Alexey Kuznetsov's arping from its iputils package:

  • fake - http://vergenet.net/linux/fake/
  • iputils - ftp://ftp.inr.ac.ru/ip-routing/iputils-ss991024.tar.gz iputils is also used for IPAT, IP address takeover

If you're not sure if the network knows that the VIP has moved, try this.

Graeme Fowler graeme (at) graemef (dot) net 13 Mar 2006

At failover, make the new live director run something along the lines of:

/sbin/ping -c5 -I $VIP $GW_IP

Where $GW_IP is the IP address of your upstream router. It's not exactly gratuitous ARP but it does, in my experience, help to rapdily converge the systems which currently don't talk to each other.

Also make absolutely sure that the VIP is being torn down on the failed director. If it isn't, and it still ARPs for it, you'll end up in all sorts of problems.

To monitor this you could feasibly run arpwatch on both the directors' upstream interfaces. You should see the VIP flip-flop on failover. If you see it repeatedly flip-flop at regular intervals, you're not tearing down properly.

Joe Dec 2003

There is also http://www.vergenet.net/~acassen/software/garp-0.1.1.tar.gz which has been available for over a year, without me even knowing about it.

Here's some tests I did

LVS equipment: 2.2.13 client, and 0.9.4/2.2.13 director.
2 realservers
a) 2.0.36 kernel, libc5, gcc-2.7.2.3, net-tools 1.42.
b) 2.2.13 kernel, glibc, gcc-2.95,    net-tools 1.52

Experiment 1: Result - arp'ing is independant of [-]arp

Summary: the -arp/+arp option for ifconfig had no effect on any devices back to 2.0.36 kernels with net-tools 1.42. If it normally arps then -arp had no effect, if it normally doesn't arp, than "arp" doesn't turn it on (data below).

Method: IP=192.168.1.1/24 with VIP=192.168.1.110/32. The VIP was on dummy0. The test was to see if the VIP was pingable from another (external) machine on the 192.168.1.0/24 network or pingable from the machine itself (ie internally from the console). (I assume I had a route add -host for the VIP although I didn't record this). The test was done with ifconfig using arp or -arp (the output of ifconfig -a didn't change)

                 -----2.0.36------- -----2.2.13------
ping from        internal  external internal external
VIP device
dummy	ARP        +         -	      +        +
        NOARP      +         -        +        +
        down       -         -        -        - (control)

Experiment2: Can the VIP be on a separate NIC?

Summary: yes, as long as the NIC doesn't have a cable plugged into it.

Method: same as above except VIP on eth1 (another NIC).

                 -----2.0.36-------
ping from        internal  external
VIP device
eth1 has cable connected to 192.168.1.0 network
eth1    ARP        +         +
        NOARP      +         +

eth1 cable to network removed
eth1    ARP        +         -
        NOARP      +         -
        works as realserver in LVS - yes

One of the reasons an no_arp interface is used on the realserver is that it is not visible to the rest of the network. Does the LVS work if the eth1 VIP on the realserver is not visible to the rest of the network?

Conclusion: for 2.0.36 dummy0 doesn't arp, and eth1 does arp. the arp/-arp option to ifconfig has no effect on arp behaviour. LVS works with both dummy0 and eth1, I assume since VIP need only be resolved as local on the realserver and does not need to be visible to the network.

Experiment 3: What devices and netmasks are neccessary for a working LVS?

Using the /etc/ethers approach for setting the MAC address of the VIP I then set up an LVS with pair of realservers serving telnet. All IPs are 192.168.1.x, all machines have a route to 192.168.1.0 via eth0. There is no default route.

1. 2.0.36, libc5, gcc 2.7.2.3, net-tools 1.42
2. 2.2.13, glibc-2.1.2, gcc-2.95, net-tools 1.52

with the following devices holding the VIP, tunl0, eth0:1, lo:0, dummy0, eth1. In each case there was no route entry for the VIP device and there was no cable connected to eth1 when it was used for the VIP. The table below shows whether the LVS worked. The VIP is installed with

ifconfig $DEVICE 192.168.1.110 netmask $NETMASK broadcast $BROADCAST
with $NETMASK="255.255.255.255" $BROADCAST="192.168.1.110"
or   $NETMASK="255.255.255.0"   $BROADCAST="192.168.1.255"

the result belong to 1 of 3 groups

+ works fine
- doesn't work
  (at $ prompt on client get
  "unable to connect to remote host.  Protocol not available"
  then client returns to regular unix $ prompt)
hang - client hangs, realserver cannot access network anymore,
  have to run rc.inet1 from console prompt on realserver to
  start network again.

netmask of VIP=255.255.255.255 (normal LVS setup)

LVS type  -----VS-Tun------     ----VS-DR------
kernel    2.0.36     2.2.13     2.0.36   2.2.13

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           -         +         +
dummy0     +           -         +         +
eth1       +           -         +         +

netmask of VIP=255.255.255.0 (not normally used for LVS)

VIP on
tunl0      +           +         +         +
eth0:1     +           -         +         +
lo:0       +           hangs     +         hangs
dummy0     +           -         +         +
eth1       +           -         +         +

It would seem that any device and any netmask can be used for the VIP on a 2.0.36 realserver for both LVS-Tun and LVS-DR.

For 2.2.13 realserver, LVS-Tun, VIP on a tunl0 device only, any netmask (ie you need tunl0 on LVS-Tun with 2.2.x kernels)

LVS-DR,  lo:0 device netmask /32 only
       all other devices any netmask

For LVS-DR then on Solaris/DEC/HP/NT... LVS can probably use a regular eth0 device rather than an lo:0 device (more work for Ratz to do :-).

Does anyone know why the lo:0 device has to be /32 for LVS-DR on kernel 2.2.13 while the other devices can be /24?

Jean-Francois Nadeau jna (at) microflex (dot) ca 6 Dec 99

In kernel 2.2.1x with a virtual interface on lo:0 and netmask of 255.255.255.0 that the interface no longer arps.

Horms 29 Oct 2003 (4yrs later, presumably referring to the 2.4 kernels)

/sbin/ifconfig lo:110 192.168.1.110 broadcast 192.168.1.110 netmask 255.255.255.255

brings up lo:110 (a virtual interface on the loopback device) for 192.168.1.110 with the broadcast and netmask as specified. If you are using LVS-DR then the packets that arrive on the realservers have the destination IP address set to the VIP. So the realservers need some way of accepting this traffic as local. One way is to add an interface on the loopback device and hide it so it won't answer ARP requests. The netmask has to be 255.255.255.255 because the loopback interface will answer packets for _all_ hosts on any configured interface. So 192.168.1.110 with netmask of 255.255.255.0 will cause the machine to accept packets for _all_ addresses in the range 192.168.1.0 - 192.168.1.255, which is probably not what you want.

Does anyone know why only the tunl0 device works for LVS-Tun on 2.2.x kernels?

Experiment 4: Effect of route entry for VIP and connection to VIP. The VIP normally has an entry in the routing table eg

route add -host 192.168.1.110 $DEVICE

I found in Experiment 2 that a route entry was not neccessary for the LVS to work when the realserver had the VIP on eth0:1. Since I had always used a route entry for the VIP I wanted to find out when it was needed. The same LVS was used as for Experiment 3. The variables were

1) a route entry/no route entry for VIP/32
2) for eth1 whether the NIC was connected to the network by a cable.

kernel            ------2.0.36-------     -------2.2.13-------
VIP               eth1 eth1_nc eth0:1     eth1  eth1_nc eth0:1

no route
   LVS             +     +      +          +      +       +
   ping internal   -     -      -          +      +       +
   ping external   +     -      +          +      +       +

route
   LVS             +     +      +          +      +       +
   ping internal   +     +      +          +      +       +
   ping external   +     -      +          +      +       +

Conclusion 1: LVS works when for both cases of route/no_route for the VIP for eth0:1 and eth1 (ie you don't need a route entry for the VIP on the realservers).

Conclusion 2: having a network cable/no network cable does not affect whether the LVS works.

Conclusion 3: for 2.0.36 kernels you can choose to have the VIP pingable from the outside world but not pingable by the local host by having it on eth1 with a cable connection (this seems weird and I can't think of any use for it just yet) or the reverse - pingable from the localhost but not by the external world by not have a cable connection.

Note
using a host's routable IP as the target - the IP on eth0 say - you can make a host unpingable from the console if you down the lo. The host is still pingable from elsewhere on the net.

6.22. Topologies for LVS-DR and LVS-Tun LVS's

6.22.1. Traditional

The conventional LVS-DR/VS-Tun topology which allows maximum scalability has each realserver with its own default gateway (to a router). (In a routerless test setup, the client would be the default gateway for the realservers. In a setup which is not network bound, i.e. is disk- or compute-bound, only one router may be needed. The changes in topology/routing are made by changing the IP of the default gw for the realservers)

Some method of handling the arp problem is needed here.

The packets sent to the realservers from the director, generate replies which go directly to the client. Failure messages (eg if a realservers is not available) do not get returned to the director, who cannot tell if a realserver has failed (see discussion of monitoring agents).

                       -------------clients-----------------------
                       |                         |       |       |
                    (router)                  (router)(router)(router)
                       |                         |       |       |
          _________    |                         |       |       |
        |          |   |    VIP                  |       |       |
        | director |---     DIP                  |       |       |
        |__________|   |                         |       |       |
                       |                         |       |       |
                       |                         |       |       |
        ---------------------------------        |       |       |
        |              |                |        |       |       |
        |              |                |        |       |       |
       RIP1           RIP2             RIP3      |       |       |
       VIP            VIP              VIP       |       |       |
 _____________   _____________   _____________   |       |       |
|             | |             | |             |  |       |       |
| realserver  | | realserver  | | realserver  |  |       |       |
|_____________| |_____________| |_____________|  |       |       |
        |              |                |        |       |       |
        |              |                ----------       |       |
        |              -----------------------------------       |
        ----------------------------------------------------------

6.22.2. Director sees replies

(from Julian Anastasov)

Note
This discussion led to Julian's martian modification.

If the default gw for each realserver is changed to the DIP (see the Martian modification section) then

  • The director has to handle the reply packets as well as in the incoming packets, doubling the network load.
  • The director sees all the reply packets. Connection failure can be detected (in principle).
                        clients
                           |
                         router
                           |
             __________    |
            |          |   |    VIP
            | director |---     DIP
            |__________|   |
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
         RIP1             RIP2              RIP3
         VIP              VIP               VIP
   _____________     _____________     _____________
  |             |   |             |   |             |
  | realserver  |   | realserver  |   | realserver  |
  |_____________|   |_____________|   |_____________|

Here's the original posting by Horms horms (at) vergenet (dot) net

Hi, I have been setting up a test network to benchmark IPVS, the topology is as follows.

       node-1      node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-3 (router)
                     |                   server-net
      ------+--------+----------+---     192.168.1.0/24
            |        |          |
         node-2    node-4     node-5
         (IPVS)   (server)   (server)

The question that I have is that the network I would really like to be testing is;

       node-1       node-6     node-7
       (client)   (client)   (client)
           |         |          |        client-net
  ---------+---------+----------+------ 192.168.2.0/24
                     |
                   node-2 (IPVS)
                     |                   server-net
      ---------+-----+----+---------     192.168.1.0/24
               |          |
             node-4     node-5
            (server)   (server)

.. other than using NAT, which has performance problems, is this possible? I tried this topology with direct routing and packets from the clients were multiplexed to the servers fine, but return packets from the servers to the client were not routed by the IPVS box.

Lars

Yes. The LVS box silently drops the return packets, since they have a src ip which is also bound as a local interface on the LVS. This is meant to be a simple anti-spoofing protection.

from Joe:

Note
The return packet from the realserver has src=VIP, dest=CIP. If this packet is routed via the director, which also has the VIP, the director will be receiving a packet from another machine with the the src being an one of its own IPs and the director will drop the packet).

You can enable logging these packets via

echo 1 >/proc/sys/net/ipv4/conf/all/log_martians

The only way around this with current Linux kernels is to disable the check in the kernel source or to use a separate box as the outward gateway. (Which is how DR is meant to be used for full performance) This is not a problem as such as it probably makes a lot of sense on not to use an IPVS box as your gateway router, Actually it makes a lot of sense to do just that IMHO. Less points of failure, less hard- and software to duplicate in a failover configuration.

Ray Bellis rpb (at) community (dot) net (dot) uk

It needs to be made more explicit in the documentation that LVS-DR will only work if you have a different return path.

Lars Marowsky-Bree lmb (at) teuto (dot) net

... or if you have a suitably patched kernel.

We spent several man days trying to get this to work before figuring out why the packets were being dropped, at which point we had no alternative but to use LVS-NAT instead.

I agree. We still assume too much knowledge on the network admin side.

FYI, we have our LVS system working now, with LVS redundancy achieved by running OSPF routing (gated) on the LVS-NAT servers and having the VIP within the same IP subnet as the RIPs so that IGP routing policies automatically determine which LVS router the packets arrive on.

Yes, thats one option. Even better than heartbeat and IPAT, if all your systems support running a routing protocol. (IPAT = IP address takeover, part of heartbeat) In essence, heartbeat and IPAT is nothing but reinventing a subset of the functionality of a hardened routing protocol like OSPF/RIPv2/EIGRP.

6.22.3. On other schemes for director/realservers to exchange roles

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg has pointed out on the mailing list that the prototype LVS can be redrawn as

                        ________
                       |        |
                       | client |
                       |________|
			   |
                           |
                        (router)
                           |
			   |
         ------------------------------------
         |                 |                |
         |                 |                |
      DIP, VIP         RIP1, VIP        RIP2, VIP
    ____________    ______________    ______________
   |            |  |              |  |              |
   |  director  |  | realserver1  |  | realserver2  |
   |____________|  |______________|  |______________|

and that any realserver is in a position to replace a failed director. No-one has bothered to write the code for this. It seems it's easier do have extra boxes in the director role (ready for failover) and others in realserver role. It's easier to wheel in another box for a spare director than to configure realservers to do two jobs reliably.

Julian

The director and the backup are in a shared network for incoming traffic, the backup sniff packets and change its connection state the same as the director (because the director is just on half client-to-server connection in LVS/TUN and LVS/DR), then drop packets. It needs some investigation and probably lots of additional code too. ;-)

Wensong Zhang wensong (at) iinchina (dot) net

I don't even think so - the main trick is getting the kernel to sniff the packets, which is probably quite easy with a little messing around. Not sending the packets out again (which would confuse the realservers) is easy with a ipchains output rule which silently drops them.

This doesn't work with a switch though, you need a shared network like a hub.

However, I have been talking with Rusty about this. The problem is more general - HA shared-state firewalls are asked for all the time, so we want to do a generic thing for everything which builds upon Netfilter's state machine. This would not only cover LVS, but also masquerading and packet filtering in general. We intend to discuss this in greater detail at the Ottawa Linux Symposium latest.

Julian

You can see,the connections depend on the initalize status and realsevers realtime status. So another method is that when Director is down, backup-sever setup the ipvs with the connections,but it seems too late. How do you think about this?

Wensong

TCP/IP should be able to cope with a few seconds delay and lost packets. You want to heartbeat once per second and take over after 3-4s though - this usually means takeover is complete in <10s, which TCP/IP should swallow.

6.23. Why do all devices broadcast the arp replies

John Reuning (10 Apr 2003)

Why are arp replies sent for all interfaces, regardless of which interface receives the arp request?

Julian

Because Linux routing agrees that all these senders have access to this IP, so we give them access to valid link layer address. This behavior is usually observed on routers configured without source address validation enabled. As this is the default behavior specified in RFC1812 (rp_filter=0), Linux simply allows access to this IP on any interface.

arp is part of the transition from network layer to link layer, right? So why should an alias on lo, an interface that doesn't really generate network frames, trigger an arp reply. Do other unix tcp/ip

Note that these packets are not passed via the lo interface, also, we do not send ARP replies via lo, why we should care about the lo's NOARP flag?

I can't seem to make a Solaris 7 system generate arp replies for an lo alias.

The different systems have different policy for IP addresses configured on loopback device. Note that in Linux, this behavior has nothing to do with the lo interface, you can configure IP on eth1 and then again to see our ARP reply for it on eth0.

6.24. A discussion about the arp problem

(Joe and Julian)

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg There is no difference between devices in 2.2.x, all devices are reported in the ARP replies: lo, tunl and dummy. This can be tested using this configuration with any device:


Host A:
        eth:x 192.168.0.1

Host B:
        eth:x 192.168.0.2
        lo, dummy, tunl: 192.168.0.3

On host A try: ping 192.168.0.3

Host B replies for 192.168.0.3 through 192.168.0.2 device

The ARP problem means: "All local interfaces are reported" until the ARP patch is used. In fact, all ARP patches which use IFF_NOARP to hide the interface are incorrect. I don't expect them in the kernel.

ARP problem, some rules:

ARP responses

  • all local IP addresses are replied: lo, eth, tunl*, dummy* but with some exceptions (see the next rules)
  • 127.0.0.0/8(LOOPBACK) and 224.0.0.0/4(MULTICAST) are not replied
  • there is one exception for the "lo" interface: it is possible the kernel to ignore the ARP request if the source IP is from the same net as the net used to configure "lo" alias. The specified network is treated as local.

For example:

realserver# ifconfig lo:0 192.168.1.1 netmask 255.255.255.0 broadcast 192.168.1.255 up

"real" treats all packets with source addr from 192.168.1.0/24 which come from the other devices (eth0) as invalid, i.e. source address validation works in this case and the ARP request are not replied. The kernel thinks: "The incoming packet arrived with saddr=local_IP1 and daddr=local_IP2(VIP), so it is invalid". By this way the host from the LAN can't talk to the realserver if its lo alias is configured with netmask != 255.255.255.255

        ifconfig dummy0 192.168.1.1 netmask 255.255.255.255

registers only 192.168.1.1 as local ip but:

        ifconfig lo:0 192.168.1.1 netmask 255.255.255.0

all 256 IPs are local. All IFF_LOOPBACK devices treat all IPs as local according to the used netmask.

Joe

I assume IFF_LOOPBACK devices are lo, lo:0..n?

Yes, currently only lo is marked as loopback. It is used to mark whole subnets as local.

lo:0 is not marked as loopback?

lo:0 is just attached IP address to the same device "lo". You can try "ifconfig lo:0 192.168.0.1 netmask 255.255.255.255" and display the interfaces using "ifconfig". There is LOOPBACK flag for lo:0 which is inherited from the device "lo". In Linux 2.2 all aliases inherit the device flags. Only the IFF_UP flag is used to add/delete the aliases.

Joe

Assume LVS-DR with VIP, RIPs all on the same /24 network on eth0 devices, realservers all have lo:0 with VIP/24 and have the standard 2.2.x kernel (no patches to hide interfaces). Router says "who has VIP", the arp request arrives at the realservers via eth0. Device lo:0 finds arp request which arrived on eth0 from router is on the same subnet as lo:0 and does not reply to the arp request.

Before checking if to answer the ARP the routing tables are checked, i.e. the source validation of the packet is performed. If 192.168.0.2 asks "who-has 192.168.1.1 tell 192.168.1.2" the realservers assumes that this is invalid packet, i.e. from one local IP to another local IP (from me to me => drop).

Joe

I notice that with the 2.2.x kernel, that lo:0 has to have netmask=255.255.255.255 to work, whereas with the 2.0.x kernels (where lo:0 doesn't reply to arp requests), that lo:0 can have the VIP on a 255.255.255.0 netmask and still work.

The rule is to use netmask 255.255.255.255 and to hide lo. The ARP works in different way in 2.2. It looks the "local" table to validate the source of the ARP request and after that it lookups the same table to check if daddr of the ARP request is local ip.

ARP requests: - all local addresses can be used by the kernel to announce them as the source for the ARP request.

is it OK to say

the kernel can (does?) use all local addresses as the source of ARP requests

It can and does. The realserver thinks that it can use any local ip address as saddr in the ARP request and the answer will be returned back if this ip is uniq in the LAN.

Joe

do you mean "the realserver will receive a reply if the s_addr is unique in the LAN"?

The realserver will receive answer if it uses RIP as saddr in the ARP request because the VIP(HIP) is hidden or when using transparent proxy because it is not local (the VIP). Real server must know how to ask (using uniq IP) or the trafic for the asked IP (ROUTER) will be blocked.

But the hidden addresses are not used because they are not uniq (2.2.14) and the answer will be returned to the Director.

Joe

do you mean "the non-hidden VIP on the director"?

Yes, when the realserver ask "who-has ROUTER tell VIP" the ARP reply is received in the Director and the transmission in the realservers is stopped. The ROUTER sends everything destined to VIP to the Director. This is true for all clients on the LAN too if they are not in this cluster (if they don't handle packets for VIP).

Joe

I would have thought that the main device on each NIC, eg eth0, eth1 would have been used as the source address.

No, it is extracted from the outgoing datagram and if saddr is local ip it is used. But if this is not local ip, i.e. when using transparent proxy or the address is marked as hidden the main device ip is used.

Joe

how is arping part of transparent proxy?

It is not. When VIP is not local IP address in the realserver this IP is not used from the ARP code. It is not in the "local" table. But TCP, UDP and ICMP use it via transparent proxy support.

They are extracted from the outgoing packet.

Joe

what is "They"? the source addresses? When you say "extracted", do you mean "removed from packet" or "looked at/detected"

The saddr from the data packet is used to build the ARP request.

We tell the kernel that these addresses are not uniq by setting <interface>/hidden=1 (starting with kernel 2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request.

Joe

the kernel can use any local address as s_addr but the code for hiding IPs from arp requests prevents the kernel from using hidden addresses as s_addr in an arp request?

Yes, the code to hide the addresses is already part of the source address autoselection (saddr in the ARP request in our case). We never autoselect hidden addresses, i.e. if the source address is not specified from the higher level. The code to hide interface:

- ignores ARP replies for hidden local addresses
- doesn't select hidden local addresses as source of the ARP request
- doesn't autoselect hidden local addresses for the IP level

Joe

When you say "We expect it is uniq in the LAN" do you mean - we expect you've set up your network properly and that you don't have the same RIP on 2 realservers? :-)

The LVS administrator must ensure that the RIPs are uniq, only the VIP is shared. We tell the kernel that the VIP addresses are not uniq by setting interfacehidden=1 (2.2.14). By this way the kernel select the devices primary IP as the source of the ARP request. We expect it is uniq in the LAN.

So, the recommendation for using the "lo" interface in the real servers is:

- use netmask 255.255.255.255 when configuring lo alias. By this way source validation doesn't drop the incoming packets to this IP. LVS users usually define the net route through the eth interface, so we can talk to other hosts from this network, for example to send the packets to the client through the default gateway. It is not needed to configure the alias with mask != 255.255.255.255

So, the interfaces which can be used in the realservers to listen for VIP are:

- lo aliases with netmask 255.255.255.255
- tunl*
- dummy*

All these devices must be marked as hidden to solve the ARP problem when using Linux 2.2.

In the Director: there is no problem to configure the VIP even on lo alias or dummy interface. If the interface is not marked as hidden this VIP is visible for all hosts on the LAN.

6.25. ATM/ethernet and router problems

LVS has only been tested on ethernet. One person had an ATM setup which didn't work with LVS-DR as the ATM router expects packets from the VIP to have the same MAC address (in LVS-DR packets coming from the VIP could have the MAC address of any of the realservers). Apparently this is not easily fixable in the ATM world. It should be possible to use Julian's martian modification to make LVS-DR work on ATM, but the person with the ATM setup disappeared off the mailing list without us convincing him of the joy in having the first ATM LVS.

Other people have found similar problems with ethernet -

Kyle Sparger ksparger (at) dialtoneinternet (dot) net

I don't know if someone has gone over this, but here's a consideration I've come across when setting up LVS in DR mode:

When the realservers reply, cisco routers (ours do, at least) will pick up on the fact that it's replying from a different MAC address, and will start arping soon thereafter. This is sub-optimal, as it causes a constant flood of arp requests on the network. Our solution has been to hardcode the MAC address into the router, but this can cause other issues, for example during failover. That can be worked around, as you can set the MAC address on most cards, but that in itself may cause other issues.

Has anyone else experienced this? Has anyone else come up with a better solution than hardcoding it into the router?

It should be possible to have the reply packets from the VIP come from a virtual MAC address (such as created by vrrpd), in which case all replies coming to the same port in a router from the VIP will have the same MAC address. No-one seems to be interested in writing the code to do this.

6.26. Same IP on multiple NICs

Bonnet Sebastien (dot) Bonnet (at) experian (dot) fr 2002-04-16

I'm setting up with LVS-DR. To allow a node to be both a realserver and a backup director, I have eth0:2 being the VIP, because at this point, "backup-and-node" is the director. But when it's not, I still need VIP to be setup on lo:1 to use "backup-and-node" as a realserver. I end up with the following config :

[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/all/hidden 1
[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/lo/hidden 1
[root@backup-and-node root]# cat /proc/sys/net/ipv4/conf/eth0/hidden 0

[root@backup-and-node root]# ifconfig
eth0    Link encap:Ethernet  HWaddr 00:40:05:5C:C2:04
        inet addr:172.22.48.208  Bcast:172.22.63.255  Mask:255.255.240.0
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

eth0:2  Link encap:Ethernet  HWaddr 00:40:05:5C:C2:04
        inet addr:172.22.48.212  Bcast:172.22.63.255  Mask:255.255.240.0
        UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1

lo      Link encap:Local Loopback
        inet addr:127.0.0.1  Mask:255.0.0.0
        UP LOOPBACK RUNNING  MTU:16436  Metric:1

lo:1    Link encap:Local Loopback
        inet addr:172.22.48.212  Mask:255.255.255.255
        UP LOOPBACK RUNNING  MTU:16436  Metric:1

The problem is that when VIP is setup on both lo:1 and eth0:2, "backup-and-node" will not answer *any* ARP request for VIP, whereas it should via eth0 (as far as I understand the purpose of the hidden feature).

Julian

The problem is that this setup is ambigous. The kernel doesn't know what device you are using for primary and for secondary IPs. Device lo is a valid device for primary IPs. It is not allowed to define one IP both as primary and secondary one. Yes, lo is first in the device list and we search for hidden IP in _any_ device. We don't have a preferred device to start from. Yes, this is limitation that nobody wants to fix. Someone will have to persuade me with a clear fix for this.

Joe

I'm surprised you're allowed to have the same IP on two different devices. Is there a reason why you'd want to do this or is it just not forbidden and therefore allowed (I beleive this is called the American philosophy).

Horms

It is actually something you may want to do. Imagine you have a dialup server, 192.168.0.1, which sits on the 192.168.0.0/24 network. Now each dialup user is going to get their own ip address, but 192.168.0.0/24 is your server network, so these ip addresses are on a different network, lets say 10.0.7.0/24. Now when the dailup users come in, there is no need for the dialup-server to have an address on the 10.0.7.0/24 network, it is just a point to point link, so you can have for instance.

[client]<-------->[dialup-server]
10.0.7.7          192.168.0.1
ppp0              ppp0

But the dialup-server already has 192.168.0.1 on eth0. Thus you have the same IP address on multiple interfaces. In fact it would have the same IP address on eth0 and each of the ppp interfaces.