7. LVS: LVS-DR

LVS-DR is based on IBM's NetDispatcher. The NetDispatcher sits in front of a set of webservers, which appear as one webserver to the clients. The NetDispatcher served http for the Atlanta and the Sydney Olympic games and for the chess match between Kasparov and Deep Blue.

When the packet CIP->VIP arrives at the director it is put into the OUTPUT chain as a layer 2 packet with dest = MAC address of the realserver. This bypasses the routing problem of a packet with dest = VIP, where the VIP is local to the director. When the packet arrives at the realserver, which finds the packet addressed to an IP local to the realserver (the VIP).

7.1. LVS-DR example

Here's an example set of IPs for a LVS-DR setup. In this example, the RIPs are on the same network as the VIP (a one network LVS-DR). In this example, for (my) convenience, the servers are on the same network as the router connecting to the client and you have to handle the arp problem (I used the arp -f /etc/ethers approach).

Host                         IP
client                       CIP=192.168.1.254
director                     DIP=192.168.1.1
virtual IP (VIP)             VIP=192.168.1.110 (arpable, IP clients connect to)
realserver1                  RIP1=192.168.1.2, VIP=192.168.1.110 (lo:0, not arpable)
realserver2                  RIP2=192.168.1.3, VIP=192.168.1.110 (lo:0, not arpable)
realserver3                  RIP3=192.168.1.4, VIP=192.168.1.110 (lo:0, not arpable)
.
.
realserver-n                 192.168.1.n+1

#lvs_dr.conf
LVS_TYPE=VS_DR
INITIAL_STATE=on
VIP=eth0:110 lvs 255.255.255.255 192.168.1.110
DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr realserver1 realserver2 realserver3
SERVER_VIP_DEVICE=lo:0
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=client
#----------end lvs_dr.conf------------------------------------

                           ________
                          |        |
                          | client |
                          |________|
                          CIP=192.168.1.254
                              |
                   CIP->VIP | |
                            v |
                              |
                           ________
                          |        |
                          | router | advertises route to VIP
                          |________|
                              |
                __________    |
               |          |   |    VIP=192.168.1.110 (eth0:1, arps)
               | director |---     DIP=192.168.1.1 (eth0)
               |__________|   |
                              |  ^
MAC_DIP->MAC_RIP1(CIP->VIP) | |  |  VIP->CIP
                            v |
                              |
             -------------------------------------
             |                |                  |
             |                |                  |
      RIP1=192.168.1.2  RIP2=192.168.1.3  RIP3=192.168.1.4 (eth0)
      VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all lo:0, non-arping)
      _____________     _____________      _____________
     |             |   |             |    |             |
     |   CIP->VIP  |   |             |    |             |
     |   VIP->CIP  |   |             |    |             |
     | realserver  |   | realserver  |    | realserver  |
     |_____________|   |_____________|    |_____________|

Here's the lvs_dr.conf file

#--------------------lvs_dr.conf
LVS_TYPE=VS_DR
INITIAL_STATE=on

#director setup
VIP=eth0:12 192.168.1.110 255.255.255.255 192.168.1.110
DIP=eth0 192.168.1.10 192.168.1.0 255.255.255.0 192.168.1.255
#service setup, one service at a time
SERVICE=t telnet rr 192.168.1.1 192.168.1.8 127.0.0.1

#realserver setup
SERVER_LVS_DEVICE=lo0:1
SERVER_NET_DEVICE=eth0

#----------end lvs_dr.conf------------------------------------

Here's how you'd set up a two-network LVS-DR. Note that the router receives packets on port R from both the RIP and VIP, which are in different networks. Once you've solved the arp problem, the router will send packets to the VIP only on port D.

                           ________
                          |        |
                          | client |
                          |________|
                          CIP=192.168.1.254
                              |
                   CIP->VIP | |
                            v |
                              |
                           ________
                          |        | R
                          | router |------------- 
                          |________|             | 
                              | D                |
                              |                  |
                 VIP=192.168.1.110 (eth0:1, arps)|
                          __________             |
                         |          |            |
                         | director |            |
                         |__________|            |
                        DIP=10.0.1.1 (eth1)      | 
                              |                  |  ^             
MAC_DIP->MAC_RIP1(CIP->VIP) | |                  |  |  VIP->CIP 
                            v |                  |
                              |                  |
             -------------------------------------
             |                |                  |
             |                |                  |
      RIP1=10.0.1.2     RIP2=10.0.1.3     RIP3=10.0.1.4 (eth0)
      VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all lo:0, non-arping)
     ______________     _____________      _____________
    |              |   |             |    |             |
    |lo:  CIP->VIP |   |             |    |             |
    |eth0:VIP->CIP |   |             |    |             |
    | realserver   |   | realserver  |    | realserver  |
    |______________|   |_____________|    |_____________|

LVS-DR setup and testing is the same as LVS-Tun except that all machines within the LVS-DR (ie the director and realservers) must be on the same segment (be able to arp each other). This means that there must be no forwarding devices between them i.e. they are using the same piece of transport layer hardware ("wire"), eg RJ-45, coax, fibre (there can be hub(s) or switch(es) in this mix). Communication within the LVS is by link-layer, using MAC addresses rather than IP's. All machines in the LVS have the VIP: only the VIP on the director replies to arp requests, the VIP on the realservers must be on a non-arping device (eg lo:0, dummy).

The restrictions for LVS-DR are

  • The client must be able to connect to the VIP on the director
  • Realservers and the director must be on the same segment (piece of wire) (they must be able to arp each other) as packets are sent by link-layer from the director to the realservers.
  • The route from the realservers to the client _cannot_ go through the director, i.e. the director cannot be the default gw for the realservers. (Note: the client does not connect directly to the the realservers for the LVS to function. The realservers could be behind a firewall, but the realservers must be able to send packets to the client). The return packets, from the realservers to the client, go directly from the realservers to the client and _do_not_ go back through the director. For high throughput, each realserver can have its own router/connection to the client/internet and return packets need not go through the router feeding the director.

For more info see e-mail postings about LVS-DR topologies in the section More on the arp problem and topologies of LVS-DR and LVS-Tun LVS's.

To allow the director to be the default gw for the realservers (e.g. when the director is the firewall), see martian modification.

Note for LVS-DR (and LVS-Tun), the services on the realservers are listening to the VIP. You can have the service listening to the RIP as well, but the LVS needs the service to be listening to the VIP. This is not an issue with services like telnet which listen to all local IPs (ie 0.0.0.0), but httpd is set up to listen to only the IPs that you tell it.

Normally for LVS-DR, the client is on a different network to the director/server(s), and each realserver has its own route to the outside world. In the simple test case below, where all machines are on the 192.168.1.0 network, no routers are required, and the return packets, instead of going out (the router(s)) at the bottom of the diagram, would return to the client via the network device on 192.168.1.0 (presumably eth0).

7.2. How LVS-DR works

Here's part of the rc.lvs_dr script which configures the realserver with RIP=192.168.1.8

#setup servers for telnet, LVS-DR
director:/etc/lvs# /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr
director:/etc/lvs# echo "adding service 23 to realserver 192.168.1.6 "
director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -R 192.168.1.6 -g -w 1

There's no forwarding in the conventional sense for LVS-DR (ip_vs does the forwarding on the director of the LVS packets). You can have ip_forward set to ON if you need it for something else, but LVS_DR doesn't need in ON. If you don't have a good reason to have it ON, then for security turn it OFF. For more explanation see design of ipvs for netfilter

#set ip_forward OFF for lvs-dr director (1 on, 0 off)
cat       /proc/sys/net/ipv4/ip_forward
echo "0" >/proc/sys/net/ipv4/ip_forward

With LVS-DR, the target port numbers of incoming packets cannot be remapped (unlike LVS-NAT). A request to port 23 (telnet) on the VIP will be forwarded to port 23 on a realserver, thus the RIP entry for the realserver in ipvsadm has no accompanying port. You can however re-map ports with iptables (see Re-mapping ports with LVS-DR).

Here's the packet headers as the request is processed by the LVS.

packet                  source        dest         data
1. request from client  CIP:3456      VIP:23       -
2. ipvsadm table:
   director chooses server=RIP1, creates link-layer packet
                        MAC of DIP    MAC of RIP1  IP datagram
                                                   source=CIP:3456,
                                                   dest=VIP:23,
                                                   data= -
3. realserver recovers IP datagram
                        CIP:3456      VIP:23       -
4. realserver looks up routing table, finds VIP is local,
   processes request locally, generates reply
                        VIP:23        CIP:3456     "login:"

5. packet leaves realserver via its default gw, not via DIP.

For the verbally oriented...

A packet arrives from the client for the VIP (CIP:3456->VIP:23). The director looks up its tables and decides to send the connection to realserver_1. The director arps for the MAC address of RIP1 and sends a link-layer packet to that MAC containing an IP datagram with CIP:3456->VIP:23. This is the same src:dst as the incoming packet and the tcpip layer see this as a forwarded packet. To allow this packet to be sent to the realserver, it is not neccessary for forwarding must be on in the director (it is turned off by default in 2.2.x, 2.4.x kernels - turning it on is handled by the configure script).

The packet arrives at realserver_1. The realserver recovers the IP datagram, looks up its routing table, finds that the VIP (on an otherwise unused, non-arping and nonfunctional device) is local.

I'm not sure what exactly happens next, but I believe the Linux tcpip stack then delivers the packet to the socket listeners, rather than to the device with the VIP, but I'm out of my depth now.

The realserver now has a packet CIP:3456->VIP:23, processes it locally, constructs a reply, VIP:23->CIP:3456. The realserver looks up its routing table and sends the reply out its default gw to the internet (or client). The reply does not go through the director.

The role of LVS-DR is to allow the director to deliver a packet with dst=VIP (the only arp'ing VIP being on the director), not to itself, but to some machine that (as far as the director knows) doesn't have the VIP address at all. The only difference between LVS-DR and LVS-Tun is that instead of putting the IP datagram inside a link-layer packet with dst=MAC of the RIP, for LVS-Tun the IPdatagram from the client CIP->VIP is put inside another IPdatagram DIP->RIP.

The use of the non-arping lo:0 and tunl0 to hold the VIP for LVS-DR and LVS-Tun (respectively) is to allow the realserver's routing table to have an entry for a local device with IP=VIP _AND_ that so that other machines can't see this IP (ie it doesn't reply to arp requests). There is nothing particularly loopback about the lo:0 device that is required to make LVS-DR work anymore than there is anything tunnelling about a tunl0 device. For 2.0.x kernels, a tunnel packet is de-capsulated because it is marked type=IPIP, and will be decapsulated if delivered to an lo device just as well as if delivered to a tunl device. The 2.2.x kernels are more particular and need a tunl device (see "Properties of devices for VIP").

7.3. Handling the arp problem for LVS-DR

7.3.1. VIP on lo:0

The VIP on the realservers must not reply to arp requests from the client (or from the router between the client and the director).

7.3.1.1. Realservers with Linux 2.2.x kernels

The loopback device does not arp by default for all OS's except Linux 2.2.x,2.4.x kernels (even when you use -noarp with ifconfig). You may need to do something if you are running a realserver with a 2.2.x or 2.4.x kernel (see the The Arp Problem).

7.3.2. Lars' method

This requires hiding the VIP on the realservers, by putting them on a separate network.

Lars set this up first on LVS-Tun. Here it is for LVS-DR. The director has 2 NICs and the realservers are on a different network (10.1.1.0/24) to the VIP (192.168.1.0/24). All IPs reply to arps. The router/client cannot route to the realserver network and the RIPs do not need to be internet routable. Since the director has 2 NICs, in the lvs_dr.conf file, set the DIP to eth1.

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                        (router)
                           |
                 VIP=192.168.1.110 (eth0, arps)
                      __________
                     |          |
                     | director |
                     |__________|
                     DIP=10.1.1.1 (eth1, arps)
                           |
                           |
          -------------------------------------
          |                |                  |
          |                |                  |
   RIP1=10.1.1.2     RIP2=10.1.1.3     RIP3=10.1.1.4 (eth0)
   VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all lo:0, can arp)
   _____________     _____________      _____________
  |             |   |             |    |             |
  | realserver  |   | realserver  |    | realserver  |
  |_____________|   |_____________|    |_____________|
          |                |                  |
      (router)          (router)           (router)
          |                |                  |
          ----------------------------------------------> to client

7.3.3. Transparent Proxy (TP or Horms' method) - not having the VIP on the realserver at all.

The subject of Transparent Proxy (Horm's method) has it's own section.

7.4. LVS-DR scales well

Performance tests (75MHz pentium classics, on 100Mbps network) with LVS-DR on the performance page (http://www.linuxvirtualserver.org/Joseph.Mack/performance/single_realserver_performance.html) showed the rate limiting step for LVS-DR director forwarding packets to the realservers. LVS doesn't add any detectable latency or change the throughput of forwarding. There is little load on the director operating at high throughput in LVS-DR mode. Apparently little computation is involved in forwarding.

In the early days of LVS, we expected the director in LVS-DR to be lightly loaded because it was receiving only small packets from the client (e.g. get index.html or get largefile.tar.gz) while the realservers were delivering the large files to the client via their router. We expected that the fan-out (number of realservers handled by a director) would be in the ratio of the filesize sent to the client compared to the requestsize from the client. It is true that the measured bandwidth coming in to the director is smaller than the output from the realservers.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 06/06/2005

I have 38 realservers behind my director, incoming traffic (to director) goes up to 20Mb/s, outgoing (from realservers LVS-DR setup) up to 60Mb/s. I have about 1200 sites hosted. 36 virtual_server entries in keepalived.conf, 30 VIPs. There's no noticable load on the poor PIII/700 director that's handling the traffic.

However we have since realised that network hardware is specified in packets/sec and not Mbps (see 8000pps) and that every outgoing packet from a realserver is matched by an incoming packet to the director (possibly just an <ack>). The director then is passing the same number of network packets as all the realservers together. Once the incoming network traffic to the director reaches 8000pps (for 100Mbps FE), the director is saturated. LVS-DR does get good fan-out (one director supporting many realservers) but the reason is not the one we originally thought. The good fan-out is because the director only has to handle network traffic, while a realserver may have to go to disk or to compute before it can produce its packets. The fan-out then is the ratio of time that the realservers need to produce the packet payload, compared to the time it takes to transmit them.

7.5. LVS-DR director as default gw for realservers, transparent proxy and Julian's martian and forward_shared patches

In the case where the director is the firewall for the realserver network, the director has to be the default gw for the realservers. The reply packet from the realserver to the client (VIP->CIP) then goes through the director (which has a device with IP=VIP). The director then is being asked to route a packet from outside, that has a src address that is on the director. Normally this is not allowed and such illegal packets are called martians.

Here's from rfc1812 from Ken Chase math (at) velocet (dot) ca 14 May 2003, posting to the beowulf mailing list.

1609 martian_source:
1610
1611         rt_cache_stat[smp_processor_id()].in_martian_src++;
1612 #ifdef CONFIG_IP_ROUTE_VERBOSE
1613         if (IN_DEV_LOG_MARTIANS(in_dev) && net_ratelimit()) {
1614                 /*
1615                  *      RFC1812 recommendation, if source is martian,
1616                  *      the only hint is MAC header.
1617                  */
1618                 printk(KERN_WARNING "martian source %u.%u.%u.%u from "
1619                         "%u.%u.%u.%u, on dev %s\n",
1620                         NIPQUAD(daddr), NIPQUAD(saddr), dev->name);
1621                 if (dev->hard_header_len) {
1622                         int i;
1623                         unsigned char *p = skb->mac.raw;
1624                         printk(KERN_WARNING "ll header: ");
1625                         for (i = 0; i < dev->hard_header_len; i++, p++) {
1626                                 printk("%02x", *p);
1627                                 if (i < (dev->hard_header_len - 1))
1628                                         printk(":");
1629                         }
1630                         printk("\n");
1631                 }
1632         }
1633 #endif
1634         goto e_inval;
1635 }
1636


5.3.7 Martian Address Filtering

   An IP source address is invalid if it is a special IP address, as
   defined in 4.2.2.11 or 5.3.7, or is not a unicast address.

   An IP destination address is invalid if it is among those defined as
   illegal destinations in 4.2.3.1, or is a Class E address (except
   255.255.255.255).

   A router SHOULD NOT forward any packet that has an invalid IP source
   address or a source address on network 0.  A router SHOULD NOT
   forward, except over a loopback interface, any packet that has a
   source address on network 127.  A router MAY have a switch that
   allows the network manager to disable these checks.  If such a switch
   is provided, it MUST default to performing the checks.

   A router SHOULD NOT forward any packet that has an invalid IP
   destination address or a destination address on network 0.  A router
   SHOULD NOT forward, except over a loopback interface, any packet that
   has a destination address on network 127.  A router MAY have a switch
   that allows the network manager to disable these checks.  If such a
   switch is provided, it MUST default to performing the checks.

   If a router discards a packet because of these rules, it SHOULD log
   at least the IP source address, the IP destination address, and, if
   the problem was with the source address, the physical interface on
   which the packet was received and the Link Layer address of the host
   or router from which the packet was received.

  Martian Filtering
        A packet that contains an invalid source or destination address
        is considered to be martian and discarded.

Horms horms (at) vergenet (dot) net>

The problem is that with Direct routing the reply from the real server has the vip as the source address. As this is an address of one of the interfaces on the director it will drop it if you try and forward it through the director. It appears from experimentation with /proc/sys/net/ipv4/conf/*/rp_filter that at least on 2.2.14, there is no way to turn this behaviour off. (for more info on rp_filter see the /proc filesystem flags.)

This type of packet is called a "source martian" and is dropped by the director. martians can be logged with

# echo 1 >/proc/sys/net/ipv4/conf/all/log_martians

There are 3 solutions to this; 2 by Julian and 1 by Horms.

7.5.1. Director has 1 NIC, accepts packets via transparent proxy.

If the director accepts packets for the VIP via transparent proxy, then the director doesn't have the VIP and the return packets are processed normally. (Note: transparent proxy only works on the director for 2.2.x kernels - update early 2003, patches are now available for 2.4.x kernels).

Here's Julian's posting

                Clients
                   |
                  ISP
                   |eth0/ppp0/...
                Router/Firewall/Director (LVS box)
                   |eth1
        +----------+------------+
        |eth0                   |eth0
        Real 1                  Real2

Router: transparent proxy for VIP (or all served VIPs). The ISP must feed your Director with packets for your subnet 199.199.199.0/24 LVS-DR mode (Yes, LVS-DR, this is not a mistake). eth1: 199.199.199.2. default gw is ISP.

Real server(s): nothing special. VIP on hidden device or via transparent proxy. eth0: 199.199.199.3. default gateway is 199.199.199.2 (the Director)

This is a minimum required config. You can add internal subnets yourself using the same physical network (one NIC) or by adding additional NICs, etc. They are not needed for this test.

Packets from the realservers with saddr=VIP will be forwarded from the director because VIP is not configured in the Director. We expect that this setup is faster than VS/NAT.

7.5.2. Julian's martian modification (forward_shared)

In normal LVS-DR, the packets returning from the realservers (which have a src_addr=VIP) are routed to anywhere but the director. Normally packets with scr_addr=VIP are rejected as source martians on the director, because the director has the VIP as a local IP. The martian modification patch allows the director to be the default gw for packets from the realservers. Since packets with src_addr are now allowed from the realserver, spoofed packets from the outside, with src_addr=VIP, must be disallowed (you can use filter rules in combination with the name of the NIC connecting to the router - assuming it's a different NIC to the one connecting to the realservers).

The original name I gave this patch is "martian modification". Julian's name for it is "forward_shared". Both names are used in the HOWTO.

To download the patches and to read Julian's notes on using the director as a gateway for realserver in LVS-DR/Tun, see LVS director as gateway in Direct Routing and Tunnel Setups. (The dates on the files are the creation dates, not the last modified. Thus the file for the 2.4.26 kernel, current in May 2004, has a date in 2001.)

Also see Transparent Bridging for an alternate solution to the martian problem.

The martian modification is currently (since Aug 2001) implemented with the hidden-forward_shared-xxx.diff patch. This patch has the hidden (for realservers) and forward_shared (for directors) patch and can be applied to both realservers and directors. (Remember for the director you need the ipvs patch too). The forward_shared patch will not be going into the kernel code (you'll always have to apply the patch) as some kernel people don't like the idea of allowing source martin packets.

This is a kernel patch, director has 2 NICs (doesn't work with one NIC), VIP is on outside NIC.

After applying the patch, for a test, use the default values for */rp_filter(=0). This allows realservers to send packets with saddr=VIP and daddr=client through the Director.

If the patch is applied and external_eth/rp_filter is 0 (which is the default) the realservers can receive packets with saddr=any_director_ip and dst=any_RIP_or_VIP which is not very good. On the external net, set rp_filter=1 for better security.

Here's the test setup

             ____________
            |            |
            |  client    |
            |____________|
                  |
                  |  192.168.2.0/24
             _____|______
            |            |
            |  director  | LVS-DR director has 2 NICs
            |____________|
                  | eth0    192.168.1.9
                  | eth0:12 192.168.1.1
                  |
                  |  192.168.1.0/24
             _____|____________________
            |
            |
       _____|__________
      |                |
      | realserver(s)  | default gw=192.168.1.1
      |________________|

192.168.1.1 is the normal router. For the test it was put on the director instead (as an alias). The director has 2 NICs, with forwarding=on (client and realservers can ping each other).

Director runs linux-0.9.8-2.2.15pre9 unpatched or with Julian's patch. LVS is setup using the configure script, redirecting telnet, with rr scheduling to 3 realservers. The realservers were running 2.0.36 (1) or 2.2.14 (2). The arp problem was handled for the 2.2.14 realservers by permanently installing in the client's arp table, the MAC address of the NIC on the outside of the director, using the command arp -f /etc/ethers.

The director was booted 4 times, into unpatched, patched, unpatched and patched. After each reboot the lvs scripts were run on the director and the realservers, then the functioning of the LVS tested by telnet'ing multiple times from the client to the VIP.

For the unpatched kernel, the client connection hung and inactive connections acccumulated for each realserver. For the patched kernel, the client telnet'ed to the VIP connecting with each realserver in turn.

The configure script will set up the modified LVS-DR (and will warn you that you need the patch for this to work). Setup details are in performance page

7.5.2.1. Martian modification performance

Performance has similar latency to LVS-NAT but the load is low on the director at high throughput of LVS-DR (see the performance page).

7.5.2.2. questions

mstockda (at) logicworks (dot) net

Which interfaces need forward_shared? the interface on the realserver lan _and_ the external side?

Julian Anastasov ja (at) ssi (dot) bg 15 Mar 2002

No, you just enabled the feature which works only for the already selected interfaces. Check it with

ip route get from VIP to 1.2.3.4 iif CHECK_ALL_INTERFACES_HERE

You should enable forward_shared only for interfaces attached to internal mediums (hubs) and of course, only where is needed.

#define IN_DEV_FORWARD_SHARED(in_dev)   ((in_dev)->cnf.forward_shared && ipv4_devconf.forward_shared)

7.5.3. LVS-DR director as default gw by bridging: the difference between "transparent proxy" and "proxy arp"

proxy arp and bridging were discussed in the early days of LVS as a way of allowing the director to be the default gw for LVS-DR. The subject came up again in a thread on another topic. Also see Transparent Bridging.

Nicolas Chiappero Nicolas (dot) Chiappero (at) estat (dot) com 28 Jan 2003

Is "proxy arp" and "transparent proxy" the same thing?

Joe

Both allow routing of packets in ways not allowed by the normal routing tables. TP allows a machine to accept (rather than forward) packets for which it is not the destination. This originally was written so that a local squid would accept packets destined for a remote httpd server.

proxy arp, allows a host to reply to arp requests, telling the requestor that it has an IP locally, when in fact the IP is on another machine. This is useful to alter routing (eg for transparent bridging).

Julian

- proxy ARP is used when the traffic should be routed at Layer 3 with the help from ARP. The packets reach the routing after the box answers ARP probes asking for foreign addresses.

- transparent proxy has mostly Layer 5-7 semantic, it is used to intercept traffic destined to foreign addresses and to deliver it to sockets.

- If so, I found a document (http://www.sjdjweis.com/linux/proxyarp/) explaining how to do proxy arp on a 2.4 kernel. Will this method be compatible with LVS as long as director would also be the default GW for realservers?

No. The spoofing checks performed from routing will drop the traffic.

  • Linux Bridging

    Here, the traffic from realservers to the ROUTER passes only Layer 2, i.e. the routing is not reached and you avoid the spoofing checks.

  • forward shared

    If you don't want Bridging or the link to the ROUTER is not ARP aware, then you can use solutions that avoid the spoofing checks for this traffic. One of them is the forward_shared flag (Solution 2).

    With the forward_shared patch applied and with eth1 as the private interface, in the forward_shared directory of the /proc filesystem you set

    all/forward_shared = 1
    eth1/forward_shared = 1
    

7.5.4. Why the forward_shared patch is not in the kernel

Julian 16 Nov 2006

Pros:

  • saves one extra patching

Cons:

  • useful only for setups which share IPs
  • very dangerous!!! That was the first concern by Alexey Kuznetsov. I see people blindly use echo 1 > all/VAR_NAME without considering what is the relation between all/VAR_NAME and DEV_NAME/VAR_NAME. I saw this many times. forward_shared should be applied only on trusted interfaces and setting 1 to all/ opens the door for spoofing/loop attacks.
  • it is another hack in routing. Not sure if all changes are entirely correct.

So, my opinion is 30% (below 50%) for inclusion. May be it is a good idea to have one diff with all IPVS patches not included in mainline. Then the IPVS users will have to patch only once. Now we even don't have this option linked to visible place in web.

7.6. Accepting packets on LVS-DR director by fwmarks

Horms firewall mark (fwmark) allows the director to accept packets by fwmark. There is no VIP required on the director.

7.7. security concerns: default gw(s) and routing with LVS-DR/LVS-Tun

The material here came from a talk by Herbie Pearthree of IBM (posting 2000-10-10) and from a posting by TC Lewis (which I've lost).

In normal IP communication between two hosts, the routing is symmetrical: each end of the link has an ethernet device with an IP and a route to the other machine. Packets are transmitted in pairs (an outgoing packet and a reply, often just an ACK).

In LVS-DR or LVS-Tun the roles of the two machines are split between 3 machines. Here is a two network test setup, with the client in the position normally occupied by the router. In production, the client will have a public IP and connect via a router. (This is my test setup. A big trap in this setup is that services which make calls from the RIP, e.g. authd/identd and rshd will work, but will fail in production, where the RIP will not be routable).

        ____________
       |            |192.168.1.254 (eth0)
       |  client    |----------------------
       |____________|          <-         |
     CIP=192.168.2.254 (eth1)             |
|             |                           |
V             |                           |
     VIP=192.168.2.110 (eth0)             |
        ____________                      |
       |            |                     |
       |  director  |                     |
       |____________|                     |
     DIP=192.168.1.1 (eth1, arps)         |
|             |                           |
V             |----------------------------
              |		        ->
     RIP=192.168.1.2 (eth0)
     VIP=192.168.2.110 (lo:0, no_arp)
        _____________
       |             |
       | realserver  |
       |_____________|

7.7.1. Director's default gw

The client sends a packet to the VIP on the director. In a normal exchange of packets between a pair of machines, the director would send a reply packet back to the client. With an LVS, the director's response instead is a packet to the MAC address of the RIP. Except for ICMP packets (which are only sent in error conditions), the VIP on the director never sends packets back to the client, it only sends packets to the realservers. A default gw for the director is not needed for the functioning of the LVS. Having a default gw would only allow the VIP director to reply to packets from the internet, such as port scans, creating a security hazard. The director doesn't need and shouldn't have a default gw.

There are pathological conditions when the VIP needs to reply to the client. If the realserver goes down, the director will issue ICMP "host unreachable" packets, till a new realserver is switched in by mon or ldirectord. (If you have a long lived tcp connection, eg with telnet or https, the new realserver will be getting packets for a connection which it doesn't know about, and it will issue a tcp reset. This reset will go out the default gw for the realserver and the client's session will hang or drop.)

If you're using the director for other functions (DNS, firewall...), then packets will need to return to the internet. If you wanted security, you could use the iproute2 tools to allow only the DNS replies to use the default route. An example of doing this is the routing used for 3-Tier LVS realservers.

Julian Anastasov ja (at) ssi (dot) bg> 30 Aug 2001

It may be these ICMPs are not fatal if they are not sent. This is true when LVS is used in transparent proxy setups and particulary in 2.4 where there is no real transparent proxy support. There icmp_send() does not send any packets when there is no running squid.

But may be the original email sender wanted to use the LocalNode feature together with a DR setup, IIRC. I see that the configure script does not have configs for such setups with mixed forwarding methods. So, as you said, the users with more knowledge can select another way to build their setup. And they will know when they need a default gateway :)

If the mtu is not matched between the router and the director, the director will need to send ICMP "fragmentation needed" packets back to the router. This is a bad setup.

You could enable default routing for icmp, but not tcp or udp from the VIP back to the router, by using iproute2.

7.7.2. Realserver's default gw, route to realserver from router

The realserver doesn't reply to the director, instead it sends its reply to the client. The realserver requires a default gw (here 192.168.1.154), but the client/router never replies to the realserver, the client/router sends its replies to the director. So the client/router doesn't need a route to the realserver network. To have one would be a security hazard. The realserver now can't ping its default gw (since there's no route for the reply packet), but the LVS still works.

The flow of packets around the LVS-DR LVS is shown by the ascii arrows.

When an attacker tries to access the nodes on the LVS, it can only connect to the LVS services on the director. It can't connect to the realserver network, as there is no routing to the realservers (even if they get access to the router). Presumably the realservers are not accessable from the outside as they'll be on private networks anyhow.

Note that for Julian's martian modification, the director will need a default gw.

7.8. routing to realserver from director

Note
you don't need routing from the server_gw to the realservers either, see route to realserver.

If you are only using the link between the director and realserver for LVS-DR packets (i.e. you aren't telnet or ssh'ing from the realserver to the director for your admin, and you aren't copying logs from one machine to another), then you don't need an IP on the interface on the director which connects to the realserver(s).

tc lewis tcl (at) bunzy (dot) net 12 Jul 2000 (paraphrased)

I would like to send packets from the LVS-DR director to the realservers by a separate interface (eth2), but not assign an IP to this interface. Normally I put a 192.168.100.x ip on eth2, but without it, route add -net 192.168.100.0 netmask 255.255.255.0 dev eth2 just gives me an error about eth2 not existing. I just want to save an extra IP.

What i'm asking is: does the director's eth2 need an ip on 192.168.100.0/24, or can i just somehow add that route to that interface to tell the machine to send packets that way? With lvs, the realservers are never going to care about the director's interface ip, since there's no direct tcp/ip connections or anything there, but it looks like it still needs an ip anyway.

If all that that interface is doing is forwarding outgoing packets from the director via the dr method, then i don't see why it needs an ip address.

Ted Pavlic tpavlic (at) netwalk (dot) com

You basically want to do device routing. There's nothing special about this -- many routers do it... NT even does it. So does Linux. Your original route command should work

route add -net 192.168.100.0 netmask 255.255.255.0 dev eth2

as long as you've brought up eth2. Now tricking Linux into bringing up eth2 without an address might be the hard part. Try this:

ifconfig eth2 0.0.0.0 up

or

ifconfig eth2 0 up

tc lewis tcl (at) bunzy (dot) net

ifconfig eth0 0.0.0.0 up

then the route did work. I tried that before with a netmask but it didn't work.

Ted Pavlic tpavlic (at) netwalk (dot) com

Remember that IP=0 actually is IP=0.0.0.0, which is another name for the default route.

The reason why IP=0 is 0.0.0.0 ... Remember that each IP address is simply a 4-byte unsigned integer, right? Well... the easiest way to envision this is to imagine that an IP is just like a base-256 number. For example:

216.69.192.12 (my mail server) would be:

12 +
192 * 256 +
69  * 256 * 256 +
216 * 256 * 256 * 256

Which is equal to 3628449804. So...

telnet 216.69.192.12 25

is the same as:

telnet 3628449804 25

0.0.0.0 is just a special system address which is the same as the default route. Making a route from 0.0.0.0 to some gateway will set your default route equal to that gateway. That's all "route add default gw ..." does. Don't believe me? Do a route -n.

So when I told TC to put 0 on his IP-less NIC, I was just choosing a system IP that I knew would not ever need to be transmitted on. Linux wanted an IP to create the interface... so I gave it one -- the IP of the default gateway. Packets would never need to leave the system going to 0.0.0.0, and Linux has to listen to this address ANYWAY, so you might as well explicitly put it on an interface.

What would have also worked (and might have been a better idea) would be to put 127.0.0.1 on that interface. That is another system address that Linux will listen to anyway if loopback has been turned on... and it should never transmit anything away from itself with that as the destination address, so it's safe to put it on more than one interface.

The only reason I chose 0 over 127.0.0.1 is because 0 is easy... It's small... It's quick. Whenever I want to telnet to my localhost's port blah I just do a:

telnet 0 blah

because I'm lazy.. (Linux sees 0, interprets 0.0.0.0, sees an address it listens to, and basically treats 0 like a loopback)

Also you'll notice that if you give an interface 0.0.0.0 as an IP address and do an ifconfig to get stats on that interface, it will still retain no IP address. Another perquesite of using 0.0.0.0 in TC's particular situation. It may actually cause less confusion in the end.

7.9. LVS-DR, LVS-Tun need rp_filter=0

Note
This applies on the director for both LVS-DR and LVS-Tun

Brandon Yap byap (at) xss (dot) com (dot) au 21 Feb 2004

I found the problem. rp_filter needed to be turned off on tunl0.

echo 0 >/proc/sys/net/ipv4/conf/tunl0/rp_filter
Note
Joe - 0 is the default value for rp_filter, as specified in RFC1812 (for more on RFC1812 see broadcast_arp_replies and julians_martian_modification). From postings on the LVS mailing list, it seems that some of the market enhanced kernels (e.g. Debian) have changed the default. (They wouldn't make any money if their kernels behaved in the expected way ;-\ .) You need to file a bug report with the supplier of your kernel.

Ratz 13 Nov 2006

for i in /proc/sys/net/ipv4/conf/*/rp_filter
do
   echo "setting $i to 0"
   echo 0 > $i
done

Ratz 21 Jan 2006

You would be referring to following snippet in the RFC, right?

5.3.8 Source Address Validation

A router SHOULD IMPLEMENT the ability to filter traffic based on a comparison of the source address of a packet and the forwarding table for a logical interface on which the packet was received. If this filtering is enabled, the router MUST silently discard a packet if the interface on which the packet was received is not the interface on which a packet would be forwarded to reach the address contained in the source address. In simpler terms, if a router wouldn't route a packet containing this address through a particular interface, it shouldn't believe the address if it appears as a source address in a packet read from this interface.

If this feature is implemented, it MUST be disabled by default.

So if I read this correctly, /proc/../conf/{all,default}/rp_filter must be off on a freshly booted kernel without any explicit user changes in any of the rc boot scripts.

I've checked on a Debian installation of one of our customers:

sf-lb:~ # cat /etc/network/options
ip_forward=yes
spoofprotect=yes
syncookies=no
sf-lb:~ # uname -a
Linux sf-lb 2.4.27 #1 Sat Oct 16 17:14:21 CEST 2004 sparc64 GNU/Linux
sf-lb:~ # cat /etc/debian_version
testing/unstable
sf-lb:~ #

I have to assume these are the default settings, which then in /etc/init.d/networking get set over doopt() (completely brain-dead redundant information).

Reading spoofprotect_rp_filter() in /etc/init.d/networking I have to assume that the person maintaining this piece of software has not understood the network related settings (besides showing horrible programming practice) in proc-fs under Linux:

spoofprotect_rp_filter () {
     # This is the best method: turn on Source Address Verification and get
     # spoof protection on all current and future interfaces.

     if [ -e /proc/sys/net/ipv4/conf/all/rp_filter ]; then
         for f in /proc/sys/net/ipv4/conf/*/rp_filter; do

--> This should be s/*/default/ to match at least the wrong comment

             echo 1 > $f
         done
         return 0
     else
         return 1
     fi
}

On top, good programming practice would be to explicitly set the other values you take for granted to 0, since an operator could have accidentally set some proc-fs values to test something and did not make it reboot-safe.

Debian is and will remain a system for people with a lot of spare time. Folks: rp_filter has almost nothing to do with proper network security! If source validation has to be done, make sure you route properly.

It's funny, Debian people would only need to have a look at SuSE or Red Hat to see how one can do the networking setup a tad bit better.

Jacob Coby jcoby (at) listingbook (dot) com 20 Feb 2004

Could you do me a favor, and turn rp_filter ON, and ping the VIP with both normal sized ping packets, and very large (>MTU). And then, turn rp_filter OFF and try it again? I'm thinking this is the reason I was having trouble getting lvs-tun to work with packets of size >MTU (see MTU). rp_filter is about the only /proc entry I didn't lookup and try fiddling with.

from the adv-routing HOWTO (http://www.ibiblio.org/pub/Linux/docs/HOWTO/Adv-Routing-HOWTO)

".. if a packet arrived on the Linux router on eth1 claiming to come from the Office+ISP subnet, it would be dropped. Similarly, if a packet came from the Office subnet, claiming to be from somewhere outside your firewall, it would be dropped also."

I think LVS-TUN packets claim to be from the outside world, but come from the subnet, don't they?

Joe: in test situations where both the director and realservers are on the same bench, tunnelled packets from the director to the realservers are from the same netmask. However in real life, the director and realservers can be on different continents and will be in different networks. The decapsulated packet is from the client.

Guy Coates gmpc (at) sanger (dot) ac (dot) uk 03 Nov 2004

I'm running into problems using LVS-DR when using a private network to route traffic from the director to the realservers.

director
Public   IP :   172.17.22.215   (eth0)
Public VIP  :   172.17.22.216 (eth0:0)
Private IP  :   10.4.1.2 (eth1)

realserver

Public  IP: 172.17.22.214 (eth0)
Private IP:	10.4.1.1   (eth1)
VIP       :     172.17.22.216 (lo:0)

eth0 on both machines are on the same segment, and eth1 on both machines are connected via a crossover cable. All client traffic comes in and out via the public network.

If I route director->realserver traffic over eth0, everything works as it should.

ipvsadm -A -t 172.17.22.216:80
ipvsadm -a -t 172.17.22.216:80 -r 172.17.22.214 -g

director:~# ipvsadm -L -c
IPVS connection entries
pro expire state       source           virtual         destination
TCP 14:49  ESTABLISHED 172.25.1.32:37143  172.17.22.216:80  172.17.22.214:80

If I route director->realserver traffic via the private network, things don't. The director routes the incoming traffic correctly, but the realserver drops the packets on the floor.

ipvsadm -A -t 172.17.22.216:80
ipvsadm -a -t 172.17.22.216:80  -r 10.4.1.1 -g

director:~# ipvsadm -L  -c -n
IPVS connection entries
pro expire state       source             virtual            destination
TCP 00:36  SYN_RECV    172.25.1.32:37154  172.17.22.216:80   10.4.1.1:80

tcpdump on the realserver confirms that the director is correctly passing the packets to the realserver:

realserver:~# tcpdump -i eth1 port 80 -p -n

12:25:30.922232 IP 172.25.1.32.37159 > 172.17.22.216.80:
S 2236244704:2236244704(0) win 5840
<mss 1460,sackOK,timestamp 172541305 0,nop,wscale 0>

However, the realserver does not pick up the packet. I'm using kernel 2.4.27+hidden arp patches on both realserver and director.

Unknown

You're not running with one of the anti-spoofing controls switched on re you? For the life of me I can't remember which sysctl this is (rp_filter?) but that would exhibit this type of behaviour if set.

Ahh yes, it looks as if debian handily sets /proc/sys/net/ipv4/conf/default/rp_filter to 1 by default. Setting that to zero on the realserver make everything spring into life.

Simon Detheridge simon (at) widgit (dot) com 30 Oct 2006

I had two LVS-Tun directors. One worked, one didnt. Yeah. I did a "cp -r /proc/sys ~/" on each machine, then a recursive diff on the results. Looks like somehow during the upgrading and ensuing tinkering, rp_filter got set to "1" on the backup director. Setting it to "0" seems to have made the bad behaviour go away. I thought I'd checked this, but must have only done it on the realservers.

7.10. Director as client in LVS-DR

The LVS-mini-HOWTO states that the lvs client cannot be on the director or any of the realservers, i.e. that you need an outside client. This restriction can be relaxed under some conditions.

Joshua Goodall joshua (at) myinternet (dot) com (dot) au 11 May 2004

I want to setup the situation where the director is one of the clients. It appears that LVS does not intercept the outbound packet when it originates on the director itself. This is with both fwmark and a configured VIP:port. I've also tried adding -j REDIRECT in the OUTPUT chain, to no avail. If I bring up the VIP on the director, I see the packet when tcpdumping localhost, but LVS doesn't grab it. Oddly, the packet is still on localhost even when the VIP is on eth0. It seems that ip_vs_in ignores the packet if the device is loopback_dev.

Questions then:

  • Why test for loopback_dev at all? Is this important, or is it just supposed to be an optimisation?
  • Can we fool ip_vs to fill skb->dev with something other than &loopback_dev if the director is the client?

I tried this patch (2.4.26)

diff -u -p -r1.1.1.1 ip_vs_core.c
--- ip_vs_core.c	19 Apr 2004 04:54:41 -0000	1.1.1.1
+++ ip_vs_core.c	11 May 2004 13:03:34 -0000
@@ -1036,7 +1036,7 @@ static unsigned int ip_vs_in(unsigned in
 	 *	Big tappo: only PACKET_HOST (nor loopback neither mcasts)
 	 *	... don't know why 1st test DOES NOT include 2nd (?)
 	 */
-	if (skb->pkt_type != PACKET_HOST || skb->dev == &loopback_dev) {
+	if (skb->pkt_type != PACKET_HOST) {
 		IP_VS_DBG(12, "packet type=%d proto=%d daddr=%d.%d.%d.%d ignored\n",
 			  skb->pkt_type,
 			  iph->protocol,

then added

iptables -t mangle -A OUTPUT -p tcp -s 0/0 -d $VIP --dport $VIPP -j MARK --set-mark 2

to the existing

ip rule add prio 100 fwmark 2 table 100
ip route add local 0/0 dev lo table 100

and now my fwmark-based LVS-DR director does the job for clients and for itself. To make LVS-NAT work, we'd also need to be able to choose the masqueraded source address, which would be a much longer diff. I didn't try LVS-Tun, but that would probably be workable like LVS-DR.

Julian

So, now you can send packets in form DIP->VIP to real servers (LVS-DR method)? I'm wondering how your patched director accepts reply packets for the LVS'ed service from the realserver in the form VIP->DIP. Linux has source address validation and you can not disable it for packets with saddr=local_ip. I see that you can remove the limitation when sending packets but how do you accept the normal LVS replies from the realservers? Maybe you do not have the VIP configured as IP address?

Joshua

There is no VIP. For regular (external) clients, I'm using fwmark + iproute2 to grab packets intended for the DIP; to capture locally sourced packets, I just put a -j REDIRECT into the OUTPUT chain of the nat table.

Julian

There can be another problem in 2.4 (2.6 seems to handle this properly): ip_vs_skb_cow does not expect skbs to have valid skb->ski. Maybe the skb should be copied (skb_copy) instead reallocating only its data. You can check for problems by using tcpdump -i lo. Make sure there are no crashes or any kind of memory leaks because I personally can not test such setup. For 2.6 you can remove also the skb->sk check.

7.11. from the mailing list

tc lewis has NAT'ed out NAT ntp clients in LVS-DR realservers.

7.12. rewriting, re-mapping, translating ports with LVS-DR

see Re-mapping ports in LVS-DR with iptables