18. LVS-J: Ludo's reinJect Forwarder: using the director as a gateway to load balance connections to the internet

18.1. Introduction

We haven't had a new forwarder for quite a while (the last one was either Localnode or LVS-Tun, way back in the early days).

An LVS director should be able to balance packets through multiple paths to the internet, except that it has to accept the replies as well. Ludo Stellingwerff ludo (at) protactive (dot) nl has hacked the ipvs code to do just that. A writeup of the state of the art in routing multiple internet connections over different paths is in Dynamic Routing. Handling failure in multipath routing is still difficult - see Julian's dead gateway detection code. As for all the forwarders, failover of the realservers is handled externally to ip_vs. Anyone who sets up this forwarder will at least need to be aware of the problem of handling failure of any of the routes.

Ludo Stellingwerff ludo (at) protactive (dot) nl 29 Jul 2005

Here are the vs_reinject_patches.tar.bz2 against 2.6.11 . This is a minimum implementation to provide support for using LVS as a loadbalancer for internet gateways.

He're the physical layout with example IPs.

-
|                   LAN  192.168.1.0/24
private IPs         |
|                   eth0 192.168.1.254
-                   |
                 Director
-                |      |
|   200.200.10.1 eth1   eth2 200.200.20.1
|                |      |
public IPs       |      |
|                |      |
|           (modem, router, wan)
| 200.200.10.254 gw_1   gw_2 200.200.20.254
|                |      |
v                ISP1   ISP2

The director is the gateway for the private network. The director load balances two separate internet connections (eth1, eth2) through the realservers which are the real gateways. Any return traffic will pass through the director on the way back (side note: you'll need to switch off reverse path filter on the director)

The code consists of two parts: director code called "reinJect" and an iptables target called "LVS". I call my code "Multipath routing through LVS", because the kernel term for multiple internet connections is multipath routing. Patches are included to allow ipvsadm to set up the reinJect forwarding.

18.2. reinJect setup with ipvsadm

Here's how you set it up. It's similar to the setup of the other forwarders, but using the -j forwarder option. There are a few extra wrinkles, as explained below.

iptables -A FORWARD -t mangle -s 192.168.1.0/24 -d 0.0.0.0/0 -m state - --state NEW -j MARK --set-mark 1
iptables -A FORWARD -t mangle -m mark --mark 1 -j LVS
ipvsadm -A -f 1 -p # for -p see below
ipvsadm -a -f 1 -r 200.200.10.254 -j
ipvsadm -a -f 1 -r 200.200.20.254 -j
iptables -A POSTROUTING -t nat -o eth1 -m mark --mark 1 -j SNAT - --to-source 200.200.10.1
iptables -A POSTROUTING -t nat -o eth2 -m mark --mark 1 -j SNAT - --to-source 200.200.20.1

18.3. The target LVS: sending packets with dst_addr=0/0 to ip_vs

Since the director is being asked to process packets with dst_addr=0/0, some method of getting the director to process the packets is needed. Originally LVS was written to process single IP, single port services (e.g. a website at VIP:80). Since the packet was destined for an IP on the original server (that was replaced by an LVS), and this packet was routed to LOCAL_IN, LVS was written to hook into LOCAL_IN. In an LVS this IP became the VIP on the outside of the director and a non-arping VIP on the realservers. Later for 2.2.x kernels, transparent proxy Transparent Proxy (Horm's method) allowed LVS to be extended so that it would accept packets to a wide range of IPs e.g. squids which process packets for 0:80. With the arrival of the 2.4 series of kernels, transparent proxy no longer worked for LVS (but still worked for squids) as the dst_addr of the packet was rewritten. LVS was back to working on only individual IPs. firewall mark (fwmark) allowed grouping of a small number of IPs to be seen as one service, and was adapted to fwmark packets for 0.0.0.0. Still methods outside LVS were required to allow the packets to be accepted locally, which fwmark didn't do. The problem was that LVS required the packets to traverse LOCAL_IN and you couldn't put the IP=0.0.0.0 on the director (or realservers) which would direct these packets to LOCAL_IN. These IPs were on machines out in internetland. This was solved by Julian with two lines of iproute2 code (see routing to a director without a VIP). Now everyone was happy again, but for LVS to work, the packets must traverse LOCAL_IN.

There has been some talk of moving the LVS hooks from LOCAL_IN to another part of the routing table e.g. LVS hook in PRE_ROUTING. In fact you can hook LVS anywhere after PRE_ROUTING, except that to make such a change would require much testing. Hopefully not too much would break in the rest of the code, but still you would have to allow time to fix it all. No-one has wanted to take on the job.

Ludo solves the problem of LVS processing packets with dst_addr=0.0.0.0 by putting the hook for his forwarder into the FORWARD chain (which is traversed by packets to 0.0.0.0). Conceivably LVS could be rewritten so that all forwarders for packets to 0.0.0.0 hook into the FORWARD chain. Here's the instructions that send the packet to LVS.

#mark new connections to 0/0 with fwmark=1
#after the SYN packet passes through, the routing cache will have an entry for that route.
#the kernel keeps a cache of used routes.
#subsequent packets in the connection will be forwarded by the routing cache table.
iptables -A FORWARD -t mangle -s 192.168.1.0/24 -d 0.0.0.0/0 -m state - --state NEW -j MARK --set-mark 1
#send all packets with fwmark==1 to the chain LVS
#the chain LVS is setup by the patches and sends the packet to the ip_vs code.
iptables -A FORWARD -t mangle -m mark --mark 1 -j LVS

The marked packets jump to the target LVS, which sends the packet to the normal ip_vs code

18.4. setting up LVS-J forwarding

The setup of the -j forwarder is the same as for the other forwarders.

ipvsadm -A -f 1 -p  #standard persistence timeout
ipvsadm -a -f 1 -r 200.200.10.254 -j
ipvsadm -a -f 1 -r 200.200.20.254 -j

Normally you would like to use persistence: e.g. accessing websites with cookies based on sourceip, using https, ssh. The problem is not in which gateway you would use, but which source address you seem to come from (here either 200.200.10.1 or 200.200.20.1).

The ip_vs code does the normal things with the packet - if it's a new connection, sets up the templates etc, and if it's a current connection, figures out which realserver to send the packet to.

The reinject code puts the packet to the mangle chain, doing a form of direct routing, returning the packet to the place where target LVS was called, but with a new mac-address and destination device (here eth1 or eth2 on the director).

Note

The packet with have the source MAC address of the public interface on the outside of the director and destination MAC address of the gateway/realserver. The ip_vs code doesn't set destination MAC addresses, but leaves that to the outgoing device driver. LVS/DR (and LVS/Reinject) changes a field in a kernel structure called SKB. (skb->dst) This field is the next ipaddress the packet will go to. The corresponding MAC address will be determined by the link-layer driver. LVS only operates on network layer. (ip-addresses)

Reinjection is only effective if ip_vs is called on the FORWARD path. If you try to reinject at the LOCAL-IN path, it won't work. The normal ip_vs function is called after the choice between local delivery and forwarding. If I only change the skb->dst at this point, it will not redo this choice and continue to be locally delivered. That is the reason why LVS-DR calls the ethernet output function directly.

18.5. SNAT'ing the output

Packets emerging from the director would have src_addr=CIP (a private IP). Ludo fixes this by SNAT'ing the src_addr.

iptables -A POSTROUTING -t nat -o eth1 -m mark --mark 1 -j SNAT - --to-source 200.200.10.1
iptables -A POSTROUTING -t nat -o eth2 -m mark --mark 1 -j SNAT - --to-source 200.200.20.1

SNAT'ing is only necessairy when the director is between a private and a public network. It will work when the director is between two public networks (with no SNAT required). In most cases SNAT is required.

18.6. LVS-J discussion by Ludo

The code could have achieved the same result using direct routing, but then couldn't SNAT the private addresses to public addresses. Therefore the code introduced the reinject director, which only changes the routing decision and then returns to normal routing with NF_ACCEPT on the hook. The packet will than go on normally, but with a new route.

This effect is similar to the iptables ROUTE target, but with the added features of LVS (caching, persistence).

The scheduler is non-intrusive: it only changes 1 field in the skb and then returns to the normal network stack, at the same point where "ip_vs_in()" was called. The code provides a Netfilter Target called "LVS" which can be used in the mangle table on the FORWARD hook. When used, this target calls "ip_vs_in()" directly, providing the routing capability of lvs.

  • Packet coming in from LAN has non-local destination, normal Linux routing will call ip_forward.
  • Netfilter Forwarding hook is called
  • In the mangle table the following two targets will be called:

    #iptables -A FORWARD -t mangle -s <lan-ip> -d 0.0.0.0/0 -m state - --state NEW -j MARK --set-mark 1
    #iptables -A FORWARD -t mangle -m mark --mark 1 -j LVS
    
  • The LVS target will call ip_vs_in(), with a schedular on fwmark 1, using the new "reinject" director:

    #ipvsadm -A -f 1 -p
    #ipvsadm -a -f 1 -j -r <gateway1>
    #ipvsadm -a -f 1 -j -r <gateway2>
    
  • The reinject director will make sure the packet will continue transfering the mangle table, but with a new Nexthop (skb->dst).
  • The packet will continue through the normal network stack, through POSTROUTING, etc.
  • The packet will be send towards the internet through the selected gateway (specified as next hop).

The provided patches have three unfinished drawbacks:

  • I'm not sure if the kernelpatches compile correctly when used as modules. Therefor I force the LVS subsystem to "inbuild/yes" when selecting the iptables target.
  • I didn't check the usage of the iptables target for the FORWARDING hook. It is still possible to select this target in PREROUTING, even though this is ineffective, it will not work.
  • It's against 2.6.11 (which is old allready:) Hopefully I'll find time to clean that up somewhere next week. Or maybe someone else has time/energy to clean them up?

Linux networking is very flexible when it comes to routing. It is possible to use several internet connections through one router. The process of selecting from these multiple defaultroutes is called multipath routing. One of the remaining problems for multipath routing under Linux is the lack of flexibility on the scheduling of these multiple defaultroutes. The normal multipath routing only provides a weight factor, but no further setup parameters. It is a basic form of load-balancing, but nothing fancy. Another problem is that multipath routing is only supported on defaultroutes, not on any route with more than one possible gateway.

The lvs_reinjection patches are designed to provide the full effectiveness of the LVS schedulars for deciding which route a given packet will take. Contrary to normal LVS setups it provides the possibility to schedule any traffic through the router. With normal LVS the scheduled service is a local service on the director which is then transfered to one or more realservers. The solution provided through these patches can select any traffic passing the director and then force this traffic through a nexthop/gateway.

The fwmark can be anywhere in the networking stack, using iptables. Then you'll need to tell the network stack to send the traffic through the LVS subsystem. This is done through the use of a new iptables target called LVS. The purpose of this target is to call the entry function of LVS.

Basically you can then use any of the LVS functions, any director available. But of the three standard forwarders, none is very effective for the internet loadbalancing case. The LVS/NAT director will mangle the headers of the packets, therefor loosing the final destination information. The LVS/TUN director will try to setup a tunnel to the realservers (in this case: the gateways), but most gateway's don't provide such a tunnel capability. Only LVS-DR provides the required behaviour: it will send the packets unmodified to the correct gateway. But the LVS/DR director does this by bypassing all local routing on the router, and sends the packet directly through the ethernet drivers output function. This means further services, like SNAT, Masquerading, etc. cannot be done on these packets.

For this problem the patches introduce a new director, called LVS/Reinject. This is a very simple director, similar to LVS/DR. But instead of sending the packet directly out the ethernet device, this director leaves that to the normal network stack. It just returns back to where LVS was called in the first place. You can't use the LVS-J forwarder in normal LVS setups, where LVS hooks into LOCAL_IN. Returning the packet to LOCAL_IN would mean that the packet will try to be locally delivered. Here the code allows LVS to hook into the FORWARD chain.

Normally LVS services require the traffic to the VIP to be delivered locally on the director. Just before this traffic is delivered to a local process the LVS subsystem will be called. If the LVS subsystem accepts the traffic for one of its services, it will steal the traffic from the local delivery-path and sent it through one of its directors to the realservers (bypassing standard routing). With the iptables target it is possible to call the LVS subsystem at any time you like. This can be on the local-delivery path, but also on the forwarding-path. If you call the "LVS" target and the traffic is selected by one of the LVS services, it will be stolen from the normal flow and delivered back there again.

Registering LVS with the FORWARD hook, fixes the problem of requiring the dst_addr to be a VIP local to the director. But I also wanted to prevent LVS from stealing the packet. I wanted the traffic to stay in the netwerk stack, and continue on its normal path. The only service I needed from LVS was the ability to select one of the available gateways. These are the realservers in LVS terminology, but here they aren't the endpoint of the connection, just the next hop to internet. So I combined the ability to sent any traffic to LVS with the ability to reinject the packets in the network flow at the same place where LVS stole it (basically returning from the LVS entry function with IPT_CONTINUE, i.s.o. NF_STOLEN)

Horms

I had a brief look over the patches and the seem ok to me. Except that I am not clear on the motivation of the following hooks. Doesn't this mean that ip_vs_in is registered in three separate places? Is this actually what you need?

Ludo Stellingwerff Aug 04, 2005

Yes, as I try to redirect forwarded traffic (with addresses not local to the director), I need to hook into NF_FORWARD. Ideally this has to be a seperate ip_vs_forward_in function, but these patches are a concept proof. This new ip_vs_forward_in function should be limited to matching fwmarks.

The packet flow is: incoming packet -> PRE_ROUTING -> FORWARD -> ip_vs_in (returning NF_ACCEPT, after changing skb->dst) -> POSTROUTING -> outgoing packet.

I'm also looking at the possibility of using the iptables REDIRECT target to get rid of the forwarding hook and use the normal ip_vs_in, but I'm not yet sure this will not mangle the original packet (It should not loose the original destination data). At least reinject should than be changed to return NF_STOLEN on the INPUT hook, and call ip_forward() to get the packet on it's way again.

The flow for the packet will then become: incoming packet -> PRE_ROUTING (REDIRECT)-> INPUT -> ip_vs_in (returning NF_STOLEN, sending packet to ip_forward()) -> FORWARD -> POSTROUTING -> outgoing packet.

Horms 9 Aug 2005

IF you can get that working, that would be nice. Though I have often wondered about just moving ip_vs_in() to FORWARD and being done with it. I've tried it briefly in the past to good effect.