48. LVS: Loadbalancing with unmodified realservers


The requirement to work with modified realservers is one of the differences between LVS and commercial realservers. LVS requires the following modifications -

  • LVS-NAT: default gw has to point to DIP
  • LVS-DR: requires non-arp'ing VIP; default gw cannot be DIP (easy to do)
  • LVS-Tun: requires VIP (which can arp); IPIP decapsulation; default gw cannot be DIP (easy)

In the commercial world, you initially have one server. Then later you want more reliability or throughput and decide you need a loadbalancer in front of several copies of the server. At this stage the configuration of the server is cast in stone by the bureaucrats. The only thing you're allowed to do is make multiple copies of the server (all with the same IP) and go find a loadbalance that will work with this setup. Flexible bureaucrats will allow a different IP on the different realservers, but this is the limit (you aren't allowed to change the default gw). Of course the people who set these requirements, have the money to make their wishes come true.

Alex Rousskov runs a Cacheoff PolyGraph (http://www.web-polygraph.org/) which tests loadbalancers. I wanted to enter an LVS director (he supplies the realservers). One of the requirements was that the realservers not be modified. This left us with LVS-NAT with a 2.0 or 2.2 kernel, which back then was a lot slower than LVS-DR. I didn't enter.

This section discusses what (little) we know about commercial loadbalancers and what can be done to make LVS work with minimally modified realservers.

48.1. F5-SNAT

This is musings from the mailing list about a type of functionality we don't have in LVS.

Paulo F. Andrade pfca (at) mega (dot) ist (dot) utl (dot) pt 11 Jul 2006

What I want is the following:

  • for inbound connections I want packets with CIP->VIP translate to DIP->RIP
  • for outbound connections (the responses from the real servers) packets with RIP->DIP translate to VIP->CIP

LVS-NAT only does DNAT, meaning CIP->VIP changes to CIP->RIP and the response from RIP->CIP to VIP->CIP. The problem is that after LVS changes the VIP to RIP for inbound connections, it seems that packets don't traverse the POSTROUTING chain to get SNAT'ed. Is there a workaround for this?

Graeme Fowler graeme (at) graemef (dot) net

Surely what you're asking for is a proxy rather than a director?

malcolm lists (at) netpbx (dot) org 13 Jul 2006

I think this is what F5 calls SNAT mode (which confuses LVS people). It's really nifty and flexible and I don't see why it can't be done at layer 4 with LVS... But LVS would need to be moved to the Forward chain rather than the INPUT one. Pretty similar to LVS-NAT so not really a proxy (it's not looking at the packet contents) I think it would be a massive improvement.. I'd even consider sponsoring someone to do it?

F5 will also check the response of the real server and if it fails re-send the commands from the cache to another server... nice but definitely layer7 proxy stuff...

You want this because:

  • Doesn't require changes to the real servers. (not even the default gateway)
  • Works across subnets (and possibly WANs)

I think Kemp technologies have managed to do it with their LVS implementation... I get a lot of customers who have their real servers so locked down they can't modify them at all.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 4 Dec 2008

It would be nice if you could do full SNAT in LVS and even better transparent SNAT using tproxy (not sure if thats even possible at layer 4) (Joe: it isn't) ..... But thats a big ask for the developers... However Haproxy will easily fit your current requirements, have you looked at that?

Joe - Here's an example of someone wanting to use F5-SNAT

Hoffman, Jon Jon (at) Hoffman (at) acs-inc (dot) com 11 Oct 2006

I have two networks that are physcally located in different locations (lets say city X and city Y). In city X we have our web servers, run by our team there. In city Y we have our load balancer that we are tring to set up as a demo to show how LVS works. We can not set our default gateway of our web servers to be the load balancer because we are trying to test LVS and can not take our web servers out of production to test a new load balancer. And we want to see the load balancing working with our present servers. What is happening is our client makes a request to our director, the director sends the request to our web server and the web server responses directly back to the client, who has no idea why that server is sending the packet to it.

It does not make sense as to why I can not masqurade the request to the real server. For example: To really strip things down, say I have the following


The client makes a request to the director, which then makes the request to one of the realservers but (according to my tcpdump) the request appears to come from the client (, therefore the real server tries to send the request directly back to the client. Is there a way to make the request to the realserver appear to come from the director, so the realserver sends the request back to the director (without changing the default gw on the realserver) rather than to the client? It just seems like there should be a way to do this.

Malcolm lists (at) loadbalancer (dot) org 11 Oct 2006

Unfortunately the answer is no. Packets can't be SNAT'd after being LVS'd. In my limited understanding this is because LVS bypasses netfilter after it has grabbed the packets from the INPUT chain

Joe - ip_vs would have to be in the FORWARD chain for this to work.

Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com 24 Feb 2007

The SNAT rule is not working without the NFCT patch - this is why got my hands on the patch in the first place. I have scenarios like this:

CLIENT -> VIP[with_public_ip_1] -> A_REAL_SERVER[private_ip_1]

A_REAL_SERVER[private_ip_1] -> VIP[with_public_ip_1] -> CLIENT


CLIENT -> VIP[with_public_ip_2] -> A_REAL_SERVER[private_ip_2]

A_REAL_SERVER[private_ip_2] -> VIP[with_public_ip_2] -> CLIENT

I'm not sure if i'm beeing clear here, but in simple words: the same public ip address that the client uses to connect to the LVS should be used as source ip in the response to the client. I have multiple public ip addresses that i need to source nat. The firewall is on the same box as the director. Any pointers?

Julian 24 Feb 2007

Aha, I see why you are using snat_reroute. But I want to note the following things:

  • - you need to set snat_reroute only if you have ip rules with source address where packets from VIP1 and VIP2 don't go to same nexthop. If you have only one possible gateway then the kernel has already attached this GW to the packet at routing time, so there is no need to waste CPU to try to reroute it somewhere else by VIP if there is no other alternative gateway.
  • - you don't need iptables SNAT rules to SNAT traffic because netfilter will not reroute it. Netfilter simply does not bind to nexthop for NAT connections. Also, you can not expect IPVS packets to reach netfilter in POST_ROUTING, the SNAT rule will not see them.

ok, but what do you see, what is the real problem? Packets are dropped and don't reach uplink router or they are not routed properly when you have 2 or more uplinks? Do you have source-based IP rules?

Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com

I am still unable to SNAT traffic leaving the box. I'm runnng the director and firewall on the same box. And this is how I do SNAT:

iptables -t nat -A POSTROUTING -o eth0 -j SNAT --to-source

Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 23 Feb 2007

Niklas, if you mean masquerading of LVS-DR client IPs on their way to real servers, you can try my approach described below.

When I tried Julian's patch several month ago (I am not sure if this have changed), I found it not suitable for use on a director that would also do masquerading (SNAT) of client IPs. I have learned that when Julian says "SNAT" he means processing of packets coming from an LVS-NAT driven real server (OUT direction) while forwarding them to clients. IN direction packets never pass through the netfilter nat POSTROUTING hook (and conntrack POSTROUTING as well), the are sent out directly by ip_vs_out() with optional ip_vs_conntrack_confirm().

Some time ago I have set up an LVS-DR based internet gateway that not only accepts connections from the internet to a VIP and redirects them to several RIPs (typical IPVS usage), but also rediects connections from intranet clients to the internet through serveral DSL/FrameRelay links, or their respective routers acting as real servers (similiar to LVS driven transparent cache cluster case). As I have no control over these routers (they are managed by their respective providers), I have to do masquerading (or SNAT) on the director itself to avoid putting several more boxes in between. In order to achieve this functionality, I have started with a "hardware" method of sending IN packets back to the director via several vlans set up over 2 additional network interfaces connected with a crossover cable. Then I have created a small patch that affects processing of LVS-DR packets only (and bypass as well), so they are not caught by ip_vs_out() and just travel through all netfilter POSTROUTNG hooks, including nat and conntrack. This solution works for me as expected.

In my opinion, Julian's patch is particularily suitable for LVS-NAT, where any other approach would probably not work at all. Furthermore, it looks for me that Julian's way (or maybe any way) of connection tracking could be not applicable to LVS-TUN, where packets leaving the director are encapsulated before they reach ip_vs_out(). But for LVS-DR there are probably two good ways at least: Julian's, but without masquerading, and my own, that I have successfully used for several months now.

My patch applies cleanly against debian linux-source-2.6.18-3 version 2.6.18-7 and is also available at http://www.icnet.pl/download/ip_vs_dr-conntrack.patch

Signed-off-by: Janusz Krzysztofik <jkrzyszt@tis.icnet.pl>
--- linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_core.c.orig    
2006-06-18 03:49:35.000000000 +0200
+++ linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_core.c 
2006-10-21 21:38:20.000000000 +0200
@@ -672,6 +672,9 @@ static int ip_vs_out_icmp(struct sk_buff
 	if (!cp)
 		return NF_ACCEPT;

+		return NF_ACCEPT;
 	verdict = NF_DROP;

 	if (IP_VS_FWD_METHOD(cp) != 0) {
@@ -801,6 +804,9 @@ ip_vs_out(unsigned int hooknum, struct s
 		return NF_ACCEPT;

+		return NF_ACCEPT;
 	IP_VS_DBG_PKT(11, pp, skb, 0, "Outgoing packet");

 	if (!ip_vs_make_skb_writable(pskb, ihl))
--- linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_xmit.c.orig    
2006-06-18 03:49:35.000000000 +0200
+++ linux-source-2.6.17-2-e49_9.200610211740/net/ipv4/ipvs/ip_vs_xmit.c 
2006-10-21 21:22:56.000000000 +0200
@@ -127,7 +127,6 @@ ip_vs_dst_reset(struct ip_vs_dest *dest)

 #define IP_VS_XMIT(skb, rt)                            \
 do {                                                   \
-	(skb)->ipvs_property = 1;                       \
 	(skb)->ip_summed = CHECKSUM_NONE;               \
 		(rt)->u.dst.dev, dst_output);           \
@@ -278,6 +277,7 @@ ip_vs_nat_xmit(struct sk_buff *skb, stru
 	/* Another hack: avoid icmp_send in ip_fragment */
 	skb->local_df = 1;

+	skb->ipvs_property = 1;
 	IP_VS_XMIT(skb, rt);

@@ -411,6 +411,7 @@ ip_vs_tunnel_xmit(struct sk_buff *skb, s
 	/* Another hack: avoid icmp_send in ip_fragment */
 	skb->local_df = 1;

+	skb->ipvs_property = 1;
 	IP_VS_XMIT(skb, rt);

@@ -542,6 +543,7 @@ ip_vs_icmp_xmit(struct sk_buff *skb, str
 	/* Another hack: avoid icmp_send in ip_fragment */
 	skb->local_df = 1;

+	skb->ipvs_property = 1;
 	IP_VS_XMIT(skb, rt);

 	rc = NF_STOLEN;

48.2. NetScaler

Bill Omer bill (dot) omer (at) gmail (dot) com 22 May 2007

A NetScaler is a hardware load balancer, a commercial product. A NetFiler is a network attached storage device which can/does sit behind a NetScaler. The client connects to the VIP. The director (a netscaler) then modifies the packet headers. The first packet is MASQ'd to the RIP, with src=CIP. The realserver then replies directly to the client, with src_addr=RIP. The client then replies directly to the real, which the netsaller assigned from the initial connection. From what I understand, the NetScaler modifies the packet headers and all connections avoid the director after the initial packet is modified.

Philip M disordr (at) gmail (dot) com 22 May 2007

The netscalers (a commercial load balancer from Citrix), is actually a pretty cool device. I like them, although lots of my colleagues don't. However, they are expensive, and for small remote offices where we set up a few local services (LDAP/DNS/HTTP), setting up a pair of Netscalers is quite expensive. I therefore turned to LVS to see if it could be a practical replacement for the Netscalers. Turns out, that LVS does some pretty cool stuff, but not everything that the Netscalers do. Additionally, you need to run other packages, to take care of the things that came default to the netscaler, such as director failover (I'm using heartbeat) and service monitoring/failover (I'm using ldirecord).

In my environment the default gateway for all of my Real Servers points not to the Load balancer, but rather to another router. This prevents and LVS-NAT from working in our environment. So far as I can tell, there's no way to do LVS-NAT with your real-servers if they do not point to the Load balancer as the default GW. This functionality works on the Netscalers because essentially, the Netscaler is proxying the connection. LVS-NAT does DNAT only. We require source and destination nat (proxying).

Main features of the Netscaler that we needed in a replacement are:

  • Direct routing form of SLB. (netscalers call this DSR, direct server return).
  • concept of "backup vips". If all the local services (realservers) in an office fail, we need the load balancer to fail over to a remote site. As such, my co-worker and I, are using HA-proxy in addition to LVS, to handle the Failover mode in the ldirectord setup.
  • ability to properly route to a remote site

Well, we use lvs+heartbeat+ldirectord+haproxy to replace the netscalers in our small remote offices. The failover VIP in our ldirectord/haproxy set up, fails over to a remote netscaler VIP. Our netscalers sit in a "1 armed" (one network) mode. They hang off the switch - only load balanced traffic goes through it.

48.3. Using MASQ with REDIRECT to accept packet on realserver to replace a NetScaler

Bill Omer bill (dot) omer (at) gmail (dot) com 22 May 2007

The goal for my project was to develop something in-house to replace the NetScalers. In my configuration on an extremely large network, I'm using LVS-DR to load balance web and app servers. However the realservers do not have the VIP. To get the realservers to accept packets for the VIP, you need to follow the procedures outlined in Transparent Proxy (Horm's method), particularly TP_2.4_problems. The loadbalanced service is listening on the RIP (not the VIP) (note the nat table option). I'm using a 2.4 kernel on the realservers.

realserver:# /sbin/iptables -t nat -A PREROUTING -d $VIP -p tcp --dport 0:65535  -j REDIRECT
realserver:# echo 1 > /proc/sys/net/ipv4/ip_forward

I've been using it to handle thousands of continuous connections with many services (including telnet and ssh) without issue.

The only downside to using my LVS-DR solution vs. NetSalers right now is the inability to add a real without *any* modifications what so ever to the OS. Granted, I have automated the process to be as painless as possible to add a single iptables rule, but with a NetScaler, you don't have to do anything. At all. No tun interface, no routing through the director, nothing at all. This is the only remaining downside to using LVS in my organization.

48.4. Using HAProxy with LVS to substitute for the remote server failover of a NetScaler

Philip M disordr (at) gmail (dot) com 22 May 2007

The main deficiency in LVS is that it can't forward to another (remote) server when the realservers fail. We handled that with HAProxy. Our setup is

  • 2 Directors
  • 2 Real servers
  • 1 remote backup site

We do LVS-DR for local services. We have the VIP IP configured on the Directors as well as dummy interfaces on the real servers. If both real-servers fail, in the ldirectord configs we have a "failover IP" set. This failover IP is another (LocalNode) IP on the director. The LocalNode IP, is being listened to by the HA-Proxy userspace daemon, which proxies the request to the remote server (Netscaler backup listening on the VIP). The director is the default gw for the remote machine; the reply comes back to HAProxy which returns the packet to the CIP.

Its complicated, but it works, except that HAproxy doesn't do UDP. :( now we're going to investigate dnsproxy. ugh...