5. LVS: LVS-NAT

5.1. Introduction

Note
see also Julian's layer 4 LVS-NAT setup (http://www.ssi.bg/~ja/L4-NAT-HOWTO.txt).

LVS-NAT is based on cisco's LocalDirector.

This method was used for the first LVS. If you want to set up a test LVS, this requires no modification of the realservers and is still probably the simplest setup.

In a commercial environment, the owners of servers are loath to change the configuration of a tested machine. When they want load balancing, they will clone their server and tell you to put your load balancer infront of their rack of servers. You will not be allowed near any of their servers, thank you very much. In this case you use LVS-NAT.

Ratz Wed, 15 Nov 2006

Most commercial load balancers are not set up anymore using the triangulation mode (Joe: triangulation == LVS-DR) (at least in the projects I've been involved). The load balancer is becoming more and more a router, using well-understood key technologies like VRRP and content processing.

With LVS-NAT, the incoming packets are rewritten by the director changing the dst_addr from the VIP to the address of one of the realservers and then forwarded to the realserver. The replies from the realserver are sent to the director where they are rewritten and returned to the client with the source address changed from the RIP to the VIP.

Unlike the other two methods of forwarding used in an LVS (LVS-DR and LVS-Tun) the realserver only needs a functioning tcpip stack (eg a networked printer), i.e. the realserver can have any operating system and no modifications are made to the configuration of the realservers (except setting their route tables).

5.2. LVS-NAT bugs

Sep 2006: Various problems have surfaced in the 2.6.x LVS-NAT code all relating to routing (netfilter) on the side of the director facing the internet. People using LVS-NAT on a director which isn't a firewall and which only has a single default gw, aren't having any problems.

It seems the 2.4.x code was working correctly: Farid Sarwari had it working for IPSec at least. The source routing problem has been identified by three people, who've all submitted functionally equivalent patches. While we're delighted to have contributions from so many people, we regret that we weren't fast enough to recognise the problem and save the last two people all their work. One of the problems (we think) is that not many people are using LVS-NAT and when a weird problem is reported on the mailing list we say "well 1000's of people have been using LVS-NAT for years without this problem, this guy must not know what he's talking about". We're now taking the approach that maybe not too many people are using LVS-NAT.

Here are the problems which have surfaced so far with LVS-NAT. They either have been solved or will be in a future release of LVS.

  • Firewall incompatibility: You couldn't run a netfilter firewall on the outside of the director. This was solved by Ben North with the Antefacto patches. These patches were taken over by Vinnie, and are now being maintained by Julian as part of the ipvs netfilter connection tracking module, ipvs_nfct. Since the NFCT patches are benign when not being used, we hope that they will be incorporated into the ip_vs code for the kernel (when Horms gets time).
  • Source routing: Outbound packets originating at the VIP are not injected into the routing table but are sent straight out the default gw. As a result the packets were not affected by iproute2 commands. This problem was found by Ken Brownfield who submitted a patch for his relatively old kernel, then Farid Sarwari who couldn't get routing to work for his IPSec LVS submitted another, then David Black realised that Julian's NFCT patches handled the problem from the start. (see source routing patches). Horm's is working on getting Julian's NFCT code into ip_vs.
  • LVS-NAT ftp helper modules for active/passive ftp: We seem to get a disproportionate number of problems with ftp on LVS. This seems to be a combination of the small number of users, real bugs and inadequate documentation. (see LVS-NAT ftp helper bug).

5.3. Example 1-NIC, 2 Network LVS-NAT (VIP and RIPs on different network)

Note
If the VIP and the RIPs are on the same network you need the One Network LVS-NAT

Here the client is on the same network as the VIP (in a production LVS, the client will be coming in from an external network via a router). (The director can have 1 or 2 NICs - two NICs will allow higher throughput of packets, since the traffic on the realserver network will be separated from the traffic on the client network).

Machine                      IP
client                       CIP=192.168.1.254
director VIP                 VIP=192.168.1.110 (the IP for the LVS)
director internal interface  DIP=10.1.1.1
realserver1                  RIP1=10.1.1.2
realserver2                  RIP2=10.1.1.3
realserver3                  RIP3=10.1.1.4
.
.
realserverN                  RIPn=10.1.1.n+1
dip                          DIP=10.1.1.9 (director interface on the LVS-NAT network)

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                        (router)
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|
            |__________|   |   DIP=10.1.1.9 (eth0:9)
                           |
                           |
          -----------------------------------
          |                |                |
          |                |                |
   RIP1=10.1.1.2      RIP2=10.1.1.3   RIP3=10.1.1.4 (all eth0)
   _____________     _____________    _____________
  |             |   |             |  |             |
  | realserver  |   | realserver  |  | realserver  |
  |_____________|   |_____________|  |_____________|

here's the lvs.conf for this setup

LVS_TYPE=VS_NAT
INITIAL_STATE=on
VIP=eth0:110 lvs 255.255.255.0 192.168.1.255
DIP=eth0 dip 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr realserver1:telnet realserver2:telnet realserver3:telnet
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=dip
#----------end lvs_nat.conf------------------------------------

The VIP is the only IP known to the client. The RIPs here are on a different network to the VIP (although with only 1 NIC on the director, the VIP and the RIPs are on the same wire).

In normal NAT, masquerading is the rewriting of packets originating behind the NAT box. With LVS-NAT, the incoming packet (src=CIP,dst=VIP, abbreviated to CIP->VIP) is rewritten by the director (becoming CIP->RIP). The action of the LVS director is called demasquerading. The demasqueraded packet is forwarded to the realserver. The reply packet (RIP->CIP) is generated by the realserver.

5.4. All packets sent from the LVS-NAT realserver to the client must go through the LVS-NAT director

For LVS-NAT to work

  • all packets from the realservers to the client must go through the director.

Forgetting to set this up is the single most common cause of failure when setting up a LVS-NAT LVS.

The original (and the simplest from the point of view of setup) way is to make the DIP (on the director) the default gw for the packets from the realserver. The documentation here all assumes you'll be using this method. (Any IP on the director will do, but in the case where you have two directors in active/backup failover, you have an IP that is moved to the active director and this is called the DIP). Any method of making the return packets go through the director will do. With the arrival of the Policy Routing tools, you can route packets according to any parameter in the packet header (e.g. src_addr, src_port, dest_addr..) Here's an example of ip rules on the realserver to route packets from the RIP to an IP on the director. This avoids having to route these packets via a default gw.

Neil Prockter prockter (at) lse (dot) ac (dot) uk 30 Mar 2004

you can avoid using the director as the default gw by

realserver# echo 80 lvs >> /etc/iproute2/rt_tables
realserver# ip route add default <address on director, eg DIP> table lvs
realserver# ip rule add from <RIP> table lvs

For the IPs in Virtual Server via NAT (http://www.linuxvirtualserver.org/VS-NAT.html).

echo 80 lvs >> /etc/iproute2/rt_tables
ip route add default 172.16.0.1 table lvs
ip rule add from 172.16.0.2 table lvs

I do this with lvs and with cisco css units

Here Neil is routing packets from RIP to 0/0 via DIP. You can be more restrictive and route packets from RIP:port (where port is the LVS'ed service) to 0/0 via DIP. Packets from RIP:other_ports can be routed via other rules.

For a 2 NIC director (with different physical networks for the realservers and the clients), it is enough for the default gw of the realservers to be the director. For a 1 NIC, two network setup (where the two networks are using the same link layer), in addition, the realservers must only have routes to the director. For a 1 NIC, 1 network setup, ICMP redirects must be turned off on the director (see One Network LVS-NAT) (the configure script does this for you).

In a normal server farm, the default gw of the realserver would be the router to the internet and the packet RIP->CIP would be sent directly to the client. In a LVS-NAT LVS, the default gw of the realservers must be the director. The director masquerades the packet from the realserver (rewrites it to VIP->CIP) and the client receives a rewritten packet with the expected source IP of the VIP.

Note
the packet must be routed via the director, there must be no other path to the client. A packet arriving at the client directly from the realserver, rather than going through the director, will not be seen as a reply to the client's request and the connection will hang. If the director is not the default gw for the realservers, then if you use tcpdump on the director to watch an attempt to telnet from the client to the VIP (run tcpdump with `tcpdump port telnet`), you will see the request packet (CIP->VIP), the rewritten packet (CIP->RIP) and the reply packet (RIP->CIP). You will not see the rewritten reply packet (VIP->CIP). (Remember if you have a switch on the realserver's network, rather than a hub, then each node only sees the packets to/from it. tcpdump won't see packets to between other nodes on the same network.)

Part of the setup of LVS-NAT then is to make sure that the reply packet goes via the director, where it will be rewritten to have the addresses (VIP->CIP). In some cases (e.g. 1 net NS-NAT) icmp redirects have to be turned off on the director so that the realserver doesn't get a redirect to forward packets directly to the client.

In a production system, a router would prevent a machine on the outside exchanging packets with machines on the RIP network. As well, the realservers will be on a private network (eg 192.168.x.x/24) and replies will not be routable.

In a test setup (no router), these safeguards don't exist. All machines (client, director, realservers) are on the same piece of wire and if routing information is added to the hosts, the client can connect to the realservers independantly of the LVS. This will stop LVS-NAT from working (your connection will hang), or it may appear to work (you'll be connecting directly to the realserver).

In a test setup, traceroute from the realserver to the client should go through the director (2 hops in the above diagram). The configure script will test that the director's gw is 2 hops from the realserver and that the route to the director's gw is via the director, preventing this error.

(Thanks to James Treleaven jametrel (at) enoreo (dot) on (dot) ca 28 Feb 2002, for clarifying the write up on the ping tests here.)

In a test setup with the client connected directly to the director (in the setup above with 1 or 2 NICs, or the one NIC, one network LVS-NAT setup), you can ping between the client and realservers. However in production, with the client out on internet land, and the realservers with unroutable IPs, you should not be able to ping between the realservers and the client. The realservers should not know about any other network than their own (here 10.1.1.0). The connection from the realservers to the client is through ipchains (for 2.2.x kernels) and LVS-NAT tables setup by the director.

In my first attempt at LVS-NAT setup, I had all machines on a 192.168.1.0 network and added a 10.1.1.0 private network for the realservers/director, without removing the 192.168.1.0 network on the realservers. All replies from the servers were routed onto the 192.168.1.0 network rather than back through LVS-NAT and the client didn't get any packets back.

Here's the general setup I use for testing. The client (192.168.2.254) connects to the VIP on the director. (The VIP on the realserver is present only for LVS-DR and LVS-Tun.) For LVS-DR, the default gw for the realservers is 192.168.1.254. For LVS-NAT, the default gw for the realservers is 192.168.1.9.

        ____________
       |            |192.168.1.254 (eth1)
       |  client    |----------------------
       |____________|                     |
     CIP=192.168.2.254 (eth0)             |
              |                           |
              |                           |
     VIP=192.168.2.110 (eth0)             |
        ____________                      |
       |            |                     |
       |  director  |                     |
       |____________|                     |
     DIP=192.168.1.9 (eth1, arps)         |
              |                           |
           (switch)------------------------
              |
     RIP=192.168.1.2 (eth0)
     VIP=192.168.2.110 (for LVS-DR, lo:0, no_arp)
        _____________
       |             |
       | realserver  |
       |_____________|

This setup works for both LVS-NAT and LVS-DR.

Here's the routing table for one of the realservers as in the LVS-NAT setup.

realserver:# netstat -rn
Kernel IP routing table
Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
192.168.1.0     0.0.0.0         255.255.255.0   U        40 0          0 eth0
127.0.0.0       0.0.0.0         255.0.0.0       U        40 0          0 lo
0.0.0.0         192.168.1.9     0.0.0.0         UG       40 0          0 eth0

Here's a traceroute from the realserver to the client showing 2 hops.

traceroute to client2.mack.net (192.168.2.254), 30 hops max, 40 byte packets
 1  director.mack.net (192.168.1.9)  1.089 ms  1.046 ms  0.799 ms
 2  client2.mack.net (192.168.2.254)  1.019 ms  1.16 ms  1.135 ms

Note the traceroute from the client box to the realserver only has one hop.

director icmp redirects are on, but the director doesn't issue a redirect (see icmp redirects) because the packet RIP->CIP from the realserver emerges from a different NIC on the director than it arrived on (and with different source IP). The client machine doesn't send a redirect since it is not forwarding packets, it's the endpoint of the connection.

5.5. Run the configure script

Use lvs_nat.conf as a template (sample here will setup LVS-NAT in the diagram above assuming the realservers are already on the network and using the DIP as the default gw).

#--------------lvs_nat.conf----------------------
LVS_TYPE=VS_NAT
INITIAL_STATE=on

#director setup:
VIP=eth0:110 192.168.1.110 255.255.255.0 192.168.1.255
DIP=eth0:10 10.1.1.10 10.1.1.0 255.255.255.0 10.1.1.255

#Services on realservers:
#telnet to 10.1.1.2
SERVICE=t telnet wlc 10.1.1.2:telnet
#http to a 10.1.1.2 (with weight 2) and to high port on 10.1.1.3
SERVICE=t 80 wlc 10.1.1.2:http,2 10.1.1.3:8080 10.1.1.4

#realserver setup (nothing to be done for LVS-NAT)

#----------end lvs_nat.conf------------------------------------

The output is a commented rc.lvs_nat file. Run the rc.lvs_nat file on the director and then the realservers (the script knows whether it is running on a director or realserver).

The configure script will setup up masquerading, forwarding on the director and the default gw for the realservers.

5.6. Setting up demasquerading on the director; 2.4.x and 2.2.x

The packets coming in from the client are being demasqueraded by the director.

In 2.2.x you need to masquerade the replies. Here's the masquerading code in rc.lvs_nat, that runs on the director (produced by the configure script).

        echo "turning on masquerading "
        #setup masquerading
        echo "1" >/proc/sys/net/ipv4/ip_forward
        echo "installing ipchain rules"
        /sbin/ipchains -A forward -j MASQ -s 10.1.1.2 http -d 0.0.0.0/0
	#repeated for each realserver and service
	..
	..
        echo "ipchain rules "
        /sbin/ipchains -L

In this example, http is being masqueraded by the director, allowing the realserver to reply to the telnet requests from the director being demasqueraded by the director as part of the 2.2.x LVS code.

In 2.4.x masquerading of LVS'ed services is done explicitely by the LVS code and no extra masquerading (by iptables) commands need be run.

5.7. rewriting, re-mapping, translating ports with LVS-NAT

One of the features of LVS-NAT is that you can rewrite/re-map the ports. Thus the client can connect to VIP:http, while the realserver can be listening on some other port (!http). You set this up with ipvsadm

Here the client connects to VIP:http, the director rewrites the packet header so that dst_addr=RIP:9999 and forwards the packet to the realserver, where the httpd is listening on RIP:9999.

director:/# /sbin/ipvsadm -a -t VIP:http -r RIP:9999 -m -w 1

For each realserver (i.e. each RIP) you can rewrite the ports differently: each realserver could have the httpd listening on it's own particular port (e.g. RIP1:9999, RIP2:80, RIP3:xxxx).

Although port re-mapping is not possible with LVS-DR or LVS-Tun, it's possible to use iptables to do Re-mapping ports with LVS-DR (and LVS-Tun) on the realserver, producing the same result.

5.8. masquerade timeouts

For the earlier versions of LVS-NAT (with 2.0.36 kernels) the timeouts were set by linux/include/net/ip_masq.h, the default values of masquerading timeouts are:

        #define MASQUERADE_EXPIRE_TCP 15*16*Hz
        #define MASQUERADE_EXPIRE_TCP_FIN 2*16*Hz
        #define MASQUERADE_EXPIRE_UDP 5*16*Hz

5.9. Julian's step-by-step check of a L4 LVS-NAT setup

Julian has his latest fool-proof setup doc at Julian's software page. Here's the version, at the time I wrote this entry.

Q.1 Can the realserver ping client?

	rs# ping -n client

A.1 Yes => good
A.2 No => bad

	Some settings for the director:

	Linux 2.2/2.4:
	ipchains -A forward -s RIP -j MASQ

	Linux 2.4:
	iptables -t nat -A POSTROUTING -s RIP -j MASQUERADE

Q.2 Traceroute to client goes through LVS box and reaches the client?

	traceroute -n -s RIP CLIENT_IP

A.1 Yes => good
A.2 No => bad

	same ipchains command as in Q.1

	For client and server on same physical media use these
	in the director:

	echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
	echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects


Q.3 Is the traffic forwarded from the LVS box, in both directions?

	For all interfaces on director:
	tcpdump -ln host CLIENT_IP

	The right sequence, i.e. the IP addresses and ports on each
	step (the reversed for the in->out direction are not shown):

	CLIENT
	   | CIP:CPORT -> VIP:VPORT
	   |		||
	   |		\/
 out	   | CIP:CPORT -> VIP:VPORT
 ||	LVS box
 \/	   | CIP:CPORT -> RIP:RPORT
 in	   |		||
	   |		\/
	   | CIP:CPORT -> RIP:RPORT
	   +
	REAL SERVER

A.1 Yes, in both directions => good (for Layer 4, probably not for L7)
A.2 The packets from the realserver are dropped => bad:

	- rp_filter protection on the incoming interface, probably
	hit from local client (for more info on rp_filter, see
the section on <xref linkend="proc_filesystem"/>
	- firewall rules drop the replies

A.3 The packets from the realservers leave the director unchanged

	- missing -j MASQ ipchains rule in the LVS box

	For client and server on same physical media:

	The packets simply does not reach the director. The real
	server is ICMP redirected to the client. In director:

	echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
	echo 0 > /proc/sys/net/ipv4/conf/<DEV>/send_redirects

A.4 All packets from the client are dropped

	- the requests are received on wrong interface with rp_filter
	protection
	- firewall rules drop the requests

A.5 The client connections are refused or are served from service
in the LVS box

	- client and LVS are on same host => not valid
	- the packets are not marked from the firewall and don't hit
	firewall mark based virtual service

Q.4 Is the traffic replied from the realserver?

	For the outgoing interface on realserver:

	tcpdump -ln host CLIENT_IP

A.1 Yes, SYN+ACK => good
A.2 TCP RST => bad, No listening real service
A.3 ICMP message => bad, Blocked from Firewall/No listening service
A.4 The same request packet leaves the realserver => missing accept
	rules or RIP is not defined
A.5 No reply => realserver problem:

	- the rp_filter protection drops the packets
	- the firewall drops the request packets
	- the firewall drops the replies

A.6 Replies goes through another device or don't go to the LVS
box =? bad

	- the route to the client is direct and so don't pass the LVS
	box, for example:

		- client on the LAN
		- client and realserver on same host

	- wrong route to the LVS box is used => use another

	Check the route:

	rs# ip route get CLIENT_IP from RIP


The result: start the following tests

rs# tcpdump -ln host CIP
rs# traceroute -n -s RIP CIP
lvs# tcpdump -ln host CIP
client# tcpdump -ln host CIP


For more deep problems use tcpdump -len, i.e. sometimes the link layer
addresses help a bit.


For FTP:

	VS-NAT in Linux 2.2 requires:

	- modprobe ip_masq_ftp (before 2.2.19)
	- modprobe ip_masq_ftp in_ports=21 (2.2.19+)

	VS-NAT in Linux 2.4 requires:

	- ip_vs_ftp

	VS-DR/TUN require persistent flag


	FTP reports with debug mode enabled are useful:

	# ftp
	ftp> debug
	ftp> open my.virtual.ftp.service
	ftp> ...
	ftp> dir
	ftp> passive
	ftp> dir

	There are reports that sometimes the status strings reported
	from the FTP realservers are not matched with the string
	constants encoded in the kernel FTP support. For example,
	Linux 2.2.19 matches
	"227 Entering Passive Mode (xxx,xxx,xxx,xxx,ppp,ppp)"


Julian Anastasov

5.10. How LVS-NAT works

director:/etc/lvs# ipvsadm does the following

#setup connection for telnet, using round robin
director:/etc/lvs# /sbin/ipvsadm -A -t 192.168.1.110:23 -s rr
#connections to x.x.x.110:telnet are sent to
#                 realserver 10.1.1.2:telnet
#using LVS-NAT (the -m) with weight 1
director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.2:23 -m -w 1
#and to realserver 10.1.1.3
#using LVS-NAT with weight 2
director:/etc/lvs# /sbin/ipvsadm -a -t 192.168.1.110:23 -r 10.1.1.3:23 -m -w 2

(if the service was http instead of telnet, the webserver on the realserver could be listening on port 8000 instead of 80)

Turn on ip_forwarding (so that the packets can be forwarded to the realservers)

director:/etc/lvs# echo "1" > /proc/sys/net/ipv4/ip_forward

Example: client requests a connection to 192.168.1.110:23

director chooses realserver 10.1.1.2:23, updates connection tables, then

packet                source                        dest
incoming              CIP:3456                      VIP:23
inbound rewriting     CIP:3456                      RIP1:23
reply (routed to DIP) RIP1:23                       CIP:3456
outbound rewriting    VIP:23                        CIP:3456

The client gets back a packet with the source_address = VIP.

For the verbally oriented...

The request packet is sent to the VIP. The director looks up its tables and sends the connection to realserver1. The packet is rewritten with a new destination (in this case with the same port, but the port could be changed too) and sent to RIP1. The realserver replies, sending back a packet to the client. The default gw for the realserver is the director. The director accepts the packet and rewrites the packet to have source=VIP and sends the rewritten packet to the client.

Why isn't the source of the incoming packet rewritten to be the DIP or VIP?

Wensong

...changing the source of the packet to the VIP sounds good too, it doesn't require that default route rule, but requires additional code to handle it.

5.11. In LVS-NAT, how do packets get back to the client, or how does the director choose the VIP as the source_address for the outgoing packets?

Note

This was written for 2.0.x and 2.2.x kernel LVSs which was based on the masquerading code. With 2.4.x, LVS is based on netfilter and there were initially some problems getting LVS-NAT to work with 2.4.x. What happens here for 2.4.x, I don't know.

Joe

In normal NAT, where a bunch of machines are sitting behind a NAT box, all outward going packets are given the IP on the outside of the NAT box. What if there are several IPs facing the outside world? For NAT it doesn't really matter as long as the same IP is used for all packets. The default value is usually the first interface address (eg eth0). With LVS-NAT you want the outgoing packets to have the source of the VIP (probably on eth0:1) rather than the IP on the main device on the director (eth0).

With a single realserver LVS-NAT LVS serving telnet, the incoming packet does this,

CIP:high_port -> VIP:telnet     #client sends a packet
CIP:high_port -> RIP:telnet     #director demasquerades packet, forwards to realserver
RIP:telnet    -> CIP:high_port  #realserver replies

The reply arrives on the director (being sent there because the director is the default gw for the realserver). To get the packet from the director to the client, you have to reverse the masquerading done by the LVS. To do this (in 2.2 kernels), on the director you add an ipchains rule

director:# ipchains -A forward -p tcp -j MASQ -s realserver1 telnet -d 0.0.0.0/0

If the director has multiple IPs facing the outside world (eg eth0=192.168.2.1 the regular IP for the director and eth0:1=192.168.2.110 the VIP), the masquerading code has to choose the correct IP for the outgoing packet. Only the packet with src_addr=VIP will be accepted by the client. A packet with any other scr_addr will be dropped. The normal default for masquerading (eth0) should not be used in this case. The required m_addr (masquerade address) is the VIP.

Does LVS fiddle with the ipchains tables to do this?

Julian Anastasov ja (at) ssi (dot) bg 01 May 2001

No, ipchains only delivers packets to the masquerading code. It doesn't matter how the packets are selected in the ipchains rule.

The m_addr (masqueraded_address) is assigned when the first packet is seen (the connect request from the client to the VIP). LVS sees the first packet in the LOCAL_IN chain when it comes from the client. LVS assigns the VIP as maddr.

The MASQ code sees the first packet in the FORWARD chain when there is a -j MASQ target in the ipchains rule. The routing selects the m_addr. If the connection already exists the packets are masqueraded.

The LVS can see packets in the FORWARD chain but they are for already created connections, so no m_addr is assigned and the packets are masqueraded with the address saved in the connections structure (the VIP) when it was created.

There are 3 common cases:

  1. The connection is created as response to packet.
  2. The connection is created as response to packet to another connection.
  3. The connection is already created

Case (1) can happen in the plain masquerading case where the in->out packets hit the masquerading rule. In this case when nobody recommends the s_addr for the packets going to the external side of the MASQ, the masq code uses the routing to select the m_addr for this new connection. This address is not always the DIP, it can be the preferred source address for the used route, for example, address from another device.

Case (1) happens also for LVS but in this case we know:

  • the client address/port (from the received datagram)
  • the virtual server address/port (from the received datagram)
  • the realserver address/port (from the LVS scheduler)

But this is on out->in packet and we are talking about in->out packets

Case (2) happens for related connections where the new connection can be created when all addresses and ports are known or when the protocol requires some wildcard address/port matching, for example, ftp. In this case we expect the first packet for the connection after some period of time.

It seems you are interested how case (3) works. The answer is that the NAT code remembers all these addresses and ports in a connection structure with these components

  • external address/port (LVS: client)
  • masquerading address/port (LVS: virtual server)
  • internal address/port (LVS: realserver)
  • protocol
  • etc

LVS and the masquerading code simply hook in the packet path and they perform the header/data mangling. In this process they use the information from the connection table(s). The rule is simple: when a packet is already for established connection we must remember all addresses and ports and always to use same values when mangling the packet header. If we select each time different addresses or ports we simply break the connection. After the packet is mangled the routing is called to select the next hop. Of course, you can expect problems if there are fatal route changes.

So, the short answer is: the LVS knows what m_addr to use when a packet from the realserver is received because the connection is already created and we know what addresses to use. Only in the masquerading case (where LVS os not involved) connections can be created and a masquerading address to be selected without using rule for this. In all other cases there is a rule that recommends what addresses to be used at creation time. After creation the same values are used.

5.11.1. So make the VIP the primary IP on the outside of the director

Wayne wayne (at) compute-aid (dot) com 26 Apr 2000

Any web server behind the LVS box use LVS-NAT can initiate communication to the Internet. However, it is not using the farm IP address, rather it is using the masquerading IP address -- the actual IP address of the interface. Is there easy way to let the server in NAT mode to go out as the farm IP address?

Lars

No. This is a limitation in the 2.2 masquerading code. It will always use the first address on the interface.

We tried and it works! We put VIP on eth0, and RIP on eth0:1 in NAT mode and it works fine. Just need to figure out how to do it during reboot, since this is done by playing with ifconfigure command. Once we swap them around, the going out IP address is the VIP address. But if LVS box reboot, you just have to redo it again.

Joe:

! :-) I didn't realise you were in VS-NAT mode, therefore not having the VIP on the realservers. I thought you must be in VS-DR.

5.12. One Network LVS-NAT

Note
According to Malcolm Turnbull, this is called "One Arm NAT" in the commercial world (i.e. one nic and one network)

The disadvantage of the 2 network LVS-NAT is that the realservers are not able to connect to machines in the network of the VIP. You couldn't make a LVS-NAT setup out of machines already on your LAN, which were also required for other purposes to stay on the LAN network.

Here's a one network LVS-NAT LVS.

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
                           |
             __________    |
            |          |   |   VIP=192.168.1.110 (eth0:110)
            | director |---|
            |__________|   |   DIP=192.168.1.9 (eth0:9)
                           |
                           |
          ------------------------------------
          |                |                 |
          |                |                 |
   RIP1=192.168.1.2   RIP2=192.168.1.3  RIP3=192.168.1.4 (all eth0)
    _____________      _____________     _____________
   |             |    |             |   |             |
   | realserver  |    | realserver  |   | realserver  |
   |_____________|    |_____________|   |_____________|

The problem:

A return packet from the realserver (with address RIP->CIP) will be sent to the realserver's default gw (the director). What you want is for the director to accept the packet and to demasquerade it, sending it on to the client as a packet with address (VIP->CIP). With ICMP redirects on, the director will realise that there is a better route for this packet, i.e. directly from the realserver to the client and will send an ICMP redirect to the realserver, informing it of the better route. As a result, the realserver will send subsequent packets directly to the client and the reply packet will not be demasqueraded by the director. The client will get a reply from the RIP rather than the VIP and the connection will hang.

The cure:

Thanks to michael_e_brown (at) dell (dot) com and Julian ja (at) ssi (dot) bg for help sorting this out.

To get a LVS-NAT LVS to work on one network -

  1. On the director, turn off icmp redirects on the NIC that is the default gw for the realservers. (Note: eth0 may be eth1 etc, on your machine).

    director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/all/send_redirects
    director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/default/send_redirects
    director:/etc/lvs# echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects
    
  2. Make the director the default and only route for outgoing packets.

    You will probably have set the routing on the realserver up like this

    realserver:/etc/lvs# netstat -r
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
    192.168.1.0     0.0.0.0         255.255.255.0   U         0 0          0 eth0
    127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
    0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0
    

    Note the route to 192.168.1.0/24. This route allows the realserver to send packets to the client by just putting them out on eth0, where the client will pick them up directly (without being demasqueraded) and the LVS will not work. This route also allows the realservers to talk to each other directly i.e. without routing packets through the director. (As the admin, you might want to telnet from one realserver to another, or you might have ntp running, sending ntp packets between realservers.)

    Remove the route to 192.168.1.0/24.

    realserver:/etc/lvs#route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
    

    This will leave you with

    realserver:/etc/lvs# netstat -r
    Kernel IP routing table
    Destination     Gateway         Genmask         Flags   MSS Window  irtt Iface
    127.0.0.0       0.0.0.0         255.0.0.0       U         0 0          0 lo
    0.0.0.0         director        0.0.0.0         UG        0 0          0 eth0
    

Now packets RIP->CIP have to go via the director and will be demasqueraded. The LVS-NAT LVS now works. If LVS is forwarding telnet, you can telnet from the client to the VIP and connect to the realserver. As a side effect, packets between the realservers are also routed via the director, rather than going directly (note: all packets now go via the director). (You can live with that.)

You can ping from the client to the realserver.

You can also connect _directly_ to services on the realserver _NOT_ being forwarded by LVS (in this case e.g. ftp).

You can no longer connect directly to the realserver for services being forwarded by the LVS. (In the example here, telnet ports are not being rewritten by the LVS, i.e. telnet->telnet).

client:~# telnet realserver
Trying 192.168.1.11...
^C
(i.e. connection hangs)

Here's tcpdump on the director. Since the network is switched the director can't see packets between the client and realserver. The client initiates telnet. `netstat -a` on the client shows a SYN_SENT from port 4121.

director:/etc/lvs# tcpdump
tcpdump: listening on eth0
16:37:04.655036 realserver.telnet > client.4121: S 354934654:354934654(0) ack 1183118745 win 32120 <mss 1460,sackOK,timestamp 111425176[|tcp]> (DF)
16:37:04.655284 director > realserver: icmp: client tcp port 4 121 unreachable [tos 0xc0]

(repeats every second until I kill telnet on client)

The director doesn't see the connect request from client->realserver. The first packet seen is the ack from the realserver, which will be forwarded via the director. The director will rewrite the ack to be from the director. The client will not accept an ack to port 4121 from director:telnet.

Julian 2001-01-12

The redirects are handled in net/ipv4/route.c:ip_route_input_slow(), i.e. from the routing and before reaching LVS (in LOCAL_IN):

        if (out_dev == in_dev && err && !(flags&(RTCF_NAT|RTCF_MASQ)) &&
            (IN_DEV_SHARED_MEDIA(out_dev)
             || inet_addr_onlink(out_dev, saddr, FIB_RES_GW(res))))
                flags |= RTCF_DOREDIRECT;

Here RTCF_NAT && RTCF_MASQ are flags used from the dumb nat code but the masquerading defined with ipchains -j MASQ does not set such or some of these flags. The result: the redirect is sent according to the conf/{all,<device>}/send_redirects from ip_rt_send_redirect() and ip_forward() from net/ipv4/ip_forward.c. So, the meaning is: if we are going to forward packet and the in_dev is same as out_dev we redirect the sender to the directly connected destination which is on the same shared media. The ipchains code in the FORWARD chain is reached too late to avoid sending these redirects. They are already sent when the -j MASQ is detected.

If all/send_redirects is 1 every <device>/send_redirects is ignored. So, if we leave it 1 redirects are sent. To stop them we need all=0 && <device>=0. default/send_redirects is the value that will be inherited from each new interface that is created.

The logical operation between conf/all/<var> and conf/<device>/<var> is different for each var. The used operation is specified in /usr/src/linux/include/linux/inetdevice.h

For send_redirects it is '||'. For others, for example for conf/{all,<device>}/hidden), it is '&&'

So, for the two logical operations we have:

For &&:

all	<dev>	result
------------------------------
0	0	0
0	1	0
1	0	0
1	1	1

For ||:

all	<dev>	result
------------------------------
0	0	0
0	1	1
1	0	1
1	1	1

When a new interface is created we have two choices:

1. to set conf/default/<var> to the value that we want each new created interface to inherit

2. to create the interface in this way:

ifconfig eth0 0.0.0.0 up

and then to set the value before assigning the address:

echo <val> > conf/eth0/<var>
ifconfig eth0 192.168.0.1 up

but this is risky especially for the tunnel devices, for example, if you want to play with var rp_filter.

For the other devices this is a safe method if there is no problem with the default value before assigning the IP address. The first method can be the safest one but you have to be very careful.

5.12.1. One Network LVS from Joe Stump

Joe Stump joe (at) joestump (dot) net 2002-09-04

5.12.1.1. Problem

The problem is you have one network that has your realservers, directors, and clients all together on the same class C. For this example we will say they all sit on 192.168.1.*. Here is a simple layout.

                         ~~~~~~~~~~~~~
                         {  Internet }------------------------+
                         ~~~~~~~~~~~~~                        |
                              |                               | IP: 192.168.1.1
                              | External IP: 166.23.3.4       |
                              |                               |
                       +---------------+                   +---------+
                       |   Director    |-------------------| Gateway |
                       +---------------+                   +---------+
                              |                                  |
                              | Internal IP: 192.168.1.25        |
                              |                                  |
                   +----------+                                  |
                   |                                             |
                   | IP: 192.168.1.200                     +--------+
                   |                                       | Client |
           +---------------+                               +--------+
           |  Real Server  |                            IP: 192.168.1.34
           +---------------+

Everything looks like it should work just fine right? Wrong. The problem is that in reality all of these machines are able to talk to one another because they all reside on the same physical network. So here is the problem: clients outside of the internal network get expected output from the load balancer, but clients on the internal network hang when connecting to the load balancers.

5.12.1.2. Cause

So what is causing this problem? The routing tables on the directors and the realservers are causing your client to become confused and hang the connection. If you look at your routing tables on your realserver you will notice that the default gatway for your internal network is 0.0.0.0. Your director will have a similar route. These routes tell your directors and realservers that requests coming from machines on that network should be routed directly back to that machine. So when a request comes to the director the director routes it to the realserver, but the realserver sends the response directly back to the client instead of routing it back through the director as it should. The same thing happens when you try to connect via the director's outside IP from an internal client IP, only this time the director mistakenly sends directly to the internal client IP. The internal client IP is expecting the return packets from the director's external IP, not the director's internal IP.

5.12.1.3. Solution

The solution is simple. Delete the default routes on your directors and real servers to the internal network.

route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0

The above line should do the trick. One thing to note is that you will not be able to connect to these machines once you have deleted these routes. Y0u might just want to use the director as a terminal server since you can connect from there to the realservers.

Also, if you have your realservers connect to DB's and NFS servers on the internal network you will have to add direct routes to those hosts. You do this by typing this:

route add -host $SERVER dev eth0

I added these routes to a startup script so it kills my internal routes and adds the needed direct routes to my NFS and DB server during startup.

5.12.2. One Network LVS-NAT with windows realservers

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 1 Aug 2008

Route configuration for Windows Server with one arm NAT mode:

When a client on the same subnet as the real server tries to access the virtual server on the load balancer the request will fail. The real server will try to use the local network to get back to the client rather than going through the load balancer and getting the correct network translation for the connection.

To rectify this issue we need to add a route to the the load balancer that takes priority over Windows default routing rules. This is a simple case of adding a permanent route:

route add -p 192.168.1.0 mask 255.255.255.0 metric 1

(NB. Replace 192.168.1.0 with your local subnet address.) The default route to the local network has a metric of 10, so this new route overrides all local traffic and forces it to go through the load balancer as required. Any local traffic (same subnet) is handled by this route and any external traffic is handled by the default route (which also points at the load balancer).

5.12.3. Malcolm's modification of the One Network LVS-NAT

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 1 Aug 2008

Forgot to add that this method (i.e. the windows realserver method above) works for Linux as well avoiding the need to add a route for each host

Route configuration for Linux with one arm NAT mode: When a client on the same subnet as the real server tries to access the virtual server on the load balancer the request will fail. The real server will try to use the local network to get back to the client rather than going through the load balancer and getting the correct network translation for the connection. To rectify this issue we need to modify the local network route to a higher metric:

route del -net 192.168.1.0 netmask 255.255.255.0 dev eth0
route add -net 192.168.1.0 netmask 255.255.255.0 metric 2000 dev eth0

NB. Replace 192.168.1.0 with your local subnet address. Then we need to make sure that local network access uses the load balancer as its default route:

route add -net 192.168.1.0 netmask 255.255.255.0 gateway 192.168.1.21 metric 0 dev eth0

NB. Replace 192.168.1.21 with your load balancer gateway Any local traffic (same subnet) is handled by this manual route and any external traffic is handled by the default route (which also points at the load balancer).

Note
FIXME Joe: here's what I think Malcolm is saying. Malcolm's clients are on 0/0 and not on the same logical network as the realservers. (In the setup above needing the icmp redirects turned off, all machines, client, director, realservers are on the same network).
  • metric 0 is high priority. anything to 0/0 goes to default gw
  • metric 2000 is low priority. anything to 192.168.1.0/24 goes to eth0.
  • you don't need to turn off icmp redirects
However
  • the linux kernel ignores the metric
  • only dynamic routing protocols (RIP, GATED) use metric, and then only to decide between duplicate routes.
  • routes with (metric>16) are ignored by dynamic routing protocols. Presumably Linux ignoring the metric, treats a route with metric=2000 the same as a route with metric=0.
So although Malcolm's method works, we don't understand why at the moment.

5.12.4. One net LVS-NAT with static routes, with small number of fixed clients, from Eric Robinson

From: "Robinson, Eric" eric (dot) robinson (at) psmnv (dot) com 21 Mar 2009

The section on One Network LVS-NAT solution which relies on disabling redirects at the load balancer and removing the local LAN routes from the RealServers causes all network traffic between RealServers to pass through the director (as is required). This is undesirable in environments where inter-RS traffic is high, as with clustering and data replication.

If someone only has a few clients that need to access RealServers on their own subnet through a director, then static routes on each client and RS seems to be a better approach. I just added explicit routes from each RS to my client machine through the director and did the same on the client. I figured I could live with some redirect traffic. It worked fine, and as a side effect I noticed that no redirects were being sent by the director machine anyway. Running sniffers simultaneously on the client, the director, and the server, I observed no ICMP redirects being sent or received. The traces also showed that requests and responses were in fact passing through the director and being properly NATed.

5.12.5. One net LVS-NAT from the mailing list

Here's an untested solution from Julian for a one network LVS-NAT (I assume this is old, maybe 1999, because I don't have a date on it).

put the client in the external logical network. By this way the client, the director and the realserver(s) are on same physical network but the client can't be on the masqueraded logical network. So, change the client from 192.168.1.80 to 166.84.192.80 (or something else). Don't add through DIP (I don't see such IP for the Director). Why in your setup DIP==VIP ? If you add DIP (166.84.192.33 for example) in the director you can later add path for 192.168.1.0/24 through 166.84.192.33. There is no need to use masquerading with 2 NICs. Just remove the client from the internal logical network used by the LVS cluster.

A different working solution from Ray Bellis rpb (at) community (dot) net (dot) uk

the same *logical* subnet. I still have a dual-ethernet box acting as a director, and the VIP is installed as an alias interface on the external side of the director, even though the IP address it has is in fact assigned from the same subnet as the

Ray Bellis rpb (at) community (dot) net (dot) uk has used a 2 NIC director to have the RIPs on the same logical network as the VIP (ie RIP and VIP numbers are from the same subnet), although they are in different physical networks.

5.13. re-mapping ports, rewriting is slow for 2.0, 2.2 kernels

For LVS-NAT, the packet headers are re-written (from the VIP to the RIP and back again). At no extra overhead, anything else in the header can be rewritten at the same time. LVS-NAT can rewrite the ports Thus a request to port VIP:80 received on the director can be sent to RIP:8000 on the realserver.

In the 2.0.x and 2.2.x series of IPVS, rewriting the packet headers is slow on machines from that era (60usec/packet on a pentium classic) and limits the throughput of LVS-NAT (for 536byte packets, this is 72Mbit/sec or about 100BaseT). While LVS-NAT throughput does not scale well with the packet rate (after you run out of CPU), the advantage of LVS-NAT is that realservers can have any OS, no modifications are needed to the realserver to run it in an LVS, and the realserver can have services not found on Linux boxes.

Note

For LocalNode, headers are not rewritten.

The LVS-NAT code for 2.4 is rewritten as a Netfilter modules and is not detectably slower than LVS-DR or LVS-Tun. (The IPVS code for the early 2.4.x kernels in 2001 was buggy during the changeover, but that is all fixed now.)

5.14. Two instances of demon running on realserver

from Horms, Jul 2005

With LVS-DR or LVS-Tun, the packet arrives on the realserver with dst_addr=VIP:port. Thus even if you set up two RIPs on the realserver you cannot have two instances of the service demon, because they would both have to be listening for VIP:port. With LVS-NAT, you could

  • have two RIPs (RIP1 and RIP2) on one realserver (both IPs could be on one NIC), ipvsadm forwarding to both RIPs, with an instance of the demon listening to RIP1 and another instance of the demon listening to RIP2.
  • one RIP on the realserver, but have ipvsadm forward requests to two different ports. Thus one instance of the demon would listen to RIP:port1 and another would listen to RIP:port2.

5.15. Performance of LVS-NAT

Horms

All things are relative. LVS-NAT is actually pretty fast. I have seen it do well over 600Mbit/s. But in theory LVS-DR is always going to be faster because it does less work. If you only have 100Mbit/s on your LAN then either will be fine. If you have gigabit then LVS-NAT will still probably be fine. Beyond that... I am not sure if anyone has tested that to see what will happen. In terms of number of connections, there is a limit with LVS-NAT that relates to the number of ports. But in practice you probably won't reach that limit anyway.

5.15.1. Performance of LVS-NAT, 2.0 and 2.2 kernels

With the slower machines around in the early days of LVS, the throughput of LVS-NAT was limited by the time taken by the director to rewrite a packet. The limit for a pentium classic 75MHz is about 80Mbit/sec (100baseT). Since the director is the limiting step, increasing the number of realservers does not increase the throughput.

The performance page shows a slightly higher latency with LVS-NAT compared to LVS-DR or LVS-Tun, but the same maximum throughput. The load average on the director is high (>5) at maximum throughput, and the keyboard and mouse are quite sluggish. The same director box operating at the same throughput under LVS-DR or LVS-Tun has no perceptable load as measured by top or by mouse/keyboard responsiveness.

5.15.2. Performance of LVS-NAT, 2.4 kernels

Wayne

NAT taks some CPU and memory copying. With a slower CPU, it will be slower.

the origial posting

Julian Anastasov ja (at) ssi (dot) bg 19 Jul 2001

This is a myth from the 2.2 age. In 2.2 there are 2 input route calls for the out->in traffic and this reduces the performance. By default, in 2.2 (and 2.4 too) the data is not copied when the IP header is changed. Updating the checksum in the IP header does not cost too much time compared to the total packet handling time.

To check the difference between the NAT and DR forwarding method in out->in direction you can use testlvs from http://www.ssi.bg/~ja/ and to flood a 2.4 director in 2 setups: DR and NAT. My tests show that I can't see a visible difference. We are talking about 110,000 SYN packets/sec with 10 pseudo clients and same cpu idle during the tests (there is not enough client power in my setup for full test), 2 CPUx 866MHz, 2 100mbit internal i82557/i82558 NICs, switched hub:

3 testlvs client hosts -> NIC1-LVS-NIC2 -> packets/sec.

I use small number of clients because I don't want to spend time in routing cache or LVS table lookups.

Of course, the NAT involves in->out traffic and this can reduce twice the performance if the CPU or the PCI power is not enough to handle the traffic in both directions. This is the real reason the NAT method to look so slow in 2.4. IMO, the overhead from the TUN encapsulation or from the NAT process is negliable.

Here come the surprises:

The basic setup: 1 CPU PIII 866MHz, 2 NICs (1 IN and 1 OUT), LVS-NAT, SYN flood using testlvs with 10 pseudo clients, no ipchains rules. Kernels: 2.2.19 and 2.4.7pre7.

  • Linux 2.2 (with ipchains support, with modified demasq path to use one input routing call, something like LVS uses in 2.4 but without dst cache usage):

    In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 99% (strange)
    In 110,000 SYNs/sec, Out 88,000 SYNs/sec, CPU idle: 0%
    
  • Linux 2.4 (with ipchains support): with 3-4 ipchains rules:

    In 80,000 SYNs/sec, Out 55,000 SYNs/sec, CPU idle: 0%
    In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 0%
    In 110,000 SYNs/sec, Out 63,000 SYNs/sec (strange), CPU idle: 0%
    
  • Linux 2.4 (without ipchains support):

    In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 20%
    In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 2%
    
  • Linux 2.4, 2 CPU (with ipchains support):

    In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 30%
    In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 0%
    
  • Linux 2.4, 2 CPU (without ipchains support):

    In 80,000 SYNs/sec, Out 80,000 SYNs/sec, CPU idle: 45%
    In 110,000 SYNs/sec, Out 96,000 SYNs/sec, CPU idle: 15%, 30000 ctxswitches/sec
    

What I see is that:

  • modified 2.2 and 2.4 UP look equal on 80,000P/s

    limits: 2.2=88,000P/s, 2.4=96,000P/s, i.e. 8% difference

  • 1 and 2 CPU in 2.4 look equal 110,000->96,000 (100mbit or PCI bottleneck?), may be we can't send more that 96,000P/s through 100mbit NIC?

  • the ipchains rules can dramatically reduce the performance - from 88,000 to 55,000 P/s
  • 2.4.7pre7 SMP shows too many context switches
  • DR and NAT show equal results for 2.4 UP 110,000->96,000P/s, 2-3% idle, so I can't claim that there is a NAT-specific overhead.

I performed other tests, testlvs with UDP flood. The packet rate is lower, the cpu idle time in the LVS box was increased dramatically but the client hosts show 0% cpu idle, may be more testlvs client hosts are needed.

Julian Anastasov ja (at) ssi (dot) bg 16 Jan 2002

Many people think that the packet mangling is evil in the NAT processing. The picture is different: the NAT processing in 2.2 uses 2 input routing calls instead of 1 and this totally kills the forwarding of packets from/to many destinations. Such problems are mostly caused from the bad hash function used in the routing code and because the routing cache has hard limit for entries. Of course, the NAT setups handle more traffic than the other forwarding methods (both the forward and reply directions), a good reason to avoid LVS-NAT with a low power director. In 2.4 the difference between the DR and NAT processing in out->in direction can not be noticed (at least in my tests) because only one route call is used, for all methods.

Matthew S. Crocker Jul 26, 2001

DR is faster, less resource intensive but has issues with configuration because of the age old 'arp problem'

Horms horms (at) vergenet (dot) net

LVS-NAT is still fast enough for many aplications and is IMHO considerably easier to set up. While I think LVS-DR is great I don't think people should be under the impresion that LVS-NAT will intrisicly be a limiting factor to them.

Don Hinshaw dwh (at) openrecording (dot) com 04 Aug 2001

Cisco, Alteon and F5 solutions are all NAT based. The real limiting factor as I understand it is the capacity of the netcard, which these three deal with by using gigabit interfaces.

Julian Anastasov ja (at) ssi (dot) bg 05 Mar 2002 in discussion with Michael McConnell

Note that I used a modified demasq path which uses one input route for NAT but it is wrong. It only proves that 2.2 can reach the same speed as 2.4 if there was use_dst analog in 2.2. Without such feature the difference is 8%. OTOH, there is a right way to implement one input route call as in 2.4 but it includes rewriting of the 2.2 input processing.

Michael McConnell

From what I see here, it looks as though the 2.2 kernel handles a higher numberof SYN's better than the 2.4 kernel. Am I to asume, that the for the 110,000SYNs/sec in the 2.4 kernel, only 63,000 SYNs/sec were answers? The rest failed?

In this test 2.4 has firewall rules, while 2.2 has only ipchains enabled.

Is the 2.2 kernel better at answer a higher number of requests?

No. Note also that the testlvs test was only in one direction, no replies, only client->director->realserver

has anyone compared iptables/ipchains, via 2.2/2.4?

here are my results There is some magic in these tests, I don't know at one place why netfilter shows such bad results. Maybe someone can point to me to the problem.

5.16. Various debugging techniques for routes

This originally described how I debugged setting up a one-net LVS-NAT LVS using the output of route. Since it is more about networking tools than LVS-NAT it has been moved to the section on Policy Routing.

5.17. Connecting directly from the client to a service:port on an LVS-NAT realserver

If you connect directly to the realserver in a LVS-NAT LVS, the reply packet will be routed through the director, which will attempt to masquerade it. This packet will not be part of an established connection and will be dropped by the director, which will issue an ICMP error.

Paul Wouters paul (at) xtdnet (dot) nl 30 Nov 2001

It would like to reach all LVS'ed services on the realservers directly, i.e. without going through the LVS-NAT director, say from a local client not on the internet.

Connecting from client to a RIP should just completely bypass all the lvs code, but it seems that the lvs code is confused, and thinks a RIP->client answer should be part of its NAT structure.

tcpdump running on internal interface of the director shows a packet from the client received on the RIP; the RIP replies (never reaches the client, the director drops it). The director then sends out a port unreachable:

Julian

The code that replies with an ICMP error can be removed but then you still have the problem of reusing connections. The local_client can select a port for direct connection with the RIP but if that port was used some seconds before for a CIP->VIP connection, it is possible that LVS to catch these replies as part of the previous connection. LVS does not inspect the TCP headers and does not accurately keep the TCP state. So, it is possible that LVS will not to detect that the local_client and the realserver have established a new connection with the same addresses and ports that are still known as NAT connection. Even stateful conntracking can't notice it because the local_clientIP->RIP packets are not subject to NAT processing. When LVS sees the replies from RIP to local_clientIP it will SNAT them and this will be fatal because the new connection is between the local_clientIP and RIP directly, not from CIP->VIP->RIP. The other thing is that CIP even does not know that it connects from same port to same server. It thinks there are 2 connections from same CPORT: to VIP and to RIP, so they can live even at the same time.

But a proper TCP/IP stack on a client will not re-use the same port that quickly, unless it is REALLY loaded with connections right? And a client won't (can't?) use the same source port to different destinations (VIP and RIP) right? So, the problem becomes almost theoretical?

This setup is dangerous. As for the ICMP replies, they are only for anti-DoS purposes but may be are going to die soon. There is still no enough reason to remove that code (it was not first priority).

Or make it switachable as #ifdef or /proc sysctl?

Wensong

Just comment out the whole block, for example,

#if 0
                if (ip_vs_lookup_real_service(iph->protocol,
                                              iph->saddr, h.portp[0])) {
                        /*
                         * Notify the realserver: there is no existing
                         * entry if it is not RST packet or not TCP packet.
                         */
                        if (!h.th->rst || iph->protocol != IPPROTO_TCP) {
                                icmp_send(skb, ICMP_DEST_UNREACH,
                                          ICMP_PORT_UNREACH, 0);
                                kfree_skb(skb);
                                return NF_STOLEN;
                        }
                }
#endif
This works fine. Thanks

The topic came up again. Here's another similar reply.

I've set up a small LVS_NAT-based http load balancer but can't seem to connect to the realservers behind them via IP on port 80. Trying to connect directly to the realservers on port 80, though, translates everything correctly, but generates an ICMP port unreach.

Ben North ben (at) antefacto (dot) com 06 Dec 2001

The problem is that LVS takes an interest in all packets with a source IP:port of a Real Service's IP:port, as they're passing through the FORWARD block. This is of course necessary --- normally such packets would exist because of a connection between some client and the Virtual Service, mapped by LVS to some Real Service. The packets then have their source address altered so that they're addressed VIP:VPort -> CIP:CPort.

However, if some route exists for a client to make connections directly to the Real Service, then the packets from the Real Service to the client will not be matched with any existing LVS connection (because there isn't one). At this point, the LVS NAT code will steal the packet and send the "Port unreachable" message you've observed back to the Real Server. A fix to the problem is to #ifdef out this code --- it's in ip_vs_out() in the file ip_vs_core.c.

5.18. A NAT router has no connections

A NAT router rewrites source IP (and possibly the port) of packets coming from machines on the inside network. With an LVS-NAT director, the connection originates on the internet and terminates on the realserver (de-masquerading). The replies (from the realserver to the the lvs client) are masqueraded. In both cases (NAT router, LVS-NAT director), to the machine on the internet, the connection appears to be coming from the box doing the NAT'ing. However the NAT box has no connection (e.g. with netstat -an) to the box on the internet. It is just routing packets (and rewriting them).

Horms 17 May 2004

There is no connection as such. Or more specifically, the connection is routed, not terminated by the kernel. However, there is a proc entry, that you can inspect, to see the natted connections.

5.19. Thoughts on extending NAT

Tao Zhao taozhao (at) cs (dot) nyu (dot) edu 01 May 2002 LVS-NAT assumes that all servers are behind the director, so the director only need to change the destination IP when a request comes in and forward that to the scheduled realserver. When the reply packets go through the director it will change the source IP. This limits the deployment of LVS using NAT: the director must be the outgoing gateway for all servers.

I am wondering if I can change the code so that both source and destinamtion IPs are changed in both ways. For example, CIP: client IP; DIP: director IP; SIP: server IP (public IPs);

Client->Director->Server: address pair (CIP, DIP) is changed to (DIP, SIP)
Server->Director->Client: address pair (SIP, DIP) is changed to (DIP, CIP).

Lars

Not very efficient; but this can actually already be done by using the port-forwarding feature AFAIK, or by a userspace application level gateway. I doubt its efficiency, since the director would _still_ need to be in between all servers and the client both ways. Direct routing and/or tunneling make more sense. As well clients do not know where the connection originally came from; making the logs on them nearly useless, also filtering by client IP and establishing a session back to the client (ie, ftp or some multimedia protocols) is also very difficult.

Wayne wayne (at) compute-aid (dot) com 01 May 2002

Client IP address is very important for analyzing the traffic for marketing people. Get rid of the CIP will make web server has no way to log where the traffic coming from, thus totally blind the marketing people, that is very undesirable for many use. Do you have to allocate a table for tracking these changes, too? That will further slow down the director.

Of course, the director need to allocate a new port number and change the source port number to it when it forwards the packet to the server. Thus this local port number should be enough for the director to distinguish different connections. This way, there will be no limitation where the servers are (the tunneling solution needs the change of server: setup tunneling)

Joe

I talked to Wensong about this in the early days of LVS, but I remember thinking that keeping track of the CIP would have been a lot of work. I think I mentioned it in the HOWTO for a while. However I'd be happy to use the code if someone else wrote it :-)

Some commercial load balancers seem to have some NAT-like scheme where the packets can return directly to the CIP without going through the director. Does anyone know how it works? (Actually I don't know whether it's NAT-like or not, I think there's some scheme out there that isn't VS-DR which returns packets directly from the realservers to the clients - this is called "direct server return" in the commercial world).

Wayne wayne (at) compute-aid (dot) com

I think those are switch-like load balancers. They don't take any IP addresses, But I think it could be done even with NAT, as long as the server has two NIC, one talk to the load balancer, the other talk to the switch/hub before the load balancer. The load balancer has to change the packet not have its own IP in it, so there is no need to NAT back to the public packet. Server set its default gateway using the other NIC to send the packets out.

5.20. Postings from the mailing list

frederic (dot) defferrard (at) ansf (dot) alcatel (dot) fr

would be possible to use LVS-NAT to load-balance virtual-IPs to ssh-forwarded real-IPs? Ssh can also be used to create a local access that is forwarded to a remote access throught the ssh protocol. For example you can use ssh to securely map a local acces to a remote POP server:

 local:localport ==> local:ssh ~~~~~ ssh port forwarding ~~~~~ remote:ssh ==> remote:pop

And when you connect to local:localip you are transparently/securely connected to remote:pop The main idea is to allow RS in differents LANs with RS that are non-Linux (precluding LVS-Tun). Example:

                                - VS:81 ---- ssh ---- RS:80
                               /
INTERNET - - - - > VS:80 (NAT)-- VS:82 ---- ssh ---- RS:80
                               \
                                - VS:83 ---- ssh ---- RS:80

Wensong

you can use VPN (or CIPE) to map some external realservers into your private cluster network. If you use LVS-NAT, make sure the routing on the realserver must be configuration properly so that the response packets will go through the load balancer to the clients.

I think that it isn't necessery to have the default router to the load balancer when using ssh because when the RS address is the same that the VS address (differents ports)

With the NAT method, your example won't work because the LVS/NAT treats packets as local ones and forward to the upper layers without any change.

However, your example give me an idea that we can dynamically redirect the port 80 to port 81, 82 and 83 respectively for different connections, then your example can work. However, the performance won't be good, because lots of works are done in the application level, and the overhead of copying from kernel to user-space is high.

Another thought is that we might be able to setup LVS/DR with real server in different LANs by using of CIPE/VPN stuff. For example, we use CIPE to establish tunnels from the load balancer to realservers like

                     10.0.0.1================10.0.1.1 realserer1
                     10.0.0.2================10.0.1.2 realserer2
   --- Load Balancer 10.0.0.3================10.0.1.3 realserer3
                     10.0.0.4================10.0.1.4 realserer4
                     10.0.0.5================10.0.1.5 realserer5

Then, you can add LVS-DR configuration commands as:

         ipvsadm -A -t VIP:www
         ipvsadm -a -t VIP:www -r 10.0.1.1 -g
         ipvsadm -a -t VIP:www -r 10.0.1.2 -g
         ipvsadm -a -t VIP:www -r 10.0.1.3 -g
         ipvsadm -a -t VIP:www -r 10.0.1.4 -g
         ipvsadm -a -t VIP:www -r 10.0.1.5 -g
 

I haven't tested it. Please let me know the result if anyone tests this configuration.

Lucas 23 Apr 2004

Is it possible use the cluster as a NAT Router? What I'm saying is: I got a private LAN and I want to share my internet connection, doing NAT and Firewall and QoS. The realservers are actually routers and dont serve any service. Is there a way to use the VIP as the private LAN gateway or to pass the traffic through the director to the "real servers (real routers)" even when is not destined to a specific port in the server?

Horms 21 May 2004

I think that should work, as long as you are only wanting to route IPv4 TCP, UDP and related ICMP. You probably want to use a fwmark virtual service so that you can forward all ports to the realservers (routers). That said I haven't tried it, so I can't be sure.

5.21. LVS-NAT source routing patch (Brownfield, Sawari and Black)

Note
Mar 2006: This will be in the next release of LVS.

Ken Brownfield found that ipvs changes the routing of packets from the director to 0/0 (i.e. LVS-NAT or LVS-DR with the forward-shared patch). The packets from ipvs should use the routing table, but they don't. Ken had a director with two external NICS. He wanted the packets to return via the NIC they arrived. When he tried LVS-NAT, with his own installed routing table (which works when tested with traceroute), the reply packets from ip_vs are sent to the default gw, apparently ignoring his routing table. It should be none of ip_vs's business where the packets are routed.

Here's Ken's ip_vs_source_route.patch.gz patch.

Here's Ken's take on the matter

I need to support VIPs on the director that live on two separate external subnets:

       |        |
  eth0 |   eth1 |         eth0 = ISP1_IP on ISP1_SUBNET
----------------------    eth1 = ISP2_IP on ISP2_SUBNET
|      Director      |
----------------------
   internal |
            |

The default gateway is on ISP1_SUBNET/eth0, and I have source routes set up as follows for eth1:

# cat /etc/SuSE-release
SuSE Linux 9.0 (i586)
VERSION = 9.0

# uname -a
Linux lvs0 2.4.21-303-smp4G #1 SMP Tue Dec 6 12:33:10 UTC 2005 i686  
i686 i386 GNU/Linux

# ip -V
ip utility, iproute2-ss020116

# ip rule list
0:      from all lookup local
32765:  from ISP2_SUBNET lookup 136
32766:  from all lookup main
32767:  from all lookup default

# ip route show table 136
ISP2_SUBNET dev eth1  scope link  src ISP2_IP default via ISP2_GW dev eth1

If I perform an mtr/traceroute on the director bind()ed to the ISP2_IP interface, outgoing traceroutes traverse the proper ISP2_GW, and the same for the ISP1_IP interface. I'm pretty sure the source- route behavior is correct, since I can revert from the proper behavior by dropping table 136.

For a single web service, I'm defining identical VIPs but for each of the ISPs:

  -A -t ISP1_VIP:80 -s wlc
  -a -t ISP1_VIP:80 -r 10.10.10.10:80 -m -w 1000
  -a -t ISP1_VIP:80 -r 10.10.10.11:80 -m -w 800
  -A -t ISP2_VIP:80 -s wlc
  -a -t ISP2_VIP:80 -r 10.10.10.10:80 -m -w 1000
  -a -t ISP2_VIP:80 -r 10.10.10.11:80 -m -w 800

Incoming packets come in via the proper gateway, but LVS always emits response packets through the default gateway, seemingly ignoring the source-route rules.

I've seen Henrick's general fwmark state tracking described. Reading this, it seems like this patch isn't exactly approved or even obviously available. And the article is from 2002. :)

I'm also not sure why this seems like such a difficult problem. If LVS honored routes, there would be no complicated hacks required. Unless LVS overrides routes, in which case it might be nice to have a switch to turn off that optimization.

I understand that routes are a subset of the problem fixed by the patch, and I can see the value of the patch. But for the basic route case it seems odd for LVS to just dump all outgoing packets to the default gw. I mean, it could cache the routing table instead of just a single gw?

From what I can tell, the SH scheduler decides which realserver will receive an incoming request based on the external source IP in the request. I can see four problems with this.

  • The first is that I can't see how this will change the return route of the packet. I can see mapping incoming source routes to specific real servers with distinct gateways, but I can't see how it could effect an LVS-NAT setup.
  • The second is that a single client IP could go through either incoming VIP. Assuming SH was somehow changing outbound routing, it would distribute the outbound gateway randomly vs. correctly. I suppose this helps distribute traffic but I'm not really interested in perpetuating asymmetric routes.
  • The third is that I'd really like to use LVS as a load-balancer, not as a simple load splitter. wlc is pretty key.
  • The fourth is that using sh doesn't change outbound routes, I just tried it. :-)

The docs state "Multiple gateway setups can be solved with routing and a solution is planned for LVS." Which seems to imply that source routing is a fix but sort of not... :(

Scanning the NFCT patch and looking at the icmp handling, I'm pretty sure the problem is that ip_vs_out() is sending out the packet with a route calculated from the real server's IP. Since ip_vs_out() is reputedly only called for masq return traffic, I think this is just plain incorrect behavior.

I pulled out the route_me_harder() mod and created the attached patch. My only concern would be performance, but it seems netfilter's NAT uses this.

First, I need to correct the stated provenance of this patch. It is a small tweaked subset of an antefacto patch posted to integrate netfilter's connection tracking into LVS, not the NFCT patches as I said. Lots of Googling, not enough brain cells. This patch applies to v1.0.10, but appears to be portable to 2.6.

During a maintenance window this morning, I had the opportunity to test the patch.

The first time I ever loaded the patched module, and shockingly it worked perfectly -- outbound traffic from masq VIPs now follows source-routes and choses the correct outbound gateway. No side effects so far, no obvious increased load.

I also poked around the 2.6 LVS source a bit to see if this issue had been resolved in later versions, and noticed uses of ip_route_output_key, but the source address was always set to 0 instead of something more specific. I'd say it might be worth a review of the LVS code to make sure source addresses are set usefully, and routes are recalculated where necessary.

In any case, if anyone has a similar problem with VIPs spanning multiple external IP spaces and gateways, this has been working like a charm for me in significant production load. So far. *knock*on*wood* I'll update if it crashes and/or burns.

Joe

any idea what would happen if there were multiple VIPs or the packets coming into the director from the outside world were arriving at the LVS code via a fwmark?

To my understanding, Henrick's fwmark patch allows LVS to route traffic based on fwmarks set by an admin in iptables/iproute2. I can imagine certain complex situations where this functionality could be useful and even crucial, but setup and maintenance of fwmarks requires specifically coded fwmark behavior in each of netfilter, iproute2, and ip_vs.

Source routes are essentially a standard feature these days, and are critical for proper routing on gateways and routers (which is essentially what a director is in Masq mode). Having LVS properly observe the routing table is a "missing feature", I believe. The patch I created requires no changes for an admin to make (no fwmarks to set up in ip_vs, netfilter, *and* iproute2), basically just properly and transparently observing routes set by iproute2 (which the rest of the director's traffic already obeys).

So short answer: Henrick's patch allows VIP routing based on fwmarks specifically created/handled by an admin for that purpose, whereas mine is a minor correction to existing code to properly recalculate the routes of outbound VS/NAT VIP traffic after mangling/masquerading of the source IP. A little end-result crossover, but really quite different. My (borrowed :) patch is essentially a one-liner, so the code complexity is very small and the behavior easily confirmable at a glance. The fwmark code is more invasive, seemingly.

Technically, I could have used fwmarks, but until someone needs that specific functionality, I suspect proper source-routing covers 90% of the alternate use cases. And it's the cleaner, more specific solution to my problem. But that's just me. :)

Your summary of SH matches my understanding -- it's hash-based persistence calculated from the client's source IP (vs destination in DH). It probably generates a good random, persistent distribution, which I can see being useful in a cluster environment where persistence is rewarded by caching/sessions/etc. WLC with persistence is probably a better bet for a load-balancer config, since it actually balances load. Without something like wackamole on the real servers, rr/sh/dh are happy to send traffic to dead servers, AFAICT.

Ken Brownfield krb (at) irridia (dot) com 22 Mar 2006

I'm attaching ip_vs_source_route.patch.gz, which is the patch itself. It patches ip_vs_core.c, adding a function call at the end of ip_vs_out() that recalculates the route for an outgoing packet after mangling/masquerading has occurred.

ip_vs_out(), according to the comments in the source (and my brief perusal of the code) is "used only for VS/NAT." There should be no effect on DR/TUN functionality as far as I can tell. This type of route recalc might be correct behavior in some TUN or DR circumstances, but I have no experience in a DR/TUN setup. So yes, I believe this patch is orthogonal to DR/TUN functionality and should be silent with regard to DR/TUN.

The only concern a user should have after applying this patch is that they make sure they are aware of existing source routes before using the patch. Users may be unknowingly relying on the fact that LVS always routes traffic based on the real server's source IP instead of the VIP IP, and applying the patch could change the behavior of their system. I suspect that will be a very rare concern.

As long as the source routes on the system are correct, where the source IP == the VIP IP, packets from LVS will be routed as the system itself routes packets. Routes confirmed with a traceroute (bound to a specific IP on the director) will no longer be ignored for traffic outbound from a NAT VIP.

Joe: next Farid Sarwari stepped in

Farid Sarwari fsarwari (at) exchangesolutions (dot) com 25 Jul 2006

I'm having some issues with IPVS and IPSec. When a HTTP client requests a page, I can see the traffic come all the way to the webserver (ws1,ws2). However, the return traffic gets to the load balancer but does not make it through the ipsec tunnel. When doing a tcpdump I can see that the packets get SNATed by ipvs. I know there is a problem with ipsec2.6 and SNAT, and I've upgraded my kernel and iptables so now SNAT with iptables works. But it looks like ipvs is doing its own SNAT which doesn't pass through the ipsec tunnel.

My setup:


                      HTTP Clients
                       -------
                         |
                          \ -- Ipsec tunnel
                          /
                         |            
                  +------------+
                  |LoadBalancer|
                  |  ipsec2.6  |  
                  |   ipvs     |
                  +------------+
                         |
                        /\
                       /  \
                      /    \
                 +-----+  +-----+
                 | ws1 |  | ws2 |
                 +-----+  +-----+


Ldirector.conf:
virtual=x.x.x.x:80 #<public ip>
        real=y.y.y.1:80 masq
        real=y.y.y.2:80 masq
        checktype=negotiate
        fallback=127.0.0.1:80 masq
        service=http
        request="/"
        receive=" "
        scheduler=wlc
        protocol=tcp

------------------

ipvsadm -ln output:
P Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  x.x.x.x:80 wlc
  -> y.y.y.1:80            Masq    1      0          0
  -> y.y.y.1:80            Masq    1      0          0

------------------

Software Version #s:
ipvsadm v1.24 2003/06/07 (compiled with popt and IPVS v1.2.0)
Linux Kernel 2.6.16
iptables v1.3.5
ldirectord  version 1.131

The Brownfield patch is for an older version of ipvs. When I was applying the patch, Hunk#3 failed. I was able to apply the third hunk manually. When I compile it give errors for the code from the first hunk of the patch.

Finally got it to work! I can access load balanced pages through ipsec. Ken Brownfield's patch seemed to have been for an older version of kernel/ipvs. If you look in the patch, there is function called ip_vs_route_me_harder with is an exact copy of ip_route_me_harder from netfilter.c. I'm not sure what version of ipvs/kernel Brownfield's patch is for. I couldn't get ipvs to compile with his patch, so I just used his idea and copied the new code from the netfilter source code. I've modified his patch by copying new the ip_route_me_harder function from net/ipv4/netfiter.c (2.6.16). Below is the patch for kernel 2.6.16 (kernel sources from FC4)

IPVS Version:     $Id: ip_vs_core.c,v 1.34 2003/05/10 03:05:23 wensong
Exp



------snip--------
--- ip_vs_core.c.orig   2006-03-20 00:53:29.000000000 -0500
+++ ip_vs_core.c        2006-07-27 14:31:14.000000000 -0400
@@ -43,6 +43,7 @@

 #include <net/ip_vs.h>

+#include <net/xfrm.h>

 EXPORT_SYMBOL(register_ip_vs_scheduler);
 EXPORT_SYMBOL(unregister_ip_vs_scheduler);
@@ -516,6 +517,76 @@
        return NF_DROP;
 }

+/* This code stolen from net/ipv4/netfilter.c */
+
+int ip_vs_route_me_harder(struct sk_buff **pskb)
+{
+        struct iphdr *iph = (*pskb)->nh.iph;
+        struct rtable *rt;
+        struct flowi fl = {};
+        struct dst_entry *odst;
+        unsigned int hh_len;
+
+        /* some non-standard hacks like ipt_REJECT.c:send_reset() can
cause
+         * packets with foreign saddr to appear on the NF_IP_LOCAL_OUT
hook.
+         */
+        if (inet_addr_type(iph->saddr) == RTN_LOCAL) {
+                fl.nl_u.ip4_u.daddr = iph->daddr;
+                fl.nl_u.ip4_u.saddr = iph->saddr;
+                fl.nl_u.ip4_u.tos = RT_TOS(iph->tos);
+                fl.oif = (*pskb)->sk ? (*pskb)->sk->sk_bound_dev_if :
0;
+#ifdef CONFIG_IP_ROUTE_FWMARK
+                fl.nl_u.ip4_u.fwmark = (*pskb)->nfmark;
+#endif
+                if (ip_route_output_key(&rt, &fl) != 0)
+                        return -1;
+
+                /* Drop old route. */
+                dst_release((*pskb)->dst);
+                (*pskb)->dst = &rt->u.dst;
+        } else {
+                /* non-local src, find valid iif to satisfy
+                 * rp-filter when calling ip_route_input. */
+                fl.nl_u.ip4_u.daddr = iph->saddr;
+                if (ip_route_output_key(&rt, &fl) != 0)
+                        return -1;
+
+                odst = (*pskb)->dst;
+                if (ip_route_input(*pskb, iph->daddr, iph->saddr,
+                                   RT_TOS(iph->tos), rt->u.dst.dev) !=
0) {
+                        dst_release(&rt->u.dst);
+                        return -1;
+                }
+                dst_release(&rt->u.dst);
+                dst_release(odst);
+        }
+
+        if ((*pskb)->dst->error)
+                return -1;
+
+#ifdef CONFIG_XFRM
+        if (!(IPCB(*pskb)->flags & IPSKB_XFRM_TRANSFORMED) &&
+            xfrm_decode_session(*pskb, &fl, AF_INET) == 0)
+                if (xfrm_lookup(&(*pskb)->dst, &fl, (*pskb)->sk, 0))
+                        return -1;
+#endif
+
+        /* Change in oif may mean change in hh_len. */
+        hh_len = (*pskb)->dst->dev->hard_header_len;
+        if (skb_headroom(*pskb) < hh_len) {
+                struct sk_buff *nskb;
+
+                nskb = skb_realloc_headroom(*pskb, hh_len);
+                if (!nskb)
+                        return -1;
+                if ((*pskb)->sk)
+                        skb_set_owner_w(nskb, (*pskb)->sk);
+                kfree_skb(*pskb);
+                *pskb = nskb;
+        }
+
+        return 0;
+}

 /*
  *      It is hooked before NF_IP_PRI_NAT_SRC at the NF_IP_POST_ROUTING
@@ -734,6 +805,7 @@
        struct ip_vs_protocol *pp;
        struct ip_vs_conn *cp;
        int ihl;
+       int retval;

        EnterFunction(11);

@@ -821,8 +893,20 @@

        skb->ipvs_property = 1;

-       LeaveFunction(11);
-       return NF_ACCEPT;
+       /* For policy routing, packets originating from this
+        * machine itself may be routed differently to packets
+        * passing through.  We want this packet to be routed as
+        * if it came from this machine itself.  So re-compute
+        * the routing information.
+        */
+       if (ip_vs_route_me_harder(pskb) == 0)
+               retval = NF_ACCEPT;
+       else
+               /* No route available; what can we do? */
+               retval = NF_DROP;
+
+       LeaveFunction(11);
+       return retval;

   drop:
        ip_vs_conn_put(cp);
------snip--------

Joe

Can you do IPSec with LVS-DR? (the director would only decrypt and the realservers encrypt)

I haven't tried it, but I don't see why it shouldn't work. It's probably easier to get work than LVS-NAT with IPSec :) You can think of Ipsec as just another interface except that with kernel 2.6 there is more ipsec0 interface. So as long as routing is setup correctly LVS-DR should work with IPSec.

so you have an ipsec0 interface and you can put an IP on it and route to/from it just like with eth0? Can you use iproute2 tools on ipsec0?

With Kernel 2.6 there is no more ipsec0 interface, but you can use iproute2 to alter the routing table. You wouldn't want to modify the routes to the tunnel because ipsec takes care of that, but you can modify routes for traffic that is coming through the tunnel destined for LVS-DR.

Ken Brownfield krb (at) irridia (dot) com 28 Jul 2006

At first glance, that's exactly what had to be ported, and I'm glad someone with enough 2.6 fu did it. Now, if someone could have it conditional on a proc/sysctl, it would seem like more of a no-brainer for inclusion. ;)

Joe: next David Black stepped in

David Black dave (at) jamsoft (dot) com 28 Jul 2006

I applied the following patch to a stock 2.6.17.7 kernel, and enabled the source routing hook via /proc/sys/net/ipv4/vs/snat_reroute: http://www.ssi.bg/~ja/nfct/ipvs-nfct-2.6.16-1.diff LVS-NAT connections now appear to obey policy routing - yay!

Referring to an older version of the NFCT patch, Ken Brownfield says in the LVS HOWTO: "I pulled out the route_me_harder() mod and created the attached patch." So the Brownfield patch is a derivative of the NFCT patch in the first place.

And here's a comment from the NFCT patch I used:

/* For policy routing, packets originating from this
 * machine itself may be routed differently to packets
 * passing through.  We want this packet to be routed as
 * if it came from this machine itself.  So re-compute
 * the routing information.

For a patched kernel, that functionality is enabled by

echo 1 > /proc/sys/net/ipv4/vs/snat_reroute

Farid Sarwari fsarwari (at) exchangesolutions (dot) com 31 Jul 2006

The problem I was having with ipvs was that I couldn't access it through ipsec kernel 2.6. I remember accessing ipvs through ipsec 2.4 a few years ago and I don't remember running into this problem. Correct me if I'm wrong, prior to kernel 2.6.16 SNAT (netfilter) didn't work properly with ipsec. When troubleshooting my problem it looked like the natting was happening after the routing decision had been made. This is why I was under the assumption that only code from kernel 2.6.16+ would fix my problem. If the NFCT patch works with ipsec, I would much rather us that.

Joe

If Julian's patch had been part of the kernel ipvs code, would anyone have had source routing/iproute2 problems with LVS-NAT?

Ken 9 Aug 2006

I don't believe so -- the source-routing behavior appears to be a (happy) side-effect of working NFCT functionality. I think the NFCT and source-routing patches' intentions are to supply a feature and a bug-fix, respectively, but NFCT is an "accidental" superset.

5.22. LVS-NAT FTP Recipe

Stephen Milton smmilton (at) gmail (dot) com 12/17/05

This may be old hat to many of you on this list, but I had a lot of problems deciphering all the issues around FTP in load balanced NAT. So I wrote up the howto on how I got my configuration to work. I was specifically trying to setup for high availability, load balanced, FTP and HTTP with failover and firewalling on the load balancer nodes. Here is a permanent link to the article: load_balanced_ftp_server (http://sacrifunk.milton.com/b2evolution/blogs/index.php/2005/12/17/load_balanced_ftp_server)

5.23. LVS-NAT vhosts with apache

Michael Green mishagreen (at) gmail (dot) com

Is it possible to make Apache's IP based vhosts work under LVS-NAT?

Graeme Fowler graeme (at) graemef (dot) net 14 Dec 2005

If, by that, you mean Apache vhosts whereby a single vhost lives on a single IP then the answer is definitely "yes", although it may seem counter-intuitive at first.

If you're using IP based virtual hosting, you have a single IP address for *each and every* virtual host. In the 'classic' sense this means your server has one, two, a hundred, a thousand IP addresses configured (as aliases) on its' interface which faces the internet and a different vhost listens to each interface.

In the clearest case of LVS-NAT, you'd have your public interface on the director handle the one, two, a hundred, a thousand _public_ IP addresses and present those to the internet (or your clients, be those as they are). Assuming you have N realservers, you then require N*(one, two, a hundred, a thousand) private IP addresses and you configure up (one, two, a hundred, a thousand) aliases per virtual server. You then setup LVS-NAT to take each specific public IP and NAT it inbound to N private IPs on the realservers.

Still with me? Good.

This is a network management nightmare. Imagine you had 256 Virtual IPs, each with 32 servers in a pool. You immediately need to manage an entire /19 worth of space behind your director. That's a lot of address space (8192 addresses to be precise) for you to be keeping up with, and it's a *lot* of entries in your ipvsadm table.

There is, however, a trick you can use to massively simplify your addressing:

Put all your IP based vhosts on the same IP but a *different port* on each realserver. Suddenly you go from 8192 realserver address (aliases) to, well, 32 address (aliases) with 256 ports in use on each one. Much easier to manage.

For even more trickery you could probably make use of some of keepalived's config tricks to "pool" your realservers and make your configuration even more simple, but if you only have a small environment you may want to get used to using ipvsadm by hand first until you're happy with it.

5.24. LVS-NAT timeout problem

Joe: here's a posting that hasn't been solved. It occured with LVS-NAT, but we don't know if it occurs with the other forwarding methods.

Dmitri Skachkov dmitri (at) nominet (dot) org (dot) uk 21 Feb 2007

I should probably say in the beginning that the issue I'm going to describe is not directly related to the problem discussed on this list a while ago (http syn/ack not translated when ftp loadbalancing also enabled). We have several LVS/NAT installations which are managed by Keepalived. All of them are pretty much identical and exhibit the same issue. The setup is looking like this (a backup load balancer and a backup router are omitted) and is LVS/NAT standard:


        !-----------------!
        !                 !
        !     Internet    !
        !                 !
        !-----------------!
                 !
                 !
        !-----------------!
        !                 !
        !     Router      !
        !                 !
        !-----------------!
                 !
                 !
        !-----------------!
        !      eth0       !
        !                 !
        !  LoadBalancer   !
        !                 !
        !      eth1       !
        !-----------------!
                 !
                 !192.168.1.0/24
    ------------------------
    !       !       !      !
  !---!                  !---!
  !RS1!     .........    !RSN!
  !---!                  !---!

This setup is working fine most of the time except when a client sends a TCP SYN packet and then forgets about this connection. In this case a RealServer starts to send SYN/ACK packets until this connection on the server times out and it sends RST/ACK. The issue is that two last packets don't get translated because ipvs on the LoadBalancer already timed out this connection. Below is a tcpdump on LoadBalancer/eth0:

10:58:20.655059 IP 213.248.204.8.2113 > 213.248.224.116.43: S 1402601529:1402601529(0) win 512
10:58:20.655335 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:24.031708 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:30.792336 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:58:44.303557 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
10:59:11.316010 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:00:05.330972 IP 213.248.224.116.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:01:05.346329 IP 192.168.1.32.43 > 213.248.204.8.2113: S 443218720:443218720(0) ack 1402601530 win 49312 <mss 1460>
11:02:05.362233 IP 192.168.1.32.43 > 213.248.204.8.2113: R 1:1(0) ack 1 win 49312

In this example I simulated the situation with sending SYN packet from my PC to the server and dropping all further packets. While the SYN/ACK packets were still being translated

director# ipvsadm -lnc
TCP 28:12  NONE        213.248.204.8:0    213.248.224.116:43 192.168.1.32:43
TCP 00:57  SYN_RECV    213.248.204.8:2113 213.248.224.116:43 192.168.1.32:43

But once I see only this:

TCP 27:02  NONE        213.248.204.8:0    213.248.224.116:43 192.168.1.32:43

Yes, I played with 'ipvsadm --set tcp tcpfin udp' and it doesn't have any effect on this issue.

packets from RealServer belonging to this connection (from RealServer point of view) stop getting translated. This is not a real problem but rather a nuisance for me. I just don't want packets with private IP's leaving LoadBalancer. I can't block this packets with iptables since I believe ipvs does SNATing somewhere in POSTROUTING chain and there is no way to put any other rules beyond this chain. I also can't modify SYN_RECV timeout since there is no tcp_timeout_syn_recv entry in /proc/sys/net/ipv4/vs/ (this is a stock CentOS 4.3 kernel). My question is: Is it possible to block not translated packets from leaving the LoadBalancer without touching RealServers and the Router?

If it can help, here is additional info:

# uname -a
Linux lb1 2.6.9-34.ELsmp #1 SMP Thu Mar 9 06:23:23 GMT 2006 x86_64 x86_64 x86_64 GNU/Linux
# ipvsadm --help
ipvsadm v1.24 2003/06/07 (compiled with getopt_long and IPVS v1.2.0)

later...

Graeme Fowler graeme (at) graemef (dot) net 25 Jun 2007

One of my "standard" (I use the term advisedly) configuration settings for LVS-NAT is to ensure that I have an SNAT rule for packets exiting the director towards clients. I make sure that if I have RS1 with VIP 1.2.3.4 and two realservers 192.0.0.1 and 192.0.0.2 that I have rules of the form:

-t nat -A POSTROUTING -o eth0 -s 192.0.0.1 -j SNAT --to-source $VIP
-t nat -A POSTROUTING -o eth0 -s 192.0.0.2 -j SNAT --to-source $VIP 

This means any packets escaping the LVS - for example where the LVS connection entry has timed out, but the realserver application session hasn't, will be SNATted to the right IP.

It also means that any sessions originating from the realserver - CGI calls to other websites, PHP database connections to offboard servers, SSI includes, RSS inclusion, whatever - appear to come from the right source. It can help to track down abuse in the case of mass virtual hosting, and it prevents information leakage of the form you're seeing.

Longer term, it looks like you need to make sure that the IP stack timeouts on the realservers match the LVS connection table timeouts on the director. Have a look at the "--set" option to ipvsadm, and check the corresponding sysctls in /proc/sys/net/ipv4/ - you may have to do a bit of deduction regarding backoff algorithms and retries to get the total time for (for example) a TCP three-way handshake timeout, like you're seeing.

dmitri (at) nominet (dot) org (dot) uk 25 Jun 2007

If I remember correctly, the POSTROUTING solution didn't work for me as it seemed LVS stuff in kernel just ignored any postrouting tables for any ip packets under LVS control. Neither ipvsadm --set had any effect on this issue. Sorry for a short explanation but this is what I remember of top of my head and since in all other respects LVS is just working fine for us and so I have not looked often into it lately.