8. LVS: LVS-Tun

8.1. LVS-Tun Intro

LVS-Tun is an LVS original. It is based on LVS-DR. The LVS code encapsulates the original packet (CIP->VIP) inside an ipip packet of DIP->RIP, which is then put into the OUTPUT chain, where it is routed to the realserver. (There is no tunl0 device on the director; ip_vs() does its own encapsulation and doesn't use the standard kernel ipip code. This possibly is the reason why PMTU on the director does not work for LVS-Tun - see MTU.) The realserver receives the packet on a tunl0 device (see need tunl0 device) and decapsulates the ipip packet, revealing the original CIP->VIP packet.

Initially only Linux could decapsulate IPIP packets, but recently FreeBSD and w2k can now do it too (hmm 2005, Microsoft has dropped support for IPIP).

If you want to try a test LVS-Tun setup on the bench, take a standard LVS-DR setup LVS-DR example, change lo on the realservers to tunl0 (and handle the ARP problem on tunl0) and change the ipvsadm switch from -g to -i . If your clients are going to be sending large packets, you need to set the MTU (see MTU for the ipip packet DIP->RIP). This can be done on the realserver with iptables (see tunl MTU solved) or iproute2 (see setting the MTU by route).

As with LVS-DR, the director doesn't know about the VIP on the realserver (it only knows about the RIP). Health checking of a service listening on the VIP on the realserver then must use a connection between the DIP and the RIP (if the demon is listening on both the RIP and DIP, the service listening on the RIP can be a proxy for the service listening on the VIP).

LVS-Tun allows the realservers to be geographically remote from the director (this is the main point of LVS-Tun). If your realservers cannot do ipip decapsulation, you can still have geographically remote realservers using other techniques (see non tunnelling realservers).

(see also Julian's LVS-Tun write up and postings to the mailing list).

8.2. LVS-Tun example setup

Here's an example set of IPs for a LVS-Tun setup. For (my) convenience the servers are on the same network as the client. The only restrictions for LVS-Tun with remote hosts are that the client must be able to route to the director and that the realservers must be able to route to the client (the return packets to the client come directly from the realservers and do not go back through the director).

Normally for LVS-Tun, the client is on a different network to the director/server(s), and each server has its own route to the outside world. In the simple test case below where all machines are on the 192.168.1.0 network there would be no default route for the servers, and routing for packets from the servers to the client would use the device on the 192.168.1.0 network (presumably eth0). In reallife, the realservers would have their own router/connection to the internet and packets returning to the client would go through this router. In any case reply packets do not go back through the director.

Machine                      IP
client                       CIP=192.168.1.254
director                     DIP=192.168.1.1
                             VIP=192.168.1.110 (arps, IP clients connect to)
realserver-1                 RIP1=192.168.1.2, VIP (tunl0, non-arping, 192.168.1.110)
realserver-2                 RIP2=192.168.1.3, VIP (tunl0, non-arping, 192.168.1.110)
realserver-3                 RIP3=192.168.1.4, VIP (tunl0, non-arping, 192.168.1.110)
.
.
realserver-n                 RIPn=192.168.1.n+1, VIP (tunl0, non-arping, 192.168.1.110)

#lvs_tun.conf
LVS_TYPE=VS_TUN
INITIAL_STATE=on
VIP=eth0:110 192.168.1.110 255.255.255.255 192.168.1.110
DIP=eth0 192.168.1.9 192.168.1.0 255.255.255.0 192.168.1.255
DIRECTOR_DEFAULT_GW=client
SERVICE=t telnet rr realserver1 realserver2
SERVER_VIP_DEVICE=tunl0
SERVER_NET_DEVICE=eth0
SERVER_DEFAULT_GW=client
#----------end lvs_tun.conf------------------------------------

                        ________
                       |        |
                       | client |
                       |________|
                       CIP=192.168.1.254
                           |
             CIP->VIP |    |   ^  
                      v    |   | VIP->CIP
                           |
       VIP=192.168.1.110   |
       (eth0:1, arps)      |
         __________        |
        |          |       |
        | director |-------
        |__________|       |
       DIP=192.168.1.1     |
       (eth0)              |
                           |
   DIP->RIP(CIP->VIP) |    |
                      v    
          -------------------------------------
          |                |                  |
          |                |                  |
   RIP1=192.168.1.2  RIP2=192.168.1.3  RIP3=192.168.1.4 (eth0)
   VIP=192.168.1.110 VIP=192.168.1.110 VIP=192.168.1.110 (all tunl0,non-arping)
    _____________     _____________     _____________
   |             |   |             |   |             |
   | realserver  |   | realserver  |   | realserver  |
   |_____________|   |_____________|   |_____________|

Here's a likely production setup (I haven't done this one myself). It assumes the realservers are on a different network to the DIP. Here x.x.x.? and y.y.y.? are public IPs. The 176 and 10 addresses are for communication between the different locations and will be assigned by the ISP.

                       ________
                      |        |
                      | client |
                      |________|
                      CIP=x.x.x.1
                          |
            CIP->VIP |    |---------------------------------
                     v    |                                 |
                      __________                            | 
                     |          |                           |
                     | D-router |                           |
                     |__________|                           |
                          |                                 |
            CIP->VIP |    |                                 |
                     v    |                                 |
                          |                                 |
                VIP=y.y.y.110(eth0, arps)                   |
                      __________                            |
                     |          |                           |
                     | director |                           |
                     |__________|                           |
                DIP=176.0.0.1 (eth1)                        |
                          |                              ^  |
  DIP->RIP1(CIP->VIP) |   |                     VIP->CIP |  |
                      v   |                                 |
                      __________                       __________
                     |          |                     |          | 
                     | R-router |  R,C-Router do not  | C-Router |
                     |__________|   advertise VIP     |__________|
                          |                                 |
                          |                              ^  |
  DIP->RIP1(CIP->VIP) |   |                     VIP->CIP |  |
                      v   |                                 |
                          |                                 |
         ----------------------------------------------------
         |                           |
         |                           |
  RIP1=10.0.0.1(eth0)       RIP2=10.0.0.2(eth0)
  VIP=y.y.y.110(tunl0)      VIP=y.y.y.110(tunl0)
         |                           |
  _________________         ___________________     
 |                 |       |                   | 
 | realserver      |       | realserver        |
 | tunl0: CIP->VIP |       |                   |
 | eth0:  VIP->CIP |       |                   | 
 |_________________|       |___________________|

8.3. You need a tunl0 device

Note
tunl0 is a networking device like eth0, lo, and dummy0.

In LVS-Tun, the tunl0 device holds the VIP, just as the lo device holds the device for LVS-DR. You need to build the tunl0 device into the Linux kernel (in networking options - IP:tunneling) - it is turned off by default. The tunnelling (ipip) can be built as a module, in which case you'll have to insmod ipip before you can use it, or you can build ipip directly into the kernel. With a kernel enabled for ipip, you should be able to see the unconfigured tunl0 device with ifconfig or with ip addr show (Feb 2004 - my ifconfig used to see the unconfigured tunl0, but it doesn't anymore.)

Then you configure the tunl0 device (even if ifconfig can't see it).

ifconfig tunl0 192.168.1.110 netmask 255.255.255.255 broadcast 192.168.1.110

when the tunl0 device becomes visible to ifconfig

or

ip addr add dev tunl0 192.168.1.110/32 brd 192.168.1.110
Note
the VIP is a /32 addr, so the brd addr is the VIP, not x.x.x.255.

8.4. the ARP problem with LVS-Tun

If the realservers and director are on a different network (e.g. the realservers are geographically remote), then the router infront of the realservers will not be advertising routes to the VIP and you won't need to handle the ARP problem on the realservers. In effect you are using Lars' method without having to do anything special.

If the realservers are using the same router as the director you need to handle the ARP problem for the realservers (set tunl0 to not reply to arp queries). This networking is the same as for LVS-DR and you'd only do this to test LVS-Tun. (there's no other reason to use LVS-Tun with the LVS-DR network). However all my LVS-Tun test cases used the same networking as for LVS-DR, i.e. the DIP and RIPs were on the same network and only one router (actually none, the client with 1 or 2 NICs, faced directly onto the director and realservers). In this case I had to handle the ARP problem for the realservers.

8.5. Reply packets appear to be spoofed

Unlike LVS-DR, with LVS-Tun the realservers can be in a different location (and on a network remote from the director), where the director and realservers will be on different networks and the realservers will be on a network that does NOT contain the VIP. If this is the case, the realservers will be generating reply packets with VIP:port->CIP (where port is the LVS'ed service). Not being on the VIP network, the routers for the realservers will have to be programmed to accept outgoing packets with src_addr=VIP:port. Routers normally drop these packets as an anti-spoofing measure. If you aren't in control of the routers, you'll just have to inform the people who are, that packets from VIP:port are valid for your business. If they don't want to help you with your business, then you should find another provider who will.

Mark Wadham mark (dot) wadham (at) areti (dot) net 30 Mar 2007

I believe we have located the source of the problem. Our load balancer is located in Manchester and the mail servers are located in London, and it appears that our upstream providers filter our traffic to prevent ip spoofing.

8.6. How LVS-Tun works

Here's part of the rc.lvs_tun script which configures the realserver with RIP=192.168.1.8

#setup servers for telnet
/sbin/ipvsadm -A -t 192.168.1.110:23 -s rr
/sbin/ipvsadm -a -t 192.168.1.110:23 -R 192.168.1.1 -i -w 1

There's no forwarding in the conventional sense for LVS-Tun. (You can have ip_forward set to ON if you need it for something else, but LVS-Tun doesn't need in ON. If you don't have a good reason to have it ON, then for security turn it OFF). For more explanation see design of ipvs for netfilter

#set ip_forward OFF for lvs-tun director (1 on, 0 off)
cat       /proc/sys/net/ipv4/ip_forward
echo "0" >/proc/sys/net/ipv4/ip_forward

As with LVS-DR, for LVS-Tun, the target port numbers of incoming packets cannot be remapped. A request to port 23 on the VIP will be forwarded to port 23 on a realserver, thus no port number is used for setting up the IP of the realserver. However you can still Re-mapping ports with LVS-Tun external to LVS, using iptables

Here's the packet headers as the request is processed by LVS-Tun.

packet                  source        dest         data
1. request from client  CIP:3456      VIP:23       -
2. ipvsadm table:
   director chooses server=RIP1, encapsulates into IPIP packet
                        DIP           RIP1         IP datagram
                                                   source=CIP:3456,
                                                   dest=VIP:23,
                                                   data= -
3. realserver recovers IP datagram
                        CIP:3456      VIP:23       -
4. realserver looks up routing table, finds VIP is local,
   processes request locally, generates reply
                        VIP:23        CIP:3456     "login: "

5. packet leaves realserver via default gw, not via DIP.

For the verbally oriented...

A packet arrives at the director for the VIP. The director looks up its tables and decides to send the connection to realserver_1. The director encapsulates the request packet in an IPIP datagram with header DIP->RIP_1. The packet arrives at realserver_1, the realserver recovers the original IP datagram, looks up its routing table, finds that the VIP (on the non-arping tunl0) is local and processes the packet locally. A reply packet is generated with VIP:23->CIP:3456. The realserver looks up its routing table and finds that a packet to CIP goes out its default gw (not to the DIP).

The tunl0 device does not arp with 2.0.36 kernels, but does with 2.2.x (and later) kernels. Go look up the section on the The Arp Problem to see if you need to patch the kernel on the realserver. (Joe: since kernel 2.6.4 and 2.4.26, arp_ignore/arg_annouce are the preferred way of handling the arp problem.)

8.7. The RIP (not the tunl device) receives the ipip packet

Joe

How does a packet get to a tunl device, which doesn't have a MAC address, from a remote machine?

Julian

tunl, lo and dummy are used just to configure the VIP. We don't send any packets through these devices. The requests are delivered to the realservers using their RIP. The director asks only about their RIP from ipvsadm. Only the router/gateway asks about VIP, but only the director must reply. When the packet is received in the realserver it is delivered locally (not forwarded or dropped) due to configured VIP. This is the only role of these "dummy" interfaces: the kernel to treat the received packet as it is destined to our host (the realserver). Nothing more. No IPIP encapsulations (for tunl), no MAC address definitions, nothing more. When we answer the request we use eth0. The tunl/lo/dummy is not selected as device for the outgoing packets. We have routes for eth0 (default gateway) which we use for the outgoing traffic. This is for DROUTE and TUNNEL mode.

If two linux boxes (not in an LVS) are joined by an IPIP tunnel and there is no MAC address associated with the tunl0 devices at each end of the link, then how do the packets get from one machine to the other?

Julian

The packets are encapsulated via IPIP and sent to the tunnel ends real IP where they are decapsulated again and appear on the tunl interface. You don't need a MAC address for point-to-point links, or logical interfaces like tunnels.

8.8. Configure LVS-Tun

Edit the template lvs_tun.conf and run the configure script

$ ./configure_lvs.pl lvs_tun.conf

Load the the parameters into the director and then the realservers with the command

$ . ./etc/rc.d/rc.lvs_tun

(the script knows whether it is running on a realserver or the director).

(later put rc.lvs_tun in /etc/rc.d or /etc/init.d and put mon_xxx.cf in /etc/mon)

check the output from ipvsadm, ifconfig -a and netstat -rn, to check that the services/IP's are correct.

8.9. set rp_filter correctly

this is now in turn off rp_filter

8.10. FreeBSD and Solaris realservers with LVS-Tun

maluyao ma(dot)luyao(at)gmail(at)com 4 Apr 2007

see LVS-Tun on FreeBSD and Solaris realservers (http://kb.linuxvirtualserver.org/wiki/LVS/TUN_mode_with_FreeBSD_and_Solaris_realserver).

Here's how to setup ipip encapulation in FreeBSD.

carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 20 Jun 2002

First, gifs must be supported in your kernel (enable "pseudo-device gif" in your kernel config).

src_addr is the address of your NIC's interface while dest_addr is the remote side or the other end of the tunnel IP address. For example, if pc1 is one end of your tunnel and pc2 is the other end, then:

if in pc1, you have:
	xl0: 1.1.1.1

	gif0:

if in pc2, you have:
	de0: 1.1.2.1

	gif0:

on pc1, do the following:

	pc1# ifconfig gif0 1.1.1.1 1.1.2.1

on pc2, do the following:

	pc2# ifconfig gif0 1.1.2.1 1.1.1.1

You can also man gifconfig .

I haven't tried using gif interfaces for IP-in-IP tunneling. I've only used them for IPv6 in IPv4 tunneling, but you can test it.

carla quiblat carlaq (at) asti (dot) dost (dot) gov (dot) ph 30 Jun 2002

I'd just like to report that I got LVS-Tun working for a Linux(as director)-OpenBSD(as realserver). I am currently testing LVS so we could use it to loadbalance web service requests (http) over different sites (different IPs/different blocks) therefore LVS-Tun is required.

I know FreeBSD implements tunneling but I've only used it for IPv6-in-IPv4 tunneling and I didn't quite understand how tunneling in Linux worked. For example, in linux to create a tunnel, you did this:

  • on the director: no tunnel is created because ipvs does the encapsulation

  • on the realserver:

    	ifconfig tunl0 172.26.20.110 netmask 255.255.255.255 broadcast 172.26.20.110
    	route add -host 172.26.20.110 dev tunl0
    

Basically, I understand that the tunl0 is identified with the remote tunnel end (VIP) but I don't understand the "route add" part since LVS-Tun only implements a one-way tunnel. That is, from the director to the realserver, tunneling from realserver-to-director is not required and seems useless. The realserver routes following it's default router path direct to the client. So that's where I got stuck. "How do you say this in *BSD using the gif0 interface, the one I'm familiar with?" In the end, this is the topology we'd like to implement:

               --------
              | client |
               --------
                  |
                  |
               Internet
                  |
                  |
              LVS director, Linux
                  |
                  |  ______________
              -------...tunnel.....-->(one-way-tunnel)realserver, *BSD
              |      --------------
              |
           realserver(local-NAT), *BSD

with the tunneled packet routed normally through its routers/gateways (edge routers or other) down to the realserver.

My test setup looks like this:

[ client with a live IP ] -------gw------eth0(10.10.8.98, DIP) [director]
                                     |   eth0:110 (VIP)
                                     |
                                     |___fxp0(10.10.8.199,RIP)[realserver]

So what I did on the OpenBSD realserver is this,

	ifconfig fxp0 10.10.8.199 netmask 255.255.255.0 up
	route add default 10.10.8.1
	ifconfig gif0 tunnel 10.10.8.199 10.10.8.98
	ifconfig lo0 _VIP netmask 255.255.255.255

10.10.8.1 is the default gateway for the private network. Notice that the tunnel endpoint is the DIP (not VIP like in Linux). This is because as I understand, the packet that arrives at the realserver (encapsulated by ipvs) has this format:

	[D|R|C|V|...payload....]

where, D - director address, R - realserver address, C - client address, and V - VIP address. Decapsulation is done by the gif0 tunnel, after that it sees that the packet is destined to itself (VIP defined at its lo0 interface) and processes it normally with source IP= client IP.

When I do "telnet VIP" from the client, I successfully enter 10.10.8.199 after the login.

8.11. Windows realservers with LVS-Tun

Note
support for ipip was removed from M$ after w2k. Paolo has a solution for non tunnelling realservers using a spanned layer2 network.

Johan Ronkainen jr (at) mpoli (dot) fi 10 Feb 2003

It's possible with w2k Server. You'll find necessary settings under Routing and Remote Access snap-in. First create new IP Tunnel under "Routing Interfaces", then select "New Interface" under IP Routing/General and put necessary settings there.

You'll also need Loopback interface so w2k will handle packets itself and won't try to route them. Open Control Panel, click Add New Hardware, navigate thru dialogs and finally select Microsoft Loopback Adapter. If you want /32 network for loopback adapter you need to change it with regedit since GUI allows only /31. Network code itself is fine with /32 subnets.

It's been a while since I did this. We load-balanced w2k Terminal Server clients to three servers. Two were on same building as clients and third one was in different city on separate subnet. Clients connected to LVS that forwarded 2/3 of connections to local servers using LVS/DR and 1/3 to remote location using LVS/Tun via IP-tunnel. Replies were routed directly to clients.

This never went to full production and servers have been re-installed since so I can't check exact configs. It's not that hard. Just like LVS/Tun with Linux on LVS end. w2k part required bit trial and error but it's doable.

Adam Hammouda AdamMH (at) aol (dot) com 02 September 2003

I'm wondering if anyone can help me with some lvs/ipvs configuration issues regarding Windows' Real Server's and LVS-Tunneling. I have been able to setup LVS-Tun when all realservers are Linux-based, however when Windows is thrown into the equation things start to get messy. I have

  • Created a new General IP Routing Tunnel (Interface) and set it's local and remote addresses' to the VIP and Director IP, respectively.
  • configured the Microsoft Loopback Adapter to use the VIP, and set it's subnet mask to 255.0.0.0 as was recommended.

Chris Chris (at) baonline (dot) co (dot) uk 03 Sep 2003

We run lvs using tunneling (ipip) with 3 windows 2000 realservers. The steps are something like:

  1. Install a loopback adapter on each realserver. I think this is documented elsewhere to solve the ARP problem. But basically, go to add new hardware, network, microsoft, loopback adapter. Assign this adapter the IP address of the cluster, (VIP). - Sounds like you have already done this
  2. Go to routing and remote access on each realserver. Enable if needed. Dont let it automatically configure = asking for trouble. Under routing interfaces, add a new IPTunnel. Now, under IP Routing - General, add a new interface, selecting the IPTunnel that you have just created. For the 'Local address', specify the RIP of that realserver. For the Remote Address, specify the DIP. Ok through all that and then reboot the realserver. - it is M$ after all. - Sounds like you have specified the wrong IP address

Paolo Penzo paolo.penzo (at) bancatoscana (dot) it 26 Sep 2003

I 'm using LVS on geographical basis (DR and TUN) with both Linux and Windows 2k as realservers. Unfortunately we started to migrate Win 2k severs to Win 2003 and we discovered that IP-IP encapsulation is not supported anymore by MS servers (see http://support.microsoft.com/?id=280484) so LVS TUN configurations don't work anymore if you use win 2k3 as realserver. I'm thinking how to overcame this problem by manually configuring IPSec tunnels or something similar... Help is wellcomed.

Joe: there was no answer

8.12. Realservers without ipip encapsulation

ipip encapsulation is used when the realservers are at a remote site. Methods of tunneling other than ipip exist (e.g. a VPN) if you need geographically remote realservers.

Richard Seabrook

Since Windows 2003 doesn't support IP-in-IP like 2000 did, what other alternatives are people using when real servers are remote from the directors?

Paolo Penzo paolo (dot) penzo (at) bancatoscana (dot) it 06 Dec 2006

We made a layer2 network spanned across geographical sites and moved to DR balancing: everthing is much more easy to manage!

8.13. LVS-Tun has smaller MTUu: PMTU is disabled - handling fragmentation

Note

Ratz 28 Feb 2007 (after seeing a patch in the lkml regarding documentation for sysctls tcp_mtu_probing and tcp_base_mss.)

I found this patch rather interesting, regarding the fact that obviously PMTU seems to be disabled by default on newly 2.6.x (x>19?) kernels. We need to keep an eye on this.

A ipip header is added when sending packets through a tunnel. Since the mtu is fixed (1500), the extra header reduces the allowed packet payload size. This will require fragmenting of packets>1480 sent from the director to the realserver in LVS-Tun. LVS (and Linux) doesn't have any special code to handle ipip fragmentation, so we should have expected LVS-Tun to fail when the client sent packets large enough to require fragmentation in the DIP->RIP hop. Either few people were using LVS-Tun in production, or clients were only sending small packets (e.g. HTTP GET) and we didn't realise for a long time that we had a problem lurking. Further below is Casey Zacek's solution for both w2k and linux. Here is Julian's description of the problem.

Julian Feb 12 2007

The client will (Joe: should?) see a "fragmentation required" icmp packet from the director, if the packet is bigger than our PMTU to RS.

Note
this problem is still present and it is hard to fix (it's a bug):
http://marc.theaimsgroup.com/?l=linux-virtual-server&m=107757685230840&w=2

Without handling ICMP errors for our IPIP packets, we will not lower the PMTU just by generating IPIP traffic. But other (non ipip) protocols (packets to RS) can learn lower PMTU and update the cache. Then we can see this lower value in the routing cache and generate reply ICMPs when large packets come from the client. At least that's what I remember from before; I'm not sure if things have changed in 2.6 now.

IPIP packets are between the DIP and RIP. These packets can hit the MTU limit in all hops between director and RS. The reply packets from RS to CLIENT are another path. If a big packet from the RS to CLIENT hits a MTU limit, then our director will receive ICMP/FRAG_NEEDED from xxxHOP to VIP, which we tunnel in IPIP to RIP. Here is a simple picture showing the MTU for each hop:

				VIP<-CIP
                         ------------------------------
                        |                              |
       1500      1400   v      1300      1200    1100  ^   1000
CLIENT ---> DHOP ---> DIRECTOR ---> RHOP ---> RS ---> CHOP ---> same CLIENT
          CIP->VIP                DIP->RIP          VIP->CIP

CLIENT   - knows about DHOP and uses MTU=1500
DHOP     - hop/router to director, knows MTU=1400 to director
director - sees MTU=1300 to RS, knows (or doesn't know) about RS (MTU=1200)
RHOP     - hop/router to realserver, knows about RS and uses MTU=1200
RS       - connects to CHOP with MTU 1100
CHOP     - hop/router to CLIENT, uses MTU 1000

The steps:

  1. Client sends 1500-byte TCP packet to DHOP (IP DF=1)
  2. DHOP returns ICMP/FRAG_NEEDED
  3. Client reduces the cached MTU and generates a new shorter 1400-byte packet
  4. ipvs() in Director receives the packet and if PMTU for RS is not configured to 1200, the packet still hits the 1300-byte limit. ICMP is replied (back to client). As the kernel/IPVS does not update PMTU cache based on ICMP for our IPIP packets, the PMTU for DIP->RIP has to be configured manually in the director (if it's going to be set at all there). Director encapsulates the CIP->VIP packet inside an ipip packet DIP->RIP. This packet has a header IPIP (there is no SYN, ACK, DATA, FIN) and is a one-off packet, and not part of a two-way DIP-RIP tcp connection.
  5. The decapsulated packet arriving at the RS (CIP->VIP) can't result in any packets going back to the director. Once decapsulated the RS forgets that DIP sent the packet, and the RS now sees the original packet from client. If the IPIP packet (DIP->RIP) is too big, the RS won't cause an icmp FRAG_NEEDED to be sent back to the DIP - the RS is not a router, only the RHOP will send an ICMP packet.
  6. CLIENT sends 1280-byte packet which IPVS tunnels into 20+1280 IPIP packet.
  7. RHOP generates ICMP/FRAG_NEEDED (src_addr=RHOP dst_addr=DIP). We don't accept this icmp packet and the icmp information is lost (the problem mentioned in URL). I don't remember if this is a problem with ip_vs() or the Linux kernel. Maybe the kernel learns the PMTU from the first ICMP packet, but IPVS does not see the ICMP packet. I hope for the 2nd packet from CIP IPVS will detect the lower MTU limit and will return ICMP immediately to CIP. Maybe IPVS looking for ICMP packets in LOCAL_IN can see this first error, but it is hard to forward similar message to the client.

If instead the director knows about the 1200-byte limit, then any IPIP packet from director will reach RS without any ICMP replies. One way of doing this would be by setting the mtu for the route DIP->RIP. (This command sets a lower MTU for all packets, not just ipip packets.)

director# ip route add RIP via RHOP dev DEV src DIP mtu 1200

If the RS generates a 1500-byte TCP reply packet (VIP->CIP), then CHOP will generate ICMP reply to the VIP, that should come in director, if routed properly (this packet will likely traverse the internet using a path separate to the client-director-RS-client path). On arrival at the director, the director will use icmp.c:icmp_unreach() to learn the PMTU. ip_rt_frag_needed() will save the value in the routing cache. Since the director doesn't send packets to CHOP, the problem then is how this information is used. The kernel's ICMP protocol receiver parses the information, updates PMTU in cache, but fails to deliver it to the upper layers as happens when delivering errors to sockets. This time IPVS was the sender. That is why the LOCAL_IN hook exists, where IPVS can listen for these errors, but as I said, it is difficult to generate ICMP error to send to the CIP.

Another problem problem is what MTU to use between RS and clients, but IPVS should properly forward (tunnel) any ICMP errors from hops between RS and (before) client to the RS (Joe: the director?). The client will never trigger an ICMP reply, which is generated only by routers. CHOP replies to the packet VIP->CIP, so this ICMP packet comes to the VIP (director) and IPVS will select the appropriate connection in the ipvsadm table, and forward the ICMP information in an IPIP packet to the RIP (as happens for the regular TCP packets from the CIP). ip_vs() used to have forwarding of ICMP from the non-error class icmp packets (e.g. ICMP ECHO), but someone dropped it from 2.6 as an unused feature.

I hope that is how IPIP setups work.

8.14. MTU: early signs of problems

awysock (at) absoftware (dot) com 28 Nov 2003

I've set up two UM Load blancers running LVS 1.0.10 and have them up and running. I'm using LVS-Tun since I rent my servers and my IP addresses are all over the place. My site deals with lots of photos, so my users are doing large POSTS along with large POSTS of Text data. It seems when the ethernet packet goes over the 1460 byte mark only some of the users fail others (my own machines) work just fine. I have tried it on my windows machine and my MAC I have no problem, but when somebody elsewhere on the net does the same function they fail with a 404 or timeout error on their end. Its only some of the people, others are not having the problems. If they go directly to the server it works. So I'm guessing it something between the LVS and the Real Servers.

I have changed the MTU value for eth0 on the director to 1400. All that does for me is make more machines (all that I have tested) suffer from the same problem. Should the MTU value be changed at different places? i.e. both ends of the tunnel? I knew that our choice to use Windows 2000, would haunt me! Does anyone know how to change the MTU for an IP tunnel in Windows 2000?

Ratz, 1 Dec 2003

Enable PMTUDiscovery in w2k (http://insight.zdnet.co.uk/communications/networks/0,39020427,2123537-2,00.htm) and DrTCP (http://www.dslreports.com/drtcp) (Joe: presumably you want DRTCP019.exe, support for MTU set in w2k).

The MTU was originally set to 1500 on all machines. Most machines worked but some would not when posting large amounts of data.

  • When I set the MTU for all interfaces on the director to 1400 and leave the MTU for the tunnel untouched at 1500, all machines would fail.
  • When I set the MTU for all interfaces on the director to 1400 and set the MTU for the tunnel at 1400, all machines would fail.
  • With the MTU for the tunnel set to 1400. I can set the MTU for the director to anywhere between 1420 - 1500 before it fails with all machines.
  • The largest packet I can transmit on the ISP's network without it fragmenting is 1472 although they claim their MTU is 1500. (ping www.linux.org -l -f 1472 works but anything bigger does not)

This makes no sense to me. The only way I can think this is correct is if: Maximum packet size (without a tunnel) between director and realservers is 1500. If the header for IPIP tunnel is about 20 bytes, then the maximum packet size for packets within the tunnel is 1480. Therefore, the MTU for the director must be at least 20 more than the MTU for the tunnel. So why does using 1400 everywhere make it all fail, but 1500 everywhere only fail on some machines?

What can I set the MTU values to in order to guarantee it working with all clients? Most of our clients have no technical knowledge and this is becomming a nightmare!

Horms 30 Nov 2003

typically the MTU used is 1500 bytes. But when tunnels come into play then this becomes slightly smaller because of the overhead for the tunnel. This should not be an issue but in practice it often makes sense to manually set the MTU to the smaller value on applicable interfaces.

ratz 01 Dec 2003

... or the mtu of the tunnel's routing entity for that matter. This is faster and less intrusive than adjusting down the whole physical interface's mtu. I use it for boxes where I have dozens of VPN tunnels over a physical interface, but also non-tunneled traffic.

Joe

Note
Ratz is saying to change the MTU not for the interface (which will affect all routes through that interface), but only for the route. Presumably the route is DIP->RIP (the packet on arrival at the RIP is decapsulated to the packet with dest_addr=VIP). (Feb 2007 - Ratz posted that he got the idea from off-line discussions with Julian. But Ratz gets the credit for telling us about it.)

Roberto Nibali ratz (at) drugphish (dot) ch 01 Jun 2004

You can set the mtu for a route to/from the VIP. You must of course pay attention to route selection which can be investigated with ip rule/ip route or the shell tools I've written to display routing tables. So you might need to put the VIP route into a special routing table which gets parsed before the other routes. Also don't forget to flush the routing cache.

Joe: in principle this is easy to do, but no-one has done it yet. The ipip packet from the director to the realserver is DIP->RIP. Ideally you would only want to change the mtu for the ipip packets to the RIP (or to the RIP network), so that other packets to the RIP (e.g. logging, administration) have standard MTUs. As well we aren't sure yet whether PMTU works, even if we do change the mtu for the DIP->RIP (someone could look in the code). Here's how Ratz changes the MTU for the default route.

Ratz 05 Feb 2007

Here I add a default route to a new table and change the default mtu. Basically you can use the "change" keyword in conjunction with the "mtu" selector on the specific route.

root@laphish2:~# ip route help
Usage: ip route { list | flush } SELECTOR
       ip route get ADDRESS [ from ADDRESS iif STRING ]
                            [ oif STRING ]  [ tos TOS ]
       ip route { add | del | change | append | replace | monitor } ROUTE
SELECTOR := [ root PREFIX ] [ match PREFIX ] [ exact PREFIX ]
            [ table TABLE_ID ] [ proto RTPROTO ]
            [ type TYPE ] [ scope SCOPE ]
ROUTE := NODE_SPEC [ INFO_SPEC ]
NODE_SPEC := [ TYPE ] PREFIX [ tos TOS ]
             [ table TABLE_ID ] [ proto RTPROTO ]
             [ scope SCOPE ] [ metric METRIC ]
             [ mpath MP_ALGO ]
INFO_SPEC := NH OPTIONS FLAGS [ nexthop NH ]...
NH := [ via ADDRESS ] [ dev STRING ] [ weight NUMBER ] NHFLAGS
OPTIONS := FLAGS [ mtu NUMBER ] [ advmss NUMBER ]
           [ rtt NUMBER ] [ rttvar NUMBER ]
           [ window NUMBER] [ cwnd NUMBER ] [ ssthresh NUMBER ]
           [ realms REALM ]
TYPE := [ unicast | local | broadcast | multicast | throw |
          unreachable | prohibit | blackhole | nat ]
TABLE_ID := [ local | main | default | all | NUMBER ]
SCOPE := [ host | link | global | NUMBER ]
FLAGS := [ equalize ]
MP_ALGO := { rr | drr | random | wrandom }
NHFLAGS := [ onlink | pervasive ]
RTPROTO := [ kernel | boot | static | NUMBER ]

root@laphish2:~# ip route show
192.168.1.0/24 dev eth1  proto kernel  scope link  src 192.168.1.32
default via 192.168.1.1 dev eth1
root@laphish2:~# ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default
root@laphish2:~# ip rule add from 10.0.0.0/16 table 33 prio 100
root@laphish2:~# ip rule show
0:      from all lookup local
100:    from 10.0.0.0/16 lookup 33
32766:  from all lookup main
32767:  from all lookup default
root@laphish2:~# ip route add default via 192.168.1.1 dev eth1 table 33
root@laphish2:~# ip route show table 33
default via 192.168.1.1 dev eth1
root@laphish2:~# ip route change default via 192.168.1.1 mtu 1000 table 33
root@laphish2:~# ip route show table 33
default via 192.168.1.1 dev eth1  mtu 1000

Cleanup the stuff:

root@laphish2:~# ip route flush table 33
root@laphish2:~# ip rule del prio 100
root@laphish2:~# ip rule show
0:      from all lookup local
32766:  from all lookup main
32767:  from all lookup default

Jacob Coby jcoby (at) listingbook (dot) com 01 Dec 2003

Decreasing the MTU with this bug only causes more problems; it causes the packets to fragment MORE often. When I had the issue, I could decrease the MTU to 200 bytes, and the connection would fail at a payload of ~160 (20b for the IP header, 20b for the IPIP header), even with non-tcp data, like ping.

Julian 28 Nov 2003

try LVS with 2.4.23 as it contains a fix for packets longer than mtu.

(and later) Julian Anastasov 24 Feb 2004, 29 May 2004

There is only one remaining problem related to LVS-TUN: there is no handling of ICMP errors being received on a local IP after being returned from somewhere in the path (DIP->RIP) coming back to the DIP and containing the reply to tunneled packet (e.g. a frag_needed message and carrying the first few bytes of the packet). We do not relay these messages, generated between the director and the realserver, back to the client. The correct target for the ICMP message depends: the director is sending 20 bytes more (the ipip overhead), and if this is causing the ICMP message, then the client need not receive the ICMP message in all cases. The client should only receive an ICMP message if the director detects a lower PMTU. While TCP and UDP handle ICMP errors, IPIP does not handle them well. The LVS-DR and LVS-NAT forwarding preserve the sender's IP in which case ICMP traffic from realservers (or hosts before realservers) is always returned to the client. But if LVS-Tun is used, the ICMP packets are not returned to the client.

If the only traffic from the director to the LVS-Tun realservers is IPVS traffic, then the routing cache does not receive the PMTU info from ipip_err() and we don't learn the correct path MTU to the realserver. Then, on forwarding packets, the IPVS code cannot detect that the path has lower PMTU. But this is theory, not really tested. Maybe we can update the PMTU in the routing cache by listening to these ICMP errors in LOCAL_IN? Needs experiments and time for fixing, patches are welcome.

There is no such thing as an MTU for ipip with IPVS. IPVS extends the packet with 20 bytes by prepending IPIP header and ignores the mtu. IPVS has its own encapsulation and uses the route to the RIP (you do not need to configure a tunl0 device on the director).

I would love to upgrade the Kernel (currently 2.4.20) but that is not an option as a quick fix at the moment. - Live environment and the like.

This time the fix is not in the IPVS code: (see the kernel bug list http://linux.bkbits.net:8080/linux-2.4/hist/net/ipv4/ip_output.c?nav=index.html|src/.|src/net|src/net/ipv4). The problem is that skb->nfcache is not copied on [re]fragmentation. Here's a posting and patch by Julian to the linux-netdev mailing list posting and patch by Julian to the linux-netdev mailing list (http://marc.theaimsgroup.com/?l=linux-netdev&m=106589293316918&w=2).

But we need to see your tcpdump output first because the PMTUD (path MTU discovery) is usually enabled.

Joe

IPIP is a one-way channel (packets don't come back?) and PMTUD doesn't work?

Julian

The director still can receive ICMP errors with the source somewhere between the LVS-Tun realserver and the director.

Chris Paul

The problem is I can not reproduce the error. We only have a small number of non technical customers who are having trouble, but I can only go so far when it comes to asking them to debug our services.

Julian 01 Dec 2003

Then your problem is related to the client-director PMTU. I understand that it can be difficult to trace an unknown client, but do you have some kind of ICMP filtering between clients and the director?

The problem came up again in May 2004 (when the current kernel is 2.4.26).

Casey Zacek cz (at) neospire (dot) net 26 May 2004

The problem, as described by one of my customers, is this (the customer is running phpBB on 3 Linux/Apache servers with an LVS-Tun setup):

For very few users, when they post long posts (anything over a few lines) and hit submit, the browser appears to hang and finally it times out. Similar effects if they try and update their profiles. I even experienced this on my home computer. I use a proxy server sometimes and it showed the request being transmitted from my computer but ultimately no response was received from the site. Now, in most instances of this, we have found that the affected users are on broadband using a router of some type. I myself use a cable modem connected through a Linksys Router. When I experienced the issue, I was able to post from work, but not from home. I fiddled with my setup, thinking it was cookies or caching of some type and ultimately performed a firmware upgrade on my router. Suddenly the problem went away.

At the time, I was running kernel 2.4.25 (IPVS 1.0.10), but since upgraded to 2.4.26 (IPVS 1.0.11), then 2.6.6 (IPVS 1.2.0). I have asked the customer to retest it, but he'll have to talk to some of his users, from the sound of things, since he upgraded his router firmware. I'd love to chalk it up to "client router problems," but that probably won't be good enough for this customer. The customer's setup worked using a Riverstone smartswitch router running what equates to LVS-NAT, but it does not work with this LVS-Tun setup.

With all three versions, I get a lot of these messages:

IPVS: ip_vs_tunnel_xmit(): frag needed

Julian

This message means that the IPVS director is generating ICMP errors to request that the client reduce the packet size. Maybe these ICMP messages are filtered somewhere and do not reach the client.

I have a step-by-step howto for TUN setups: http://www.ssi.bg/~ja/TUN-HOWTO.txt

Note
Joe: This URL doesn't directly address the mtu problem. It checks the capsulation and routing.

Joe

Why is the default MTU for ipip packets 1480, rather than 1500+overhead_for_ipip=1520? Is 1500 a hardware buffer size limit in the NICs? (i.e. hardware buffer=1500?)

Julian

I don't know which the origin of the 1500 limit. Maybe it is a balance between link sharing and protocol header overhead.

There is a convention in IPv4 to reply with an ICMP error if a packet with DF flag set reaches a smaller pipe (i.e. packet length > PMTU). If the DF flag is not set, the packet is fragmented into MTU-sized fragments.

Note
For an explanation of PMTU and the DF flag, see PMTU - Path MTU Discovery (http://www.netheaven.com/pmtu.html).

There can be many problems related to MTU:

  • no ICMP errors generated (from director or from other hosts between director and realservers)
  • ICMP errors do not reach their destination (the client), e.g. filters dropping blindly any ICMP traffic
  • ICMP errors generated from realservers (or from hosts between the director and realservers) not forwarded from director to client
  • PMTU not updated in routing cache due to IP TOS changes after routing

Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004

The problem is caused by the linux kernel not taking into account the size of the ipip tunnel headers when sending traffic over an ipip tunnel.

Basically, the MTU (the largest size packet than can be sent over a network) is normally 1500 bytes. With the IP header information this drops to 1492, so the largest size of packet that can be sent over an IP link, before the packet get split into multiple packets is 1492 bytes. When you use ipip tunneling, there is an additional header that takes the maximum transmition size through the link to somthing like 1480. Linux kernel 2.4.??? does not take into account this additional header and sets the mtu for the ipip tunnel to 1492. So if you send a packet that is between 1480 and 1492, it gets truncated rather than split into multiple packets. The ipip tunnel destination then waits to receive the rest of the packet, which it never arrives. The result is the server never responds.

When I was having this problem, it was a nightmare because you can not guarantee it will fail. It only fails when the packet size is very specific and the size of the header is also large. To fix this you can either.

  • Upgrade to kernel 2.6.???
  • Change the MTU values on the director.

I solved it by changing the MTU values, but it was nearly a year ago now I and can't remember exactly which ones I changed, i.e., the RIP on the director, the tunnel from the director, or the tunnel from the realserver.

Chris Paul Chris (at) baonline (dot) co (dot) uk 27 May 2004

You have to change the mtu value on the end of the IP tunnel that initiates the tunnel i.e. the realserver (in this instance, a w2k box). This value should be close to the mtu value of the physical interface it is going through, but small enough to ensure there is enough space left for the ipip header. We use 1400 and have never had any reports of it failing. To do this you goto registry and add a dword entry called MTU with the decimal value 1400 (safe) into

hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of ip tunnel}

reboot

Note
with w2k, XP, you can "Restart Networking"
Note
Joe - see tunl MTU solved for Casey Zacek's modification of this method.

If the mtu is not set, you get lots of IPVS: ip_vs_tunnel_xmit(): frag needed messages logged to the console and connections hang.

Joe

I would have thought you'd set the mtu at the director end. Presumably if any end of the segment has a reduced mtu, then both ends of the segment should be notified about it.

Julian

This message means "fragmentation needed but DF flag set". ip_vs_tunnel_xmit() tries to prepend IPIP header, but notices that the resulting packet with DF flag set will exceed the PMTU(director->RS) limit, so it generates an ICMP error instead of xmit-ing the packet to the RS.

Joe

What messages would you get if the icmp problem was about the link between the tunnel realserver and the director?

Messages from the same type but we haven't handled this case yet. In the meantime, setting the proper PMTU in the director for the route to the realserver is a good idea.

Jacob Coby jcoby (at) listingbook (dot) com 27 May 2004

I could test for the problem reliably by using ping with packet_size>934 (934 and lower worked fine). Once I bumped it up over 934, I'd see Must Fragment (MF) ICMP messages being sent, and the ping request would have no response. As I lowered the MTU, the size of the ping that would cause the problem lowered in direct proportion. A 1500 MTU would cause a 935 byte ping to fail, a 1400 MTU would cause a 835 byte ping to fail, and so on. Any HTTP GET or POST over that 934 byte payload would cause the site to not respond.

Chris Paul

Where are you setting your mtu of 1400? You have to make sure that it is the mtu for data inside the tunnel. When I changed the mtu values, the only way I could reliably get it to change the size inside the tunnel rather than the whole tunnel packets was from the realserver not the director.

Jacob Coby

I have no idea which MTU I was setting. I could get the problem to go away for one or two times, and then it would come back. It's been over a year since I messed with LVS-TUN, and I'm now running LVS-DR.

Peter Mueller pmueller (at) sidestep (dot) com 27 May 2004

I've heard people in poptop use this hack. Maybe you can modify for your use in this situation. If it works I like this solution better than a change to the MTU on the interface.

iptables -A FORWARD -p tcp --tcp-flags SYN,RST SYN -j TCPMSS --set-mss 1300
Note
Joe: MSS is maximum segment size, i.e. the payload in the packet, rather than the packet size (which is set by mtu).

Julian 3 Jun 2004

Ratz's work around should work, or you can hope that other traffic between director and RS will update the PMTU in the routing cache.

Joe

if you put a tunl0 device on the director, would it receive the PMTU packets back from the realserver?

Yes, it can look into the ICMP errors that include IPIP header but the current version of ipip.c does not update the RIP's PMTU. Another option is IPVS to do it in LOCAL_IN. The VIP does not play here. The forwarded traffic in the director is routed to daddr=RIP (as for the other forwarding methods). Only the clients need a route to VIP.

Joe

I was thinking to reduce the MTU for the CIP-VIP segment, then there would be no problem in the DIP-RIP segment. Is this a way of handling it?

This is another solution. Just keep PMTU(CIP->VIP) <= PMTU(DIP->RIP) + 20. I'm not sure you can do it for every client. Maybe it can be in the default route :)

To use Ratz's work-around, you set the PMTU for packets going to the RIP (via eth0 on the director, there being no tunl0 devices on the director). If it is set to 1500 then you do not need such route as IPVS reports PMTU reduced with 20 (here 1480) when generating ICMP error to client. So, if PMTU to RIP is X or RS sends ICMP error to director notifying for PMTU=X then IPVS will report PMTU=(X-20) to client.

OTOH, may be it is not so difficult to check in LOCAL_IN for any FRAG_NEEDED errors and if they reduce the PMTU for RIP we can update the routing cache. Need to investigate whether we can easily find that such error is for one of our TUN RIPs.

What you can do on the director when using LVS-TUN:

  • if PMTU to RIP is lower then outdev's MTU then you have to specify it in special route to RIP. If the PMTU is 1500 you do not need such route, IPVS automatically reports MTU 1480 to all clients that send packets>1480. There is no chance for PMTUD to work when the ICMP errors are dropped between director and client. It will happen for any used PMTU to RIP.
  • run tcpdump and check for any received or generated ICMP errors

    The PMTU is not updated in the routing cache if director receives ICMP_FRAG_NEEDED. This is easy to detect and to solve. The good news is that you can detect it from any client, send large file, tcpdump for ICMP errors coming from realservers to director. If this is the case (PMTU to RIP is lower than outdev's MTU) than you can try to specify pmtu in special route to RIP. Once the director knows the right PMTU to RIP then it will report it to every client that violates it. There is no need IPVS to relay the ICMP error coming from RIP to the client, we just know how to generate it on each request from client. The only benefit can be if ipip.c is patched to update the PMTU in the routing cache and to avoid creating special route to RIP.

  • All other problems can be related to filtering of the ICMP errors generated from director and sent to client. Such places for filtering can be the netfilter in director or any router used to reach the client. It is enough to check that the ICMP errors generated in director reach the director's uplink GW. Then you hope that the client does not filter ICMP.

Joe

what is the MTU doing in the output of ip addr show dev tunl0 when you have a tunl device on a machine? I can set it (can't I?). Is the mtu meaningless, ignored, what?

It is ignored for IPVS traffic, IPVS has its own encapsulation and uses the route to RIP (you do not need to configure tunl0 in director). The tunl0 device is usually needed to receive IPIP packets, so in normal cases you do not need such interface in director even when using TUN realservers. The PMTU setting must be for the route to RIP. Such setting (and special route to daddr=RIP) can be needed only if PMTU to RIP is less than the outdev MTU.

So with regular ipip tunneling (not ipvs) you only need the tunl0 device on the receiving end? The only reason you need a tunl0 device on the transmitting end is to handle the packets that reply?

For regular ipip purposes tunl0 can be used both for send and for receive. IPVS simply knows how to create ipip packets without using the ipip code.

8.15. tunl mtu solved: Setting the MTU by MSS with iptables on the realserver

Note
Casey's solution is run on the realserver. Presumably a similar solution could be found for the director. Ratz's method of setting the mtu for the route rather than the interface runs on the director.

Casey Zacek cz (at) neospire (dot) net 2005/03/11

I've emailed about this before, and nothing we ever came up which really worked. The real problem I've always had is that I've never had a means for duplicating it (possibly because I didn't fully understand the problem -- I can probably duplicate it at will now), and my customers have eventually just either accepted it and moved on or changed to an LVS-NAT environment. I finally came across someone whose home network was setup in such a way as to experience the "problem", so I decided to figure it out once and for all and hopefully end all the confusion. Attached is a piece of PHP (lvs-tun-test.php) that'll duplicate the problem. The "submit" query will timeout if you are experiencing the problem.

Matthew Boehm matthew (at) matthewboehm (dot) com 6 Jan 2007 (and Casey).

Note

With IE6/7: When you submit the POST, the page just reloads (Matthew) or hangs/timesout with no data posted (Casey).

With Firefox/Netscape: You get a "Bad Request" page.

Cut here --- lvs-tun-test.php --------------------------------------
<html>
<head>
	<title>big POST test</title>
</head>
<body>
<?php echo $HTTP_POST_VARS['test']; ?>
<form action="lvs-tun-test.php" method="POST">
<textarea name="test" cols="100" rows="10">
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
alskdjfdaslkfjdslkjadsflkdsjfalsdkjfdsalkfjasdlkfjasdflkjadsfkljadsfkljasdflkasdjflksdjf
</textarea>
<input type="submit">
</form>
</body>
</html>
Cut here -----------------------------------------------------------

In order to force yourself to experience the problem, you need to forcefully ignore icmp fragmentation-needed packets. I am able to do that on my home network with a simple iptables rule on my firewall:

iptables -I FORWARD -p icmp --icmp-type fragmentation-needed -j DROP

Now, I browse the lvs-tun-test.php through LVS-Tun, and click submit, and it just hangs and times out. tcpdump shows the expected results. Then I change the MTU on the loopback interface on the realserver (It's a w2k box) using regedit, then disable and re-enable the loopback adapter via the network properties, then click submit again. Poof, it works.

tcpdump is my friend. I started out running tcpdump on the director:

23:13:52.804610 IP (tos 0x0,  ttl 116, id 26413, offset 0, flags [DF],   length: 48)   CIP.60964 > VIP.80: S [tcp sum ok] 3288780265:3288780265(0) win 65535 <mss 1452,nop,nop,sackOK>
23:13:52.810423 IP (tos 0x0,  ttl 116, id 26415, offset 0, flags [DF],   length: 40)   CIP.60964 > VIP.80: . [tcp sum ok] 3288780266:3288780266(0) ack 2303765635 win 65535
23:13:52.813943 IP (tos 0x0,  ttl 116, id 26416, offset 0, flags [DF],   length: 602)  CIP.60964 > VIP.80: P [tcp sum ok] 0:562(562) ack 1 win 65535
23:13:52.820802 IP (tos 0x0,  ttl 116, id 26417, offset 0, flags [DF],   length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 562:2014(1452) ack 1 win 65535
23:13:52.820887 IP (tos 0xc0, ttl  64, id 25185, offset 0, flags [none], length: 576)  VIP >       CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26417, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 562:2014(1452) ack 1 win 65535
23:13:52.827175 IP (tos 0x0,  ttl 116, id 26419, offset 0, flags [DF],   length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 2014:3466(1452) ack 90 win 65446
23:13:52.827251 IP (tos 0xc0, ttl  64, id 25186, offset 0, flags [none], length: 576)  VIP >       CIP: icmp 556: VIP unreachable - need to frag (mtu 1480) for IP (tos 0x0, ttl 116, id 26419, offset 0, flags [DF], length: 1492) CIP.60964 > VIP.80: . 2014:3466(1452) ack 90 win 65446
23:13:52.833420 IP (tos 0x0,  ttl 116, id 26420, offset 0, flags [DF],   length: 1492) CIP.60964 > VIP.80: . [tcp sum ok] 3466:4918(1452) ack 90 win 65446

The tcp [DF] CIP->VIP (packet length 1492 -- too big), then IPVS's ICMP response continues until the request eventually times out. This message is generated every time one of the ICMP responses are sent:

IPVS: ip_vs_tunnel_xmit(): frag needed

The problem comes when the ICMP host-unreachable (change MTU) packets are ignored/dropped and not acted-upon by the client. This is a more common situation than I thought would be the case.

A few hours of debugging later, I realized that the SYN+ACK packet, the response from the real server to continue the connection handshake, is missing. Duh. I moved my tcpdumping to a tap in the network that I knew would get all of the traffic. The SYN+ACK packet establishes the MSS (max segment size -- the data segment size for the packets for this connection) to 1452, just as the client machine requests (the first packet in the earlier trace).

Duh! I had read all the stuff on the URL above, and the posting by Chris Paul comes closest to describing the solution:

In reality, it's not "the end of the IP tunnel that initiates the tunnel" because the tunnel interface on the w2k box doesn't initiate anything -- it only receives forwarded traffic from the director. What he really means is "the interface on the real server that is handshaking the TCP connection with the client." The goal is to get the client to send smaller packets so that they'll make it on to the realserver.

CLIENT sends SYN to DIRECTOR
DIRECTOR encapsulates SYN packet in IPIP tunnel; sends to REALSERVER
REALSERVER receives SYN packet on LOOPBACK interface
REALSERVER sends SYNACK to CLIENT from LOOPBACK interface w/ MSS=1452
CLIENT sends ACK to DIRECTOR, on to REALSERVER
REALSERVER responds to CLIENT from LOOPBACK
repeat until dead

So, we have to change that MSS that gets sent back from realserver to client. That is, set the MTU on the loopback interface on the w2k box. The solution is to do exactly what Chris Paul Chris said, except change from:

hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of ip tunnel}

to:

hklm\system\currentcontrolset\services\tcpip\parameters\interfaces\{guid of MS Loopback Adapter}

After all, if you set an MTU in the IP tunnel interface this way, it won't be there after you reboot, I've found. Oh, and 1480 is the magic number. 1400 is safe, but 1480 works. Any higher than that, and it doesn't work as desired.

So I went to investigate how to do the same thing on my Linux real servers, only to find that the tunl0 interface, which is the connection endpoint for Linux realservers, already has an MTU of 1480. I don't know when that got fixed, but I guess I won't worry about it.

(later) I was wrong; here's the fix for Linux realservers:

iptables -A OUTPUT -s VIRTUAL-IP -p tcp -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j TCPMSS --set-mss 1440

Tested, tcpdumped, works. Now I have no more 'IPVS: ip_vs_tunnel_xmit(): frag needed' messages. (At least for now. We'll see if I'm wrong tomorrow.)

Chris Paul, 11 Mar 2005

Isn't this fixed in Kernel 2.6 anyway

Casey Zacek cz (at) neospire (dot) net

I really don't think it's possible to fix this on the director (and my directors are running 2.6.11 anyway -- and it's not fixed there). The closest way I could think of was to ignore the DF flag in the incoming TCP packets and just fragment them anyway.

Casey Zacek cz (at) neospire (dot) net 2005/04/12

It's not fixed in 2.6; I still need the iptables rule to set the mss

# iptables -A OUTPUT -s VIRTUAL-IP -p tcp -m tcp --tcp-flags SYN,RST,ACK SYN,ACK -j TCPMSS --set-mss 1440
Note
Joe: we don't know why this works

Julian Feb 2007

Huh, I don't know why, may be because there is such limit somewhere in the path from RS to client. Path from RS to client is not different between real servers in DR or TUN mode, they both send normal reply from VIP to CIP, no IPIP is involved there. May be problem with a CHOP that can not route ICMP to VIP properly.

jarol1@seznam.cz J (dot) Libak (at) sh (dot) cvut (dot) cz 07 Dec 2006

Today I ran into an MTU problem with LVS-Tun. Small packets were forwarded to real servers without problems, but the bigger ones weren't and TCP retransmissions occurred. I noticed the problem dissapeared when I switched to LVS/DR so this gave me hint to where the problem might be. MTU 1480 had to be set on the outgoing interface of realservers with tunl0 having standard 1500. Directors have 1500 on all interfaces. This way TCP syn ack contained correct MTU and the client didn't send big packets that were discarded on director anymore. IP header is 20 bytes long so 1480 is the maximum value that works.

Casey Zacek cz (at) neospire (dot) net 31 Aug 2007

For some reason that I cannot remember, I have switched off of this iptables method in favor of using some advanced routing to take care of the MSS setting.

(Joe: Ratz says that the MTU should be set for the route and not for the device, since not all routes/packet types to/from a device need an altered MTU.)

I wish I would have shared with the group when I started it, because I can't remember why I'm doing it this way now. Still on the real servers, I use routing like so:

This assumes the VIP is in a class C network

 ip route flush table 42
 ip route add table 42 to VIP_NETWORK/24 dev eth0 advmss 1440
 ip route add table 42 to default via VIP_NETWORK_GATEWAY advmss 1440
 ip rule add from VIP table 42 priority 42
 ip route flush cache

So, for example, say VIP is 10.2.2.38 VIP_NETWORK is 10.2.2.0 VIP_NETWORK_GATEWAY is 10.2.2.1 (probably)

 ip route flush table 42
 ip route add table 42 to 10.2.2.0/24 dev eth0 advmss 1440
 ip route add table 42 to default via 10.2.2.1 advmss 1440
 ip rule add from VIP table 42 priority 42
 ip route flush cache

The number 42 is just a number I chose when I started this.

Sameer Garg sameer (dot) garg (at) gmail (dot) com 6 Sep 2007

By trial and error I was able to find a work around this:

On the director I did the following
# ip route add REAL_SERVER_IP via DIRECTOR_GATEWAY dev eth0 advmss 1400

On the Real Server
# ip route change default via REAL_SERVER_GATEWAY dev eth0 advmss 1400

I am still not sure why I need to make the change on the director because technically during the three way handshake, the real server should tell the client about MSS being 1400.I have tried it without making the changes on the director but it doesn't work.

8.16. Setting the MTU by route

Note
This works on the realserver, but not on the director. We don't know why it doesn't work on the director and we're not really sure why it works on the realserver either.

With Casey having a suitable test setup, we asked him to test setting the MTU by route using Julian's suggestion of

director# ip route add RIP via RHOP dev DEV src DIP mtu 1440

Casey Zacek cz (at) neospire (dot) net 14 Feb 2007

Nope. Doesn't work. Here's tcpdump running on the realserver showing the first packet back to the client, which negotiates the MSS for the connection.

21:52:37.819770 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], length: 48) \
   VIP.80 > ENDUSER.1276: S [tcp sum ok] 2051800163:2051800163(0) \
   ack 1809535240 win 5840 <mss 1460,nop,nop,sackOK>

That "mss 1460" needs to be "mss 1440". That's the secret magic key to the universe.

I got some of these when I blocked icmp-type fragmentation-needed to my workstation, with logging:

IN=eth0 OUT= MAC=00:18:8b:74:d1:98:00:06:5b:3a:9f:0b:08:00 \
   SRC=66.111.105.216 DST=10.3.3.10 LEN=576 TOS=0x00 PREC=0x00 TTL=62 ID=32755 \
   PROTO=ICMP TYPE=3 CODE=4 [SRC=10.3.3.10 DST=66.111.105.216 LEN=1500 TOS=0x00 \
   PREC=0x00 TTL=62 ID=42460 DF PROTO=TCP SPT=45445 DPT=80 WINDOW=114 RES=0x00 ACK URGP=0 ] MTU=1420

And my page request just waited and waited (Firefox 2.0). When I flushed the icmp-type fragmentation-needed DROP rules, and I submit the page again, it goes through instantly. I also tried with

director# ip route add RIP via RHOP dev DEV src DIP mtu lock 1440
                                                        ^^^^

This also did not work.

Julian

To tell if this is a PMTU problem (rather than we haven't figured out the correct ip route command), one should check all steps with tcpdump in all boxes, icmp, tcp.

Now, I can make it work if I do this on the real server:

RS# ip route add table 42 to default via DEFAULTGW advmss 1440 RS# ip rule add from VIP to default table 42 priority 42

So, at least it doesn't require iptables. Also, this doesn't cover any client machine that is not reached via the default route. Instead you'd need something more like this:

RS# ip route add table 42 to LOCALNET/xx dev LOCALDEV advmss 1440
RS# ip route add table 42 to default via DEFAULTGW advmss 1440
 .. more entries for any static or other routes ..
RS# ip rule add from VIP table 42 priority 42

In most cases, though, these two routes and one rule will cover it. I think I prefer using iproute to using iptables, as iptables tends to be more volatile in my environments.

8.17. rewriting, re-mapping, translating ports with LVS-Tun

see Re-mapping ports in LVS-DR with iptables