29. LVS: Running a firewall on the director: Interaction between LVS and netfilter (iptables).

Note

May 2004: This chapter has been rewritten. Before the arrival of the Antefacto patches, it was not possible to run arbitary iptables rules for ip_vs controlled packets on a director. Hence you couldn't run a firewall on the director and we told people to put their firewall on a separate box. Julian then took over writing the code and now it is possible to run a firewall on the director. The code is now called ipvs netfilter connection tracking module, ipvs_nfct and is still beta, so keep us informed of how it works. In previous writeups, I misunderstood how the code worked, and made some incorrect statements. Hopefully this rewrite fixes the misinformation I propagated.

For one of many introductions to netfilter see The netfilter framework in Linux 2.4 (http://gnumonks.org/~laforge/presentations/netfilter-lk2000/netfilter.ps.gz).

According to Ratz (18 Apr 2006), NFCT causes a 20% throughput drop on a GbE inbound service.

29.1. Start with no filter rules

Although this chapter is about applying iptables rules to directors, be aware that you don't need filter rules to set up an LVS. Misconfiguring the filter rules may cause strange effects. Make sure for testing that you can turn your filter rules on and off. Here's a cautionary tale.

Sebastiaan Tesink maillist-lvs (at) virtualconcepts (dot) nl 14 Jul 2006

On one of our clusters we have problems with ipvs at the moment. Our cluster is built with 2 front-end failover ipvs-nodes (managed with ldirectord), with 3 Apache back-end nodes, handling both http as well as https. So all the traffic on a virtual ip on port 80 or 443 of the front-end servers is redirected to the backend webservers.

Two weeks ago, we were running a 2.6.8-2-686-smp Debian stable kernel, containing ipvs 1.2.0. We experienced weekly (6 to 8 days) server crashes, which caused the machines to hang completely without any log-information whatsoever. These crashes seemed to be related to IPVS, since all our servers have the exact same configuration, except for the additional ipvs-modules on the front-end servers. Additionally, the same Dell SC1425 servers are used for all servers.

For this reason we upgraded our kernel to 2.6.16-2-686-smp (containing ipvs 1.2.1) on Debian stable, which we installed from backports (http://www.backports.org). There aren't any crashes on these machines anymore. However, there are two strange things we noticed since this upgrade. First of all, the number of active connections has increased dramatically, from 1,200 with a 2.6.8-2-686-smp kernel, to well over 30,000 with the new kernel. We are handling the same amount of traffic.

# ipvsadm -L
IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  XXX.net wlc persistent 120
  -> apache1:https                  Route   10     2          0
  -> apache2:https                  Route   10     25         0
  -> apache3:https                  Route   10     14         0
TCP  XXX.net wlc persistent 120
  -> apache1:www                    Route   10     10928      13
  -> apache2:www                    Route   10     11433      6
  -> apache3:www                    Route   10     11764      10

We are using the following IPVS modules: ip_vs ip_vs_rr ip_vs_wlc

Secondly, Internet Explorer users are experiencing problems exactly since the upgrade to the new ipvs version. With Internet Explorer, an enormous amount of tcp-connections is opened when visiting a website. Users are experiencing high loads on their local machines, and crashing Internet Explorers. With any version of FireFox this is working fine by the way. Nevertheless, this started exactly since our IPVS upgrade.

Note
Note IE/IIS breaks tcpip rules to make loading fast (see What makes IE so fast http://grotto11.com/blog/slash.html?+1039831658

Sebastiaan Tesink sebas (at) virtualconcepts (dot) nl 19 Jun 2007

The solution to this problem was relatively easy, but we only discovered it recently. Basically, this problem was caused by the firewall, which contained "state checks". We used to have the following iptable rule:

iptables -A INPUT -m state --state NEW -p tcp --dport 80 -j ACCEPT

While using IP_VS, this caused connections to be denied by the firewall. Therefore we changed this to:

iptables -A INPUT -p tcp --dport 80 -j ACCEPT

which leads to a more balanced view on the number of active versus inactive connections in the load balancer. Hopefully this is some useful information to get your documentation even better.

29.2. Introduction

For 2.4.x kernels (and beyond), LVS was rewritten as a netfilter module, rather than as a piece of stand-alone kernel code. Despite initial expectations by Rusty Russel that LVS could be written as a loadable netfilter module, it turned out not to be possible to write LVS completely within the netfilter framework. As well, there was a minor performance penalty (presumably in latency) for LVS as a netfilter module compared to the original version. This penalty has mostly gone with rewrites of the code.

The problem was in connection tracking, which among other things allows the machine to determine if a packet belongs to a RELATED or ESTABLISHED connection. As well connection tracking helps with multiport protocols like FTP-NAT. The ip_vs controlled packets take a different path through the routing code than do non-LVS packets. The netfilter connection tracking doesn't know about the ip_vs controlled packets. Even if it did know about them, netfilter conntrack was considered too slow to use for LVS.

For LVS-DR and LVS-Tun, where the reply packets do not go through the director, netfilter is not able to connection tracking on these packets at all. The Antefacto patches were written to allow connection tracking of ip_vs controlled packets for LVS-NAT. Connection tracking for ip_vs with LVS-DR or LVS-Tun was not attempted. The ipvs_nfct code now allows conntrack for LVS-DR and LVS-Tun.

Julian

You can (and always have been able to) use firewall rules that match by device, proto, port or IP, without using ipvs netfilter connection tracking module, ipvs_nfct.

Julian 16 Mar 2007

The NFCT patch is not a way to use iptables NAT rules, it just provides iptables -m state support for IPVS packets.

snat_reroute is only for IPVS packets. I just added some information in HOWTO.txt (http://www.ssi.bg/~ja/nfct/HOWTO.txt). SNAT: translate source address. Reroute: call output routing for 2nd time (saddr=VIP), first was the normal input routing for saddr=RIP.

29.3. Netfilter hooks and LVS: the path/route of an ip_vs controlled packet

Horms horms (at) verge (dot) net (dot) au 19 May 2004

here is my understanding of the way that the Netfilter Hooks and LVS fit together.

Figure 1. Interaction of LVS with Netfilter

interactions between netfilter and LVS.

The location of LVS hooks into the netfilter framework on the director.

Packets travel from left to right. A packet coming from the client enters on the left and exits on the right heading for the realserver. A reply from the realserver (in the case of LVS-NAT) enters on the left and exits on the right heading for the client. For normal LVS-DR and LVS-Tun operation (see the martian modification), reply packets do not go through the director.

  • for incoming packets the path is:
    PREROUTING -> LOCAL_IN -> POSTROUTING
    
  • for outgoing packets (only LVS-NAT):
    PREROUTING -> FORWARD -> POSTROUTING
    
  • for incoming ICMP:
    PREROUTING -> FORWARD -> POSTROUTING
    

When the director receives a packet, it goes through PREROUTING where Routing decides that the packet is local (usually because of the presence of the VIP on a local interface). The packet is then sent to LOCAL_IN. On the inbound direction, LVS hooks into LOCAL_IN. Modules register with a priority, the lowest priority getting to look at the packets first. LVS registers itself with a higher priority than iptables rules, and thus iptables will get the packet first and then LVS.

Mike McLean mikem (at) redhat (dot) com 21 Oct 2002

On the director, filter rules intercept packets before ip_vs sees them, otherwise firewall mark (fwmark) would not work

If LVS gets the packet and decides to forward it to a realserver, the packet then magically ends up in the POSTROUTING chain.

LVS does not look for ingress packets in the FORWARD chain. The only time that the FORWARD chain comes into play with LVS is for return packets from realservers when LVS-NAT is in use. This is where the packets get unNATed. Again LVS gets the packets after any iptables FORWARDing rules.

ip_vs_in attaches to the LOCAL_IN hook. For a packet to arrive at LOCAL_IN, the dst_addr has to be an IP on a local interface (any interface e.g. eth0). The result of this requirement (that dst_addr is an IP on a local interface) is that you still need the VIP on the director, when accepting packets for LVS by VIP-less methods like fwmark or transparent proxy. (It would be nice to remove the requirement for this otherwise non-functional VIP.)

There are ways around the requirement for a local IP, but they may create other problems as well.

  • move ip_vs_in from the LOCAL_IN hook to the PREROUTING hook. I tried that briefly once and it seemed to work.
  • play with routing rules to deliver the packet locally e.g. VIP-less routing info or Matt wrote:

    http://marc.theaimsgroup.com/?l=linux-virtual-server&m=104265930914084&w=2
    
    ip rule add prio 100 fwmark 1 table 100
    ip route add local 0/0 dev lo table 100
    

    or perhaps

    VIP=10.0.0.1/32
    ip rule add prio 100 to $VIP table 100
    ip route add local 0/0 dev lo table 100
    

    But I haven't tested either much. There is an oblique reference to this on http://www.linuxvirtualserver.org/docs/arp.html.

If packets for the VIP (on say eth0) started arriving on another interface (say eth1) due to dynamic routing, then LVS wouldn't care - the packet still arrives at LOCAL-IN. rp_filter would probably need to be disabled, and perhaps a few other routing tweaks, but fundamentally if the director could route the traffic (i.e. get the packets), then LVS could load balance it.

29.4. how to filter with netfilter

Note
netfilter has several families of rules, e.g. NAT and filter. The filter rules filter, but do not alter, packets. Not all iptables commands are filter rules.

For some background on filtering with netfilter, see the Linux netfilter Hacking HOWTO (http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html). Although iptables rules can be applied at any chain, the filter rules are only applied at the LOCAL_IN, FORWARD and LOCAL_OUT (see Packet Selection: IP Tables http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO-3.html#ss3.2).

Horms: LOCAL_IN and LOCAL_OUT in the kernel correspond, more or less, to INPUT and OUTPUT in iptables.

Since any packet only traverses one of these three chains, for any packet then, there is one and only one place to filter it. This is a change from ipchains, when you could filter a packet in several chains. You can't filter in the other chains even if you want to.

Julian

by design you can filter only in these chains because iptable_filter registers only there:

+ Netfilter hooks
  + LOCAL_IN Tables with different priority
    + filter
      + Chains
        - INPUT (and other chains used with -j CHAIN_NAME)
    + mangle
      + Chains
        - INPUT

we have hooks (places in the kernel stack) where each table module attaches its chains with rules for packet matching. It is a tree with multiple levels of lists. Filtering never worked in the other chains. Table "filter" has only chains INPUT, FORWARD and OUTPUT, each used in the corresponding hook.

Horms

You can see how many packets and bytes a rule is effecting by running

iptables -t nat -L -v -n
iptables -L -v -n

Here's a diagram from Linux Firewalls using iptables (http://www.linuxhomenetworking.com/wiki/index.php/Quick_HOWTO_:_Ch14_:_Linux_Firewalls_Using_iptables), with LOCAL_IN relabelled as INPUT etc.

Figure 2. Diagram from "Linux Firewalls Using iptables"


Diagram from Linux Firewalls using iptables

29.5. ipvs_nfct, netfilter connection tracking for ipvs

For some information on netfilter connection tracking (nfct) see Linux netfilter Hacking HOWTO (http://www.netfilter.org/documentation/HOWTO/netfilter-hacking-HOWTO.html).

May 2004: Because ipvs changes the path of LVS controlled packets, netfilter is not able to connection track them. For LVS-NAT (and for LVS-DR/Tun when using the forward_shared flag for the martian modification), packets in both directions go through the director, so it is possible to write replacement conntrack code (e.g. Antefacto patches).

For LVS-DR, LVS-Tun, the return packets go directly from the realserver to the client and do not go through the director. You can infer the state of the connection using the same mechanism (timeouts) by which the LVS-DR/LVS-Tun director has always made decisions on the state of the connection (where connections are listed as ActiveConn and InActConn by ipvsadm). There ip_vs assumes that connections are setup and terminated in a normal manner. The first implementations of ipvs used the standard ivp4 timeouts to declare a state transition. More recent implementations allow for a private set of timeout values for the ip_vs controlled connections (see tcp timeouts and the private copies of these timeouts for ipvs).

Joe - 29 Jan 2003

There were some incompatibilities between LVS and netfilter when running a director and firewall on the same box with 2.4.x?

Julian

Yes, LVS and Netfilter use their own (separate) connection tracking implementations. The situation hasn't changed since I explained it January 2002. If we are going to fix this, then changes in Netfilter are required too, mostly in the routing usage. LVS has some requirements for the connection state which are not present in netfilter. I don't think it is good to move LVS to Netfilter conntracking. And I still don't have enough time to think about such big changes for LVS.

May 2004: Julian has written the ipvs_nfct module, which among other things, fakes the connection tracking for LVS-DR and LVS-Tun.

Julian's ipvs_nfct code for LVS for 2.4 and 2.6 kernels are on his Netfilter Connection Tracking Support page (http://www.ssi.bg/~ja/nfct/). This page is accessed from Julian's software page (http://www.ssi.bg/~ja/) through "Linux IPVS tools and extensions: IPVS Netfilter connection tracking support"). Because of the small demand for this functionality, this code is one of Julian's lower priority projects.

Julian's HOWTO that comes with his patches states that the return packets have to go through the director. In fact the patch works for all LVS forwarding methods, but really is only useful for LVS's in which the replies return through the director.

Julian

I've uploaded the 2.6 version of the NFCT patches, but they aren't tested. ipvs_nfct matchs conntrack state for IPVS connections, e.g. NEW, ESTABLISHED, RELATED.

Note
Julian doesn't have a setup to test most of his code. Any untested code from Julian that I've tried has worked first time.

The only problem is with the netfilter's non-official tcp window tracking patch (which I haven't tested) that only works for NAT (I suspect). Maybe some of the checks done in the tcp window tracking patch require a bidirectional stream, so possibly it doesn't work for the unidirectional LVS-DR or LVS-TUN connections, i.e. when the director doesn't see the reply packets. In all other cases (LVS-NAT or bidirectional DR/TUN with forward_shared=1)

For LVS-NAT, ip_vs forwards packets in both directions in the director. For LVS-DR and LVS-TUN the replies are visible to the Netfilter firewall on the director if is the default gw for the realservers (for LVS-DR this requires the julians_martian_modification or forward_shared patch). Even if these reply packets are not part of the ipvs stream, they are forwarded, since forward_shared=1. For the Netfilter firewall, there is no difference in the forwarding method. In all cases the incoming traffic is handled in LOCAL_IN and the replies in FORWARD.

IPVS always knows the conn state (NEW/RELATED/ESTABLISHED), it is simply exported to the netfilter conntracking.

The patch works for cases when the replies don't go through the director, but in this case it is not very useful. The main purpose of the patch is to match reply packets. For the request packets, the conntrack entry is confirmed, which can speedup the packet handling (I hope). IPVS without NFCT drops the conntrack entry for each packet and allocates the conntrack entry again for the next packet.

With ipvs_nfct, each skb comes with skb->nfct attached. ipvs-nfct preserves this NFCT struct, while the default IPVS drops it on skb free.

Joe

Your NFCT HOWTO with date 10 Apr 2004 (but internal date Sep 2003) says that NFCT works for LVS-NAT and forward shared LVS-DR etc, but doesn't say anything about LVS-DR, LVS-Tun

non-NAT methods if forward_shared flag is used

I assume NFCT provides perfect conntrack for LVS when the replies go through the director and hence you can you use all iptables commands (eg ESTABLISHED). I assume RELATED will need helpers.

Yes, I'm just not sure for FTP for DR/TUN because ip_vs_ftp is not used (for LVS-DR, ftp requires persistence). IIRC, without creating NF expectations we can not expect to match FTP-DATA as RELATED.

For LVS-DR etc, your reply above seems to indicate that NFCT provides conntrack for LVS-DR too, however your NFCT HOWTO doesn't mention it.

Yes, it does not mention for any restrictions but the source has '- support for all forwarding methods, not only NAT'. It is still beta software. As yet there hasn't been a lot of interest in the code.

Is it OK to have a firewall on the director yet?

Yes, for simple ipchains-like rules. I can't be sure for more advanced filtering rules that play with conntrack specific data but if you use only device names, IPs, protos and ports it should work. ipvs works perfectly with netfilter as far as iptables filter rules are concerned.

can the LVS director now (with ipvs_nfct) can have any iptables command run on it, as if it were a regular linux router? Can I now expect to run any iptables command on an ipvs virtual service stream and have it work like on a normal linux box with an normal tcp/udp stream?

Maybe not because the IPVS packets do not use the same path through the network stack as other non-IPVS packets. I can not guarantee complete compatibility e.g you can do very wrong things with using some NAT rules, for example, DNAT. ipvs_nfct when added to ipvs gives you the ability to use -m state and nothing more. Now we can use -m state, with plain IPVS you can not use -m state.

what doesn't work with LVS-DR/Tun?

We support -m state for DR and TUN too. The only thing that doesn't work for DR and TUN is FTP-DATA.

Stephane Klein 26 Aug 2004

I installed your ipvs-nfct-2.4.26-1.diff patch, I enabled the CONFIG_IP_VS_NFCT and recompiled the kernel. Here are my rules to enable http service:

$IPTABLES -A INPUT -i eth1 -p tcp -m multiport -d $VIP --destination -port 80,21 -m state --state NEW -j RULE_2
$IPTABLES -A RULE_2 -j LOG --log-level info --log-prefix "RULE 2 -- > ACCEPT"
$IPTABLES -A RULE_2 -j ACCEPT
$IPTABLES -A FORWARD -p tcp -m state --state R,E -j ACCEPT

Julian

All out->in traffic passes INPUT (not FORWARD as in netfilter), you can not allow only NEW packets. FORWARD is passed only for in->out traffic for NAT. Some iptables examples can be found at the NFCT page (http://www.ssi.bg/~ja/nfct/"). You can also read about the netfilter hooks LVS uses here: my LVS page (http://www.ssi.bg/~ja/LVS.txt).

"Vince W." listacct1 (at) lvwnet (dot) com 12 Feb 2005

How I successfully compiled 2.6.9 and .10 FC3 kernels withip_vs_nfct patch and GCC 3.4.2

As a followup of sorts to my previous posts, "Error building 2.6.10 kernel with ip_vs_nfct patch - does anyone else get this?", I figured out what the problem was, and with some advice from Julian Anastasov, was successful getting kernels compiled with the ip_vs_nfct patch. This was the error message I would get previously, at the modules stage of the kernel build:

 CC [M]  net/ipv4/ipvs/ip_vs_proto_ah.o
 CC [M]  net/ipv4/ipvs/ip_vs_nfct.o

net/ipv4/ipvs/ip_vs_nfct.c: In function `ip_vs_nfct_conn_drop':
include/linux/netfilter_ipv4/ip_conntrack.h:248: sorry, unimplemented: 
inlining failed in call to 'ip_conntrack_put': function body not available
net/ipv4/ipvs/ip_vs_nfct.c:385: sorry, unimplemented: called from here
make[3]: *** [net/ipv4/ipvs/ip_vs_nfct.o] Error 1
make[2]: *** [net/ipv4/ipvs] Error 2

The system in question is Fedora Core 3, which sports version 3.4.2 of the GNU C compiler (and everything in the release is built with it). The kernel source I was using is the kernel-2.6.10-1.741_FC3.src.rpm. I had added the ip_vs_nfct and nat patches to the kernel build spec file, inserted the "CONFIG_IP_VS_NFCT=y" kernel config option line between the "CONFIG_IP_VS_FTP=m" and "CONFIG_IPV6=m" lines of each kernel arch/type *.config file, and built the kernel.

As it turns out, others have seen problems compiling code which contain external inline functions with GCC 3.4.2, not just people trying to use ip_vs_nfct. I found one such instance where the user documented that by removing "inline" from the function declaration, they were able to compile successfully.

Since ip_conntrack_put is declared and exported as an inline-type function in include/linux/netfilter_ipv4/ip_conntrack.h, this seems to cause a problem for ip_vs_nfct making use of the function. I asked Julian what he thought about this idea, and he suggested that "inline" may need to be removed from ip_conntrack_put's declaration in net/ipv4/netfilter/ip_conntrack_core.c also, since this is where the function is exported. Armed with this idea, I modified both ip_conntrack.h and ip_conntrack_core.c to remove "inline" from the function, and created a patch which I then added to the .spec and kernel build.

The kernel compiled successfully and ran. My firewall script worked, and ip_vs_nfct did its job. Unfortunately, I discovered several uptime hours later when the box kernel panicked that there is a known spinlock problem in the 2.6.10 kernel - somewhere in the filesystem/block device drivers code. I say known because comments exist in later iterations of the 2.6.10 kernel spec changelog which indicate that steps were taken to increase the verbosity of output when this specific kernel panic occurs. I do not know if it is an issue with the upstream 2.6.10 sources or not, but at least I knew it wasn't because of ip_vs_nfct.

At any rate, I have been successful building 2.6.9 FC3 kernels with Julian's 2.6.9 ip_vs_nfct patches, and not seeing the spinlock "not syncing" kernel panics I saw building with any Fedora Core 3 2.6.10 kernel .src.rpm sources I tried building with. It's been up for 3 days now on the box running this kernel, and it is functioning exactly as desired. ...which also means it's gonna be time to update my keepalived Stateful Firewall/LVS Director HOW-TO document from 2003 (http://www.lvwnet.com/vince/linux/Keepalived-LVS-NAT-Director-ProxyArp-Firewall-HOWTO.html) soon...

Specifically, I used the 2.6.9-1.681_FC3 .src.rpm file, including Julian's two patches (ip_vs_nfct and also the nat patch) and the one shown below to remove "inline" from the ip_conntrack_put function.

If anyone else is interested in building kernels using Julian's patches and you have GCC 3.4.2 (or newer, I'm sure...) you may be interested in this small patch to remove "inline" from ip_conntrack.h and ip_conntrack_core.c. I'll attach it to this post, and also post the text of it here:

diff -urN ../linux-2.6.10/include/linux/netfilter_ipv4/ip_conntrack.h 
./include/linux/netfilter_ipv4/ip_conntrack.h
--- 
../linux-2.6.10/include/linux/netfilter_ipv4/ip_conntrack.h 2004-12-24 
16:35:28.000000000 -0500
+++ ./include/linux/netfilter_ipv4/ip_conntrack.h   2005-02-07 
06:48:57.260570933 -0500
@@ -245,7 +245,9 @@
}

/* decrement reference count on a conntrack */
-extern inline void ip_conntrack_put(struct ip_conntrack *ct);
+/* vince: try this without inline:
+extern inline void ip_conntrack_put(struct ip_conntrack *ct); */
+extern void ip_conntrack_put(struct ip_conntrack *ct);

/* find unconfirmed expectation based on tuple */
struct ip_conntrack_expect *
diff -urN ../linux-2.6.10/net/ipv4/netfilter/ip_conntrack_core.c 
./net/ipv4/netfilter/ip_conntrack_core.c
--- ../linux-2.6.10/net/ipv4/netfilter/ip_conntrack_core.c  2004-12-24 
16:33:47.000000000 -0500
+++ ./net/ipv4/netfilter/ip_conntrack_core.c    2005-02-07 
06:48:16.702522768 -0500
@@ -77,7 +77,9 @@

DEFINE_PER_CPU(struct ip_conntrack_stat, ip_conntrack_stat);

-inline void
+/* vince: try this without inline:
+inline void */
+void
ip_conntrack_put(struct ip_conntrack *ct)
{
   IP_NF_ASSERT(ct);

If anyone else cares to comment on whether removing "inline" from either/both of these places is good/bad, or what performance impact this may have, please do tell. But it is working well for me so far.

--Boundary_(ID_Kepw1VtyitIN0uRSqVWEDA)
Content-type: text/plain; name="linux-2.6.10-ip_conntrack_put-no-inline.diff"
Content-disposition: inline;
 filename="linux-2.6.10-ip_conntrack_put-no-inline.diff"
Content-transfer-encoding: 7bit

diff -urN ../linux-2.6.10/include/linux/netfilter_ipv4/ip_conntrack.h ./include/linux/netfilter_ipv4/ip_conntrack.h
--- ../linux-2.6.10/include/linux/netfilter_ipv4/ip_conntrack.h	2004-12-24 16:35:28.000000000 -0500
+++ ./include/linux/netfilter_ipv4/ip_conntrack.h	2005-02-07 06:48:57.260570933 -0500
@@ -245,7 +245,9 @@
 }
 
 /* decrement reference count on a conntrack */
-extern inline void ip_conntrack_put(struct ip_conntrack *ct);
+/* vince: try this without inline:
+extern inline void ip_conntrack_put(struct ip_conntrack *ct); */
+extern void ip_conntrack_put(struct ip_conntrack *ct);
 
 /* find unconfirmed expectation based on tuple */
 struct ip_conntrack_expect *
diff -urN ../linux-2.6.10/net/ipv4/netfilter/ip_conntrack_core.c ./net/ipv4/netfilter/ip_conntrack_core.c
--- ../linux-2.6.10/net/ipv4/netfilter/ip_conntrack_core.c	2004-12-24 16:33:47.000000000 -0500
+++ ./net/ipv4/netfilter/ip_conntrack_core.c	2005-02-07 06:48:16.702522768 -0500
@@ -77,7 +77,9 @@
 
 DEFINE_PER_CPU(struct ip_conntrack_stat, ip_conntrack_stat);
 
-inline void 
+/* vince: try this without inline:
+inline void */
+void 
 ip_conntrack_put(struct ip_conntrack *ct)
 {
 	IP_NF_ASSERT(ct);

29.6. LVS-NAT netfilter conntrack example with ftp

Julian Anastasov ja (at) ssi (dot) bg 19 May 2004

For users of ipvs-nfct I would recommend the following rules. The example is for ftp by LVS-NAT to VIP=192.168.1.100. Access to all other ports on the VIP is denied.

# turn on conntrack and load helper modules
echo 1 > /proc/sys/net/ipv4/vs/conntrack

# module to correctly support connection expectations for FTP-DATA
modprobe ip_conntrack_ftp
# module to detect ports used for FTP-DATA
# (May 2004, has a kernel bug which hasn't been fixed)
# `modprobe ip_nat_ftp` is optional and ip_nat_ftp needs a fix:
# http://marc.theaimsgroup.com/?l=linux-netdev&m=108220842129842&w=2
# if ip_nat_ftp is used together with ipvs_nfct for FTP NAT.
modprobe ip_nat_ftp
# Restrict LOCAL_IN access
# accept packets to dport 21 and related and established connections.
# the related connections are determined by the ftp helper module
# drop all other packets
iptables -A INPUT -p tcp -d 192.168.1.100 --dport 21 -j ACCEPT
iptables -A INPUT -p tcp -d 192.168.1.100 -m state --state RELATED,ESTABLISHED - j ACCEPT
iptables -A INPUT -p tcp -d 192.168.1.100 -j DROP

# Restrict FORWARD access
# accept only related, established. drop all others
iptables -A FORWARD -m state --state RELATED,ESTABLISHED -j ACCEPT
iptables -A FORWARD -j DROP

Without NFCT support it is difficult to filter in INPUT for FTP-DATA packets but for http/https where the VPORT is known it is not difficult. The same difficulties are IN FORWARD for the NAT replies. This is where ipvs-nfct wins - you have only a small number of rules.

Traffic to the VIP is first filtered by iptables in INPUT and then scheduled from IPVS. IPVS in 2.6 causes the scheduled out->in packets (after any LVS-NAT translations) to appear in the LOCAL_OUT hook where they can be filtered again.

IPVS works always after the filter rules. It is easy when you know the ports but for FTP-DATA it is not possible, you have to specify input devices and IPs for which you grant access for forwarding. You overcome such problems with ipvs-nfct.

Ratz 03 Aug 2004

LVS-NAT with the NFCT patch will work for 2.4.x and 2.6.x kernels regarding filtering, if you don't use fwmark

LVS-DR will most probably not work with 2.6.8 and above kernels regarding filtering since the tcp window tracking patch has been merged to the vanilla tree; however there is a relaxation sysctl that could revert the strict TCP window and sequence number checking to the loosly-knitted one (aka: non-existant) as previously found in vanilla Linux kernels.

29.7. tcpdump is LVS compatible

You can use tcpdump to debug your iptables rules on your running director. tcpdump makes a copy of packets for its own use. tcpdump gets a copy of the packets before netfilter (on the way in) and after netfilter (on the way out). You should see all packets with tcpdump as if netfilter and LVS didn't exist.

Joe 16 Mar 2001

I'm looking at packets after they've been accepted by TP and I'm using (among other things) tcpdump.

Where in the netfilter chain does tcpdump look at incoming and outgoing packets? When they are put on/received from the wire? After the INPUT, before the OUTPUT chain...?

Julian

Before/after any netfilter chains. Such programs hook at packet level before/after the IP stack just before/after the packet is received/must be sent from/by the device. They work for other protocols. tcpdump is a packet receiver just like the IP stack is in the network stack.

Note

If you are using twisted pair ethernet through a hub/switch, your NIC will only see the packets to/from it. Thus tcpdump running on the director will not see packets from the realserver to the client in LVS-DR. In the days when people used coax for ethernet, all machines saw all packets on a segment.

29.8. Writing Filter Rules

If you're writing your own rules, start off with a quiet machine, log all packets and then access one of the services. Write rules to accept the packets you want and keep logging the rest. Try another service... Deny all packets that you know aren't needed for your LVS.

You can probably accept all packets that have both src_addr and dst_addr in the RIP network. As well machines might need access to outside services (e.g. ntp, dns). Realservers that are part of 3-Tier LVS LVSs, will also require rules to allow them to access outside services.

Joe (on changing from writing ipchains rules for ip_vs for 2.2, to writing iptables rules for ip_vs for 2.4)

I see packets only in the INPUT and OUTPUT chains, but not in FORWARD or in lvs_rules chains.

Ratz 21 May 2001

If you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it:

packets coming from outside to the LVS do:

        PRE_ROUTING -> LOCAL_IN(LVS in) -> POST_ROUTING

packets leaving the LVS travel:

        PRE_ROUTING -> FORWARD(LVS out) -> POST_ROUTING

From the iptables howto:

COMPATIBILITY WITH IPCHAINS

This iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three.

When writing filter rules (e.g. iptables), keep in mind

  • write the rules in trees. If a packet has to traverse many rule tests before it is accepted/rejected, then throughput will decrease. If many packets traverse a rule set, then you should attempt to shorten the path through the rules, possibly by breaking the rule set into several branches.
  • Ratz has shown that you can have 200-500 rules in a branch before throughput is affected (see pf-speed-test.pdf).

"K.W." kathiw (at) erols (dot) com

can I run my ipchains firewall and LVS (piranha in this case) on the same box? It would seem that I cannot, since ipchains can't understand virtual interfaces such as eth0:1, etc.

Brian Edmonds bedmonds (at) antarcti (dot) ca 21 Feb 2001

I've not tried to use ipchains with alias interfaces, but I do use aliased IP addresses in my incoming rulesets, and it works exactly as I would expect it to.

Julian

I'm not sure whether piranha already supports kernel 2.4, I have to check it. ipchains does not understand interface aliases even in Linux 2.2. Any setup that uses such aliases can be implemented without using them. I don't know for routing restrictions that require using aliases.

I have a full ipchains firewall script, which works (includes port forwarding), and a stripped-down ipchains script just for LVS, and they each work fine separately. When I merge them, I can't reach even just the firewall box. As I mentioned, I suspect this is because of the virtual interfaces required by LVS.

LVS does not require any (virtual) interfaces. LVS never checks the devices nor any aliases. I'm not sure what is the port forwarding support in ipchains too. Is that the support provided from ipmasqadm: the portfw and mfw modules? If yes, they are not implemented (yet). And this support is not related to ipchains at all. Some good features are still not ported from Linux 2.2 to 2.4 including all these autofw useful things. But you can use LVS in the places where use ipmasqadm portfw/mfw but not for the autofw tricks. LVS can perfectly do the portfw job and even to extend it after the NAT support: there are DR and TUN methods too.

Lorn Kay lorn_kay (at) hotmail (dot) com

I ran into a problem like this when adding firewall rules to my LVS ipchains script. The problem I had was due to the order of the rules.

Remember that once a packet matches a rule in a chain it is kicked out of the chain--it doesn't matter if it is an ACCEPT or REJECT rule(packets may never get to your FWMARK rules, for example, if they do not come before your ACCEPT and REJECT tests).

I am using virtual interfaces as well (eg, eth1:1) but, as Julian points out, I had no reason to apply ipchains rules to a specific virtual interface (even with an ipchains script that is several hundred lines long!)

unknown

FWMARKing does not have to be a part of an ACCEPT rule. If you have a default DENY policy and then say:

/sbin/ipchains -A input -d $VIP -j ACCEPT
/sbin/ipchains -A input -d $VIP 80 -p tcp -m 3
/sbin/ipchains -A input -d $VIP 443 -p tcp -m 3

To maintain persistence between port 80 and 443 for https, for example, the packets will match on the ACCEPT rule, get kicked out of the input chain tests, and never get marked.

29.9. The Antefacto Netfilter Connection Tracking patches

The Problem: Because of the incompatibilities between netfilter and LVS, it is not possible (in general) to have iptables firewall rules running on the director (the firewall must be on a separate machine).

The first code (the Antefacto patches) were written by Ben North for 2.4 kernels, when he worked for (the now defunct) Antefacto. The code was then (Jun 2003) taken over by Vinnie listacct1 (at) lvwnet (at) com and now (Apr 2004) being ported by Julian who calls them the "netfilter connection tracking" (nfct) patches.

The original Antefacto patches allowed a firewall on directors in an LVS where the packets from the realservers return through the director (LVS-NAT, and LVS-DR with the forward_shared flag when the director is the default gw for the realservers). This restriction is required so that the director has full information about the connection (in LVS-DR and LVS-Tun, the director makes guesses about the state of the connection from timeout values).

Documentation for the Antefacto code is at setting up an LVS-NAT Director (running keepalived) to function as a stateful firewall, which also happens to use proxy-arp. The code is at Antefacto patch for 2.4.19/1.0.7 and Antefacto patch for 2.4.20/1.0.8

29.9.1. The problem:director can't be firewall as well

John P. Looney john (at) antefacto (dot) com Apr 12, 2002

We modified ip_vs to get it to play nicely with iptables on the same box, so you don't need a seperate firewall/vpn box.

The patches weren't accepted to the main branch, as the changes were considered non-mainstream, and they were made from 0.8.2 version, which was a little old then. Have a look at;

http://www.in-addr.de/pipermail/lvs-users/2002-January/004585.html

From memory (I didn't do the kernel work), the ip_vs connection tracking tables and the netfilter connection tracking tables were not always in synch. So, you couldn't statefully firewall an ip_vs service. There is a readme included somewhere in that thread.

We've just done a product release. One of our aims is to reimplement these changes in the 1.0.x branch, if someone hasn't already done so. When that's done, we'll post those patches to the list also.

Here's the original posting from Ben North.

Ben North             Software Engineer
a n t e f a c t o     t: +353 1 8586008
www.antefacto.com     f: +353 1 8586014
181 Parnell Street - Dublin 1 - Ireland

Note
Feb 2003: Antefacto no longer exists. You should be able to contact Ben at ben (at) redfrontdoor (dot) org.

We've been working with the LVS code for the past while, and we wanted to allow the use of Netfilter's connection-tracking ability with LVS-NAT connections. There was a post on the mailing list a couple of weeks ago asking about this, and my colleague Padraig Brady mentioned that we had developed a solution.

I've now had time to clean up the patches, and I attach a README, and two patch files. One is for the Linux kernel, and one is to the LVS code itself. Any comments, get in touch. We have done a fair amount of testing (overnight runs with many tens of thousands of connections), with no problems.

Many thanks for the great piece of work. Hope the patches are useful and will be considered for inclusion in future releases of LVS. I notice that 1.0.0 is going to arrive soon; the attached patches might be better applied to a Linux-kernel-style 1.1 "development" branch.

Vinnie listacct1 (at) lvwnet (dot) com 04 May 2003

Well I haven't tried to crash the firewall/Director or anything, but to sum it up, the firewall box is doing its job now just as well as it was before I started dinking around with LVS/IPVS. It is letting traffic come IN that I have IPVS virtual services for, and letting it be FORWARDED to the Real Servers. It's not getting in the way of IPVS connections in progress, nor does it appear to be letting traffic through which is NOT related to connections already in progress.

Ratz

Guys, I hope you _do_ realize that not even netfilter has a properly working connection tracking. Without the tcp-window-tracking patch, netfilter allows you to send arbitrary packets through the stack. It's a well-known fact and even the netfilter homepage at some point mentioned it.

Point taken. But that's not an IPVS or Antefacto problem.

I take it that you didn't do any tests of the patch or netfilter in general with a packet generator (where you can modify every last bit of an skb).

No, I can't say that I have. Perhaps you would be willing to put some of that expertise you have to work?

And, to your interest, LVS _does_ have sort of connection state tracking.

I am aware of that. But the point about all of this (and the reason that the folks who actually wrote the Antefacto patch did so) is that IPVS works independently of netfilter's connection tracking. So Netfilter doesn't have a CLUE about all those connections going on (or not going on) to IPVS-based services and RealServers.

But if you want your LVS Director to also be your main firewall, that means you have to be able to tell your firewall box, in ways that you can communicate your wishes with iptables commands, what kind of traffic you want to allow to go in/out of your LVS. But that's pretty hard to do since IPVS unmodified doesn't bother to let netfilter in on the loop of what it's doing.

The antefacto patch allows netfilter and IPVS to communicate about all that traffic going through your LVS, so that at the iptables ruleset level, it is possible to write rules that work for your LVS.

If netfilter's connection tracking is broken, then it's broken -- IPVS, Antefacto, or not.

29.9.2. the patches

Following patches, Copyright (C) 2001--2002 Antefacto Ltd, 181 Parnell St, Dublin 1, Ireland, released under GPL, and then Ben's documentation on the patch.

First the patch to the kernel sources (http://www.austintek.com/WWW/LVS/LVS-HOWTO/HOWTO/files/antefacto_kernel.diff)

And second, the patch to the LVS sources (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/antefacto_lvs.diff) (made against 0.8.2 so may need some clean-up).

29.9.3. Making LVS work with Netfilter's connection tracking

The two attached patches modify the kernel and the ipvs modules in such a way that ipvs NAT connections are correctly tracked by the Netfilter connection-tracking code. This means that firewalling rules can be put in place to allow incoming connections to a virtual service, and then by allowing ESTABLISHED and RELATED packets to pass the FORWARD chain, we achieve stateful firewalling of these connections.

For example, if director 4.3.2.1 is offering a virtual service on TCP port 8899, we can do

iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED
iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED

and get the desired behaviour. Note that the second rule (the one in the FORWARD chain) covers all virtual services offered by the same director, so if another service is offered on port 9900, the complete set of rules required would be

iptables -A INPUT -p tcp -d 4.3.2.1 --dport 8899 -m state --state NEW,ESTABLISHED,RELATED
iptables -A INPUT -p tcp -d 4.3.2.1 --dport 9900 -m state --state NEW,ESTABLISHED,RELATED
iptables -A FORWARD -p tcp -m state --state ESTABLISHED,RELATED

i.e., one rule in the INPUT chain is required per virtual service, but the rule in the FORWARD chain covers all virtual services.

Patches required

  • Patch to main kernel source

    There is a small change required to the kernel patch. The stock kernel patch which comes with the ip_vs distribution just adds a few EXPORT_SYMBOL()s to ksyms.c. For the Netfilter connection-tracking functionality, we need a bit more. The files affected, and reasons, are:

    ip_conntrack_core.c: init_conntrack(): Mark even more clearly that the newly-created connection-tracking entry is not in the hash tables. This change isn't strictly necessary but makes assertion-checking easier.

    ip_conntrack_standalone.c: Export the symbol __ip_conntrack_confirm(). I didn't really like the idea of exporting a symbol starting with double-underscore, but nothing too bad seems to have happened. The function seems to take care of reference-counting, so I think we're OK here.

    ip_nat_core.c: ip_nat_replace_in_hashes(): (new function) Exported wrapper round replace_in_hashes() which deals with the locking on ip_nat_lock.

    ip_nat_standalone.c: Export the new ip_nat_replace_in_hashes() function.

    ip_nat.h: Declare the new ip_nat_replace_in_hashes() function.

    More explanation below.

  • Patch to ip_vs code

    ip_vs_app.c: skb_replace(): Copy debugging information across to the new skb, if debugging is enabled. This is a separate issue to the main connection-tracking patch, but was causing spurious warnings about which hooks a skb had passed through.

    ip_vs_conn.c: Include some netfilter header files. Declare a new function ip_vs_deal_with_conntrack().

    ip_vs_nat_xmit(): Code to make sure that Netfilter's connection-tracking entry is correct.

    ip_vs_deal_with_conntrack(): (new function) The guts of the new functionality. Changes the data inside the Netfilter connection-tracking entry to match the actual packet flow.

    ip_vs_core.c: route_me_harder(): (new function) Copied from ip_nat_standalone.c. Code to re-make the routing decision for a packet, treated as locally-generated.

    ip_vs_out(): Separate from the connection-tracking code changes, don't send ICMP unreachable messages. This has been discussed on the list recently and I think the consensus was that this change is OK. The sysctl method would be better though, so ignore this bit.

    Also call route_me_harder() to decide whether the outbound packet needs to be routed differently now it is supposed to be coming from the director machine itself.

    ip_vs_in(): When checking if a packet might be trying to start a new connection, check that it has SYN but not ACK. Previously, the only check was that it had SYN set.

    If there is a new connection being attempted, check for consistency between Netfilter's connection-tracking table and LVS'. More explanation of this bit below.

    ip_vs_ftp.c: Include the Netfilter header files. Declare new function ip_vs_ftp_expect_callback().

    ip_vs_ftp_out(): Once we have noticed that a passive data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED.

    ip_vs_ftp_in(): Once we have noticed that an active data-transfer connection has been negotiated at application level, tell Netfilter to expect this connection and so treat it as RELATED.

    ip_vs_ftp_expect_callback(): (new function) When the RELATED packet arrives (for a data-transfer connection), update Netfilter's connection-tracking entry for the connection.

29.9.3.1. General connections (i.e. not FTP)

Each entry in Netfilter's connection-tracking table has two tuples describing source and destination addresses and ports. One of these tuples is the ORIG tuple, and describes the addressing of packets travelling in the "original" direction, i.e., from the machine that initiated the connection to the machine that responded. The other is the REPLY tuple, which describes the addressing of packets travelling in the "reply" direction, i.e., from the responding machine to the initiating machine. Normally, the REPLY tuple is just the "inverse" of the ORIG tuple, i.e., has its source and destination reversed. But for LVS connections, this is not the case. This is what causes the problem when using the unmodified Netfilter code with IPVS connections. Actually, it's one of the things that causes trouble.

The following is roughly what happens with the unmodified code for the start of a TCP connection to a virtual service. Suppose we have

   +--------+
   | Client |
   +--------+
     (CIP)       <-- Client's IP address
       |
       |
  { internet }
       |
       |
     (VIP)       <-- Virtual IP address
  +----------+
  | Director |
  +----------+
     (PIP)       <-- (Director's Private IP address)
       |
       |
     (RIP)       <-- Real (server's) IP address
 +-------------+
 | Real server |
 +-------------+

Then the client sends a packet to the VIP:VPORT; say

CIP:CPORT -> VIP:VPORT

Netfilter on the director makes a note of this packet, and sets up a temporary connection-tracking entry with tuples as follows:

ORIG:  CIP:CPORT -> VIP:VPORT
REPL:  VIP:VPORT -> CIP:CPORT

(the "src-ip:src-port -> dest-ip:dest-port" notation is hopefully clear enough). We will call a connection-tracking entry a "CTE" from now on.

LVS notices (in ip_vs_in(), called as part of the LOCAL_INPUT hook) that VIP:VPORT is something it's interested in, grabs the packet, re-writes it to be addressed

CIP:CPORT -> RIP:RPORT

and sends it on its way by means of ip_send(). As a result, the POST_ROUTING hook gets called, and ip_vs_post_routing() gets a look at the packet. It notices that the packet has been marked as belonging to LVS, and calls the (*okfn), sending the packet to the wire without further ado.

When it has been transmitted, the reference count on the CTE falls to zero, and it is deleted. (This is a mild guess but I think is right.) Normally, CTEs avoid this fate because __ip_conntrack_confirm() is called for them, either via ip_confirm() as a late hook in LOCAL_IN, or through ip_refrag() called as a late hook in POST_ROUTING. "Confirming" the CTE involves linking it into some hash tables, and ensuring it isn't deleted.

So this is the first problem --- the CTE is not "confirmed".

Suppose we confirmed the connection. Then when the Real Server replies to this packet, it sends a packet addressed as

RIP:RPORT -> CIP:CPORT

to the director (because the Director is the router for such packets, as seen by the Real Server). Then the connection-tracking code in Netfilter on the director tries to look up the CTE for this packet, but can't find one. The CTE we /want/ it to match says

ORIG:  CIP:CPORT -> VIP:VPORT
REPL:  VIP:VPORT -> CIP:CPORT

with no mention of the RIP:RPORT. So this reply packet gets labelled as "NEW", whereas we wanted it to be labelled as "ESTABLISHED".

So as well as confirming the CTE, we also need to alter the REPLY tuple so that it will match the

RIP:RPORT -> CIP:CPORT

packet the Real Server sends back. Then everything will work.

These two things are what the ip_vs_deal_with_conntrack() function does. Luckily there is a ip_conntrack_alter_reply() function exported by Netfilter, which we can use. Then we can also call the newly-exported __ip_conntrack_confirm() to confirm the connection. (We need to do the reply altering first because __ip_conntrack_confirm()ing the CTE uses the addresses in the ORIG and REPLY tuples to place the CTE in the hash tables, and we want it placed based on the /new/ reply tuple.)

There is a slight complication in that the NAT code in Netfilter gets confused if addressing tuples change, so we need to tell the NAT code to re-place the CTE in its hash tables. This is done with the newly-exported ip_nat_replace_in_hashes() function.

The ip_vs_deal_with_conntrack() function is called from the ip_vs_nat_xmit() function, since this whole problem only applies to LVS-NAT. It is only called if the CTE is unconfirmed.

Hacking round a possible race

When testing this, we found that very occasionally there would be a problem when the Netfilter CTE timed out and was deleted. The code would fail an assertion: the CTE about to be deleted was not linked into the hash chain it claimed it was. This would happen after a few tens of thousands of connections from the same client to the same virtual service.

We tracked this down to the above ip_vs_deal_with_conntrack() code being called for a CTE which already existed and was already confirmed. Doing this moved the CTE to a different hash chain and broke things.

The only explanation I could come up with is that there is a race in the ip_vs code. The ip_vs code doesn't set up one timer per connection entry. Instead, it uses a kernel timer to do some work every second. I didn't look into this too deeply, but it looked like the following is a possibility.

If the slow-timer code decides that a LVS connection should be expired, there seems to be a window where a packet can arrive and update that connection, meaning that it should no longer be expired. But it is anyway. There are more details; supplied on request. But if somebody who knows the timer code could check whether the above is a possibility, and fix it if so, that would be good.

The workaround detects if the CTE is already confirmed, and deletes it and also drops the packet if so. Higher levels in the stack take care of retransmitting so nothing too drastic goes wrong.

Later, we noticed the workaround being triggered much more often than we'd expect, and it turned out that incoming packets with the SYN and ACK bits both set were being treated as potentially starting new connections, whereas SYN/ACK packets are in fact a response to a connection initiated by the director itself. So we tightened the test to be

((h.th->syn && !h.th->ack) || (iph->protocol != IPPROTO_TCP))

instead of

(h.th->syn || (iph->protocol!=IPPROTO_TCP))

which is how it is in the original LVS code. This doesn't seem to have caused any nasty side effects.

Note that this only happened when an FTP virtual service was configured, because of the code in ip_vs_service_get() which allows a "wild-card" match for incoming FTP data connections.

29.9.3.2. FTP connections

The other main change is to the LVS FTP module. We add code to the two functions ip_vs_ftp_out() and ip_vs_ftp_in(), to deal with passive and active data transfers respectively. The basic idea is the same for both types of transfer.

By keeping an eye on the actual traffic going between the client and the FTP server, we can tell when a data transfer is about to take place. For a passive transfer, the ip_vs_ftp module looks out for the string "227 Entering Passive Mode" followed by the address and port the server will listen on. For an active transfer, the client transmits the "PORT" command followed by the address and port the client will listen on.

Once we have detected that a data transfer is about to take place, we add code to tell Netfilter's connection-tracking code to /expect/ the data connection. Then, packets belonging to the data connection will be labelled "RELATED" and can be allowed by firewall rules. There is an exported function ip_conntrack_expect_related(), which we call. The only difference between the set-up for passive and active transfers is that for passive transfers we don't know the port the client will connect from, so have to specify the source port as "don't care" by means of its mask.

The ip_conntrack_expect_related() function allows us to specify a callback function; we use ip_vs_ftp_expect_callback() (new function in this patch). ip_vs_ftp_expect_callback() works out whether the new connection is for passive or active, modifies the REPLY tuple, and confirms the CTE.

I've just noticed that I modify the reply tuple directly instead of calling ip_conntrack_alter_reply(). Can't see any good reason for this, so should probably change the code to use ip_conntrack_alter_reply() instead. Might not have time to test that change here, so will leave it alone for now.

So to run a virtual FTP service, load the extra ip_vs_ftp module, but /not/ the ip_conntrack_ftp or ip_nat_ftp modules. It is very likely that the ip_vs_ftp module would not cooperate very well with those two modules, so if you want to run a non-virtual FTP service /and/ load-balance a virtual FTP service on the same machine, more work might be required.

route_me_harder()

We call this function to possibly re-route the packet, because we were using policy routing (iproute2). This allows routing decisions to depend on more than just the destination IP address of the packet. In particular, a routing decision can be influenced by the source IP address of the packet, and by the fact that the packet should be treated as originating with the local machine. The call to route_me_harder() re-makes the routing decision in light of the new state of the packet. It could be removed (or disabled via a sysctl) if the overhead was too annoying in an application which didn't require this extra flexibility.

Additional #defines

There are additional #defines available to add assertion-checking and various amounts of debugging to the output of the new code.

#define BN_ASSERTIONS to include extra code which checks various things are as they should be. This adds a small amount of overhead (sorry, haven't measured it) but caught some problems in development.

#define BN_DEBUG_FTP to emit diagnostic and tracing information from the modified ip_vs_ftp module. Again, was useful during development but probably not useful in production.

#define BN_DEBUG_IPVS_CONN to emit diagnostic and tracing information from the new code which handles Netfilter's CTEs. Same comments apply: useful while I was working on it, but probably not in actual use.

29.10. The design of LVS as a netfilter module, pt1

Tao Zhao taozhao (at) cs (dot) nyu (dot) edu 11 Jul 2001

The source code of LVS adds ip_vs_in() to netfilter hook NF_IP_LOCAL_IN to change the destination of packets. As I understand, this hook is called AFTER routing decisions have reached. So how can it forward the packet to the new assigned destination without routing?

Henrik Nordstrom hno (at) marasystems (dot) com

Instead of rewriting the packet inside the normal packet flow of Linux-2.4, IPVS accepts the packet and constructs a new one, routes it and sends it out.. This approach does not make much sense for LVS-NAT within the netfilter framework, but fits quite well for the other modes.

Julian

LVS does not follow the netfilter recommendations. What happens if we don't change the destination (e.g.DR and TUN methods which don't change the IP header). When such packet hits the routing the IP header fields are used for the routing decision. Netfilter can forward only by using NAT methods.

LVS tries not to waste CPU cycles in the routing cache. You can see that there is output routing call involved but there is a optimization you can find even in TCP - the destination cache. The output routing call is avoided in most of the cases. This model is near the one achieved in Netfilter, i.e. to call only once the input routing function (2.2 calls it twice for DNAT). I'm now testing a patch for 2.2 (on top of LVS) that avoids the second input routing call and that can reroute the masqueraded traffic to the right gateway when many gateways are used and mostly when these gateways are on same device. The tests will show how different is the speed between this patched LVS for 2.2 and the 2.4 one (one CPU of course).

We decided to use the LOCAL_IN hook for many reasons. May be you can find more info for the LVS integration into the Netfilter framework by searching in the LVS mail list archive for "netfilter".

Julian 29 Oct 2001

IPVS uses only the netfilter's hooks. It uses own connection tracking and NAT. You can see how LVS fits into the framework on the mailing list archive.

Ratz

I see that the defense_level is triggered via a sysctrl and invoked in the sltimer_handler as well as the *_dropentry. If we push those functions on level higher and introduce a metalayer that registers the defense_strategy which would be selectable via sysctrl and would currently contain update_defense_level we had the possibility to register other defense strategies like e.g. limiting threshold. Is this feasible? I mean instead of calling update_defense_level() and ip_vs_random_dropentry() in the sltimer_handler we just call the registered defense_strategy[sysctrl_read] function. In the existing case the defense_strategy[0]=update_defense_level() which also merges the ip_vs_dropentry. Do I make myself sound stupid? ;)

The different strategies work in different places and it is difficult to use one hook. The current implementation allows they to work together. But may be there is another solution considering how LVS is called: to drop packets or to drop entries. There are no many places for such hooks, so may be it is possible something to be done. But first let's see what kind of other defense strategies will come.

Yes, the project got larger and more reputation than some of us initially thought. The code is very clear and stable, it's time to enhance it. The only very big problem that I see is that it looks like we're going to have to separate code paths one patch for 2.2.x kernels and one for 2.4.x.

Yes, this is the reality. We can try to keep the things not to look different for the user space.

This would be a pain in the ass if we had two versions of ipvsadm. IMHO the userspace tools should recognize (compile-time) what kernel it is working with and therefore enable the featureset. This will of course bloat it up in future the more feature-differences we will have regarding 2.2.x and 2.4.x series.

Not possible, the sockopt are different in 2.4

Joe (I think)

Could you point me to a sketch where I could try to see how the control path for a packet looks like in kernel 2.4? I mean some- thing like I would do for 2.2.x kernels:

Julian (I think)

I hope there is a nice ascii diagram in the netfilter docs, but I hope the info below is more useful if you already know what each hook means.

           ----------------------------------------------------------------
           |            ACCEPT/                              lo interface |
           v           REDIRECT                  _______                  |
   --> C --> S --> ______ --> D --> ~~~~~~~~ -->|forward|----> _______ -->
       h     a    |input |    e    {Routing }   |Chain  |     |output |ACCEPT
       e     n    |Chain |    m    {Decision}   |_______| --->|Chain  |
       c     i    |______|    a     ~~~~~~~~        |     | ->|_______|
       k     t       |        s       |             |     | |     |
       s     y       |        q       |             v     | |     |
       u     |       v        e       v            DENY/  | |     v
       m     |     DENY/      r   Local Process   REJECT  | |   DENY/
       |     v    REJECT      a       |                   | |  REJECT
       |   DENY               d       --------------------- |
       v                      e -----------------------------
      DENY

Ratz (I think)

The biggest problem I see here is that maybe the user space daemons don't get enough scheduling time to be accurate enough.

That is definitely true. When the CPU(s) are busy transferring packets the processes can be delayed. So, the director better not spend many cycles in user space. This is the reason I prefer all these health checks to run in the realservers but this is not always good/possible.

No, considering the fact that not all RS are running Linux. We would need to port the healthchecks to every possible RS architecture.

Yes, this is a drawback.

unknown (Ratz ?)

Tell me, which scheduler should I take? None of the existing ones gives me good enough results currently with persistency. We have to accept the fact, that 3-Tier application programmers don't know about loadbalancing or clustering, mostly using Java and this is just about the end of trying to load balance the application smoothly.

WRR + load informed cluster software. But I'm not sure in the the case when persistency is on (it can do bad things).

I currently get some values via an daemon coded in perl on the RS, started via xinetd. The LB connects to the healthcheck port and gets some prepared results. He then puts this stuff into a db and starts calculating the next steps to reconfigure the LVS-cluster to smoothen the imbalance. The longer you let it running the more data you get and the less adjustments you have to make. I reckon some guy showing up on this list once had this idea in direction of fuzzy logic. Hey Julian, maybe we should accept the fact that the wlc scheduler also isn't a very advanced one:

loh = atomic_read(&least->activeconns)*50+atomic_read(&least->inactconns);

What would you think would change if we made this 50 dynamic?

Not sure :) I don't have results from experiments with wlc :) You can put it in /proc and to make different experiments, for example :) But warning, ip_vs_wlc can be module, check how lblc* register /proc vars.

29.11. The design of LVS for Netfilter and Linux 2.4, pt2

The most recent version of Julian's writeup of LVS and Netfilter (NF) is on the LVS website. Here is the version available in Jun 2002.

29.11.1. TODO:

- redesign LVS to work in setups with multiple default routes (this requires changes in the kernels, calling ip_route_input with different arguments). The end goal: one routing call in any direction (as before) but do correct routing in in->out direction. The problems:

  • fwmark virtual services and the need for working at prerouting. Solution: hook at PREROUTING after the filter and do there connection creation (after QoS, fwmark setup). Hook at prerouting, listen for traffic for established connections and call ip_route_input with the right arguments (possibly in the routing chain). Goal: always pass one filter chain in each direction (FORWARD). The fwmark is used only for connection setup and then is ignored.

  • hash twice the NAT connections in same table (at prerouting we can see both requests and replies), compare with cp->vaddr to detect the right direction

- help from Netfilter to redesign the kernel hooks:

  • ROUTING hook (used from netfilter's NAT, LVS-DR and in->out LVS-NAT)
  • fixed ip_route_input to do source routing with the masquerade address as source (lsrc argument)
  • more control over what to walk in the netfilter hooks?

- different timeouts for each virtual server (more control over the connection timeouts)

- Allow LVS to be used as NAT router/balancer for outgoing traffic

29.11.2. CURRENT STATE:

Running variants:

  • 1. Only lvs - the fastest
  • 2. lvs + ipfw NAT
  • 3. lvs + iptables NAT

Where is LVS placed:

LOCAL_IN:100 ip_vs_in

FORWARD:99 ip_vs_forward_icmp
FORWARD:100 ip_vs_out

POST_ROUTING:NF_IP_PRI_NAT_SRC-1 ip_vs_post_routing

The chains:

The out->in LVS packets (for any forwarding method) walk:

pre_routing -> LOCAL_IN -> ip_route_output or dst cache -> POST_ROUTING


	LOCAL_IN
	ip_vs_in	-> ip_route_output/dst cache
			-> mark skb->nfcache with special bit value
			-> ip_send -> POST_ROUTING

	POST_ROUTING
	ip_vs_post_routing
			- check skb->nfcache and exit from the
			chain if our bit is set

The in->out LVS packets (for LVS/NAT) walk:

pre_routing -> FORWARD -> POST_ROUTING

	FORWARD (check for related ICMP):
	ip_vs_forward_icmp	-> local delivery -> mark
				skb->nfcache -> POST_ROUTING

	FORWARD
	ip_vs_out		-> NAT -> mark skb->nfcache -> NF_ACCEPT

	POST_ROUTING
	ip_vs_post_routing
			- check skb->nfcache and exit from the
			chain if our bit is set

Why LVS is placed there:

- LVS creates connections after the packets are marked, i.e. after PRE_ROUTING:MANGLE:-150 or PRE_ROUTING:FILTER:0. LVS can use the skb->nfmark as a virtual service ID.

- LVS must be after PRE_ROUTING:FILTER+1:sch_ingress.c - QoS setups. By this way the incoming traffic can be policed before reaching LVS.

- LVS creates connections after the input routing because the routing can decide to deliver locally packets that are marked or other packets specified with routing rules. Transparent proxying handled from the netfilter NAT code is not always a good solution.

- LVS needs to forward packets not looking in the IP header (direct routing method), so calling ip_route_input with arguments from the IP header only is not useful for LVS

- LVS is after any firewall rules in LOCAL_IN and FORWARD

29.11.3. Requirements for the PRE_ROUTING chain

Sorry, we can't waste time here. The netfilter connection tracking can mangle packets here and we don't know at this time if a packet is for our virtual service (new connection) or for existing connection (needs lookup in the LVS connection table). We are sure that we can't make decisions whether to create new connections at this place but lookup for existing connections is possible under some conditions: the packets must be defragmented, etc.

There are so many nice modules in this chain that can feed LVS with packets (probably modified)

29.11.4. Requirements for the LOCAL_IN chain

The conditions when sk_buff comes:

- ip_local_deliver() defragments the packets (ip_defrag) for us

- the incoming sk_buff can be non-linear

- when the incoming sk_buff comes only the read access is guaranteed

What we do:

- packets generated locally are not considered because there is no known forwarding method that can establish connection initiated from the director.

- only TCP, UDP and related to them ICMP packets are considered

- the protocol header must be present before making any work based on fields from the IP or protocol header.

- we detect here packets for the virtual services or packets for existing connections and then the transmitter function for the used forwarding method is called

- the NAT transmitter performs the following actions:

We try to make some optimizations for the most of the traffic we see: the normal traffic that is not bound to any application helper, i.e. when the data part (payload) in the packets is not written or even not read at all. In such case, we change the addresses and the ports in the IP and in the protocol header but we don't make any checksum checking for them. We perform incremental checksum update after the packet is mangled and rely on the realserver to perform the full check (headers and payload).

If the connection is bound to some application helper (FTP for example) we always perform checksum checking with the assumption that the data is usually changed and with the additional assumption that the traffic using application helpers is low. To perform such check the whole payload should be present in the provided sk_buff. For this, we call functions to linearize the sk_buff data by assembling all its data fragments.

Before the addresses and ports are changed we should have write access to the packet data (headers and payload). This guarantees that the packet data should be seen from any other readers unchanged. The copy-on-write is performed from the linearization function for the packets that were with many fragments. For all other packets we should copy the packet data (headers and payload) if it is used from someone else (the sk_buff was cloned). The packets not bound to application helpers need such write access only for the first fragment because for them only the IP and the protocol headers are changed and we guarantee that they are in the first fragment. For the packets using application helpers the linearization is already done and we are sure that there is only one fragment. As result, we need write access (copy if cloned) only for the first fragment. After the application helper is called to update the packet data we perform full checksum calculation.

- the DR transmitter performs the following actions:

Nothing special, may be it is the shortest function. The only action is to reroute the packet to the bound realserver. If the packet is fragmented then ip_send_check() should be called to refresh the checksum.

- the TUN transmitter performs the following actions:

Copies the packet if is already referred from someone else or when there is no space for the IPIP prefix header. The packet is rerouted to the real server. If the packet is fragmented then ip_send_check() should be called to refresh the checksum in the old IP header.

- if the packets must leave the box we send them to POST_ROUTING via ip_send and return NF_STOLEN. This means that we remove the packet from the LOCAL_IN chain before reaching priority LAST-1. The LocalNode feature just returns NF_ACCEPT without mangling the packet.

In this chain if a packet is for LVS connection (even newly created) the LVS calls ip_route_output (or uses a destination cache), marks the packet as a LVS property (sets bit in skb->nfcache) and calls ip_send() to jump to the POST_ROUTING chain. There our ip_vs_post_routing hook must call the okfn for the packets with our special nfcache bit value (Is skb->nfcache used after the routing calls? We rely on the fact that it is not used) and to return NF_STOLEN.

One side effect: LVS can forward packet even when ip_forward=0, only for DR and TUN methods. For these methods even TTL is not decremented nor data checksum is checked.

29.11.5. Requirements for the FORWARD chain

LVS checks first for ICMP packets related to TCP or UDP connections. Such packets are handled as they are received in the LOCAL_IN chain - they are localy delivered. Used for transparent proxy setups.

LVS looks in this chain for in->out packets but only for the LVS/NAT method. In any case new connections are not created here, the lookup is for existing connections only.

In this chain the ip_vs_out function can be called from many places:

FORWARD:0 - the ipfw compat mode calls ip_vs_out between the forward firewall and the masquerading. By this way LVS can grab the outgoing packets for its connection and to avoid they to be used from the netfilter's NAT code.

FORWARD:100 - ip_vs_out is registered after the FILTER=0. We can come here twice if the ipfw compat module is used because ip_vs_out is called once from FORWARD:0 (fw_in) and after that from pri=100 where LVS always registers the ip_vs_out function. We detect this second call by looking in the skb->nfcache bit value. If the bit is set we return NF_ACCEPT. In fact, the second ip_vs_out call is avoided if the first returns NF_STOLEN and after calling the okfn function.

The actions we perform are the same as in the LOCAL_IN chain for the NAT transmitter with the exception that we should call ip_defrag(). The other difference is that we have write access to the first fragment (it is not referred from someone else) after ip_forward() calls skb_cow().

29.11.6. Requirements for the POST_ROUTING chain

LVS marks the packets for debugging and they appear to come from LOCAL_OUT but this chain is not traversed. The LVS requirements from the POST_ROUTING chain include the fragmentation code only. But even the ICMP messages are generated and mangled ready for sending long before the POST_ROUTING chain: ip_send() does not call ip_fragment() for the LVS packets because LVS returns ICMP_FRAG_NEEDED when the mtu is shorter.

LVS makes MTU checks when accepting packets and selecting the output device. So, the ip_refrag POST_ROUTING hook is not used from LVS.

The result is: LVS must hook POST_ROUTING as first (may be only after the ipfw compat filter) and to return NF_STOLEN for its packets (detected by checking the special skb->nfcache bit value).

The Netfilter hooks:

Priorities:
        NF_IP_PRI_FIRST = INT_MIN,
        NF_IP_PRI_CONNTRACK = -200,
        NF_IP_PRI_MANGLE = -150,
        NF_IP_PRI_NAT_DST = -100,
        NF_IP_PRI_FILTER = 0,
        NF_IP_PRI_NAT_SRC = 100,
        NF_IP_PRI_LAST = INT_MAX,


PRE_ROUTING (ip_input.c:ip_rcv):
	CONNTRACK=-200, ip_conntrack_core.c:ip_conntrack_in
	MANGLE=-150, iptable_mangle.c:ipt_hook
	NAT_DST=-100, ip_nat_standalone.c:ip_nat_fn
	FILTER=0, ip_fw_compat.c:fw_in, defrag, firewall, demasq, redirect
	FILTER+1=1, net/sched/sch_ingress.c:ing_hook

LOCAL_IN (ip_input.c:ip_local_deliver):
	FILTER=0, iptable_filter.c:ipt_hook
	LVS=100, ip_vs_in
	LAST-1, ip_fw_compat.c:fw_confirm
	CONNTRACK=LAST-1, ip_conntrack_standalone.c:ip_confirm

FORWARD (ip_forward.c:ip_forward):
	FILTER=0, iptable_filter.c:ipt_hook
	FILTER=0, ip_fw_compat.c:fw_in, firewall, LVS:check_for_ip_vs_out,
		masquerade
	LVS=99, ip_vs_forward_icmp
	LVS=100, ip_vs_out

LOCAL_OUT (ip_output.c):
	CONNTRACK=-200, ip_conntrack_standalone.c:ip_conntrack_local
	MANGLE=-150, iptable_mangle.c:ipt_local_out_hook
	NAT_DST=-100, ip_nat_standalone.c:ip_nat_local_fn
	FILTER=0, iptable_filter.c:ipt_local_out_hook

POST_ROUTING (ip_output.c:ip_finish_output):
	FILTER=0, ip_fw_compat.c:fw_in, firewall, unredirect,
		mangle ICMP replies
	LVS=NAT_SRC-1, ip_vs_post_routing
	NAT_SRC=100, ip_nat_standalone.c:ip_nat_out
	CONNTRACK=LAST, ip_conntrack_standalone.c:ip_refrag


CONNTRACK:
	PRE_ROUTING, LOCAL_IN, LOCAL_OUT, POST_ROUTING

FILTER:
	LOCAL_IN, FORWARD, LOCAL_OUT

MANGLE:
	PRE_ROUTING, LOCAL_OUT

NAT:
	PRE_ROUTING, LOCAL_OUT, POST_ROUTING

29.12. Example ip_tables filter scripts

A simple firewall installation script is gshield.

This script below by Tim Cronin, was written before ipvs netfilter connection tracking module, ipvs_nfct became available.

Tim Cronin tim (at) 13-colonies (dot) com 14 Feb 2003

# ipvsadm
IP Virtual Server version 1.0.6 (size=1048576)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  xx.xx.xx.xx:http wlc persistent 1200
  -> 192.168.1.25:http            Masq    1      0          2
TCP  xx.xx.xx.xx:http wlc persistent 1200
  -> 192.168.1.20:http            Masq    2      16         11
  -> 192.168.1.10:http            Masq    3      17         23

This script (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/files/tim_cronin.sh) has been running reliably in production for 6 months. The link at the top of the script is a good starting point to understand how it works. Note that the default config generates copious logs. The IP addresses have been changed to protect the innocent server. I had problems with the syn flag hence the section ignoring stuff going to the vips.

29.13. performance hit on director with iptables/netfilter

Dant dan (at) id-confirm (dot) com

I setup an intrusion detection system (IDS) on the director, and I had a simple test on the director by connectting the mean-service-time, HitRatio .. which cost me dual weeks. And I found that snort does not affect the performance.

ratz 16 Nov 2005

Depends on the configuration, it's actually quiet easy to kill Snort with some advanced ruleset and above L4 checks.

as both snort and iptables use libpcap library to scratch packets,

iptables does not use libpcap, to my avail, so I'm not sure I understand your question.

does it mean the iptables will not affect the director's performance ? or am I right before when using snort?

iptables is the user space part of netfilter, which is in the kernel. So no, iptables will not hurt performance, but netfilter certainly does, depending on the usage, amount of rules and your hardware configuration.

I'm running several highly loaded LVSs, these days I found that there are so many malicious scans, so I want to ban them all by portsentry. And we also confused by by ddos :-/

Portsentry only mitigates the problem, doesn't solve it. Also, it's not something that should be implemented on the LVS. Also having a NIDS on the director is a bit suboptimal, since a IDS should at best not be detectable and should also be in read-only mode. Either put a second box between the networks you need to sniff, preferrably in bridge mode or modify your network cables by removing the TX part, so only receiving is possible. Both suggestions don't work well with a director.

On the modifying-network-cables-for-IDS part: http://www.snort.org/docs/tap/

While we're touching on this subject here, what kind of a NIDS do people use inside an LVS setup, and how can it be implemented?

There's nothing special about LVS that would require a different approach to NIDS, so this is more a general question off how to deploy IDS; and this, I'm afraid, is subject to personal views. I don't know on which level you plan on deploying IDS, but a good starter is the Snort documentation corner, which can be found at: http://www.snort.org/docs/ Especially interesting is the IDS load balancer. I've talked to Marty about load balancing traffic to multiple IDS nodes to share the load in 2001 I think, however I don't remember what our consensus was.

Other than that you'd have to be a bit more specific. I'd be glad to help, although I've left the IDS field 2-3 years ago. One of the reason is that with the Basel II and the Sarbanes-Oxley acts (http://www.aicpa.org/sarbanes/index.asp) you barely can't allow yourself anymore to "lose" data, which in the sense of IDS translates to either "false positives" or "true negatives". Since the two items mentioned are a general issue of IDS systems, that require highly skilled personnel, other means to acquire the demanded level of security quality management have to be found, for example: reliable logging and monitoring, on top of a well-thought and implemented security policy.

29.14. Long sessions through LVS DR director terminated by icmp-host-prohibited (ICMP type 3 code 10)

Note
This problem was found in the normal operation of an LVS. The problem is with netfilter, not LVS. Netfilter is gratuitously sending icmp packets in an ESTABLISHED connection. We don't know what the problem is about (yet). It's here in case someone else finds it too.

Klaas Jan Wierenga k (dot) j (dot) wierenga (at) home (dot) nl 13 Mar 2007

I have a problem where sometimes some long standing mp3 streaming sessions over HTTP are terminated because the LVS-DR director sends an "ICMP type 3 code 10 - host unreachable" packet to the client (which is the source of the mp3 stream). When this happens the client stops sending packets for 15 minutes 15 minutes (the TCP idle session timeout of LVS?) before it reconnects on the same ports. The 15 minutes seems to be related to the connection timing out of the LVS connection table.

When this happens the real servers are all fine, load is not heavy and ldirectord is able to perform it's checks. In fact nothing shows in the ldirectord.log so the real servers are all available. This is quite a long post, I've tried to include all relevant details.

My setup: ISP router -> LVS Director -> Local switch -> Realserver[123] -> ISP router

Directions I've looked into so far and questions I've asked myself:

  1. enabled "quiescent=yes" to maybe not terminate existing connection, but this is not the problem I think because the real servers are all available when this happens.
  2. Where is this ICMP packet generated in linux/net/ipv4/ipvs/* source files? Answer: nowhere!, at least not with type 3 code 10
  3. Could it be that this ICMP packet is generated by some sort of denial-of-service defense code that I'm unaware of?
  4. Where is this specific ICMP packet (HOST_UNREACH_ANO) genererated in the kernel?

    Answer: net/ipv4/netfilter/ipt_REJECT.c: send_unreach(*pskb, ICMP_HOST_ANO);

    So it appears that netfilter (iptables?) is sending it. Why? This could be due to the firewall rule:

    REJECT     all  --  anywhere             anywhere            reject-with icmp-host-prohibited
    

    But why is this sent on an existing, established and active connection? Or is there some TCP timeout because the director only sees incoming packets on the connection? Maybe this rings a bell with someone.

  5. Maybe the client is not behaving correctly by not continuing to send data after receiving ICMP host unreachable? TCP/IP Illustrated Vol1 [Stevens] says on page 317, 21.10 ICMP Errors:

    "A received host unreachable or network unreachable is effectively ignored, since these two errors are considered transient. ... It could be that an intermediate router has gone down and it can take the routing protocols a few minutes to stabilize on an alternative route.. During this period either of these two ICMP errors can occur, but the must not abort the connection. Instead, TCP keeps trying to send the data that caused the error, although it may eventually time out."

Later...

A while ago I posted about a problem I was having with long mp3 streaming sessions which were terminated because the streaming LVS cluster (managed by me) was sending icmp-host-prohibited on an established connection to the client which was causing the connection to be terminated.

Initially I suspected the LVS director but after some investigation I found out that it never sends icmp-host-prohibited. The only other possibility was netfilter sending it. The relevant parts of my initial iptables was (/etc/sysconfig/iptables):

*filter
:FORWARD ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:RH-Firewall-1-INPUT - [0:0]
:OUTPUT ACCEPT [0:0]
-A FORWARD -j RH-Firewall-1-INPUT
-A INPUT -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 80 --state NEW -j ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited
COMMIT

After I changed the port 80 rule to the one below effectively disabling connection tracking on port 80 the problem disappeared.

-A RH-Firewall-1-INPUT -p tcp --dport 80 -j ACCEPT

Initially I made this iptables change on the LVS director, but then the realservers would send icmp-host-prohibited sometimes on established connections, after also changing iptables on the realservers did the problem go away.

It is still unclear to me why netfilter would decide to send icmp-host-unreachable on established connection when connection tracking is active. Maybe someone on the netfilter list can shed some light on this.

Still later... did you ever find a better solution?

Klaas Jan Wierenga k (dot) j (dot) wierenga (at) home (dot) nl 26 Jun 2007

Not really. It appears to be a netfilter problem because when I changed my firewall rules (/etc/sysconfig/iptables) to disable connection tracking, the problem went away.

# Don't do connection tracking on port 80 and 8000, 
# because sometimes it results in dropped connections due to ICMP_HOST_UNREACHABLE messages
#
#-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 80 --state NEW -j ACCEPT
#-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 8000 --state NEW -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 80 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 8000 -j ACCEPT

29.15. stateful filtering: LVS-NAT

Note
Laurentiu didn't know about Siim Poder's patch for stateful LVS-NAT filtering about 2 months previously FIXME: write up Siim's patch and link this to it.

Laurentiu C. Badea (L.C.) lc (at) waat (dot) com 10 Sep 2008

This is my solution to the problem I found of the last FIN-ACK eaten by iptables. I had a simple LVS-NAT configuration where LVS lives on a gateway: The kernel is 2.6.20 with ipvsadm 1.24 (Fedora 5).

  CIP - VIP/LVS - RIP

LVS does run iptables but fairly open, INPUT allows all incoming traffic for the VIP, OUTPUT allows NEW,ESTABLISHED,RELATED states and FORWARD is open (for what it's worth, I think ipvs does not go through there at all). So that worked fine.

Until I noticed the real server has many connections in FIN_WAIT2 state. They have the same timeout as TIME_WAIT so I was gonna let it go, but then I looked at the client and all of them were in LAST_ACK state. The client kept resending FIN-ACKs, none of which made it to the server at all. On the LVS, ipvsadm -Lc shows connections in TIME_WAIT state so it did get them.

Well, long story short, the OUTPUT chain blocked *only* that FIN-ACK packet for some odd reason. I was sure that ipvs is shortcircuiting iptables and bypassing OUTPUT, but I guess I misinterpreted the little map in the HOWTO. All the other packets matched the "NEW" rule. This would would have ended up as INVALID probably. I am now adding rules to allow all OUTPUT towards the RIPs, stateless.

This is how it looked from the network side:
Incoming traffic to LVS (CIP->VIP)
  0.016862  CIP -> VIP HTTP HEAD / HTTP/1.1
  0.017193 VIP -> CIP  [ACK] Seq=1 Ack=117 Win=5888 Len=0
  0.021949 VIP -> CIP  [TCP segment of a reassembled PDU]
  0.022173 VIP -> CIP  [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0
  0.034046  CIP -> VIP [ACK] Seq=117 Ack=195 Win=6912 Len=0
 *0.046042  CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0*
the above packet does not make it, beyond here retransmits only
  0.250217 VIP -> CIP  [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0
  0.260110  CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0
  0.267333  CIP -> VIP [TCP Dup ACK 11#1] [ACK] Seq=118 Ack=196 Win=6912 Len=0 SLE=195 SRE=196
  0.705855  CIP -> VIP [FIN, ACK] Seq=117 Ack=196 Win=6912 Len=0

Coming out the other end towards RIP:
  0.016847  CIP -> RIP HTTP HEAD / HTTP/1.1
  0.017119 RIP -> CIP  [ACK] Seq=1 Ack=117 Win=5888 Len=0
  0.021873 RIP -> CIP  [TCP segment of a reassembled PDU]
  0.022115 RIP -> CIP  [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0
  0.034021  CIP -> RIP [ACK] Seq=117 Ack=195 Win=6912 Len=0
only two retransmits seen:
  0.250147 RIP -> CIP  [FIN, ACK] Seq=195 Ack=117 Win=5888 Len=0
  0.267312  CIP -> RIP [TCP Previous segment lost] [ACK] Seq=118 Ack=196 Win=6912 Len=0 SLE=195 SRE=196

The current setup seems to work except for a minor annoyance - the netfilter conntrack table still has the connections, when I would have expected that to be almost empty, given that LVS steals the packets from nf. The connections display as UNREPLIED and originating on the RIP:80 so they aren't "real" but I'm curious which packets from the real server triggered them.

A blanket ACCEPT rule on outgoing traffic doesn't seem very secure for a firewall, and in my case there's a firewall in front of the LVS.

Outgoing FORWARDed traffic is not the one allowed though, it is the traffic originating on the LVS machine itself, the OUTPUT chain in the main table which is usually left open anyway.

Since then I have noticed the INPUT chain would have blocked the same packet in the same configuration, so both INPUT and OUTPUT need to have a stateless ACCEPT on that tcp port for the LVS to work.

29.16. stateful filtering: LVS-DR

Note
Because the director cannot see the reply packets from the realserver, the standard netfilter stateful filtering can't be used with LVS-DR (or LVS-Tun).

Thomas Pedoussaut thomas (at) pedoussaut (dot) com 22 Apr 2008

For one of my dozen of services ( a straight TCP connection), the TCP-FIN packets that are arriving on the load balancer are never passed to the real server. I activated the logs of iptable and could see the FIN packets being dropped. No idea why the FIN are dropped and not the other ones. I obviously have the --state ESTABLISHED,RELATED -j ACCEPT in my iptable rules.

I had a quick look at /proc/net/ip_conntrack before, during and after the connection but nothing specific to that connection seems to be inserted (the module is loaded and other traffic gets tracked).

It even happen when I close the client connection within seconds of creation, so I don't think timeouts are involved. My issue is that the application in backend doesn't deal with timeouts, so never initiate the closing of the connection.

Later...

Basically, all packets (SYN and non-SYN) are allowed by the "--state NEW" iptables but not by the ESTABLISHED,RELATED, because the director never sees the replies from the real server and so never creates a conntrack for that connection. When a FIN packet arrives, it is not validated as a --state NEW, because it's flag FIN is activated and so, that particular packet is dropped.

So the solution is to change the iptables rule

from
-A RH-Firewall-1-INPUT -m state --state NEW -m tcp -p tcp --dport $VPORT -j ACCEPT
to
-A RH-Firewall-1-INPUT -m tcp -p tcp --dport $VPORT -j ACCEPT