35. LVS: ICMP

Note
ICMP is an IP protocol for sending error messages between 2 hosts at the end of a segment. ICMP messages are not propagated beyond the two hosts involved in the error condition.

Sometime in 2000, code was added for LVS to handle ICMP. For LVS'ed services, the director handles ICMP redirects and MTU discovery delivering them to the correct realserver. ICMP packets for non-LVS'ed services are delivered locally.

Setups where packets are not defragmented properly are difficult to diagnose: only large packets are affected and the setup will work for much of the time, but clients will see their connection hanging. The realservers can have large numbers of connections hung in FIN_WAIT. We see this when packets are enlarged by encapsulation and then decapsulated before arriving at their destination, midway in their passage through the network, and the en/de-capsulating mechanism doesn't send icmp need_defrag packets.

35.1. MTU discovery and ICMP handling

joern maier 13 Dec 2000

What happens if an ICMP "host unreachable" messages is send to the director because a client went down ? Are the entrys from the connection table removed ?

Julian Anastasov ja (at) ssi (dot) bg Wed, 13 Dec 2000

No

Are the messages forwarded to the Realservers ?

Julian 13 Dec 2000

Yes, the embedded TCP or UDP datagram is inspected and this information is used to forward the ICMP message to the right realserver. All other messages that are not related to existing connections are accepted locally.

Eric Mehlhaff mehlhaff (at) cryptic (dot) com passed on more info

Theoreticaly, path-mtu-discovery happens on every new tcp connection. In most cases the default path MTU is fine. It's weird cases (ethernet LAN conenctions with low MTU WAN connections ) that point out broken path-MTU discovery. i.e. for a while I had my home LAN (MTU 1500) hooked up via a modem connection that I had set MTU to 500 for. The minimum MTU in this case was the 500 for my home, but there were many broken web sites I could not see because they had blocked out the ICMP-must-fragment packets on their servers. One can also see the effects of broken path mtu discovery on FDDI local networks.

Anyway, here's some good web pages about it:

http://www.freelabs.com/~whitis/isp_mistakes.html
http://www.worldgate.com/~marcs/mtu/

What happens if a realserver is connected to a client which is no longer reachable? ICMP replies go back to the VIP and will not neccessarily be forwarded to the correct realserver.

Jivko Velev jiko (at) tremor (dot) net

Assume that we have TCP connections...and realserver is trying to respond to the client, but it cannot reach it (the client is down, the route doesn't exist anymore, the intermadiate gateway is congested). In these cases your VIP will receive ICMP packets dest unreachable, source quench and friends. If you dont route these packets to the correct realserver you will affect performance of the LVS. For example the realserver will continue to resend packets to the client because they are not confirmed, and gateways will continue to send you ICMP packets back to VIP for every packets they droped. The TCP stack will drop these kind of connections after his timeouts expired, but if the director forwarded the ICMP packets to the appropriate realserver, this will occur a little bit earlier, and will avoid overloading the redirector with ICMP stuff.

When you receive a ICMP packet it contains the full IP header of the packet that cause this ICMP to be generated + 64bytes of its data, so you can assume that you have the TCP/UDP header too. So it is possible to implement "Persitance rules" for ICMP packages.

Summary: This problem was handled in kernel 2.2.12 and earlier by having the configure script turn off icmp redirects in the kernel (through the proc interface). For 2.2.13 the ipvs patch handles this. The configure script knows which kernel you are using on the director and does the Right Thing (TM).

Joe: from a posting I picked off Dejanews by Barry Margolin

the criteria for sending a redirect are:

  1. The packet is being forwarded out the same physical interface that it was received from,
  2. The IP source address in the packet is on the same Logical IP (sub)network as the next-hop IP address,
  3. The packet does not contain an IP source route option.

Routers ignore redirects and shouldn't even be receiving them in the first place, because redirects should only be sent if the source address and the preferred router address are in the same subnet. If the traffic is going through an intermediary router, that shouldn't be the case. The only time a router should get redirects is if it's originating the connections (e.g. you do a "telnet" from the router's exec), but not when it's doing normal traffic forwarding.

unknown

Well, remember that ICMP redirects are just bandages to cover routing problems. No one really should be routing that way.

ICMP redirects are easily spoofed, so many systems ignore them. Otherwise they risk having their connectivity being disconnected on whim. Also, many systems no longer send ICMP redirects because some people actually want to pass traffic through an intervening system! I don't know how FreeBSD ships these days, but I suggest that it should ship with ignore ICMP redirects as the default.

35.2. LVS code only needs to handle icmp redirects for LVS-NAT and not for LVS-DR and LVS-Tun

Julian: 12 Jan 2001

Only for LVS-NAT do the packets from the realservers hit the forward chain, i.e. the outgoing packets. LVS-DR and LVS-Tun receive packets only to LOCAL_IN, i.e. the FORWARD chain, where the redirect is sent, is skipped. The incoming packets for LVS/NAT use ip_route_input() for the forwarding, so they can hit the FORWARD chain too and to generate ICMP redirects after the packet is translated. So, the problem always exists for LVS/NAT, for packets in the both directions because after the packets are translated we always use ip_forward to send the packets to the both ends.

I'm not sure but may be the old LVS versions used ip_route_input() to forward the DR traffic to the realservers. But this was not true for the TUN method. This call to ip_route_input() can generate ICMP redirects and may be you are right that for the old LVS versions this is a problem for DR. Looking in the Changelog it seems this change occured in LVS version 0.9.4, near Linux 2.2.13. So, in the HOWTO there is something that is true: there is no ICMP redirect problem for LVS/DR starting from Linux 2.2.13 :) But the problems remains for LVS/NAT even in the latest kernel. But this change in LVS is not created to solve the ICMP redirect problem. Yes, the problem is solved for DR but the goal was to speedup the forwarding for the DR method by skipping the forward chain. When the forward chain is skipped the ICMP redirect is not sent.

ICMP redirects and LVS: (Joe and Wensong)

The test setups shown in this HOWTO for LVS-DR and LVS-Tun have the client, director and realservers on the same network. In production the client will connect via a router from a remote network (and for LVS-Tun the realservers could be remote and all on separate networks).

The client forwards the packet for VIP to the director, the director receives the packet on the eth0 (eth0:1 is an alias of eth0), then forwards the packet to the realserver through eth0. The director will think that the packet came and left through the same interface without any change, so an icmp redirect is send to the client to notify it to send the packets for VIP directly to the RIP.

However, when all machines are on the same network, the client is not a router and is directly connected to the director, and ignores the icmp redirect message and the LVS works properly.

If there is a router between the client and the director, and it listens to icmp redirects, the director will send an icmp redirect to the router to make it send the packet for VIP to the realserver directly, the router will handle this icmp redirect message and change its routing table, then the LVS/DR won't work.

The symptoms is that once the load balancer sends an ICMP redirect to the router, the router will change its routing table for VIP to the realserver, then all the LVS won't work. Since you did your test in the same network, your LVS client is in the same network that the load balancer and the server are, it doesn't need to pass through a router to reach the LVS, you won't have such a symptom. :)

Only when LVS/DR is used and there is only one interface to receive packets for VIP and to connect the realserver, there is a need to suppress the ICMP redirects of the interface.

Joe

The ICMP redirects is turned on in the kernel 2.2 by default. The configure script turns off icmp redirects on the director using sysctl

echo 0 > /proc/sys/net/ipv4/conf/eth0/send_redirects

(Wensong) In the reverse direction, replies coming back from the realserver to the client

                                  |<------------------------|
                                  |                  realserver
 client <--> tunlhost1=======tunlhost2 --> director ------->|

After the first response packet arrives from the realserver at the tunlhost2, tunlhost2 will try to send the packet through the tunnel. If the packet is too big, then tunlhost2 will send an ICMP packet to the VIP to fragment the packet. In the previous versions of ipvs, the director won't forward the ICMP packet to (any) realserver. With 2.2.13 code has been added to handle the icmp redirects and make the director forward icmp packets to the corresponding servers.

If the client has two connections to the LVS (say telnet and http) each to 2 different realservers and the client goes down, the director gets 2 ICMP_DEST_UNREACH packets. The director knows from the CIP:port which realserver to send the icmp packet to?

Wensong Zhang 21 Jan 2000

The director handles ICMP packets for virtual services long time ago, please check the ChangeLog of the code.

ChangeLog for 0.9.3-2.2.13

The incoming ICMP packets for virtual services will be forwarded to the right realservers, and outgoing ICMP packets from virtual services will be altered and send out correctly. This is important for error and control notification between clients and servers, such as the MTU discovery.

Joe

If a realserver goes down after the connection is established, will the client get a dest_unreachable from the director?

No. Here is a design issue. If the director sends an ICMP_DEST_UNREACH immediately, all tranfered data for the established connection will be lost, the client needs to establish a new connection. Instead, we would rather wait for the timeout of connection, if the realserver recovers from the temporary down (such as overloaded state) before the connection expires, then the connection can continue. If the real server doesn't recover before the expire, then an ICMP_DEST_UNREACH is sent to the client.

If the client goes down after the connection is established, where do the dest_unreachable icmp packets generated by the last router go?

If the client is unreachable, some router will generate an ICMP_DEST_UNREACH packet and sent to the VIP, then the director will forward the ICMP packet to the realserver.

Since icmp packets are udp, are the icmp packets routed through the director independantly of the services that are being LVS'ed. i.e. if the director is only forwarding port 80/tcp, from CIP to a particular RIP, does the LVS code which handles the icmp forward all icmp packets from the CIP to that RIP. What if the client has a telnet session to one realserver and http to another realserver?

It doesn't matter, because the header of the original packet is encapsulated in the icmp packet. It is easy to identify which connection is the icmp packet for.

35.3. ICMP checksum errors

(This problem pops up in the mailing list occasionally, e.g. Ted Pavlic on 2000-08-01.)

Jerry Glomph Black

The kernel debug log (dmesg) occasionally gets bursts of messages of the following form on the LVS box:

IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!
IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!
IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!
IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!
IP_MASQ:reverse ICMP: failed checksum from 199.108.9.188!

What is this, is it a serious problem, and how to deal with it?

Joe

I looked in dejanews. No-one there knows either and people there are wondering if they are being attacked too. It appears in non-LVS situations, so it probably isn't an LVS problem. The posters don't know the identity of the sending node.

Wensong

I don't think it is a serious problem. If these messages are generated, the ICMP packets must fail in checksum. Maybe the ICMP packets from 199.108.9.188 is malformed for some unknown reason.

Here are some other reports

Hendrik Thiel thiel (at) falkag (dot) de 18 Jun 2001

I noticed this in dmesg and messages:

kernel: IP_MASQ:reverse ICMP:failed checksum from 213.xxx.xxx.xxx!
last message repeated 1522 times.

Is this lvs specific (using nat) ? or can this be an attack?

Alois Treindl alois (at) astro (dot) ch

I see those too

Jun 17 22:16:19 wwc kernel: IP_MASQ:reverse ICMP: failed checksum from 193.203.8.8!

not as many as you but every few hours a bunch.

Juri Haberland juri (at) koschikode (dot) com

From time to time I see them also on a firewall masquerading the companies net. I always assumed it is a corrupted ICMP packet... Who knows...

35.4. ICMP Timeouts

Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 14 Feb 2001

what is the usefulness of the ICMP packets that are sent when new packets arrives for a TCP connection that timed out for in LVS box ? I understand obviously for UDP but I don't see their role for a TCP connection...

Julian

I assume your question is about the reply after ip_vs_lookup_real_service.

It is used to remove the open request in SYN_RECV state in the realserver. LVS replies for more states and may be some OSes report them as soft errors (Linux), others can report them as hard errors, who knows.

it's about ICMP packets from a LVS-NAT director to the client. For example, a client accesses a TCP virtual service and then stops sending data for a long time, enough for the LVS entry to expire. When the client try to send new data over this same TCP connection the LVS box sends ICMP (port unreachable) packets to the client. For a TCP connection how do these ICMP packets "influence" the client ? It will stop sending packets to this expired (for the LVS box...) TCP connection only after its own timeouts, doesn't it ?

By default TCP replies RST to the client when there is no existing socket. LVS does not keep info for already expired connections and so we can only reply with an ICMP rather than sending a TCP RST. (If we implement TCP RST replies, we could reply TCP RST instead of ICMP).

What does the client do with this ICMP packet? By default, the application does not listen for ICMP errors and they are reported as soft errors after a TCP timeout and according to the TCP state. Linux at least allows the application to listen for such ICMP replies. The application can register for these ICMP errors and detect them immediately as they are received by the socket. It is not clear whether it is a good idea to accept such information from untrusted sources. ICMP errors are reported immediately for some TCP (SYN) states.

35.5. PMTUD (path MTU discovery)

see MTU

35.6. Long sessions through LVS DR director terminated by icmp-host-prohibited (ICMP type 3 code 10)

Note
Klaas's posting of a bug. We don't know what it's about yet.

Klaas Jan Wierenga k.j.wierenga (at) home (dot) nl 26 Mar 2007

I have a problem where sometimes some long standing mp3 streaming sessions over HTTP are terminated, because the LVS-DR director sends an ICMP type 3 code 10 - host unreachable (icmp-host-prohibited) packet to the client (the source of the mp3 stream). When this happens the client stops sending packets for 15 minutes 15 minutes (the TCP idle session timeout of LVS?)

Initially I suspected the LVS director but after some investigation I found out that it never sends icmp-host-prohibited (not in linux/net/ipv4/ipvs/* source files). The only other possibility was netfilter sending it (found in net/ipv4/netfilter/ipt_REJECT.c: send_unreach(*pskb, ICMP_HOST_ANO). But why is this sent on an existing, established and active connection?

The relevant parts of my initial iptables was (/etc/sysconfig/iptables):

*filter
:FORWARD ACCEPT [0:0]
:INPUT ACCEPT [0:0]
:RH-Firewall-1-INPUT - [0:0]
:OUTPUT ACCEPT [0:0]
-A FORWARD -j RH-Firewall-1-INPUT
-A INPUT -j RH-Firewall-1-INPUT
-A RH-Firewall-1-INPUT -i lo -j ACCEPT
-A RH-Firewall-1-INPUT -p icmp -m icmp --icmp-type any -j ACCEPT
-A RH-Firewall-1-INPUT -m state --state RELATED,ESTABLISHED -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 80 --state NEW -j ACCEPT
-A RH-Firewall-1-INPUT -j REJECT --reject-with icmp-host-prohibited COMMIT

After I changed the port 80 rule to the one below effectively disabling connection tracking on port 80 the problem disappeared.

-A RH-Firewall-1-INPUT -p tcp --dport 80 -j ACCEPT

Initially I made this iptables change on the LVS director, but then the realservers would send icmp-host-prohibited sometimes on established connections, after also changing iptables on the realservers did the problem go away.

It is still unclear to me why netfilter would decide to send icmp- host-unreachable on established connection when connection tracking is active. Maybe someone on the netfilter list can shed some light on this.

later on following up: 26 Jun 2007

I never figured it out. It appears to be a netfilter problem because when I changed my firewall rules (/etc/sysconfig/iptables) to disable connection tracking, the problem went away.

# Don't do connection tracking on port 80 and 8000 because sometimes it
# results in dropped connections due to ICMP_HOST_UNREACHABLE messages
#-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 80 --state NEW -j ACCEPT
#-A RH-Firewall-1-INPUT -p tcp -m state -m tcp --dport 8000 --state NEW -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 80 -j ACCEPT
-A RH-Firewall-1-INPUT -p tcp --dport 8000 -j ACCEPT