44. LVS: Weird hardware (and software)

44.1. Arp caching defeats Heartbeat switchover

Claudio Di-Martino claudio (at) claudio (dot) csita (dot) unige (dot) it

I've set up a LVS using direct routing composed of two linux-2.2.9 boxes with the 0.4 patch applied. The load balancer acts as a local node too. I configured mon to monitor the state of the services and update the redirect table accordingly. I also configured heartbeat so that when the load balancer fails the second machine takes over the virtual ip, sets up the redirect table and starts mon. When the load balancer restarts, the backup reconfigures itself as a realserver, drops the interface alias that carries the virtual ip, stops mon, clears the redirect table. Although the configuration of the two machines is set up correctly it fails to restore the load balancer due to arp caching problems.

It seems that the local gateway keeps routing requests for the VIP to the load balancer backup. Sending gratuitous arp packets from the load balancer doesn't have effect since the interface of the backup is still alive and responding.

Has anyone encountered a similar problem and is there a hack or a proper solution to take back control of the virtual ip?

Antony Lee AntonyL (at) hwl (dot) com (dot) hk

I am new to LVS and I have a problem in setting up two LVSes for failover issue. The problem is related to the ARP caching of the primary LVS' MAC address in the realservers and the router connected to the Internet. The problem leads all the Internet connections stalled until all ARP caching in Web Servers and router to be expired. Can anyone help to solve the problem by making some changes in the Linux LVS ? (It is because I am not able to change the router ARP cache time. The router is not owned by the Web hosting company not by me.)

In each LVS, there are two network card installed. The eth0 is connected to a router which is connected to the Internet. The eth1 is connected to a private network which is the same segment as the two NT IIS4.

The eth0 of the primary LVS is assigned an IP address 202.53.128.56
The eth0 of the backup LVS is assigned an IP address 202.53.128.57
The eth1 of the primary LVS is assigned an IP address 192.128.1.9
The eth1 of the primary LVS is assigned an IP address 192.128.1.10

In addition, both primary and backup LVS have enabled the IPV4 FORWARD and IPV4 DEFRAG. In the file /etc/rc.d/rc.local the following command was also added:

ipchains -A -j MASQ 192.168.1.0/24 -d 0.0.0.0/0

I use the piranha to configure the LVS so that the two LVS have a common IP address 202.53.128.58 in the eth0 as eth0:1. And have a IP address 192.128.1.1 in the eth1 as eth1:1

The pulse daemon is also automatically be run when the two LVSes were booted.

In my configuration, the Internet clients can still access to our Web server with one of the NT was disconnected from the LVS. The backup LVS --CAN AUTOMATICALLY-- take up the role of the primary LVS when the primary LVS is shut down or disconnected from the backup LVS. However, I found that all the NT Web Servers cannot reach the backup LVS through the common IP address 192.128.1.1, and all the Internet clients stalled to connect to our web servers.

Later, I found that the problem may due to the ARP caching in the Web Servers and router. I tried to limit the ARP cache time to 5 seconds in the NT servers and half of the problem has solved ,i.e. the NT Web servers can reach the backup LVS through the common IP address 192.128.1.1 when the primary LVS was down. However, it is still cannot be connected through the Internet clients when the LVS failover occur.

Wensong

I just tried two LVS boxes with piranha 0.3.15. When the primary LVS stops or fails, the backup will take over and send out 5 Gratuitous Arp packets for the VIP and the NAT router IP respectively, which should clean the ARP caching in both the web servers and the external router.

After the LVS failover occurs, the established connections from the clients will be lost in the current version, and the clients need to re-connection the LVS.

.. 5 ARP packets for each IP address, and 10 for both the VIP and
the NAT router IP. I saw the log file as follows:

Mar  3 11:12:14 PDL-Linux2 pulse[4910]: running command "/sbin/ifconfig" "eth0:5" "192.168.10.1" "up"
Mar  3 11:12:14 PDL-Linux2 pulse[4908]: running command "/usr/sbin/send_arp" "-i" "eth0" "192.168.10.1" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:14 PDL-Linux2 pulse[4913]: running command  "/sbin/ifconfig" "eth0:1" "172.26.20.118" "up"
Mar  3 11:12:14 PDL-Linux2 kernel: send_arp uses obsolete (PF_INET,SOCK_PACKET)
Mar  3 11:12:14 PDL-Linux2 pulse[4909]: running command "/usr/sbin/send_arp" "-i" "eth0" "172.26.20.118" "00105A839CBE" "172.26.20.255" "ffffffffffff"
Mar  3 11:12:17 PDL-Linux2 nanny[4911]: making 192.168.10.2:80 available

I don't know if the target addresses of the 2 send_arp commands are set correctly. I am not sure if it is different when broadcast or source IP is used as target address, or any target address is OK.

Horms

Are there just 5 ARPs or 5 to start this and then more gratuitous ARPs at regular intervals. If the gratuitous ARPs only occur at fail-over then once the ARP caches on hosts expire there is a chance that a failed host - whose kernel is still functional - could reply to an ARP request.

wanger (at) redhat (dot) com

When we put this together, I talked to Alan Cox about this. His opinion was that send 5 ARPs out at 2 seconds apart. If there is something out there listening and cares, then it will pick it up.

The way piranha works, as long as the kernel is alive, the backup (or failed node) will not maintain any interfaces that are Piranha managed. In other words, it removes any of those IPs/interfaces from its routing table upon failure recovery.

44.2. Weird Hardware I: cisco catalyst routers gratuitously cache arp data (failover is slow)

Some hardware manufacturers release equipment with broken tcpip implementations.

Sean Roe May 06, 2004

I was looking for some info on cisco catalyst switches to help speed up the failover between my two director boxes. I have the following LVS-NAT setup:

                   |--------|-----|WebServer1|
       -- |LVS01|--|Cisco   |-----|WebServer2|
       |           |        |-----|WebServer3|
 ------|           |Catalyst|-----|WebServer4|
       |           |        |-----|WebServer5|
       -- |LVS02|--|Switch  |-----|WebServer6|
                   |--------|
  Virt     LVS                    Real
  IP       Servers                Servers

My Problem is that if lvs01 fails lvs takes over the load, but it takes forever (5-6 minutes). for the realservers to start using the new director. It also seems that it works faster, if I actually restart the httpd on each webserver. This is a LVS-NAT with multiple virtual IPS going to different ports on the webservers.

John Reuning john (at) metalab (dot) unc (dot) edu 23 Apr 2004

I've seen a similar delay in failover when using cisco routers. They don't update the internal MAC address table after receiving gratuitous arp packets during an LVS director failover event. I don't know if the heartbeat package uses arps to fail over, but keepalived does. Cisco routers seem to need icmp packets before they'll update the MAC address table. For LVS, the problem here is that the router continues to send traffic to the VIP at the master's hw address instead of shifting to the backup's hw address.

However, this wouldn't explain why your realservers route to the wrong address. The realservers and the LVS directors are on the same network segment, right?

The problem isn't with the layer-2 switches, it's with the next-hop router (the external default gateway for the LVS directors). It's common behavior with Cisco routers to update their arp cache table in response to source-generated packets but not in response to gratuitous arp packets.

Peter Mueller

I've seen a similar delay in failover when using cisco routers.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 24 Apr 2004

Me too, ISPs often configure managed routers to not respond to arp requests. You tend to have to ask them to flush the routing table if you change any of your router facing ips. I'm sure the routers can be configured to respond to ARPs

Horms 07 May 2004

Sounds like there could be a problem with your gratuitous arps that are supposed to effect failover. I have used catalyst swithces quite a lot, in fact both my test rack and the main switch for the network here at VA Japan used catalyst switches. I have found that they are quite aggressive about caching ARP information, and in some cases seem to effect proxy arp. But the current send_arp code in heartbeat seems to work just fine. Actually, I some times run that command manually after rearanging IP addresses on machines.

44.3. Weird Hardware II: autonegotiation failure on cisco CSS 11050

Ed Fisher efisher (at) mrskin (dot) com 08 Feb 2005

We're trying to setup LVS to serve as a drop-in replacement for a pair of Cisco CSS 11050s. We aren't doing any fancy layer 7 stuff on the CSS, like passing certain directories to other servers, or anything like that. I got it all setup, working, and I was able to drop it in for the CSSes pretty smoothly. Our traffic spikes on the CSS reach 90mbit/s. Not huge by a lot of standards, but still sizable. The CSS was pushing out about 50mbit/s when we cut over to the LVS-NAT box, and traffic immediately dropped to about 20mbit/s, never breaking 30. A test download from a box on another network, with a 100mbit connection to the Internet, was able to download a single file at well over 40mbit/s through the CSS. Through the LVS, it peaked at 1Mbit/s at the beginning and then quickly fell to about 300kbit/s after a few seconds, and stayed there. The hardware for the LVS machine: P4 2.26ghz, 2GB of memory. Two e1000 NICs, but both are hooked up to 100mbit switches, since we haven't done our gigabit upgrade yet.

The problem turned out to be that I was plugging into an extreme 24e2 switch, which was uplinked to an extreme 1i router. It was the connection between the 24e2 and the 1i that was bad. The 1i was set to force full duplex, the switch was set to autonegotiate, and so was, for annoying reasons, defaulting to half duplex. I plugged the LVS machine directly into the 1i, set the port to auto-negotiate on the 1i and the LVS machine, and linked up at 1000baseT FD and performance increased dramatically. Testing without the css in the mix showed I was indeed able to saturate the link.

44.4. Weird Hardware III: Watchguard firewall at client site

Jacob Coby jcoby (at) listingbook (dot) com 29 Jul 2005

I've got a client IP addr that, on occassion, takes up a mass of connections and leaves them in an ESTABLISHED state. The IP addr is of a business that uses our website, but it's causing a DOS of sorts.

Software:

ipvsadm v1.21 2002/07/09 (compiled with popt and IPVS v1.0.4)
kernel-2.4.20-28.7.um.3
redhat 7.3

Symptoms:

  • Connections jump from 50-100 up to 300-600.
  • A single IP address takes up 80-90% of those connections.
  • All of the connections from that ip address are in the ESTABLISHED state.
  • Very few of them are actually sending/receiving data (when using tcpdump -xX -s 1024 "host bad.ip.addr"). I see a few packets with the F and S flags set.

I have disabled Keepalive on the real servers (Keepalive on allows them to close connection by themselves). It's too expensive to keep enabled for our site. Any ideas? snort logs don't show anything malicious from the ip. Because these are all ESTABLISHED connections to our website, they're taking up an apache process, and eventually locking everyone else out.

Graeme Fowler

Since it's an application-level problem your LVS is doing exactly what it should :)

Jacob

Yeah, that's what I was thinking. I didn't know if LVS was accidently not FIN'ing connections or whatever.

Graeme

If you switch on the extended status and server-status handlers in Apache, you can check what Apache thinks is happening, at the very least. If it's always the same source IP, I'd consider tracking down what the machine is and seeing if (for example) it's a broken proxy, or whether you can actually route back to it - if the latter then it could be a simple network problem (or even a complex one!).

Jacob

I'd forgotten about the server-status modules.. I've actually got them enabled. As for trying to route back to the ip addr, good idea. I'll try that next time the issue happens. I know the ip addr is a NAT firewall, so it could be the firewall, it could be an office computer, or it could be a personal notebook causing the problem. They've scanned all of their computers for viruses. My only thought at this point is some spyware is screwing up the tcp stack and isn't closing connections.

later...

As an update, we traced this back to a problem with the firewall the company is using. Both firewalls are made by Watchguard (http://www.watchguard.com). They are different models - one is a Firebox and I'm not sure what the other is. The Firebox apparently can operate in two NAT modes: direct (?) and proxy. In direct mode, it is leaving hundreds of connections in the EST state. In proxy mode, it works correctly. The other firewall can only act in direct mode, so it is still causing a problem.

44.5. Weird Hardware IV: wrong device gets MAC address

Note
An HP ProCurve switch is used in this installation, but Horms doesn't think it's the problem.

Troy Hakala troy (at) recipezaar (dot) com

In an LVS-NAT setup, on a rare occassion, the ARP cache of one of the realservers gets the wrong MAC address for the director, I assume after a re-ARP. It gets the MAC of eth0 instead of eth1. It's easy to fix with an arp -s, but I'd like to understand why this happens.

Horms 18 Nov 2005

Taking a wild guess, I would say that eth1 is handling a connection that has the VIP as the local address. And during the course of that connection, the local arp cache expires. The director sends an arp-request to refresh its cache. However, the source address of the ARP requests is the VIP, as it is a connection to the VIP that caused the ARP requests. ARP requests actually act as ARP announcements. And thus the MAC of eth0 is advertised as the MAC of the VIP.

Just a guess. If its correct, then this is exactly the problem that the arp_announce proc entry is designed to address. Or alternatively you can use arptables.

This is the second half of the ARP problem that has to be solved when doing LVS-DR, and I have some limited explanation of it at http://www.ultramonkey.org/3/topologies/hc-ha-lb-eg.html#real-servers Just ignore the bits that aren't about either arp_announce or /sbin/arptables -A OUT -j mangle...

It could also be that for some completely perverse reason eth1 is receiving an ARP request for the VIP. If that case you are really in the same boat as using LVS-DR.

44.6. Weird Hardware V: SonicWAll firewall rewriting sequence numbers

G. Allen Morris III gam3-lvs-users (at) gam3 (dot) net6 Mar 2006

It seems that the firewall changes the sequence number of packets comming in (changing them again on the way out). This of course breaks LVS-Tun as the sequence number does not get restored when it leaves the network wrapped in the IPIP packet.

I can not find any SonicWALL documentation and would like to know if anyone here knows if there is a fix for this.

44.7. Weird Hardware VI: cisco 2924XL switch

Tony Spencer tony (at) games-master (dot) co (dot) uk 10 Mar 2006

We have a couple LVS's running on Centos 4.2 they are working fine and failover as they should, the backup server takes the VIP when the primary server is taken offline. However I've noticed an issue when the failover occurs.

From inside our network we can get to web sites and radius server using the VIP when the failover occurs. But from outside our network we can't, the connection fails. Both servers are plugged into a Cisco 2924XL switch which sees the IP move from one port to another when it fails over. Into the same switch is our upstream link also.

Could it be an arp issue with the upstream because the MAC address of the VIP has changed? I thought this at first but even when I brought the primary server backup so the VIP was on the same MAC address it still wouldn't work.

44.8. Weird Hardware VII: unknown switches don't defragment

Andreas Lundqvist lvs (at) rsv (dot) se 1 Nov 2006

I have a LVS running on Suse Sles9 with three linux realservers and direct routing. This is part of our intranet that spans nationwide and upon launch here this Monday I had not run into any problems what so ever. This isn't the case anymore.. My problem is that clients in two sites in different city's are getting their connections the status FIN_WAIT and will not timeout, so now I have 15000 FIN_WAIT's per realserver and still rising each day. Other sites including the site I'm in does not have this problem, my FIN_WAIT's from my PC times out just fine.

I'm told by our network guy's that just these two sites are indeed running different network hardware than our other sites (cisco).

Later - We seem to have found a workaround. The problem is that our WAN only allows packets with maximum size of 1518 and with encryption headers we exceed that. Our other site's fix this in their switches before sending it out on our WAN but this is apparently not supported in the switches on my two problem site's. So the fix was to lower MTU on my three Realservers to 1400. I had to reboot my LVS server to free the hanging FIN_WAIT's - I tried to just unload the modules but it just hung the rmmod command.

44.9. Weird Hardware VIII: bad routers/routing tables at ISP

Matthew matthew (at) matthewboehm (dot) com, 20 Feb 2007

We've been running a 3 machine setup (1 dir, 2 rs) in TUN mode for about the past 2 months. Recently we've had this affliction where if you goto the VIP, everything is super slow. But if you goto RS1 or RS2 everything is blazing fast (about a 10x difference in bandwidth). This problem started this morning for the 2nd time. It happened about 2 weeks ago but we did nothing to the setup and the problem seemed to fix itself. But now that it's happened again we need some answers. I'd also add that my connection from home doesn't have this problem. From home, the VIP, RS1 and RS2 are all blazing fast. But we just had a customer call from Chicago who was getting slow speeds. Here in our office its slow as well to VIP but not to RS1, RS2.

Matthew matthew (at) matthewboehm (dot) com 22 Feb 2007

Well isn't this fan-damn-tastic. Turns out it wasn't our problem at all. We noticed that things started picking up speed around 3. We all assumed it was because I moved the VIP to point to RS1, thus bypassing the LVS. From our hosting provider, thePlanet.com:

After careful analysis of outbound throughput issues with various providers The Planet engineers determined that a gradual route table exhaustion was at the core of an increasing number of issues. It was determined that to repair this issue an emergency reload of the router dsr01.dllstx3 was required. At approximately 2:42pm CST we reloaded this router. There was a brief period of packet loss as the redundant assumed the load and at 2:51pm CST as the router re-converged. By 2:53pm CST all traffic was forwarding normally.

Amazing.

44.10. Possible Wierd Hardware (or driver) IX: Broadcom GigE card

Note
This is one of those frustrating set of posts where the person disappears without letting us know what happened. We don't know if there really is a problem, or what it is. It might serve as an example of how to go about figuring out such problems.

Jesse Cantara jesse_cantara (at) esupport (dot) com 20 Jul 2007

I'm trying to figure out a problem I'm having with my LVS-NAT setup. It's a very simple setup, one director, two networks (director has two nics, one on lan one on internet), three webservers on LAN only on port 80. The issue I'm having is occasionally and randomly the director will apparently just sever the connection when trying to download a file from the webserver. I have performed these tests just fine without issue: 1) Downloading a file directly from the director to a client 2) Downloading a file from the webserver to the director So it would appear that the physical connection is OK, I can make connections to the individual machines without problem, just when connecting through the director to the webserver.

What happens is I will be downloading a file, and it will hang (at random points during the download, sometimes not at all), and not continue. ipvsadm will show "ESTABLISHED" on that connection for quite a long time, then "ERR!" after it times out I believe. Watching the traffic on a packet-sniffer client-side shows that directly before the failure, my client keeps sending the same "ack" message back to the server over and over, and the server appears to not recognize it.

later...

This might be a hardware/driver issue. I'm having basically the exact same problem when attempting to use IPTABLES on the director just as a simple NAT router to one webserver (trying to isolate the problem), and I still get the exact same behavior (connection closes randomly). So the problem doesn't appear to be limited to IPVS anyway. The machine I'm using for the director/router is a Dell 860 with a Broadcom NetXtreme BCM5721 with the "Tigon3" (tg3.ko) driver.

Here is the config of my machines: CentOS 5 latest kernel 2.6.18-8.1.8.el5 ipvsadm v1.24 IPVS v1.2.0

later...

It is definitely a driver issue. The tg3 broadcom kernel module doesn't seem to work properly at gigabit speeds. 100 mbit works fine (which is OK in this situation).

Tobias Klausmann klausman (at) schwarzvogel (got) de 23 Jul 2007

Then I'd look into the matter more closely. We use about 1k machines with tg3 chips, some 500 of them in farms using GE links and have had no hiccups whatsoever (beyond the usual amount of simply broken hardware which was replaced). CErtainly no systemativ error. Don't forget that your switching infrastructure might be at fault, too.

Jesse

The device on my machine is the Broadcom BCM5721, and the reason why I decided that the driver was at fault is because I found somebody else online with the same problem and that particular model of Broadcom NIC.

Tobias

My fault, I neglected to take into consideration that the tg3 driver supports more than one particular chip(set). Our machines have BCM5703X GE chips. We also have machines that have BCM5708 chips, but those are served by another driver (bnx2).

The question is on which layer the error originates (as opposed to where you *see* it). This can be as simple as using wireshark to see the connection break down (can be a hassle if it takes long to trigger) or be as full-fledged as bringing in a hardware network analyzer. The order should be obvious :)

Jesse

I have replaced some of the switching hardware, but not all of it. The reason why I don't think it is the switch or cables is because direct system-to-system communication works just fine, it's only when I'm doing any sort of packet-forwarding (with lvs-nat or just simple iptables port-forwarding).

That surely points to software rather than hardware. Do you have any funky iptable setups that might interfere? Also you might want to try to use add-in GE boards but keep everything else the same. Intel's EEPro1000 might be worth a try - it uses an entirely different driver, yet it's readily available. That way, you could rule out both the NIC hw and the driver.

44.11. slow nics

With everyone expecting 1GBps throughput, you need to choose your ethernet cards carefully. Mikio was using 2.6.18, which has proven to be a buggy kernel. Someone recommended that he get a better kernel.

Mikio Kishi mkishi (at) 104 (dot) net 24 Nov 2008

With new connections/sec >15,000, %CPU ksoftirqd went to 100% eth0: (PCI Express:2.5GB/s:Width x1), eth0: Intel(R) PRO/1000 Network Connection.

Graeme

There's only so many interrupts you can generate from one NIC before the CPU wedges - and in most cases every packet generates in interrupt. If you're using plain-old-PCI, you get less throughput than more advanced PCI variants.

44.12. PCI-X nics

Joe: commodity PCI-X has not worked well, due to clock skews on the 2X-wide bus (you can't get 128 lines to go up and down together). IBM got it to work in their high end hardware because they made the mobo and the NIC and could guarantee that they would work together on their implementation of the PCI-X bus. The chances of a commodity PCI-X mobo working with someone else's PCI-x NIC aren't real good and PCI-X has been abandonned for PCI-e.

Sashi Kant sashi (dot) kant (at) eng (dot) admob (dot) com 24 Oct 2008 was having RX Packet drops on high traffic LVS in DR setup.

Here's his cure:

  • Intel PCIe gigabit dual port card with RX buffer tuned at 1024 using ethtool
  • No need for port channel as switch ports are Gig capable and bandwidth is not an issue.
  • Bonding will be used in active-backup mode as it is not negatively impacting the server.
  • No more packet drops at 60k pps with above changes.
  • Getting rid of Broadcom PCI-X cards from HA servers.

44.13. Microsoft http clients and servers violate the RFC for TCP/IP

In case you're getting funny results

44.14. MSIE SSL bugs

44.14.1. MSIE SSL bug 1

Benoit Gauthier gauthier (at) circum (dot) com 22 May 2008

We are experiencing a bizarre problem involving LVS, SSL and Internet Explorer.

We have an LVS cluster comprised of one load balancer connecting to six real servers. Each real server serves Apache requests to a ,cgi script located on a backend server that is accessed by each node using NFS. This setup has worked well so far, both using regular requests and SSL requests.

A problem appeared last week soon after we migrated the cluster from using public IP addresses for real servers to using local (192.168) IP addresses (i.e., we went from DR to NAT). Non-SSL requests are handled fine. SSL requests made directly to the backend server (i.e., avoiding the LVS cluster) are handled fine. SSL request handled by the LVS cluster are fine if they are issued by any browser other than Internet Explorer 5 (and Internet Explorer 6 in the case of one tester). But Internet Explorer 5 SSL requests to the load balancer are sometimes handled correctly, but, other times, they either return the same page instead of the requested page or they return a browser error page (browser cannot reach the page).

To add to the difficulty, it appears that these errors are not actually produced by the Apache server: one of the testers has a slow Internet connection and can clearly feel a lag when the next page is being properly processed; when an error page is returned, it comes up immediately, suggesting that no Internet communication took place.

We can replicate the problem fairly easily. The problem does NOT recur if we limit the <img> tags on the page to a single image. If we add a second <img> tag (coded as <img src=dir/image.gif>), we can produce the problem. The images are all static (not dynamically generated).

We added a "persistent=300" line in ldirector.conf. To no avail.

Graeme

Hrm. Has your Apache SSL config got the following lines?

SetEnvIf User-Agent ".*MSIE.*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

They can be set in the vhost context, or globally. See ssl_faq.html (http://www.modssl.org/docs/2.8/ssl_faq.html#ToC49) for details.

Also, have you got keepalives switched on in your Apache config?

Benoit

The following instruction was in the general definition of the ssl portion:

BrowserMatch ".*MSIE.*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

and no keepalive instruction elsewhere in the Apache config.

I inserted this right after the "Listen 443" in the Apache SSL configuration:

SetEnvIf User-Agent ".*MSIE.*" \
         nokeepalive ssl-unclean-shutdown \
         downgrade-1.0 force-response-1.0

as opposed to the 'BrowserMatch ".*MSIE.*"' instruction that I already had. And this worked! After about 50 page tests, no errors were reported.

Thanks again. And let's all pray for the demise of Internet Explorer.

44.14.2. MSIE SSL bug 2

Daniel Kerrutt d (dot) kerrutt (at) googlemail (dot) com 25 Jun 2008

I posted this subject to microsoft.public.internetexplorer.general, but wanted to ask for help here, too.

I have a problem while connecting to a loadbalanced website via SSL. Sometimes IE is displaying an "Page cannot be displayed" error message. Sometimes the error is happening instantly, sometimes after entering the URL 3-4 times.

Server software is Apache 2.2 / mod_ssl Keep-Alive is forced to turn off with the following configuration statement with SSL connections:

SetEnvIf User-Agent ".*MSIE.*" nokeepalive ssl-unclean-shutdown
downgrade-1.0 force-response-1.0

Joe

Graeme replied to a similar problem on 22 May (IE and apache/ssl). Go look in the archives. the source of the problem is described here. http://www.modssl.org/docs/2.8/ssl_faq.html#ToC49 I don't know why this is a problem when connecting through a director, but not when connecting directly, so I may not understand the problem

Daniel

Ok, finally its fixed.. Turned out that SSL session cache was not activated across all realservers - so I did hit the IE bug.... *use head with wall*

Although not a problem of LVS itself, it might be an environment where users possibly meet this MSIE bug. Right now I cannot confirm that the error is happening _only_ through LVS, because there was an inconsistent configuration throughout the realservers with my setup, which I did not notice. This means that I maybe tested the direct connection on a proper configured realserver, but another case could perhaps confirm that (http://archive.linuxvirtualserver.org/html/lvs-users/2008-05/msg00062.html).

One part of the solution was turning off keep-alive for MSIE clients as posted before by Graeme (http://archive.linuxvirtualserver.org/html/lvs-users/2008-05/msg00074.html), which is not configured by default with a standard Debian Apache setup (AFAIR).

But in my case, the actual error was something else: Make sure that the SSLSessionCache is activated, like described at the bottom of the subject here: http://www.modssl.org/docs/2.8/ssl_faq.html#ToC49

SSLSessionCache        shmcb:/var/run/apache2/ssl_scache(512000)
SSLSessionCacheTimeout  300

The SSLSessionCache directive should be activated by default with a standard Apache setup. But in my case it happened that it was not activated across all the realservers.

All solutions are described in the SSL-FAQ http://www.modssl.org/docs/2.8/ssl_faq.html#ToC49, therefore you can possibly refer to the SSL-FAQ somewhere in the HOWTO, because users might meet this problems within the LVS environment.