20. LVS: Services: single-port

20.1. ftp, tcp 21

This is a multi-port service and is covered in the ftp section of multi-port services.

20.2. ssh, sftp, scp, tcp 22

Surprisingly (considering that it negotiates a secure connection) nothing special either. You do not need persistent port/client connection for this. sshd is a standard one-port tcp connection.

The director will timeout an idle tcp connection (e.g. ssh, telnet) in 15mins, independantly of any settings on the client or server. You will want to change the tcpip idle timeouts.

As well ssh has its own timeouts. You'll get "Connection reset by peer" independently of LVS.

linuxxpert linuxxpert (at) gmail (dot) com 19 Dec 2008

  • Make sure you have "TCPKeepAlive yes" in your sshd_config file.
  • If TCPKeepAlive is already yes, then add "ClientAliveInterval" in your sshd_config.

man sshd_config: ClientAliveInterval

Sets a timeout interval in seconds after which if no data has been received from the client, sshd will send a message through the encrypted channel to request a response from the client. The default is 0, indicating that these messages will not be sent to the client. This option applies to protocol version 2 only.

Note
sftp and scp are also single port services on port 22.

Note

The current (Jul 2001) implementations of ssh (openssh-2.x/openssl-0.9.x) all use ssh protocol 2. Almost all the webpages/HOWTOs available are about sshd protocol 1. None of this protocol 1 information is of any use for protocol 2 - everything from setting up the keys onward is different. Erase everything from your protocol 1 installations (the old ssh binaries, /etc/ssh*), do not try to be backwards compatible, avoid the old HOWTOs and go buy

SSH The Secure Shell
Daniel J. Barrett and Richard E. Silverman
O'Reilly 2001
ISBN: 0-59-00011-1

(sometime in 2002: looks like new information is arriving on webpages.)

anonymous

How can I configure rsh/ssh to enable copying of files between 2 machines without human intervention?

Brent Cook busterb (at) mail (dot) utexas (dot) edu 09 Jul 2002

Here are some articles on OpenSSH key management.

http://www-106.ibm.com/developerworks/library/l-keyc.html
http://www-106.ibm.com/developerworks/library/l-keyc2/
http://www-106.ibm.com/developerworks/linux/library/l-keyc3/

My realservers aren't connected to the outside world so I've setup ssh to allow root to login with no passwd. If you compile ssh with the default --prefix, where it is installed in /usr/local/bin, you will have to change the default $PATH for sshd too (I suggest using --prefix=/usr so you don't have to bother with this.)

./configure --with-none --with-default-path=/bin:/usr/bin:/usr/local/bin

Here's my sshd_config - the docs on passwordless root logins were not helpfull.

# This is ssh server systemwide sshd_config

Port 22
#Protocol 2,1
ListenAddress 0.0.0.0
#ListenAddress ::
HostKey /usr/local/etc/ssh_host_key
ServerKeyBits 768
LoginGraceTime 600
KeyRegenerationInterval 3600
PermitRootLogin yes
#PermitRootLogin without-password
#
# Don't read ~/.rhosts and ~/.shosts files
IgnoreRhosts yes
# Uncomment if you don't trust ~/.ssh/known_hosts for RhostsRSAAuthentication
#IgnoreUserKnownHosts yes
StrictModes yes
X11Forwarding no
X11DisplayOffset 10
PrintMotd yes
KeepAlive yes

# Logging
SyslogFacility AUTH
LogLevel INFO
#obsoletes QuietMode and FascistLogging

RhostsAuthentication no
#
# For this to work you will also need host keys in /usr/local/etc/ssh_known_hosts
#RhostsRSAAuthentication no
RhostsRSAAuthentication yes
#
RSAAuthentication yes

# To disable tunneled clear text passwords, change to no here!
PasswordAuthentication yes
#PermitEmptyPasswords no
PermitEmptyPasswords yes
# Uncomment to disable S/key passwords
#SkeyAuthentication no
#KbdInteractiveAuthentication yes

# To change Kerberos options
#KerberosAuthentication no
#KerberosOrLocalPasswd yes
#AFSTokenPassing no
#KerberosTicketCleanup no

# Kerberos TGT Passing does only work with the AFS kaserver
#KerberosTgtPassing yes

CheckMail no
#UseLogin no

# Uncomment if you want to enable sftp
#Subsystem	sftp	/usr/local/libexec/sftp-server
#MaxStartups 10:30:60

20.2.1. keys for realservers running sshd

If you install sshd and generate the host keys for the realservers using the default settings, you'll get a working LVS'ed sshd. However you should be aware of what you've done. The default sshd listens to 0.0.0.0 and you will have generated host keys for a machine whose name corresponds to the RIP (and not the VIP). Since the client will be displaying a prompt with the name of the realserver (rather than the name associated with the VIP) this will work just fine. However the client will get a different realserver each connection (which is OK too) and will accumulate keys for each realserver. If instead you want the client to be presented with one virtual machine, you will need each machine to have its hostname being the name associated with the VIP, the sshd will have to listen to the VIP (if LVS-DR, LVS-Tun) and the hostkeys will have to be generated for the name of the VIP.

20.2.2. ssh zombie processes

The user must exit cleanly from their ssh session, or the realserver will be left running the ssh invoked process at high load average. The problem is that what the user thinks is a clean exit and what the sshd thinks are a clean exit, may be different things. (There is a similar problem on a webserver, which is running a process invoked by a cgi script, when the client disconnects by clicking to another page or hitting "stop").

Shivaji Navale

On our realservers, after the users have logged out of their ssh session, zombie processes run at high load average in the background. We resort to killing the zombie processess. The ssh connection to the director doesnt get closed even after ctrl-D. From netstat, the connection is still active.

Dave Jagoda dj (at) opsware (dot) com 23 Nov 2003

Does this sound like what's going on? http://www.openssh.org/faq.html#3.10, ssh hangs on exit.

20.2.3. persistence with sshd

You do not need persistence with ssh, but you can use it.

Piero Calucci calucci (at) sissa (dot) it 30 Mar 2004

I use ipvs to load balance ssh and I do use persistence (and my users are happy with it -- in fact they asked me to do so). With this setup when they open multiple sessions they are guaranteed to reach always the same realserver, so they can see all their processes and local /tmp files.

If you use persistence, be aware of the effects.

jeremy (at) xxedgexx (dot) com

I'm using ipvs to balance ssh connections but for some reason ipvs is only using one real server and persists to use that server until I delete its arp entry from the ipvs machine and remove the virtual loopback on the realserver. Also, I noticed that connections do not register with this behavior.

Wensong

do you use the persistent port for your VIP:22? If so, the default timeout of persistent port is 360 seconds, once the ssh session finishes, it takes 360 seconds to expire the persistent session. (In ipvs-0.9.1, you can flexibly set the timeout for the persistent port.) There is no need to use persistent port for ssh service, because the RSA keys are exchanged in each ssh session, and each session is not related.

20.3. telnet, tcp 23

Simple one port service. Use telnet (or phatcat) for initial testing of your LVS. It is a simpler client/service than http (it is not persistent) and a connection shows up as an ActiveConn in the ipvsadm output. You can fabricate a fake service on the realservers with LVS and inetd

(Also note the director timeout problem, explained in the ssh section).

20.4. smtp, tcp 25; pop3, tcp 110; imap tcp/udp 143 (imap2), 220(imap3). Also sendmail, qmail, postfix, and mailfarms.

(for non LVS solutions to high throughput, high availability mailers, see tutorial by Damir Horvat).

20.4.1. The many reader, single writer problem with smtp

For mail which is being passed through, LVS is a good solution.

If the realserver is the final delivery target, then the mail will arrive randomly at any one of the realservers and write to the different filesystems. This is the many reader/many writer problem that LVS has. Since you probably want your mail to arrive at one place only, the only way of handling this right now is to have the /home directory nfs mounted on all the realservers from a backend fileserver which is not part of the LVS. (an nfs.monitor is in the works.) Each realserver will have to be configured to accept mail for the virtual server DNS name (say lvs.domain.com).

Rio rio (at) forestoflives (dot) com 21 Jun 2007

We're using the proprietary mail server software (surgemail) which has built-in 2-way mirroring that updates each other within milliseconds of a change.

(Joe: can't imagine it would be that hard to add that feature to a GPL'ed smtpd.)

20.4.2. mailbox formats: mbox, maildir

Joe - if mail arrives on different realservers, then people have tried merging/synching the files on the different machines. Only one of the file formats, maildir, is worth attempting. Even that hasn't worked out too well and it seems that most successful LVS mail farms are using centralised file servers.

Todd Lyons tlyons (at) ivenue (dot) com 13 Jan 2006

  • mbox:one file for all messages (like the /var/spool/mail/$USER mailbox)
  • maildir: one file per message

I don't know that I'd feel all that comfortable about syncing either one: mbox - This could end up with a corrupted mbox file since it's all one big long file. maildir - Guaranteed that you won't end up with any filename conflicts since the hostname is used in the filename, but there are some shared files, such as the quota or subscribed files. Those files could easily get out of sync.

Graeme Fowler graeme (at) graemef (dot) net 13 Jan 2006

Indeed they can (says someone long since burnt, not exactly by rsync, but by some other methods of attempting to sunc mailboxes...).

I now use NetApp filers on which my maildirs live, and I have multiple IMAP/POP (and Horde/IMP webmail) servers living behind a pair of LVS directors. Using maildir format [0], I export the mailboxes via NFS to the frontend servers and it works a treat - generally because maildir saves files using filenames derived from the hostname, so the frontend servers don't hit race conditions when trying to manipulate files.

If you need really (and I mean _really_) high NFS transaction rates and fault-tolerance, I can't recommend the NetApp kit highly enough. If you don't, then other options - cheaper ones! - are (for example) high-end Intel-based IBM, Dell, HP/Compaq etc. servers with large hardware RAID arrays for redundancy. You'd have to address the various benchmarks according to your means, but there's a price point for every pocket somewhere.

[0] for the inevitable extra question: Exim v4.x for the (E)SMTP server, Courier IMAP server for IMAP and POP; all backended with MySQL for mailbox/address lookups and POP/IMAP/SMTP authentication.

20.4.3. Maintaining user passwds on the realservers

Gabriel Neagoe Gabriel (dot) Neagoe (at) snt (dot) ro

for syncing the passwords - IF THE ENVIRONMENT IS SAFE- you could use NIS or rdist

20.4.4. identd (auth) problems with MTAs

You will not be explicitely configuring identd in an LVS. However authd/identd is used by sendmail and tcpwrappers and will cause problems. Services running on realservers can't use identd when running on an LVS (see identd and sendmail). Running identd as an LVS service doesn't fix this.

sendmail

in sendmail.cf set the value

Timeout.ident=0

Also see Why do connections to the smtp port take such a long time?

qmail qmail:

Martin Lichtin lichtin (at) bivio (dot) com

  • If invoked with tcp-env in inetd.conf - use the -R option
  • If spawned using svc and DJ's daemontools packages -
    #/usr/local/bin/tcpserver -u 1002 -g 1001 -c 500 -v 0 smtp /var/qmail/bin/qmail-smtpd
    

    tcpserver is the recommended method of running qmail, where you use the -R option for tcpserver

    -R: Do not attempt to obtain $TCPREMOTEINFO from the remote host. To avoid loops, you must use this option for servers on TCP ports 53 and 113.

exim

Michael Stiller ms (at) 2scale (dot) net 26 Jun 2003

in exim.conf set the timeout for identd/auth to zero

rfc1413_query_timeout = 0s

Remember to reHUP exim.

20.4.5. postfix

Note: postfix is not name resolution friendly. This will not be a problem if smtp is an LVS'ed service, but will be a problem if you use it for local delivery too.

20.4.6. testing an LVS'ed sendmail

Here's an LVS-DR with sendmail listening on the VIP on the realserver. Notice that the reply from the MTA includes the RIP allowing you to identify the realserver.

Connect from the client to VIP:smtp

client:~# telnet lvs.cluster.org smtp
 trying 192.168.1.110...
 Connected to lvs.cluster.org
 Escape character is '^]'.
220 lvs.cluster.org ESMTP Sendmail 8.9.1a/8.9.0; Sat 6 Nov 1999 13:16:30 GMT
 HELO client.cluster.org
250 client.cluster.org Hello root@client.cluster.org [192.168.1.12], pleased to meet you
 quit
221 client.cluster.org closing connection

check that you can access each realserver in turn (here the realserver with RIP=192.168.1.12 was accessed).

20.4.7. pop3, tcp 110

pop3 - as for smtp. The mail agents must see the same /home file system, so /home should be mounted on all realservers from a single file server.

Using qmail -

Abe Schwartz sloween (at) hotmail (dot) com

Has anyone used LVS to balance POP3 traffic in conjunction with qmail?

Wayne wayne (at) compute-aid (dot) com 13 Feb 2002

We used LVS to balance POP3 and Qmail without any problem.

Mike McLean mikem (at) redhat (dot) com 21 Apr 2004

Generally RW activity like with POP does not work well with LVS. When such things are required, one reasonable solution is to have a three tier setup where the lvs director load balances across multiple servers (in your case pop/imap), which access their data via a highly available shared filesystem/database. Of course this requires more machines and more hardware.

Kelly Corbin kcorbin (at) theiqgroup (dot) com 23 Aug 2004

Pop-Before-SMTP and LVS

I just added pop and smtp services to my load balancers, and I wanted to know if there was a way to tie the two connections together somehow. I use pop-before-smtp to allow my users to send mail, but sometimes they reconnect to a different server for smtp than the one they pop-ed into. Most of the time it's OK because they have pop-ed to both of the servers in the cluster but every now and then I have a user with an issue.

Is there a way to have it keep track of the IP and always send those SMTP and POP connections to the same realserver?

Josh Marshall josh (at) worldhosting (dot) org

You don't need to have the smtp and pop connections on the same realserver. We've got three mail servers here with pop-before-smtp running... I just point the daemons at the same file on an nfs share and it all just works. The pop-before-smtp daemons can handle all writing to the same file just fine.

Joe

You can link the two services with fwmark (and possibly persistence, depending on the time between connections). LVS is not really designed for writing to the realservers. To do so, you have to serialise the writes and then propagate the writes to the other realservers. This is a bit of a kludge, but is the way everyone handles writes to an LVS. The other point of an LVS is load balancing. What you're proposing is to turn off the loadbalancing and have the content on each of the realservers different. You need to be able to failout a realserver.

What you're proposing can be done, but it isn't an LVS anymore

20.4.8. imap tcp/udp 143 (imap2),220 (imap3)

Ramon Kagan rkagan (at) YorkU (dot) CA 20 Nov 2002

Previously we had been using DNS shuffle records to spread the load of IMAP connections across 3 machines. Having used LVS-DR for web traffic for a long time now (about 2.5 yeras) we decided to bring our IMAP service into the pool. The imap servers have been setup to retain idle connections for 24 hours (not my choice but the current setup anyways). The lvs servers have been setup to have a persistance of 8 hours. What has been happening lately is that users get the following type error.

"The server has disconnected, may have gone down or may be a network problem, try connecting again"

So, users don't have to reauthenticate per se, they just need to reclick on whatever button, so that the client they are using reauthenticates them. The timeout for this is as low as ten minutes. However the imap daemon is kept alive, and the client is shipped the PID of the imapd to reconnect directly. This is supposed to circumvent the necessity of reauthentication. In the case where the connection is lost, the reauthentication actually starts a new imapd process.

Joe

there is an LVS timeout and another timeout (tcp-keepalive) associated with tcp connections which affects telnet, ssh sessions. You need to reset both (see timeouts).

The solution: On the linux imap servers the tcp-keepalive value in /proc/sys/net/ipv4/tcp_keepalive_time is set to 7200

centaur# pwd
/proc/sys/net/ipv4
centaur# cat tcp_keepalive_time
7200

This must be matched by the ipvsadm tcp timeout. So the options are to do one of:

  • use ipvsadm --set 7200 0 0 on the lvs server
  • echo "900" >! /proc/sys/net/ipv4/tcp_keepalive_time
    

Ramon Kagan rkagan (at) YorkU (dot) CA 27 Nov 2002

Further update on the solution: although this works there is one catch. If a person logs out and comes back within the 7200s as stated in the previous email, they continue to get the message because the realserver and the director don't match again. We will be lowering the value to somewhere between 5 and 15 minutes (300-900) to address this type of usage.

Torsten Rosenberger rosenberger (at) taoweb (dot) at 16 Sep 2003

i want to build a mail cluster with cyrus-imapd but i don't know how to handle the mailbox database on LVS.

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 16 Sep 2003

we use HP ServiceGuard for clustering (12 Cyrus instances running on three physical servers, storage on HP VA7410 connected through SAN), Perdition as a proxy (connects user to correct instance, also acts as an SSL accelerator, runs on two physical servers) and LVS (keepalived, active-active on two physical servers).

the other route is to use Cyrus Murder. e.g. shared folders will probably work better with Murder. administration is easier, too. we have to connect to correct instance. at the time we chose our architecture, Murder seemed unfinished, and we were worried about single point of failure. I'm afraid I haven't followed its development closely.

20.4.9. Thoughts about sendmail/pop

(another variation on the many reader/many writer problem)

loc (at) indochinanet (dot) com wrote:

I need this to convince my boss that LVS is the solution for very scalable and high available mail/POP server.

Rob Thomas rob (at) rpi (dot) net (dot) au

This is about the hardest clustering thing you'll ever do. Because of the constant read/write access's you -will- have problems with locking, and file corruption.. The 'best' way to do this is (IMHO):

  1. NetCache Filer as the NFS disk server.
  2. Several SMTP clients using NFS v3 to the NFS server.
  3. Several POP/IMAP clients using NFS v3 to the NFS server.
  4. At least one dedicated machine for sending mail out (smarthost)
  5. LinuxDirector box in front of 2 and 3 firing requests off

Now, items 1 2 -and- 3 can be replaced by Linux boxes, but, NFS v3 is still in Alpha on linux. I -believe- that NetBSD (FreeBSD? One of them) has a fully functional NFS v3 implementation, so you can use that.

The reason why I emphasize NFSv3 is that it -finally- has 'real' locking support. You -must- have atomic locks to the file server, otherwise you -will- get corruption. And it's not something that'll happen occasionally. Picture this:

  [client]  --  [ l.d ] -- [external host]
                   |
     [smtp server]-+-[pop3 server]
                   |
               [filesrv]

Whilst [client] is reading mail (via [pop3 server]), [external host] sends an email to his mailbox. the pop3 client has a file handle on the mail spool, and suddenly data is appended to this. Now the problem is, the pop3 client has a copy of (what it thinks) is the mail spool in memory, and when the user deletes a file, the mail that's just been received will be deleted, because the pop3 client doesn't know about it.

This is actually rather a simplification, as just about every pop3 client understands this, and will let go of the file handle.. But, the same thing will happen if a message comes in -whilst the pop3d is deleting mail-.

                           POP Client    SMTP Client
  I want to lock this file <--
  I want to lock this file       	<--
  You can lock the file    -->
  You can lock the file                  -->
  Consider it locked       <--
  File is locked           -->
  Consider it locked             	<--
  Ooh, I can't lock it                   -->

The issue with NFS v1 and v2 is that whilst it has locking support, it's not atomic. NFS v3 can do this:

  I want to lock this file <--
  I want to lock this file       	<--
  File is locked           -->
  Ooh, I can't lock it                   -->

That's why you want NFSv3. Plus, it's faster, and it works over TCP, rather than UDP 8-)

This is about the hardest clustering thing you'll ever do.

Stefan Stefanov sstefanov (at) orbitel (dot) bg

I think this might be not-so-hardly achieved with CODA and Qmail.

Coda (http://www.coda.cs.cmu.edu) allows "clustering" of file system space. Qmail's (http://www.qmail.org) default mailbox format is Maildir, which is very lock safe format (even on NFS without lockd).

(I haven't implemented this, it's just a suggestion.)

20.4.10. pop3/LVS-DR by Fabien

Fabien fabien (at) oeilpouroeil (dot) org 03 Oct 2002

I successfully installed and tested a LVS-DR in a small network (1 director and 2 realservers) with http and pop3 load balancing/high availability, using too some hints from ultramonkey project (which team I thank too for the very good howto :)).

About the pop3 LVS here is what I did, and if someone feels like correcting me or suggesting/advicing me something better I will be grateful !

I used on both realservers : - the smtp daemon postfix with the VDA patch to handle maildir.

Here LVS doesn't handle smtp, I just use dns multiple MX feature. - the light weighted pop3 daemon tpop3d ( http://www.ex-parrot.com/~chris/tpop3d/ ) which can manage maildir.

The incoming mails are stored on the realservers in maildir type account following dns MX shedulding, and so are stored quite randomly on each realserver. To synchronize both realservers so that pop3 accounts are correct when checked, I use rsync and especially drsync.pl rsync wrapper ( http://hacks.dlux.hu/drsync/ ) which keeps track of a given filelist between rsync synchronizations (here the filelist is all the maildirs content).

At the moment, using crontab, drsync is run on each realserver every minutes and synchronizes the content of the other realserver with his one. It seems to work with a dozen pop3 accounts and hundreds of mail sent (no loss).

Fabien fabien (at) oeilpouroeil (dot) org 03 Oct 2002

I have an LVS-DR (1 director and 2 realservers) with http and pop3 load balancing/high availability (using some hints from ultramonkey project). On both realservers I have
  • the smtp daemon postfix with the VDA patch to handle maildir. (LVS doesn't handle smtp, I just use the dns multiple MX feature.)
  • the lightweight pop3 daemon tpop3d which can manage maildir format.

The incoming mails are stored on random realservers in maildir format following dns MX scheduling. To synchronize both realservers (so that pop3 accounts are correct when checked), I use rsync and especially the drsync.pl rsync wrapper, which keeps track of a given filelist between rsync synchronizations (here the filelist is the maildir files). drsync is run on each realserver every minute (cron) synchronizing the content with the other realserver(s). So far it works with a dozen pop3 accounts and hundreds of mail sent with no loss.

20.4.11. strange mail problem

trietz trietz (at) t-ipnet (dot) net 18 Oct 2006

I'm using LVS-NAT for a simple rr-loadbalancing between two sendmail realservers. I setup a director with 3 NICs, one for the external connection(eth0) and the other two(eth1 and eth2) for connecting my realservers over crosspatch cable. The director has two ip adresses on the external interface. Here is the output from ipvsadm-save:

-A -t x.x.x.123:smtp -s rr
-a -t x.x.x.123:smtp -r 192.168.0.1:smtp -m -w 1
-a -t x.x.x.123:smtp -r 192.168.0.2:smtp -m -w 1
-A -t x.x.x.122:smtp -s rr
-a -t x.x.x.122:smtp -r 192.168.0.1:smtp -m -w 1
-A -t x.x.x.123:imaps -s rr
-a -t x.x.x.123:imaps -r 192.168.0.1:imaps -m -w 1

The packets intialized by the realservers are SNATed with iptables on the director successfully. My problem: loadbalancing works fine, but I see a lot of the reply packets from the realserver leaving the director on interface eth0 with the internal ips 192.168.0.1 and 192.168.0.2.

After testing I assume it is a timeout problem.

  • Client creates a smtp connection over the director to the realserver, which works perfect
  • Connection hang for a while and times out

  • Realserver try to close the connection, but the director doesn't "SNAT" the package
  • It looks like the director forgot the connection, because the timeout from the realserver is longer than the timeout from the director.

My solution:

  • Patch my kernel sources with the ipvs_nfct patch.
  • Activate conntrack:

    echo 1 > /proc/sys/net/ipv4/vs/conntrack
    
  • Add the following iptables rule on the director (eth1,2 are on the DRIP network, eth0 faces the internet):

    iptables -A FORWARD -i eth1 -o eth0 -m state --state INVALID -j DROP
    iptables -A FORWARD -i eth2 -o eth0 -m state --state INVALID -j DROP
    

Horms explanation (off-list)

Basically the real-server and end user have a connection open. It idles for a long time. So long that the connection entry on the linux director expires. Then a the real-server sends a packet over the connection, which the linux director doesn't recognise and sends out to the ether without unnatting it.

Solution? Well other than the one he suggests, changing the timeouts would help, assuming his analysis is correct.

Joe - this timeout problem came up earlier.

Dominik Klein dk (at) in-telegence (dot) net 15 May 2006

So let's just say we have a simple setup:

client -> switch -> director -> switch -> realserver

The client establishes a connection, sends some data, whatever. i Then it does nothing for say 10 minutes. After that it tries to reuse the still established connection and it works just fine. Then it does nothing for say 16 minutes and tries again to use the still established connection. In the meantime, the default timeout for the connection table (default 15 minutes) runs out and so this connection is not valid on the director. So the director replies with RST on the PSH packet from the client and the connection breaks for the client. The realserver does not know anything about the reset on the director, so it still considers the connection established.

The client opens a new connection, but the old one will still be considered established on the realserver. That's what made my MySQL server hit the max_connection limit and reject any new clients. I will try to set the timeout higher, as it can easily happen that my clients do nothing for a few hours at night, which will - sooner or later - hit the max_connection limit again.

20.5. Mail Farms

20.5.1. Designing a Mail Farm

Peter Mueller pmueller (at) sidestep (dot) com 10 May 2001

what open source mail programs have you guys used for SMTP mail farm with LVS? I'm thinking about Qmail or Sendmail?

Michael Brown Michael_E_Brown (at) Dell (dot) com, Joe and Greg Cope gjjc (at) rubberplant (dot) freeserve (dot) co (dot) uk 10 May 2001

You can do load balancing against multiple mail servers without LVS. Use multiple MX records to load balance, and mailing list management software (Mailman, maybe?). DNS responds with all MX records for a request. The MTA should then choose one at random from the same piority. (A cache DNS will also return all MX records.) You don't get persistent use of one MX record. If the chosen MX record points to a machine that's down, the MTA will choose another MX record.

Wensong

I think that central load balancing is more efficient in resource utilization than randomly picking up servers by clients, basic queuing theory can prove this. For example, if there are two mail servers grouped by multiple DNS MX records, it is quite possible that a mail server of load near to 1 still receiving new connections (QoS is bad here), in the mean while the other mail server just has load 0.1. If the central load balancing can keep the load of two server around 0.7 respectively, the resource utilization and QoS is better than that of the above case. :)

Michael Brown Michael_E_Brown (at) Dell (dot) com 15 May 2001

I agree, but... :-)

  1. You can configure most mail programs to start refusing connections when load rises above a certain limit. The protocol itself has built-in redundancy and error-recovery. Connections will automatically fail-over to the secondary server when the primary refuses connections. Mail will _automatically_ spool on the sender's side if the server experiences temporary outage.
  2. Mail service is a special case. The protocol/RFC itself specified application-level load balancing, no extra software required.
  3. Central load balancer adds complexity/layers that can fail.

I maintain that mail serving (smtp only, pop/imap is another case entirely) is a special case that does not need the extra complexity of LVS. Basic Queuing theory aside, the protocol itself specifies load-balancing, failover, and error-recovery which has been proven with years of real-world use.

LVS is great for protocols that do not have the built-in protocol-level load-balancing and error recovery that SMTP inherently has (HTTP being a great example). All I am saying is use the right tool for the job.

Note this discussion applies to mail which is being forwarded by the MTA. The final target machine has the single-writer, many-reader problem as before (which is fine if it's a single node), (i.e. don't run the leaf node as an LVS).

Joe

How would someone like AOL handle the mail farm problem? How do users get to their mail? Does everyone in AOL get their mail off one machine (or replicated copies of it) or is each person directed to one of many smaller machines to get their mail?

Michael Brown

Tough question... AOL has a system of inbound mail relays to receive all their user's mail. Take a look:

[mebrown@blap opt]$ nslookup
Default Server:  ausdhcprr501.us.dell.com
Address:  143.166.227.254

> set type=mx
> aol.com
Server:  ausdhcprr501.us.dell.com
Address:  143.166.227.254

aol.com	preference = 15, mail exchanger = mailin-03.mx.aol.com
aol.com	preference = 15, mail exchanger = mailin-04.mx.aol.com
aol.com	preference = 15, mail exchanger = mailin-01.mx.aol.com
aol.com	preference = 15, mail exchanger = mailin-02.mx.aol.com
aol.com	nameserver = dns-01.ns.aol.com
aol.com	nameserver = dns-02.ns.aol.com
mailin-03.mx.aol.com	internet address = 152.163.224.88
mailin-03.mx.aol.com	internet address = 64.12.136.153
mailin-03.mx.aol.com	internet address = 205.188.156.186
mailin-04.mx.aol.com	internet address = 152.163.224.122
mailin-04.mx.aol.com	internet address = 205.188.158.25
mailin-04.mx.aol.com	internet address = 205.188.156.249
mailin-01.mx.aol.com	internet address = 152.163.224.26
mailin-01.mx.aol.com	internet address = 64.12.136.57
mailin-01.mx.aol.com	internet address = 205.188.156.122
mailin-01.mx.aol.com	internet address = 205.188.157.25
mailin-02.mx.aol.com	internet address = 64.12.136.89
mailin-02.mx.aol.com	internet address = 205.188.156.154
mailin-02.mx.aol.com	internet address = 64.12.136.121
dns-01.ns.aol.com	internet address = 152.163.159.232
dns-02.ns.aol.com	internet address = 205.188.157.232

So that is on the recieve side. On the actual user reading their mail side, things are much different. AOL doesn't use normal SMTP mail. They have their own proprietary system, which interfaces to the normal internet SMTP system through gateways. I don't know how AOL does their internal, proprietary stuff, but I would guess it would be massively distributed system.

Basically, you can break down your mail-farm problem into two, possibly three, areas.

1) Mail receipt (from the internet)
2) Users reading their mail
3) Mail sending (to the internet)

Items 1 and 3 can normally be hosted on the same set of machines, but it is important to realize that these are separate functions, and can be split up, if need be.

For item #1, the listing above showing what AOL does is probably a good example of how to set up a super-high-traffic mail gateway system. I normally prefer to add one more layer of protection on top of this: a super low-priority MX at an offsite location. (example: aol.com preference = 100, mail exchanger = disaster-recovery.offsite.aol.com )

For item #2, that is going to be a site policy, and can be handled many different ways depending on what mail software you use (imap, pop, etc). The good IMAP software has LDAP integration. This means you can separate groups of users onto separate IMAP servers. The mail client then can get the correct server from LDAP and contact it with standard protocols (IMAP/POP/etc).

For item #3, you will solve this differently depending on what software you have for #2. If the client software wants to send mail directly to a smart gateway, you are probably going to DNS round-robin between several hosts. If the client expects it's server (from #2) to handle sending email, then things will be handled differently.

Wenzhuo Zhang wenzhuo (at) zhmail (dot) com

Here's an article on paralleling mail servers by Derek Balling.

Shain Miley 25 May 2001

I am planning on setting up an LVS IMAP cluster. I read some posts that talk about file locking problems with NFS that might cause mailbox corruption. Do you think NFS will do the trick or is there a better (faster, journaling) file system out there that will work in a production environment.

Matthew S. Crocker matthew (at) crocker (dot) com 25 May 2001

NFS will do the trick but you will have locking problems if you use mbox format e-mail. You *must* use MailDir instead of mbox to avoid the locking issues.

You can also use GFS (www.globalfilesystem.org) which has a fault tolerant shared disk solution.

Don Hinshaw dwh (at) openrecording (dot) com

I do this. I use Qmail as it stores the email in Maildir format, which uses one file per message as opposed to mbox which keeps all messages in a single file. On a cluster this is an advantage since one server may have a file locked for writing while another is trying to write. Since they are locking two different files it eases the problems with NFS file locking.

Courier also supports Maildir format as I believe does Postfix.

I use Qmail+(many patches) for SMTP, Vpopmail for a single UID mail sandbox (shell accounts my ass, not on this rig), and Courier-Imap. Vpopmail is configured to store userinfo in MySQL and Courier-Imap auths out of Vpopmail's tables.

Joe:

I've always had the creeps about pop and imap sending clear text passwds. How do you handle passwds?

It's a non-issue on that particular system, which is a webmail server. There is no pop, just imapd and it's configured to allow connections only from localhost. The webmail is configured to connect to imapd on localhost. No outside connections allowed.

But, this is another reason that I started using Vpopmail. Since it is a mail sandbox that runs under a single UID, email users don't get a shell account, so even if their passwords are sniffed, it only gets the cracker a look into that user's mailbox, nothing more.

At least on our system. If a cracker grabs someone's passwd and then finds that the user uses the same passwd on every account they have, there's not much I can do about that.

On systems where users do have an ftp or shell login, I make sure that their login is not the same as their email login and I also gen random passwords for all such accounts, and disallow the users changing it.

I'm negotiating a commercial contract to host webmail for a company (that you would recognize if I weren't prohibited by NDA from disclosing the name), and if it goes through then I'll gen an SSL cert for that company and auth the webmail via SSL.

You can also support SSL for regular pop or imap clients such as Netscape Messanger or MS Outlook or Outlook Express.

Everything is installed in /var/qmail/* and that /var/qmail/ is an NFS v3 export from a RAID server. All servers connect to a dedicated MySQL server that also stores it's databases on another NFS share from the RAID. Also each server mounts /www from the RAID.

Each realserver runs all services, smtpd, imapd, httpd and dns. I use TWIG as a webmail imap client, which is configured to connect to imapd on localhost (each server does this). Incoming smtp, httpd and dns requests are load balanced, but not imapd, since they are local connections on each server. Each server stores it's logs locally, then they are combined with a cron script and moved to the raid.

It's been working very well in a devel environment for over a year (round- robin dns, not lvs). I've recently begun the project to rebuild the system and scale it up into a commercially viable system, which is quite a task since most of the software packages are at least a year old, and I'll be using a pair of LVS directors instead of the RRDNS.

Matthew Croker

Users will also be using some sort of webmail (IMP/HORDE) to get their mail when they are off site...other than that standard Eudora/Netscape will be used for retrieval.

I settled on TWIG mainly because of it's vhost support. With Vpopmail, I can execute

# /var/qmail/vpopmail/bin/vadddomain somenewdomain.com <postmaster passwd>

and add that domain to dns and begin adding users and serving it up. I had to tweak TWIG just a bit to get it to properly deal with the "user@domain" style login that Vpopmail requires, but otherwise it works great. Each vhost domain can have it's own config file, but there is only one copy of TWIG in /www. TWIG uses MySQL, and though it doesn't require it, I also create a seperate database for each vhost domain.

IMP's development runs along at about Mach 0.00000000004 and I got tired of waiting for them to get a working addressbook. That plus it doesn't vhost all that well. SquirrelMail is very nice, but again not much vhost support. Plus TWIG includes the kitchen sink. Email, contacts, schedule, notes, todo, bookmarks and even USENET (which I don't use), each module can be enabled/disabled in the config file for that domain, and it's got a very complete advanced security module (which I also don't use). It's all PHP and using mod_gzip is pretty fast. I tested the APC optimizer for PHP, but every time I made a change to a script I had to reload Apache. Not very handy for a devel system, but it did add some noticable speed increases, until I unloaded it.

(Joe - I've lost track of who is who here)

The realservers would need access to both the users home directories as well as the /var/mail directory. I am not too familiar with the actual locking problems...I understand the basics but I also hear that NFS V3 was supposed to fix some of the locking issues with V2...I also saw some links to GFS,AFS,etc not too sure how they would work...

For those of you that missed the importance of using Maildir format...

Alexandra Alvarado

I need to implement a cluster mail server with 2 computers smtp, 2 computers pop3 and 2 computers (NAS) failover for storage. The idea is to have duplicated copies of the mail (/var/spool/mail and /var/spool/mqueue) in the nas using online replication.

Matthew Crocker matthew (at) crocker (dot) com 20 May 2003

Do not use mbox mailbox format. mbox does not work well over NFS with multiple hosts writing to the same file. You should use Maildir format. Just about every SMTP server can handle Maildir.

I'm running 4 mail servers. All mail servers have SMTP,POP3,IMAP, SMTPS,IMAP-SSL,POP3-SSL running on them. I'm using qmail, qmail-ldap, and Courier-IMAP. I have a Network Appliance NetFiler F720 200gig NFS server for my Maildir storage I have 3 OpenLDAP servers setup with 1 master and 2 slaves. The Mail servers only talk with the slaves. All mailAddress information is held in the LDAP directory.

You need to centralize the storage of mail using NAS and centralized the storage of account information using either LDAP or MySQL

How do you plan on failing over the NAS? NFS with soft mounts should handle it pretty well. Setup 2 Linux boxes one as an active, one as a passive NFS server. Both connected via Fiber Channel to a chunk of drives. The passive machine *must* have a way to STOMITH (Shoot The Other Machine In The Head) the active server if it crashes. You do not want to have both machines mounting the same drive space at the same time. Very bad things will happen....

Soo.. Setup an EXT3 filesystem on the FC drives. mount it on the active Linux box, export it over NFS with a virtual IP address. If the active server fails you need to remove power from it using a remote power switch. You need to be able to guarantee that it won't come back to life and start writing to the filesystem again. Clean the FC filesystem. remount it and export it over NFS on the same virtual IP addresses. keepalived can handle the VIP stuff with VRRP. I think it can also launch an external script during a failover to handle the shooting, cleaning, mounting, exporting of the filesystem. EXT3 cleans pretty quickly.

The SMTP/POP3 servers will be very un-happy to see their NFS server disappear so you will need to recover quickly. Processes will probably pile up in 'D'isk wait status on all of the machines. load will go through the roof. After the NFS server comes back online the hung processes should recover and finish up.

Adaptec makes a very nice 12 disk rack mounted RAID controller that has U160 SCSI going to the disks and 1 gig FiberChannel going to the servers. Redundant power, Redundant RAID controllers, Redundant FC loops going to each server. It is called the DuraStor 7320S. Plan on about $10kUS + drives for this type of system.

Network Appliance make an amazing box with complete High Availability failover of clustered data. You can expect to pay $200KUS for a complete clustered solution with 300GB usable storage. Fully redundant. Pretty much shoot a shotgun at it and not go down or lose data.

EXT3 running on Logical Volume Manager (LVM) can handle journalling and snapshots

Making the servers/services redundant is easy. Making your NAS/SAN redundant is expensive.

I'm looking into the Adaptec external RAID controller/drive array setup with 2 Linux boxes for my NAS. I've been running my NetApp for 3 years and have *NEVER* had it crash. It really is an amazing box. The problem is it is only one box and I don't have $200k to make it a cluster. I think I can do a pretty good job for about $20k with the Adaptec box, a bunch of Seagate drives and a couple Linux boxes.

You could also look into distributed filesystems like GFC, Coda ... but I don't feel confident enough in them to handle production data just yet.

Just a quick note: About a week ago I tried compiling a kernel that had been patched by SGI for XFS. The kernel (2.4.2) compiled fine, but choked once the LVS patches had been applied. Not having a lot of time to play around with it, I simply moved to 2.4.4+lvs 0.9 and decided not to bother with XFS on the director boxes.

also I thought about samba and only found one post from last year where someone was going to try it but there was no more info there.

Well, there's how I do it. I've tried damned near every combination of GPL software available over about the last 2 years to finally arrive at my current setup. Now if I could just load balance MySQL...

Greg Cope

MySQL connections / data transfere work much faster (20% ish) when on local host - so how about running mysql on each host, which is a select only system, and each localhost uses replication to a mster DB that is used for inserts and updates ?

Ultimately I think I'll have to. After I get done rebuilding the system to use kernel 2.4 and LVS and everything is stabilized, then I'll be looking very hard at just this sort of thing.

Joe, 04 Jun 2001

SMTP servers need access to DNS for reverse name lookup. If they are LVS'ed in a LVS-DR setup, won't this be a problem?

Matthew S. Crocker matthew (at) crocker (dot) com

You only need to make sure you have the proper forward and reverse lookup set. inbound mail to an SMTP server gets load balanced by the LVS but it still sees the orginal from IP of the sender and can do reverse lookups as normal. outbound mail from an SMTP server makes connections from its real IP address which can be NAT'd by a firewall or not. That IP address can also be reverse looked up.

normally the realservers in a LVS-DR setup have private IPs for the RIPs and hence they can't receive replies from calls made to external name servers.

I would also assume that people would write filter rules to only allow packets in and out of the realservers that belong to the services listed in the director's ipvsadm tables.

I take it that your LVS'ed SMTP servers can access external DNS servers, either by NAT through the director, or in the case of LVS-DR by having public IPs and making calls from those IPs to external nameservers via the default gw of the realservers?

We currently have our realservers with public IP addresses.

Bowie Bailey Bowie_Bailey (at) buc (dot) com

You can also do this by NAT through a firewall or router. I am not doing SMTP, but my entire LVS setup (VIPs and all) is private. I give the VIPs a static conduit through the firewall for external access. The realservers can access the internet via NAT, the same as any computer on the network.

Adail Oliveira

I want to build e-mail cluster using lvs, anybody have experience with this?

kwijibo (at) zianet (dot) com

It is pretty much the same as setting up an LVS for any other service. The biggest problem you will probably have is figuring out how handle storage for the mailboxes. My experience is that it works great. I am not sure how I would handle our mail load without it.

Todd Lyons tlyons (at) ivenue (dot) com 2005/27/05

Agreed. We use a NetApp for our central NFS server, 2 http machines for webmail, 2 imap machines, and 2 smtp machines. We have a 2 node load balancer with failover that balances the 3 protocols listed above (as well as other webservers). The 6 machines serving mail are dual P4 2.8 GHz with 1Gig RAM boxen, the load balancers are old P3 700 boxes that only do load balancing. We're just a small system though compared to many who do it. We estimate we could scale up to about 10-20 real servers for each service before we start to get throughput problems with the NetApp.

Graeme Fowler graeme (at) graemef (dot) net 15 Dec 2005

My day job sees me working for an ISP; we have a number of mail systems where we use multiple frontend servers (some behind LVS, some using other methods) with NetApp Filer backends offering mail storage over NFS mounts to the real/frontend servers. They handle SMTP, POP, IMAP, Webmail (and other things not necessarily mail related) but store the data on common NFS mounts.

Yes, there are well-documented issues with "mail on NFS" but that usually happens with shared SMTP server spools rather than IMAP/POP systems. We haven't had a problem yet; the filers are very reliable (and good if they do go wrong on rare occasions, too) and the platform scales out nicely.

Pierre Ancelot pierre (at) bostoncybertech (dot) com 15 Dec 2005

I tried to implement a load balance mail system with imap and imaps In this case, a user creating an imap folder will create it on only one node.... How do I update a mail received on one host to every other host? Using rsync would delete every mail received in the same time on other servers...

Scott J. Henson scotth () csee ! wvu ! edu 2005-12-15

Perdition or Courier-Imap. Both of them are/have imap proxies(or directors). So what you end up with is one(or more) front end proxies that send the person to the right machine. It looks something like this. Btw, we are moving to such a setup so I don't know how well it will be

The connection hits your load balancers(just straight ip_vs). Then it hands of the connection to a pool of imap proxies. Then each imap proxy figures out which real mail warehouse to send the message to(ldap is a good place to store this info cause it too can be load balanced, slaves anyway and then no need to load balance the master).

At that point the user does their thing and its all stored on one server. But you have many mail boxes distributed across many mail warehouses. There is an issue of backups and such for the mail warehouse, but this distributes the load so you can serve many more mail boxes than one server could. I think we are gonna go with RAID arrays and then hot spare mail warehouse mirrors. If the lead warehouse fails the backup comes online(through something like heartbeat). This should be more than enough redundancy for us, but you may want to look into other solutions if you want more, aka fiberchannel or some such.

Obviously this setup can get very complex but become very stable. Depending on your application you can throw money at it or not even have hot spares and trust in your RAID/backups.

Markmsalists () gmx ! net 2005-12-15

Maybe I am misinterpreting this, but it sounds like each mailbox is assigned to exactly one realserver? So you have distributed the mailboxes for load-balancing, but is there any redundancy if one of the boxes goes down?

Scott J. Henson scotth (at) csee (dot) wvu (dot) edu 15 Dec 2005

Nope, you're not misinterpreting anything. With mail you can't really have the type of redundancy you can have with say a webserver serving out content. The problem is that imap doesn't like to be distributed. The only way is to do it over something like iSCSI or fiberchannel or some other enterprise level storage medium. What we are doing is to distribute load and to provide some redundancy. If one mail box goes down then we bring up the hot spare, but most of our mail boxes still stay up. Also if one has a failed RAID, then we can move all the mail boxes off of it and bring it down, repair the raid, then bring it back up and move the mail boxes back. It really offers more flexibility and does increase the redundancy on a site level.

Todd Lyons tlyons () ivenue ! com 2005-12-16

Could always go with something like cyrus-imap with the murder extension, which is for an imap cluster. I've never set it up, don't know any details much beyond what I've stated here. But it's supposed to make the mail machines look like a cluster.

Kees Bos k.bos () zx ! nl 2005-12-16

Maybe you can put some kind of imapproxy in front of your imap servers. The imapproxy than has to know about the multiple imap servers and the imapproxy itself can be loadbalanced. I haven't used it myself, but perdition seems to do this: http://www.vergenet.net/linux/perdition/

Scott J. Henson scotth () csee ! wvu ! edu 2005-12-16

Yes, I forgot to mention that, but word from the already been there, its HARD. At least in my experience its more trouble than its worth.

20.5.2. Commercial Mail Farm

Here's an example of a commercial mail farm using LVS. Commercial ventures are usually loathe to tell us how they use LVS - Seems they don't understand the spririt of GPL. Even after you're told the setup, you still need someone to get it going and keep it going, so I don't know what advantage they get out of keeping their setups secret. Michael Spiegle popped up on the mailing list and gave some info about a LVS'ed mail farm he'd setup for a customer, so I asked off-list if he'd mind giving us more details. So here it is. Thanks Michael.

Michael Spiegle mike (at) nauticaltech (dot) com 13 Nov 2006

The setup is VERY simple and straightforward. We have 17 mailservers in production right now. Here's some specs on our setup:

Pair of LVS-NAT directors, each dual Xeon, 2GB RAM, dual 3com Fiber 1000SX NICS, failover uses keepalived. There is no firewall in-front of the directors. The 17 mail servers are dual Xeon, 4GB RAM, dual 10K SCSI disks in RAID1 (software), dual onboard e1000.

During peak time today (monday has highest load), we were doing about 14K active connections at any given time (probably over 30K inactive connections). Actual bandwidth isn't terribly high... due to the nature of the service. The director pair also provides load balancing for a cluster of 30 Sun X1 servers running HTTPd, which has much fewer connections, but a little more bandwidth intensive. I think we might be pushing 350mbit/sec at peak times (haven't seen MRTG graphs in a while).

Our services allow local mailboxes and forwarding to external addresses. We have something in the neighborhood of 250K local mailboxes and 170K external forwards. The load on the director pair is practically non-existent (generally 0.0x).

The ONLY time we have EVER come close to maxing out our LVS was during a DDOS attack. The NICs we use don't have interrupt coalescence in the driver, so we were actually running out of interrupts on the box which isn't the fault of LVS. I don't recall any other metrics from the DDOS attack, but LVS would have swallowed it if we had interrupt coalescence.

The mailservers however have some issues which need to be sorted out. They run anywhere from 2 to 15 as the load average. They use their local disks heavilly for temporary incoming mail storage. We store all mail on a pair of NetApp FAS-940s which we appear to be pushing the limits on somehow ( the excessive amount of NFS-wait we're hitting is driving up the load on the mailservers).

How does localmail get from the realservers to the NetApp? The mailserver uses multiple daemons to accomplish mail tasks. When a piece of mail comes in, a message is created on the disk. Another daemon is dedicated to figuring out where the messages go (to an external forward, or to a local mailbox). If the message is going to an external forward, it is handled by an external-mail daemon. If the message is staying local, an internal-mail daemon puts it on the NetApp.

When a customer connects via POP, the POP server looks up the location of the mailbox in memory, then goes straight to the NetApp to fetch the messages. All NetApps are mounted to the realservers via Linux NFS client. (the excessive amount of NFS-wait we're hitting is driving up the load on the mailservers).

nfs used to be a real dog. It's impressive that nowadays you can run a network disk for a machine that's being pushed hard.

On any given server, we've got 300 items in dmesg regarding "couldn't contact NetApp". We have some issues to sort out regarding the NFS mount options we're using. Also, the NetApps have an issue where if a single mailbox hits the dirsize limits (I think its something on the order of 2 million messages in a single directory), the NetApp freaks out and pegs 100% CPU. During this time, load on the mailservers doubles because those mail processes/daemons can't talk to the NetApp and causes a pile-up of connections.

It is a real "cluster" in the sense that customer data (email address to local mailbox mapping) is stored in a memory-resident database - therefore, access is VERY fast and any server can handle any number of connections.

We have a pair of dual-xeon boxes running tinyDNS to provide caching DNS for the mail cluster. Previously, we had our default gateways on the mailservers set to the 2nd interface of the directors (which were also load balanced). This caused all cache DNS traffic to go through the LVS which resulted in a conntrack table of over 100K at peaktimes. Even though the DNS traffic was UDP, netfilter attempts to create a very basic connection status for the traffic. For example, if mailserver01 sends out a DNS request to dnscache01, netfilter will create a conntrack table entry and label it as unack'd until LVS sees a response from dnscache01. When it sees this response, it relabels the connection to ack'd. Once I realized what was going on, I re-architected the DNS layout slightly to allow the mailservers to directly communicate with the caching nameservers. Now, the LVS runs about 20-25K connections in conntrack.

About the limitations of conntrack. Previously (year ago), we had an old pair of DNS caches, which were nothing more than single-proc P3 boxes @ 900Mhz. We NEVER had a problem with these boxes until we provisioned the new cache boxes (dual xeon). The old boxes were on the same VLAN as the mailservers, so traffic/connections through LVS weren't really a problem. When we provisioned the new DNS caches, which were to be "segregated" from other internal networks, thus the mailservers using LVS as a gateway to hit them. I always had a feeling that our mail servers were "slower" with the new caches than the old ones. I dug deeper one day and found out that we were hitting conntrack limits on LVS from all these DNS queries. I alleviated that as I explained earlier by re-architecting the DNS cache layout, but I still don't feel it was quite up to snuff. I did notice that each of the new DNS caches had netfilter enabled in kernel (unnecessary addition from our shoddy development team) and we were hitting conntrack limits on them as well.

The moral of the story is that I never had these problems with the old servers, because I compiled the kernel WITHOUT netfilter. So in the old kernel

  • , there are no conntrack limits to hit,
  • the kernel doesn't have to do lookups in a massive 100K table for each connection. If my math serves me right, 100K conntrack entries consumes a little over 100MB of RAM.

Its not very scientific, but I believe conntrack introduces enough latency to be noticeable at our level.

Since we run a "real" cluster, we can fail out any machine as we please. Any mailserver can handle any customer.

We also have a separate LVS-DR web cluster based on linux x86. It runs the same memory-resident database to build apache virtualhost entries on the fly. Also, since each customer has their own IP (for SSL purposes), the realservers have about 65K (250+ class Cs) bound to the loopback. Works beautifully.

How do you handle failover of large numbers of VIPs?

Since I haven't had a chance to work on that particular system yet, I don't know. I can tell you however that the slave wouldn't have to ARP for all of those IPs in our particular setup. We have a pair of PIX firewalls in front of the LVS which do passthru to the LVS. The only thing the slave has to ARP for is the placeholder IP on the interface, and the firewall will figure out where to send the traffic. True, if the master firewall died, the slave would have to ARP for all the IPs... but we've found the firewalls to be quite reliable.

I'm hoping to be able to push LVS a little farther with a possible project in the upcomming month. I'm leaving this place I currently work at and am going to a company that does lots of media streaming. They're using netscalers to push 10gbit of bandwidth outbound in an asymmetrical-routing sort of way. Since they cost 80K/ea, I'm hoping I can convince them that LVS is just as good if not better for MUCH cheaper.

20.6. dns, tcp/udp 53 (and dhcpd server 67, dhcp client 68)

Note

For an article containing a section on loadbalancing by DNS, see http://1wt.eu/articles/2006_lb/ Making applications scalable with Load Balancing by Willy Tarreau.

Another article about using round robin DNS to load balance services.

For rotating/round robin DNS/DNS for geographically distributed load balancing, see Load Balancing by DNS

DNS uses tcp and udp on port 53. It's a little more complicated than a regular single port service and is in the multiple port section at DNS

20.7. http name and IP-based (with LVS-DR or LVS-Tun), tcp 80

http with name- and ip-based http is a simple one port service. Your httpd must be listening to the VIP which will be on lo:0 or tunl0:0. The httpd can be listening on the RIP too (on eth0) for mon, but for the LVS you need the httpd listening to the VIP.

Thanks to Doug Bagley doug (at) deja (dot) com for getting this info on ip and name based http into the HOWTO.

Both ip-based and name-based webserving in an LVS are simple. In ip-based (HTTP/1.0) webserving, the client sends a request to a hostname which resolves to an IP (the VIP on the director). The director sends the request to the httpd on a realserver. The httpd looks up its httpd.conf to determine how to handle the request (e.g. which DOCUMENTROOT).

In named-based (HTTP/1.1) webserving, the client passes the HOST: header to the httpd. The httpd looks up the httpd.conf file and directs the request to the appropriate DOCUMENTROOT. In this case all URL's on the webserver can have the same IP.

The difference between ip- and name-based web support is handled by the httpd running on the realservers. LVS operates at the IP level and has no knowledge of ip- or name-based httpd and has no need to know how the URLs are being handled.

Here's the definitive word on ip-based and name-based web support. Here are some excerpts.

The original (HTTP/1.0) form of http was IP-based, i.e. the httpd accepted a call to an IP:port pair, eg 192.168.1.110:80. In the single server case, the machine name (www.foo.com) resolves to this IP and the httpd listens to calls to this IP. Here's the lines from httpd.conf

Listen 192.168.1.110:80
<VirtualHost 192.168.1.110>
        ServerName lvs.mack.net
        DocumentRoot /usr/local/etc/httpd/htdocs
        ServerAdmin root@chuck.mack.net
        ErrorLog logs/error_log
        TransferLog logs/access_log
</VirtualHost>

To make an LVS with IP-based httpds, this IP is used as the VIP for the LVS and if you are using LVS-DR/VS-Tun, then you set up multiple realservers, each with the httpd listening to the VIP (ie its own VIP). If you are running an LVS for 2 urls (www.foo.com, www.bar.com), then you have 2 VIPs on the LVS and the httpd on each realserver listens to 2 IPs.

The problem with ip-based virtual hosts is that an IP is needed for each url and ISPs charge for IPs.

Doug Bagley doug (at) deja (dot) com

With HTTP/1.1, a client Name based virtual hosting uses the HTTP/1.1 "Host:" header, which HTTP/1.1 clients send. This allows the server to know what host/domain, the client thinks it is connecting to. A normal HTTP request line only has the request path in it, no hostname, hence the new header. IP-based virtual hosting works for older browsers that use HTTP/1.0 and don't send the "Host:" header, and requires the server to use a separate IP for each virtual domain.

The httpd.conf file then has

NameVirtualHost 192.168.1.110

<VirtualHost 192.168.1.110>
ServerName www.foo.com
DocumentRoot /www.foo.com/
..
</VirtualHost 192.168.1.110>

<VirtualHost 192.168.1.110>
ServerName www.bar.com
DocumentRoot /www.bar.com/
..
</VirtualHost 192.168.1.110>

DNS for both hostnames resolves to 192.168.1.110 and the httpd determines the hostname to accept the connection from the "Host:" header. Old (HTTP/1.0) browsers will be served the webpages from the first VirtualHost in the httpd.conf.

For LVS again nothing special has to be done. All the hostnames resolve to the VIP and on the realservers, VirtualHost directives are setup as if the machine was a standalone.

Ted Pavlic pavlic (at) netwalk (dot) com.

Note that in 2000, http://www.arin.net/announcements/ ARIN (look for "name based web hosting" announcements, the link changes occasionally, couldn't find it anymore - May 2002) announced that IP based webserving would be phased out in favor of name based webserving for ISPs who have more that 256 hosts. This will only require one IP for each webserver. (There are exceptions, ftp, ssl, frontpage...)

20.7.1. "/" terminated urls

Noah Roberts wrote:

When I use urls like www.myserver.org/directory/ everything works fine. But if I don't have the ending / then my client attempts to find the realserver and ask it, and it uses the hostname that I have in /etc/hosts on the director which is to the internal LAN so it fails badly.

Scott Laird laird (at) internap (dot) com 02 Jul 2001

Assuming that you're using Apache, set the ServerName for the realserver to the virtual server name. When a user does a 'GET /some/directory', Apache returns a redirect to 'http://$servername/some/directory/'.

20.8. http with LVS-NAT

Summary: make sure the httpd on the realserver is listening on the RIP not the VIP (this is the opposite of what was needed for LVS-DR or LVS-Tun). (Remember, there is no VIP on the realserver with LVS-NAT).

tc lewis had an (ip-based) non-working http LVS-NAT setup. The VIP was a routable IP, while the realservers were virtual hosts on the non-routable 192.168.1.0/24 network.

Michael Sparks michael (dot) sparks (at) mcc (dot) ac (dot) uk

What's happening is a consequence of using NAT. Your LVS is accepting packets for the VIP, and re-writing them to either 192.168.123.3 or 192.168.123.2. The packets therefore arrive at those two servers marked for address 192.168.123.2 or 192.168.123.3, not the VIP.

As a result when apache sees this:

<VirtualHost w1.bungalow.intra>
...
</VirtualHost>

It notices that the packets are arriving on either 192.168.123.2 or 192.168.123.3 and not w1.bungalow.intra, hence your problem.

Solutions

  • If this is the only website being serviced by these two servers, change the config so the default doc root is the one you want.

  • If they're servicing many websites, map a realworld IP to an alias on the realservers and use that to do the work. IMO this is messy, and could cause you major headaches.

  • Use LVS-DR or LVS-Tun - that way the above config could be used without problems since the VS address is a local address as well. This'd be my choice.

Joe 10 May 2001

It just occured to me that a realserver in a LVS-NAT LVS is listening on the RIP. The client is sending to the VIP. In an HTTP 1.1 or name based httpd, doesn't the server get a request with the URL (which will have the VIP) in the payload of the packet (where an L4 switch doesn't see it)? Won't the server be unhappy about this? This has come up before with name based service like https and for indexing of webpages. Does anyone know how to force an HTTP 1.1 connection (or to check whether the connection was HTTP 1.0 or 1.1) so we can check this?

Paul Baker pbaker (at) where2getit (dot) com 10 May 2001

The HTTP 1.1 request (and also 1.0 requests from any modern browser) contain a Host: header which specifies the hostname of the server. As long as the webservers on the realservers are aware that they are serving this hostname. There should be no issue with 1.1 vs 1.0 http requests.

so both virtualHost and servername should be the reverse dns of the VIP?

Yes. Your Servername should be the reverse dns of the VIP and you need to have a Virtualhost entry for it as well. In the event that you are serving more than one domain on that VIP, then you need to have a VirtualHost entry for each domain as well.

what if instead of the name of the VIP, I surf to the actual IP? There is no device with the VIP on the LVS-NAT realserver. Does there need to be one? Will an entry in /etc/hosts that maps the VIP to the public name do?

Ilker Gokhan IlkerG (at) sumerbank (dot) com (dot) tr

If you write URL with IP address such as http://123.123.123.123/, the Host: header is filled with this IP address, not hostname. You can see it using any network monitor program (tcpdump).

20.9. httpd is stateless and normally closes connections

http is stateless, in that the httpd has no memory of your previous connections. Unlike other stateless protocols (NFS, ntp) a connection is made (it is tcp rather than udp based). However httpd will usually attempt to disconnect as soon as possible, in which case you will not see any entries in the of ActConn column of the ipvsadm output. For HTTP/1.1, the browser/server can negotiate a persistent httpd connection.

If you look with ipvsadm to see the activity on an LVS serving httpd, you won't see much. A non-persistent httpd on the realserver closes the connection after sending the packets. Here's the output from ipvsadm, immediately after retrieving a gif filled webpage from a 2 realserver LVS.

director:# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:www rr
  -> RS2.mack.net:www            Masq    1      2          12
  -> RS1.mack.net:www            Masq    1      1          11

The InActConn are showing the connections that transferred hits that have been closed and are in the FIN state waiting to timeout. You may see "0" in the InActConn column, leading you to think that you are not getting the packets via the LVS.

Roberto Nibali ratz@drugphish.ch 22 Dec 2003

If you want to see connections before they are closed, you should invoke ipvsadm with watch,

watch -n 1 'ipvsadm -Ln'

or if you want it realtime (warning, eats a lot of CPU time):

watch -n -1 'ipvsadm -Ln'

When a client connects, you'll see a positive integer in the ActiveConn column.

20.10. netscape/database/tcpip persistence (keepalives)

With the first version of the http protocol, HTTP/1.0, a client would request a hit/page from the httpd. After the transfer, the connection was dropped. It is expensive to setup a tcp connection just to transfer a small number of packets, when it is likely that the client will be making several more requests immediately afterwards (e.g. if the client downloads an html page which contains gifs, then after parsing the html page, the client will request the gifs). With HTTP/1.1 application level persistent connection was possible. The client/server pair negotiate to see if persistent connection is possible. The server uses an algorithm to determine when to drop the connection (KeepAliveTimeout, 15sec usually or needs to recover file handles...). The client can drop the connection at anytime without consulting the server (e.g. when it has got all the hits on a page). Persistence is only allowed when the file transfer size is known ahead of time (e.g. an html page or a gif). The output from a cgi script is of unknown size and it will be transferred by a regular (non-persistent) connection. Persistent connection requires more resources from the server, as file handles can be open for much longer than the time needed for a tcpip transfer. The number of keepalive connections/client and the timeout are set in httpd.conf. Persistent connection with apache is called keepalive (http://www.auburn.edu/docs/apache/keepalive.html) and is described in http persistent connection.

With the introduction of lingering_close() to apache_v1.2, a bug in some browsers would hold open the connection forever Connections in FIN_WAIT_2 and Apache (http://www.auburn.edu/docs/apache/misc/fin_wait_2.html), leaving the output of netstat on the server filled with connections in FIN_WAIT_2 state. This required the addition of an RFC violating timeout for the FIN_WAIT_2 state to the server's tcpip stack.

Kees Hoekzema kees (at) tweakers (dot) net 17 Feb 2005

When using keepalive a client opens a connection to the cluster and that connection stays open (for as long as the clients wants, or a timeout occurs serverside). So the loadbalancer can not (at normal LVS level) see whether it is a normal connection with just one large request from the server or that it is a keepalived connection with lots of requests. As far as I know persistence has no influence on keepalive.

Jacob Coby

The Apache KeepAlive option is one of the first things to turn off when you start getting a lot of traffic.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 2005/02/18

This is a dangerous shortcut. Sometimes, opening/closing connections a hundred times a second could put down your server. Turning off HTTP keepalive implies a good choice between the 3 available apache mpms. In my case, I turn it off for the banner server, but keep it on (only a short time) on our products sites (which have lots of little images). Having keepalive on or off will not affect your LVS performances; LVS persistence (aka affinity) will.

Alois Treindl alois (at) astro (dot) ch 30 Apr 2001

when I reload a page on the client, the browser makes several http hits on the server for the graphics in the page. These hits are load balanced between the realservers. I presume this is normal for HTTP/1.0 protocol, though I would have expected Netscape 4.77 to use HTTP/1.1 with one connection for all parts of a page.

Joe

Here's the output of ipvsadm after downloading a test page consisting of 80 different gifs (the html file has 80 lines of <img src="foo.gif">).

director:/etc/lvs# ipvsadm
IP Virtual Server version 1.0.7 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr
  -> RS2.mack.net:http              Route   1      2          0
  -> RS1.mack.net:http              Route   1      2          0

It would appear that the browser has made 4 connections which are left open. The client shows (netstat -an) 4 connections which are ESTABLISHED, while the realservers show 2 connections each in FIN_WAIT2. Presumably each connection was used to transfer an average of 20 requests.

If the client-server pair were using persistent connection, I would expect only one connection to have been used.

Andreas J. Koenig andreas (dot) koenig (at) anima (dot) de 02 May 2001

Netscape just doesn't use a single connection, and not only Netscape. All major browsers fire mercilessly a whole lot of connections at the server. They just don't form a single line, they try to queue up on several ports simultaneously...

...and that is why you should never set KeepAliveTimeout to 15 unless you want to burn your money. You keep several gates open for a single user who doesn't use them most of the time while you lock others out.

Julian

Hm, I think the browsers fetch the objects by creating 3-4 connections (not sure how many exactly). If there is a KeepAlive option in the httpd.conf you can expect small number of inactive connections after the page download is completed. Without this option the client is forced to create new connections after each object is downloaded and the HTTP connections are not reused.

The browsers reuse the connection but there are more than one connections.

KeepAlive Off can be useful for banner serving but a short KeepAlive period has its advantages in some cases with long rtt where the connection setups costs time and because the modern browsers are limited to the number of connections they open. Of course, the default period can be reduced but its value depends on the served content, whether the client is expected to open many connections for short period or just one.

Peter Mueller pmueller (at) sidestep (dot) com 01 May 2001

I was searching around on the web and found the following relevant links..

http://thingy.kcilink.com/modperlguide/performance/KeepAlive.html
http://httpd.apache.org/docs/keepalive.html -- not that useful
http://www.apache.gamma.ru/docs/misc/fin_wait_2.html -- old but interesting

Andreas J. Koenig andreas (dot) koenig (at) anima (dot) de 02 May 2001

If you have 5 servers with 15 secs KeepAliveTimeout, then you can serve

60*60*24*5/15 = 28800 requests per day

Joe

don't you actually have MaxClients=150 servers available and this can be increased to several thousand presumably?

Peter Mueller

I think a factor of 64000 is forgotten here (number of possible reply ports), plus the fact that most http connections do seem to terminate immediately, despite the KeepAlive.

Andreas (?)

Sure, and people do this and buy lots of RAM for them. But many of them servers are just in 'K' state, waiting for more data on these KeepAlive connections. Moreover, they do not compile the status module into their servers and never notice.

Let's rewrite the above formula:

MaxClients / KeepAliveTimeout

denotes the number of requests that can be satisfied if all clients *send* a keepalive header (I think that's "Connection: keepalive") but *do not actually use* the kept-alive line. If they actually use the kept-alive line, you can serve more, of course.

Try this: start apache with the -X flag, so it will not fork children and set the keepalivetimeout to 60. Then load a page from it with Netscape that contains many images. You will notice that many pictures arive quickly and a few pictures arive after a long, long, long, looooong time.

When the browser parses the incoming HTML stream and sees the first IMG tag it will fire off the first IMG request. It will do likewise for the next IMG tag. At some point it will reach an IMG tag and be able to re-use an open keepalive connection. This is good and does save time. But if a whole second has passed after a keepalive request it becomes very unlikely that this connection will be re-used ever, so 15 seconds is braindead. One or two seconds is OK.

In the above experiment my Netscape loaded 14 images immediately after the HTML page was loaded, but it took about a minute for each of the remaining 4 images which happened to be the first in the HTML stream.

Joe

Here's the output of ipvsadm after downloading the same 80 gif page with the -X option on apache (only one httpd is seen with ps, rather than the 5 I usually have).

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.11 (size=16384)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port             Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr
  -> RS2.mack.net:http              Route   1      1          1
  -> RS1.mack.net:http              Route   1      0          2

The page shows a lot of loading at the status line, then stops, showing 100% of 30k. However the downloaded page is blank. A few seconds later the gifs are displayed. The client shows 4 connections in CLOSE_WAIT and the realservers each show 2 connections in FIN_WAIT2.

Paul J. Baker pbaker (at) where2getit (dot) com 02 May 2001

The KeepAliveTimeout value is NOT the connection time out. It says how long Apache will keep an active connection open waiting for a new request to come on the SAME connection after it has fulfilled a request. Setting this to 15 seconds does not mean apache cuts all connections after 15 seconds.

I write server load-testing software so I have do quiet a bit of research in the behaviour of each browser. If Netscape hits a page with a lot of images on it, it will usually open about 8 connections. It will use these 8 connections to download things as quickly as it can. If the server cuts each connection after 1 request is fullfilled, then Netscape browser has to keep reconnecting. This costs a lot of time. KeepAlive is a GOOD THING. Netscape does close the connections when it is done with them which will be well before the 15 seconds since the last request expire.

Think of KeepAliveTimeout as being like an Idle Timeout in FTP. Imagine it being set to 15 seconds.

Ivan Pulleyn ivan (at) sixfold (dot) com 23 Jan 2003

To totally fragment the request, if using apache, 'KeepAlive off' option will disable HTTP keep-alive sessions. So a single browser load will have to connect() many times; once for the document, then again for each image on the page, style sheet, etc. Also, sending a pragma no-cache in the HTTP header would be a good idea to ensure the client actually reloads.

20.10.1. Sudden Changes in InActConn

nigel (at) turbo10 (dot) com

This weekend the web service we run came under increased load --- about an extra 10,000,000 queries per day ---- when InActConn went from 200-300 to 2000+ in about 60 seconds and the LVS locks up.

Rob ipvsuser (at) itsbeen (dot) sent (dot) com 2005/03/13

I had a high number of inactive connections with apache set up to not use keepalive at all. After activating keep alive in apache (LVS was already persistant) the number of inactive connections went way down.

So in my case at least, connections were setup, used for a single GET for a gif, button, jpeg, js script, or other page component then the server closed the connection, only to open another for the next gif, etc.

You might be able to use something like multilog to watch a bunch of the logs at the same time to get an idea if the traffic looks like real people (get page 1, get page 1 images, get page 2, get page 2 images) or if it is random hammering from a dos attack.

I wrote a small shell script that pulled the recent log entries, counted the hits per IP address for certain requests and then created a iptables rule on the director (or some machine in front of the director) to tarpit requests from that IP. This worked in my situation because we knew that certain URLs were only hit a small number of times during a legit use session (like a login page shouldn't be hit 957 times in an hour by the same external IP) This could help reduce the tide of requests if you are actually encountering a (d)dos. I ran it every 12 minutes or so. If you are getting ddos'd the tarpit function of iptables http://www.securityfocus.com/infocus/1723 or the tarpit standalone can be a great help. Also, Felix and his company seem to have helped some large companies deal with high traffic ddos attacks - http://www.fefe.de/

BTW, You might be interested in http://www.backhand.org/mod_log_spread/ for centralized and redundant logging. That way you can run different kinds of real time analysis with no extra load on the webservers or the normal logging hosts by just having an additional machine join/subscribe to the multicast spread group with the log data.

OK I can't find my script, but this was the start of it, it is hardly a shell script (but someone may find it useful): Add a "grep blah" command just before the awk '{print $2}' if you want just certain requests or other filtering.

multidaychk.sh
#!/bin/sh
# look for mutliday patterns
# $1 is how many days back to search
# $2 is how many high usage IPs to list
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -${1} | xargs -n 
1 cat | awk '{print $2}' | sort | uniq -c | sort -nr | head -${2}

byhrchk.sh
#!/bin/sh
# looks for IPs hitting during a certain hr of the day
# $1 is how many days back to search
# $2 is how many high usage IPs to list
# $3 is which hour of the day
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -${1} | xargs -n 
1 cat | fgrep "2005:${3}" | awk '{print $2}' | sort | uniq -c | sort -nr 
| head -${2}

recentchk.sh
#!/bin/sh
# This just checks the latest X lines from the newest log file
# $1 is how many lines from the file
# $2 is how many high usage IPs to list
ls -1tr /usr/local/apache2/logs/access_log.200*0 | tail -1 | xargs -n 1 
tail -${1} | awk '{print $2}' | sort | uniq -c | sort -nr | head -${2}

20.11. dynamically generated images on web pages

On static webpages, all realservers serve identical content. A dynamically generated image is only present on the webserver that gets the request (and which generates the image). However the director will send the client's request for that image to any of the realservers and not neccessarily to the realserver that generated the image.

Solutions are

  • generate the images in a shared directory/filesystem
  • use fwmark to setup the LVS.

Both methods are described in the section using fwmark for dynamically generated images.

20.12. http: sanity checks, shutting down, indexing programs (modifying /etc/hosts), htpasswd, apache proxy and reverse proxy to look at URL, mod_backhand, logging

people running webservers are interesting in optimising throughput and often want to look at the content of packets. You can't do this with LVS, since LVS works at layer 4. However there are many ways of looking at the content of packets that are passed through an LVS to backend webservers. Material on reverse proxies is all through this HOWTO. I haven't worked out whether to pull it all together or leave it in the context it came up. As a start...

defn: forward and reverse proxy: adapted from Apache Overview HOWTO (http://www.tldp.org/HOWTO/Apache-Overview-HOWTO-2.html) and Running a Reverse Proxy with Apache (http://www.apacheweek.com/features/reverseproxies). See also http://en.wikipedia.org/wiki/Proxy_server and http://en.wikipedia.org/wiki/Reverse_proxy

A proxy is a program that performs requests on behalf of another program. The source and destination IPs on the packets do not change.

forward proxy: (the traditional http proxy), accepts requests from clients, contacts the remote http server and returns the response. An example is "squid". The main advantage of a squid is that it caches responses (it's a proxy cache). Thus a repeat request for a webpage will be returned more quickly, since the proxy cache will (usually) be closer to the client on the internet. Forward proxies are of interest because of their caching. That they cache by doing a proxy request is not of much interest to users.

reverse proxy: a webserver placed in front of other servers, providing a unified front end to the client, but offloading certain tasks, e.g. SSL processing, FTP from the backend webservers to other machines. The most common reason to run a reverse proxy to enable controlled access from the internet to servers behind a firewall. Squid can also reverse proxy.

Note
Quite what is "reverse" about this sort of proxy, I don't know - perhaps they needed a name to differentiate it from "forward". "reverse" is not a helpful name here. Both forward and reverse proxies have the same functionality: the forward proxy is at the client end, while the reverse proxy is at the server end.)

In some sense, the LVS director functions as a reverse proxy for the realservers.

20.12.1. sanity checks

When first setting up, to check that your LVS is working...

  1. put something different on each realserver's web page (e.g. the string "realserver_1" at the top of the homepage).
  2. use rr for your scheduler
  3. make sure you're rotating through the different web pages (each one is different) and look at the output of ipvsadm to seen a new connection (probably InActConn)
  4. ping the VIP from a machine directly connected to the outside of the director. Then check the MAC address for the VIP with arp -a

20.12.2. replies coming from wrong VIP (check configs)

Nicklas Bondesson nicklas (dot) bondesson (at) mindping (dot) com 24 Feb 2007 (with help from Graeme Fowler)

I have several VIPs. Regardless of the VIP the client connects to, they get a response from a different IP which never varies. I found out that everything was working the way it should with https, which further led me into debugging our apache setup rather than LVS. Apache didn't have the appropiate virtual hosts configured for all the vip's. This is why I always saw the same ip address as source - the ip of the _default_ (first configured) apache virtual host.

20.12.3. Shutting down http

You need to shut down httpd gracefully, by bringing the weight to 0 and letting connections drop, or you will not be able to bind to port 80 when you restart httpd. If you want to do on the fly modifications to your httpd, and keep all realservers in the same state, you may have problems.

Thornton Prime thornton (at) jalan (dot) com 05 Jan 2001

I have been having some problems restarting apache on servers that are using LVS-NAT and was hoping someone had some insight or a workaround.

Basically, when I make a configuration change to my webservers and I try to restart them (either with a complete shutdown or even just a graceful restart), Apache tries to close all the current connections and re-bind to the port. The problem is that invariably it takes several minutes for all the current connections to clear even if I kill apache, and the server won't start as long as any socket is open on port 80, even if it is in a 'CLOSING' state.

Michael E Brown wrote:

Catch-22. I think the proper way to do something like this is to take the affected server out of the LVS table _before_ making any configuration changes to the machine. Wait until all connections are closed, then make your change and restart apache. You should run into less problems this way. After the server has restarted, then add it back into the pool.

I thought of that, but unfortunately I need to make sure that the servers in the cluster remain in a near identical state, so the reconfiguration time should be minimal.

Julian wrote

Hm, I don't have such problems with Apache. I use the default configuration-time settings, may be with higher process limit only. Are you sure you use the latest 2.2 kernels in the realservers?

I'm guessing that my problem is that I am using LVS persistent connections, and combined with apache's lingering close this makes it difficult for apache to know the difference between a slow connection and a dead connection when it tries to close down, so the time it takes to clear some of the sockets approaches my LVS persistence time.

I haven't tried turning off persistence, and I haven't tried re-compiling apache without lingering-close. This is a production cluster with rather heavy traffic and I don't have a test cluster to play with. In the end rebooting the machine has been faster than waiting for the ports to clear so I can restart apache, but this seems really dumb, and doesn't work well because then my cluster machines have different configuration states.

One reason for your servers to block is a very low value for the client number. You can build apache in this way:

CFLAGS=-DHARD_SERVER_LIMIT=2048 ./configure ...

and then to increase MaxClients (up to the above limit). Try with different values. And don't play too much with the MinSpareServers and MaxSpareServers. Values near the default are preferred. Is your kernel compiled with higher value for the number of processes:

/usr/src/linux/include/linux/tasks.h

Is there any way anyone knows of to kill the sockets on the webserver other than simply wait for them to clear out or rebooting the machine? (I tried also taking the interface down and bringing it up again ... that didn't work either.)

Is there any way to 'reset' the MASQ table on the LVS machine to force a reset?

No way! The masq follows the TCP protocol and it is transparent to the both ends. The expiration timeouts in the LVS/MASQ box are high enough to allow the connection termination to complete. Do you remove the realservers from the LVS configuration before stopping the apaches? This can block the traffic and can delay the shutdown. It seems the fastest way to restart the apache is apachectl graceful, of course, if you don't change anything in apachectl (in the httpd args).

20.12.4. Running indexing programs (e.g. htdig) on the LVS (modifying /etc/hosts

(From Ted I think)

Setup -

realservers are node1.foobar.com, node2.foobar.com... nodeN.foobar.com, director has VIP=lvs.foobar.com (all realservers appear as lvs.foobar.com to users).

Problem -

if you run the indexing program on one of the (identical) realservers, the urls of the indexed files will be

http://nodeX.foobar.com/filename

These urls will be unuseable by clients out in internetland since the realservers are not individually accessable by clients.

If instead you run the indexing program from outside the LVS (as a user), you will get the correct urls for the files, but you will have to move/copy your index back to the realservers.

Solution (from Ted Pavlic, edited by Joe).

On the indexing node, if you are using LVS-NAT add a non-arping device (eg lo:0, tunl0, ppp0, slip0 or dummy) with IP=VIP as if you were setting up LVS-DR (or LVS-Tun). With LVS-DR/VS-Tun this device with the VIP is already setup. The VIP is associated in dns with the name lvs.foobar.com. To index, on the indexing node, start indexing from http://lvs.foobar.com and the realserver will index itself giving the URLs appropriate for the user in the index.

Alternately (for LVS-NAT), on the indexing node, add the following line to /etc/hosts.

127.0.0.1       localhost lvs.foobar.com

make sure your resolver looks to /etc/hosts before it looks to dns and then run your indexing program. This is a less general solution, since if the name of lvs.foobar.com was changed to lvs.bazbar.com, or if lvs.foobar.com is changed to be a CNAME, then you would have to edit all your hosts files. The solution with the VIP on every machine would be handled by dns.

There is no need to fool with anything unless you are running LVS-NAT.

20.12.5. htpasswd with http

Noah Roberts wrote:

If anyone has had success with htpasswords in an LVS cluster please tell me how you did it.

Thornton Prime thornton (at) jalan (dot) com Fri, 06 Jul 2001

We have HTTP authentication working on dozens of sites through LVS with all sorts of different password storage from old fashioned htpasswd files to LDAP. LVS when working properly is pretty transparent to HTTP between the client and server.

20.12.6. apache proxy (reverse proxy) rather than LVS

Tony Requist

We currently have a LVS configuration with 2 directors and a set of web servers using LVS-DR and keepalived between the directors (and a set of MySql servers also). This is all working well using the standard RR scheduling without persistence. We will be adding functionality that will be storing data on some but not all web servers. For this, we need to be able to route requests to specific web servers according to the URL. Ideally I would generate URLs like:

stuff.domain.com://KEY

and I could have a little code in LVS (or called from LVS) where I could decode KEY to find that the data is on server A, B and C -- then have LVS route to one of these three servers. I've looked through the HOWTO and searched around but I have not been able to find anything.

Scott J. Henson scotth (at) csee (dot) wvu (dot) edu 20 Jul 2005

I would personally use apache proxy statements on the servers that don't have the information. This will increase load slightly, but is probably the easiest.

If you really want to go the LVS route, there are some issues, I believe. If my memory serves, the current version of LVS is a level 4 router and to do what you want, you would need a level 7 router. I have heard of some patches floating around to turn ip_vs into a level 7 router, but Ive not seen them personally, nor tried them.

Note

L7 L7 Switch requires much more computation than L4. You don't want to do anything at L7 that you can handle any way at all at L4.

Todd Lyons tlyons (at) ivenue (dot) com 20 Jul 2005

This is application level, not network level. The better solution (IMHO) is to put two machines doing reverse proxy with the rules to send the requests to the correct server. Then have your load balancers balance among the two rproxies.

A poor man's solution would be to put the reverse proxying on the webservers themselves.

This is not really good for HA though since you don't have redundancy if there is only one webserver serving a particular resource.

If the reverse proxies have a different IP than the public IP of the webservers, then you have more options.

Andres Canada

When the cluster node gets the request it looks to apache configuration and whatever to serve this request. When a Director node receives a request for a special web application that is only in one cluster node (for example node35) there should be something inside it that send that request to a "special node" (not the next one if it's using round robin, but the node35).

Todd Lyons tlyons (at) ivenue (dot) com Dec 14 2005

You should consider setting up a reverse proxy. This is a machine that sits in front of your apache boxen that examines urls and sends them to various private apache servers, and sends the reply back. The outside world doesn't talk directly to the private apache servers.

In our case, we have several different machines that handle different types of requests. We have 3 rp's sitting in front of them getting load balanced by two LVS boxen. The load balanced rp's receive the GET or POST from the outside world, examine it, and send the request to the appropriate private machine, waits for the reply, and sends that to the requesting client.

20.12.7. mod_backhand

From Lars lmb (at) suse (dot) de mod_backhand, a method of balancing apache httpd servers that looks like ICP for web caches.

Jessica Yang, 8 Oct 2004

Our application require L7 load balancing because we use URL rewriting to keep the session info in the requested URLs, like this: http://ip/servlet/servletname;jsessionid =*****. Basically, we want the load balancer will deliver the requests who have the same jsessionid to the same realserver. Looking through the LVS document, KTCPVS seems to be able to provide L7 load balancing, but I couldn't find any documentation about compiling, configuring, features and commands of KTCPVS. Does KTCPVS have the feature to distinguish the jsessionid in the requested URL and/or in the Cookie header? Does KTCPVS have to be bundled together with IPVS? And what is the process to make it work?

Wensong 09 Oct 2004

KTCPVS has a cookie-injection load balancing feature just as you described. You can use something like the following

insmod ktcpvs.o
insmod ktcpvs_chttp.o
tcpvsadm -A -i http -s chttp
tcpvsadm -a -i http -r realserver1:80
tcpvsadm -a -i http -r realserver2:80
tcpvsadm -a -i http -r realserver3:80

Jacob Coby jcoby (at) listingbook (dot) com 08 Oct 2004

It almost sounds like you need to use a proxy instead of LVS to do the load balancing. If something in your jsessionid is unique to a server, it would be very simple to make a rewrite rule accomplish what you want.

Clint Byrum cbyrum (at) spamaps (dot) org 08 Oct 2004

mod_backhand (http://www.backhand.org/mod_backhand/).

VERY nice load balancing proxy module for apache. It does require that your content be served from Apache 1.3 (Windows or Unix) though.

cheaney Chen

There are a lot of different kinds of SLB techniques, ex. DNS-based, Dispatcher-based(like LVS), and server-based , etc. And my question is, for a commercial web site (like yahoo or ...) how to do SLB. What methods are used to handle huge numbers of client's requests? Combination of SLB techniques above , or ... ?!

Clint Byrum cbyrum (at) spamaps (dot) org 06 Jan 2005

I've used LVS for frontend balancing, and backhand (www.backhand.org) for backend.

In short, mod_backhand takes specific resource-intensive requests and proxies them to whichever servers are least busy. It works *VERY* well. We have a farm of cheap boxes serving lots of CPU intensive requests and every box has the same exact load average within 2-3%. It even allows persistence. Only downside is it requires Apache 1.3, but so far that hasn't been a problem for us. :)

20.12.8. Apache logging

isplist (at) logicore (dot) net 2007-07-25

How can I exclude the logging from the LVS servers on apache? The constant checking for the host is creating VERY large log files.

Graeme Fowler graeme (at) graemef (dot) net 25 Jul 2007

This is really a question you should be asking on an Apache mailing list, but anyway... The easiest thing to do is to create a separate <VirtualHost blah> definition that simply logs to /dev/null:

<VirtualHost 1.2.3.4:80>
  ServerName blah.test.domain
  CustomLog /dev/null combined
  ...other directives...
</VirtualHost>

then configure whatever healthcheck/monitor app you're using to query the virtual host blah.test.domain by hostname - that differs so much between mon, keepalived and ldirectord that I'll leave that as an exercise for you. However, I have to say that even with a check interval of 1 second that would only give you 86400 lines per day - if you're using LB of any form I'd expect you to be generating that number of entries per hour, if not more.

20.13. HTTP 1.0 and 1.1 requests

Joe: Does anyone know how to force an HTTP 1.1 connection?

Patrick O'Rourke orourke (at) missioncriticallinux (dot) com:

httperf has an 'http-version' flag which will cause it to generate 1.0 or 1.1 requests.

20.14. Large HTTP /POST with LVS-Tun

If a client does a large (packet>MTU) POST through a tunnel device (i.e. LVS-Tun) the MTU will be exceeded. This is normally handled by the icmp layer, but linux kernels, at least upto 2.4.23, do not handle this properly for paths involving tunnel devices. see path MTU discovery.

20.15. http keepalive - effect on InActConn

Randy Paries rtparies (at) gmail (dot) com 07 Feb 2006

I just added a new realserver (local.lovejoy). It has many more InActConn than the other servers. It's newer hardware. Any ideas?

ipvsadm
IP Virtual Server version 1.0.10 (size=65536)
Prot LocalAddress:Port Scheduler Flags
 -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  www.unitnet.com:http wlc persistent 1800 mask 255.255.255.0
 -> local.lovejoy:http           Route   1      113        5568
 -> local.krusty:http            Route   1      97         223
 -> local.flanders:http          Route   1      91         158
TCP  www.unitnet.com:https wlc persistent 1800 mask 255.255.255.0
 -> local.flanders:https         Route   1      0          12

Graeme Fowler graeme (at) graemef (dot) net 7 Feb 2006

The newer machine is a newer OS, running a newer version of Apache and probably newer hardware (OK, those last two are assumptions) - I bet it's responding more quickly. The InActConn counter diaplays thso connection in TIME_WAIT or related states, after a FIN packet has arrived to end the connection. If you run:

ipvsadm -Ln --persistent-conn --sort --rate

and
 
ipvsadm -Ln --persistent-conn --sort --stats

You will probably see that lovejoy is handling rather more traffic than krusty and flanders.

Joe - this could have been the answer, but it wasn't. It was a timeout problem though.

Randy Paries

This ending up being the KeepAlive setting (or lack there of in the httpd.conf) change to KeepAlive On and problem went away

20.16. Fallback/Sorry pages with Apache

Gustavo Mateus

I want to customize a fallback server page for each of the 10 web sites (domains) running on 5 realservers servers. The way I imagine it can be done is setting lighttpd to respond to 10 different ips. One ip on the fallback server for every virtual server that I have. Is there a way to avoid that?

prockter (at) lse (dot) ac (dot) uk 30 May 2007

The fallback web server can use virtual hosts just like any other web service so you can provide all sorry pages (little mini sites with graphics and all) from a single server. Or you can use a cgi script which varies what it does base on the environment (which will include virtual host information) Very very ancient web browsers don't send enough information and to support them you will have to use IP based hosting, so if you want a single IP just provide a catch all page for those few (if any) browsers.

You use

fallback=192.168.20.5:80 masq

the information about which virtual host it is, comes in the http request from the users browser, just like it does when they talk to the real service.

20.17. Testing http with apachebench (ab)

Larry Ludwig ludes (at) yahoo (dot) com 12 Nov 2006

From testing it appears that my load balance is working. From using apachebench (ab) I get about half of the connections fail. Sometimes the test doesn't even complete. I don't get these errors if I test directly to the server IP. Some times I get:

apr_recv: No route to host (113)
Total of 1000 requests completed

Turns out my "error" wasn't an error after all. Everything was working fine, except for the apachebench error. What happens is apachebench (ab) stores a copy of the first downloaded web page and if it doesn't match in future page requests marks it as an error. So if the pages on the load balancers are not EXACTLY the same, then it will spew an error like the one I got. In our case the page listed the server name.

20.18. Apache setup for DoS

Willem de Groot willem (at) byte (dot) nl 18 Apr 2006

To my surprise, opening 150 tcp connections to a default apache installation is enough to effectively DoS it for a few minutes (until connections time out). This could be circumvented by using mod_throttle, mod_bwshare or mod_limitipconn but imho a much better place to solve this is at the LVS loadbalancer. Which already does source IP tracking for the "persistency" feature.

Ratz

Only on a really badly configured web server or maybe a 486 machine :). Otherwise this does not hold. Every web server will handle at least 1000 concurrent TCP connections easily. After that you need some ulimit or epoll tweaking.

Nope, these won't circumvent anything - you then just open a HTTP 1.1 channel and reload using GET / every MaxKeepAliveTimeout-1. Those modules will not help much IMHO. They only do QoS on established sockets. It's the wrong place to interfere.

It's not a Layer 4 issue, it's a higher layer issue. Even if it wasn't, how would source IP tracking ever help? Check out HTTP 1.1 and pipelining. Read up on the timing configurations and so on.

Only poorly-configured web servers will allow you to hold a socket after a simple TCP handshake without sending any data, you get a close on the socket for HTTP 1.1 configured web servers with low timeouts.

You are right however, in that using such an approach of blocking TCP connections (_inluding_ data fetching) can tear down a lot of (even very well known) web sites. I've actually started writing a paper on this last year, however never finished it. I wrote a proof-of-concept tool that would (after some scanning and timeout guessing) block a whole web site, if not properly configured. This was done using the CURL library. It simulates some sort of slow-start slashdot-effect.

Ken Brownfield krb (at) irridia (dot) com 18 Apr 2006

This 150 connection limit is the default MaxClients setting in Apache, which in practice should be adjusted as high as you can without Apache using more memory than you want it to (e.g., 80-100% of available RAM -- no need for the DoS to swap-kill your box, too). Each Apache process will use several megabytes (higher or lower depending on 32- or 64-bit platforms, add-on modules, etc) so this can't be set too high. Disabling KeepAlives will drop your working process count by roughly an order of magnitude, and unless you're just serving static content it's generally worth disabling. But for your case of 150 idle connections, it doesn't help.

Netfilter has various matching modules that can limit connections from and/or to specific IPs, for example:

iptables --table filter --append VIP_ACCEPT --match dstlimit -- 
dstlimit 666 --dstlimit-mode srcipdstip-dstport --dstlimit-name  
VIP_LIMIT --jump ACCEPT

The reason DoS attacks are so successful (especially full-handshake attacks) is that something needs to be able to cache and handle incoming connections. And that is exactly where Apache is weakest -- the process model is terrible at handling a high number of simultaneous, quasi-idle connections.

LVS has some DoS prevention settings which you should consider (drop_entry, drop_packet, secure_tcp) but they're generally only useful for SYN floods. A full handshake will be passed on through LVS to the application, and that is where the resources must be available. And given persistence, a single-IP attack will be extremely effective if you only have one (or few) real servers.

Once a connection has been made to Apache, it will need to either relegate idle connections out of process (see Apache 2.2's new event MPM, not sure if it only works on idle keepalives) or limit based on IP with the modules you mention.

This problem is difficult to solve completely, and I agree that solving it in Apache is the least powerful, least convenient, and highest overhead solution. Given Netfilter functionality (2.6 and later), the absence of throttles or connection limits in LVS isn't fatal. But I do feel that LVS could be made a more comprehensive system if it rolled in even basic connection throttling/limiting, plus a more closely integrated and maintained health checking system. And source-routing support. ;)

There are commercial products available that implement heavy-duty DoS/ intrusion protection. They block the vast majority of simple attacks and are crucial for any large-scale public-facing services. But a good distributed full-handshake or even valid HTTP request DoS is almost impossible to fully block.

I agree that the ~1,000 simultaneous connection count is indeed the general breaking point for select()- or poll()-based web servers (in my experience), and epoll() is a much better solution as you say. But Apache will not handle 1,000 simultaneous connections unless you have 4GB of RAM, you're on a 32-bit platform, and you have every feature turned off. And then only if you don't want any disk buffer/ cache. :)

With typical application server support (e.g., mod_php), Apache will not reach 1000 processes without something like 8-16G of RAM. I've never been able to set MaxClients above 200... Copy-on-write only goes so far.

Sorry for the tangent, but throttling/DoS prevention is especially important for any web/application server based on the process model.

Graeme Fowler graeme (at) graemef (dot) net 18 Apr 2006

This is an application (Apache) configuration issue, not really a load balancing issue at all. A default Apache configuration shouldn't, ideally, be in production. The MaxClients setting is 150 (may be higher depending on distro and choice of MPM) for a reason, which is that not everyone has the same hardware and resource availability. It's better that you're given a limited resource version than one which immediately spins away and causes your server to expire due to lack of memory, for example.

This is a problem which LVS itself can't help with, given that the concept of true feedback isn't implemented.

If you spend the time to get to know your server, you'll find that you can sort out this sort of resource famine quite easily by tuning Apache, with the caveat that it will _always_ be possible to cause Apache (or any other webserver for that matter) to fall over by flooding it. Think about the infamous "Slashdot effect".

You could, in theory, do some limiting with netfilter/iptables on the director, but that's OT for this list. To test, just use ApacheBench, which comes with Apache :)

Ratz

Too bad that apache only allows epoll for MPM event models. For the other interested readers, we're essentially talking about a feature which is best described here: http://www.kegel.com/c10k.html

Now, as for the memory pressure mentioned below, I beg to differ a bit ... I have rarely hit the problems serving 800-1000 concurrent sessions on 32bit using a normal 2G/2G-split 2.4.x or 2.6.x kernel. As for memory/cache... Again, I believe that if you already hit the memory limits, you did something wrong in your configuration or setup :). mod_php or even mod_perl are memory hogs, but if you use a proper m:n threading model, I bet you can still serve a couple of hundred concurrent connections.

I would argue that copy on write kills your performance because your application was not designed properly :). No pun intended, but I've more than once fixed rather broken web service architectures based on PHP or Servlets or JSP or ASP or -insert you favourite web service technology-.

DoS prevention does not exist, this topic has been beaten to death already :). DoS mitigation, maybe yes. Maybe we should define throttling before continuing discussing its pro/cons. It could very well be that we agree on that.

Most of our customer's httpd show RSS between 800KB and 2MB; some of them it's including mod_perl or mod_php. You can drop the process count if you set your timeouts correctly, or you implement proper state mapping using a reverse proxy and a cookie engine. With your iptables command, no wonder you have no memory left on your box :).

You can't drop a certain amount of illegitimate _and_ legitimate connections when you're running on a strict SLA. QoS approaches based on window scaling help a bit more.

Regarding throttling, I reckon you use the TC framework available in the Linux kernel since long before netfilter hit the tables.

Commercial packages use Traffic contracts and the sorts, just like TC for example. Blocking or dropping is not acceptable, diverting or migrating is. The biggest issue on large-scale web services according to my experience is the detection of malicious requests.

Ken

The mod_python and mod_php applications currently under my care are at 38-44MB resident on 64-bit. On a minimal 64-bit box, I'm seeing 6MB resident. I've honestly never seen an application, either CGI- or mod-based, use less than 2MB on 32-bit including the CGI, and most in the 15-45MB range. As you say, I think the application is a huge variable, but therein lies the weakness of the process model.

Timeouts certainly help, but that's somewhat akin to saying that if you set your synrecv timeout low enough, the DoS won't hurt you. :) KeepAlives by their nature will increase the simultanous connection count, but I apologize if I came across as advocating turning them off as a knee-jerk fix to connection-count problems.

Whether they're beneficial or not (for scalability reasons) depends on whether you bottleneck on CPU or RAM first, and whether you're willing to scale wider to keep the behavior change due to keepalives.

The iptables rule I gave was just an example, and 666 is my numeric "foo".

I was just mentioning dropping packets, not advocating them. drop_packet and secure_tcp, set to 1, seem decent choices. If LVS is out of RAM, I think your SLA is doomed, only to be perhaps aided by these settings. Having them on all the time is indeed Bad.

I had forgotten about TC, though I'm not sure it can throttle *connections* vs *throughput*.

As for improving LVS: I had to completely rewrite the LVS alert module for mon, in addition to tweaking several of the other mon modules. Now, this was on a so- last-year 2.4 distro -- I haven't worked with LVS under 2.6 yet or more modern mon installs. I also wrote a simple CLI interface wrapper to ipvsadm, since editing the ipvsadm rules file isn't terribly operator-friendly and prone to error for simple host/service disables.

I think all the parts are there for a Unix admin to complete an install. But for a health-checking, stateful-failover, user-friendly- interface setup, it's pretty piecemeal. And there's no L7 to my knowledge. There are some commercial alternatives (that will appeal to some admins for these reasons) that are likely inferior overall to LVS. I think the work lies most in integration, both of the documentation and testing, and perhaps patch integration.

The primary parts of the commercial DoS systems I alluded to are the attack fingerprints and flood detection that intelligently blocks bad traffic, not good traffic. Nothing is 100%, but in terms of intelligent, low-false-positive malicious request / flood blocking, they do extremely well at blocking the bad stuff and passing the good stuff. Is it worth the bank that they charge, or the added points of failure? Depends on how big your company is I suppose. But I know of no direct OSS alternative -- or I'd use it!

Ratz

As for RSS - these seem to be my findings as well (contrary to what I stated earlier), after logging into various high volume web servers of our customers. In fact, I quickly set up an apache2 with some modules and this is the result:

vmware-test:~# chroot /var/apache2 /apache2/sbin/apachectl -l
Compiled in modules:
  core.c
  mod_deflate.c
  mod_ssl.c
  prefork.c
  http_core.c
  mod_so.c
vmware-test:~# grep ^LoadModule /var/apache2/etc/apache2/apache2.conf
LoadModule php5_module modules/libphp5.so
LoadModule access_module modules/mod_access.so
LoadModule dir_module modules/mod_dir.so
LoadModule fastcgi_module modules/mod_fastcgi.so
LoadModule log_config_module modules/mod_log_config.so
LoadModule mime_module modules/mod_mime.so
LoadModule perl_module modules/mod_perl.so
LoadModule rewrite_module modules/mod_rewrite.so
LoadModule setenvif_module modules/mod_setenvif.so
vmware-test:~# ps -U wwwrun -u wwwrun -o pid,user,args,rss,size,vsz
  PID USER     COMMAND                       RSS    SZ    VSZ
 8761 wwwrun   /apache2/sbin/fcgi-pm -f /e 11660  2964  17144
 8762 wwwrun   /apache2/sbin/apache2 -f /e 11764  3096  17332
 8763 wwwrun   /apache2/sbin/apache2 -f /e 11760  3096  17332
 8764 wwwrun   /apache2/sbin/apache2 -f /e 11760  3096  17332
 8765 wwwrun   /apache2/sbin/apache2 -f /e 11760  3096  17332
 8766 wwwrun   /apache2/sbin/apache2 -f /e 11760  3096  17332

If I disable everything non-important except mod_php, I get following:

vmware-test:~# ps -U wwwrun -u wwwrun -o pid,user,args,rss,size,vsz
  PID USER     COMMAND                       RSS    SZ    VSZ
 9088 wwwrun   /apache2/sbin/fcgi-pm -f /e  9768  2304  15004
 9089 wwwrun   /apache2/sbin/apache2 -f /e  9856  2304  15060
 9090 wwwrun   /apache2/sbin/apache2 -f /e  9852  2304  15060
 9091 wwwrun   /apache2/sbin/apache2 -f /e  9852  2304  15060
 9092 wwwrun   /apache2/sbin/apache2 -f /e  9852  2304  15060
 9093 wwwrun   /apache2/sbin/apache2 -f /e  9852  2304  15060

A bare apache2 which only serves static content (not stripped or optimized) yields:

vmware-test:~# ps -U wwwrun -u wwwrun -o pid,user,args,rss,size,vsz
  PID USER     COMMAND                       RSS    SZ    VSZ
 9191 wwwrun   /apache2/sbin/apache2 -f /e  2588  1364   5376
 9192 wwwrun   /apache2/sbin/apache2 -f /e  2584  1364   5376
 9193 wwwrun   /apache2/sbin/apache2 -f /e  2584  1364   5376
 9194 wwwrun   /apache2/sbin/apache2 -f /e  2584  1364   5376
 9195 wwwrun   /apache2/sbin/apache2 -f /e  2584  1364   5376

However, copy on write does not occur for carefully designed application logic with shared data. So, normally even 40 rss does not hurt you. Stripping both perl and python to a minimal set of functionality helps further.

I checked with various customers' CMS installations based on CGIs and they range between 1.8MB and 11MB RSS. Again, this does not hurt so long as the thread model is enabled. However, one has to be cautious regarding the thread pool settings and for the application handler (perl, python, ...) within the thread model of apache or else resource starvation or locking issues bite you in the butt. For Perl I believe the thread-related settings are:

   PerlInterpStart    <ThreadLimit/4>
   PerlInterpMax      <ThreadLimit/3*2>
   PerlInterpMaxSpare <ThreadLimit/2>

Which however heavily interferes with the underlying apache threading model. If you only use LWPs (pre-2.6 kernel time) those settings are better not used or you get COW behaviour of the perl thread pool. For NPTL based setups, this yield much reduced memory constraints. I cannot post customer data for obvious reasons.

I believe that 3 simple design techniques help reduce the weakness of the process model.

  • Design your web service with shared sources in mind
  • Use caches and ram disks for your storage
  • Optimise your OS (most people don't know this)

The last point sounds trivial but I've seen people running web servers on SuSE or RedHat using a preemtive Kernel, NAPI and runlevel 5 with KDE or Gnome, d/i-notify and power management on!

Basic debugging with valgrind, vmstat, slabtop could have showed them immediately why there was I/O starvation, memory pressure and heavy network latency.

I didn't actually mean TCP timeouts, but KeepAlive timeouts for example.

I don't buy the CPU bottleneck for web service applications. Yes, I have seen 36-CPU nodes go down to their knees by simply invoking a Servlet directly, but after fixing that application and moving to a multi-tier load balanced environment things smoothed down quite a bit. My experience is that CPU constraints for web services are a result of bad software engineering. Excellent technology and frameworks are available, people sometimes just don't know how to use them ;).

For RAM, I'd have to agree that this is always a weakness in the system, but I reckon that a sane IT budget to implement and map your business into an e-business web service is certainly considering high enough expenses in buying hardware, including enough (fast and reliable) RAM.

You should seriously consider giving advice regarding installing iptables/netfilter stuff on high-volume networked machines. At least make sure you do not load the ip_conntrack module, or you're running out of memory in no time. I've seen badly configured web servers which had the ip_conntrack module loaded (collecting every incoming connection tuple and setting a timeout of a couple of hours) running out of memory within hours. The customer before this fix rebooted his box 3 times a day per cronjob ... go figure.

In my 8+ years of LVS development and usage, I have never seen an LVS box run out of memory. I'd very much like to see such a site :).

For TC: with the (not so very well documented) action classifier and the u32 filter plus a few classes you should get there.

As for L7: ktcpvs is a start, not much tested in the wild though I believe. OTOH putting my load balancer consultancy hat on, I rarely see L7 load balancing needs, except maybe cookie based load balancing. I would very much like to see a simple, working and proper VRRP implementation or integration into LVS. This is what gives hardware load balancers USPs.

We spend a considerable amount of time doing consultant work in banking or government environments (after all, what else is there in Switzerland :)), and there is the tendency to zero-tolerance regarding false-positives in blocking. Trying to explain why this happens nevertheless is sort of difficult at times.

20.19. squids, tcp 80, 3128

A squid realserver for the most part can be regarded as just another httpd realserver. Squids were one of the first big uses of LVS. There are some exceptions

  • scheduling squids).

    In an LVS of squids, the content of each squid will be different. This breaks one of the assumptions of LVS, so you need to use an appropriate scheduler.

  • see 3-Tier LVS setups

I haven't set up a squid LVS myself but some people have found problems.

Rafael Morales

Before I run the rc.lvs_dr script, my realserver can connect to the outside world, but after I run it, I lost connection. The only difference in the route table is this:

10.2.2.71 dev lo  scope link  src 10.2.2.71

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 16 Jan 2004

I had the same thing happen. You should not add any lo route. I don't know why.

Here is how I configure my realservers for LVS-DR :

  • I use noarpctl and first of all I had the noarp entry
  • I add a lo:n interface using ifconfig (never use ifup, it seems to make some arp things, and operate badly on an already running cluster).
  • I start the service. Then I configure the director via keepalived.
  • To make it right in case of reboot. I add the ifcfg-lo:n part of the script to the redhat /etc/sysconifg/network-script directory :

    DEVICE=lo:1
    IPADDR=(VIP)
    NETMASK=255.255.255.255
    BROADCAST=(VIP)
    ONBOOT=yes
    

    And then I have this init.d script I use for noarp :

    #!/bin/bash
    # noarp init script
    # For LinuxVirtualServer realservers
    # FJ 05/01/2004 (logical, french date format)
    # noarp devs are welcome to add it to the noarp distribution as
    # an example under the GPLv2 or BSD licence (but I would never grant
    # any rights for such a simple script).
    
    start () {
    # This is a little bit tricky. I use my director for both
    # public (194.*) and private (10.*) IPs. So...
    for i in /etc/sysconfig/network-scripts/ifcfg-eth* ; do
            . $i
            if [ `echo $IPADDR | cut -f 1 -d "."` ==  194 ] ; then
                    RIPPUB=$IPADDR
            fi
            if [ `echo $IPADDR | cut -f 1 -d "."` ==  10 ] ; then
                    RIPPRIV=$IPADDR
            fi
    done
    # Let's have a look at the loopback aliases
    for i in /etc/sysconfig/network-scripts/ifcfg-lo:* ; do
            . $i
            if [ `echo $IPADDR | cut -f 1 -d "."` ==  194 ] ; then
                    /usr/local/sbin/noarpctl add $IPADDR $RIPPUB
            fi
            if [ `echo $IPADDR | cut -f 1 -d "."` ==  10 ] ; then
                    /usr/local/sbin/noarpctl add $IPADDR $RIPPRIV
            fi
    done
    }
    
    stop () {
    /usr/local/sbin/noarpctl reset
    }
    status () {
    /usr/local/sbin/noarpctl list
    }
    
    case "$1" in
      start)
            start $1
            ;;
      stop)
            stop $1
            ;;
      restart|reload)
            restart $1
            ;;
      status)
            status
            ;;
      *)
            echo $"Usage: $0 {start|stop|restart|status}"
            exit 1
    esac
    

Palmer J.D.F J (dot) D (dot) F (dot) Palmer (at) swansea (dot) ac (dot) uk Nov 05, 2001

With the use of IP-Tables etc on the directors can you route various URLs/IPs (such as ones requiring NTLM authentication like FrontPage servers etc) not to go through the caches, but just to be transparently routed to their destination.

Horms

This can only be done if the URLS to be passed directly through can be identified by IP address and/or Port. LVS only understands IP addresses and Ports, whether it is TCP or UDP, and other spurious low level data that can be matched using ipchains/iptables.

In particular LVS does _not_ understand HTML, it cannot differentiate between, for instance http://blah/index.html and http://blah/index.asp. rather you would need to set up something along the lines of http://www.blah/index.html and http://asp.blah/index.asp, and have www.blah and asp.blah resolve to different IP addresses.

Further to this you may want to take a look at Janet, (http://wwwcache.ja.net/JanetService/PilotService.html) one of the first big uses of LVS with squids.

Jezz Palmer had to add a default route from his squid realservers to get them to work. Squid accessed machines on the internet. An alternate approach would be to use iproute2 to add routes only for the services required, and to not add a default route.

Jezz J (dot) D (dot) F (dot) Palmer (at) swansea (dot) ac (dot) uk 10 Apr 2002

Here is a list of ports that squid accesses on the internet (outside world).
80          # http
21          # ftp
443,563     # https, snews
70          # gopher
210         # wais
1025-65535  # unregistered ports
280         # http-mgmt
488         # gss-http
591         # filemaker
777         # multiling http (multilingual translation services)
83,81,90    # Special web sites. These are web servers we need to access,
            # running on idiotic reserved ports.

20.19.1. Setting up squids with fwmark on the director and transparent proxy on the realservers

With squids you can't use a VIP to set up a virtual service - the requests you're interested in are all going to a port, port 80. Since the requests are being to an IP that's not on the director, you also need to force the director to accept the packets for local processing (see routing to a director without a VIP). Both of these problems were handled in one blow in the 2.0 and 2.2 kernels by Horms method using the -j REDIRECT option to ipchains. This doesn't work for the 2.4 and beyond kernels, as the dst_addr of the packet is rewritten before the packet is delivered to the director. A possible (untested) solution is the TPROXY method.

The method used starting with the 2.4 kernels is to mark all packets to port 80 and schedule on the mark. The packets are accepted locally by iproute2 commands.

Con Tassios ct (at) swin (dot) edu (dot) au 13 Feb 2005

Transparent proxy with squid works well if you use fwmarks. I use it with the following LVS-DR configuration:

  • Directors: kernel 2.4.29, keepalived 1.1.7

    Assumming 192.168.0.0/16 is the local network, mark all non local http packets with mark 1.

    # iptables -t mangle -A PREROUTING -p tcp -d !192.168.0.0/16 --dport 80 -j MARK --set-mark 1
    

    Then configure LVS using fwmark 1 as the virtual service.

    Use these commands so the director will accept the packets

    # ip rule add prio 100 fwmark 1 table 100
    # ip route add local 0/0 dev lo table 100
    
  • Realservers: standard RHEL kernel, squid, noarp:

    Configure the squid servers to handle transparency the normal way as described in the squid documentation.

bikrant (at) wlink (dot) com (dot) np Jun 24 2005

   <cisco router>
    202.79.xx.230
       |
       |-------------------------|-----------------------|
       |                         |                       |
       |                         |                       |
eth0:202.79.xx.240/24 fxp0: 202.79.xx.241/24     202.79.xx.235/24
    <Director>           <realserver >             <client>
    (gw: cisco)           (gw: cisco)              (gw: cisco)

The default route for all machines is the router. Forwarding is by LVS-DR.

Director

Gentoo Linux with 2.6.10 Kernel

ipvsadm -A -f 1 -s sh
ipvsadm -a -f 1 -r 202.79.xx.241:80

(-g is the default, the :80 is ignored for LVS-DR)

iptables -t mangle -A PREROUTING -p tcp --dport 80 -j MARK --set-mark 1

ip rule add prio 100 fwmark 1 table 100
ip route add local 0/0 dev lo table 100

echo 0 >  /proc/sys/net/ipv4/ip_forward

Real server Configuration: FreeBSD 5.3 with squid configured by trans-proxy.

Cisco Router:


interface Ethernet0/0
ip address 202.79.xx.230 255.255.255.0
ip policy route-map proxy-redirect

access-list 110
     access-list 110 deny tcp host 202.79.xx.241 any eq 80
     access-list 110 permit tcp 202.79.xx.0 0.0.0.255 any eq 80

route-map proxy-redirect permit 10
    match ip address 110
    set ip next-hop 202.79.xx.240

20.20. authd/identd, tcp 113 and tcpwrappers (tcpd)

You do not explicitely setup authd (identd) as an LVS service. Some services (e.g. sendmail and services running inside tcpwrappers). initiate a call from the identd client on the realserver to the identd server on the client. With realservers on private networks (192.168.x.x) these clients will have non-routable src_addr'es and the LVS'ed service will have to wait for the call to timeout. authd initiates calls from the realservers to the client. LVS is designed for services which receive connect requests from clients. LVS does not allow authd to work anymore and this must be taken into account when running services that cooperate with authd. The inability of authd to work with LVS is important enough that there is a separate section on authd/identd.

20.21. ntp, udp 123

ntp is a protocol for time synching machines. The protocol relies on long time averaging of data from multiple machines, statistics being kept separately for each machine. The protocol has its own loadbalancing and failure detection and handling mechanisms. If LVS is brought in to an ntp setup, then an ntp client machine would be balanced to several different servers over a long time period, negating the effort to average the data as if it were coming from one machine. ntp is probably not a good service to LVS.

Joe May 2002

I tried setting up ntp under LVS-DR and found that on the realserver, ntpd did not bind to the VIP (on lo:xxx). ntpd bound to 0.0.0.0, 127.0.0.1 and the RIP on eth0, but not to the VIP. Requests to the VIP:ntp from the client would receive a reply from RIP:ntp. The client does not accept these packets (reachable = 0).

Attempts to fix this on the realserver, all of which produced reply packets which were _not_ accepted by the client, were

  • bring up ntpd while only lo was up. ntpd is bound to 0.0.0.0 and 127.0.0.1.
  • bring up ntpd while the VIP was on lo:xxx and while eth0 was down: result ntpd bound to 0.0.0.0 and 127.0.0.1 but not to the VIP.
  • put the VIP onto another ethernet card, e.g. eth1. Under these conditions, the realserver worked for LVS:telnet. However ntpd bound to eth0, lo and 0.0.0.0, but not to the VIP on eth1.
  • put the VIP onto eth0 and the RIP onto eth1. ntpd now bound to the VIP, but with my hardware, only eth1 was connected to the network and I couldn't figure out how to route between the RIP on eth0 and the outside world.

For comparison, telnet also binds to 0.0.0.0 under the same circumstances, but LVS-DR telnet realservers return packets from the VIP rather than the RIP, allowing telnet to work in an LVS. The difference is that telnet is invoked by inetd and when a packet arrived on the VIP, then a copy of telnetd is attached to the VIP on the realserver.

If you run ntpd under inetd using this line in inetd.conf

ntp      dgram   udp     wait    root    /usr/local/bin/ntpd  ntpd

and look in the logfiles, you'll see that for every ntp packet that arrives from the client, ntpd is invoked, rereads the /etc/ntp.drift file, finds that another ntpd is bound to port 123 and gets a signal_no_reset. However ntpd is bound to the VIP on the realserver and the client does get back packets from the VIP and starts accumulating data. Meanwhile on the realservers, 100's of copies of ntpd are running and 100's of entries appear in ntpq>peer. You can now kill all the ntpds but the first and have a working ntp realserver with ntpd listening on the VIP (lo:xxx). ntp is not designed to run under inetd. Postings on the news:comp.protocols.time.ntp about binding ntpd to select IPs indicate that the code for binding to an interface was written early in the history of ntpd and is due a rewrite.

Invoking ntpd under inetd with the nowait option, produces similar results on the realserver, except that now the client does not get (accept) any packets.

Tc lewis managed to get ntp working on LVS-DR realservers, after some wrestling with the routing tables. ntp was being NAT'ed to the outside world, rather than being LVS'ed (i.e. the realservers were time synching with outside ntp master machines). See also NAT clients under LVS-DR.

Wayne wayne (at) compute-aid (dot) com 2000-04-24

I'm setting up an LVS for NTP (udp 123) using LVS-NAT. Two issues are

  1. rr seems to balance better than lc

  2. the balance seems in a large time frame is fine (by sampling the NTP log every 5 minutes) but not fine by sampling the NTP log every second. Is round robin sending traffic to servers based on each request or based on a period of time?

We have tested LVS with DNS, which is UDP based, too. What we are doing with this test is not for heavy load issue, rather to see if LVS can provide a fail-over mechanism for the services. By load balancing the servers, we can make two servers backup one, if the one failed, the service will not stop.

If round-robin does this one request per server, it is pretty hard to explain what we saw at the server log, which indicating one server getting twice the requests than other two in some seconds, and getting a lot less at other seconds. Could you explain why we seeing that?

Joe

DNS occasionally issues tcp requests too, which might muddy the waters. I tested LVS on DNS about 6 months ago at moderate load (about 5/sec I believe) and it behaved well. I don't remember looking for exact balance and I doubt if I would have regarded a 50% imbalance a problem. I was just seeing that it worked and didn't lock up etc.

If you have two ntp servers at the same stratum level and the rest of the machines are slaves, the whole setup will keep functioning if you pull the plug on one of the two servers. Will LVS give you anything more than that?

Just realised that you must be running a busy site there. If I have only one client, after everything has settled down, it will only be making one request/1024secs. If I have 5 servers they'll only be getting requests every 5120 secs. Got any suggestions about simulating a busy network? run ntpdate in a loop on the client?

One client will not do. We setup a lot of clients. The reason is the NTP server somehow remember the client and if a client asked too much, it would not answer it for a period of time. Do not know who designed this in, but we found it out from our tests. We have close to 100 clients computers in the test.

If you have 100 clients making requests every 1024 secs, thats only 1 request every 10secs. You seem to be getting more than that. Even at startup with 1 request every 64secs, that's only 2 requests/sec.

Make two to three requests per second per client. At some seconds later, NTP servers will stop talking, but that is fine. You already got the profit by then. The scheduling has been changed to round-robin now, it does work better, but it still has problem on the micro scale.

20.22. https, tcp 443

http is an IP based protocol, while https is a name based protol.

http: you can test an httpd from the console by configuring it to listen on the RIP of the realserver. Then when you bring up the LVS you can re-configure it to listen on the VIP.

https: requires a certificate with the official (DNS) name of the server as the client sees it (the DNS name of the LVS cluster which is associated with the VIP). The https on the realserver then must be setup as if it had the name of the LVS cluster. To do this, activate the VIP on a device on the realserver (it can be non-arping or arping - make sure there are no other machines with the VIP on the network or disconnect your realserver from the LVS), make sure that the realserver can resolve the DNS name of the LVS to the VIP (by dns or /etc/hosts), setup the certificate and conf file for https and startup the httpd. Check that a netscape client running on the realserver (so that it connects to the realserver's VIP and not to the arping VIP on the director) can connect to https://lvs.clustername.org. Since the certificate is for the URL and not for the IP, you only need 1 certificate for an LVS running LVS-DR serving https.

Do this for all the realservers, then use ipvsadm on the director to forward https requests to each of the RIPs.

The scheduling method for https must be persistent for keys to remain valid.

When doing health checking of the https service, you can connect directly to the IP:port. e.g. see code in https.monitor at http://ftp.kernel.org/pub/software/admin/mon/contrib/monitors/https/https.monitor

Jaroslav Libak jarol1 (at) seznam (dot) cz

I run several apache ip based virtual servers on several RSs and test them using ldirectord via http only, even though they run https too. If https is configured properly it will work whenever http does.

When compiling in Apache.. What kind of certificate should I create for a real application with Thawte?

Alexandre Cassen Alexandre (dot) Cassen (at) wanadoo (dot) fr

When you generate your CSR, use the CN (Common Name) of the DNS entry of your VIP.

pb peterbaitz (at) yahoo (dot) com 18 Feb 2003

F5 and Foundry both DO NOT put SSL Cert processing on their load balancers, they offload it to SSL Accelerator boxes. So, don't let anyone tell you anything negative about LVS in this regard. The big boys don't recommend or do it either.

20.22.1. use reverse proxy to run https on localnode while other services are forwarded

peterbaitz - 27 Jan 2003

Is it possible for SSL to be supported on the director rather than on the realserver(s)?

Right now, the powers that be have brought up the question of placing a purchased Mirapoint email system behind a "free" load balancer (neglecting to consider that Mirapoint runs on FreeBSD, and that Piranha is a purchasable LVS product as well).

Joe

I think you're saying that you want the director to forward port 80 and to accept port 443 locally, ie to not forward port 443. If this is the case then you add entries with ipvsadm for port 80 only. All other traffic sent to the VIP on other ports will be handled locally.

Matthew Crocker matthew (at) crocker (dot) com 27 Jan 2003

It can be done, in fact just about anything can be done these days. If it is a smart thing to do is another matter... What you are trying to do isn't really a function of LVS. You can setup Apache+SSL running in a reverse proxy configuration. That apache and be running on or in front of the LVS director. The apache can then make normal web connections to the internal machines which can be run through the LVS director and load balanced.

You can use keepalived or hearbeat to manage the high availability functions of your Apache/SSL proxy. You can use hardware based SSL engines to handle the encryption/decryption. This is all transparent to the functions of LVS.

LVS is 'just' a smart IP packet router, you give it a packet and tell it how you want it handled. It can be configured to do a bunch of things.

The ideal solution for the highest performance and greatest availability is to have 2 groups of directors, each group having N+1 machines running LVS. Have 1 group of Apache/SSL servers configured and 1 group of internal web servers.

LVS group 1 load balances the inbound SSL traffic to one of the Apache/SSL servers. The apache servers make connections to the internal servers though LVS group 2. LVS group 2 load balances the internel HTTP traffic into the realservers.

To save money you could move Apache/SSL onto the LVS directors but that could hurt performance.

20.22.2. Matt Venn's director(NAT)+mod_proxy+mod_ssl+apache HOWTO (howto run https in localnode while forwarding other services, uses reverse proxy)

Matt Venn lvs (at) attvenn (dot) net Jul 6 2004

You might want to do this if you have highly specced director(s) that you don't want to waste, or not much SSL traffic. I use this setup to cache all images, and to do SSL acceleration for my realservers. Requirements

Method:

patch ip_vs_core.c with Carlos' patch, configure the kernel for LVS, build kernel, install and reboot, compile and install ipvsadm-1.21.

Here are my config files for a small cluster with 1 director and 2 realservers. This config will do the SSL for traffic to editcluster.localnet, and load balance both https and http traffic to the 2 realservers.

#/etc/hosts
127.0.0.1               localhost
192.168.0.50            director1.localnet editcluster.localnet vhost1.localnet
192.168.1.1             director1.safenet editcluster.safenet vhost1.safenet

192.168.1.3             processor1.safenet
192.168.1.4             processor2.safenet

ipvsadm rules for a setup which listens on the director 8080, and load balances the realservers on port 80.

-A -t 192.168.1.1:8080 -s rr
-a -t 192.168.1.1:8080 -r 192.168.1.3:80 -m -w 1
-a -t 192.168.1.1:8080 -r 192.168.1.4:80 -m -w 1

apache: note that I have many virtual hosts, and then one domain for the SSL content.

for reverse proxy cache

<IfModule mod_proxy.c>
        CacheRoot                                               "/tmp/proxy"
        CacheSize                                               1000000 
</IfModule>

for SSL content

<VirtualHost 192.168.0.50:443>
        ServerName editcluster.localnet
        SSLEngine                                               On
        ProxyPass / http://editcluster.safenet:8080/
        ProxyPassReverse / http://editcluster.safenet:8080/
</VirtualHost>

one of these for each virtual host

<VirtualHost 192.168.0.50:80>
        ServerName vhost1.localnet
        ProxyPass / http://vhost1.safenet:8080/
        ProxyPassReverse / http://vhost1.safenet:8080/
</VirtualHost>

Then you need a properly configured apache on your realservers that is set up with virtual hosts for vhost1.safenet and editcluster.safenet, all on port 80.

20.22.3. https without persistence, how sessions work

William Francis 29 Jul 2003 14:46:08

Is it possible to use LVS-DR with https without persistence?

James Bourne james (at) sublime (dot) com (dot) au 30 Jul 2003

It is possible. I made sure that the SSL certificate was available to each realserver/virtual host via an NFS mount. I use a single centralised httpd.conf file across all realservers. For example:

<VirtualHost <VIP>:443>
        SSLEngine               On
        ServerName              servername:443

        DocumentRoot            "/net/content/httpd/vhostname"
        ServerAdmin             email@domain.com
        ErrorLog                /net/logs/httpd/vhostname/ssl_error_log
        TransferLog             /net/logs/httpd/vhostname/ssl_access_log
        CustomLog               /net/logs/httpd/vhostname/ssl_request_log "%t
%h %{SSL_PROTOCOL}x %{SSL_CIPHER}x \"%r\" %b"
        SSLCertificateFile      /net/conf/httpd/certs/vhostname.crt
        SSLCertificateKeyFile   /net/conf/httpd/certs/vhostname.key
        SSLCipherSuite
ALL:!ADH:!EXPORT56:RC4+RSA:+HIGH:+MEDIUM:+LOW:+SSLv2:+EXP:+eNULL

	<Directory />
                Options         None
                AllowOverride   None
                Order           Allow,Deny
                Allow from      a.b.c.d/255.255.255.0 a.b.c.d/255.255.255.0
	</Directory>
</VirtualHost>

/net/logs, /net/conf and /net/content are all NFS mount points.

The downside is that unless you have real signed certificates from Thawte etc. your browser may want to confirm the legitimacy of the certificate presented each time it hits a new realserver. This depends on the load balancing method used.

Hence why the use of persistence is good with https.

Horms

The other reason that persistance is a good idea relates to session resumption. This allows subsequent connections to be set up much faster if an end-user connects to the same realserver. Some Layer 4 Switching implementations allow persistance bassed on session Id for this reason. LVS doesn't do this. And it is a bit hard to put into the current code (when I say a bit, I mean more or less impossible).

For those who are interested, this is how Session IDs are used.

An basic SSL/TLS connection has two main phases, the handshake phase and the data transfer phase. Typically the handshake occurs at the begining of the connection and once it has finished data transfer takes place. The handshake uses asymetric (public key) cryptography, typically RSA, while the data transfer uses symetric cryptography, typially something like DES3. When the sesssion begins the public keys are generated. They are then used to securly transfer the keys that are generated for use with the symetric cryptography that is used for the data transfer.

In a Nutshell the idea is to use slow asymetric cryptography to share the keys required for fast symetric cryptography which is used to transfer the data. Unfortunately the handshake itself is quite slow. Especially for many short connections - as the handshake usually only occurs once its persentage of time for a connection diminishes the longer the connection lasts.

To avoid this problem SessionIDs may be used. This alows an end-user and real-server to identify each other using the SessionID that was issued by the real-server in a previous session. When this occurs an abreviated handhake is used which avoids the more expensive parts of the handshake. Thus making things faster.

Note that using different real-servers will not cause connections that try to use Session IDs to fail. They will just use the slower version of the handshake.

Nicolas Niclausse Jul 30, 2003

Indeed, it will be MUCH slower. I've made a few benchmarks, and https with renegotiation is ~20 times slower.

There is an alternative to persistance: you can share the session IDs on the realservers side with distcache http://distcache.sourceforge.net/ .

Christian Wicke Jul 31, 2003

Is load balancing based on the session id extracted from the request possible?

Horms

LVS works at layer 4 so fundamentally it doesn't have the capability to handle session ids.

20.22.4. You can have two IPs for an https domainname

Cheong Tek Mun

Is it possible to have one domain name for https with two VIPs. For example, the DNS for domain name http:/test.com is 166.166.166.100. I have an LVS with these two VIPs: 166.166.166.100 and 166.166.166.101. Can I have https service on both VIPs?

Horms 21 Dec 2004

Yes.

Joe - I assume test.com is listed in DNS as having two IPs

20.23. name based virtual hosts for https

Dirk Vleugels dvl (at) 2scale (dot) net 05 Jul 2001

I want to host several https domains on a single LVS-DR cluster. The setup of http virtual hosts is straightforward, but what about https? The director needs to be known with several VIP's (or it would be impossible to select the correct server certificate).

Matthew S. Crocker matthew (at) crocker (dot) com

SSL certs are labelled with the URL name but the SLL session is established before any HTTP requests. So, you can only have one SSL cert tied to an IP address. If you want to have a single host handle multiple SSL certs you need a seperate IP for each cert. You also need to setup the director to handle all the IP's

named based HTTP DO NOT WORK with SSL because the SSL cert is sent BEFORE the HTTP so the sever won't know what cert to send.

Note

Horms has described how sessions are established.

Martin Hierling mad (at) cc (dot) fh-lippe (dot) de

You can't do Name Based VHosts, because the SSL Stuff is done before HTTP snaps in. So at the Beginning there is only the IP:Port and no www.domain.com. Look at Why can't I use SSL with name-based/non-IP-based virtual hosts? (here reproduced in its entirety).

The reason is very technical. Actually it's some sort of a chicken and egg problem: The SSL protocol layer stays below the HTTP protocol layer and encapsulates HTTP. When an SSL connection (HTTPS) is established Apache/mod_ssl has to negotiate the SSL protocol parameters with the client. For this mod_ssl has to consult the configuration of the virtual server (for instance it has to look for the cipher suite, the server certificate, etc.). But in order to dispatch to the correct virtual server Apache has to know the Host HTTP header field. For this the HTTP request header has to be read. This cannot be done before the SSL handshake is finished. But the information is already needed at the SSL handshake phase.

Bingo!

Simone

I have done a configuration with LVS-DR and keepalived. I will use the server for an intranet application. I need to menage various intranet domain over the "job machine" and each domain has to be encrypted over ssl. Apache needs to use a different IP for any ssl certificate. What is the right way to implement about 10 ssl domains over the job machine?

Stephen Walker swalker (at) walkertek (dot) com 18 Aug 2003

You cannot use name-based virtual hosts in conjunction with a secure Web server. The SSL handshake occurs before the HTTP request identifies the appropriate name-based virtual host. Name-based virtual hosts only work with the non-secure Web server.

Dirk

With LVS-NAT this would be no problem (targeting different ports on the RS's). But with direct routing I need different virtual IP's on the RS. The question: will the return traffic use the VIP-IP by default? Otherwise the client will notice the mismatch during the SSL handshake.

"Matthew S. Crocker" matthew (at) crocker (dot) com

Yes, on the realservers you will have multiple dummy interfaces, on for each VIP. Apache will bind itself to each interface. The sockets for the SSL session are also bound to the interface. The machine will send packets from the IP address of the interface the packet leaves the machine on. So, it will work as expected. The clients will see packets from the IP address they connected to.

Julian Anastasov ja (at) ssi (dot) bg

NAT:	one RIP for each name/key
DR/TUN: one VIP for each name/key
Is this correct?

James Ogley james (dot) ogley (at) pinnacle (dot) co (dot) uk

The realservers also need a VIP for each https URL, as they need to be able to resolve that URL to themself on a unique IP (this can be achieved with /etc/hosts of course)

Joe

Are you saying that https needs its own IP:port rather than just IP?

Dirk

Nope. A unique IP is sufficient. Apache has to decide which csr to use _before_ seeing the 'Host' header in the HTTP request (SSL handshake comes first). A unique port is also sufficient to decide which virtual server is meant though (and via NAT easier to manage imho).

(I interpret Dirk as saying that the IP:port must be unique. Either the IP is unique and the port is the same for all urls, or the IP is common and there is a unique port for each url.)

anon:

how would you run two virtual domains in apache with different certificates, but just one ip address?

Jacob Coby jcoby (at) listingbook (dot) com 26 Feb 2003

It is impossible to share an ip address across multiple https domains on the standard port. Why? Because the HTTP Host header is encapsulated inside the SSL session, and apache (or anything else) can't figure out which SSL cert to use, until AFTER decoding the session. But, to decode the session, it must first send the cert to the client.

Catch-22.

To use multiple https domains, you'll have to either differenciate them by IP and/or by port.

What if I use a SSL-hardware decoder box

I'm not sure what you are talking about, but I really don't think it will help. The problem is still the same: trying to serve up two different SSL certs based on a Host: header alone in the HTTP stream which is encapsulated by the SSL session which can only be verified by the correct SSL cert.

The server _cannot_ get to this Host header without sending a SSL cert.

Niraj Patel niraj (at) vipana (dot) com 20 Dec 2006

Since https uses name resolution to pull the SSL cert, would I also need something like the following:

  • a dns entry for each virtual host that maps a fqdn like web.abc.com to each of the RIPs i.e. web.abc.com resolves to RIP1, RIP2, etc.
  • an SSL certificate for web.abc.com that's installed on each RS.

Jaro jarol1 (at) seznam (dot) cz Dec 20 2006

Name resolution is used to discover IP address not pull SSL certificates. Client initiates a TCP connection to server IP address to receive the SSL certificate. SSL will also work if you connect to IP directly in your browser (in sence that encryption will take place).

You don't need the DNS entries. ldirectord should be able to perform https checks to IP directly. You will need the certificates.

I run several apache ip based virtual servers on several RSs and test them using ldirectord via http only even though they run https too. If https is configured properly it will work whenever http does.

20.24. Obtaining certificates for https

May 2006: Van Jacobson (of TCP/IP fame) (http://en.wikipedia.org/wiki/Van_Jacobson) has a talk on how they got from circuit switching to packet switching. Now he wants networking changed from point-to-point to allow fetching signed data BitTorrent style without having to specify the location. The problem with the current system is that only the connection is certified (you know who you've connected to via ssl/ssh, you don't know who originated the data/e-mail). If each webpage/piece of data was signed, then there'd be no more pharming, phishing or spam.

He points out that obtaining certificates from Verisign is a single point of failure. He tells the story that in about 2004, someone (and they don't know who) obtained a root certificate in the name of Microsoft from Verisign. Better is a distributed (and presumably revokable) system e.g. like PKI (http://en.wikipedia.org/wiki/Public_key_infrastructure).

Zachariah Mully zmully (at) smartbrief (dot) com 26 Aug 2002

Finally received the quote from Verisign for 128-bit SSL certs for our website, and I was blown away, $1595/yr! These guys must be making money hand over foot at these prices. They want $895 for one cert and license for one server and another $700 for each additional server in the cluster. This is only for one FQDN, by the end of next year, I'll need to secure three more domains hosted by these servers... Perhaps I heard wrong, but I had thought that I could simply get one cert for a domain (in this case www.smartbrief.com) and use it on all the servers hosting it in my LVS system, but the Verisign people said I needed to buy licenses for each server in the system!

So I am wondering if Verisign is yanking my chain and if anyone has any recommedations for other Root CA's that have more reasonable pricing.

Joe (warning - rant follows)

I'll sell you one for $1500 or for $1 if you like. They're both the same ;-\

This is a rip-off because Verisign got their certificates into Netscape/IE back when it counted and no-one else bothered to do the same thing. It's the same monopoly that they had on domain names and they've just got greedy. When I needed to get a certificate, I looked up all the companies listed in my Netscape browser. Most didn't exist anymore or weren't offering certificates. The only two left were verisign and Thawte. Thawte was in South Africa and were half the price of Verisign. I wasn't sure how well a South African certificate would stand up in a US court. Thawte then bungled by setting the expiration of their certificates to be short enough that everyone with the current browsers of the time would not recognise Thawte certificates anymore. End of Thawte.

Eventually Verisign bought out Thawte. No more competition.

The webpage to get a certificate was an abomination a few years back. I can't imagine the dimwit who wrote it.

No-one has stepped in to be an alternate RootCA, and I can't imagine why. I would expect EFF could do it, anyone could do it. You do need a bit of money and have to setup secure machine(s), have some way of keeping track of keys and making sure that the webbrowsers have them pre-installed. It appears to be more than anyone else wants to do, even with the price going through the roof at $1500 a pop.

The browser people could help here by making newly approved RootCA certificates downloadable from the website for each browser, but it would appear that all are colluding with Verisign.

As far as the website operation is concerned a self signed certificate is just as good as one from Verisign. The only problem is when the user gets the ominous message warning them that the signing authority of this certificate is not recognised.

You could engage in a bit of user education here and tell them that Verisign's signature is no better than yours.

Otherwise you're over a barrel that doesn't need to be there and no-one has stepped forward to fix the situation.

Doug Schasteen dschast (at) escindex (dot) com 26 Aug 2002

www.ssl.com sells certs but their prices aren't much better. $249 per domain. The real kick in the teeth is that you need a separate certificate for not only each domain, but also sub-domains. So I have to pay an additional $249 if I want to secure something like intranet.mywebsite.com or mail.mywebsite.com as opposed to just www.mywebsite.com. As far as I know, Thawte is still the cheapest, even though they are owned by Verisign.

nick garratt nick-lvs (at) wordwork (dot) co (dot) za 26 Aug 2002

if you're using NAT you'll just need the one cert for, as you correctly state, its per FQDN. one cert works fine for my NATed cluster. not too sure what the implications of DR would be in this context...

all a CA is a trusted third party with the buy-in from the browser manufacturers. the tech is not rocket science either. could be anyone; there's clearly an opportunity for another operator.

Zachariah Mully wrote:

Thanks Joe, this is unfortunately exactly what I expected to hear. And yes, the omnious warning will definitely confuse and scare our brain dead users.

Joe

They aren't really brain dead. They just don't understand what's going on and quite reasonably in that situation they are worried about their credit card number and what's going to happen to it. They have a right to know that their connection isn't being rerouted to some other entity and this fear is how Verisign is making their money.

I've just had an offline exchange with someone who self signs and send the client a pop-up explaining the situation. This appears to be for inhouse stuff. I don't know if this is going to work in the general case - I expect that you'll get a different reception if you are the Bank of London and if you are selling dubious services. You could try it initially and log the connections that don't follow through after getting the educational pop-up to see how much people are scared off.

As someone pointed out, one cert should work fine for many NAT'ed servers, anyone know if my DR config would change that?

The certificate is for a domainname. All realservers think they are running that domainname. For LVS-DR they all have the same IP (the VIP). For LVS-NAT they all have different IPs (the various RIPs) in which case you have to have a different /etc/hosts file for each realserver (see the HOWTO). In all cases the machines have the same domainname and can run the same certificate. (Hmm, it's been a while, I can't remember whether the RootCA asks you for your IP or not, so I don't remember if the IP is part of the cert). I can't imagine how Verisign is ever going to tell that you have multiple machines using the same cert. Perhaps you could NFS export the one copy of the cert to all realservers.

You do need a bit of money and have to setup secure machine(s), have some way of keeping track of keys and making sure that the webbrowsers have them pre-installed.

Greg Woods woods (at) ucar (dot) edu 26 Aug 2002

The last part of this is the difficult part. We run our own RootCA here, because we were quoted a price from Verisign in excess of $50K per year for what we wanted to do. Then there is the ominous-looking spam that VeriSign sends that makes it sound like you will lose your domain name if you don't register it through them, so I won't do business with them anyway even if the price *has* come down.

So we had little choice, and we've just had to guide our users through the scary dialog boxes to get them to accept our CA. Once that's done though, we can now use SSL with authentication to control viewing of our internal web pages. Works for us, but your mileage may vary. I do recall hearing a lot of cursing coming from the security administrator's office while they were trying to get the RootCA working, too. That can be rather tricky.

Eric Schwien fred (at) igtech (dot) fr 26 Aug 2002

Thawte is selling 1 year certs at 199 $ each. If you have a Load Balancer System, they ask you to buy additional "Licences" for the second, third, etc ... Real server.

In fact, if you just copy and paste the original cert, all is working fine (ie, without additional "licences"), ... But you do not have the right for it. This is a new pricing scheme of Thawte, that still seems cheaper than offers you had!

However, you still need one cert for each domain.

Their Web Site is all new, quite long to read everything, but procedures are well explained.

"Chris A. Kalin" cak (at) netwurx (dot) net 26 Aug 2002

Nope, he was talking about 128 bit certs, which even from Thawte are $449/year.

Joe Cooper joe (at) swelltech (dot) com 26 Aug 2002

How about GeoTrust? Looks like $119/year for a 128-bit cert. Though some colo/hosting providers seem to be offering the same product for $49/year (RackShack.net, for example). Maybe only for their own customers, I don't know.

Bobby Johns bobbyj (at) freebie (dot) com 29 Aug 2002

If you're using DR you only need one cert. I'm running that way right now and it works flawlessly.

Also, if you're interested in the nuts and bolts of making your own certificates, see Holt Sorensen's articles on SecurtyFocus.

Parts 1-4:

http://online.securityfocus.com/cgi-bin/sfonline/infocus.pl?id=1388
http://online.securityfocus.com/cgi-bin/sfonline/infocus.pl?id=1462
http://online.securityfocus.com/cgi-bin/sfonline/infocus.pl?id=1466
http://online.securityfocus.com/cgi-bin/sfonline/infocus.pl?id=1486

Here's what the readers at Slashdot have to say about why certificates are so expensive

Malcolm Turnbull wrote

As far as I am aware Thwate do not require you to buy a seperate cert for each realserver (just for each domain).

Simon Young simon-lvs (at) blackstar (dot) co (dot) uk 18 Feb 2003

Just for the record, here's the relevant section from the definitions section of Thawte's ssl certificate license agreement:

"Licensing Option" shall mean the specific licensing option on the enrollment screen that permits a subscriber to use of a Certificate on one physical Device and obtain additional Certificate licenses for each physical server that each Device manages, or where replicated Certificates may otherwise reside

"Device" shall mean a network management tool, such as a server load balancer or SSL accelerator, that routes electronic data from one point to single or multiple devices or servers.

And from section 4 or the agreement:

... You are also prohibited from using your Certificate on more than one server at a time, except where you have purchased the specific licensing option on the enrollment screen that permits the use of a Certificate on multiple servers (the Licensing Option). ...

So it looks like all realservers do indeed require a license for each realserver - or at least you have to buy the 'multiple server' license option, which is more expensive than the single machine license.

In addition:

In the event you purchase the Licensing Option, you hereby acknowledge and agree that ... you may not copy the Certificate on more than five (5) servers.

So a large number of realservers may need a multiple server license for every five machines. This could get expensive...

In summary, you need a valid license for every copy of your certificate being used, whether it be single licenses for each, or multiple licenses for every 5 machines.

anon

You only need a certificate per domain. You should be able to copy it to as many servers as you want. I had a SSL IPs load balanced using LVS-TUN with two computers, using the same certificate, and nothing complained about the certificate.

Malcolm Turnbull Malcolm (at) loadbalancer (dot) org 22 Oct 2003

Me too, and it only took 2 years for me to realise that I was breaking the licence/law... Verisign et. al. have clauses in their contracts stating that you can't use a cert on more than one web server unless you pay for a multiple use licence...

Joe Dec 2003,

"Crypto" Steven Levy, Viking Pub, 2001, ISBN 0-670-85950-8. I enjoyed this book - it describes the people involved in producing cryptography for the masses: Diffie, Hellman, Rivest, Shamir, Adleman, Zimmermann, Chaum and Ellis (and many others), how they did it and how they had to fight the NSA, and the legislators to get their discoveries out into the public in a useful way.

The real story of why RSA (or its descendants, e.g. Verisign) has the only certificate in Netscape is revealed in this book on p278. I will attempt to summarise

RSA owned or controlled all the patents needed for cryptography for the masses. They had licensed their patents for Lotus Notes and to Microsoft but were limited to 40 bits for the export version and 56 bits for the US version (Microsoft shipped 40 bit enabled code in all versions, to simplify maintenance). RSA was not making much money and attempts to put their patents to use were hobbled by the NSA, the US laws on cryptography, and by the lack of awareness amongst the general public and application writers that cryptography was useful (c.f. how hard it is to get wifi users to enable WEP). The Netscape team was assembled from the authors of Mosaic by Jim Clark, the just departed CEO of Silicon Graphics, who was casting about for a new idea for a start-up company.

" The idea was to develope an improved browser called the Navigator, along with software for servers that would allow businesses to go on-line. The one missing component was security. If companies were doing to sell products and make transactions over the internet, surely customers would demand protection. It was the perfect job for encryption technology.

Fortunately Jim Clark knew someone in the field - Jim Bidzos (the business manager at RSA). By the time negotiations were completed, Netscape had a license for RSA and the company's help in developing a security standard for the Web: a public key-based protocol known as the Secure Sockets Layer. Netscape would build this into its software, ensuring that its estimated millions of users would automatically get the benefits of crypto as envisioned by Merkle, Diffie and Hellman, and implemented by Rivest, Shamir and Adleman. A click of the mouse would send Netscape users into crypto mode: a message would appear informing them that all information entered from that point was secure. Meanwhile, RSA's encryption and authentication would be running behind the scenes.

Jim Bidzos drove his usual hard bargain with Netscape: in exchange for its algorithms, RSA was given 1% of the new company. In mid-1995, Netscape ran the most successful public offering in Wall Street's history, making RSA's share of the company worth over 20$M. "

The point of this quote is that the people who'd invented cryptography for the masses, had struggled against their own evil empire (NSA, the US Govt) and were now in a position to make good. Because of the patents, there was only one game in town, RSA, and when Netscape was casting around for crypto, RSA was it. Even if RSA had been staffed by GPL true believers, there weren't any other companies with root CA's (if there were, they would have had to license RSA's patents).

So because of historical accident (the crypto algorithms were all patented and because the US Govt/NSA tried to keep the genie of crypto from getting out by sitting on people) RSA was the only company that had crypto at the dawn of the internet.

20.25. Self made certificates

Matthias Krauss MKrauss (at) hitchhiker (dot) com 18 Aug 2003

you can find a nice explain of self made certs configure and virt. addresses under: http://www.eclectica.ca/howto/ssl-cert-howto.php

20.26. SSL Accelerators and Load Balancers

Note

Horms has described how sessions are established

For a description of the (commerical) Radware SSL accelarator setup see Radware. This setup has the SSL accelarator as a realserver and the decrytped http traffice is fed back to the director for loadbalancing as http traffic.

Encrypted versions of services (e.g. https/http, imaps/imap, pops/pop, smtps/smtp) are available which require decryption of the client stream, with the plain text being fed to the regular demon. In the reply direction, the plain text must be re-encrypted. Decryption/Encryption are CPU intensive processes and also require the tracking of keys. Vendors have produced SSL accelerators (cards or stand-alone boxes), which do the de- and en-cryption, thus taking the load off the CPU in the server allowing it to do other things. These accelerator cards are useful if

  • you need to increase the capacity of your server(s) and don't want to buy a bigger server to handle the extra load from encryption/decryption,
  • you already have the biggest server you can buy
  • you don't have the SSL enabled version of the demon

The cards (or boxes) usually have proprietary software. There are only 2 products which work with Linux (both based on the Broadcom chip?).

Since these SSL accelerator boxes are not commodity items, they are always going to be more expensive than the equivalent extra computing power in more servers. The niche for SSL accelerators seems to be the suits, who faced with choosing between

  • a low cost solution which requires some understanding of technology
  • or a high cost solution supported by an external vendor, which requires no understanding of technology

will choose the high cost solution. The people with money understand money; they usually don't understand technology. They have little basis to judge the information coming from the technologically aware people they hire and whose job it is to advise them (i.e. the suits don't trust the people at the keyboard.)

solutions for the technologically aware people are (this is not my area - anyone care to expand on this - Joe)

  • add more servers and use a load balancer. The demon on the servers will have its own SSL code.
  • use apache with mod_reverse_proxy
  • use ssl engine

The dominant school of thought amongst LVS'ers is to add realservers running the SSL'ified demon.

Kenton Smith

Do I terminate the SSL traffic at the director or the realserver? How do I handle the certs? If the traffic is terminated at the realserver, do I need a certificate for each realserver? Can I use a name-based cert using the domain name that goes with the virtual IP on the director, thus only requiring one certificate?

Joe (caveat: I haven't done SSL with LVS).

Some rules when thinking about how to handle services on LVS

  • each realserver thinks it is being connected directly by the client.
  • each client thinks it is directly connected to a single box (the realserver).
  • Neither the client or the realserver knows the director exists.

So - setup each realserver as if the client was directly connecting to it. Put a name based cert on each realservers and let the realserver handle the SSL de/encoding.

pb peterbaitz (at) yahoo (dot) com 15 Oct 2003

Where I work we use Piranha (Red Hat's spin of LVS) and regarding SSL, we let the realservers do the SSL work. No sense busying the director with processing the SSL, and even if you wanted to, you would look to SSL Accelerators, which we have not implemented, though we looked at the technology theoretically speaking - but you also get into what service(s) you are using SSL for, webmail, web sites, etc. Better to let the realservers handle the SSL... you can always add more realservers if SSL processing bogs them down by some fraction.

Horms

I agree. And arguments that I have heard to the contrary are usually tedious at best. SSL is probably the most expensive thing that your cluster needs to do. Thus disributing amongst the realservers makes the most sense as you can scale that by just adding new machines.

You can terminate the SSL connection at the director, perhaps using something like squid as a reverse proxy, but then the Linux Director has to do a _lot_ of work. You probably want to get an hardware crypto card if you are going down that road and have a reasonable ammount of traffic.

You shold use the same certificate on each of the realservers. That way end-users will always see the same certificate for a given virtual service.

Can I use a name-based cert using the domain name that goes with the virtual IP on the director, thus only requiring one certificate?

I am not sure that I follow this. The name in the certificate needs to match the name that your end-users are connecting to. So if you have www.a.com, www.b.com and www.c.com then they can't use the same certificate. Though the certificates can have wildcarsd, so you could use the same certificate for www1.a.com, www2.a.com and www3.a.com.

On a related note. You have to have a different IP address or use a different port for each different certificate. There is no way to use name based virtual services with certificates as SSL has no facility for virtal hosting and thus there is no way for the ssl server to select beetween different certificates on the same IP/Port.

Peter Mueller

If I wanted to use a hardware SSL decrypting device such as a card in my LVS-director boxes, how could I set this up in LVS? I see no problem getting 443 to decrypt, but how do people then forward this traffic to the realserver boxes? I like the idea of saving 20-30+ Thawte bills a month AND offloading a whole bunch of CPU for the one time cost of $500/card..

AFIK at this time the only real way to do this is to use a user-space proxy of some sort. Once you have it in user space it is pretty straight forward as long as the card is supported by openssl / provides the appropriate engine library for openssl.

On the other hand, surely there is someone who isn't committing highway robbery to provide certificates. AFIK the reason you offer above is the only reason to use an accellearator card in this situation. It is a technical solution to Thwate overcharging. A much better solution is to distribute load on the cluster, that is what it is there for.

Matthew Crocker matthew (at) crocker (dot) com 22 Oct 2003

There are more than one way to handle SSL traffic. This is how I do it

I have 2 working machines (aka realservers) running Linux/Apache/SSL I have 1 /24 subnet (256 IP addresses) assigned to SSL serving I register 1 SSL certificate per SSL domain I host ( www.abc.com, www.def.com ,www,ghi.com) I assign each domain to an IP address from the SSL pool using DNS (www.abc.com IN A 159.250.20.1, www.def.com IN A 159.250.20.2) I use LVS-DR to load balance the connections to the 2 realservers. I setup the realservers to handle every IP in the SSL pool.

In short, SSL certificates are branded with the domain name. The SSL protocol establishes security before any HTTP requests. The client web browser checks the domain it went to (location bar) wit the domain in the certificate. If the domains do not match the web browser complains to the user. SSL is still established. Due to this processes you must us separate IP address for each SSL certificate so Apache will know what SSL cert to use when establishing the connection.

I use a hybrid LVS-NAT/LVS-DR setup with fwmarks and some static routes to handle my SSL traffic. Check a couple months back in the logs where I detail how I do it.

My realservers are not on the Internet. Only traffic in the SSL Pool going to port 80 and 443 are routed to the realservers. Each real server has a copy of all SSL certs (shared drive). If I need SSL decryption hardware I would place it in the realservers. Persistence is set on the LVS box for the connections. Port 80 and 443 are bound together for persistence.

As part of his work, PB, has been finding out about SSL accelerators. Here's his writeup (Mar 2003).

20.26.1. PB's SSL Accelerator write up

I've been working with SSL accelerators and have been thinking about using them with LVS. After having Foundry, F5 and Cisco in here for a technical review of their products, I find that LVS has the same basic load balancing functionality. F5 uses BSD Unix, Cisco runs on Linux with ASIC chips, and they have fancy GUI and various additional functionality, but LVS has all the same basic stuff.

Note
Note: Red Hat is a Cisco customer, and uses Cisco load balancing and ssl solution in-house (I assume), rather than using their own Red Hat Advanced Server + Piranha + SSL solution of their own which would have helped all us RH Linux Piranha customers.

An SSL accelerator is a piece of software running either on a separate box which is inserted into your data stream, decrypting on the way to the server and then re-encrypting on the way back to the client, or running on a card with its own processor, that's inserted into the server to do the same thing.

In simpler times we put SSL Certificates (Verisign, Thawte, etc.) on the realservers, and let the load balancer route the traffic to the realservers where the SSL decryption is done. https, e-mail protocols smtps, imaps, and pops are all SSL Encryption oriented. Additionally, many new applications today have joined the SSL bandwagon. Now SSL decryption is CPU-intensive and adds actual load on your realservers. Solution? Add more real servers behind your load balancer, right? Yes and no.

Adding additional realservers behind your load balancer to offset SSL decryption load is the intuitive solution, since that is what the purpose of a load balancer is, to allow many realservers, and lower the load to each. However some folks today add what is gernerally called "SSL Accelerator" solution to the mix, and remove as many SSL Certificates and SSL decryption from the realservers and pre-process the SSL Encrypted data stream before it hits the real servers. In short SSL Accelerators decrypt the data, then pass the decrypted data via clear-text (standard) protocols (like http, smtp, imap, pop) to the real servers.

e.g. for https, in a standard server, the decrypted https is sent as clear text to the httpd on port 80. Same for smtps -> smtp, imaps ->imap, pops-> pop, etc.

What happens to ssh? I think because ssh is using its own RSA keys (not SSL) on the server the information is encrypted all the way.

The point of the SSL accelerator is to reduce the load on your server (which is now busy crypting, rather than serving pages), so that it can deliver more pages of data to clients without changing your server setup. However you could do the same thing by beefing up you server (or putting an array of realservers behind an LVS) without any change in software on your servers.

The problem then is one of cost. The Ingrian and Sonicwall SSL boxes are in the multiple 10K$ range. The cards cost less, and other stand-alone units that support fewer protocols (like http/https only) cost a few thousand. Suits feel better spending money when it is available (ie. an SSL accelerator array will insure we don't need to spend more money on 1 or 2 or 3 more realservers in the future). So an SSL accelerator is aimed at non-technical, but financially knowlegable managers, rather than techically competent but financially naive computer people. (When faced with the choice of going to an L7 load-balancer or rewriting the application to be L4 friendly, the suits will go for the slow and expensive L7 load-balancer, while the programmers will re-write the application - the choice depends on what you have at hand that you understand).

You can have an SSL Accelerator without a load balancer, but in order to have an array of them, you want a load balancer. (Another choice is DNS round robin, which several people, including Horms, have found, is not a good way to go).

There are two main ways this passing the baton is done. First, your load balancer can load balance all data steams to a few of these SSL Accelerator units, which decrypt any SSL encrypted data, and themselves load balance all protocols to your realservers. Second, your load balancer can load balance just the SSL encrypted data streams to several SSL Accelerator units, which after they decrypt the data, pass it back to your load balancer as clear-text to be routed by your load balancer as non-encrupted data over standard protocols to your realservers.

SSL Accelerators are available as stand alone units (you would normally buy several and make an array of them) or as SSL Accelerator CARDS which plug into your realservers to speed them up with regard to SSL decryption (which is yet another solution).

In principle you could separate out the SSL handingly from a linux server and run it on a separate box to make a linux only SSL accelerator box. The natural people to do this would be the mod_ssl people, but they are supporting linux compatible SSL hardware (e.g. cards that use the Broadcom chipset) via the "OpenSSL engine" feature. (see the mod_ssl mailing list archives )

This will allow you to build a SSL Accelerator Linux box or to beef up your realserver with a Broadcom card in your Linux LVS load balancer. Since the extra processing is now on a separate processor, in principle this should not add a lot of extra load to your realserver.

Ingrian and Sonicwall are a couple fairly expensive SSL Accelorator solutions which support all the protocols you need. Broadcom makes the card you can add to your real servers. There are other brands which make less expensive SSL Accelerators that support only https (web) protocol.

The information from most vendors is vendor-speak, however Ingrian has some white papers.

I peronally called Ingrian and found them to be extremely TECHNICALLY helpful (more than I could understand myself). They seemed willing to help me even though I told them I use Piranha/LVS (and not Foundry their partner). I did not find any level of open-source style detail on SSL Acceleration/decruption. Got all my info from the load balancer companies, Ingrian, and white papers.

I don't know how an SSL accelerator box/card works. Presumably several levels are involved.

  • The box has to get the packets

    One of the two methods (called "one arm config") you have the load balancer route only SSL-bound protocals (ie. https, imaps, smtps, etc.) to the SSL accelerator.

  • for a card, it has to grab the packets off the PCI bus. (Anyone know how this is done?).
  • the SSL accelerator has to keep track of session data etc.

The end result is that the box/card takes the SSL encrypted form of the data, decrypts it to clear text, and shoots it out like it was never encrypted.

Some SSL accelerators have load balancing built-in (eg Ingrian). But not all. The standard "one arm config" does NOT require it. You use your own load balancer, and send SSL encrypted data to your SSL Accelerators, then they output decrypted clear text data sending it back to the load balancer for routing to the realservers. Only Ingrian mentioned/recommended that they can also do load balancing.

I don't know how you would use an SSL accelerator with LVS-DR. The load balancer companies talk about the decrypted clear text going to the realservers, but do not recall a dicussion about going back out again.

20.26.2. from the mailing list

Matthias Krauss MKrauss (at) hitchhiker (dot) com 10 Mar 2003 (severly editorialised by Joe)

I had LVS-DR forwarding http (but not https). I hoped to decrypt the https packets on the director with the SSL accelerator card and then pass the decrypted packets to the realserver via the LVS, to save cpu time on realservers. I simulated 10 concurrent requests and downloads of about 3 GB via the ssl acclerator, apache's cpu time went up to 30% on a 1Ghz/512MB host.

when you had the accelerator card in front of the director, what does it do with traffic to other ports eg port 80? Does it just pass them through? Does it look like a router/bridge except for port 443?

For me it looked like some kind of proxy, dealing with the incomming ssl/443/encrypted traffic, decrypts it and passed http/80/decrypted traffic to the LVS . With tcpdump I saw that between the ssl rewrite engine and the VIP was only regular http traffic. By the way, I didnt have an acclerator card for the apache box, I just used apache rewrite and proxy pass mod for the decrytion job.

Julian 11 Mar 2003

The SSL Accelerator cards I know allow user space processes (usually many threads) to accelerate the handling of private keys. What I know is that the normal traffic is still encrypted and decrypted from the CPU(s).

OTOH, decryptying the SSL data in directors is used mostly to modify the HTTP headers for cookie and scheduling purposes. For other applications it can be for another reason. Even the HTTP stream from the realservers is modified. So, I don't think LVS can be used here. Of course, it is possible to implement everything in such way so the user space handling is avoided and all processing is moved in kernel space: queues for SSL async processing, HTTP protocol handling, cookies, just like ktcpvs works in kernel space to avoid memory copy.

I don't follow the SSL forums, so I don't know at what stage is the kernel-level SSL acceleration. My experience shows that the user-space model needs many threads just for private keys to keep the accelerator busy and the rest is spent for CPU encryption and decryption of the data.

Joe: private keys are kept in threads rather than in a table?

Julian: If you have to use the following sequence for an incoming SSL connection (user space):

SSL_accept
loop:
	SSL_read, SSL_write to client
	read, write to realserver

The SSL_accept operation is the bottleneck for non-hardware SSL processing, SSL_accept handles the private key which costs very much. The hw accel cards offload atleast this processing (the engine is used internally from SSL_accept) but we continue to call SSL_read and SSL_write without using the hw engine.

Considering the above sequence we have two phases which repeat for every incoming connection:

  • wait the card drivers to finish the private key operations. That means one thread waiting in blocked state for SSL_accept to finish.
  • do I/O for the connection (this includes data encryption and decryption and everything else)

What we want is while the card is busy with processing (it does not have any PCI I/O during this processing) to use the CPU not for the idle kernel process but for encryption and decryption of other connections that are not waiting the engine. So, the goal is to keep the queue of the hwaccel busy with requests and to use the CPU at the same time for other processing. As result, the accel reaches its designed limit of RSA keys/sec and we don't waste CPU in waiting only one SSL_accept for results.

Joe: if there was only one thread, the accelerator would be a bottle neck or it wouldn't work at all?

The CPU is idle waiting SSL_accept (the card) then the card is idle waiting the CPU to encrypt/decrypt data. The result could be 20% usage of the card and (30% usage of the CPU) and the idle process is happy: 70% CPU.

I don't know if this is true for all cards. But even with an accelerator, the using of SSL costs 3-4 times more than just the plain HTTP. My opinion is that this game is useful only for cookie persistence. For other cases LVS can be used in directors and the accelerators on the realservers - LVS forwards TCP(:SSL:HTTP) at L4, the accel is used from user space as usually.

Joe: is this 3-4 times the number of CPU cycles?

Yes, handling of SSL encrypted HTTP traffic with hwaccel is 3-4 times slower than handling the same HTTP traffic without using SSL. Of course, without hwaccel this difference could be 20 and that depends on the used CPU model. What I want to say is that it is better to delay this processing at the place where it is needed: if the SSL traffic needs to be decrypted for scheduling and persistence reasons than it should be done in director but if SSL is used only as secure transport and not for the above reasons than it is better to buy one or more hwaccel cards for the realservers. Loading the director should be avoided if possible.

As for any tricks to include LVS in the SSL processing I don't know how that can be done without using kernel-space SSL hwaccel support. And even then, LVS can not be used, may be ktcpvs can perform URL switching and cookie management.

pb Mar 11, 2003

I wonder if a software engine could be written to accept data from any SSL service (https/smtps/imaps/pops) and let apache rewrite + proxy pass mod decrypt it, then get it sent back out the correct clear text port (http/smtp/imap/pop). Its all SSL encrypted the same way, so once decrypted just pass it to the right protocol. No?

Horms

Yes this would be possible. But I fail to see why it would be desirable unless you are running a daemon that can't do SSL itself (not all demons are SSL enabled).

Joe: see SSL'ifying demons.

unknown: Would an SMP system make a difference such that the OS and general I/O is not bogged down?

Horms

Surely the problem is CPU and not I/O so to that end one would expect that an SMP machine would help.

Ratz

the simple solution that works in almost all cases is apache + mod_reverse_proxy. You can build highly scalable and secure transaction servers. No need for SSL accelerators, since after all, there are only 2 known products out there that work with linux. Also you can double-balance the requests with the same load balancer before the reverse proxy and after the proxy with LVS-DR and two NICs which leads to a pretty nice HA setup. (The healthchecking scripts however are a bit complex though.)

Horms

Personally I agree that SSL accelerators are pointless. I don't really see that you can do much that can't be done with a reverse proxy or better still, just handling the SSL on the realservers. When you think about it, SSL is probably at least as CPU intensive as whatever else the Real Servers are doing, so it makes sense to me that load should be spread out. I can see some arguments relating to SSL session IDs and the like. But the only real way forward here is to move stuff into the kernel. I haven't given this much thought, but what Julian had to say on the matter makes a lot of sense.

(from a while back)

Jeremy Johnson jjohnson (at) real (dot) com 12 Oct 2000

I have been evaluating the Intel 7110 and the 7180 SSL E-commerce accelerators recently. I would prefer to use the 7110's (Straight SSL Accelerators) coupled with LVS instead of throwing out LVS and using the 7180 as the cluster director.

Now, before anyone says "You can do that with LVS-Tun", I am aware that I can fix this problem with LVS-Tun but I would like to see if there is a fix for LVS-DR that would let this work as I prefer the performance of LVS-DR over LVS-Tun. Here is how I have the network setup

                        ________
                       |        |
                       | client |
                       |________|
                           |
                        (router)
                           |
             __________    |
            |          |   |
            | director |---|
            |__________|   |
                           |
                           |
          -----------------------------------
          |                |                |
          |                |                |
    ____________      ____________    ____________
   |            |    |            |  |            |
   | 7110 SSL   |    |  7110 SSL  |  | 7110 SSL   |
   |____________|    |____________|  |____________|
          |                |                |
          |                |                |
    ____________      ____________    ____________
   |            |    |            |  |            |
   | realserver |    | realserver |  | realserver |
   |____________|    |____________|  |____________|

These 7110 SSL Accelerators have 2 Ports, IN one and OUT one. The problem is that when the request comes into the Director and the director looks up the MAC address of the RealServer, it is getting the MAC address of the First NIC in the 7110 Director, so when the packet headers are rewritten with the Destination MAC Address, the MAC address of a NICK in the 7110 is written instead of the MAC of the RealServer. The effect is a black hole.The cluster works fine without the 7110 in between LVS and the RealServer, as soon as I swap in the 7110, all traffic bound for the RealServer with the 7110 in front of it is blackholed.

Any Ideas? I could be wrong about what exactly is happening but basically as soon as I swap in the 7110 LVS for the RealServer Dies, same thing when I set the box in fail-through mode, the box is acting as a straight piece of wire then and I am having the same problem so I believe it has something to do with the wrong MAC address, any ideas?

I am really hoping that someone has encountered something like this and has a fix. The obvious fix would be to be able to on the director force the MAC addresses of the RealServer instead of looking them up.

Terje Eggestad terje (dot) eggestad (at) scali (dot) no 13 Oct 2000

Since you're in an evaluation process, I offer another solution to the accelerator problem, that will give you a cleaner lvs setup. I've tried a PCI accelerator card from RainBow, RainBow. My experience was with Netscape Enterprise Server on HP, but RedHat flags that their version of Apache, Stronghold, has drivers for this card. The effect was nothing short of amazing. Before adding this card the CPU usage ration between the web server and the CGI programms was 75:25. After adding this card it was in the neighbourhood of 10:90.

Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 17 Aug 2005

I have been using LVS for a small farm for a couple of years now. I am interested in adding SSL hardware acceleration to my two LVS servers. It is my goal to maintain performance by offloading the SSL chores, and reduce the cost of certificate renewal by not applying certificates to my web servers.

Can anyone offer advice from experience doing the same? I am using an LVS-NAT configuration currently and am happy with it. It has been suggested that I get a commercial product to do this (Big-IP from F5) which I am not absolutely opposed to, but if there is a good track record with adding SSL hardware acceleration to LVS then I will be happy to stick with what I've been using.

Peter Mueller pmueller (at) sidestep (dot) com

Intel used to make a daisy-chain network device that would do this. A lot of companies still add an SSL card to a few servers, e.g. http://h18004.www1.hp.com/products/servers/security/axl600l/ or http://www.chipsign.com/modex_7000.htm. And then there are the accelerators on F5s and their like. I think the least disruptive way will be the add-on-card to two servers and a :443 vip containing only them.

kwijibo (at) zianet (dot) com

Why are you worrying about offloading it? I would just buy some boxes with faster CPU's if speed is a concern. The time it takes to ship the data back and forth over the bus to the SSL accelerator your processor probably could have taken care of it. Especially if the algorithm uses all the bell and whistle features todays CPUs have.

Note

Joe: as Horms says, it's all about the number of certificates. This is not a problem that has a technical solution.

20.26.3. SSL'ifying demons

Richard Lyons posting to the qmail mailing list, 05 Feb 2003

To configure Qmail to use "pop3s" and "smtps", there are plenty of examples in the archives at http://www-archive.ornl.gov:8000/, but let me hit the high points.

There are two ways to offer secure versions of the SMTP and POP services. In the first, the existing services on port 25 and 110 can be enhanced with the STARTTLS and STLS extensions (RFCs 2487 and 2595), allowing clients to negotiate a secure connection. The other method is to provide services on different ports and wrap or forward connections on the new ports to the existing services.

STARTTLS for qmail-smtpd is done either with the starttls patch (look for starttls on http://www.qmail.org) or with Scott Gifford's TLS proxy, see http://www.suspectclass.com/~sgifford/stunnel-tlsproxy/stunnel-tlsproxy.html

As far as I know, the only implementation of STLS for qmail-pop3d is Scott's TLS proxy, see the above link.

The starttls patch requires patching and installing a modified qmail-smtpd/qmail-remote but no other changes (apart from configuring certs, etc). The starttls patch also allows secure connections from your mailserver to others supporting RFC2487.

The TLS proxy requires patching and installing a modified stunnel and changing your run scripts, but doesn't modify the current qmail install.

Creating new services can be done with stunnel (http://www.stunnel.org) or sslwrap (http://www.quiltaholic.com/rickk/sslwrap/). You can either configure a daemon to listen on the secure ports (465 and 995 for SSMTP and SPOP3) and forward the traffic to the normal services, or run a service on the ports that invokes the secure wrapper inline. A drawback of the first approach is that connections appear to be from 127.0.0.1, reducing the usefullness of tcpserver on the unencrytped port (I'm told there's a work around for this on Linux machines).

New services requires configuring stunnel and new run scripts but no changes to the existing installation.

An example of stunnel-3 wrapped services can be found in http://www.ornl.gov/its/archives/mailing-lists/qmail/2003/01/msg01105.html

Jesse reports success with stunnel-4 wrapped services in http://www.ornl.gov/its/archives/mailing-lists/qmail/2002/09/msg00238.html

Finally, let me note that if your users want secure services, they should using something like PGP/GPG and APOP.

Brad Taylor btaylor (at) Autotask (dot) com 8 Aug 2005

I'm running a web server and Squid in reverse proxy mode and terminating SSL on the Squid box allowing HTTP to the realserver. I want to add two more real servers for a total of 3. Could I put and LVS into this mix in front of the Squid server and add IP's to Squid and load balancing the Squid IP's allowing Squid to continue to terminate SSL? Traffic would move like this: HTTPS -> (LVS) -> HTTPS (1 of 3 Squid IP's) -> HTTP (1 of 3 real servers). Or would the SSL traffic at the LVS need to decrypted 1st?

Joe 2005-08-08

Whether it's better to have the SSL decryption on the realserver or in a separate SSL accelarator isn't clear AFAIK. It's not like it comes up a lot and we've figured out what to do. Horms probably has the clearest point on the matter which is not to have a separate SSL engine but to have each realserver do its own decrypting/encrypting.

Graeme Fowler graeme (at) graemef (dot) net 08 Aug 2005

I've built a similar system at work (can't go into too much detail about certain parts of it, sadly) but the essence is as follows:

           director + failover director (keepalived/LVS)
                    |
    _________________________________
   |                |                |
squid 1      ... squid 2      ... squid N
   |________________|________________|
   |                |                | 
realserver 1 ... realserver 2 ... realserver N

The Squids are acting in reverse proxy mode and I terminate the SSL connections on them via the frontend LVS, so they're all load balanced. Behind the scenes is a bit more complicated as certain vhosts are only present on certain groups of servers within the "cluster" and the Squids aren't necessarily aware where they might be, so the Squids use a custom-written redirector to do lookups against an appropriate directory and redirect their requests accordingly.

In a position where you have a 1:1 map of squids/realservers you could in theory park a single server behind a single squid, but that doesn't give you much scaleability. It does however mean if a webserver fails then that failure gets cascaded up into the cluster more easily. Then again, if you're shrewd with your healthchecks you can combine tests for your SSL IP addresses and take them down if all the webservers fail.

Also, don't overallocate IP addresses on the Squids. If I say "think TCP ports", you only need a single IP on your frontend NIC... but I'll leave you to work that one out!

Remember that the LVS is effectively just a clever router; it isn't application-aware at all. That's what L7 kit is for :)

Horms

Here is a description I wrote a while ago about SSL/LVS In a nutshell, you probably want to use persistance and have the real-servers handle the SSL decryption. http://archive.linuxvirtualserver.org/html/lvs-users/2003-07/msg00184.html Actually, using the lblc scheduler, or something similar, might be another good solution to this problem.

Graeme

...this is OK if you have a set of servers onto which you can install multiple SSL Certs and undergo the pain of potentially having an IP management nightmare:

1 realserver, 1 site -> 1 "cluster" IP
2 realservers, 1 site -> 2 "cluster" IPs
...
10 realservers, 10 sites -> 100 "cluster" IPs

I think you can see where this is going! Very rapidly the IP address management becomes unwieldy.

You can of course get around this by assigning different *ports* to each site on each server, but if you have the cert installed on all servers in the cluster this soon becomes difficult to manage too.

Offloading onto some sort of SSL "proxy", accelerator, engine (call it what you will) means that you can then simply utilise the processing power of that (those) system(s) to do the SSL overhead and keep your webservers doing just that, serving pages. If each VIP:443 points to a different port on the "engine" you greatly simplify your address management too.

Horms

I do not follow how having traffic that arrives on the real-servers in plain-text or SSL changes the IP address situation you describe above.

Graeme

Also, most commercial SSL certification authorities will charge you an additional fee to deploy a cert on more than one machine, so if you can reduce the number of "engine" servers you reduce your costs by quite some margin.

Horms

This is the sole reason to use an SSL accelerator in my opinion. The fact is that SSL is likely to be the most expensive ($) operation your cluster is doing. And that in many cases it is cheaper to buy some extra servers, and offload this processing to an LVS cluster, than to by an SSL accelerator card.

Graeme

You do however still need to use persistence, and potentially deploy some sort of "pseudo-persistence" in your engines too to ensure that they are utilising the same backend server. If you don't, you'll get all sorts of application-based session oddness occurring (unless you can share session states across the cluster).

Horms

If you have applications that need persistence, then of course you need persistence regardless of SSL or not. But if you are delivering SSL to the real servers then you probably want persistence, regardless of your applications, to allow SessionID to work, and thus lower the SSL processing cost.

20.27. SSL termination at localnode: patch by Carlos Lozano, Siim Poder and Malcolm Turnbull

Note

This code was originally written by Carlos Lozano Carlos Lozano's ip_vs_core.c.diff. Malcolm wanted it ported to the current kernel and offered to pay for the work. Siim Poder did the recoding. This patch will be in 2.6.28 (Dec 2008).

Malcolm

I just paid Siim Poder for the work to convert Carlos Lozano's earlier patch to the latest kernel... And then luckily Horms thought it was a good idea... I think that Kemp and Barracuda did it as well but didn't feed it back to the community.

lists lists (at) loadbalancer (dot) org 01 May 2008

At the moment I can do SSL termination with Pound (http://www.apsis.ch/pound/), then hand off locally to HaProxy (http://haproxy.1wt.eu/) for cookie insertion and load balancing:

Pound -> HaProxy -> Real Servers
x.x.x.10:443 -> x.x.x.10:80 -> Real Servers

But I'd like to do :

Pound -> LVS -> Real Servers
x.x.x.10:443 -> x.x.x.10:80 -> Real Servers

But the Pound process on the director can't access Real servers via the local LVS set up at x.x.x.10:80? Is this the local node problem? I've tried in NAT and DR mode. Is their anyway I can get LVS to pick up a local request i.e. wget x.x.x.10:80 (from local console) picks up data from a real server?

I found what I was after in the HOWTO (Carlos Lozano's ip_vs_core.c.diff)!

Malcolm 17 Dec 2008

The patch is so that we can have a local server proxy e.g. mod_proxy, pound, stunnel, haproxy feed into an LVS server pool on the same server (Joe i.e. director). We use Pound (reverse proxy) to terminate HTTPS traffic and then pass it to an LVS VIP (masq/NAT) which then distributes to HTTP servers. This only works in LVS-NAT mode with the load balancer as the default gateway.

Blog post here: LVS Local node patch for Linux 2.6.25, Centos 5 kernel build how-to (http://www.loadbalancer.org/blog/lvs-local-node-patch-for-linux-2625-centos-5-kernel-build-how-to/)

Note
Joe: you don't need to do the patch/build. The code is in the kernel.

Siim Poder siim (at) p6drad-teel (dot) net 18 Dec 2008

afaik it doesn't require the director to be the default gw in general (for any LVS mode), as a director's IP will be the source IP of the packet. The added functionality is that the director can connect to one of it's own VIPs and have the connections load balanced to a number of RSs. As Malcolm already mentioned, the main use case would be having having a https proxy on the director accepting https using local VIP to LB to http RSs, instead of LBing https to https RSs (potentially using director resources more efficiently, in case the traffic is low enough).

The reason it didn't work (before the patch) is that traditionally ipvs code only tried to load balance packets that were coming IN to the director (from the network). This patch adds the possibility to load balance packets that are leaving the director - that is, packets originating from the director itself.

Localnode only comes to play, if you have two directors and a setup that load balances both incoming and outgoing connections. For only load balancing the outgoing connections (that is, just one director) you do not need to have a VIP:443, the https_proxy can just listen on your "normal every-day" IP.

 CLIENTS
    |
    v
 VIP:443
DIRECTOR1
    |
    |\----------------\
    |                 |
    v                 v
localnode          lvs-dr
    |                 |
    v                 v
DIRECTOR1          DIRECTOR2
https_proxy        https_proxy
    |                 |
    v                 v
 VIP:80             VIP:80
DIRECTOR1           DIRECTOR2
    |                 |
    |                 |
    |\               /|
    | \----\ /------/ |
    |       x         |
    | /----/ \------\ |
    |/               \|
    |                 |
    v                 v
 lvs-dr             lvs-dr
    |                 |
    v                 v
REALSERVER1        REALSERVER2
httpd              httpd

The part that was not previously possible, is the https_proxy connecting to VIP:80 on the DIRECTOR1 itself and having the connections load balanced between RS1 and RS2.

There is nothing special to the setup. You just have to connect to one of the VIPs from the director. Before the patch, the connection would get reset, but after patch, it is load balanced to whatever real servers behind the VIP.

Something like this:

ipvsadm -A -t 192.168.0.1:80 -s rr
ipvsadm -a -t 192.168.0.1:80 -r 192.168.1.2:80 -m
ipvsadm -a -t 192.168.0.1:80 -r 192.168.1.3:80 -m

nc 192.168.0.1:80	#after setup, run netcat to VIP:80

For unpatched kernels, you would get connection refused, for patched ones you get connected to either to 1.2:80 or 1.3:80.

20.28. r commands; rsh, rcpi (and their ssh replacements), tcp 514

The r commands use multiport protocols. see Section 21.12.

20.29. lpd, tcp 515

Network printers have a functioning tcpip stack and can be used as realservers in an LVS-NAT setup to produce a print farm. I got the idea for this at Disneyworld in Florida. At the end of one of the roller coaster rides (Splash Mountain) cameras take your photo and if you buy your photo, it is printed out on one of about 6 colour printers. I watched the people collecting the print-outs - they didn't seem to know which printer the output was going to and looked at all of them looking for your print. The same thing would happen with an LVS-NAT printer farm, since you can't pick the realserver that will get the job.

Using LVS-NAT with lpd is written up in section 5.3 of the article on performance of single realserver LVSs.

20.30. Databases

Because of the large cost of distributed or parallel database servers, people have looked at using LVS with mysql or postgres.

A read-only LVS of databases is simple to setup; you only need some mechanism to update the database (on all of the realservers) periodically. A read-write LVS of databases is more difficult, as the write to the database on one realserver has to be propagated to the databases on the other realservers. A read-mostly database (like a shopping cart), can be loadbalanced by LVS with the updates occuring outside the LVS. However a full read-write LVS database is not feasible at the moment.

Moving data around between realservers is external to LVS. If clients are writing, you need to do something at the back end to propagate the writes to the other nodes on a time scale which is atomic compared to the time of reads by other users. If you have a shopping cart, then other user's don't need to know about the a particular user's writes, till they commit to ordering the item (decrementing the stock), so you have some time to propagate the writes.

Currently most LVS databased are deployed in a multi-tier setup. The client connects to a web-server, which in turn is a client for the database; this web-server database client then connects to a single databased. In this arrangement the LVS should balance the webservers/database clients and not balance the database directly.

Production LVS databases, e.g. the service implemented by Ryan Hulsker RHulsker (at) ServiceIntelligence (dot) com (sample load data at http://www.secretshopnet.com/mrtg/) have the LVS users connect to database clients (perl scripts running under a webpage) on each realserver. These database clients connect to a single databased running on a backend machine that the LVS user can't directly access. The databased isn't being LVS'ed - instead the user connects to LVS'ed database clients on the realserver(s) which handle intermediate dataprocessing, increasing your throughput.

Next, replication was added to mysqld and updates to one database could be propagated to another. Rather than using peer-to-peer replication, LVS users setup replication in a star, with a master that accepted the writes, which were propagated to slave databases on the realservers. Then mysqld acquired functionality similar to loadbalancing removing any need to develope LVS for loadbalancing a database (e.g. - from Ricardo Kleeman 4 Aug 2006 - multi-master replication, which means you can have 2 servers as master and slave... so in that sense a pair should be usable for load balancing). (also see loadbalanced mysql cluster) Some of the earlier comments about mysqld below may no longer be relevant to current versions of mysqld.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 10 Dec 2008

Load balancing SQL server is no problem with LVS but just like every other database pretty difficult to get your application to work with it unless its a read only copy. Normal configuration is a shared storage twin-head write cluster + load balanced read-only clusters.

20.30.1. Multiple Writer, Common Filesystem

Note
summary: this doesn't work

An alternative to propagating updates to databases is to use a distibuted filesystem, which then becomes responsible for the updates appearing on other machines. (see Filesystems for realserver content and Synchronising realserver content). While this may work for a webserver, a databased is not happy about having agents outside its control writing to its data filesystem.

A similar early naive approach approach of having databaseds on each realserver accessing a common filesystem on a back-end server, fails. Tests with mysqld running on each of two realservers working off the same database files mounted from a backend machine, showed that reads were OK, but writes from any realserver either weren't seen by the other mysqld or corrupted the database files. Presumably each mysqld thinks it owns the database files and keeps copies of locks and pointers. If another mysqld is updating the filesystem at the same time then these first set of locks and pointers are invalid. Presumably any setup in which multiple (unlocked) databaseds were writing to one file system (whether NFS'ed, GFS'ed, coda, intermezzo...) would fail for the same reason.

In an early attempt to setup this sort of LVS jake buchholz jake (at) execpc (dot) com setup an LVS'ed mysql database with a webinterface. LVS was to serve http and each realserver to connect to the mysqld running on itself. Jake wanted the mysql service to be lvs'ed as well and for each realserver to be a mysql client. The solution was to have 2 VIPs on the director, one for http and the other for mysqld. Each http realserver makes a mysql request to the myqslVIP. In this case no realserver is allowed to have both a mysqld and an httpd. A single copy of the database is nfs'ed from a fileserver. This works for reads.

20.30.2. Parallel Databases

Malcolm Cowe

What if the realservers were also configured to be part of an HA cluster like SGI Failsafe or MC/ServiceGuard (from HP)? Put the real servers into a highly available configuration (probably with shared storage), then use the director to load balance connections to the virtual IP addresses of the HA cluster. Then you have a system that parallelises the database and load balances the connections. This might require ORACLE Open Parallel Server (OPS) edition of the database, you'd have to check.

"John P. Looney" john (at) antefacto (dot) com 12 Apr 2002

My previous employer did try such a system, and had very little luck with it. In general, you are much better off having one writer, and multiple readers. Then, update the "read-only" databases from the write-database. We were using Oracle Parallel Server on HP-UX, and it crashed about once every two weeks, not always just under heavy load. Databases are by design single entities. Trying to find a technical solution to such a mathematical problem is asking for pain and suffering, over a prolonged period of time.

20.30.3. Linux Scalable Database project

Note
May 2001: The LSD project does not seem to be active anymore. The replication feature of mysql is functionally equivelant.

The Linux Scalable Database (LSD) project http://lsdproject.sourceforge.net/ is working on code to serialise client writes so that they can be written to all realservers by an intermediate agent. Their code is experimental at the moment, but is a good prospect in the long term for setting up multiple databased and file systems on separate realservers.

20.30.4. Gaston's Apache/mysql setup

Gaston Gorosterrazgoro (at) hostarg (dot) com (dot) ar Jun 12 2003

I solved my problems quite different than seen in the HOW-TO. So, someone may need this email in the near future:

-------------------------------------------------------------------------
    Director
        192.168.0.23 (eth1)
        192.168.0.24 (eth0:100)
        10.0.0.1 (eth0)
        Just configured ip_vs to load balance Apache between Server3 and
Server1 (in this order because for me, Server3 is the primary apache server)
-------------------------------------------------------------------------
    Server1 (primary MySQL, secondary Apache)
        10.0.0.3 (eth0)
        10.0.0.10 (eth0:100)
        MySQL is Master/slave for Server3

-------------------------------------------------------------------------
    Server2 (secordary MySQL)
        10.0.0.4 (eth0)
        Mon running, checking every 10 secs. MySQL in 10.0.0.3. If it fails,
then this machine configures eth0:100 as 10.0.0.10.
        MySQL is Master/slave for Server1
-------------------------------------------------------------------------
    Server3 (primary Apache)
        10.0.0.5
        Apache (in my case php), connects always to 10.0.0.10.
-------------------------------------------------------------------------

End of story. I have better security in the MySQL daemons (not accesible from clients), less charge in the director machine (don't have to worry about MySQL, neither run MON yet), and I'm happy. :) Now is the turn for Cyrus in Server2 and Server3.

20.31. Databases: mysql

20.31.1. How To Set Up A Load-Balanced MySQL Cluster

There is a writeup (Mar 2006) at http://www.howtoforge.com/loadbalanced_mysql_cluster_debian by Falko Timme based on UltraMonkey. (we don't know this guy and he's never posted to the LVS mailing list - he seems to know what he's doing though.)

20.31.2. mysql replication

MySQL (and most other databases) supports replication of databases.

Ted Pavlic tpavlic (at) netwalk (dot) com 23 Mar 2001

When used with LVS, a replicated database is still a single database. The MySQL service is not load balanced. HOWEVER, it is possible to put some of your databases on one server and others on another. Replicate each SET of databases to the OTHER server and only access them from the other server when needed (at an application or at some fail-over level).

Doug Sisk sisk (at) coolpagehosting (dot) com 9 May 2001

An article on mysql's built in replication facility

Michael McConnell michaelm (at) eyeball (dot) com> 13 Sep 2001

Can anyone see a down side or a reason why one could not have two Systems in a Failover Relationship running MySQL. The Database file would be synchronized between the two systems via a crontab and rsync. Can anyone see a reason why rsync would not work? I've on many occasions copied the entire mysql data directory to another system and start it up without a problem. I recognize that there are potential problems that the rsync might take place while the master is writing and the sync will only have part of a table, but mysql's new table structure is supposed to account for this. If anything, a quick myismfix should resolve these problems.

Michael McConnel michaelm (at) eyeball (dot) com 2001-09-14

There are many fundamental problems with MySQL Replication. MySQL's Replication requires that two systems be setup with identical data sources, activate in a master/slave relationship. If the Master fails all requests can be directed to the Slave. Unfortunately this Slave does not have a Slave, and the only way to give it a slave, is to turn it off, synchronize it's data with another system and then Activate them in a Master/Slave relationship, resulting in serious downtime when databases are in excess of 6 gigs (-:

This is the most important problem, but there are many many more, and I suggest people take a serious look at other options. Currently I use a method of syncing 3 systems using BinLog's.

Paul Baker pbaker (at) where2getit (dot) com>

What is the downtime when you have to run myisamchk against the 6 gig database because rsync ran at exactly the same time as mysql was writting to the database files and now your sync'd image is corrupted?

There is no reason you can not set up the slave as a master in advance from the beginning. You just use the same database image as from the original master.

When the master master goes down, set up a new slave by simple copying the original master image over to the new slave, then point it to the old slave that was already setup to be a master. You wouldn't need to take the original slave down at all to bring up a new one. You would essentially be setting up a replication chain but only with the first 2 links active.

Michael McConnell michaelm (at) eyeball (dot) com>

In the configuration I described using Rsync, the MyISMchk would take place on the slave system, I recognize the time involved would be very large, but this is only the slave. This configuration would be setup so an Rsync between the master and slave takes place every 2 hours, and then the Slave would execute a MyISMchk to ensure the data is ready for action.

I recognize that approximately 2 hours worth of data could be lost, but I would probably us the MySQL BinLogs rotated at 15 minutes interval and stored on the slave to allow this to be manually merged in, and keep the data loss time down to only 15 minutes.

Paul, you said that I could simply copy the Data from the Slave to a new Slave, but you must keep in mind, in order to do this MySQL requires that the Master and Slave data files be IDENTICAL, that means the Master must be turned off, the data copied to the slave, and then both systems activated. Resulting in serious downtime.

Paul

You only have to make a copy of the data one time when you initial set up your Master the first time. As long as it takes to do this is your downtime:

   kill mysqlpid
   tar cvf mysql-snapshot.tar /path/to/mysql/datadir
   /path/to/mysql/bin/safe_mysqld

Your down time is essentially only how long it takes to copy your 6 gigs of data NOT across a network, but just on the same machine. (which is far less than a myisamchk on the same data) Once that is done, at your leisure you can copy the 6 gigs to the slave while the master is still up and serving requests.

You can then continue to make slave after slave after slave just by copying the original snap shot to each one. The master never has to be taking offline again.

Michael McConnell

You explained that I can kill the MySQL on the Master, tar it up, copy the data to the Slave and activate it as the Slave. Unfortunately this is not how MySQL works. MySQL requires that the Master and Slave be identical data files, *IDENTICAL* that means the Master (tar file) cannot change before the Slave comes online.

Paul

Well I suppose there was an extra step that I left out (because it doesn't affect the amount of downtime). The complete migration steps would be:

  1. Modify my.cnf file to turn on replication of the master server. This is done while the master mysql daemon is still running with the previous config in memory.
  2. shutdown the mysql daemon via kill.
  3. tar up the data.
  4. start up the mysql daemon. This will activate replication on the master and cause it to start logging all changes for replication since the time of the snapshot in step 3. At this point downtime is only as long as it takes you to do steps 2, 3, and 4.
  5. copy the snapshot to a slave server and active replication in the my.cnf on the slave server as both a master and a slave.
  6. start up the slave daemon. at this time the slave will connect to the master and catch up to any of the changes that took place since the snapshot.

So as you see, the data can change on the master before the slave comes online. The data just can't change between when you make the snapshot and when the master daemon comes up configured for replication as a master.

Michael McConnell

Paul you are correct. I've just done this experiment.

A(master) -> B(slave)
B(master -> C(slave)

A Died. Turn off C's Database, tar it up, replicated the Data to A, Activate A as Slave to C. No data loss, and 0 downtime.

(there appears to have been an offline exchange in here.)

Michael McConnell michaelm (at) eyeball (dot) com>

I've just completed the Rsync deployment method, this works very well. I find this method vastly superiors to both other methods we discussed. Using Rsync allows me to only use 2 HOSTS and I can still provide 100% uptime. In the other method I need 3 systems to provide 100% uptime.

In addition the Rsync method is far easier to maintain and setup.

I do recognize this is not *perfect* I run the Rsync every 20 minutes, and then I run myismchk on the slave system immediately afterwards. I run the MyISMChk to only scan Tables that have changed since the last check. Not all my tables change every 20 minutes. I will be timing operations and lowering this Rsync down to approximately 12 minutes. This method works very effectively for managing a 6 Gig Database that is changing approximately 400 megs of data a day.

Keep in mind, there are no *real time* replication methods available for MySQL. Running with MySQL's building Replication commonly results (at least with a 6 gig / 400 megs changing) in as much as 1 hour of data inconsistency. The only way to get true real time is to use a shared storage array.

Paul Baker

MySQL builtin replication is supposed to be "realtime" AFAIK. It should only fall behind when the slave is doing selects against a database that causes changes to wait until the selects are complete. Unless you have a select that is taking an hour, there is no reason it should fall that far behind. Have you discussed your findings with the MySQL developers?

Michael McConnell

I do not see MySQL making any claims to Real-time. There are many situations where a high load will result in systems getting backed up, especially if your Slave system performs other operations.

MySQL's built-in replication functions like so;

  1. Master writes updates to Binary Log
  2. Slave checks for Binary Updates
  3. Slave Downloads new Bin Updates / Installs

Alexander N. Spitzeraspitzer (at) 3plex (dot) com

how often are you running rsync? since this is not realtime replication, you run the risk of losing information if the master dies before the rsync has run (and new data has been added since the last rsync.)

Don Hinshaw dwh (at) openrecording (dot) com>

I have one client where we do this. At this moment ( I just checked) their database is 279 megs. An rsync from one box to another across a local 100mbit connections takes 7-10 seconds if we run it at 15 minute intervals. If we run it at 5 minute intervals it takes <3 seconds. If we run it too often, we get an error of "unexpected EOF in read_timeout".

I don't know what causes that error, and they are very happy with the current situation, so I haven't spent any time to find out why. I assume it has something to do with write caching or filesystem syncing, but that's just a wild guess with nothing to back it up. For all I know, it could be ssh related.

We also do an rsync of their http content and httpd logs, which total approximately 30 gigs. We run this sync hourly, and it takes about 20 minutes.

Benjamin Lee benjaminlee (at) consultant (dot) com>

For what it's worth, I have also been witness to the EOF error. I have also fingered ssh as the culprit.

John Cronin

What kind of CPU load are you seeing? rsync is more CPU intensive than most other replication methods, which is how it gains its bandwidth efficiency. CPU Load? How many files are you syncing up - a whole filesystem, or just a few key files? From you answer, I assume you are not seeing a significant CPU load.

Michael McConnell

I RSync 6 Gigs worth of data, Approximately 50 files (tables). Calculating Checksum's is really a very simple calculation, the cpu used to do this is less than 0% - 1% of a PIII 866. (care of vmstat)

I believe all of these articles you have found are related to RSync Servers that serve in the function one would see a system as a major FTP Server. For example ftp.linuxberg.org or ftp.cdrom.com

Mark

We're using master/slave replication. The LVS balances the reads from the slaves and the master isn't in the load-balancing pool at all. The app knows that writes go to the master and reads go the VIP for the slaves. Aside from replication latency, this works very reliably.

Todd Lyons tlyons (at) ivenue (dot) com 26 Oct 2005

This works for us too. We have an inhouse DBI wrapper that knows to read from slaves and write to the master. Reads are directed to a master for a certain number of seconds until we are reasonably sure that the data has been replicated out.

this is exactly how i was doing it. The problem was that i expected the data on the slaves to be realtime - and the replication moved too slow. not to mention it broke a lot (not sure why) - anytime there was a query that failed it halted replication.

We've not seen this. Our replication has been very stable, and our system is roughly 80% read, 20% write. But we do as few reads as possible on the master, keeping all of the reads on the slaves, doing writes to the master.

in which case, i had to either take the master down, or put it in read-only mode, so i could copy the database over, and re-initialize replication.

Make a tarball of the databases that you replicate and keep that tarball someplace available for all slaves. Record the replication point and filename. Then when you have to restart a slave, blow away the db, extract the tarball, configure the info file to the correct data, then start mysql and it will populate itself from the bin files on the master. For us, we try to do it once per month, but I've seen 100 day bin file processing and it only took about 30 minutes to catch the database up. We have gig networking though, so your results may vary.

We also have a script that connects and checks the replication status for each slave and spits out email warnings if the bin logs don't match. It doesn't check position because it frequently changes mid-script, sometimes more than once. Here's an example:

[todd@tlyons ~]$ checkmysqlslave.sh 
Status Check of MySQL slave replication.
Log file names should match exactly, and
log positions should be relatively close.

Machine      : sql51
Log File     : fs51-bin.395
Log Position : 18123717

Machine      : sql52
Log File     : fs51-bin.395
Log Position : 18123863

Machine      : images3
Log File     : fs51-bin.395
Log Position : 18123863

Machine      : images4
Log File     : fs51-bin.395
Log Position : 18124769

Machine      : images5
Log File     : fs51-bin.395
Log Position : 18125333

Machine      : images6
Log File     : fs51-bin.395
Log Position : 18125333
of course, this was on sub-par equipment, and maybe nowadays it runs better. i'm thinking that the NDBcluster solution might be better since it's designed for this more - replication still (afaik) sucks because when you need to re-sync it requires the server to be down or put in read-only, which is essentially downtime. The main problem I see with NDB is that it requires the entire database (or whatever part each node is reponsible for) to be in memory.

We have had very stable results from mysql stable releases doing replication. We typically use the mysql built rpms instead of the vendor rpms. We had serious stability problems with NDB, but we were also trying a very early version. It is liable to be much more solid now.

not work with files the way normal InnoDB, MyIsam, etc engines work. And I think this is still the case in MySQL 5 (correct me if I'm wrong.) I don't know what happens if your storage node suddenly runs out of Memory.

It won't be until MySQL 5.1 that you'll be able do file based clustering. Then load balancing becomes simple, for both reads and writes.

Dan Yocum yocum (at) fnal (dot) gov 10 Dec 2008

While not strictly on topic wrt to LVS, with MySQL one can use multi-master, shared-nothing replication or MySQL clustering (yes, these are 2 separate methods for replicating data), with all nodes having read and write access. I have implemented the former with LVS load balancing the connections to the MySQL RSs with good success. I used these guides to set this up: http://www.howtoforge.com/mysql_database_replication http://www.howtoforge.com/loadbalanced_mysql_cluster_debian And I wrote up the whole procedure for our (grid) authentication and authorization environment: https://docs.google.com/View?docID=ddszv68g_19d88pzv&revision=_latest

20.31.3. Clustering mysql

Matthias Mosimann matthias_mosimann (at) gmx (dot) net 2005/03/01

I also don't have experience with any of these solutions but...

  • MySQL's solutions to make a mysql cluster are at MySQL cluster (http://www.mysql.com/products/cluster/).
  • The book "High Performance MySQL Cluster" is available from O'Reilly.
  • For people using PHP with a MySQL cluster see SQL relay (http://sqlrelay.sourceforge.net/).
  • If you're using a java application maybe this page could help you: ObjectWeb: Clustered JDBC (http://c-jdbc.objectweb.org/)

Jan Klopper janklopper (at) gmail (dot) com 2005/03/02

Your best method (if you only do a lot of Select query's) would be to use one mysql master server which gets all the update/delete/insert query's and stores the main database,. Then you could use mysql master-slave setup to replicate this master database to a couple of slave servers on which you could do all of your selects.

You could use Mon/ldirectord to make a LVS cluster out of these select servers just like any other type of servers. I'm not sure what you should do with the persistence though (especially if you use mysql_pconnect from within php/apache).

I have the following setup:

  • cheap webservers (LVS)
  • cheap load balancers (VIP)
  • expensive raid 10 mysql server (master)
  • cheap fileserver + mysql slave (just for backup, not for queries)

Granted, if your database server dies, you're screwed, but with the current mysql master/slave/replication/cluster technologies you wouldn't be able to provide failover for update/insert/delete query's anyway.

Note
making one of the slaves into a master and thus change its database will give you a serious problem when you master comes back up again. You'd have to make the master a slave, stop the current master (breaking the whole point of allways online) and replicate to it, then when it is up2date change its role to be master again.

Jan Klopper

So you can setup a mysql server on each of the nodes of you cluster (allowing all of the nodes to be load balanced mysql servers)

What would be a better sollution (afaik) is to place mysql on all the apache servers as well, since this would would allow them to connect to their own database instead of using a network connect, decreasing the overhead for each query/connect. And if one of the webserver nodes went down, so does its mysql, so thats not a real problem. LVS will just not redirect any request to the machine at all. A not clustered server would then be the nbdd server managing and updating all of the mysql servers on the nodes itself. And thus you could use really inexpensie hardware for the mysql server, as you won't need UBER speed on that machine to handle all requests from all LVS-nodes anymore.

Todd Lyons tlyons (at) ivenue (dot) com

We've always found that separating services out to different machines works better than running multiple services on each machine.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com (replying to Jan)

Well you still need a very powerfull nbdd server, with high network capabilities, and you should use at least two data nodes.

Matthias Mosimann

Here's a thread in the mysql mailing list about clustering / loadbalancing mysql http://forums.mysql.com/read.php?25,14754,14754#msg-14754

Francois JEANMOUGIN

The idea is to have small MySQL nodes on the realservers and 2 or 3 data nodes (powerfull machines) on the backend. The current state of the artr for MySQL is to use an NBD cluster. You can use LVS to load-balance connections to your MySQL servers. You need a small management server to manage it.

Todd Lyons tlyons (at) ivenue (dot) com 2005/03/02

We worked for a short while with NBD clustering and experienced huge problems. It was probably more an issue with perl-DBD instead of NBD itself, but it was enough that we had to go back to standard replication. Our current setup uses a homegrown DBI wrapper that does all reads from the slave and directs all writes to the master (and for a short time all reads in that session go to that master too). As read load escalates, just add another slave and put it on an LVS VIP.

Francois JEANMOUGIN

To check whether a MySQL server is alive or not, you have two solutions (assuming you choose Keepalived):

  • Use a TCP_CHECK that will detect if the server accepts connections.
  • Use the check_mysql from nagios-plugins embedded in a MISC_CHECK script.

Jan Klopper

But as soon as you start doing updates to the database, you will horribly screw up your data, because mysql can't propagate the changes made on slaves back to the master and so on. This will lead to inconsistency's in your database. If you only need to do load (and I mean loads) of select queries with no changes than you could very well use mysql slaves inside a LVS to run these queries.

Francois JEANMOUGIN

Are we talking about the same thing? I'm not talking about master/slave or active-slave method using old MySQL replication process. I'm talking about NBD cluster: http://dev.mysql.com/tech-resources/articles/mysql-cluster-for-two-servers.html, http://dev.mysql.com/doc/mysql/en/mysql-cluster-basics.html I'm currently evaluating this solution, which as passed the "case study" test. I will implement it at the end of this month. I would like to know if anyone here have any (bad or good) experience with such MySQL setup.

Todd Lyons tlyons (at) ivenue (dot) com

We've had experience and it was bad. Things like 'DESCRIBE table' would give erratic results. Required a beta version of perl-DBD. Performance was good though. We had a 2 node cluster with plans to scale upwards. I think once supporting code stabilizes it will be a good solution. I like the way they keep things in memory because RAM is cheap. We are also evaluating the true clustering solution provided by Emic Networks. It is fantastic in my opinion, but it's not cheap. It requires custom configuration of your switches and routers to handle multicast MACs, a virtual gateway, and some VLAN'ing. It works well though and scales well. If your system has more than 50% reads, it will work well for you.

Jan Klopper (replying to Francois)

You are talking about pure failover only right? I was talking about load balancing for mysql which is pretty much undoable. Pure failover might be possible, just make sure the slave becomes the master on failover, and when the old master get back up it starts being a slave, and replicates from the new master.

Francois

No no no. Please read the links I sent, and add this one : http://dev.mysql.com/doc/mysql/en/mysql-cluster-overview.html I'm talking about pure active-active cluster. This setup could be strenghted by LVS and a Keepalived MISC checker. In the figure above, you put LVS between Applications and SQL. About the link above, note that mysqld server and nbdd server can be on the same machine.

Jan Klopper

Even normal master-slave replication still gives loads of problems, so anything cutting edge which might do the trick and allow for the slaves to be full database servers (and thus capable of handling updates/inserts/deletes) should not be trusted in my opinion. Granted, if they get the slaves to also be masters to each other, and they get it to pay nice, then we're all set to have the best and fastest clustering database on the planet (except for oracle?).

Bosco Lau bosco (at) musehub (dot) com

Lets forget about 'multi'-master scenario. It is basically an unsolvable problem for MySQL (at the moment). The only sensible multi-server setup for MySQL is 'single master' - 'mutiple slave', preferbly in 'star shape formation' i.e. each slave connected directly to master. Also don't bother about 'daisy-chaining' the slaves, this creates a single point of failure. With the mutiple slaves/LVS setup , you can just use MYISAM tables in the slaves database whereas you may have innoDB tables in the master. This will make your SELECT performances even better. Since the slaves just replicates the data from the master, there is no need for transactional tables type in the slave database.

Jan Klopper

Multi-master setups are a huge problem in mysql. If you do use slaves for just the queries, use MyIsam tables (preferrable stored completely in memmory) to increase select speeds. I'm using mysql to run huge banner/advertisement systems, and thus i need at least one update/insert per view. or i should store them outside of the db, and thus i can;t use normal slaves. (because i'd still need to connect to the master to make the update.) Star formation is the way to go, unless you have a huge amount of changes on the master which make you need a seccond cascade in there to load balance the updates and release the master from most of the slave replication querys. I think this step could also be done with LVS (having a couple of slave's to retreive the updates from one master, balanced by LVS, ) and then use them as masters for a couple of other normal query slaves.

Matthias Mosimann matthias_mosimann (at) gmx (dot) net 2005/04/18

In a mysql cluster, data is stored in memory. If you store about 100 GB in your databaes you need about 100 GB of Memory (ram + swap). Now it looks that there will be a solution for that. I found a thread in the myslq-cluster mailing list and a link in the manual. Mysql 5.1 will be able to handle disk-based records in the cluster.

Thread:
http://forums.mysql.com/read.php?25,20801,20801#msg-20801

Link:
http://dev.mysql.com/doc/mysql/en/mysql-5-1-cluster-roadmap.html

Nigel Hamilton nigel (at) turbo10 (dot) com 2005/04/18

The RAM requirements of the initial MySQL Cluster design makes it unworkable for me so a hybrid RAM + disk-based system is welcome news. Our writes may equal or exceed reads - I was planning on using "memcached" to help with handing off as many reads as possible to a distributed RAM bucket. But I'm still thinking of ways to distribute writes - ideally at the application level all I need is a database handle that connects to one big high speed amorphous distributed DB. My current short term solution to distributing writes is to use a MySQL heap/ram table on each node which acts as a bucket collecting writes which is periodically emptied to the Master's disk.

Matthias Mosimann matthias_mosimann (at) gmx (dot) net

IIRC we had a discussion (Jan Klopper, Francois Jeanmougin and me) on the 2nd March this year on the same topic allready. Here's the link: http://archive.linuxvirtualserver.org/html/lvs-users/2005-03/mail4.html

Francoise Jeanmougin

We made some other testing on high availabilty/high performance MySQL:

  • InnoDB has a bottleneck, if you have more than 1024 simultaneous concurrent connections to the DB (We have about 5000).
  • Ndb (the MySQL cluster) works quite great, but has a problem with DB opening (use database ;) which limits the scalability of the solution.

We didn't had time to test Ndb properly on our environment, the solution seems to be good in terms of design, it is memory based, and uses table patitionning (so it'll split the datas on several servers). It has to be improved to be really usable and strong.

Troy Hakala troy (at) recipezaar (dot) com 25 Oct 2005

We're using master/slave replication. The LVS balances the reads from the slaves and the master isn't in the load-balancing pool at all. The app knows that writes go to the master and reads go the VIP for the slaves. Aside from replication latency, this works very reliably.

The problem then is failure of the master. The master is redundant in itself (RAID drives, redundant power) to minimize the risk. But yes, we are very read-intensive and keeping the master out of the read load-balancing further increases the master's lifespan.

We haven't tried master-master replication, but it's pretty easy and quick to turn a slave into a master. We prefer simplicity to complexity. Besides, we're not a bank and no one dies if the master db goes down for a few minutes every 5 years. :-)

FWIW, our outages have been caused by bandwidth outages more than server hardware failure. There's supposed to be redundancy there too, but for some reason or another, the redundancy never kicks in. ;-)

Replication can be restarted easily: slave start after you fix the error, which is usually just a mysqlcheck on the table. You shouldn't have to take the master down at all.

And if you use a replication check in LVS, LVS will take the db server out of rotation if it's not replicating, so it shouldn't even be noticed until you can get around to fixing it. Nagios is also recommended to let you know when a slave is no longer replicating.

The latency is on the order of a couple seconds and it's easy to take care of in the app. In fact, it's only a problem if you cache db results for hours or days (we use memcached). So it's not much of a problem in reality if you account for it.

NDB violates my simplicity+commodity ideals... it's complicated and requires very expensive (lots of RAM) servers. And, I think it *requires* SCSI drives for some reason (I thought it was memory-based)! Doesn't Yahoo use MySQL replication? If it's good enough for Yahoo, it's good enough for most people. :-)

If the master went down, we'd make a slave into a master manually. But it's never happened in real life, just in lab tests. A script could be written to do it, I guess, but it's not that simple.

All the talk about redundancy and high-availability is a bit academic, IMO, unless it really is mission-critical -- and if it was, I'd be using Oracle and not MySQL so I could blame them. ;-)

As I mentioned, none of this matters if bandwidth and electricity go out, which is less reliable in my experience than solid-state computer hardware.

mike mike503 (at) gmail (dot) com 25 Oct 2005

perhaps changing a slave to a master has changed from when I used it - anytime a slave died, it would not start replicating (or you wouldn't want to) unless it caught up to the master since it died. but to grab a snapshot of the master data, you would have to take the master down or flush tables with a read lock until it was done getting the data synced. then you could restart it...

for example, i was using this on a forum - when someone posts a message:

webserver inserts data into master
master replicates to slave
webserver reads from slave

if that doesn't happen within milliseconds, the user is taken to a page that doesn't have their post show up yet.

Pierre Ancelot pierre (at) bostoncybertech (dot) com 26 Oct 2005

I run NDB and it works pretty fine... Debian GNU/Linux Sarge Mysql 4.1 2 nodes and a management server.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 26 Oct 2005

Note that it is unable to handle more than about 1000rq/s. Our MySQL server (standalone MyISAM) goes up to 3000rq/s. InooDB hangs at 1024rq/s.

20.32. Using Zope with databases

Karl Kopper karl (at) gardengrown (dot) org 10 Feb 2004

Here is the Zope database (ZODB) (http://www.zope.org/Products/ZEO/ZEOFactSheet). The "ZEO" client/server model was designed to make ZODB work in a cluster for read mostly data

I'm using it, or about to use it in production. Here is a quick write-up:

Zope (http://www.zope.org/) is collection of python programs that help display web content. Zope can run as a standalone web server or underneath Apache. Add-on features to Zope are called "products." One Zope product is called the Content Management Framework or CMF. The CMF product introduces the concept of users to the Zope content. A separate project later sprung up on top of the CMF called Plone (plone.org). Plone and Zope let you build your own web pages using either the TAL or the deprecated DTML language that Zope interprets. Zope has connectors to external databases like Postgres and MySQL but it also comes with its own database called ZODB.

ZEO sits between Zope scripts that you write and the ZODB. The ZEO client grabs all calls the local Zope server makes to the ZODB and trys to satisfy them using a local cache or by sending the request over the network to a ZEO server. The ZEO server then writes the data to local storage using the ZODB.

Because ZEO clients talk to the ZEO server on an IP address the ZEO server can be made highly available with Heartbeat (on a server pair outside the cluster), and each realserver in the LVS cluster can be a ZEO client. There are a few catches to this, the biggest is that all Zope servers should share an "instancehome". See the Zope web site for details.

20.33. Databases: Microsoft SQL server, tcp 1433

Malcolm Turnbull had a 3-Tier LVS with database clients running on the realserver webservers connecting back through the VIP on the director to other realservers running M$SQL. The requests never made it from the database clients to the database.

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 10 May 2004

Apparently if you use either a netbios name in an ASP db connection string, or if your web server has M$oft authentication turned on, IIS will use a challenge response hand shake to log on to the SQL server even though you've specified a basic username and password. Anyway its working fine now. Heres the M$oft article (http://support.microsoft.com/default.aspx?scid=kb;EN-US;175671).

20.34. Databases: Oracle

Bilia Gilad giladbi (at) rafael (dot) co (dot) il 03 Feb 2005

Hi i configured oracle application server 10g with lvs . All works fine except oracle portal . In the oracle portal cluster manual : http://www.tju.cn/docs/oas90400/core.904/b12115/networking.htm

" The Parallel Page Engine (PPE) in Portal makes loopback connections to Oracle Application Server Web Cache for requesting page metadata information. In a default configuration, OracleAS Web Cache and the OracleAS Portal middle-tier are on the same machine and the loopback is local. When Oracle Application Server is front-ended by an LBR, all loopback requests from the PPE will start contacting OracleAS Web Cache through the LBR. Assume that the OracleAS Portal middle-tier and OracleAS Web Cache are on the same machine, or even on the same subnet. Then, without additional configuration, loopback requests result in network handshake issues during the socket connection calls. "

" In order for loopbacks to happen successfully, you must set up a Network Address Translation (NAT) bounce back rule in the LBR, which essentially configures the LBR as a proxy for requests coming to it from inside the firewall. This causes the response to be sent back to the source address on the network, and then forwarded back to the client. "

Does any one know how can I make the bounce back rule in the LBR( load balancer ) ? Does this mean We have to work with NAT ( we prefer DR )?

Con Tassios ct (at) swin (dot) edu (dot) au 03 Feb 2005

We run LVS-DR with Oracle portal and don't experience any problems. When the application server connects to the VIP the connection is internal to the server itself as each server is configured with the VIP on lo:1.

20.35. Databases: ldap, tcp/udp 389, tcp/udp 636

20.35.1. ldap, read only

Tim Mooney Tim (dot) Mooney (at) ndsu (dot) edu 10 Sep 2007

We've been load balancing OpenLDAP for years using LVS-DR. When clients can update LDAP, balancing becomes much more tricky. It's a pretty standard setup. Original setup was done by someone else, but openldap was the first service we used LVS for, before even http. We've been using LVS-DR with OpenLDAP for at least 5 years, probably closer to 7. For now it's only one port. Clients don't need to bind and can't retrieve anything that's sensitive, so we're only doing ldap (no ldaps). We have additional balanced services beyond LDAP, but the LDAP portion looks like:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
   -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  vs2.ndsu.NoDak.edu:ldap lc
   -> obscured2.NoDak.edu:ldap       Route   1      16         982
   -> obscured1.NoDak.edu:ldap       Route   1      17         984

If you do an ldapsearch against our directory, you're getting our LVS-DR openldap: i

ldapsearch -x -LLL -h vs2.ndsu.nodak2.edu -b dc=ndsu,dc=nodak,dc=edu uid=mooney

There's another organization co-located with the IT organization here at the university, and they've also been running LVS-DR in front of their openldap directory for nearly as along as we have. LDAP is a critical component of Hurderos, which we've been using since its inception. Hence the need for a highly-available LDAP.

There can be replication between ldap servers, like there is with mysql, but in our case we have a master repository (an Oracle database) that feeds adds/deletes/modifies directly to our two back-end LDAP servers (bypassing the LVS-DR director). The built-in replication of ldap has really matured. Once OpenLDAP 2.4 is out, I need to revisit what's possible with it.

20.35.2. ldap, read/write

We haven't got ldap working in read/write mode with LVS yet.

Konrads (dot) Smelkovs (at) emicovero (dot) com 03 Jul 2003

I have an LVS-DR, wlc, with three realservers running openldap 2.0. I have noticed that when going through the loadbalancer the nodes do not always reply, while if issuing requests directly, I get a reply all the time. The service is pretty busy (about 1k connections at any given moment). Adding persistence does not help. If I understand it correctly, LDAP is a simple TCP service. Usually, after performing an initial connection and query it idles there and reuses the connection.

Joe

Not knowing anything about ldap I looked at some of the HOWTOs. They are all long on configuring and using ldap, but none describe the ldap protocol. From the LDAP HOWTO I see that ldap uses iterative calls to the slapd (somewhat like DNS), that can be sent to other slapds (I don't know how this would work in your case). As well the slapd needs a 3rd tier database (eg dbm). If the clients are writing to the database, then you're going to have the many reader, many writer LVS problem. You're going to have to have only one copy of the database if clients are writing to the slapd

My setup: I have a "master" LDAP server that performs all writes and it replicates to the other "slave" servers. If a write request is sent to a slave, it is refered to the master server. The client then follows this reference and makes a succesful write. So in my case I have a stand-alone master, which is not load-balanced. Also, due to specific of the application (think authentification or such), it performs one or two exact search requests, like uid=konrads and requests for a attribute to be returned, e.g. : userPassword.

presumably these are two consecutive tcpip connections? If so you'll need persistence. As to what else is missing I don't know. Just for sanity checks, you can use these machines, IPs to setup some conventional LVS'ed service eg telnet, httpd? You know the ldap realservers are working OK outside of the LVS (ie you can connect to them directly, possibly after fiddling IPs)? After that it's brute force eg tcpdump I'm afraid. If you know the protocol well enough you can debug also with phatcat.

20.36. nfs, udp 2049

20.36. nfs, udp 2049

It is possible with LVS to export directories from realservers to a client, making a highly capacity nfs fileserver (see performance data for single realserver LVS, near the end). This would be all fine and dandy, except that there is no way to fail-out the nfs service.

The problem is that the filehandle the client is handed is derived from the location of the file on the disk and not from the filename (or directory). The filehandle is required by the NFS spec to be invariant under name change or a move to a new mount point, or to another directory. I've talked to the people who write the NFS code for Linux and they think these are all good specs and there's no way the specs are going to change. The effect this has for LVS is that if you have to failout a realserver (or shut it down for routine maintenance), the client's file presented from a different realserver will have a different filehandle and the filehandle the client has will be for a file that doesn't exist. The client will get a stale file handle requiring a reboot.

Although no-one has attempted it, a read-only NFS LVS is possible.

20.36.1. NFS, the dawning of awareness of the fail-out problem

Matthew Enger menger (at) outblaze (dot) com 1 Sep 2000

I am looking into running two NFS servers, one as a backup for serving lots of small text files and am wondering what would be the best solution for doing this. Does anyone have any recomendations.

Joe

LVS handles NFS just fine (see the performance document on the docs page of the lvs website). You have to handle the problem of writes yourself to keep both NFS servers in sync.

Jeremy Hansen jeremy (at) xxedgexx (dot) com 1 Sep 2000

Can nfs handle picking up a client machine in a failover situation. If my primary nfs server dies, and my secondary takes over the first, can nfs clients handle this? Something tells me nfs wouldn't be very happy in this situation.

Wayne wayne (at) compute-aid (dot) com 01 Sep 2000

It will not work for NFS. NFS is working based on server hand over an opaque packet to the client. Client then will communicate with the server based on that opaque handle. Normal NFS construct that opaque handle involves some file system ID from that particular server, which most likely will be different from one server to the other.

Joe

this is a problem even for UDP NFS? (I know NFS is supposed to be stateless but isn't)

Wayne

If the NFS servers are identical, there is a chance that it may work. However, if the file systems are not identical (from file system ID point of view), it will not work, no matter if it is UDP or not. The stateless is only true to that particular server. BUT, that is the NFSes I had been worked on before based on Sun's original invention, it may not true for others implementations.

Alan Cox alan (at) lxorguk (dot) ukuu (dot) org (dot) uk 1 Sep 2000

IFF your NFS servers are running some kind of duplexing protocol and handling write commits to both disk sets before acking them then the protocl is good enough - for any sane performance you would want NFSv3 The implementations are another matter

John Cronin jsc3 (at) havoc (dot) gtf (dot) org

It works as long as the filesystems are identical. That means either readonly content dd'd to identical disk on failover machine, or dual ported disk storage with two hosts (or more) attached. When the failover happens the backup system then mounts the disks and takes over. You need to be VERY sure that the primary system is NOT up and writing to the disk or you will have to go to backups after the disk gets corrupted (having two systems perform unsynchronized writes to the same filesystem is not a good idea).

This is the way Sun handles it in their HA cluster. They go a step further by using Veritas Volume Manager, and Veritas has to forcibly import the disk group to the backup when a failover is done - the backup also sends a break or poweroff command to the primary via the serial terminal interface to make darn sure the primary is down. That said, I have seen three systems all mounting the same volume during some pathological testing I did on some systems at a Sun HA training course. The storaage used in this situation was Sun D1000/A1000 (dual ported UltraSCSI) and Sun A5000 (Fiberchannel).

Joe

Would the client get a stale file handle on real-server failure? Now that I think about it, I snooped on a tread with similar content recently. It would have been linux-ha or beowulf. I had assumed you could fail out an NFS server, since file servers like the Auspex boxes use NFS (I thought) and can failover. How they work at the low level I don't know.

Joe

in principle this could be handled by generating the filehandle from the path/filename which would be the same on all systems?

Wayne

It could be done like that. But if the hard drive SCSI ID is different that could cause system uses different file system ID assigned to the file system, and end up stalled handle.

20.36.2. failover of NFS

Here we're asking one of the authors of the Linux NFS code.

Joseph Mack

One of the problems with running NFS as an LVS'ed service (ie to make an LVS fileserver), that has come up on this mailing list is that a filehandle is generated from disk geometry and file location data. In general then the identical copies of the same file that are on different realservers will have different file handles. When a realserver is failed out (e.g. for maintenance) and the client is swapped over to a new machine (which he is not supposed to be able to detect), he will now have an invalid file handle.

Is our understanding of the matter correct?

Dave Higgen dhiggen (at) valinux (dot) com 14 Nov 2000

In principle. The file handle actually contains a 'dev', indicating the filesystem, the inode number of the file, and a generation number used to avoid confusion if the file is deleted and the inode reused for another file. You could arrange things so that the secondary server has the same FS dev... but there is no guarantee that equivalent files will have the same inode number; (depends on order of file creation etc.) And finally the kicker is that the generation number on any given system will almost certainly be different on equivalent files, since it's created from a random seed.

If so is it possible to generate a filehandle only on the path/name of the file say?

Well, as I explained, the file handle doesn't contain anything explicitly related to the pathname. (File handles aren't big enough for that; only 32 bytes in NFS2, up to 64 in NFS3.)

Trying to change the way file handles are generated would be a MASSIVE redesign project in the NFS code, I'm afraid... In fact, you would really need some kind of "universal invariant file ID" which would have to be supported by the underlying local filesystem, so it would ramify heavily into other parts of the system too...

NFS just doesn't lend itself to replication of 'live' filesystems in this manner. It was never a design consideration when it was being developed (over 15 years ago, now!)

There HAVE been a number of heroic (and doomed!) efforts to do this kind of thing; for example, Auspex had a project called 'serverguard' a few years ago into which they poured millions in resources... and never got it working properly... :-(

Sorry. Not the answer you were hoping for, I guess...

20.36.3. shared scsi solution for NFS

It seems that the code which calculates the filehandle in NFS is so entrenched in NFS, that it can't be rewritten to allow disks with the same content (but not neccessarily the same disk geometry) to act as failovers in NFS. The current way around this problem is for a reliable (eg RAID-5) disk to be on a shared scsi line. This way two machines can access the same disk. If one machine fails, then the other can supply the disk content. If the RAID-5 disk fails, then you're dead.

John Cronin jsc3 (at) havoc (dot) gtf (dot) org 08 Aug 2001

You should be able to do it with shared SCSI, in an active-passive failover configuration (one system at a time active, the second system standing by to takeover if the first one fails). Only by using something like GFS could both be active simultaneously, and I am not familiar enough with GFS to comment on how reliable it is. If you get the devices right, you can have a seamless NFS failover. If you don't, you may have to umount and remount the file systems to get rid of stale file handle problems. For SMB, users will have to reconnect in the event of a server failure, but that is still not bad.

Salvatore J. Guercio Jr sguercio (at) ccr (dot) buffalo (dot) edu 12 Jun 2003

Here is background on my LVS/NFS setup:

I have 4 IBM x330 and an IBM FastT500 SAN, this project I am working is to set up the SAN to server 635GB worth of storage space to the Media Studies department of the University. Each x330 is connected to the SAN with GB fibre channel and 100MB to a Cisco 4006 Switch. The Switch has a GB connection to the media studies department where they have a lab of 4 Macs running MacOS X. They will mount the share and access media data on the share. Each client has a GB connection to the Cisco switch.

The LVS comes in handy to increase the bandwidth of the SAN as well as gives us some redundancy. Since I have a shared SCSI solution, I will not run into the file handle problem and I am hoping that using IBM GPFS filesystem, will help me with the lockd problem.

Did you have to do anything as far as forwarding your portmap or any other ports to the realservers? Right now I am only forwarding udp port 2049.

Joe

Just a sanity check here... Do you understand the problems of exporting NFS via LVS (see the HOWTO)? Are these problems going to affect you (if not, this would be useful for us to know). I assume this setup will handle write locking etc. For my information, how is LVS going to help here? Is it increasing throughput to disk?.

20.36.4. detecting failed NFS

Don Hinshaw

Would a simple TCP_CONNECT be the right way to test NFS?

Horms

If you are running NFS over TCP/IP then perhaps, but in my experience NFS deployments typically use NFS over UDP/IP. I'm suspecting a better test would be to issue some rpc calls to the NFS server and look at the response, if any. Something equivalent to what showmount can do might not be a bad start.

Joe

how about exporting the directory to the director as well and doing an `ls` every so often

Malcolm Cowe malcolm_cowe (at) agilent (dot) com 7 Aug 2001

HP's MC/ServiceGuard cluster software monitors NFS using the "rpcinfo" command -- it can be quite sensitive to network congestion, but it is probably the best tool for monitoring this kind of service.

The problem with listing an NFS exported directory is that when the service bombs, ls will wait for quite a while -- you don't want the director hanging because of an NFS query that's gone awry.

Nathan Patrick np (at) sonic (dot) net 09 Aug 2001

Mon includes a small program which extends what "rpcinfo" can do (and shares some code!) Look for rpc.monitor.c in the mon package, available from kernel.org among other places. In a nutshell, you can check all RPC services or specific RPC services on a remote host to make sure that they respond (via the RPC null procedure). This, of course, implies that the portmapper is up and responding.

From the README.rpc.monitor file:

This program is a monitor for RPC-based services such as the NFS protocol, NIS, and anything else that is based on the RPC protocol. Some general examples of RPC failures that this program can detect are:

  - missing and malfunctioning RPC daemons (such as mountd and nfsd)
  - systems that are mostly down (responding to ping and maybe
    accepting TCP connections, but not much else is working)
  - systems that are extremely overloaded (and start timing out simple
    RPC requests)

To test services, the monitor queries the portmapper for a listing of RPC programs and then optionally tests programs using the RPC null procedure.

At Transmeta, we use:

  "rpc.monitor -a" to monitor Network Appliance filers
  "rpc.monitor -r mountd -r nfs" to monitor Linux and Sun systems

Michael E Brown michael_e_brown (at) dell (dot) com 08 Aug 2001

how about rpcping?

(Joe - rpcping is at nocol_xxx.tar.gz for those of us who didn't know rpcping existed.)

20.36.5. NFS locking and persistence

Steven Lang Sep 26, 2001

The primary protocol I am interested in here is NFS. I have the director setup with DR with LC scheduling, no persistence, with UDP connections timing out after 5 seconds. I figured the time it would need to be accessing the same host would be when reading a file, so they are not all in contention for the same file, which seems to cost preformance in GFS. That would all come in a series of accesses. So there is not much need to keep the traffic to the same host beyond 5 seconds.

Horms horms (at) vergenet (dot) net

I know this isn't the problem you are asking about, but I think there are some problems with your architecture. I spent far to much of last month delving into NFS - for reasons not related to laad balancing - and here are some of the problems I see with your design. I hope they are useful.

As far as I can work out you'll need the persistance to be much longer than 5s. NFS is stateless, but regardless, when a client connects to a server a small ammount of state is established on both sides. The stateless aspect comes into play in that if either side times out, or a reboot is detected then the client will attempt to reconnect to the server. If the client doesn't reconnect, but rather its packets end up on a different server because of load balancing the server won't know anything about the client and nothing good will come of the sitiation. The solution to this is to ensure a client consistently talks to the same server, by setting a really long persistancy.

There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has.

Of course unless you can get the lockd on the different NFS servers to talk to each other you are going to have a problem if different clients connected to different servers want to lock the same file. I think if you want to have any locking you're in trouble.

I actually specifically tested for this. Now it may just be a linux thing, not common to all NFS platforms, but in my tests, when packets were sent to the server other than the one mounted, it happily serves up the requested data without even blinking. So whatever state is being saved (And I do know there is some minimal state) seems unimportant. It actually surprised me how seamlessly it all worked, as I was expecting the non-mounted server to reject the requests or something similar.

That is quite suprising as the server should maintain state as to what clients have what mounted.

There is also the small issue of locks. lockd should be running on the server and keeping track of all the locks the client has. If the client has to reconnect, then it assumes all its locks are lost, but in the mean time it assumes everything is consistent. If it isn't accessing the same server (which wouldn't work for the reason given above) then the server won't know about any locks it things it has.

This could indeed be an issue. Perhaps setting up persistence for locks. But I don't think it is all bad. Of course, I am basing this off several assumptions that I have not verified at the moment. I am assuming that NFS and GFS will Do The Right Thing. I am assuming the NFS lock daemon will coordinate locks with the underlying OS. I am also assuming that GFS will then coordinate the locking with the rest of the servers in the cluster. Now as I understand locking on UNIX, it is only advisory and not enforced by the kernel, the programs are supposed to police themselves. So in that case, as long as it talks to a lock daemon, and keeps talking to the same lock daemon, it should be okay, even if the lock daemon is not on the server it is talking to, right?

That should be correct, the locks are advisor. As long as the lock daemons are talking to the underlying file system (GFS) then the behaviour should be correct, regarldess of which lock daemon a client talks to, as long as the client consistently talks to the same lock daemon for a given lock.

Of course, in the case of a failure, I don't know what will happen. I will definitely have to look at the whole locking thing in more detail to know if things work right or not, and maybe get the lock daemons to coordinate locking.

Given that lockd currently lives entirely in the kernel that is easier said than done.

20.36.6. Other Network file systems: replacements for NFS

This is from the beowulf mailing list

Ronald G Minnich rminnich (at) lanl (dot) gov 26 Sep 2002

NFS is dead for clusters. We are targeting three possible systems, each having a different set of advantages:

  • panasas (http://www.panasas.com)

  • lustre (http://www.lustre.org)

  • v9fs (http://v9fs.sourceforge.net), from yours truly

20.36.7. nfs mounts: hard and soft

This is not LVS exactly, but here's from a discussion on hard and soft mounts from the beowulf mailing list.

Alvin Oga alvin (at) Maggie (dot) Linux-Consulting (dot) com 14 May 2003

if you export a raid subsystem via nfs... you will have problems

  • if you export it w/ hard mount, all machines sit and wait for the machine that went down to come back up

  • if you export it w/ soft mount, you can ^C out of any commands that try to access the machine went down, and continue on your merry way as long as you didnt need its data

best way to get around NFS export problems - have 2 or 3 redundant systems of identical data ( a cluster of data ... like that of www.foo.com websites )

Greg Lindahl lindahl (at) keyresearch (dot) com 15 May 2003

There are soft and hard mounts, and there are interruptable mounts ("intr" -- check out "man nfs").

  • A hard mount will never time out. If you make it interruptable, then the user can choose to ^C. This is the safe option.

  • A soft mount will time out, typically leaving you with data corruption for any file open at the time. This is the option you probably never want to use.

By the way, if you use the automounter to make sure that unused NFS filesystems are not mounted, it can help quite a bit when you have a crash, depending on your usage pattern.

Other potential problems include: inadequate power or airflow, high input air temperature, old BIOS on the 3ware card, old Linux kernel, etc.

20.36.8. linux nfs in general

Benjamin Lee ben (at) au (dot) atlas-associates (dot) com

I am wondering what (currently) people consider a production quality platform to run an NFS server on? I am thinking maybe a BSD although the kernel NFS code in Linux is much more stable now, I've heard. The idea is to put together a RAID boxen which will serve web pages and mail spool via NFS. (Don't worry, I'm only mounting the mail spool once. ;-) It's not a large enterprise system. )

Ted Pavlic tpavlic (at) netwalk (dot) com 19 Sep 2000

While Linux does not provide very good NFS support and generally has problems with things like locking (major problems), and quotas, don't rule it out. Personally I have a system now which is very similar to the system in which you sound like you want to build. Until recently, it was a Linux server with an external RAID serving both mailspool and web pages to four realservers. (and a couple of other servers -- application, news, etc.) Right now (for various reasons most having to do with how nightly backups are handled) I have two Linux servers both with external RAIDs. One handles mailspool, the other handles web pages. Both are configured so that if and when one machine Linux server dies completely the other can pick up the other RAID and serve both RAIDs again. There will be some manual interaction, of course, but I have NOCs 24/7 and hopefully such a problem would occur when a tech was available to handle the switch.

Such a system seems to work fine for me. (It's worked for a long time). It's easy for me to administrate because everything's Linux. I run into problems here or there when a program has trouble locking and requires a local share, but I get around those problems... and overall it's not that bad.

20.36.9. stale file handles

Joe 16 Dec 2005

The client gets a stale file handle when the client has a file||directory open and the server stops serving the file||directory. This error is part of the protocol (i.e. it's not a bug or a problem, it's supposed to happen). The stale file handle doesn't happen at the server end (the server doesn't care if the client is totally hosed).

client                     server

                           server:# export /home/user/
                           server:# ls /home/user
                                  foo

client:# mount /home/user
client:# cd /home/user/
client:# ls ./
          foo

client:# cd foo
client:# ls
    ..listing of files in foo

                          server:# unexport /home/user/

client:# ls
      error: stale file handle

df and mount will hang (or possibly return after a long timeout). The error goes away when the server comes back.

                          server#: export /home/user

client:# ls
    .. listing of files in foo

If the servers went down and the clients (realservers) stayed up, all with open file handles, the clients just have to wait till the servers come up again. The stale file handle will mess up the client till the server comes back. Since foo is on the same piece of disk real-estate, foo comes back with the same file handle when the server reappears

An irrecoverable example:

                            server:# export /home/user

client:# mount /home/user
client:# cd /home/user
client:# ls ./
    foo

(ie as before so far)

do something different, 
force an irreversible failure on the server

                            server:# rmdir /home/user/foo

client:# ls ./

    stale file handle

                            server:# mkdir /home/user/foo

client:# ls ./
    stale file handle

In this example, when /home/user/foo is recreated, it's on a new piece of disk real-estate and will have a different file handle. The client is hung and you can't umount /home/user (maybe you can with umount -f). If you can't umount /home/user, you will have to reboot the client.

20.36.10. read-only NFS LVS

Joseph L. Kaiser 7 Mar 2006

I have been tasked to mount a read-only NFS mounted software area to 500+ nodes. I wanted to do this with NFS and LVS, but after reading all the howtos with regard to NFS and LVS and reading all the email with regard to this in the archives (twice), it seems clear to me that this doesn't work.

Joe

I haven't done any of this, I've just talked to people - so my knowledge is only theoretical.

(ro) is a different kettle of fish. If you use identical disks (you only need identical geometry, so you could in principle use different model disks) and make the disks bitwise identical (e.g. made with dd), then clients will always get the same filehandle for the same file, no matter which realserver they connect to. For disk failure, make sure you have a few dd copies spare. I assume (but don't know) you won't be able to use RAID etc because of problems keeping the individual disks identical on the different realservers. You'll only be able to use single disks.

If you have to update files, then you'll have to do it bitwise. I don't have any ideas on how to do this off the top of my head. Perhaps you could have two disks, one mounted and the other umounted, but having received the new dd copy. You would change the weight to 0 with ipvsadm and let the connections expire (about 3 mins TCP UDP scheduling) umount the current disk and then mount the new disk. Presumably you'll have some period in the changeover where the realservers are presenting different files.

Presumably for speed, you should have as much memory in the realservers as possible, so that recently accessed files are still in memory.

However, I have a boss, and he wanted me to ask if turning off no-attribute caching (noac) would help in the reliability of this service.

If the disks are (ro) then the attributes on the server will never change, in which case you want the client to cache the attributes forever.

He has seen with another NFS mounted filesystem that using the "noac" turns off caching and clients that sometimes get "Stale NFS file handle" will reread the file and succeed.

Is having the client not reboot only "sometimes" an acceptable spec? With "noac" will it now pass the tests in stale file handle?

Note
DRBD is not a service in the network sense, but it used as a file server.

Neamtu Dan dlxneamtu(at)yahoo(dot)com

I have a test system made up by a director, 3 servers running ftp and http and 2 storage computers from where the servers take their data. If I were to stick to one storage computer there would be no problem, but I want redundacy. So I've set up a working heartbeat on both the storage computers, but when the primary fails the servers can no longer use the data unless the servers umount and mount again on the Virtual IP address the heartbeat uses (I tryed mounting with nfs and samba so far). As I understand these 2 mounting methods will not work in case of failover because of different disk geometry. Do you know what I should do for nfs or samba to work in case of failover (at least for read only, if read-write on the storage is not possible).

mike mike503 (at) gmail (dot) com 1 Apr 2006

DRBD+NFS works for me, pretty well too. check out linux-ha.org for HaNFS.

Martijn Grendelman martijn (at) pocos (dot) nl 21 Mar 2007

I have recently set up a two-node cluster, both servers configured identically, both handling HTTP, HTTPS and FTP connections over LVS. Both machines are capable of playing the role of LVS director, but only one is active at once. Monitoring of real servers is done with Mon.

Files are on a DRBD device, which is exported over NFS. Failover of LVS (VIP + rules), DRBD, NFS and MySQL (also on DRBD) is handled by Heartbeat. Works like a charm!! The information on http://www.linux-ha.org/DRBD/NFS was extremely useful.