19. LVS: Services: general, setup, debugging new services

If you just want to find out about setting up a particular service that we already have figured out (e.g. all single port read-only services, some multiport services) then just go to that section. This section is if you are having trouble setting up a service, or want to know more about how services are setup.

19.1. Single port services are simple

Single port tcp services all use the same format:

  • the realserver listens on a known port (e.g. port 23 for telnet)
  • the client initiates connection by sending a SYN from a high port (port number > 1024) to the VIP:known_port (e.g.VIP:23)
  • the director selects the next realserver to service the request from its scheduling table, allocates a new entry in its hash (ipvsadm) table, and forwards the SYN packet to the realserver.
  • the realserver replies to the client. For LVS-DR and LVS-Tun, the default gw for the realserver is not the director: the reply packet goes directly to the client. For LVS-NAT, the default gw for the realserver is the director: the reply packet is sent to the director, where it is masqueraded and then sent to the client.

A similar chain of events is involved in pulling down the tcp connection.

In principle, setting up a service on an LVS is simple - you run the service on the realserver and forward the packets from the director. A simple service to setup on LVS is telnet: the client types a string of characters and the server returns a string of characters, making it the choice of services for debugging an LVS.

In practice some services interact more with their environment. ftp needs two ports. With http, the server needs to know its name (it will have the IP of realserver, but will need to proclaim to the client that it has the name associated with the VIP). https is not listening to an IP, but to requests to a nodename. This section shows the steps needed to get the common single-port services working. A section on multi-port services shows how to set up multi-port services like ftp or e-commerce sites.

When trying something new on an LVS, always have telnet running as an LVS'ed service. If something is not working with your service, check how telnet is doing. Telnet has the advantages

  • telnetd listens on on the realserver (at least under inetd)
  • the exchange between the client and server is simple, well documented,
  • the connection is non-persistence (new sessions initiated from a client will make a new connection with the LVS) unencrypted and in ascii (you can follow it with tcpdump)
  • the telnet client is available on most OS's

19.2. setting up a (new) service

When setting up your LVS, you should first test that your realservers are working correctly. Make sure that you can connect to each realserver from a test client, then put the realservers behind the director. Putting the realservers into an LVS changes the networking. For testing the realservers separately

  • LVS-DR, LVS-Tun: Have the VIP on lo:n or tunl0:n with the service listening to the VIP. You'll need some way of routing to a director without a VIP from the test client. Alternately you can put the VIP on eth0 and move it back to the local device afterwards.
  • LVS-NAT: The service will be listening on the RIP. In production the client will be connecting to the VIP, so name resolution may be required mapping the RIP to the name of the VIP. If you need this see indexing.

The LVS director behaves as a router (with slightly different rules). Thus when setting up an LVS on a new service, the client-server semantics are maintained

  • the client thinks it is connecting directly to a server
  • the realserver thinks it is being contacted directly by the client

Example: nfs over LVS, realserver exports its disk, client mounts a disk from LVS (this example taken from performance data for single realserver LVS),

realserver:/etc/exportfs (realserver exports disk to client, here a host called client2)

/       client2(rw,insecure,link_absolute,no_root_squash)

The client mounts the disk from the VIP. Here's client2:/etc/fstab (client mounts disk from machine with an /etc/hosts entry of VIP=lvs).

lvs:/   /mnt            nfs     rsize=8192,wsize=8192,timeo=14,intr 0 0

The client makes requests to VIP:nfs. The director must forward these packets to the realservers. Here's the conf file for the director.

#lvs_dr.conf for nfs on realserver1
VIP=eth1:110 lvs
DIP=eth0 dip
SERVICE=t telnet rr realserver1 realserver2	#for sanity check on LVS
#to call NFS the name "nfs" put the following in /etc/services
#nfs             2049/udp
#note the 'u' for "udp" in the next line
SERVICE=u nfs rr realserver1			#the service of interest
SERVER_NET_DEVICE=eth0 Why do you need persistence?
#----------end lvs_dr.conf------------------------------------

19.3. services must be setup for forwarding type

The services must be setup to listen on the correct IP. With telnet, this is easy (telnetd listens on under inetd), but most other services need to be configured to listen to a particular IP.

For LVS-NAT, the packets will arrive with dst_addr=RIP, i.e. the service will be listening to (and replying from) the RIP of the realserver. When the realserver replies, then name of the machine returned will be that of the realserver (the RIP), but the src_addr will be rewritten by the director to be the VIP. If the name of the realserver is part of its service (as with name based http) then the client will associate the VIP with this name. The realserver then will need to associate the RIP with this name. You could put an entry for the RIP into /etc/hosts linking it to this name.

With LVS-DR and LVS-NAT the packets will arrive with dst_addr=VIP, i.e. the service will be listening to (and replying from) an IP which is NOT the IP of the realserver. Configuring the httpd to listen to the RIP rather than the VIP is a common cause of problems for people setting up http/https. All realservers will need to think that they have the hostname associated with the VIP.

In both cases, in production, you will need to make the name of the machine given by the realserver to be the name associated with the VIP.

Note: if the realserver is Linux 2.4 and is accepting packets by transparent proxy, then see the section on Transparent Proxy (Horm's method) for the IP the service should listen on.

19.4. Realservers present the same content: Synchronising (filesharing) content and config files, backing up realservers

Realservers should have indentical files/content for any particular service (since the client can be connected to any of them). This is not a problem for slowing changing sites (e.g. ftp servers), where the changes can be made by hand, but sites serving webpages have to be changed daily or hourly. Some semi-automated method is needed to stage the content in a place where it is reviewed and then moved to the realservers.

For a database LVS, the changes have to be propagated in seconds. In an e-commerce site you have to either keep the client on the same machine when they transfer from http to https (using persistence), which may be difficult if they do some comparative shopping or have lunch in the middle of filling their shopping cart, or propagate the information e.g. a cookie to the other realservers.

Here are comments from the mailing list.


If you just have two servers, it might be easy to use rsync to synchronize the backup server, and put the rsync job in the crontab in the primary. See http://rsync.samba.org/ for rsync.

If you have a big cluster, you might be interested in Coda, a fault-tolerant distributed filesystem. See the code website for more information.


from comments on the mailing list, Coda now (Aug 2001) seems to be a usable project. I don't know what has happened to the sister project Intermezzo.

May 2004. It appears that development has stopped on both Coda and Intermezzo. I think the problem was too difficult.

Jan 2006. Coda appears to be back in development.

J Saunders 27 Sep 1999

I plan to start a frequently updated web site (potentially every minute or so).

Baeseman, Cliff Cliff (dot) Baeseman (at) greenheck (dot) com

I use mirror to do this. I created a ftp point on the director. All nodes run against the director ftp directory and update the local webs. It runs very fast and very solid. upload to a single point and the web propagates across the nodes.

Paul Baker pbaker (at) where2getit (dot) com 23 Jul 2001 (and another posting on 18 Jul 2002 announcing v0.9.2.2)

PFARS Project on SourceForge

I have just finished commiting the latest revisions to the PFARS project CVS up on SourceForge. PFARS prounced 'farce' is the "PFARS For Automatic Replication System."

PFARS is currently used to handle server replication for Where2GetIt.com's LVS cluster. It has been in the production environment for over 2 months so we are pretty confident with the code stability. We decided to open source this program under the GPL to give back to the community that provided us with so many other great FREE tools that we could not run our business without (especially LVS). It is written in Perl and uses rsync over SSH to replicate server file systems. It also natively supports Debian Package replication.

Although the current version number is 0.8.1 it's not quite ready for release yet. It is seriously lacking in documentation and there is no installation procedure yet. Also in the future we would like add support for RPM based linux distros, many more replication stages, and support for restarting server processes when certain files are updated. If anyone would like to contribute to this project in any way do not be afraid to email me directly our join the development mailing list at pfars-devel@lists.sourceforge.net.

Please visit the project page at http://sourceforge.net/projects/pfars/ and check it out. You will need to check it out from CVS as there are no files released yet. Any feedback will be greatly appreciated.

Joe (May 2004): the last code code entry for pfars was Sep 2002. The project appears to have stopped development.

Zachariah Mully

I am having a debate with one of my software developers about how to most efficiently sync content between realservers in an LVS system.

The situation is this... Our content production software that we'll be putting into active use soon will enable our marketing folks to insert the advertising into our newsletters without the tech and launch teams getting involved (this isn't necessarily a good thing, but I'm willing to remain open minded ;). This will require that the images they put into newsletters be synced between all the webservers... The problem though is that the web/app servers running this software are load-balanced so I'll never know which server the images are being copied to.

Obviously loading the images into the database backend and then out to the servers would be one method, but the software guys are convinced that there is a way to do it with rsync. I've looked over the documentation for rsync and I don't see anyway to set up cron jobs on the servers to run an rsync job that will look at the other servers content, compare it and then either upload or download content to that server. Perhaps I am missing an obvious way of doing this, so can anyone give me some advice as to the best way of pursuing this problem?

Bjoern Metzdorf bm (at) turtle-entertainment (dot) de 19 Jul 2001

You have at least 3 possibilities:

  • You let them upload to all RIPs (uploading to each realserver)
  • You let them upload to a testserver, and after some checks you use rsync to put the images onto the RIPs.
  • You let them upload to one defined RIP instead of the VIP and rsync from there (no need for a testserver)

Stuart Fox stuart (at) fotango (dot) com 19 Jul 2001

nfs mount one directory for upload and server the images from there.

Write a small perl/bash script to monitor both upload directories remotely then rsync the differences when detected.

Don Hinshaw dwh (at) openrecording (dot) com 19 Jul 2001

You can use rsync, rsync over ssh or scp.

You can also use partition syncing with a network/distributed filesystem such as Coda or OpenAFS or DRBD (DRBD is still too experimental for me). Such a setup creates partitions which are mirrored in real-time. I.e., changes to one reflect on them all.

We use a common NFS share on a RAID array. In our particular setup, users connect to a "staging" server and make changes to the content on the RAID. As soon as they do this, the real-servers are immediately serving the changed content. The staging server will accept FTP uploads from authenticated users, but none of the real-servers will accept any FTP uploads. No content is kept locally on the real-servers so they never need be synced, except for config changes like adding a new vhost to Apache.

jik (at) foxinternet (dot) net 19 Jul 2001

If you put the conf directory on the NFS mount along with htdocs then you only need to edit one file, then ssh to each server and "apachectl graceful"

Don Hinshaw dwh (at) openrecording (dot) com 20 Jul 2001

Um, no. We're serving a lot of: <VirtualHost x.x.x.x> and the IP is different for each machine. In fact the conf files for all the real-servers are stored in an NFS mounted dir. We have a script that manages the separate configs for each real-server.

I'm currently building a cluster for a client that uses a pair of NFS servers which will use OpenAFS to keep synced, then use linux-ha to make sure that one of them is always available. One thing to note about such a system is that the synced partitions are not "backups" of each other. This is really a "meme" (way of thinking about something). The distinction is simply that you cannot rollback changes made to a synced filesystem (because the change is made to them both), whereby with a backup you can rollback. So, if a user deletes a file, you must reload from backup. I mention this because many people that I've talked to think that if you have a synced filesystem, then you have backups.

What I'm wondering is why you would want to do this at all. From your description, your marketing people are creating newsletters with embedded advertising. If they are embedding a call for a banner (called a creative in adspeak) then normally that call would grab the creative/clickthrough link from the ad server not the web servers. For tracking the results of the advertising, this is a superior solution. Any decent ad server will have an interface that the marketing dept. can access without touching the servers at all.

Marc Grimme grimme (at) comoonics (dot) atix (dot) de 20 Jul 2001

Depending on how much data you need to sync, you could think about using a Cluster Filesystem. So that any node in the LVS-Cluster could concurrently access the same physically data. Have a look at GFS. We have a clustered webserver with 3 to 5 nodes with GFS underneath and it works pretty stable.

If you are sure on what server has latest data is uploaded to, no problem with rsync. If not, I would consider to use a Network - or Cluster Filesystem. That should save a lot of scripting work and is more storage efficient.

jgomez (at) autocity (dot) com

We are using rsync as a server. We are using a function that uploads the contents to the server and sync the uploaded file to the other servers.The list of servers we have to sync is in a file like:


When a file is uploaded,the server reads this file and make the sync to all the other servers.

"Matthew S. Crocker" matthew (at) crocker (dot) com 14 May 2002

Working machines have local disks for qmail queue and /etc /boot which are EXT3. Working data (/home, /usr, /shared, /webspace) lives on a Network Appliance Netfiler. I really can't say enough about the NetApps they are simply awesome. You pay a chunk of money but it is money well spent.

Andres Tello Abrego C.A.K." criptos (at) aullox (dot) com 06 Sep 2002

Using the KISS principle.

The usernames and password collection, must be centralized, for control, only one place, where, u update, change and remove passwords,then, a little help of scp, and all the trick is done.

Just, copy, over a secure coneccition, ur password collectios file.. and, u are sync. We, even develop a "cluster" admin web based app, the principle of functioning, was, one server, is the "fistone" then, using, small programan triggered by a ssh execution command or attached to a port using the inetd super server.. and u are done.

"Matthew S. Crocker" matthew (at) crocker (dot) com 07 Sep 2002

Instead of using NIS, or NIS+ I use LDAP for all my customer information records. I store, Radius, Qmail, DHCP, DNS, and Apache Virtual Host information in my LDAP server. We have a couple LDAP slaves and have all servers query the LDAP servers for info. Radius, Qmail are real time, everything else is updated via a script.

To replicate NIS functions in LDAP check out www.padl.com. They have a schema and migration tool set

Jerker Nyberg jerker (at) update (dot) uu (dot) se 08 Sep 2002

I used to take information from the customer database and store shadow/passwd/groups/httpd.conf/aliases/virtusertable/etc in two high-availability MySQL-databases (on the same machines that run LVS) then every 30 minutes or apropriate generate the files on the realservers. The "source of all information" for us whas the customer database (also MySQL), that we can modify in our own customized python/GTK-clients or the customers (indirect) via a webinterface.

One of the ideas with this was to move the focus from what is stored on the servers to what is in the customer database. In that way it is easy to inactivate accounts if customers doesn't pay their fees etc. If the real-servers go down, they can all be reinstalled with a kickstartinstallation including the scripts that generate the configuration files. I found it easier with "pull" instead of a "push" for the configuration files.

Local files (with databases "db" instead of linear files in /etc/nsswitch.conf - this began to make a difference with more than 10k users) in my experience always seemed to be faster than any networked nameservices (LDAP, NIS etc) even if you use nscd to cache as much as you can.

James Ogley james.ogley (at) pinnacle (dot) co (dot) uk 14 Aug 2002

We have an internal 'staging' server that our web designers upload content to. A shell script we run as a daemon then rsync's the content across the cluster members.

In addition, we have an externally facing FTP server that external customers upload their content to. The above mentioned shell script rsyncs that content to itself as park of it's operation.

"Matthew S. Crocker" matthew (at) crocker (dot) com 14 Aug 2002

We use keepalived for the cluster/LVS monitoring/management. We use SCP to move the keepalived.conf file around to all the servers.

For content it is all stored on a Netfiler F720. The next upgrade will replace the F720 with a cluster of Netfilers (F85C ??). The realservers NFS mount the content (Webdirs, Maildir)

Put new content on the NFS server, every machine sees it

We run SMTP,POP3,IMAP. We'll be adding HTTP, HTTPS and FTP in the next few weeks. Our LVS is LVS-NAT, 2 directors, 4 realservers, Cisco 3548 switch.

Do you know of any new technology or propriety solutions that need an open source implementation?

I would like to see a Netfiler type appliance open sourced. I know I can go with a linux box but I'm just not sure on the performance. I want a fiber channel based NFS server will complete journaling, 30 second reboots, fail over, snapshots. I think a linux box with EXT3 or ReiserFS comes close but you don't get snapshots and I'm not sure how the failover would work.

Doug Schasteen wrote:

What does the keepalived vrrp do exactly? What are you using MySQL for? Because if you are running scripts or web programs, then don't you need to specify an IP in your connection strings? I'm just wondering how that works, because if all your connection strings are set to a certain IP and then that IP goes down, how does it know to fail over to the second machine? The only thing I can think of is your second machine takes over that IP somehow.

keepalived is an awesome tool, bundled with VRRP it allows for machine failover. Basically you set it up like this.

Machine A:

 Physical IP
 MySQL Master
 keepalived Master
  VRRP IP address

Machine B:

 Physical IP
 MySQL Slave
 keepalived BACKUP
  VRRP IP address

The MySQL Master is setup to replicate with the MySQL Slave ( The SQL client apps connect to MySQL on The IP address will only exist on the machine which keepalived determines to be the active MASTER. If something causes that machine to crash or if the backup machine stops recieving VRRP announcements from the master it will enable the IP address and send out arps for the IP. The clients will connect to the same IP address all the time. That IP address can be on either machine.

I use keepalived to fail over my LVS servers but it can be used to failover any group of machines.

I was planning on tackling this issue by writing (rewriting) all of my scripts/programs to include one file that does the mysql connections. Then I only have to change one file when I want to change where my mysql connections go. And then maybe I'll add a failover connection inside of that include file, like an "if the first connection didn't work, try the backup server". The problem with that is that if for some odd reason the first connection doesn't work (perhaps I rebooted the machine), it will put them on the backup server and updates will be made to the backup server. Any updates made to the backup mysql server while I'm rebooting the main mysql server will probably be lost. I can maybe add two-way replication for when something like this happens (but not use two-way replication all the time, because I've heard that has problems.)

Have the slave server dump transaction logs so you can manually replicate the data back over when you recover the master server.

Ramon Kagan rkagan (at) YorkU (dot) ca 14 Aug 2002

We use lvs with keepalived for High Availability. All our servers are identical in setup, and use NFS to a cluster of Netapp filers (two F840s) Our setup uses, LVS-DR since we push very close to the 100Mbit/s range, NAT seemed to have too much overhead.

Services that we run are web(http, https), web mail, mail delivery. Pop and imap on soon to be added to this list.

New content is put onto the filers, thus all nfs clients pick it up immediately.

For our MySQL setup I have a single "MySQL" machine. I setup my MySQL to listen on the designated port and have setup strict rules in MySQL for authentication and access. (see mysql.user and mysql.db tables). For redundancy I have a second machine running as a replication slave against the MySQL machine. I'm using keepalived's vrrp framework to force failover when problems arise (hasn't happened yet, knocking on wood really hard). I have tested this in a development environment and it seems to work nicely. I found that with both machines running 100Full Duplex, our MySQL server can complete over 10,000 write transactions per second and the latency between databases is on average 0.0019 seconds (yes, under 2 thousandth of a second!). I will admit that these are pretty strong machines (Dual P3 1.4 with 2 GB memory, 100% SCSI based), but I seen similar performance on P3 600 with 512 MB memory, IDE based, still 100 Full Duplex though.

VRRP - virtual router redundancy protocol

Keepalived is software writeen by Alexandre Cassen. What is can do it as such:

  1. Health checking of realserver allowing removing of unresponsive ones from an lvs table (auto control of the lvs table)
  2. Heartbeat between two lvs boxes so that if one fails the other takesover.

So, with these two you can create a High Availibity (HA) cluster.

Using "2" only you can setup a service, like MySQL replication, and run just the heartbeat (in this case VRRP framework) without the health checking or LVS framework. Then if the master node goes down the slave node can run a script to convert the slave database into a master database.

So, if you have a master system say dbmast.ex.com, and a slave dbslav.ex.com you create a service IP db.ex.com. All clients talk through db.ex.com. On startup, dbmast.ex.com arps out gratuitously that it is db.ex.com. On failure, dbslave.ex.com would arp out to take over the systems.

This way, client need not know of any changes.

Go to www.keepalived.org for more info. If you have any further questions, there is a mailing list, and I have helped in the past with the documentation.

nick garratt nick-lvs (at) wordwork (dot) co (dot) za 14 Aug 2002

transferring of content is currently done via tar over ssh:

tar cBf - . | ssh remoteserver "cd /to/content/location && tar xBf -"

(beware path length !)

it's useful for transferring entire hierarchies, preserving perms, symlinks and whatnot, but we'll probably migrate to rsync.

before new content is deployed it is transferred to an intermediate deploy server from the dev server where it is thoroughly tested/abused. none of these machines forms part of the cluster per se.

the content is transferred to the remote access server in the cluster after testing. from this machine it it transferred to each of the web servers in turn using the same mechanism described above.

any content which must be shared (rw) by all web servers (client templates, ftp dirs) is NFS served.

content is database driven dynamic content (apart from obvious exceptions) providing both web site facilities and an http/s get/post and xml api.

cluster manager : lvs using fwmark nat for (public) http/s,ftp, smtp, dns and fwmark dr (internal clustered services) for http. mon. www1 - wwwn : php, apache. all run msession and an ftpd although only one is IT at any point db : pair of mysql servers in master/slave. loghost : responsible for misc services : ntp master for cluster, syslogd - remote logging, tftpd for log archiving, log analysis ...) remote : dns for the world (also fwmark) and secondary nameserver for cluster, ssh

Neulinger, Nathan nneul (at) umr (dot) edu 26 Apr 2004

We use AFS/OpenAFS as our backend storage for all regular user and web data. (Mail and databases are separate.) About 3 TB total capacity, of which about 1.9 used. We have 3500+ clients, many of which are user desktops - thus unsuitable for nfs. NFS is suitable for tightly controlled server clusters, but not really for export to clients that may or may not be friendly.

Joe 26 Apr 2004

If it's a readonly site, then have only static pages on the realservers and rsync them from a staging machine (which may have dynamic html). If you need randomly different content from dynamic pages, generate different versions of them every few minutes and push different ones to each realserver. You want the httpd to be fetching as much from the disk cache as possible.

John Reuning john (at) metalab (dot) unc (dot) edu 26 Apr 2004

Squid might be a good solution for caching static pages on real servers. For php caching, you can use turck mmcache. It works well most of the time, but is sometimes flaky with poorly-written php code.

We actually do what Joe suggested, manually rsync'ing content to a cache directory on the realservers. Alias directives are added to map the cached content to the proper url location. Our shared storage is nfs, and we, too, have lots of user-maintained files. However, we've targeted directories that don't change often (theme and layout directories for CMS applications, for example). Rsync is efficient, and we run it every hour or so.

There's one performance problem that's not solved by this, though. Apache performs lots of stat() calls when serving pages (checking for .htaccess files). The stat calls are made before the content is served and go to the nfs servers. Under a high traffic load, the stat calls bog down our nfs servers despite the content being cached on the real servers.

Graeme Fowler keepalived (at) graemef (dot) net 28 May 2004

There's loads of way to synchronise realserver content or the whole filesytem incase of realserver disk failure.

  • Have a "management station" which can do pubkey SSH logins to the managed machines, run scripts, push software and so on.
  • Create a disk image of a server you're happy with and have it network boot using syslinux/pxelinux
  • Utilise HP's open source OpenSSI project (an OSS cluster management system). I've found the OpenSSI concepts to be hugely useful in theory, if not in practice.
  • Use a "virtual" machine system such as UML. You can then keep a "dumb" system running the virtual machine, make changes to the image offline, copy it to the "dumb" system and reboot the virtual machine instead.

J. Simonetti jeroens (at) office (dot) netland (dot) nl 28 May 2004

I also found systemimager (http://www.systemimager.org/) myself which sounds promising as well.

19.5. cfengine for synchronising files

cfengine is designed to control and propagate config files to large numbers of machines. Presumably it could be used to synchronise realserver content files as well.

Has anyone succesfully rolled-out a cluster of real-servers using coda? (Main reason would be for the replication of config files (Apache/qmail) across all real-servers) - Is this doable? Or am I better of using rsync?

Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no 27 Jun 2003

I recommend using rdist or cfengine which are designed to distribute configuration files, not to make exact copies of a directory (structure). Both rdist and cfengine are more configurable than rsync.

19.6. File Systems for (really big) Clusters: Lustre, Panasas

People have discussed CODA as a filesystem for synchronisation. In clusters NFS has a lot of overhead and failed mounts result in hung systems. Here's a posting from the beowulf mailing list

Bari Ari bari (at) onelabs (dot) com 26 Sep 2002

NFS is dead for clusters. We are targeting three possible systems, each having a different set of advantages:

  • panasas (http://www.panasas.com)
  • lustre (http://www.lustre.org)
  • v9fs (http://v9fs.sourceforge.net)

19.7. File Systems for Clusters: Samba waits for a commit and is slow, NFS fills buffers and is fast

(from the TriLUG mailing list).

John Broome

I have a RH 9 machine that is acting as a fileserver for a completly windows network (98 and 2000), the users mentioned that the file transfers seemed slow. Some testing showed that samba was moving data much slower than NFS. NFS was using pretty much the entire speed potential of the network, where SMB was about half that, or less. No indication on the server that CPU, HDD, or memory is the problem. When tested off site with different hardware and a different OS (Ubuntu 5.04), the same problem popped up - SMB dragging along, NFS cranking. Since this is a mostly windows network we can't really use NFS instead of the samba.

Tanner Lovelace clubjuggler (at) gmail (dot) com 06/20/2005

A quick search for "samba tuning" brings up this quote from http://www.oreilly.com/catalog/samba/chapter/book/appb_01.html "If you run Samba on a well-tuned NFS server, Samba may perform rather badly."

If you follow the link at the bottom of the page (http://www.oreilly.com/catalog/samba/chapter/book/appb_02.html) it has suggestions for things in samba to tune.

Jason Tower jason (at) cerient (dot) net

I was testing transfer speed using my t42 with ubuntu. File transfers using nfs would occur at wire speed (12.5 MB/s) while the exact same file transferred using smb (mounted with smbmount) would only be about 5.5 MB/s.

However, when I booted into windows on the t42 I could copy the file (with smb of course) at nearly wire speed. So it seems that at least part of the perceived problem has something to do with the smb *client*, not the server. cpu utilization and iowait was not even close to being a bottleneck so I'm not sure where the slowdown is occuring or why.

Jon Carnes jonc (at) nc (dot) rr (dot) com 06/20/2005

I wrote this up for TriLUG about 5 years ago... We tested various forms of file transfer including NFS and Samba and - if I remember correctly - we found Samba (version 3) to be about 1/3 the speed of NFS (version 2). The problem was that the Samba process waited for a commit before negotiating for the next data transfer whereas NFS filled a buffer and continuously pushed that buffer out.

Obviously if you're running from a buffer out of RAM you can run at network speeds (or as fast as your internal bus and cpu can go).

Microsoft's implementation of SMB pumps the data to be moved into a buffer and works similarly, so it's almost as fast as using NFS (though it does some other weirdness that always makes it a bit slower than NFS...)

NFS v3 had a toggle that also defaulted to waiting for a commit from the remote hard drive before sending more data - that moved files around just slightly faster than Samba (it crawled.)

19.8. Discussion on distributed filesystems

This was a long thread in which everyone discussed their knowlege of the matter. I've added subsequent postings at the end. If clients are reading from the realservers (e.g. webpages), then it's simple to have the same content on each machine and push content once a day say. If clients are writing to disks, you have an different problem, of propagating the writes to all machines. In this case you may want an (expensive) central fileserver (look for NetApp elsewhere in this HOWTO for happy users e.g. NetApp for mailservers).

Graham D. Purcocks

What, if any, Distributed Filesystems have any LVS users tried/use with any success?


We hear little about distributed file systems with LVS. Intermezzo is supposed to be the successor to CODA, but we don't hear much more here on the LVS mailing list about Intermezzo than we do about CODA.

Distributed file systems are a subject of great interest to others (e.g. beowulfs) and you will probably find better info elsewhere.

The simple (to setup) distributed filesystems (e.g. PVFS) are unreliable, i.e. if one machine dies, you loose the whole file system. This is not a problem for beowulfs, since the calculation has to be restarted if a compute node dies, and jobs can be checkpointed. Reliable distributed filesystems require some effort to setup. GFS looks like it would take months and much money to setup. A talk at OLS_2003 described Lustre (http://www.lustre.com), a file system for 1024 node clusters that is in deployment. It sounds simpler to setup than GFS, but I expect it will still be work. Lustre expects the layer of hardware (disks) below it to be reliable (all disks are RAID).

Rather than depending on the filesystem to distribute state/content, for an LVS where clients write infrequently (if at all), state/content can be maintained on a failover pair of machines which push content to the realservers.

Graham Purcocks grahamp (at) wsieurope (dot) com 04 Nov 2003

My thoughts are:-

  • NFS is fine if the content is not changing often. As it is a single point of failure, you have to have a backup and do all the failover and synchronizing stuff mentioned in other emails.
  • With a distributed file system, this is not the case and it 'should' sort itself out as servers go offline. This system is needed if you have dynamically changing content which changes often, then rsync will not cut it.

John Barrett jbarrett (at) qx (dot) net 04 Nov 2003

nfs directory naming has not been an issue for me in the least -- I always mount nfs volumes as /nfs[n] with subdirs in each volume for specific data -- then symlink from where the data should be to the nfs volume -- same thing you will have to do with coda -- in either case the key is planning your nfs/coda setup in advance so that you dont have issues with directory layouts, by keeping the mountpoints and symlinks consistent across all the machines.

I'm not currently doing replicated database -- i'm relying on raid5+hotswap and frequent database backups to another server using mysqldump+bacula. Based on my reading, mysql+nfs not a very good idea in any case -- mysql has file locking support, but it really slows things down because locking granularity is at the file level (or was the last time I looked into it -- over a year ago -- please check if there have been any improvements)

based on my current understanding of the art with mysql, your best bet is to use mysql replication and have a failover server used primarily for read only until the read/write server fails (if you need additional query capacity) (ldirectord solution), or do strict failover (heartbeat solution), only one server active at a time, both writing replication logs, with the inactive reading the logs from the active whenever both machines are up (some jumping through hoops needed to get mysql to startup both ways -- possibly different my.cnf files based on which server is active)

with either setup -- the worst case scenario is one machine goes down, then the other goes down some period of time later after getting many updates, then the out of sync server comes up first without all those updates

(interesting thought just now -- using coda to store the replication logs and replicating the coda volume on both the database servers and a 3rd server for additional protection, 3rd server may or may not run mysql, your choice of you want to do a "tell me 3 times" setup -- then you just have to keep a file flag that tells which server was most recently active, then any server becoming active can easily check if it needs to merge the replication logs -- but we are going way beyond anything I have ever actually done here, pure theory based on old research)

in either case you are going to have to very carefully test that your replication config recovers gracefully from failover/failback scenarios

my current cluster design concept that I'm planning to build next week might give you some ideas on where to go (everything running ultramonkey kernels, in my case, on RH9, with software raid1 ide boot drives, and 3ware raid5 for main data storage):

M 1 -- midrange system with large ide raid on 3ware controller, raid5+hotswap, coda SCM, bacula network backup software, configured to backup directly to disk for first stage, then 2nd stage backup the bacula volumes on disk to tape (allows for fast restores from disk, and protects the raid array in case of catastophic failure)

M 2 and 3 heartbeat+ldirectord load balancers in LVS-DR mode -- anything that needs load balancing, web, mysql, etc, goes through here (if you are doing mysql with a read/write + multiple read only servers, the read only access goes through here, write access goes direct to the mysql server and you will need heartbeat on those servers to control possesion of the mysql write ip, and of course your scripts will need to open seperate mysql connections for reading and writing)

M 4 -- mysql server, ide raid + hotswap -- I'm not doing replication, but we already discussed that one :)

then as many web/application servers as you need to do the job :) each one is also a replica coda server for the webspace volume, giving replicated protection of the webspace and accessability even if the scm and all the other webservers are down -- you may want multiple dedicated coda servers if your webspace volume is large enough that having a replicate on each webserver would be prohibitivly expensive

If you use replicated mysql servers, this script may provide a starting point for a much simplified LVS-DR implementation: http://dlabs.4t2.com/pro/dexter/dexter-info.html -- the script is designed to layer additional functionality on top of a single mysql server, but the quick glance that I took shows it could be extended to route queries to multiple servers based on the type of query.. i.e. read requests to the local mysql instance, write requests to a mysql instance on another server.

setup 1 mysql master server, it will not be part of the mysql cluster, its sole task is to handle insert/update and replicate those requests to the slave servers.

setup any number of mysql replicated slaves -- they should bind to a VIP on the "lo" interface, and the kernel should have the hidden ip patch (ultramonkey kernels for instance)

modify dexter to intercept insert/update requests and redirect them to the master server (will mean keeping 2 mysql sessions open -- one to the master, one to the local slave instance) -- if the master isnt there, fail the update -- install the modified script on all the slave servers

setup the VIP on an ldirectord box and add all the slaves as targets -- since mysql connections can be long running, I suggest using LeastConnection balancing

Now clients can connect to the VIP, the slaves handle all read accesses, and the master server handles all writes

The only possible issue with this setup is allowing for propogation delays after insert/update -- i.e. you wont be able to read back the new data instantly at the slaves -- may take a second or 2 before the slaves have the data -- your code can loop querying the database to see if the update has completed if absolutly neccessary

you still have a single point of failure for database updates, but your database is always backed up on the slaves, and because there is only one point of update, its very difficult for the slaves to get out of sync, and read access is very unlikely to fail -- also you have none of the problems with database locking, as NFS is still not recommended based on the lists that I scanned to get up to date on the issues

Lastly -- the master CAN be a read server if you wish (my setup above assumes it is not) given that the update load is not so much that the master gets overloaded -- if you have high update frequency, then lets the slaves handle all the reads, and the master only updates

Ariyo Nugroho ariyo (at) ebdesk (dot) com 31 Oct 2003

After successfully setup LVS-NAT for telnet, http, and ftp services, now I'm going to do it with databases. From the HOWTO, it's said that implementing such configuration needs distributed filesystem. There're so many names mentioned in the HOWTO. And it make me confused.

The first name I noticed from the LVS page is Coda. But then I found that Peter Braam has stopped working in coda team. He initiated another one, Intermezzo. Many articles stated that this new filesystems is very promising than Coda. Unfortunately until now, I can't find whether this Intermezzo has become stable or not.

So, are there anyone of you that has experience with any distributed filesystems? Which one would you recommend?


The HOWTO says that you need someway of distributing the writes to all the realservers. This can be done at the application level or at the filesystem lever. At the time we first considered running databases on LVS, neither distibuted filesystems nor replication was easily available on Linux. Pushing writes would have to be done via a DBI/DBD interface or similar. I expected that distributed filesystems were just around the corner (intermezzo, CODA), but they never arrived. In the meantime mysql has implemented replication and pushing the writes now seems best done at the application level.

Ratz in his postings to the mailing list has shown how most problems that involve maintaining state across the realservers can and should be solved at the application level. It only seems reasonable to approach the database write problem the same way.

If someone comes up with a bulletproof, easy to maintain, reliable distributed filesystem, then all of this will be thrown out the window and we'll all go to distributed filesystems. However with the effort that's been put into distributed filesystems and the lack of progress that's been made, relative to the progress from modifying the application, I think it's going to be a while before anyone has another go at distributed filesystems for Linux.

I'm sure the HOWTO says that LVS databases can use distributed filesystems. However LVS'ed databases don't require (==need) distributed filesystems.

Karl Kopper karl (at) gardengrown (dot) org 31 Oct 2003

Did I mention I have had a good experience with NFS?

In my experience NFS is rock solid on the newer 2.4 Linux kernels. What let's me say this is the fact that since late February the company where I work has been using Linux NFS clients in an LVS cluster as the main business system without a single problem. This is a system that processes half a billion dollars a year in orders and prints hundreds of documents (warehouse pick sheets for example) every day (using the rock-solid LPRng system btw). NFS has not failed once. The existing in-house order processing applications did not have to be rewritten to run on the cluster over NFS because they use fcntl locking.

This cluster is all in one data center. I would sooner quit my job than be forced into implementing a file system for a mission critical application that had to do lock operations to guarantee data integrity over a WAN. I don't care how good the file system code is.


This went over my head. Are you using locking with nfs or not?

We are doing locking with NFS (the fcntl calls from the application cause the Linux kernel to use statd/lockd to talk to the NAS server). I just wouldn't do locking over a WAN. I'm just talking about doing any type of lock operation over a WAN (for example an NFS client that connects to the NAS box over the WAN). Not really distributed content stored on direct attached disk on each node (though that would have to be even worse for locking over a WAN).


I missed the WAN bit in the original posting. So you were referring to some sort of file system (distributed ?, eg AFS) spread over a WAN? I had forgotten that some distributed file systems aren't local only.

Let's see if we understand e.o.

  • Where you're coming from:

    You don't like file locking onto a box on another network over possibly non-dedicated links. You're happy with NFS because you have the disks local (in some arrangement I don't know about yet) on a network that others can't get to.

  • Where I'm coming from:

    I think of distributed file systems as being used on machines like beowulfs (or an LVS) where the disks are on machines on a separate and dedicated network, that is not accessible to clients (all jobs are submitted to a master node, and the client never sees the filesytem behind it). The people running the beowulf have complete control over the file system(s) and network behind the master node.


the problem is alack of a consistent definition of the term "distributed file system." Here is the configuration I'm referring to:

[LVS RS 1]------------>|   NAS       |
                       |    Server   |
[LVS RS 2]------------>|             |
                       | NFS Server) |
[LVS RS 3]------------>|_____________|

Each LVS RS has /var, /usr, etc. on a local disk drive, but shared data is placed on the NFS-mounted file system. (They are just ordinary NFS clients to the NAS box). Lock arbitration is performed by the NAS box.

I think the terms "distributed file system" and "cluster file system" suffer from the same problem of vagueness of definition. Awhile back Alan Cox had this to say about the term cluster file system (CFS):

It seems to mean about three different things

  1. A clusterwide view of the file store implemented by any unspecified means - i.e. an application view point.
  2. A filesystem which supports operation of a cluster
  3. A filesystem with multiple systems accessing a single shared file system on shared storage

Meaning #3 can be really confusing because a 'cluster file system' in that sense is actually exactly what you don't want for many cluster setups, especially those with little active shared storage' For example if you are doing database failover you can do I/O fencing and mount/umount of a more normal fs.

John Barrett jbarrett (at) qx (dot) net 01 Nov 2003

I've just recently installed Coda, and must say that i'm less impressed with it compared to NFS for a number of reasons. There is a server setup I will be doing in a few weeks where Coda will be the only choice, so don't think that I'm being completely against Coda, I just feel the range of areas where it is usable is fairly narrow.

  • NFS is already in the linux kernel as is Coda, but Coda is an older version (5.3.xx vs current 6.0.xx, make sure your user-space code is right for your kernel module)
  • NFS -- setting up the client and server is a no-brainer, webmin handles both if you dont want to get your keyboard dirty, and even if you hand-edit, its no problem, no such luck with Coda
  • Coda's main strength is replicated servers, But you can do the same with NFS if you are willing to accept some delay before changes propogate to the other NFS servers (i.e. rsyncing the nfs servers every so often, using heartbeat failover to bring the backup NFS server online as needed) -- if your files on disc are frequently changing and replicated servers must be in sync, Coda is the better choice.
  • Coda's main weakness IMHO is the hoops you have to jump through when you make changes to the server system -- you have to kill and restart the client daemon on each client machine to get changes to take. (adding a new server or replicate, creating or deleting volumes, etc -- experiment a lot on non-production systems, get the production setup right the 1st time)
  • Coda does much more in the way of local caching than NFS, and the cache size is configurable... make the cache as large as the distributed filespace and it is possible to continue to operate if all the servers are down, and any changes will be committed when one or more of the servers come back online (presuming all the files needed are in the cache)
  • Coda does not use system uid/gid for its file -- it maintains its own user/pass database, and you must login/acquire a ticket before accessing Coda volumes -- NFS runs off the existing uid/gid system, all you have to do is keep the passwd/group files on all the machines in sync for key users, or setup NIS+

In closing, I feel that I had a bad experience with Coda, but I wont hessitate to try again when I have more time to dig into the detail, I was under a lot of time pressure on this latest job, so I went with NFS just to get the system online NOW :)

Todd Lyons tlyons (at) ivenue (dot) com 16 Feb 2006

Here are some questions to ask to compare which avenue you would like to take for a cluster filesystem:

  1. In a cluster filesystem, how many real servers can go down at one time and leave the virtual raid array still operational?
  2. In a NAS box, how many drives can die simultaneously and the system is still operational?
  3. In a cluster filesystem, what happens if half of the switch ports just all of a sudden die, or the whole switch, or some corruption happens in your switch and several ports suddently segment themselves into its own VLAN?
  4. For a NAS box, ask #3. (Hint: a good NAS like the FAS270c with twin heads will do a complete takeover if necessary, so 3 of the 4 ethernet ports can lose connectivity. As long as one is still connected, that one head can serve the load of both heads.)
  5. In an NAS box, does it have multiple power supplies? Multiple ethernet ports? Multiple parity drives? Multiple spare drives?
  6. How much money do you have available to spend? Are you counting the amount of time it will take you to keep a cluster filesystem running and stable as compared to the relatively troublefree NAS boxes (assuming you don't undersize it)?

In terms of complete disaster recovery:

  1. How long does it take to backup a complete cluster fs?
  2. How long does it take to backup a NAS?
  3. How long would it take to rebuild and restore a cluster if half of the machines died all at once? If all of the machines died at once?
  4. How long would it take to rebuild and restore a NAS box if half of the drives died all at once? If all of the drives died at once? If both power supplies died at the same time?

If you think that you'll never see any of these situations, you might be lucky and never will, but my general experience with computer hardware is that things run very smooth for a very long time, and then something hiccups hard and takes a bit of work to recover from. I've also heard mention that "Murphy was an optimist." :-)

If you can't tell, I'm of the opinion that a NAS will do more for you than a cluster filesystem, but keep in mind that's also my comfort zone. If I was daily into the inner workings of a cluster fs production system, I might feel differently.

19.9. load balancing and scheduling based on the content of the packet: Cookies, URL, file requested, session headers

Mar 2002: questions on these topics have come up in the context of persistent connection, L7 Switch, statefull failover or cookie based routing (see L7 switch Introduction). I've tried to collect it all here. Make sure you read these other sections if you are implementing cookies or URL rewriting.

LVS being an L4 switch does not have access to the content of packets, only the packet headers. So LVS doesn't know the URL inside the packet. Even if it did, LVS would need an understanding of the http protocol to inspect cookies or rewrite URL headers. Often people think that an L7 Switch is the answer here. However an L7 switch is slow and expensive. If you have what looks to be an L7 problem, you should see if there is a solution at the L4 level first (see rewriting your e-commerce application).

The short answer is that you can't use LVS to load balance based on the content of the packet. In the case of http, there are other tools which understand the content of http packets and you can use those.

19.9.1. Cookies

Cookies are an application level protocol for maintaining client state in the stateless http/https protocols. Cookies are passed between servers and clients which have http, https and/or database services. For the cookie specification see netscape site. When you sign in at Ebay, you receive a cookie. You will still be signed in days later on the same screen, even if you don't do anything with your screen/mouse/keyboard.

A cookie is left on the disk of the client machine. It's intrusive and can be used by the server to spy on your shopping/surfing habits. They are a security hazard and clients should turn them off. There are other non-intrusive methods for maintaining state, e.g. passing information to the client in the URL with url rewriting or session management using php. However many sites require you to allow cookies. Make sure you set your browser to flush your cookies when it exits.

Joe: Feb 2004. I thought having the client's data in the URL was a good idea till I talked to people at OLS, when I find that this method can't be used, as the client's data is visible to and alterable by the client and hence isn't secure.

Although initially designed as a helpful tool, the only use for cookies now is information gathering by the server (it's like having your pocket picked while window shopping).

Being a layer 4 switch, LVS doesn't inspect the content of packets and doesn't know what's in them. A cookie is contained in a packet and the packet looks just like any other packet to an LVS.

If you're asked to setup an LVS infront of cookie dependant webservers, you will need to turn on persistence so that the client will be guaranteed of connecting to the same realserver.

Eric Brown wrote:

Can LVS in any of its modes be configured to support cookie based persistent sessions?

Horms horms (at) vergenet (dot) net 3 Jan 2001

No. This would require inspection of the TCP data section, and infact an understanding of HTTP. LVS has access only to the TCP headers.

valery brasseur

I would like to to load balancing based on cookie and/or URL,


Have a look at http://www.LinuxVirtualServer.org/docs/persistence.html :-)

matt matt (at) paycom (dot) net

I have run into a problem with the persistant connection flag, I'm hoping that someone can help me. First off, I don't think there is anything like this out now, but, is there anyway to load-balance via URL? Such as http://www.matthew.com being balanced among 5 servers without persistant connections turned on, and http://www.matthew.com/dynamic.html being flagged with persistance? Second question is this; I don't exactly need a persistant connection, but I do need to make sure that requests from a particular person continue to go to the same server. Is there any way to do this?

James CE Johnson jcej (at) tragus (dot) org Jul 2001

We ran into something similar a while back. Our solution was to create a simple Apache module that pushes a cookie to the browser the when the "session" begins (e.g. -- when no cookie exists). The content of the cookie is some indicator of the realserver. On the second and subsequent requests the Apache module sees the cookie and uses the Apache proxy mechanism to forward the request to the appropriate realserver and return the results.

19.9.2. Forwarding an httpd request based on file name (mod_proxy, mod_rewrite)

Sean, 25 Dec 2000

I need to forward request using the Direct Routing method to a server. However I determine which server to send the request to depending on the file it has requested in the HTTP GET not based on it's load.

Michael E Brown

Use LVS to balance the load among several servers set up to reverse proxy your realservers, set up the proxy servers to load balance to realservers based upon content.

Atif Ghaffar atif (at) 4unet (dot) net

On the LVS servers you can run apache with mod_proxy compiled in, then redirect traffic with it.


        ProxyPass /files/downloads/ http://internaldownloadserver/ftp/
        ProxyPass /images/ http://internalimagesserver/images/

Proxy pass, or you can use mod_rewrite, in that case, your realservers should be reachable from the net. There is also a transparent proxy module for apache.

Yan Chan ychan (at) ureach (dot) com 19 Aug 2003

My ipvs is set up right and everything. I set the VIP's address of port 80 to forward to my Real Web Servers. I then set port 90 to another Web Server with different stuff in it. My problem is when i try to access the page for port 80, www.abc.com, the web page shows fine. In order for me to access the page in port 90, i have to type www.abc.com:90. As you can see, it doesnt look elegant. Is there a way to change it so i can make it www.abc.com/ipvs equal to www.abc.com:90? like the rewrite rule in apache? I tried using apache in the loadbalancer. But it doesnt seen to work.

Stephen Walker swalker (at) walkertek (dot) com 19 Aug 2003

This is how I set up my reverse proxy in apache:

    ProxyRequests Off
    RewriteEngine On

    ProxyPass /perl http://www.abc.com:8080/perl

    RewriteRule (^.*\.pl) http://www.abc.com:8080$1 [proxy,last]
    RewriteRule (^.*\.cgi) http://www.abc.com:8080$1 [proxy,last]

    ProxyPassReverse / http://www.abc.com:8080/
    ProxyReceiveBufferSize 49152

The ProxyPass rule says everything that goes to /perl is forwarded to port 8080, the Rewrite rules take care of scripts ending in .pl or .cgi. Obviously you need to have the rewrite and proxy modules running on your apache server.

Randy Paries 4 Feb 2004

I want to loadbalance based on file names. I want
http://www.mydomain.com/a/ to go to realserver_1
http://www.mydomain.com/b/ to go to realserver_2

Dave Lowenstein

You could try apache's mod_rewrite.


Actually I think mod_proxy would probably be the right choice. Though this could be combined with mod_rewrite. Actually, it would be pretty trivial me thinks.

Viperman Aug 23, 2003

I'm successfully using LVS with a reverse proxy configuration in apache, everything working just fine. I just faced a problem, when a user is trying to read the client IP address using PHP with the $_SERVER['REMOTE_ADDR'] variable. My RIP is showing up in place of the CIP.

Horms horms (at) verge (dot) net (dot) au 24 Aug 2003

I believe that when proxies are involved you need to check variables other than REMOTE_ADDR, which will generally be the IP address of the proxy, as this is acting as an end-user of sorts. See http://www.php.net/getenv

In any case the problem should not be caused by LVS as it does not change the source IP address nor the HTTP headers (or body).

19.9.3. Rewriting URL headers

alex (at) short (dot) net

We have 4 distict sites, all being virtual hosted and load balanced by a single VIP. Namevirtualhosts in apache picks the right content dependent on the host header. This works great for those distict sites and their corresponding domains.

Problem is that we now have 25 domains to point to distict site 1 and 25 domains pointing to distinct site 2. Right now the nameserver entries are www CNAME www.domain.com for all of those domains i.e. distinct sites


I want

www.a1.com -> www.a.com
www.a2.com -> www.a.com
www.b1.com -> www.b.com

I'd rather not fill my httpd conf with all these domains. I was either hoping that LVS can do some host header modifications or I'll have to make 4 VIP and each site have a distinct external ip.

Jacob Coby jcoby (at) listingbook (dot) com 19 Feb 2003

LVS sits too low to handle munging the HTTP headers. Check out ServerAlias (http://httpd.apache.org/docs-2.0/mod/core.html#serveralias), I think it will do pretty much exactly what you want it to do with a minimal of fuss. It's available in 1.3.x and 2.x, I only link to the 2.0 docs since they look better.

You could even stick the aliases in another file (a_aliases.conf or whatever) and include that into the VirtualHost section of Site A. The only real "problem" with this setup is that you have to bump apache (you can use 'apachectl graceful' if you don't have SSL) everytime you add a new alias or aliases.

Magnus Nordseth magnun (at) stud (dot) ntnu (dot) no Thu, 20 Feb 2003 18:07:17 +0100

If the hosts have almost identical configuration (in apache) you can use dynamic virtual hosting

19.9.4. URL parsing


Is there any way to do URL parsing for http requests (ie send cgi-bin requests to one server group, static to another group?)

John Cronin jsc3 (at) havoc (dot) gtf (dot) org 13 Dec 2000

Probably the best way to do this is to do it in the html code itself; make all the cgis hrefs to cgi.yourdomain.com. Similarly, you can make images hrefs to image.yourdomain.com. You then set these up as additional virtual servers, in addition to your www virtual server. That is going to be a lot easier than parsing URLs; this is how they have done it at some of the places I have done consulting for; some of those places were using Extreme Networks load balancers, or Resonate, or something like that, using dozens of Sun and Linux servers, in multiple hosting facilities.


What you are after is a layer-7 switch, that is something that can inspect HTTP packets and make decisions bassed on that information. You can use squid to do this, there are other options. A post was made to this list about doing this a while back. Try hunting through the archives.

LVS on the other hand is a layer-4 switch, the only information that it has available to it is IP address and port and protocol (TCP/IP or UDP/IP). It cannot inspect the data segment and see even understand that the request is an HTTP request, let alone that the URL requested is /cgi-bin or whatever.

There has been talk of doing this, but to be honest it is a different problem to that which LVS solves and arguably should live in user space rather than kernel space as a _lot_ more proccessing is required.

19.9.5. session headers

Torvald Baade Bringsvor Dec 05, 2002

We have a setup with two reverse proxies, two frontends and two backend application servers. Usually we just use LVS to switch between the proxies, and then establish a direct mapping from each of the reverse proxies onto an application server. But now I am wondering if it is possible to use LVS to switch between the two frontends and the two backends as well, in other words to cluster the frontends and backends. Regular persistence doesn't work here, because (as far as the backends are concerned) all the traffic comes from just two addresses. It would be really nice to be able to inspect the session ID (which is contained in the http header of the requests) and route the request based on that. But is it possible? Has anybody done this?

Horms 10 Mar 2003

Unfortunately LVS does not have access to the session header so this information cannot be used for load balancing.

19.9.6. Scheduling by packet content

Horms 07 May 2004

It would be nice to be able to use KTCPVS-like shedulers that make use of L7 information inside LVS, but there are major problems. In terms of TCP the problem is that LVS wants to schedule the connection when the first packet arrives. However, to get L7 information you need the three way handshake to have completed. I guess this would be possible if LVS itself handled the three-way connection, and then buffered up the packets in the established state until enough L7 information had been collected to schedule a given connection. But I suspect this really would be quite painful.

19.10. Timeouts for TCP/UDP connections to services

Sometime in 2008: LVS, as originally written, would timeout connections between client and server, independently of whether the connection itself wanted to be timed out. The timeout was short enough that people setting up a new LVS would find their sessions disconnected, without knowing why. Expect that sometime soon, that the timeouts will be changed to match those in netfilter.

This is part of an off-line discussion on why the timeouts shouldn't be set to infinity.

Graeme Fowler graeme (at) graemef (dot) net 23 Dec 2008

Although it is (theoretically, at least, and I'll have to check the TCP RFC for this) possible to write an app which does the three-way handshake and then holds the connection open without exchanging any further packets for a long time (for some value of "long"), in the case of ip_vs this could result in resource starvation on highly loaded systems.

ip_vs is essentially a man in the middle (albeit a nice one) which has some knowledge about the transactions in progress at TCP level. If we set the timeout to 0/infinity, then under some conditions (say broken networks, BGP peerings dropping, or - ooh, topical - multiple undersea fibres being severed) the director will be left with a table stuffed full of "tracked" connections (I use the term carefully, noting the similarity to netfilter's conntrack modules) which never go away unless a FIN/RST turns up.

If that/those packet/s never arrive, the director will have an ever-increasing number of connections in the table. Under circumstances where there's a high turnover of connections (100,000/sec, for example, which is deliberately high for illustrative purposes) even 0.0001% of connections getting into that state would result in the following:

  100000  conns/sec
  0.0001  % never closed
     0.1  conns/sec "stalled"

   86400   secs/day
    8640   conns/day "stalled"

Assuming a well-managed, not-interfered-with director that would give us 3153600 "stalled" connections in a year of uptime. OK, so that may be far-fetched for some people *but* it's perfectly plausible in terms of embedded systems.

And embedded systems often don't have much RAM - and that's the killer factor here. Low RAM means little space for the hash table, and if the hash tables fills up we stop routing (I presume, unless it FIFOs entries out or does some other non-time-related scavenging).

I see Horms commented similarly, but without numbers... now let me look at RFC 793 and what it says...

"The timeout, if present, permits the caller to set up a timeout for all data submitted to TCP. If data is not successfully delivered to the destination within the timeout period, the TCP will abort the connection. The present global default is five minutes."

It doesn't specify upper or lower bounds, so it looks like infinity is (technically) possible. Note however that lots of firewall devices, NAT boxes and so on will drop them anyway - Cisco PIX and ASA, Checkpoint devices have a 24 hour default; netfilter's conntrack modules have a default 5 day TCP session timeout for established connections. For more netfilter goodness, see: /proc/sys/net/netfilter/ There are lots of sysctl goodies in there!

Why not, as ip_vs is linked so closely to netfilter, make use of the same sysctls?


I'm not sure how it would work in practice - perhaps some people would want to tune LVS and netfilter separately - but at the very least we could use the same default.

Feb 2003: The timeout information is now in man(8) ipvsadm.

19.10.1. 2.2 kernels

Julian, 28 Jun 00

LVS uses the default timeouts for idle connections set by MASQ for (EST/FIN/UDP) of 15,2,5 mins (set in /usr/src/linux/include/net/ip_masq.h). These values are fine for ftp or http, but if you have people sitting on a LVS telnet connection, they won't like the 15min timeout. You can't read these timeout values, but you can set them with ipchains. The format is

$ipchains -M -S tcp tcpfin udp

Wensong Aug 2002

for 2.4 kernels

director:/etc/lvs# ipvsadm --set tcp tcpfin udp


$ipchains -M -S 36000 0 0

sets the timeout for established connections to 10hrs. The value "0" leaves the current timeout unchanged, in this case FIN and UDP.

19.10.2. 2.4 kernels

Julian Anastasov ja (at) ssi (dot) bg 31 Aug 2001

The timeout is set by ipvsadm. Although this feature has been around for a while, it didn't work till ipvsadm 1.20 (possibly 1.19) and ipvs-0.9.4 (thanks to Brent Cook for finding this bug).

$ipvsadm --set tcp tcpfin udp

The default timeout is 15 min for the LVS connections in established (EST) state. For any NAT-ed client connections, ask iptables.

To set the tcp timeout to 10hrs, while leaving tcpfin and udp timeouts unchanged, do

#ipvsadm --set 36000 0 0

Brent Cook busterb (at) mail (dot) utexas (dot) edu 31 Aug 2001

I found the relevant code in the kernel to modify this behavior in 2.4 kernels without using ipchains. I got this info from http://www.cs.princeton.edu/~jns/security/iptables/iptables_conntrack.html In /usr/src/linux/net/ipv4/netfilter/ip_conntrack_proto_tcp.c , change TCP_CONNTRACK_TIME_WAIT to however long you need to wait before a tcp connection timeout.

Does anyone foresee a problem with other tcp connections as a result of this? A regular tcp program will probably close the connection anyway.

static unsigned long tcp_timeouts[]
= { 30 MINS,    /*      TCP_CONNTRACK_NONE,     */
    5 DAYS,     /*      TCP_CONNTRACK_ESTABLISHED,      */
    2 MINS,     /*      TCP_CONNTRACK_SYN_SENT, */
    60 SECS,    /*      TCP_CONNTRACK_SYN_RECV, */
    2 MINS,     /*      TCP_CONNTRACK_FIN_WAIT, */
    2 MINS,     /*      TCP_CONNTRACK_TIME_WAIT,        */
    10 SECS,    /*      TCP_CONNTRACK_CLOSE,    */
    60 SECS,    /*      TCP_CONNTRACK_CLOSE_WAIT,       */
    30 SECS,    /*      TCP_CONNTRACK_LAST_ACK, */
    2 MINS,     /*      TCP_CONNTRACK_LISTEN,   */

In the general case you cannot change the settings at the client. If you have access to the client, you can you can arrange for the client to send keepalive packets often enough to reset the timer above and keep the connection open.

Kyle Sparger ksparger (at) dialtone (dot) com> 5 Oct 2001

You can address this from the client side by reducing the tcp keepalive transmission intervals. Under Linux, reduce it to 5 minutes:

echo 300 > /proc/sys/net/ipv4/tcp_keepalive_time

where '300' is the number of seconds. I find this useful in all manner of situations where the OS times out connections.

19.10.3. udp flush bug in early 2.6 kernels

Ashish Jain

The load balancer works fine the way I expected for one one thing: Sometimes after heavy load surge (500 UDP packets per sec. on the same connection) , the output of ipvsadm -l -c shows active UDP connections even if there is no traffic and I stop sending any UDP packets. In ipvsadm, the default UDP connection timout is set to 300 sec (I did not change that. I did not even use persistant flag). I expect these connections to go away after 300 seconds. But after the 300 sec timer expires, it gets reset to 60 sec and these active UDP connectiosn stay forever. What could be the reason? I have noticed from the ip_vs code (ip_vs_conn.c) that there are 2 functions implemented to expire a connection: ip_vs_conn_expire (This one resets the timer to 60*Hz if the reference count for this connection is greater than 1 or these is error deleting connection from hash tab) ip_vs_conn_expire_now (Deletes the connection immediately) The function called after timer expires is ip_vs_conn_expire and not the second one. Why is this so?

Horms 27 Oct 2005

This is a bug, I believe it was fixed in

How can I turn on debugging for ip_vs?

You need to enable IP_VS_DEBUG at compile time, and then fiddle the debug proc value in /proc/sys/net/ipv4/vs at run time.

19.11. name resolution on realservers: running name resolution friendly demons on realservers

Unless the realserver is in a 3-Tier LVS LVS, it is just sending packets from the VIP to the CIP and doesn't need name resolution.

The realservers however run services other than just the LVS services, e.g smtp to mail logs and cron output. The smtpd should only need /etc/hosts to send mail locally. I upgraded from sendmail to postfix on one of my realservers to find that I could no longer mail to or from the upgraded machine. The problem is that postfix (http://www.postfix.org/) requires DNS for name resolution, thus requiring a nameserver. Postfix couldn't deliver mail on my realserver, as there was no resolution for the private realserver names (which have private IPs). Thus to run postfix, you need also to run DNS. This adds the security complications of punching a hole in your filter and routing rules to get to port 53 on other machines. Postfix works fine on a machine delivering mail to users on the internet and where the hostname is publically known, but doesn't work for machines on a private network with no DNS running.

A little later, I found that you can turn off DNS lookups in postfix (see postfix faq, look for "resolv"). However this doesn't get you much - now you need to do everything via /etc/hosts. You can't use /etc/hosts for local machines and /etc/resolv.conf for the occasional machine that's not on the localnetwork.

With (the earlier versions of) sendmail, you can have a nameserver in /etc/resolv.conf where it will only be used for hosts not in /etc/hosts. You don't want to be running postfix on a realserver just for local mail. If you run sendmail only for local mail delivery, then you only need an /etc/hosts file.

Jan 2004: just installed sendmail-8.12.10. It's been "improved" and now requires DNS for all addresses, including private addresses. It doesn't look at /etc/hosts. You can turn off DNS lookups by telling sendmail to look at /etc/service.switch (ignoring the already available /etc/host.conf and nsswitch.conf), where you can tell it to look at /etc/hosts but now it will not use DNS. This is a step backwards for sendmail.

From the TriLUG mailing list: Tanner Lovelace clubjuggler (at) gmail (dot) com 26 Apr 2007

Postfix is a mail transport agent and therefore by design does not lookup A records. Instead it looks up MX records. Note that /etc/hosts does not contain MX records, so it is therefore appropriate that postfix not look there. However, it is possible to make postfix look for both A records and use /etc/hosts. This postfix config line will make postfix use /etc/hosts:

disable_dns_lookups = yes

For more information about this see this URL: http://www.postfix-jp.info/origdocs/QandA-en.html#4.10

The facilities for nameresolution on Linux are problematic. I had assumed that when an application asked for name resolution, that local facilities (libresolve.so?) handled the request (gethostbyname, gethostbyaddr?) using whatever resources were available (in /etc/nsswitch.conf /etc/host.conf) and the application accepted the result without knowing how the name was resolved (e.g. whether in /etc/hosts, NIS or DNS).

This isn't what happens - the application has to do it all. If you watch ping deliver a packet to a remote machine, whose name is not in /etc/hosts,

`strace ping -c 1 remote_machine > strace.out 2>&1 `

you see ping access files in this order


and then finally connect to the dns port (53) on the first nameserver in /etc/resolv.conf.

It turns out there is no "resolving facility". The application has to work its way through all these files and then handle the name resolution itself. Ping appears to be using a "resolving facility", but only because it goes through all the files in /etc searching in the order you'd expect for a "resolving facility". Applications can ignore these files and do whatever they want. nslookup doesn't look at /etc/hosts, but goes straight to DNS. Postfix also connects directly to DNS.

I'm a little dissappointed to find that every application writer has to handle name resolution themselves. For postings on the topic where other people have been similary disabused of their ignorance on name resolution, do a google search on /etc/hosts, /etc/host.conf, /etc/resolv.conf, gethostbyname, nslookup and throw in "postfix" for more info on the postfix part of the problem. e.g. the neohapsis archives for postfix and openbsd.

For info on setting up /etc/hosts, /etc/nsswitch.conf, /etc/host.conf see network HOWTO (http://www.tldp.org/HOWTO/Net-HOWTO/).

For an MTA on realservers with private RIPs, neither postfix nor sendmail are suitable (although you may have to use them). An older version of sendmail should be fine.

From the TriLUG mailing list: Tanner Lovelace clubjuggler (at) gmail (dot) com 26 Apr 2007

nslookup is a tool that previously came with the name server (and is now deprecated in favor of dig, which is also a dns testing tool). nslookup was written specifically to test DNS resolution and therefore it is perfectly valid that it not check local files.

You have to take the context of what the application is looking for. The /etc/hosts file only provides names and IP addresses. Postfix, by default, isn't looking for that. It's looking for MX records. The programs nslookup, dig, and host are all tools written to test and debug the DNS system. It would be wrong for them to look in /etc/hosts, since it is not part of the DNS system. For most applications, though, that only look for IP addresses (A records) or hostnames (PTR records), looking in /etc/hosts is appropriate, and in fact, this is what the gethostbyname and gethostbyipaddr system calls do. Anything that uses the gethostbyname system call does follow what is in /etc/nsswitch.conf. For instance, with this line in nsswitch.conf

hosts:          files dns

If I run ping www.trilug.org and examine what files it opens I get this:

open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
open("/etc/nsswitch.conf", O_RDONLY)    = 3
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/libnss_db.so.2", O_RDONLY)   = 3
open("/lib/tls/i686/cmov/libnss_files.so.2", O_RDONLY) = 3
open("/usr/lib/libdb3.so.3", O_RDONLY)  = 3
open("/var/lib/misc/protocols.db", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/var/lib/misc/protocols.db", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/etc/protocols", O_RDONLY)        = 3
open("/etc/resolv.conf", O_RDONLY)      = 4
open("/etc/host.conf", O_RDONLY)        = 4
open("/etc/hosts", O_RDONLY)            = 4
open("/etc/ld.so.cache", O_RDONLY)      = 4
open("/lib/tls/i686/cmov/libnss_dns.so.2", O_RDONLY) = 4
open("/lib/tls/i686/cmov/libresolv.so.2", O_RDONLY) = 4

Note that it does go to /etc/hosts first, as specified by nsswitch.conf. If I then change the line in nsswitch.conf to be this instead:

hosts:          dns

and rerun the same test I get this:

open("/etc/ld.so.preload", O_RDONLY)    = -1 ENOENT (No such file or directory)
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/tls/i686/cmov/libc.so.6", O_RDONLY) = 3
open("/etc/nsswitch.conf", O_RDONLY)    = 3
open("/etc/ld.so.cache", O_RDONLY)      = 3
open("/lib/libnss_db.so.2", O_RDONLY)   = 3
open("/lib/tls/i686/cmov/libnss_files.so.2", O_RDONLY) = 3
open("/usr/lib/libdb3.so.3", O_RDONLY)  = 3
open("/var/lib/misc/protocols.db", O_RDWR|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/var/lib/misc/protocols.db", O_RDONLY|O_LARGEFILE) = -1 ENOENT (No such file or directory)
open("/etc/protocols", O_RDONLY)        = 3
open("/etc/resolv.conf", O_RDONLY)      = 4
open("/etc/ld.so.cache", O_RDONLY)      = 4
open("/lib/tls/i686/cmov/libnss_dns.so.2", O_RDONLY) = 4
open("/lib/tls/i686/cmov/libresolv.so.2", O_RDONLY) = 4
open("/etc/host.conf", O_RDONLY)        = 4

Note that it does not look in /etc/hosts. It isn't ping that's searching these files, it's gethostbyname in the C library calling into libnss_*. The libnss libraries are the resolver.

I agree that there should be tools other than DNS debugging tools. Kevin suggested probably the best one:

% getent hosts {hostname}

This will correctly use the linux name resolving functions and follow what has been set up in nsswitch.conf.

TriLUG mailing list: jason (at) monsterjam (dot) org 26 Apr 2007

A very easy way to do what you want is to install dnsmasq. It will allow you to treat your /etc/hosts as dns entries to your server AND clients. I think the -b flag will do what you want.

release 0.991 Added -b flag: when set causes dnsmasq to always answer
              reverse queries on the RFC 1918 private IP space itself and
              never forward them to an upstream server. If the name is not in
              /etc/hosts, dnsmasq replies with the dotted-quad address.

19.12. Debugging new services

At some stage trying to LVS a service that works just fine when you connect directly to the realserver, but doesn't work when you connect to the same service through the director. There is something about the service that you've taken for granted or may not even be aware of, and that assumption doesn't hold when the two-way tcpip connection is spread amongst 3 machines (adding the director). It will probably be because the service

  • uses multiple ports. The multiport problem can be solved by persistence to port 0. This isn't a particularly subtle approach, but will at least get your service working.
  • requires multiple rounds of tcpip connections. This can be solved by persistence to the service's port, when all connections will go to the same realserver.
  • writes to the realserver (see Filesystems for realserver content).
  • something we don't know about or you've just plain messed up.

In this case you'll need brute force.

  • run tcpdump on the client and server (i.e. without a director, and not using an LVS) to see the packet exchanges when the service is working. Then connect up the LVS (make sure that it's working by testing say telnet as an LVS'ed service), then run tcpdump on the client, director (both NICs if a 2 NIC director) and the realservers. This will be tedious.
  • If you know the service's protocol (test with just the client and server, i.e. with no LVS), you can work your way through the connection with phatcat. For example sessions of phatcat with a 2 port protocol, see the section on ftp.

19.13. "broken" services:servlets and j2ee

Here, Ratz is replying to a poster about his problems LVS'ing a website with servlets. The servelets are writing content to different realservers from the same client. This is normally handled by persistence.

Roberto Nibali ratz (at) tac (dot) ch 10 Jul 2003

You have a broken :) service which you would like to load balance with persistence.


How do you handle broken services? How would you design the service not to be broken?

I don't know how I shall answer this question. I hope his service is not broken. But how do you understand his problem? It should be solved by setting up port 0 service, right?

Sometimes things have to be done in a more complex way. The only thing that they shouldn't do is migrate sessions within the realserver pool for CPU load sharing, as is not uncommon for servlet technology. If they do this, they need to have a common pool for allocated resources (and thus have the process migration overhead) which then again defeats the purpose of inter-node CPU load balancing at first hand. But we do not know what exactly he's trying to come up with.

Another possibility would be to set up persistent fwmark pools consisting of a mark for incoming to service:80 and the same mark for incoming to service:defined_portrange. With the System.Properties in JDK you can set the port range which will be allocated and thus you can pretty much restrict the service fwmark pool. You then of course load balance the fwmark pool. I only told him to use port 0 because he doesn't know the application so with higher possibility he wouldn't know the dynamically opened ports of this application either and therefor would not be able to restrict it accordingly.

Then again he could do something like:

iptables -t mangle -A PREROUTING -j MARK --set-mark 1 -p tcp -d ${VIP}/32 --dport 80
iptables -t mangle -A PREROUTING -j MARK --set-mark 1 -p tcp -d ${VIP}/32 --dport 1023:61000
ipvsadm -A -f 1 -s wlc -p 300 -M
ipvsadm -a -f 1 -r ${RIP1} -g -w 1
ipvsadm -a -f 1 -r ${RIP2} -g -w 1

Joao Clemente jpcl (at) rnl (dot) ist (dot) utl (dot) pt 24 Jun 2003

I've been talking with the developer of the j2ee app I'm trying to cluster, and I guess I'll have a hard time with this feature:

After a user interaction with the web server, the user will dowload a applet. That applet will comunicate with a service (in a well-known port) that was started in the server (at that time).

So, I see this problem coming:

At instance1:
Node1 gets the user http request, returns the html page, the applet, and
creates the service. Node2 knows nothing about this.

At instance2:
Applet connects to port xxx, gets round-robin to node2... bad

No matter what persistency rules I setup here, as I have 2 different ports (80 and xxx) I see no way to say "when user interacts with server, set persistency rule for yyy time that maps user:80 to node1:80 AND ALSO user:whatever to node1:xxx"

Besides that, I also have another question: That service that is listening in the server node will then give the connection to another instance, that will control the connection from there on (there is a pool of instances waiting to take over). Will lvs route those connections, that it doesnt even know of? I'm not sure, but this mechanism seems something similar to a passive-ftp connection... Maybe someone know a lvs-friendly tip to make things work. Btw, this applet and the connection is used to allow server->browser communication without using http refresh/pooling.

19.14. http logs, error logs

The logs from the various realservers need to be merged.

From the postings below, at least in the period 2001-3, using a common nfs filesystem doesn't work and no-one knows whether this is a locking problem fixable by NFS-v3.0 or not. The way to go seems to be merglog

Emmanuel Anne emanne (at) absysteme (dot) fr

..the problem about the logs. Apparently the best is to have each web server process its log file on a local disk, and then to make stats on both all files for the same period... It can become quite complex to handle, is there not a way to have only one log file for all the servers

Joe - (this is quite old i.e. 2000 or older and hasn't been tested).

log to a common nfs mounted disk? I don't know whether you can have httpds running on separate machines writing to the same file. I notice (using truss on Solaris) that apache does write locking on files while it is running. Possibly it write-locks the log files. Normally multiple forked httpds are running. Presumably each of them writes to the log files and presumably each of them locks the log files for writing.

Webstream Technical Support mmusgrove (at) webstream (dot) net 18 May 2001

I've got 1 host and 2 realservers running apache(ver 1.3.12-25). The 2nd server NFS exports a directory called /logs. The 1st acts as a client and mounts that drive. I have apache on the 1st card saving the access_log file for each site into that directory as access1.log. The 2nd server saves it as access2.log in the same directory. Our stats program on another server looks for *.log files in that directory. The problem is that whenever I access a site (basically browse through all the pages of a site), the 2nd card adds the access info into the access2.log file and everything is fine. The 1st card saves it to the access1.log file for a few seconds, then all of a sudden the file size goes down to 0 and its empty.

Alois Treindl alois (at) astro (dot) ch

I am running a similar system, but with Linux 2.4.4 which has NFS version 3, which is supposed to have safe locking. Earlier NFS version are said to have buggy file locking, and as Apache must lock the access_log for each entry, this might be the cause of your problem.

I have chosen not to use a shared access_log between the realservers, i.e. not sharing it via NFS. I share the documents directory and a lot else via NFS between all realservers, but not the logfiles.

I use remote syslog logging to collect all access logs on one server.

  1. On server w1, which holds the collective access_log and error_log, I have in /etc/syslog.conf the entry:

    local0.=info /var/log/httpd/access_log
    local0.err   /var/log/httpd/error_log
  2. on all other servers, I have an entry which sends the messages to w1:

    local0.info     @w1
    local0.err      @w1
  3. On all servers, I have in http.conf the entry:

    CustomLog "|/usr/local/bin/http_logger" common
  4. and the utility http_logger, which sends the log messages to w1, contains:

    # script: logger
    use Sys::Syslog;
    $SERVER_NAME = 'w1';
    $FACILITY = 'local0';
    $PRIORITY = 'info';
    openlog ($SERVER_NAME,'ndelay',$FACILITY);
    while (<>) {
  5. I also to error_log logging to the same central server. This is even easier, because Apache allows to configure in httpd.conf:

    ErrorLog syslog:local0

On all realservers, except w1, thse log entries are sent to w1 by the syslog.conf given above.

I think it is superior to using NFS. the access_log entries of course contain additional fields in from of the Apache log lines, which originate from the syslogd daemon.

It is also essential that the realservers are well synchonized, so that the log entries appear in correct timestamp sequence.

I have a shared directory setup and both Real Servers have their own access_log files that are put into that directory (access1.log and access2.log...i do it this way so the Stats server can grab both files and only use 1 license), so i dont think its a file locking issue at all. Each apache server is writing to its own separate access log file, it's just that they happen to be in the same shared directory. How would httpd daemon on server A know to LOCK the access log from server B.


Why do you think it is NOT a file locking problem? On each realserver, you have a lot of httpd daemons running, and to write into the same file without interfering, they will have to use file locking, to get exclusive access. On one each server, you do not have just one httpd daemon, but many forked copies. All these processes on ONE server need to write to the SAME logfile. For this shared write access, they use file locking.

If this files sits on a NFS server, and NFS file locking is buggy (which I only know as rumor, not as experience), then it might well be the cause of your problem.

Why don't you keep your access_log local on each server, and rotate them frequently, to collect them on one server (merge-sorted by date/time), and then use your Stats server on it?

If you use separate log files anyway, I cannot see the need to create them on NFS. Nothing prevents you from rotating them every 6 hours, and you will probably not need more current stats.

So the log files HAVE to be on a local disk or else one may run into such a problem as I am having now?


I don't now. I only have read the NFS file locking before NFS 3.0 is broken. It is not a problem related to LVS. You may want to read http://httpd.apache.org/docs/mod/core.html#lockfile

Thanks but Ive seen that before. Each server saves that lock file to its own local directory.

Anyone have a quick and dirty script to merge-sort by date/time the combined apache logs?

Martin Hierling mad (at) cc (dot) fh-lippe (dot) de

try merglog


assuming that all files contain only entries from the same month, I think you can try:

sort -m -k 4 file1 file2 file3 ...

Arnaud Brugnon arnaud (dot) brugnon (at) 24pmteam (dot) com

We successfuly use mergelog (you can find on freshmeat or SourceForge) for merging logs (gz or not) from our cluster nodes. With use a simple perl script for downloading them to a single machine.

Juri Haberland list-linux.lvs.users (at) spoiled (dot) org Jul 13 2001

I'm looking for a script that merges and sorts the access_log files of my three realservers running apache. The logs can be up to 500MB if combined.

Michael Stiller ms (at) 2scale (dot) net Jul 13 2001

You want to look at mod_log_spread

Stuart Fox stuart (at) fotango (dot) com

cat one log to the end of the other then run

sort -t - -k 3 ${WHEREVER}/access.log &gt; new.log

then you can run webalizer on it.

Thats what I use, doesnt take more than about 30 seconds. If you can copy the logs from your realservers to another box and run sort there, it seems to be better

Heck, here's the whole(sanitized) script


## Set constants

DATE=`date "+%d-%b-%Y"`
YESTERDAY=`date --date="1 day ago" "+%d-%b-%Y"`

## First(1) Remove the tar files left yesterday
find ${ROOT} -name "*.tar.bz2" |xargs -r rm -v

## First get the access logs
## Make sure some_account has read-only access to the logs

su - some_account -c "$SSH some_account@real.server1 \"cat
/usr/local/apache/logs/access.log\" >  ${ROOT}/logs/$DATE.log"
su - some_account -c "$SSH some_account@real.server2 \"cat
/usr/local/apache/logs/access.log\" >> ${ROOT}/logs/$DATE.log"

## Second sort the contents in date order

sort -t - -k 3 ${ROOT}/logs/$DATE.log > ${ROOT}/logs/access.log

## Third run webalizer on the sorted files
## Just set webalizer to dump the files in ${ROOT}/logs

/usr/local/bin/webalizer -c /usr/stats/conf/webalizer.conf

## Forth remove all the crud
## You still got the originals on the realservers

find ${ROOT} -name "*.log"|xargs -r rm -v

## Fifth tar up all the files for transport to somewhere else

cd ${ROOT}/logs && tar cfI ${DATE}.tar.bz2 *.png *.tab *.html && chown
some_account.some_account ${DATE}.tar.bz2

Stuart Fox stuart (at) fotango (dot) com

Ok scrub my last post, i just tested mergelog. On a 2 x 400mb log it took 40 seconds, my script did it in 245 seconds.

Juri Haberland list-linux.lvs.users@spoiled.org

Ok, thanks to you all very much! That was quick and successful :-)

I tried mergelog, but I had some difficulties to compile it on Solaris 2.7 until I found that I was missing GNU make...

But now: Happy happy, joy joy!

karkoma abambala (at) genasys (dot) es

Another posibility... http://www.xach.com/multisort/

Stuart Fox stuart (at) fotango (dot) com

mergelog seems to be 33% faster than multisort using exactly the same file

Julien 7 Jan 2003

Does s/b know a way to merge apache error logs? Mergelog and Multisort only merge Access logs.

Jacob Coby 1 Dec 2003

cat error_log* >error_log.all
 or to sort by date
cat error_log* | sort -r > error_log.all

ratz 01 Dec 2003

This will not sort entries by date. Imagine following two (fictive, but syntactically and semantically correct) error_logs:

# cat error_log.1
[Thu Dec 4 03:47:24 2002] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations
[Thu Mar 27 03:47:24 2003] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations
# cat error_log.2
[Sun Jan 19 06:57:41 2003] [error] [client] File does not
[Thu Apr 20 03:47:24 2003] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations

Your pipeline does not sort them in a correct way (entries by date) at all. IMHO it's not so easy to script ;).

# cat error_log.* | sort -r
[Thu Mar 27 03:47:24 2003] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations
[Thu Dec 4 03:47:24 2002] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations
[Thu Apr 20 03:47:24 2003] [notice] Apache/1.3.27 (Unix) mod_jk/1.2.2
mod_perl/1.27 PHP/4.3.1 mod_ssl/2.8.14 OpenSSL/0.9.7a configured --
resuming normal operations
[Sun Jan 19 06:57:41 2003] [error] [client] File does not

To me the best solution is still to either write all error_logs into the same file or to configure httpd.conf in a way that the logs are sent via the syslog() interface. Then you use syslog-ng to do all the needed logics, data handling, merging, correlation and event triggering.


If you're trying to stitch error logs from seperate sources, then yes, it become less-trivial, and you're better off going with a scripting language to do the stitching.

Guy Waugh

Would this work to sort the error logs...

cat error_log.* | sort -o sorted-error_log -k 2M -k 3n

(use -r if you want the order reversed) I don't understand why, but when I do this, it sorts on the fourth field as well (the time)...

ratz 02 Dec 2003

It doesn't work with my sort or (more correctly) with my LC_TIME settings. If you want to sort with the '... -k xM ...' you need an appropriate LC_TIME entry or it will not work. A possible one is:

LC_TIME="%a %b %w %H:%M:%S %Y"

But this must be handwaved according to locale(5) and then compiled with localedef(3). Lucky you, if you have a charmap which matches the apache log files output ;). Also read the info page on sort to see the difference between '-k 3n' and '-k 3,3n'.

`-k POS1[,POS2]'
      Specify a sort field that consists of the part of the line between
      POS1 and POS2 (or the end of the line, if POS2 is omitted),
      _inclusive_.  Fields and character positions are numbered starting
      with 1.  So to sort on the second field, you'd use `--key=2,2'
      (`-k 2,2').  See below for more examples.

laurie.baker (at) bt (dot) com 08 Jan 2003

Take a look at logresolvemerge.pl. While I currently only use it myself for access logs, I believe you can configure it for whatever your require.

Joe: the people on the Beowulf mailing list use Swatch to parse logs collected on a centralised logserver.

Mikkel Kruse Johnsen mkj (dot) its (at) cbs (dot) dk 19 May 2003

You can use spreadlogd for logging the activity, so that all your web frontends send their logs to one server. www.spread.org, http://www.backhand.org/mod_log_spread/.

there are tools for migrating logs under: http://awstats.sourceforge.net/docs/awstats_tools.html

Joe Stump joe (at) joestump (dot) net 22 Dec 2005

I know of two ways to merge logs...

  • A centralized logging server (i.e. syslog). Have Apache log to that and then parse the logs from there.
  • Use rsync or scp to sync the logs to a central server, cat them into one larger server and then parse them from there.

I'm currently doing #2, but plan on moving to #1 pretty soon. The reason for the change is that it's a lot more streamlined than my current setup and it keeps logs in a single location instead of N locations where N is the number of nodes.

I currently use NFS for pretty much everything. All of my DocRoot's and configuration files (apache/php) are on the NFS server. I then aggregate the logs onto the NFS server and compile them with awstats on another server (don't ask).

Dan Trainor dan (at) id-confirm (dot) com 22 Dec 2005

I have a few very high traffic sites, and I've found that it would sometimes take AWStats so long to read the logs in one pass, I'd set it up to rotate and parse logs up to six times a day. I would imagine that, with a heavy LVS setup with many realservers, you may face the same problem.

Perhaps the awstats author at some point will create a tool which will merge all the gathered data into one single file or database. This way, the realservers could process their own logs, so you would not put all the load on one main processign server, and not face the same kind of problem which I previously had mentioned.

Todd Lyons tlyons (at) ivenue (dot) com 22 Dec 2005

You tell syslog or syslog-ng to log to a remote network source instead of or in addition to a local file on each of the real servers, then on the central logging server configure it to listen for incoming network log info and tell it where to put it.

Here's a syslog-ng master server config:

options { 

        # The default action of syslog-ng 1.6.0 is to log a STATS line
        # to the file every 10 minutes.  That's pretty ugly after a
        # while.
        # Change it to every 12 hours so you get a nice daily update of
        # how many messages syslog-ng missed (0).

source src { unix-stream("/dev/log"); internal(); pipe("/proc/kmsg"); };
source net { udp(); };

filter f_authpriv { facility(auth, authpriv); };
filter f_cron { facility(cron); };
filter f_ldap { facility(local4); };
filter f_mail { facility(mail); };
filter f_messages { level(info .. warn)
        and not facility(auth, authpriv, cron, mail, local4); };

destination authlog { file("/var/log/auth.log"); };
destination cron { file("/var/log/cron"); };
destination ldap_net { file("/disk1/log/slapd.log"); };
destination mail_net { file("/disk1/log/maillog"); };
destination mail { file("/var/log/maillog"); };
destination messages { file("/var/log/messages"); };

# By default messages are logged to tty12...
destination console_all { file("/dev/tty12"); };
# ...if you intend to use /dev/console for programs like xconsole
# you can comment out the destination line above that references
# /dev/tty12
# and uncomment the line below.
#destination console_all { file("/dev/console"); };

log { source(src); filter(f_authpriv); destination(authlog); };
log { source(src); filter(f_cron); destination(cron); };
log { source(net); filter(f_ldap); destination(ldap_net); };
log { source(src); filter(f_mail); destination(mail); };
log { source(net); filter(f_mail); destination(mail_net); };

log { source(src); filter(f_messages); destination(messages); };
log { source(src); destination(console_all); };

Here's a client that logs maillog locally and to a remote syslog server:
options { 

source src { unix-stream("/dev/log"); internal(); pipe("/proc/kmsg"); };

filter f_authpriv { facility(auth, authpriv); };
filter f_cron { facility(cron); };
filter f_mail { facility(mail); };
filter f_messages { level(info .. warn)
        and not facility(auth, authpriv, cron, mail); };
filter f_monitoring { not match("(did not issue)|("); };

destination authlog { file("/var/log/auth.log"); };
destination cron { file("/var/log/cron"); };
destination mail { file("/var/log/maillog"); udp(""); };
destination messages { file("/var/log/messages"); };

# By default messages are logged to tty12...
destination console_all { file("/dev/tty12"); };

log { source(src); filter(f_authpriv); destination(authlog); };
log { source(src); filter(f_cron); destination(cron); };
log { source(src); filter(f_mail); filter(f_monitoring);
destination(mail); };

log { source(src); filter(f_messages); destination(messages); };
log { source(src); destination(console_all); };

Here's a client that logs ldap only to a remote syslog server:
options { 

source src { unix-stream("/dev/log"); internal(); pipe("/proc/kmsg"); };

filter f_authpriv { facility(auth, authpriv); };
filter f_cron { facility(cron); };
filter f_ldap { facility(local4); };
filter f_mail { facility(mail); };
filter f_messages { level(info .. warn)
        and not facility(auth, authpriv, cron, mail, local4); };

destination authlog { file("/var/log/auth.log"); };
destination cron { file("/var/log/cron"); };
destination ldap { udp(""); };
destination mail { file("/var/log/maillog"); };
destination messages { file("/var/log/messages"); };

# By default messages are logged to tty12...
destination console_all { file("/dev/tty12"); };

log { source(src); filter(f_authpriv); destination(authlog); };
log { source(src); filter(f_cron); destination(cron); };
log { source(src); filter(f_ldap); destination(ldap); };
log { source(src); filter(f_mail); destination(mail); };

log { source(src); filter(f_messages); destination(messages); };
log { source(src); destination(console_all); };

Tomas Ruprich xruprich (at) akela (dot) mendelu (dot) cz 22 Dec 2005

Well, I was realizing something like month ago... In apache configuration file on each application server i have this line:

CustomLog "|/usr/bin/logger -t cluster_access_log" combined env=!dontlog

and then on log server I have syslog-ng installed, where are these configuration lines:

destination d_cluster_access_log { file("/var/log/httpd/all_clusters_log"); };
filter f_cluster_access_log { match("cluster_access_log"); };
source s_net { udp(); };
log { source(s_net); filter(f_cluster_access_log); destination(d_cluster_access_log); };

awstats is very good idea, I use it too

Here's the syslog configuration on application servers. I think it's quite simple, but only for order... For /etc/syslog.conf:

*.*                                              @<syslog_server_IP>

Lemaire, Olivier olivier (dot) lemaire (at) siemens (dot) com 22 Dec 2005

Mergelog is your friend (http://mergelog.sourceforge.net/), after rsyncing youf file to a larger server. A centralised logging server is probably overkill unless you need your logs up-to-date at the last second.

Graeme Fowler graeme (at) graemef (dot) net 22 Dec 2005

mod_log_mysql might help you out too: http://bitbrook.de/software/mod_log_mysql/

...or its' Apache 1 cousin, mod_log_sql: http://www.outoforder.cc/projects/apache/mod_log_sql/

Note however that for a big and/or very busy cluster you need to be very, very careful with your database design and the setup of your servers. At work a colleague recently ran this up across 40 Apache servers and knocked the ass out of the MySQL logging server, jamming it up with 1000 persistent client connections. That was bad operational design on our part, but still something worth remembering.

Performance-wise it seems to do well as all the queries are inserts, and it's obviously possible to make use of MySQL table replication to amalgamate several collected tables onto one host for post-processing.

As a theory it's definitely got legs, we just have to find out how many in practice now!


when sending logs to a central server, are there any problems with streams becoming intermixed, so that you get nonsense?

Todd Lyons tlyons (at) ivenue (dot) com 23 Dec 2005

No one host's line will be interrupted but another hosts's line. The lines from different hosts will be intermingled, but they will be complete. Here's an example:

Dec 23 11:48:09 smtp2 sm-mta[7171]: jBNJm7WM007171: Milter add: header: X-Spam-Status: Yes, hits=30.5 required=5.0 tests=BAYES_99,DRUGS_ERECTILE,\n\tDRUGS_MALEDYSFUNCTION,FORGED_RCVD_HELO,HTML_30_40,HTML_MESSAGE,\n\tHTML_MIME_NO_HTML_TAG,MIME_HEADER_CTYPE_ONLY,MIME_HTML_ONLY,\n\tNO_REAL_NAME,URIBL_AB_SURBL,URIBL_JP_SURBL,URIBL_OB_SURBL,URIBL_SBL,\n\tURIBL_SC_SURBL,URIBL_WS_SURBL autolearn=no version=3.1.0-gr0
Dec 23 11:48:09 smtp2 sm-mta[7171]: jBNJm7WM007171: Milter: data, reject=550 5.7.1 Blocked by SpamAssassin
Dec 23 11:48:09 smtp2 sm-mta[7171]: jBNJm7WM007171: to=<dianag@domain.com>, delay=00:00:01, pri=30500, stat=Blocked by SpamAssassin
Dec 23 11:48:09 smtp1 sm-mta[10021]: jBNJm8o0010021: from=<qylyrxy@domain2.com>,size=0, class=0, nrcpts=0, proto=SMTP, daemon=MTA, relay=[]
Dec 23 11:48:09 smtp2 spamd[26407]: prefork: child states: IIIII
Dec 23 11:48:09 smtp1 sm-mta[9872]: jBNJlQPs009872: <asing@domain3.com>... User unknown
Dec 23 11:48:09 smtp1 sm-mta[10010]: jBNJm3SU010010: ruleset=check_rcpt, arg1=<admin@domain4.com>, relay=24-176-185-20.dhcp.reno.nv.charter.com [] (may be forged), reject=550 5.7.1 <admin@domain5.com>... Relaying denied. IP name possibly forged []
If you have a central log server, what do you do if it dies? How do failover is this solution?

Joe Stumpf

Well you could always load balance your log traffic on an LVS setup with redundant NetApp's, but really why on God's green Earth would you do that? It's log traffic. I've never heard of a place where log traffic ever justified redundant servers.

Johan van den Berg vdberj (at) unisa (dot) ac (dot) za 11 Jan 2006

The following works like a charm. I have 4 nodes, 1 fileserver, and 1 lvs machine... The nodes are high usage, and log everything to syslog using "| logger..." syntax in apache, and syslog forwards to @fileserver. Fileserver uses the - option before a filename to only sync the access log when needed to file, so that the access logs don't cause too much filesystem access on the fileserver. I am though concerned that every once in a while, I've seen a line or two go missing if I push too much into syslog at one time.

Graeme Fowler graeme (at) graemef (dot) net03 May 2006 (in reply to a query about logging from a two realserver LVS).

Use mod_log_spread (http://www.backhand.org/mod_log_spread/). This makes use of the multicast spread toolkit to allow you to log messages to remote servers. The mechanics of it I leave to you as they aren't hugely simple.

Alternatively, make Apache log to a remote syslog host which combines the logs for you. This could easily be *both* of your realservers logging to each other, and again I leave the mechanics of it to you. Note that this will not scale up or out very far, but for a two-node solution it's perfect IMO. I'm saying that having each realserver act as a logging host for all the rest won't scale. Beyond a pair, having a dedicated syslog host (or indeed more than one, for robust logging) is the way forward, as you say.

Lasse Karstensen lkarsten (at) hyse (dot) org4 May 2006

I tried compiling mod_log_spread a few months ago, even found a apache2 patch on some mailling list, but without luck. Anyone succeeded using it with apache2? The project seems abandoned. Too bad, we used it before and were mostly happy.

We're using the syslog solution. We're having 10-12 realservers now, with some moderate amounts of traffic.

Graeme Fowler graeme (at) graemef (dot) net 04 May 2006

I used mod_log_spread with apache2, but I no longer have access to either the slightly modified codebase or the production results. It worked well, however, shared between 8 servers with a pair of collectors. There was a slight lag to log processing as each listener piped the log arrivals through a Perl script (running in a "while (<>" loop), did $magic with them, and put them in the right users' logfiles.

Clint Byrum cbyrum (at) spamaps (dot) org 04 May 2006

We used mod_log_spread for about 6 months at Adicio, Inc. We were pumping 4-8 million hits per day, reaching rates of 600-1000 hits per second at times, across 8 servers through it. I even submitted a few patches and we paid the author, George Schlossnagle, to enhance it to be more robust for us.

We gave up on it ultimately, as the underlying toolkit, spread, wasn't scaling for us. We'd have a little network blip on one machine, and the whole ring would stop working for 5 minutes. Server loads would go up retrying, and retransmitting, error logs would be flying. It was a real mess.

Ultimately we needed to do some tuning that involved recompiling the spread daemon. I gained a deep understanding of the spread protocol, and decided it was far too complex for this purpose.

We've gone to a system now where logs are written locally using the program 'cronolog' and once per hour they are collected via NFS export. It works pretty well, though it was nice to have one big log file.

Dan Trainor dan (at) id-confirm (dot) com05 May 2006

What we've done in the past is also used mod_log_mysql. While not the most efficient way of logging (sure, your setup may differ) - this did allow us to be flexible as to where we wanted logs sent. We would dump this log nightly and export it back into one on disk, then run our stats against it. We later modified our stats system to read directly from the database, which worked out quite nicely. That's a lot of work, though. I guess if you're looking for something simple, that's not the answer for you. However, it's food for thought.

Daniel Ballenger lpmusix (at) gmail (dot) com5 May 2006

I just recently ran a benchmark on mysql on a machine for my company... With MySQL I was pushing >1000 inserts per second. This was on a Quad 700Mhz (Compaq DL580) box with 1GB of ram and 4 9.1GB drives in raid 5. I'm sure with faster disks I'd be able to push that box even harder with mysql. But of course, I've yet to hit that many queries per second yet with it in production :).

Graeme Fowler graeme (at) graemef (dot) net06 May 2006

In testing, we found an interesting limit - MySQL 4.x seems to have a hard limit of 1000 client connections, and it can't be raised. As every single Apache child process opens a connection to the server to log accesses, this means that (for example) 5 Apache servers with MaxClients set to more than 200 can block the MySQL server. In the same tests we found that the server doesn't recover from this, so it stops Apache from working while each child waits for its' logging connection to close. MySQL 5.x did not show this behaviour.