27. LVS: Persistent Connection (Persistence, Affinity in cisco-speak)

Note

Apr 2006: No-one has tried this, but it seems that the -SH scheduler could replace persistence, without the failover problems of persistence. The -SH scheduler schedules according to the client IP, meaning that all of a client's connection requests will be sent to the same RIP. The -SH scheduler has been around for a while, but it seems that no-one has known what it did. One of the problems was that no-one knew how to use the weight parameter.

Note
Sep 2002: Rewritten. All references to the LVS persistence used in kernels <2.2.12 have been dropped.

(For another writeup on persistence, see LVS persistence page .)

For LVS, the term "persistence" has 2 meanings.

The two types of persistence are quite different. Unfortunately, both features are persistent and can reasonably claim the name "persistent". This causes some confusion in nomenclature. LVS persistence could alternately be described as connection affinity or port affinity.

LVS persistence directs all (tcpip) connection requests from the client to one particular realserver. Each new (tcpip) connection request from the client resets a timeout (time set by the -p option of ipvsadm) LVS persistence has been part of LVS for quite a while (first implementation by Pete Kese, when it was called pcc) and was added to handle ssl connections, squids and multiport connections like ftp (squids now have their own scheduler).

(LVS) persistence is also used when the realserver must maintains state (i.e. when the client sends information to the realserver in shopping carts, or writing to an application such as a database, or the client must hold a cookie).

Persistence has the following effects

You should understand the consequences of using persistence if you plan to use it in production. The ideal approach from a theoretical point of view is to rewrite the application so that data is propagated to all realservers immediately (or at least before the client initiates a new SSL session), allowing the LVS to run in non-persistent mode. Rewriting your application is difficult, but if you're in production with a secure (SSL) site, you're already spending money. Despite us using every opportunity to exhort people to rewrite their applications, we find that most people don't and continue to use persistence.

Alternatives to persistence include

27.1. LVS persistence

LVS persistence makes a client connect to the same realserver for different tcpip connections. The LVS persistant connection is at the layer 4 protocol level.

LVS persistence is rarely needed and has some pitfalls (as explained below). It's useful when state must be maintained on the realserver, e.g. for https key exchanges, where the session keys are held on the realserver and the client must always reconnect with that realserver to maintain the session.

LVS persistence has two consequences

  • A client making a new tcpip connection, within the timeout period (usually 5-10mins), will be sent to the same realserver as on the previous connection. The new tcp connection will reset the timer. A connect request made past the timeout period will be treated as a new connection and will be assigned a realserver by the scheduler. The default timeout varies with LVS release, but is in the 300-600sec range.

    When implementing LVS persistence, there are problems in recognising a client as the same client returning for another connection. While the application can recognise a returning client by state information e.g. cookies (which we don't encourage, see below for better suggestions), at layer 4, where LVS operates, only the IPs and port numbers are available. If it's left to the application to recognise the client (e.g. by a cookie), it may be too late, the client may be on the wrong realserver and the ssl connection is refused. For LVS persistence, the client is recognised by its IP (CIP) or in recent versions of ip_vs, by CIP:dst_port (i.e. by the CIP and the port being forwarded by the LVS). If only the CIP is used to schedule persistence, then the entries in the output of ipvsadm will be of the form VIP:0 (i.e. with port=0), otherwise the output of ipvsadm will be of the form VIP:port.

    Recognising the client is simple enough for machines on static IPs, but people on dial-up links

    • come up on a different IP for each dial-up session. If the phone line drops during a session the client will reappear with a different IP (but probably coming from the same class C network)
    • if they are coming through a proxy (like AOL), they will come from different IPs (again probably in the same class C network) for different tcipip connections, within a single session (i.e. requests for hits for a web page may come from several IPs). (for more info see persistence granularity).

    The solution to this is to set a netmask (e.g. /24) for persistence and to accept any IPs in this netmask as the same client. The downside is that if a significant fraction of your clients are from AOL, they will appear to be a single client and will all be beating on one realserver, while the other realservers are near idle.

    Note
    For regular http, you don't care how many different IP(s) the client uses to request its hits for a single webpage and you don't need persistence.

  • When all ports (VIP:0) are scheduled to be persistent, then requests by a client for services on different ports (e.g. to VIP:telnet, to VIP:http) will go to the same realserver. This is useful when the client needs access to multiple ports to complete a session. Useful multi-port connections are

    • 20,21 for active ftp
    • 21 and a high port for passive ftp
    • port 80,443 for an e-commerce site

    A side effect is that once persistence is set for all ports, requests by the client to any port, not just the ones you think the client is interested in, will be forwarded to the realserver. (The client will get a "connection refused" if the realserver is not listening on the other forwarded ports.) For security (to stop port scans etc), you'll have to filter requests to the other ports.

    The ports won't neccessarily be paired in the way you want e.g. in the (admittedly unlikely) event that you have an ftp and e-commerce setup on the same LVS, both ftp and e-commerce requests will go to the same realserver. What you'd like is for the e-commerce (80,443) requests to be scheduled independantly of the ftp (20,21) requests. In this way your ftp requests will go to one realserver while your requests to the e-commerce site will go to a different realserver. Its simpler administratively to have different services (ftp, http/https) on a different lvs.

    The all ports (VIP:0) approach is quite crude, and was a first attempt at bundling together connect requests for multiple services from a client. This side effect (of persistence activating all ports), does not arise if multiport services are forwarded by a persistent fwmark. To bundle services see fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html) - in particular persistence granularity with fwmark (http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.fwmark.html#fwmark_persistence_granularity).

Note: the persistence timeout is the elapsed time, between different tcpip connections, for the client to be recognised as a returning client. You still have the same idle timeout within a tcpip connection as for other services.

Wensong Zhang wensong (at) gnuchina (dot) org 11 Jan 2001

The working principle of persistence in LVS is as follows:
  • a persistent template is used to keep the persistence between the client and the server.
  • when the first connection from a client, the LVS box will select a server according to the scheduling algorithm, then create a persistent template and the connection entry. the control of the connection entry is the template.
  • The late connections from the clients will be forwarded to the same server, as long as the template doesn't expire. The control of their connection entries are the template.
  • If the template has its controlled connections, it won't expire.
  • If the template has no controlled connections, it expires in its own time.

malcolm lists (at) netpbx (dot) org

What the maximum setting for the persistence timeout? The Docs say its unlimited but I don't believe that :-).

Horms 25 Aug 2006

ipvsadm may have some other limit due to signedness issues and the like. But in the kernel it is stored as an unsigned int, which represents seconds. So any value between 0 and (2^32)-1 seconds is valid, which is potentially a rather long time.

27.2. Scheduling looks different under persistence

In a normal (non-persistent) LVS, if you connect to VIP:telnet with rr scheduling, you will connect to each realserver in turn. This is because the director is scheduling each tcpip connection as separate items. When you logout of your telnet session and telnet to the VIP again, the director sees a new tcpip connection and schedules it round robin style i.e. to the next realserver in the ipvsadm table.

However, if you then make the LVS persistent, the director schedules each CIP as a separate item. Repeated telnet tcpip connections (logins and logouts) to the VIP (within the persistence timeout period) will be regarded as the same scheduling item, since they are coming from the same client, and will all be sent to the same realserver. Even though rr scheduling is in effect, you will be connected to the same realserver. To test that the scheduler is round-robin'ing under persistence, you will need to login from several different clients (i.e. with different IPs), or after the persistence timeout has expired.

If two services are scheduled as persistent (here telnet, http), they are scheduled independantly. Here I have only 1 client (so it isn't a good test) and I connect twice by telnet and then twice by http. Scheduling is within the blocks setup by the `ipvsadm -A` command (here starting at "TCP ...". Here there are two blocks, scheduled separately.

ipvsadm
IP Virtual Server version 0.9.4 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
TCP  lvs.mack.net:http rr persistent 360
  -> RS2.mack.net:http            Route   1      0          2
  -> RS1.mack.net:http            Route   1      0          0
TCP  lvs.mack.net:telnet rr persistent 360
  -> RS2.mack.net:telnet          Route   1      0          2
  -> RS1.mack.net:telnet          Route   1      0          0

Doing the same test a bit later, I found all connections going to the other realserver.

Will the timeout variable set on persistent connection affect an open socket that's open for several days streaming data?

Horms 2005/02/22

No. The persistance timeout has no effect whatsoever on the timeout of open connections. They have their own timeouts which are generally in line with those of TCP.

Will another connection from the same client go to a different realserver while there's an open socket with streaming data?

Not if you use persistance. If you use persistance, and either there is a connection open, or the persistance timeout has not elapsed since the last connection was closed, then a subsequent connection from the same end-user will go to the same real-server.

For those who care, this is all controlled by the expiry of connection entries and persistace templates by ip_vs_conn_expire().

27.3. Persistent and regular (non-persistent) services together on the same realserver.

If you setup both a non-persistent service (for testing, say telnet) and persistence on the same VIP, then all services will be persistent except telnet, which will be scheduled independantly of the persistent services. In this case connections to VIP:telnet would be scheduled by rr (or whatever) and you would connect with all realservers in rotation, while connections to VIP:http will go to the same realserver.

Example: If you setup a 2 realserver LVS-DR LVS with persistence,

director:/etc/lvs# ipvsadm -A -t VIP -p 360 -s -rr
director:/etc/lvs# ipvsadm -a -t VIP -R rs1 -g -w 1
director:/etc/lvs# ipvsadm -a -t VIP -R rs2 -g -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> RS2.mack.net:0              Route   1      0          0
  -> RS1.mack.net:0              Route   1      0          0

then (as expected) a client can connect to any service on the realservers (always getting the same realserver).

If you now add an entry for telnet to both realservers, (you can run these next instructions before or after the 3 lines immediately above)

director:/etc/lvs# ipvsadm -A -t VIP:telnet -s -rr
director:/etc/lvs# ipvsadm -a -t VIP:telnet -R rs1 -g -w 1
director:/etc/lvs# ipvsadm -a -t VIP:telnet -R rs2 -g -w 1

giving the ipvsadm output

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.5 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:0 rr persistent 360
  -> RS2.mack.net:0              Route   1      0          0
  -> RS1.mack.net:0              Route   1      0          0
TCP  lvs2.mack.net:telnet rr
  -> RS2.mack.net:telnet         Route   1      0          0
  -> RS1.mack.net:telnet         Route   1      0          0

the client will telnet to both realservers in turn as would be expected for an LVS serving only telnet, but all other services (ie !telnet) go to the same first realserver. All services but telnet are persistent.

The director will make persistent all ports except those that are explicitely set as non-persistent. These two sets of ipvsadm commands do not overwrite each other. Persistent and non-persistent connections can be made at the same time.

Julian

This is part of the LVS design. The templates used for persistence are not inspected when scheduling packets for non-persistent connections.

Examples:

  • ftp (LVS-NAT): connections to both ftp ports for passive ftp is handled by the module ip_masq_ftp. You don't need to add persistence for ftp with LVS-NAT.
  • ftp (LVS-DR or LVS-Tun): you need persistence on the realservers. Run the first set of commands above.
  • ftp and http (LVS-NAT): persistence not needed (ip_masq_ftp handles the ftp ports for active and passive ftp).
  • ftp and http (LVS-DR or LVS-Tun): persistence needed to handle the two port protocol ftp. If you just have one entry in the ipvsadm table (persistence to VIP:0) then a client connecting to the http service of the LVS will always get the same realserver (this may not be a great problem). If you want to make the http service non-persistent but leaving all other services persistent, then run then add a non-persistent entry for http.
  • http and https (all forwarding methods): Normally an https connection is made after the client has made selections on an http connection when data is stored on the realserver for the client. In this case the realserver should be made persistent for all services.

Note: making realserver connections persistent allows _all_ ports to be forwarded by the LVS to the realservers. An open, persistently connected realserver then is a security hazard. You should have filter rules on the director to block all services on the VIP except those you want forwarded to the realservers.

27.4. Tracing connections: where will the client connect next?

You can trace your system in the following way. For example:

[root@kangaroo /root]# ipvsadm -ln
IP Virtual Server version 1.0.3 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port          Forward Weight ActiveConn InActConn
TCP  172.26.20.118:80 wlc persistent 360
  -> 172.26.20.91:80             Route   1      0          0
  -> 172.26.20.90:80             Route   1      0          0
TCP  172.26.20.118:23 wlc persistent 360
  -> 172.26.20.90:23             Route   1      0          0
  -> 172.26.20.91:23             Route   1      0          0

[root@kangaroo /root]# ipchains -L -M -n
IP masquerading entries
prot expire   source               destination          ports
TCP  02:46.79 172.26.20.90         172.26.20.222        23 (23) -> 0

Although there is no connection, the template isn't expired. So, new connections from the client 172.26.20.222 will be forwarded to the server 172.26.20.90.

For 2.4 kernels

director:/etc/lvs# ipvsadm -Lc
or
director:/etc/lvs# ipvsadm -Lcn

This shows the state of the connection (ESTABLISHED, FIN_WAIT) and the time left till persistence timeout.

27.5. Bringing down persistent services.

Note
This is the behaviour before late 2004.

27.5.1. Clearing the table

If a client is connected (persistently) to a realserver and the ipvsadm table is cleared (ipvsadm -C) then the connection will hang. If you then reinstall the original ipvsadm rules for that service, the connection will work again (and you'll see the correct entries in ActiveConn and InActConn). Wensong (a bit below) explains why the code doesn't clear the entry, but only removes the pointer to the entry.

Ratz

In new versions of ip_vs (look for it Sep 2002 or later) you can affect the behaviour of ip_vs towards the connection when the ipvsadm table is cleared with a sysctl. Details are in the sysctl document http://www.linux-vs.org/docs/sysctl.html. e.g.

  • net.ipv4.vs.expire_nodest_conn=0

    maintain entry in table (but silently drop any packets sent), allowing service to continue if the ipvsadm table entries are restored.

  • net.ipv4.vs.expire_nodest_conn=1

    expire the entry in table immediately and inform client that connection is closed. This is the expected behaviour by some people when running `ipvsadm -C`

However if you have some client at the other end buying $1M of your software with his credit card, you want to be nice to them. The nice way of deleting a service is to set the weight to zero (when no new connections will be allowed to that realserver) and then wait for the current connections to disconnect/expire before deleting them (use some script to monitor the number of connections). Since the client can stay connected for hours (for some services) you can't predict when you'll be able to bring your server down.

27.5.2. time to clear quiescent persistent connections

In a normal (non-persistent) tcp connection, after setting a service to weight=0, the ipvsadm connection (hash) table will clear FIN_WAIT time (with Linux, about 2 mins) after the last client disconnects. With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects. This time defaults to about 5mins but can be much longer. Thus you cannot bring down a realserver offering a persistent service, till the persistence timeout has expired - clients who have connected in recently can still reconnect.

Tim Cronin wrote:

if you're using pasv you need persistence....

/sbin/ipvsadm -A -t 172.24.1.240:ftp -p
#forward ftp to realserver 192.168.1.20 using LVS-NAT (-m), with weight=1
/sbin/ipvsadm -a -t 172.24.1.240:ftp -r 192.168.1.20:ftp -m -w 1

if I change the weigh of the .20 RIP to 0 and rerun the script my connections continue to go that server even when I zeroed and clear the table.

Julian 1 Nov 2002

Because the virtual service is marked persistent. In such case RSs with weight 0 can continue to accept _new_ conns.

27.5.3. Resetting timeout

The persistence timeout is not reset to the original timeout on each new tcpip connection, it is incremented by TIME_WAIT.

unknown (possibly Julian)

Yes, as implemented, the persistence timeout guarantees affinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful.

Terry Green, 7 Feb 2003

Agree completely - however, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT

Julian Anastasov 2003-02-08 2:21:35

The persistence timeout is used only once: when the first connection from this client is established. The current meaning is the persistent time to cover period of time after the client appears for first time. It is extended if there are still active connections. Then there are 3 (or more) options:

  • extend it again with the persistent time
  • extend it with 2mins
  • use the persistence time after the last connection from client terminates

The second option is implemented, as it was expected from other users :)

A long time ago my opinion was that it is good the persistent time to be used when the last connection terminates (3 above). This can be a config option, if someone wants to implement it.

unknown (Julian?)

Maybe you see it 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire.

Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes.

Julian

Well, then it is not set to 1:40 but to 2:00 as expected.

Terry

Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection.

Julian

True, your reading is accurate :) I now see why it was 1:40

Terry

My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

Julian

You can do it in your source tree or to implement it for other users as config option. I don't know what the other people think.

Ratz and another poster on 12 Aug 2004 like resetting the timeout to the persistence value.

27.5.4. Persistence is independant of scheduler

The scheduler determines which realserver gets the next connection. With persistence, the same realserver gets the next connection.

Horms 13 Sep 2004

Persistance opperates independant of the scheduler. It does not matter if you use the RR, WLC, DH or any other type of scheduler, it always works the same way. That is, it looks up a persistance template and if it finds one, then it uses it, else it asks the scheduler what to do.

In other words, if there was a connection from a given end-user, and the persistance timeout has not expired, subsequent connections from the same end-user (masked with the persistance netmask) will go to the same realserver. As this lookup occurs _before_ a call to the scheduler, it is not affected by quirks in any scheduler.

Brett

I have an LVS director that uses wrr with 3600 of persistence for two realservers. I noticed that connections going through a firewall from my internal network tend to get locked into one of my realservers but usually doesn't go to the other realserver unless all of the connections have expired to the first realserver.

Ratz 10 Aug 2004

Correct.

From what I understood with LVS is it's support to use the source IP for persistence but I wasn't sure if it also used a source port.

No, it doesn't. The persistent template is created as follows:

{proto,} caddr, 0, vaddr, vport, daddr, dport>

As you can see, the cport is set to 0 globally.

Horms

The source IP address is used, but the source port is not. This is because successive connections from the same host will almost certainly have a different ephemereal source port. There is no parameter in LVS to change this behaviour. Though off the top of my head it would seem like a simple hack to alter this if you needed to for some reason.

Would using a different scheduler or a kernel upgrade (with a new lvs version) work around this?

Horms

Not likely.

Ratz

You would need to tweak ../net/ipv4/ipvs/ip_vs_core.c:ip_vs_sched_persist().

27.6. Forcing a break in a persistent connection: expire_quiescent_template - Horms sysctl for quiescing persistent connections

Horms horms (at) verge (dot) net (dot) au 12 Apr 2004

Expire Quiescent Template. Here's the writeup.

This patch adds a proc entry to tell LVS to expire persistance templates for quiescent server. As per the documentation patch below:

expire_quiescent_template - BOOLEAN

0 - disabled (default)
not 0 - enabled

When set to a non-zero value, the load balancer will expire
persistant templates when the destination server is quiescent. This
may be useful, when a user makes a destination server quiescent by
setting its weight to 0 and it is desired that subsequent otherwise
persistant connections are sent to a different destination server.
By default new persistant connections are allowed to quiescent
destination servers.

If this feature is enabled, the load balancer will expire the
persistance template if it is to be used to schedule a
new connection and the destination server is quiescent.

This patch was written to allow loadbalancing of https, with failover. However it can be used to force a break of a persistent connection. With persistent connection and the weight of a realserver set to 0, any new connections will go to other realservers, but existing connections will stay till they timeout or the client disconnects an active session (whichever is longest). Experience on the mailing list shows that this could be a long time. Misconfigured clients stay connected forever. This patch forces the client's connection to break. The client probably will not be happy about this, but then you may not want to wait 24hrs to do maintenance either.

Graeme Fowler graeme (at) graemef (dot) net 19 Jul 2007

/proc/sys/net/ipv4/vs/expire_quiescent_template ensures that when a realserver's weight is changed to 0 (ie. it is set to "quiescent"), rather than removed from the pool of realservers, existing persistent sessions on that realserver are expired from the persistence template.

Where you have persistence set on a virtual service, setting the weight to 0 usually results in no new sessions being forwarded to that realserver *but* existing sessions will continue to be handled until they are closed. It's the "graceful" way of taking a server down for maintenance, for example. If you remove a realserver from the pool, with expire_quiescent_template set to 1, those sessions expire immediately.

Additionally, setting expire_nodest_conn to 1 also helps by removing persistent entries when a realserver is removed from the pool. Without this, persistent entries will hang until they timeout and get redirected to another realserver.

are their any problems caused by setting both expire_nodest_conn and expire_quiescent_template?

None that I can think of directly; however if a healthcheck fails because something goes awry (local intervening network conditions, transient load on director, something like that) then a realserver could well be quiescent or removed briefly - in which case all established persistent sessions will be terminated. This may not be desirable in the case of a condition which is resolved in a few seconds.

I guess careful tuning of healthchecks along with good network design would be the way to not trigger it, but that's outside the scope of this discussion :)

Nicola Pero nicola (at) brainstorm (dot) co (dot) uk 25 Nov 2004

Has anyone been able to setup ldirectord to load balance two HTTPS servers with failover ?

The two real HTTPS servers are stateless (except for the SSL info in the web servers); there are few concurrent users (up to 10), but instant switchover in case of failure is essential.

Anyway, the problem we have is that when one of the two HTTPS servers goes down, the load balancer detects it but all clients connected to the server which is down keep being sent to it. Changing 'persistent', 'quiescent', timeouts etc didn't seem to have any effect on this!

Our case is also complicated by the fact that in certain cases we might decide that a realserver should not be used even if HTTPS is still running fine on the server. That might happen if the application sitting behind the HTTPS has a problem. We've got a URL on the realserver which can be checked to know if the realserver is OK to be used or not. Checking those seems to be working fine! The problem is with the realserver being marked as down, and all requests still being sent to it!

Keep in mind this is not a typical web farm, there are few concurrent users (most often 0 or 1), but it's critical that the web application is always available.

Malcolm Turnbull Nov 25, 2004

you definately want quiescent=no (in ldirectord).

Horms 26 Nov 2004

Or use this patch. http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html

The patch just makes persistent sessions behave sensibly(tm) when a realserver is made quiescent. This isn't specific to HTTPS at all, but I think it is the problem that the user is seeing. The other solution is not to make the realservers quiescent, and just removed them instead.

2.4 patch (http://www.in-addr.de/pipermail/lvs-users/2004-February/011018.html), 2.6 patch (http://article.gmane.org/gmane.linux.network/18906).

The Existing behaviour.

When a realserver is marked as quiescent (by setting the weight to zero) no additional connections will be allocated to that realserver by the scheduler (the LVS connection allocator, not the cpu scheduler, the packet scheduler, or your secretary).

This works quite well, unless the scheduler is bypassed for some reason. As it happens this occurs only if a virtual service is marked as persistant and there is a persistant-template in existance - that is, recently there was prior connection from the same end-user.

In this case the presance of the persistant-template is sufficient for additional connections to be sheduled, despite the fact that the server is marked as quiescent. Though the connections have to be from an end-user (IP address/netmask) that was forwarded to the realserver in question within the persistant timeout.

My patch allows this behaviour to be changed, by expiring the templates when a real-server is marked as quiescent. Thus the scheduler gets called, and the behaviour is the same as for non-persistant service, which is generally what people expect/want.

Joe

and just rip out the connections?

By removing a realserver you break all the connections and remove all the persitant templates. So no further connections are forwared whatsoever. Actually, no further packets are forwarded. Unfortunatelly, this breaks connections that are in progress.

So what happens in the following case:

You've filled your shopping cart under http, then you go to https to give your credit card info, which usually takes at least 3 webpages (fill in your credit card and shipping info, click send, get confirmation page, click accept, get final page for printing). Let's say while you're reviewing the confirmation page, the realserver goes down and the LVS removes it by running ipvsadm. The tcpip state of the client is ESTABLISHED and the client has the SSL session ID. The LVS has to cache the credit card info somewhere to make it available to the new realserver. When the user hits the accept button, the browser presumably is going to get a tcpip reset from the new realserver. Does the browser just handle it and attempt to make a new tcpip connection? From what you say above, the browser will find that its SSL session ID is invalid and it will do the long handshake. Once that happens the client will hopefully be SSL connected to a realserver that knows about the credit card transaction already underway.

In the situation you describe above the main factors in determining if it would work or not are

  • how do the realservers store their data?

    if some sort of shared storage is used, say for example NFS, and the transaction is not in some half broken state, then it should be ok, though there might be a race in there

  • will the client's browser reconnect (either automatically or by the user hitting reload)?

    the answer is generally yes.

What I am trying to say is it really boils down to an interaction between the end-user's browser and the real-servers web-application. The LVS magic in between neither hinders nor helps the situation, other than allowing the end-user to connect to a different-realserver if/when a reload occurs.

And the SessionID shouldn't really come into it. Because if it is still valid, it will be used, and if it is invalid it will be discarded and a full handshake will be performed. Sure, it might take an extra few moments, and possible the real-server might be a bit overloaded if a lot of reconnects of this nature occur simultaneously, but the success (or failure) of the SSL handshake should not be affected.

I had thought that the keys are in memory and you can't move the keys/session data from one machine to another.

That is not the case, let me elaborate. (I wrote an SSL implementation once so I know this one :)

SSL makes use of public key encryption (e.g. RSA) and private key encryption (e.g. DES, AES) as well as a host of other techniques to make communications more secure. In a nutshell public key encryption - which is slow but does not require any prior agreement of keys - is used to negotiate a key that is used for private key encryption - which is fast, but requires a key to be negotiated. This key negotiation phase is part of the SSL handshake.

It turns out, particularly for small transfers as are typical on the web, that the public key encryption negotiation phase of the handshake is quite expensive. To alleviate this the server may (almost always will) give the client a Session ID during the course of the handshake. i If the client reconnects it _may_ offer this Session ID and _if_ the server recognises it then an abridged version of the handshake is performed which relies on cryptographic information that both the client and server have cached.

Observe:

  • If the client does not offer a Session ID, the long handshake is performed.
  • If the server does not recognise the Session ID - perhaps because it has expired, perhaps because it is a different machine, perhaps because the Session ID is bogus - the long handshake is performed.
  • Also, if the client tries to guess a Session ID and guesses one that the server knows about, unless the cached key information it holds matches, the handshake will fail and the session will terminate. Guessing the cryptographic information is usually difficult at best, though it depends what cipher suite (combination of cryptographic algorithms) was used for the original session. Thus, DoS issues aside, guessing Session IDs is typically of little value. It is the second point above that allows failover of SSL servers to work. The server is actually allowed to cache the Session Id for as long as it wants, including discarding it immediately. This is catered for by falling back to the long handshake if the Session ID is not matched.

Stephane Klein

I have an LVS using persistence. All is working well until I stop a real server. The director continue to send requests to the real server which was stopped. ipvsadm -Lcn confirms that the request is still sent to the stopped real server.

Horms 2005/03/22

The problem here is that persistance still takes effect even after the real server is removed (I assume you have quiescent=1). You can change this behaviour by running.

echo 1 > /proc/sys/net/ipv4/vs/expire_quiescent_template

The effect of this is that the persistance templates are expired when a connection is made quiescent. And thus no additional connections will be directed to the real server in question.

27.7. what if a realserver holding a persistent (sticky) connection crashes

An explanation of the problem:

  • normal (non-persistent) connection to a service (e.g. httpd).

    If the server crashes while your tcpip connection is open, that connection will hang (it will eventually time out). The client will notice some icon showing that the browser is continuing to look for the page. The director will notice that the realserver has died and will remove the realserver from the ipvs table by first setting its weight to 0. This will stop any new connections, but allow current connections to continue (and eventually exit). Since the current connections are hung, the director will assume they have exited after the time of the tcp timeouts. Once the connection table for that realserver is empty, the entries for the realserver are removed from the ipvs table. Eventually the browser will timeout or the user will reload, this establishing a new tcpip connection, whereupon the LVS will connect the user to a working realserver (the dead one not being sent any new connections). The connection with the original realserver was lost (or hung). The persistence of the connection is the tcpip connection - any new tcpip connection will be sent to a new realserver.

    Clients are used to connections on the internet hanging and will not realise that a realserver died on them. The behaviour of ip_vs here gives a satisfactory behaviour as far as the client is concerned.

    Note
    If the service was telnet, the client would have a hung session and would have to close out their window and reconnect.[C This is not satisfactory, but there's no way to transfer a tcpip connection to a new machine.

  • persistent connection to a service (e.g. https) with -p 600 (10 mins timeout).

    Everything is the same as for the non-persistent connection, except the criteria for terminating the user session.

    If you have set a persistence timeout on the director of 10mins, then the director is saying "no matter what happens, I will connect this client to that realserver for all tcpip connection requests for the next 10mins (even if the realserver is dead)". The director is guaranteeing that the realserver will be up for the next 10mins and the persistence extends beyond any single tcpip connection to cover new tcpip connections in the timeout period. If the director sets weight=0 for a realserver (e.g. if it has crashed), then new tcpip connections from the client will still be sent to the same (dead) realserver.

    The behaviour of ipvs, which satisfactorily removes realservers when the granularity is a tcpip connection, doesn't work when the LVS session can cover many tcpip connections.

Ben Hollingsworth

OK, so I've got my setup nailed down pretty well. This is pair of squid web proxies on a 2-host LVS running UltraMonkey / HB 2.0.7-8 on RHEL4 (2.6.9). I'm struggling with one more thing, though. With quiescent=true, if I shut down squid on one box, connections from new hosts fail over to the other box just fine, but connections from persistent hosts keep going to the same, dead box. I realize this is as intended. If I set quiescent=false, all client communication with the dead box ceases immediately, which includes cutting off active connections at the knees. That's not an issue if the squid actually dies. However, most of our failovers will be due to my own planned maintenance. In that case, I'd like to allow existing connections (which may be lengthy downloads) to finish before sending new requests (even from persistent clients) to the live box. I can't find any way to do this without hacking the kernel to match a 2-yr-old patch that Horms published (assuming that even applies to my setup). Most of the info about this seems to have been written three years ago. Is there a way to make this work without a custom compile?

Adrian Chapela achapela (dot) rexistros (at) gmail (dot) com 12 Mar 2007

OK, to solve the problem there are two variables:

  • /proc/sys/net/ipv4/vs/expire_nodest_conn: to expire connections before the protocol timeout. This is to solve the problem when a server goes down. For example in the UDP protocol the protocol timeout is too high.
  • /proc/sys/net/ipv4/vs/expire_quiescent_template: (see Section 27.6) this variable I think is the variable to solve your problem. With this you timeout your persistent template when a server goes down.

I don't know what them makes exactly but the first solve my problems. You could make a test.

Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 12 Mar 2007

I have one solution, but it works only in case of transparent proxy setup. Instead of persistance, use lblc scheduler (without persistance). lblc itself gives you some kind of persistance of 6 minutes or more. If 6 minutes is not enough for you, please look here: http://kb.linuxvirtualserver.org/wiki/Talk:Locality-Based_Least-Connection_Scheduling

-----------------------------------------------------------------

The material below is older, from when the persistence code was at an earlier stage of development. None of this code exists anymore.

Ted Pavlic tpavlic_list (at) netwalk (dot) com

Is this a bug or a feature of the PCC scheduling...

A person connects to the virtual server, gets direct routed to a machine. Before the time set to expire persistent connections, that real machine dies. mon sees that the machine died, and deletes the realserver entries until it comes back up.

But now that same person tries to connect to the virtual server again, and PCC *STILL* schedules them for the non-existent real server that is currently down. Is that a feature? I mean -- I can see how it would be good for small outages... so that a machine could come back up really quick and keep serving its old requests... YET... For long outages those particular people will have no luck.

Wensong

You can set the timeout of template masq entry into a small number now and the connection will expire soon.

Or, I will add some codes to let each realserver entry keep a list of its template masq entries, remove those template masq entries if the realserver entry is deleted.

To me, this seems most sensible. Lowering the timeouts has other effects, affecting general session persistence...

I agree with this. This was what I was hoping for when I sent the original message. I figure, if the server the person was connecting to went down, any persistence wouldn't be that useful when the server came back up. There might be temporary files in existence on that server that don't exist on another server, but otherwise... FTP or SSL or anything like that -- it might as well be brought up anew on another server.

Plus, any protocol that requires a persistent connection is probably one that the user will access frequently during one session. It makes more sense to bring that protocol up on another server than waiting for the old server to come back up -- will be more transparent to the user. (Even though they may have to completely re-connect once)

So, yes, deleting the entry when a realserver goes down sounds like the best choice. I think you'll find most other load balancers do something similar to this.

mike mike (at) bizittech (dot) com 28 Sep 2003

I am using LVS-DR to balance 4 MS servers. Due to the nature of the web application and the user behavior I had to set the connection timeout to 30 min.

Note
Joe: he does not specify whether he is using persistence and this is the persistence timeout from setting up with persistence or the tcpip idle timeout. Presumably it is the persistence timeout.

In case of failure of one of the realservers users need to be forced to connect to a different server. That means the lvs tables need to cleared as far as connections from clients to the failed box, so that any reconnect trail will open new connection to one of functioning servers . I am using ldirectord to startup and monitor.

I am using ldirectord to poll the realserver for the result of an asp page. In case of failure it turns the weight to 0 on the ipvs rule. No new connections will be sent to the dead realserver but in every retry of the clients still tries to connect to the dead realserver until the timeout of that connection. This is the expected behaviour according to lvs documentation.

Joao Clemente jpcl (at) rnl (dot) ist (dot) utl (dot) pt

How do you delete the entry of the realserver?

Mike

Basically I'm using a similar rule to the one used to insert the virtual servers into lvs. It's something like this (I can't be 100% exact as I don't have access to my lvs box from home)

/sbin/ipvsadm -d -t $VIP:PORT -r $REALSERVER

Matthew Crocker matthew (at) crocker (dot) com 28 Sep 2003

Don't set the weight to 0, remove the realserver from the LVS table when it fails. When you remove the realserver from the table you also remove the information from the persistence table. Setting the weight to 0 is normally used for orderly shutdown of a realserver for maintenance.

Note
Joe: the entries for the current connections to the realserver stay in the ip_vs hash table until they timeout, even though they are no longer displayed in the default output of ipvsadm. These current connections can't be used with a dead realserver.

Peter Nash peter.nash (at) changeworks (dot) co (dot) uk 29 Sep 2003

I'm using LVS-NAT with persistence controlled by ldirectord. I've found that the "quiescent=" line in ldirectord.cf controls the behaviour you are looking for. If "quiescent=no" then when a realserver fails it's LVS entries are removed from the table and clients immediately failover to an alternate server. If "quiescent=yes" then when a realserver fails it's entries remain in the LVS tables but the weight is set to 0 and clients will continue to try to connect to that server until the persistence expires. The default setting (on my installation) was "yes" and I had to change this to get the behaviour I wanted.

Rommel, Florian Florian (dot) Rommel (at) quartal (dot) com 28 Sep 2003

in your ldirectord.cf, add this line at the top (above your virtual section)

quiescent = no

it deletes the server entry from the table automatically if the server fails. Once the server is back up it'll add it automatically. If that line is not set, the default is yes, which just sets the server to weight 0 and that leaves the connections persistant. I had to look for a while to find that little line.

Mike

Thanks Florian Rommel and peter nash. That was it.

vilsalio (atO eupmt (dot) es

I don't know how I can remove the persistence when one of my realservers crash, without waiting for expiration of the timeout.

ratz 27 Nov 2003

Please refer to the sysctl. /proc/sys/net/ipv4/vs/expire_nodest_conn should do what you want.

Patrick Kormann pkormann (at) datacomm (dot) ch

I have the following problem: I have a direct routed 'cluster' of 4 proxies. My problem is that even if the proxy is taken out of the list of real servers, the persistent connection is still active, that means, that proxy is still used.

Andres Reiner

Now I found some strange behaviour using 'mon' for the high-availability. If a server goes down it is correctly removed from the routing table. BUT if a client did a request prior to the server's failure, it will still be directed to the failed server afterwards. I guess this got something to do with the persistent connection setting (which is used for the cold fusion applications/session variables).

In my understanding the LVS should, if a routing entry is deleted, no longer direct clients to the failed server even if the persistent connection setting is used.

Is there some option I missed or is it a bug ?

Wensong Zhang wrote:

No, you didn't miss anything and it is not a bug either. :)

In the current design of LVS, the connection won't be drastically removed but silently drop the packet once the destination of the connection is down, because monitering software may marks the server temporary down when the server is too busy or the monitering software makes some errors. When the server is up, then the connection continues. If server is not up for a while, then the client will timeout. One thing is gauranteed that no new connections will be assigned to a server when it is down. When the client reestablishs the connection (e.g. press reload/refresh in the browser), a new server will be assigned.

jacob (dot) rief (at) tis (dot) at wrote:

Unfortunately I have the same problem as Andres (see below) If I remove a realserver from a list of persistent virtual servers, this connection never times out. Not even after the specified timeout has been reached.

Wensong

The persistent template won't timeout until all its connections timeout. After all the connections from the same client connection expires, new connections can be assigned to one of the remaining servers. You can use "ipchains -M -L -n" (or netstat -M) to check the connection table (for 2.4.x use cat /proc/net/ip_conntrack).

Only if I unset persistency the connection will be redirected onto the remaining realservers. Now if I turn on persistency again, a prevoiusly attached client does not reconnect anymore - it seems as if LVS remembers such clients. It does not even help, if I delete the whole virtual service and restore it immediately, in the hope to clear the persistency tables.

director:/etc/lvs# ipvsadm -D -t <VIP>; ipvsadm -A -t <VIP> -p; ipvsadm -a -t <VIP> -R <alive realserver>

And it also does not help closing the browser and restarting it. I run LVS in masquerading mode on a 2.2.13-kernel patched with ipvs-0.9.5. Would'nt it be a nice feature to flush the persistent client connection table, and/or list all such connections?

Wensong

There are several reasons that I didn't do it in the current code. One is that it is time-consuming to search a big table (maybe one million entries) to flush the connections destined for the dead server; the other is that the template won't expire until its connection expire, the client will be assigned to the same server as long as there is a connection not expired. Anyway, I will think about better way to solve this problem.

27.8. Load Balancing time constant is longer with persistence

(This is from a thread on 'Preference' instead of 'persistence' started by Martijn Klingens on 2002-10-08.)

Load balancing occurs with a time constant of the connection to the LVS. For a non-persistent connection like http, with FIN_WAIT=2mins, loads will balance on a time scale longer than 2mins. At shorter time scales, the loads will not be balanced. For persistence with a persistence time out of 30mins, load balancing will require times greater than 30mins (like several hours).

This problem is related to the unbalance caused by proxy farms (e.g. AOL).

27.9. The tcp NONE flag

Malcolm Turnbull malcolm (at) loadbalancer (dot) org 2005/04/26

What does the TCP flag NONE mean? When I make a connection through LVS to a real server and look in the connection table I normally get

TCP 17:24 ESTABLISHED 173.19.13.214:1736 173.19.15.175:80 173.19.12.243:80 

And everything including persistence works as expected But when I connect using a bit of javascript from IE(client side) I get :

TCP 17:24 NONE 173.19.13.214:0 173.19.15.175:80 173.19.12.243:80 

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com

This is the way LVS is manages persitence, by creating a NONE connection in the connection table.

And the first connection gets a 404 error, further refreshes work fine, and then persistence doesn't seem to work? Do connections with a status of NONE not get put is the persistence table? The javascript is refreshing a page from the server every 1 minute. If you set the javascript to go every 10mins you get far more 404 errors.

This looks strange. Your persistence timeout seems to be about 20min (the time to timeout is just after the string "TCP"), so 1min or 10min should be the same. I would suspect a problem in the js itself. Look at your server error_log.

This post (http://www.in-addr.de/pipermail/lvs-users/2005-February/013235.html) sugested droping all TCP NONE packets as they weren't required.

Your servers do not receive NONE TCP connections, they are created locally and are just here for perstence management purposes.

27.10. Resetting the persistence timeout counter (persistence behaviour for short timeout values)

Terry Green tgreen (at) mitra (dot) com 2003-02-06

the LVS-HOWTO states:

With persistent connection, the connection table doesn't clear till the persistence timeout (set with ipvsadm) time after the last client disconnects.

This appears to be not quite true. (In the following tests I'm using Kernel 2.4.19 with patch 1.0.7)

Testing/Observations - using a port 80 definition with 5 minute persistence (keepalived being used to do the configs).

# ipvsadm
  IP Virtual Server version 1.0.7 (size=16384)
  Prot LocalAddress:Port Scheduler Flags
     -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
  TCP  devlivelink:http rr persistent 300
     -> devlivelink2:http            Route   1      0          0
     -> devlivelink1:http            Route   1      0          0
  TCP  devlivelink:https rr persistent 300
     -> devlivelink2:https           Route   1      0          0
     -> devlivelink1:https           Route   1      0          0

I start a connection to the web server for purposes of downloading a large file (which will take more than 5 minutes). Every time I connect from the client, I see the connection template timeout reset to 5 minutes, as you would expect from the persistence timeout value (300sec).

# ipvsadm -Lc
  TCP 04:59  NONE        greenblade.mitra.com:0 devlivelink:http devlivelink1:http
  TCP 00:03  FIN_WAIT    greenblade.mitra.com:51330 devlivelink:http devlivelink1:http
  TCP 00:02  FIN_WAIT    greenblade.mitra.com:51329 devlivelink:http devlivelink1:http
Note

I've shortened the TCP timeouts for purposes of testing using IPVS connection entries with ipvsadm --set 5 4 0

However, if the template record is allowed to expire, it will be kept because there's still an active connection, but it's time will be reset to IP_VS_S_TIME_WAIT constant (defaulted to 2 minutes in ip_vs_conn.c) rather than to the persistence time set for this session. Further, the data structure for the connection template appears to have been corrupted, as any further connections from the client reset the template time to 2 minutes, instead of the original persistence time.

To verify this, I changed line 317 of ip_vs_conn.c from

# ipvsadm -Lc
        [IP_VS_S_TIME_WAIT]     =       2*60*HZ,
    to
        [IP_VS_S_TIME_WAIT]     =       2*50*HZ,

and recompiled the kernel

Rerunning the tests, I see the connection template record being reset to 1:40 instead of 2:00. Here's the IPVS connection entries (output of ipvsadm -Lc) as time progresses.

pro expire state      source                     virtual          destination
TCP 04:50 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:49 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:48 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 04:47 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

.
.

TCP 00:02 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:04 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 00:01 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

(here being reset to 1:40)

TCP 01:39 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

TCP 01:38 NONE        greenblade.mitra.com:0     devlivelink:http devlivelink2:http
TCP 00:05 ESTABLISHED greenblade.mitra.com:51356 devlivelink:http devlivelink2:http

Julian

Yes, as implemented, the persistence timeout guarantees afinity starting from the first connection. It lasts _after_ the last connection from this "session" is terminated. There is still no option to say "persistence time starts for each connection", it could be useful.

Terry

Agree completely. However, I expected the template record to be reset to the session persistence time, not to the value of IP_VS_S_TIME_WAIT.

Julian

The persistence timeout is used only once: when the first connection from this client is established. The current meaning is the persistent time to cover period of time after the client appears for first time. It is extended if there are still active connections. Then there are 3 (or more) options:

  1. extend it again with the persistent time
  2. extend it with 2mins
  3. use the persistence time after the last connection from client terminates

The second option is the one implemented, as we found it was the behaviour the users expected :) A long time ago my opinion was that it is better to use the persistence time when the last connection terminates (item 3 above). We could make this a config option if anyone wants it.

May be you see the value 20 seconds after the 2-minute cycle is restarted. It is "reset" only when its timer expires, not when the controlled connections expire.

Terry

Nope - perhaps I wasn't clear... I was watching ipvsadm -Lc every second. I did the tests originally and saw the template record being reset to 2 minutes if it expired with an active connection (even though the persistence setting for the connection was NOT 2 minutes). Then I did another connect from the client, and the template record was reset again to 2 minutes (not the persistence setting again), suggesting the template record data structure had somehow had it's persistence time reset from the original setting to 2 minutes.

Then, to prove to myself that my reading of the source was accurate, I hacked the source to make IP_VS_TIME_WAIT 2*50*HZ instead of 2*60*HZ, and with the newly compiled kernel, the template record started being reset to 100 seconds when it expired with an active connection.

My expectation would have been that the template record's timer would get reset to the session persistence value rather than to IP_VS_TIME_WAIT.

True, your reading is accurate :) I now see why it was 1:40

Joe - other people have found this behaviour too

chulmin2 (at) hotmail (dot) com 2003-02-11

I have set the persistence timeout to 30s

ipvsadm -A -t 211.1.1.1:80 -p 30

after I connected, I confirmed the settings

# ipvsadm -Lc
TCP 00:30.00 NONE        211.1.1.2:0     211.1.1.1:http 192.168.1.3:http
TCP 02:00.00 TIME_WAIT   211.1.1.2:40929 211.1.1.1:http 192.168.1.3:http

But after 30s the timeout returns to 2 mins.

TCP 02:00.00 NONE        211.1.1.2:0     211.1.1.1:http 192.168.1.3:http
   ~~~~~~~~~~
TCP 01:30.00 TIME_WAIT   211.1.1.2:40952 211.1.1.1:http 192.168.1.3:http

Here's Terry's summary:

I observed the same behavior, and traced it down to the scenario where the template record times out with valid connection records still counting down. In this case, the template record is reset to 2 minutes (actually, to the value of the IP_VS_S_TIME_WAIT constant). When this happens, the data structure record representing the template connection also gets altered, because any further connections from the client reset the template record to 2 minutes (NOT the original session persistence time).

The replies I got from Julian suggested that this behavior was intended, (and thus, I would suggest, the documentation is slightly inaccurate). I didn't pursue it too far, as I this only showed up when I was using really short persistence times for testing purposes. I don't expect it will happen too often or have too much impact when using a more practical session timeout time.

27.11. Why you don't want persistence for your e-commerce site: why you should rewrite your application

Malcolm Turnbull Malcolm.Turnbull (at) crocus (dot) co (dot) uk 18 Sep 2002

The main problem with using persistence for session variable tracking is that the only thing you are gaining by using LVS is increased performance. You are not getting any high availability i.e. if your realserver falls over during a persistent SSL session, you loose your shopping basket (or whatever).

Anyone using ASP/IIS will be well used to the service restarting all the time due to the 64MB ASP memory limit in IIS5 (wonder if they'll raise this in .net)

My wife always leaves web sites open for things like holidays/hotels etc so that when I come home I can see it... Often as soon as I click anything I loose the session.. :-(

Bad design. To save money on re-coding.. code it properly in the first place.

Joe Stump joe (at) joestump (dot) net 22 Nov 2002 (replying to another thread)

What Joe is trying to get at here (and this would apply to you PHP session users out there as well) is that your realservers should have access to your session files. The simple solution is a shared drive (under windows) or an NFS mount (under *NIX). Other solutions include NetApps, SANs, etc. The problem Devendra was having is that the session files exist independantly on each of the realservers. When a realserver dies or is taken out of the RAIC all of the session files on that realserver are gone. If they were in a central location where all servers had access they wouldn't die.

Joe

I've sat on https sites for more that 30mins of inactivity. Also I've had the modem line drop on me in the middle of filling in forms on badly written websites (eg registering a domainname), - when I come back, I have a new IP. I expect anyone who wants to do internet business to handle these problems seamlessly.

Roberto Nibali ratz (at) tac (dot) ch 10 Sep 2002

Exactly and most of the time you've got non-technical stakeholders or managers in the back that will rip your head of if that happens.

Persistence only gets you so far here, since memory requirements limit you to the number of connections maintained.

Yes, memory and timeout constraints combined in a linear fashion.

Ratz's idea (in the HOWTO) is to redesign the application. He can do that. Not everyone can. He maintains state data on the servers with a database.

Everyone can, and the other people can work with Tomcat's internal state replication module to do that. But it's slow last time I tested it (1 year ago) and tends to have nasty locking issues.

Alternately in php3 you could write the url that the client moved to on the next click to would contain the state information (functions the same as cookies).

Note
Joe Dec 2003: I thought this was the solution for quite a while. However I now find that since all the data is encoded in a long string as part of the URL, the client can manipulate it, making the data at the client and server different. You do not want this to happen.

If you can't rewrite the application, then you'll risk loosing some customers and I would say that LVS is not for you.

DoS problems are difficult for everybody. With persistence it's just worse.

DoS problems are not to be solved on the LVS box.

Matthias Krauss 10 Sep 2002

what is the maximum timeout value

Joe

There is no maximum value. However the connection underneath will timeout eventually, and you will start to use a lot of memory with a large number of connections.

Julian

Note that setting 0 as RS weight is assumed as "stopped temporary". The existing connections continue to work. It is assumed that the RS weight is set to 0 some time before deleting the RS. By this way we give time for all connections/sessions to terminate gracefully. Sometimes weight 0 can be used from health checks as a step before deleting the RS. Such two-step realserver shutdown can avoid temporary unavailability of the realserver. Graceful stop. At least, the health checks can choose whether to stop the RS before deleting it.

Roberto Nibali ratz (at) tac (dot) ch 13 Sep 2002

It's also useful to introduce a service level window for maintainance work. If you have a service level agreement with only a few minutes downtime a year and you need to exchange the HD of one RS you can quiesce that particular RS about 1 hour before the maintainance work and if you have a resonable low or most of the time even no active connection rate, you unplug the cable, shutdown the server and fix the problem. Then you put it back in, set the weight>0 and off you go.

If the RS is deleted the traffic for existing conns is stopped (and if expire_nodest_conn sysctl var is set the conn entries are even deleted). Of course, if for some connections we don't see packets these conns can remain in the table until their timer is expired.

Bobby Johns bobbyj (at) freebie (dot) com 13 Sep 2002

When you add in the persistence problem I suspect you're doing something that's a bad idea. I suspect the reason you need persistence (or think you do) is because you're storing state or session information locally on each web server. Although it may work, it's a weak design for a web app. If you want a high performance solution, use a common server with something like MySQL on it to hold the session or state information. If you're nervous about the single point of failure on the database box, add a replicated sever behind it. Keeping state info on each web server is just a weak solution in a highly-available high-performance environment. Hardware is pretty cheap in comparison.

I would suggest 2 LVS servers running HA between them, 2 or more web servers, and 2 session/state db servers running replicated. Bang for the buck, it's a good solution and gives you a pretty resilient, robust, and scalable system. The system you are trying to implement now will hammer 33% of your user sessions if you have a web server failure and ALL of them if you have an LVS server failure. With the proper monitoring and HA, no single machine failure will hammer your users in the system I suggest. For the price of 6 or 7 Linux servers boxes, you have what people used to pay more than $100K for just a few years ago.

27.12. more about e-commerce sites: we used to think memory was the problem - it isn't

The original idea of persistence was to allow for connections like https sessions. This solved the problem of keeping the client's connection on the same realserver. However it doesn't work well.

The first problem is that it uses a lot of memory. The default timeout for LVS persistence is somewhere around 360secs, while the default timeout for a regular LVS connection via LVS-DR is TIME_WAIT (about 1 minute). This means that LVS persistent connections will stay in the LVS connection table for 6 times longer for persistent connection. As a consequence the hash table (and memory requirements) will be 6 times larger for the same number of connections/sec. Make sure you have enough memory to hold the increased table size if you're using persistent connections. If the persistence is being used to hold state (e.g. shopping cart), then you must allow a long enough timeout for the client to surf to another site for a better price, make a cup of coffee, think about it and then go find their credit card. This is going to be much longer than any reasonable timeout for LVS persistence and the state information will have to be held on a disk somewhere on the realservers and you'll have to allow for the client to appear on a different realserver later with their credit card information.

The next problem is that persistence doesn't allow for failover.

The memory problem really isn't as bad as was originally thought. Here's some exchanges on the mailing list which talk about the real problems.

Joe 18 Sep 2002

The conventional LVS wisdom is that it's not a good idea to build an LVS e-commerce website in which https is persistent for long periods. The initial idea was that a long timeout allows the customer to have a cup of coffee or surf to other websites while thinking about their on-line purchase.

Julian 18 Sep 2002

Yes, if your site uses persistence for HTTP/HTTPS then you better to use cookies (not LVS). If you don't care for the HTTPS persistence (any realserver can serve connections from one client "session") then you create normal service. In such case your care for the backend DB.

The problem with this approach is that the amount of memory use is expected to be large and the director will run out of memory. We've been telling people to rewrite their application so that state is maintained on the realservers allowing the customer to take an indefinite time to complete their purchase. Currently 1G of memory costs about an hour of programmer's time (+ benefits, + office rental/heating/airconditioning/equipment + support staff). Since memory is cheap compared to the cost of rewriting your application, I was wondering if brute force might just be acceptable. I can't find any estimates of the numbers involved in the HOWTO although similar situations have been discussed on the mailing list e.g.

http://marc.theaimsgroup.com/?l=linux-virtual-server&m=99200010425473&w=2

there the calculation was done to see how long a director would hold up under a DoS. The answer was about 100secs for 128M memory and 100Mbps link to the attacker doing a SYN flood. I'm not running one of these web sites and I don't know the real numbers here. Is amazon.com or ebay connected by 100Mbps to the outside world?

What you can do with 1G of memory on the director? Each connection requires 128bytes. 1G/128 is 8M customers online at any one time. Assuming everyone buys something this is 1500 purchases/sec. You'd need the population of a large town just to handle shipping stuff at this rate. I doubt if any website at peak load has 8M simultaneous customers.

However you only have 64k ports on each realserver to connect with customers allowing only have 64k customers/realserver.

Note that the port limit is only between two IPs. You still can reuse one port for many connections if the two connections don't have same ends (IP and port).

How much memory do you need on the director to handle a fully connected realserver?

64k x 128 = 8M

Let's say there are 8 realservers. How much memory is needed on the director?

8 x 8M = 64M

this is not a lot of memory. So the problem isn't memory but realserver ports AFAIK

No, you don't waste realserver ports for connections from client to the LVS. But using many sockets in realserver hurts. Memory for sockets is a problem, sometimes the sockets can reserve huge buffers for data.

What is the minimum throughput of customers assuming they all take 4000 sec (66 mins) to make their purchase?

8 x 64k/4000 = 64 purchases/sec

You're still going to need a hire a few people to pack and ship all this stuff. If people use only take 6mins for their purchase, you'll be shipping 640 packages/sec.

Assuming you make $10/purchase at 64 purchases/sec, that's $2.5G/yr.

So with 64M of memory, 8 realservers, 4000sec persistence timeout, and a margin of $10/purchase I can make a profit of $2.5G/yr.

It seems memory is not the problem here, but realserver ports (or being able to ship all the items you sell).

Let's look at another use of persistence - for squids (despite the arrival of the -DH scheduler, some people prefer persistence for squids).

Here you aren't limited by shipping and handling of purchases. Instead you are just shipping packets to the various target httpd servers on the internet. You are still limited to 64k clients/realserver. Assume you make persistence = 256secs (anyone client who is idle for that time is not interested in performance). This means that the throughput/realserver is 256hits/sec. This isn't great. I don't know what throughput to expect out of a squid, but I suspect it's a lot more.

Ratz

Well, it depends what you want to offer. If it's an online shop like amazon.com you certainly want to store the generated cookie or whatever it is on a central DB cluster where every RS can connect to and request for the ID if it doesn't already have one.

The memory is a completely different layer. It's about software engineering and not about saving money. Yes, you can probably kill the problem temporary by adding more memory but a broken application framework remains a broken application framework.

Plus, normally when you do build an e-commerce site, you have a customer that has outsourced this task to your company. So you do a C-requirement and a feasability study to provide the customer with a proper cost estimation. Now you build the application and it is built in a broken way so that you need to either fix it or add more RAM in our case. The big problem here is:

  • you might have a strict SLA that doesn't permit this
  • you change the C-requirements and thus you need a new test phase
  • the customer gets upset because she spent big bucks on you

It's lack of engineering and a typical situation of plain incompetence: When you earnestly believe you can compensate for a lack of skill by doubling your efforts, there's no end to what you can't do.

But all this also depends on the situation. I don't think we can give people a generalised view of how things have to be done. One might argue that people come to this project because of monetary constraints and they sure do not care about the application if the problem is solved by putting more RAM into the director.

I for example rather spend a few bucks on good hardware and a lot of RAM for the RS because they need to carry the execution weight of the application. The director is just a more or less intelligent router.

pb (who has 1GB of memory and who wants to increase his persistence time to 60mins)

Wwe handle 1 million messages a day, and 20,000+ webmail users, thus 125,000 messages per hour send/recv in 8 hour work day. Would changing 15 to 60min Persistance on the LVS take up a lot of memory and processing (CPU/load) overhead? We're running 1gb of memory and dual pentium III.

Malcolm Turnbull malcolm (at) loadbalanceri (dot) org 27 Apr 2004

I would think it would be fine, 1 GB should handle almost 8 million connections in the timeout period i.e. 60 mins (or 2mins with no persistence).

Horms horms (at) verge (dot) net (dot) au 27 Apr 2004

I think Malcom is on the money here. Keep in mind that each connection entry / persistance timeout consumes something like 128bytes (actually, it might be a bit bigger now, but it is still in that ball-park). You can do the maths (actually you should, my brain has already checked out for the day), but if you are getting 100 connections/s, for an hour, each from a unique host, then you are still only going to end up using about 45Mb of memory for persistance entries. I doubt that will hurt you. I would also be supprised if you are getting connections from 360,000 unique hosts per hour :-)

Joe

for 4 realservers that's 40k messages/hr. I don't know how many tcp connections are required for a message transfer, but let's say it's 1. You have 15mins persistence, so 10k connections will be in existance at any one time.

For memory for the ipvs hash table: At 128bytes/connection, that's 1.28M of memory for the ipvsadm hash table. You have quite a margin with memory.

For disk and network I/O:

Let's say the average e-mail is 10kB. Each realserver is processing (10k messages/(15*60) secs) * 10kB = 0.1MBytes/sec. Your disks and network also have large margins of safety.

The error just a couple random people are having with WEBMAIL is "invalid session ID" as though they lost their connection to the realserver (actually a "message director") they were on. But I don't know if it is the "message directors" fault, or LVS.

have no idea, but I don't see any heavy load here. Are the clients timing out after 15mins and attempting to continue their session? WOuldn't the app/client know that the session has been closed and to go through the whole login procedure again? I don't know much about your app I'm sorry.

nothing here addresses the issue of persistence timeout. This is determined by how long you allow the client to be disconnected before you propagate to all realservers, the state changes in the realserver that occured in the last connection.

27.13. persistence with windows realservers

With unix realservers, we've been encouraging developers to rewrite the application (see rewriting your e-commerce application), to save client state in a failover safe fashion (i.e. in a place accessable by all realservers). Previously you would ask the client to accept a cookie or save the client state on the realserver to which the client connects (but where the state will be lost if the realserver fails). Rewriting the application is possible with unix, which gives you access to the primitives and you can build the application any way you want (provided that you have enough time and you understand the primitives well enough).

With Windows, you aren't given access to the primitives, but instead are given access to an API. If Windows has already coded up the function you want and you are happy to use that, then it's easy. If you want something else, you're SOL.

devendra orion dev_orion (at) yahoo (dot) com 22 Nov 2002

I need to enable loadbalancing on our curent director M/c (LVS-NAT enabled). We are currently having 3 realservers serving same website and need to be loadbalanced. The website is hosted on w2k and uses IIS session Mgmt (no cookies). Only problem is we need to keep this session alive for basically 8 hrs as our clients access the application continuously. How can I configure the loadbalancer to keep the connection persistence to same server after successful client login?

Joe (giving the party line)

The best solution is not to use persistence, but to re-write the application, so that the state information is stored in a place accessable to all realservers. In this way, if one realserver fails, the session with the client can continue.

Alex Kramarov alex (at) incredimail (dot) com 22 Nov 2002

I continuously hear on this list suggestions to rewrite applications to use other session management means then the one that comes with IIS. As a Windows/Unix developer/administrator (you can mix and match any one of the 2 groups ;), i would really like to say, that usually this is not that easy in the IIS environment, especially if you try to tell this to windows only developers, that don't know anything else then the MS way to do things. The best they can hope is to wait for the upcoming IIS 6 release, that includes session management that is meant to use in webfarms (db based), or to try some non microsoft (still proprietary) solutions, that try to do the same, like frameWERKS framework, that functions as a drop in replacement for the IIS session components.

I am not saying this to start a MS war on the list, but only to tell, that when an ms inclined person hears that he should "re-write the application" - 95% chance that this will be his last try to use lvs for his solutions. on the other hand, saying that there is a such and such solution that can help him will probably be considered...

(Alex has given an explanation of IIS session management below).

Tim Cronin tim (at) 13-colonies (dot) com 22 Nov 2002

We use IIS /w sessions and vls_nat and use wlc /w persistance. The presistance time must match the iis.session.timeout. We haven't had any problems, but we only have a 20min session.

27.14. messing with the ipvsadm table while your LVS is running

This is an example of persistence with firewall mark (fwmark).

Bowie Bailey

If I start a service with:

ipvsadm -A -f 1 -s wlc -p 180

and then change the persistence flag (setting the persistence granularity netmask to /24 with the -M option) to

ipvsadm -E -f 1 -s wlc -p 180 -M 255.255.255.0

how does that affect the connections that have already been made?

Julian 30 Jul 2001

The connections are already established. But the persistence is broken and after changing the netmask you can expect the next connections to be established to another realservers (not to the same as before the change).

(also see persistence netmask).

If IP address 1.2.3.4 was connected to RIP1 before I changed the persistence and then 1.2.3.5 tries to connect afterwards, would he be sent to RIP1, or would it be considered a new connection and possibly be sent to either server since the mask was 255.255.255.255 when the first connection happened?

New realserver will be selected.

unknown

Let's say: I have 1000 http requests (A) through a firewall of a customer (so in fact all requests have the same Source IP for Loadbalancer, because of NAT) and then one request (B) from the Intranet and then again 1000 Request (C) from that firewall, what does LB do? I have three Realservers r1, r2, r3 (ppc with rr)

a) A to r1, B to r2, C to r1 (because of SourceIP) [Distribution:2000:1:0.0000001]
b) A to r1, B to r2, C to r3 (because r3 is free) [Distribution:1000:1:1000]
c) A to r1, B to r2, C to r2 (due to the low load of r2) [Distribution:1000:1000:0.000001]
A to r1 && r2 && r3 (depending on source port),
B to r1 || r2 || r3,
C to r1 && r2 && r3 [Distribution: 667:667:666]

Ratz ratz (at) tac (dot) ch 12 Sep 1999

If C reachs the load balancer before all the 1000 requests of A expire, then the requests of C will be sent to r1, and the distribution is 2000:1:0.

If all the requests of A expires, the requests of C will be forwarded to a server that is selected by a scheduler.

BTW, persistent port is used to solve the connection affinity problem, but it may lead to dynamic load imbalance among servers.

Jean-Francois Nadeau

I will use LVS to load balance web servers (Direct Routing and WRR algo). I use persitency with a big timeout (10 minutes). Many of our clients are behind big proxies and I fear this will unbalance our cluster because of the persitent timeout.

Wensong

persistent virtual services may lead to the load imbalance among servers. Using some weight adapation approaches may help avoid that some servers are overloaded for a long time. When the server is overloaded, decrease its weight so that connections from new clients won't be sent to that server. When the server is underloaded, increase its weight.

Can we alter directly /proc/net/ip_masquerade ?

No, it is not feasible, because directly modifying masq entries will break the established connection.

27.15. Persistence for multiport services

Persistence was originally used to handle multiport services (e.g. ftp/ftp-data, http/https). While persistence is still the best method for ftp with LVS-DR, LVS-Tun, http/https is better handled by persistence granularity with fwmark.

27.16. Proxy services, e.g. AOL

Clients from AOL or T-online access the internet via proxies. Because of the way proxies can work, a client can come from one IP for one connection (eg port 80) and from another IP for the next connection (eg port 443) and will appear to be two different clients. Since there is no relation between the CIP and the indentity of the client, LVS cannot loadbalance by CIP. Usually these two connections will come from the same /24 netmask. Lars wrote the persistence granularity patch for LVS, which allows LVS to loadbalance all clients from a netmask as one group. If you set the netmask for persistence to /24 (with the -M option to ipvsadm) and all clients from the same class C network will be sent to the same realserver. This will mean that clients from AOL appear as a single (very active) client, and will likely take up all capacity on one realserver, leading to unbalance in load on the realservers. This is as good as we can do with LVS.

Wensong

If you want to build a persistent proxy cluster, you just need set a LVS box at the front of all proxy servers, and use the persistent port option in the ipvsadm commands. BTW, you can have a look at wwwcache.ja.net/JanetServices/PilotServices.html "how to build a big JANET cache cluster using LVS" (link dead, May 2002).

If you want to build a persistent web service but some proxy farms are non-persistent at client side, then you can use the persistent granularity so that clients can be grouped, for example you use 255.255.255.0 mask, the clients from the same /24 network will go to the same server.

Jeremy Johnson jjohnson (at) real (dot) com

how does LVS handles a single client that uses multiple proxies... for instance aol, when an aol user attempts to connect to a website, each request can come from a different proxy so, how/if does LVS know that the request is from the same client and bind them to the same server?

Joe

if this is what aol does then each request will be independant and will not neccessarily go to the same realserver. Previous discussions about aol have assumed that everyone from aol was coming out of the same IP (or same class C network). Currently this is handled by making the connection persistant and all connections from aol will go to one realserver.

Michael Sparks zathras (at) epsilon3 (dot) mcc (dot) ac (dot) uk

If ISP user (eg AOL) has a proxy array/farm then the requests are _likely_ to come from two possibilities:

  • A single subnet (if using an L4/L7 switch that rewrites ether frames, or using several NAT based L4/L7 switches)
  • A single IP (If using the common form of L4/L7 switch)

The former can be handled using a subnet mask in the persistance settings, the latter is handled by normal persistance.

*However* In the case of our proxy farm neither of these would work since we have 2 subnet ranges for our systems - 194.83.240/24 and 194.82.103/24, and an end user request may come out of each subnet totally defeating the persistance idea... (in fact dependent on our clients configuration of their caches, the request could appear to come from the above two subnets or the above 2 subnets and about 1000 other ones as well)

Unfortunately this problem is more common that might be obvious, due to the NLANR hierarchy, so whilst persistance on IP/subnet solves a large number of problems, it can't solve all of them.

Billy Quinn bquinn (at) ifleet (dot) com 05 Jun 2001

I've come to conclusion that I need an expensive (higher layer) load balancer node , which load balances port 80 (using persistence because of sessions) to 3 realservers which each run an apache web server, and tomcat servlet engine. Each of the 3 servers is independent and no tomcat load balancing occurs.

This has worked great for about a year, while we only had to support certain IP address ranges. Now, however, we have to support clients using AOL and their proxy servers, which completely messes up the session handling in tomcat. In other words, one client comes from multiple different IP addresses based on which proxy server it comes through.

It seems the thing to do is to adjust the persistence granularity. However, if I adjust the netmask, all of our internal network traffic will go to one server, which kind of defeats the purpose.

What I'm concluding is, that I'll need to change the network architecture (since we are all on one subnet), or buy a load balancer which will look at the actual data in the packets (layer 7?).

Joe

There has been comments by people dealing with this problem (not many), but they seem to be still able to use LVS. We don't hear of anyone who is having lots of trouble with this, but it could be because no-one on this list is dealing with AOL as a large slice of their work.

If 1/3 of your customers are from AOL you could sacrifice one server to them, but it's not ideal. If all your customers are from AOL, I'd say we can't help you at the moment.

My concern with that would be anyone else doing proxying ... now or in the future . I would not be opposed to routing all of the AOL customers to one server for now though . I guess we could have to deal with each case of proxying individually. I wonder how many other ISP's do proxying like that

How many different proxy IPs do AOL customers arrive on the internet from? How many will appear from multiple IP's in the same session and how big is the subnet they come from? (/24?)

Good question, I'm not sure about that one. The customer that reported the problem seemed to be coming from about 2-4 different IP addresses (for the same session ).

If AOL customers come from at least 3 of these subnets and you have 3 servers, then you can use LVS as a balancer.

Peter Mueller pmueller (at) sidestep (dot) com

Over here we also need layer-7 'intelligent' balancing with our apache/jakarta setup. We utilize two tiers of 'load-balancing'. One is the initial LVS-DR round-robin type setup, while the second layer is our own creation, layer-7. Currently we round-robin the first connection to one server, then that server calls a routine that will ask the second-tier layer-7 java monitor boxes which box to send the connection to. (If for some reason the second layer is down, standard round-robin occurs).

We're about 50% done with migration from cisco LD (yuck!) to LVS-DR. After the migration is fully complete the goal is to have the two layers interacting more efficiently and hopefully merged into one 'layer' eventually.. for example, if we tell our java-monitor second-tier controllers to shutdown a server, the first tier will then mark the node out of service automatically.

PS - we found the added layer-7 intelligent balancing to be about 30-50% (?) added effectiveness to cisco round robin LD.. I think the analogy of a hub versus a switch works fairly well here..

Chris Egolf cegolf (at) refinedsolutions (dot) net>

We're having the exact same problem with WebSphere cookie-based sessions. I was testing this earlier today and I think I've solved this particular problem by using fwmarks.

Basically, I'm setting everything from our internal network with one FWMARK and everything else with another. Then, I setup the ipvsadm rules with the default client persistence for our internal network(/32) and a class C netmask granularity (/24) for everything from the outside to deal w/ the AOL proxy farms.

Here's the iptables script I'm using to set the marks:

iptables -F -t mangle
iptables -t mangle -A PREROUTING  -p tcp -s 10.3.4.0/24 -d $VIP/32 \
--dport 80 -j MARK --set-mark 1
iptables -t mangle -A PREROUTING  -p tcp -s ! 10.3.4.0/24 -d $VIP/32 \
--dport 80 -j MARK --set-mark 2

Then, I have the following rules setup for ipvsadm:

director:/etc/lvs# ipvsadm -C
director:/etc/lvs# ipvsadm -A -f 1 -s wlc -p 2000
director:/etc/lvs# ipvsadm -a -f 1 -r $RIP1:0 -g -w 1
director:/etc/lvs# ipvsadm -a -f 1 -r $RIP2:0 -g -w 1

director:/etc/lvs# ipvsadm -A -f 2 -s wlc -p 2000 -M 255.255.255.0
director:/etc/lvs# ipvsadm -a -f 2 -r $RIP1:0 -g -w 1
director:/etc/lvs# ipvsadm -a -f 2 -r $RIP2:0 -g -w 1

FWMARK #1 doesn't have a persistent mask specified, so each client on the 10.3.4.0/24 network is seen as an individual client. FWMARK #2 packets are seen as a class C client network to deal with the AOL proxy farm problem. (for more on persistent netmask see the section in fwmark on fwmark persistence granularity).

Like I said, I just did this today, and based on my limited testing, I think it works. I'm thinking about maybe setting a whole bunch of rules to deal w/ each of the published AOL cache-proxy server networks (http://webmaster.info.aol.com/index.cfm?article=15&sitenum=2), but I think that would be too much of an administrative nightmare if they change it.

The ktcpvs project implements some level of layer-7 switching by matching URL patterns, but we need the same type of cookie based persistence for our WebSphere realservers. Hopefully, it won't be too long before that gets added.

Matthias Krauss MKrauss (at) hitchhiker (dot) com 2003-01-30

I turned on our lvs and it didnt take long for the phone rings from AOL people. The are switching between proxys with the result that the targed web will is different - we need it persitant.

Lars

The persistency netmask feature might help you, in exchange for lower granularity of the load balancing (but it shouldn't matter). However, all AOL users will then likely hit the same webserver. It just goes on to show that IP addresses are unsuitable to identify a single user ;-) Real fix would be to use layer7 switching based on the URL or a cookie, even; alternatively, you could make your application less dependent on persistence, for example by storing your session data in a global cache/db, which would also make it easier for you to preserve sessions when a single webserver fails.

I have now the persistency netmask feature up and it seems to work fine. All the sender networks are forwarded to 1 RIP and the load share on all RIP's is nearly equal. The AOL users are still complaining and I've got the impression that aol has different netmasks on their proxies. I found a list at http://webmaster.info.aol.com/proxyinfo.html and used this info for my fwmarks. Here's my iptables list

#mark all packets from these networks to VIP:80 with fwmark=3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  64.12.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  153.163.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  195.93.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.16.0/21 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  198.81.26.0/26 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  202.67.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3
director:/etc/lvs# iptables -t mangle -A PREROUTING -p tcp -s  205.188.0.0/16 -d $VIP/32 --dport 80 -j MARK --set-mark 3

and for ipvsadm apply persistence granularity with fwmark.

#forward all packets with fwmark=3 with rr scheduler.
director:/etc/lvs# ipvsadm -A -f 3 -s rr -p 3600 -M 255.255.255.0
director:/etc/lvs# ipvsadm -a -f 3 -r $RIP1 -g -w 100

Of course their is no balancing anymore for the above nets, but fortunately we don't have many aol customers.

Alternately, we found a optional way by using p3p http headers which aol offers/describe under http://webmaster.info.aol.com/headers.html

Note
P3P is W3C's Platform for Privacy Policy.

Here's another set of postings from Dec 2003, this time using fwmark to aggregate all the traffic from AOL. As with above, all connections from the proxy servers (i.e. all of AOL) will all go to one realserver.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 30 Dec 2003

For AOL clients, I need to use persistent connections. AOL make the IPs rotate very fast using several /8 or /16. Here's the list of AOL IPs for their clients

64.12.0.0 - 64.12.255.255
152.163.0.0 - 152.163.255.255
172.128.0.0 - 172.191.255.255
195.93.0.0 - 195.93.63.255
195.93.64.0 - 195.93.127.255
198.81.0.0 - 198.81.31.255
202.67.64.0 - 202.67.95.255
205.188.0.0 - 205.188.255.255

Can I handle this with fwmark?

Matthias Krauss MKrauss (at) hitchhiker (dot) com 30 Dec 2003

Here's the list of AOL proxies (http://webmaster.info.aol.com/proxyinfo.html).

#The proxy list from above

AOLPROXYS="64.12.96.0/19 152.163.188.0/21 152.163.189.0/21 152.163.194.0/21
152.163.195.0/21 152.163.197 152.163.201.0/21 \
152.163.204.0/21 152.163.205.0/21 152.163.206.0/21 152.163.207.0/21
152.163.213.0/21 152.163.240.0/21 \
152.163.248.0/22  152.163.252.0/23 195.93.32.0/22 195.93.48.0/22
195.93.64.0/19  198.81.0.0/22 198.81.8.0/23 \
198.81.16.0/21 198.81.26.0/23 202.67.64.0/21 205.188.178.0/21
205.188.192.0/21 205.188.193.0/21 205.188.195.0/21 \
205.188.196.0/21 205.188.197.0/21 205.188.198.0/21 205.188.199.0/21
205.188.200.0/21 205.188.201.0/21 \
205.188.208.0/21 205.188.209.0/21"

for aolproxys in $AOLPROXYS
do
  iptables -t mangle -A PREROUTING -p tcp -s $aolproxys -d VirtualIP/32 \
 --dport 80 -j MARK --set-mark 1
done

#-M is persistence netmask.
#It may not be needed since the netmask is already in the iptable command above
ipvsadm -A -f 1 -s wrr -p 3600 -M 255.255.255.0
ipvsadm -a -f 1 -r RealIP -g
#=> All listed AOL traffic is now going to VIP of machine with RealIP

I think you could concatenate some /21. Regarding both whois and AOL technical contact, I deduced :

AOLPROXYS="64.12.0.0/16 152.163.0.0/16 172.128.0.0/10 195.93.0.0/17 198.81.0.0/19 202.67.64.0/19 205.188.0.0/16"

Here's is my entry for the fwmark rule in keepalived.conf. Note the string "fwmark 1" which replaces "VIP port" as used in the standard setup. (Presumably "fwmark 1" is just a string which is passed to ipvsadm.)

virtual_server fwmark 1 {
    delay_loop 20
    lb_algo rr
    lb_kind DR
    persistence_timeout 1800
    persistence_granularity 255.255.255.0
    protocol TCP
    virtualhost www.toto.com
    real_server 172.16.1.4 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
}

#And here is it for the "standard users" :

virtual_server $VIP 80 {
    delay_loop 20
    lb_algo rr
    lb_kind DR
    persistence_timeout 1800
    persistence_granularity 255.255.255.0
    protocol TCP
    virtualhost www.toto.com
    real_server 172.16.1.4 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
    real_server 172.16.1.5 80 {
        weight 1
        HTTP_GET {
            url {
              path /index.jsp
            }
            connect_port 8083
            connect_timeout 10
            nb_get_retry 5
            delay_before_retry 20
        }
    }
}

So, in one file, you have both you're load balancing and you're HA settings. Of course, you need to configure iptables by yourself. Also, it doesn't do anything on the realservers (but on the realservers, I just need to configure the VIPs and noarpctl things, which is easy).

# ipvsadm -Ln

FWM 1 rr persistent 1800 mask 255.255.255.0
  -> 172.16.1.4:80                Route   1     0          0

For an example of using fwmark with keepalived, see ./doc/samples/keepalived.conf.fwmark in the source directory.

Casey Zacek cz (at) neospire (dot) net 2005/04/08

Here's the current aol proxy list (http://webmaster.info.aol.com/proxyinfo.html), in a more raw format, but it changes occasionally (I just had to update my list when I went looking for them):

Note
Joe: most of these are not class C
64.12.96.0/19

149.174.160.0/20

152.163.240.0/21
152.163.248.0/22
152.163.252.0/23
152.163.96.0/22
152.163.100.0/23

195.93.32.0/22
195.93.48.0/22
195.93.64.0/19
195.93.96.0/19
195.93.16.0/20

198.81.0.0/22
198.81.16.0/20
198.81.8.0/23

202.67.64.128/25

205.188.192.0/20
205.188.208.0/23
205.188.112.0/20
205.188.146.144/30

207.100.112.0/21
207.200.116.0/23

27.17. key exchanges (SSL)

Persistence is required for SSL services, as keys are cached.

Francis Corouge wrote:

I made a LVS-DR lvs. All services work well, but with IE 4.1 on secured connection, pages are received randomly. when you make several requests, sometime the page is displayed, but sometimes a popup error message is displayed

IE can't open your Internet Site <url>
An error occured with the secured connexion.

I did not test with other versions of IE, but netscape works fine. It works when I connect directly to the realserver (realserver disconnected from the LVS, and the VIP on the realserver allowed to arp).

Julian

Is the https service created persistent i.e. using ipvsadm -p ? I assume the problem is in the way SSL is working: cached keys, etc. Without persistence configured, the SSL connections break when they hit another realserver. It may be in the way the bugs are encoded. It also depends on the how the how the SSL requests are performed (which we don't know).

Notes from Peter Kese, who implemented the first persistence (pcc) (this is probably from 1999).

The PCC scheduling algorithm might produce some imbalance of load on realservers. This happens because the number of connections established by clients might vary a l