23. LVS: UDP Services - unique problems

LVS has been able to schedule and forward UDP packets from the very beginning. Most of our users have been load balancing TCP services and we've glibly assumed that all our TCP knowledge and coding applied to UDP as well. However while TCP connections are alike (start with SYN and end with FIN), a service using UDP has more flexibility.

A single scheduler is not going to work for all of these. Little progress has been made for UDP over LVS because few people are using UDP services with LVS. Most UDP services of interest (ntp, dns) have their own inbuilt loadbalancing and so their has been no pressing requirement for LVS/UDP, like there has for LVS/TCP.

23.1. SIP (Session Initiation Protocol)

SIP is an all UDP protocol for VoIP (voice over IP) telephony. It has lots of ports and is a bit complicated for LVS (no-one has it working under LVS). Asterisk has its own load balancer (see below). Currently we suggest that you try that first.

Joao Filipe Placido, Jul 19, 2005

I have read some posts about a SIP module for lvs. Is this available? Does it do SIP dialog persistence?


Unfortunately it is not available as the project was halted before it became ready for release. I'm not entirely sure what SIP dialog persistance is, but in a nutshell all my design did was to provide persistance based on the Caller-ID, rather than the client's address and port as LVS ususally does.

Mike Machado mmachado (at) o1 (dot) com 20 Aug 2004

I have a LVS-DR setup with two realservers. My service is a voip application, SIP specifically. I am trying to balance requests between the two. I have all the LVS stuff setup, and was able to get the telnet test to work properly. With UDP though, there seems to be a problem. When the application is forming its reply, it uses the realserver as the source IP, instead of the VIP, as it does with the telnet test. I assume this is because UDP is stateless. I tried to SNAT the packets back to the correct IP, but you cannot SNAT locally generated packets.

I was able to change my voip application to just BIND to the VIP, but due to the nature of this application, it needs to be able to communicate on both the VIP and the RIP, I just want reply packets to use the same source IP and the inbound packets.

Anyone come across this problem for UDP applications, along with a possible solution?


What about using IP_PKTINFO in sendmsg, "srcip" is your server IP used in each request packet as daddr:e.g. example (http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/0771.html); thread (http://www.ussg.iu.edu/hypermail/linux/kernel/0406.1/index.html#0247).


Fixing the application to send reply packets from the addresses that they were received on is the best solution IMHO. It is the way I have resolved this problem in the past. The alternative would be to use LVS-NAT instead.

Erik Versaevel erik (at) infopact (dot) nl 13 Jan 2005

I'm currently trying to create a loadbalacing SIP (voip protocol) cluster, however for this to work I need SIP messages from the same call (identifiable by the sip callid field) to get to the same realserver over and over again. (so, I need persistence based on the contents of the SIP Call-ID field). This would call for ktcpvs as we need to process packets at layer 7, however that poses 2 new problems, the first is that SIP uses clear text UDP messages, not tcp and the second is that there are no SIP modules for ktcpvs.

Another option would be to mark SIP packets with iptables/netfilter based on the callid, however i run into the same problem, there are no modules to accomplish this.

I know that there are commercial products available who are able to do SIP session persistence based on callid, the F5 Big-IP for example, the downside of that is it costs around $ 10.000 for a single loadbalancer (which is a SPOF so you need 2) and is a bit overkill as i don't need multi gigabit loadbalancing.

High persistence won't work because reply packets from another SIP source might be balanced to the wrong server, ie packets from might be balanced to real server 1, which sends it's reply directly to (the end point for the call) but the replies from might end up at another real server. (be aware I'm using direct routing because of NAT traversal)

Using TCP Sip would only solve half the problem (and couse some more). Answers to request could still end up at the wrong server (but one would only have to write a module for that, and not a kudpvs) and not all clients support TCP based sip.

Wensong Zhang wensong (at) linux-vs (dot) org 17 Jan 2005

We cannot use ktcpvs, because ktcpvs supports TCP only, and there is no SIP modules for TCP transport. The firewall marking doesn't solve the problem. However, we can write a special SIP UDP scheduling module for IPVS. It can detect the Call-Id from UDP packet and send it to the SIP server according to the recorded Call-Id table. We assume that there is no UDP fragments. Some NAT boxes (such as early IOS version of Cisco router) may drop UDP fragements except the first one.

Erik Versaevel

Such a module would definitly solve the problem. A round robin with call-id persistence would be awsome. Currently a device which can do that costs around $ 10000 each (times 2 for HA).

Malcolm Turnbull wrote:

If I wanted to test load balancing SIP using standard LVS UDP is their an OpenSource or Commercial Free Server to test against ?


AFAIK you can do all of VOIP PBX on Linux Asterisk The handsets (phones) are more of a problem. I believe you can get a linux box with a sound card and mic/headphones to be a phone. There was a large purchase of VOIP phones a while ago, that someone got to run linux, but these have all gone. You'll probably need to buy a real VOIP phone to test with.

Curt Moore tgrman21 (at) gmail (dot) com 11 Feb 2005

I believe that Wensong has the right idea here. Although SIP does support TCP the vast majority of SIP endpoints ony support UDP.

Asterisk can be used for a SIP application server but it's not geared/written to be a SIP proxy. For this you need something like SER, SIP Express Router.

When correctly configured, SER acting as a SIP proxy does support the distributing of calls to Asterisk boxes acting as media/application servers. The issue becomes how to load balance/distribute calls to multiple SER boxes, based on SIP call-id, so that the same SIP call-id always goes to the same SER box. Once you've statefully routed a call, based on call-id, to a particular SER box, SER can take over and ensure that things go to the correct Asterisk media/application server or endpoint based on its routing configuration. The director nodes just need to be smart enough to send the right call to the right SIP proxy residing in the LVS cluster.

As far as NAT goes, I've found though lots of experience that you'll never be able to penetrate every NAT implementation out there when it comes to SIP/UDP, you can only hope to get 99% of them as many of them don't fully conform to the RFC.

Gerry Reno greno (at) verizon (dot) net 16 May 2008

Ok, I finished setting up some pbx (asterisk). Can I use LVS to load balance the call traffic between multiple pbx's? Or with SIP protocol is it necessary to use OpenSER?


You probably can, but given the nature of SIP - two transport protocols, multi-port, session based - it could get very complicated. You could definitely sort out the main ports - TCP/UDP port 5060 - trivially; but the follow-on complication is how you then track the session traffic which can wander around all over the place (cf. the LVS FTP helper).

I'd strongly recommend you have a good read of the Asterisk mailing list - it seems that there are several app-based load balancing schemes for Asterisk, and if they do what you need, I'd use them.

Morgan Fainberg morgan (at) mediatemple (dot) net 17 May 2008

In theory, you could use a FWM (firewall mark) setup and persistent connections. If you map the virtual server group to use the same FWM for the TCP ( SIP uses TCP port 5060) and UDP (RTP usually is configured for UDP ports 16384-32767) datastreams. It should work in theory.

However, the application-based Load-balancing in Asterisk does function fairly well and you might end up with a better solution. Typically, with load-balancing I find that the more complexity you add just makes it that much harder to debug when things go awry.


I think the fwmark approach might work. And I like this since load-balancing with LVS is better for me because I have all my other services on it. I'm keeping all traffic going through the Asterisk box with canreinvite=no. canreinvite=yes would present a further scenario as the endpoints would then end up in direct communication for RTP. You'll have to excuse me if I've oversimplified this. I have not used fwmarks before.

So let's see, I'm using keepalived so in the conf I guess I would have something like:

virtual service RS_IP 5060 { # SIP
virtual service fwmark 1 { # SIP RTP

In iptables (directors):
iptables -t mangle -A PREROUTING -p udp -d --dport \
10000:20000 -j MARK --set-mark 1 # SIP RTP: where -d has ip of real servers

In iptables (realservers): # only for NAT, what about DR?
iptables -A PREROUTING -t mangle -d <VIRTUAL_IP> -j MARK --set-mark 1 # 
route back to director


Those looks reasonable, however, you will probably not want to separate the SIP and RTP traffic. It would make more sense to use two iptables rules that set the same firewall mark. IE: You can set as many iptables rules as the system can handle to assign a given firewall mark. Any traffic (regardless of port/type) can be balanced with the FWM. FWM is (as you can see by the ipvsadm man-page) it's own service type. Instead of specifying --tcp-service or --udp- service you specify --fwmark-service. Given that I use Keepalived vs. the other methods, it is slightly different than making direct calls with ipvsadm.

In short, no need to have separate VIPS for SIP and RTP unless you have different servers handing SIP traffic.

It would probably look something more like this:

virtual service fwmark 1 { # SIP RTP

iptables -t mangle -A PREROUTING -p udp -d --dport \ 
10000:20000 -j MARK --set-mark 1 # SIP RTP: where -d has ip of real  servers
iptables -t mangle -A PREROUTING -p tcp -d --dport \
5060 -j MARK --set-mark 1 # SIP RTP: where -d has ip of real servers

I've not used FWM+NAT in a good long while. You probably don't need to set the firewall mark on the realservers as the firewall mark (I don't believe) stays with the packet once it leaves the local networking stack (ie, it is not sent out on the wire). So unless the system needs to do something specific with the firewall mark (IE iprule to policy-route to the director) the firewall mark will not need to be set on the real-server.

A DR configuration should work almost identically, however, I've not done UDP in a DR configuration (always NAT). A standard DR configuration ~should~ function for a Asterisk setup like this.


Yes, of course, I need to keep the SIP and RTP together since I'm not using a separate SIP server. So now if we use ARA we should have a good extensible solution. To me this seems like it might be better than OpenSER because with OpenSER you have a SPOF whereas with keepalived/LVS you have more robust solution. My setup is LVS-DR so I need to think is the direct return route is going to create any problems. Otherwise, the only thing lacking in this picture is FreePBX does not support ARA :-(

later...Gerry Reno greno (at) verizon (dot) net 26 Dec 2008

Actually, I abandoned plans for LVS+Asterisk. We just beefed up our recovery techniques and made sure everyone knew what to do if Asterisk crashed or hung on us. Yes, we lose calls and it's a pain but we live with it right now.

23.2. UDP timeouts (SIP)

Benjamin Lawetz blawetz (at) teliphone (dot) ca 30 Jun 2005

I have a setup that load balances SIP UDP packets between 4 servers. Today one of my servers failed and mon removed it from the load-balancing, but some of the connections still remain and keep getting refreshed. I noticed something bizarre though with the ipvsadm UDP timeout, it is set to 35s, but the 3 connections that "stay stuck" seem to have an expire of 60 seconds instead of 35. And even though the server is removed and the rest of the connections timeout and get redirected to another server. Those 3 just keep going to the failed server.

Anyone have any idea why these 3 connections are (so it seems) auto-refreshing every 60s on a server that doesn't exist? Any way to clear these?

Ipvsadm startup script:

echo 1 > /proc/sys/net/ipv4/ip_forward
/sbin/ipvsadm --set 0 0 35
/sbin/ipvsadm -A -u -s rr
/sbin/ipvsadm -a -u -r -i -w 5
/sbin/ipvsadm -a -u -r -i -w 5
/sbin/ipvsadm -a -u -r -i -w 5
/sbin/ipvsadm -a -u -r -i -w 5

Before removal this is what I have:

Ipvsadm -L -n:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
UDP rr
  ->            Tunnel  5      0          4
  ->            Tunnel  5      0          3
  ->            Tunnel  5      0          3
  ->            Tunnel  5      0          6

Ipvsadm -L -n -c:

IPVS connection entries
pro expire state       source             virtual            destination
UDP 00:30  UDP
UDP 00:25  UDP
UDP 00:34  UDP
UDP 00:26  UDP
UDP 00:22  UDP
UDP 00:32  UDP
UDP 00:21  UDP
UDP 00:20  UDP
UDP 00:34  UDP
UDP 00:23  UDP
UDP 00:31  UDP
UDP 00:32  UDP
UDP 00:27  UDP
UDP 00:40  UDP
UDP 00:28  UDP
UDP 00:25  UDP

After removal I have:

Ipvsadm -L -n:

IP Virtual Server version 1.2.1 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port           Forward Weight ActiveConn InActConn
UDP rr
  ->            Tunnel  5      0          3
  ->            Tunnel  5      0          3
  ->            Tunnel  5      0          6

Ipvsadm -L -n -c:

IPVS connection entries
pro expire state       source             virtual            destination
UDP 00:28  UDP
UDP 00:31  UDP
UDP 00:31  UDP
UDP 00:31  UDP
UDP 00:29  UDP
UDP 00:27  UDP
UDP 00:24  UDP
UDP 00:24  UDP
UDP 00:33  UDP
UDP 00:40  UDP
UDP 00:11  UDP
UDP 00:57  UDP
UDP 00:57  UDP
UDP 00:33  UDP
UDP 00:33  UDP

Sorry to answer myself, but narrowed down the problem.

I sniffed the traffic on the load balancer coming from one of the "stuck" IPs and I'm getting no traffic whatsoever. So basically while getting no traffic coming in, having no destination to go to and having a timeout of 35 seconds, the connection entries counts down from 59 seconds to 0 seconds and loops back to 59 seconds again. Is there a way to remove these connections? Anyone have any idea why they are looping when no traffic is coming in?


Which kernel do you have? This sounds like a counter bug that was resolved recently. However I couldn't convince myself that it manifests in 2.4 (I looked at 2.4.27). http://archive.linuxvirtualserver.org/html/lvs-users/2005-05/msg00043.html Also, I may as well mention this as it has been floating around: http://oss.sgi.com/archives/netdev/2005-06/msg00564.html

Marcos Hack marcoshack (at) gmail (dot) com 20 Dec 2005

I'm using LVS to load balance SIP UDP connections, and when a virtual service fail on a real server the director don't clean up the UDP connection on connection table. The solution seems to be just setting /proc/sys/net/ipv4/vs/expire_nodest_conn=1 and change keepalived to remove the virtual service on failure instead of using "inhibit_on_failure" (set weight to 0).

Michael Pfeuffer wq5c (at) texas (dot) net 18 Nov 2008

I've got an LVS-NAT configuration that works for HTTP traffic, but SIP UDP traffic is not being load balanced (and all SIP packets come from the same source address) - it always goes to the 1st RIP on the list. The services are all configured the same except for the checktype.


The same source address would explain it. You could artificially reduce the UDP protocol timeout for testing:

ipvsadm --set <tcp> <tcpfin> <udp>  (see man ipvsadm for info)

but you can't make it less than 1 second. Also, because UDP is stateless, a "session" is viewed as traffic arriving from $host with $source_port to a given VIP/Port within the UDP timeout detailed above. This causes problems with SIP signalling data in some cases, because it has a tendency to be sourced from port 5060 to port 5060, and is quite regular.

If you have a larger spread of clients, over time things will become roughly balanced according to your RS weights.

23.3. UDP timeouts (DNS)

the sysctl expire_quiescent_template (see Section 28.6 seems to be useful in several situations.
we don't find out till the end that Adrian is using persistence

Adrian Chapela achapela (dot) rexistros (at) gmail (dot) com 05 Mar 2007

I have a two directors in high availability configuration. All it's OK for TCP, but for UDP it's no OK. In UDP load balance it's ok, but the fail over don't happens. If one client is "redirected" to one real server, the connections are redirected always to this server, even the server goes down. I'm not using DNS's built in redundancy. I use a failover mechanism with LVS to have a high availibility of my DNS servers. Today I change my config of virtual server to a DNS servers. Before I have two config, One in UDP and another in tcp for the "same" service. A DNS service can be in UDP or TCP. I config my DNS servers to serve dns queries in the two protocols. When I serve with keepalived the DNS service in UDP and TCP the fail over is OK, but when I config to serve in UDP only the fail over doesn't happens..

My UDP health check does the right things (I think...). The checks recognize well when a server goes down. In the list doesn't appear, but the packets are thrown to the serve many minutes later. I then the packets are thrown to the "limbo" (/dev/null I think..). I don't know what is happen but in another situation with a firewall maked with Shorewall, I had a similar problem. I changed the rules in firewall (one port for another) and the packets was ruled to the first port. I reboot the machine and all OK. With TCP this not happens never.

Later - apparently after help from Graeme

The solution was set /proc/sys/net/ipv4/vs/expire_nodest_con=1. When a server is removed from the pool the 'established' connections are removed but before the connections are waiting to the protocol timeout and in UDP is too high. Other important variable is /proc/sys/net/ipv4/vs/expire_quiescent_template (see Section 28.6. For now I don't use it.

Simon Pearce sp (at) http (dot) net 27 Nov 2006

I am running a dns cluster (Gentoo) with two directors active/active and 4 realservers running powerdns. Each server has a 3Ghz Pentium 4 and 1 Gig of Ram. I have about 250 VIPs. I could do it all with one VIP of course, but quite a few of our customers require there own dns servers with there own ip address. A lot of them don't really need it, but it looks good to them.

Everytime time the dns cluster exceedes a certain limit some of the ip addresses stop working properly. It effects the system in a way that for certain domains you get a timeout when querying the cluster. Some of the transfered IP's seem to stop working or slow down to an extend that other dns servers stop querying us. Load average is 1-2. Even though queries don't get through the director (reply in 4000ms), the realservers answer direct requests. The only iptables rule is on the director to masquerade out calls to the internet.

Joe: Is the problem load or the number of IPs (if you can tell)? There is another problem with failover of large numbers of IPs, just incase you want to read more on the topic (it may not be related to your problem). http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.failover.html#1024_failover

Can you setup ipvsadm with a single fwmark instead of all the IPs? That would shift the responsibility for handling all the IPs to iptables, rather than ipvsadm.

Graeme Fowler graeme (at) graemef (dot) net 27 Nov 2006

I know it was LVS-DR, and that it didn't have 250+ IP addresses, but the DNS system I built for my previous employer used LVS with keepalived. The last time I had access to the statistics, it was running at something like 1200 queries/sec (which will have risen now by something like 25% if memory serves), 99% of which were UDP, without a glitch.

However - as Joe mentioned - I built it to balance on fwmarks, not on TCP or UDP. Incoming packets were marked in the netfilter 'mangle' table according to protocol and port, and the LVS was then built up from the corresponding fwmarks.

There was one network "race" we never bottomed, which has affected the system once or twice since I left, where an unmarked packet somehow slipped through to the "inside" (ie. realserver-facing rather than client-facing) LAn and then caused massive traffic amplification. That however isn't related in any way to the OP's problem.

Wayne wayne (at) compute-aid.com who has a financial interest (works for?) Webmux, posted that only Webmux has solved this problem, but didn't give any details (Simon has now solved it). At the moment this stands as an uncorroborated statement by Wayne. Presumably the 53-tcp/udp replies and calls to forwarders (from the RIP?) were being correctly nat'ed. A solution was posted by Graeme Fowler where the IPs are fwmark'ed and the LVS balances the fwmark, but this was for a smaller number of IPs and Graeme didn't know if it would work for 250IPs.

Simon Pearce sp (at) http (dot) net 6 Apr 2007

Some of you on the list might remember my problem concerning our DNS cluster last year. http://archive.linuxvirtualserver.org/html/lvs-users/2006-11/msg00278.html

These problems (DNS timeouts) have continued throughout this year and I have been desperately trying to find the solution. I have been folowing the mailing list and stumbled over the probems Adrian Chapela was having with his DNS setup. Which brought me to the solution ipvsadm -L --timeout the default settings for UDP packets was set to 500 seconds which should be changed. Which is way to long the load balancers were waiting for 5 minutes to timeout a UDP packet I get ablout 1500 queries a second. I changed the setting to 15 seconds last week. And moved some of our old windows/bind DNS servers to the new linux DNS cluster. Before I changed the timeout settings I always recieved a call from our customers within two hours your DNS services are not responding correctly. The IP's that refused to answer would always change I have 254 IP's some of the large German dialup providers would refuse to talk to us which resulted in domains not being reachable. Our DNS cluster is autorative for about 250000 domains so you can imagine how many complaints I recieved. I was about to give up and scrap keepalived I am so glad I did not. Changing the timeout value solved my problems and I am a happy man at the moment. Is there a way to set the timeout value permently so it is saved after a reboot of the server? One last thing I would like to say is a big thank you to Graeme Fowler, Horms, Adrian Chapela and Alexandre Cassen for writing this great piece of software. and anyone else on the list who maybe contributed to help me finaly find the solution. Thank you guys you do a great job on the mailing list.

horms 8 May 2007

glad to hear that you got to the bottom of your problem. I am a little concerned about the idea of reducing UDP timeouts significantly because to be quite frank UDP load-balancing is a bit of a hack. The problem lies in the connectionless nature of the protocol, so naturally LVS has a devil of a time tracking UDP "connections" - that is a series of datagrams between a client and server that are really part of what would be a connection if TCP was being used.

As UDP doesn't really have any state all LVS can do to identify such "connections" is to set up affinity based on the source and destination ip and port tuples. If my memory serves me correctly DNS quite often originates from port 53, and so if you are getting lots of requests from the same DNS server then this affinity heristic breaks down.

The trouble is that if the timeout is significatnly reduced, the probablility of it breaking down the other way - in the case where that affinity is correct - increases.

I'm not saying that you don't have a good case. Nor am I saying that changing the default timeout is off-limits. Just that what exactly is a good default timeout is a tricky question, because what works well in some cases will not work well in others, and vice versa.

To some extent I wonder if the userspace tools should have the smarts to change the timeout if port 53 (DNS) is in use. Thought that may be an even worse heuristic.

I wonder if a better idea might be the one packet scheduling patches by Julian http://archive.linuxvirtualserver.org/html/lvs-users/2005-09/msg00214.html. Much to my surprise these aren't merged. Perhaps thats my fault. I should look into it.

I also wonder, if problem relates to connection entries for servers that have been quiesced, then does setting expire_quiescent_template help (see Section 28.6)?

echo 1 > /proc/sys/net/ipv5/vs/expire_quiescent_template

Rudd, Michael Michael (dot) Rudd (at) tekelec (dot) com 8 May 2007

My 2 cents in dealing with DNS and your idea of the OPS feature. I have implemented the OPS feature into the 2.6 kernel and its running well. Without that feature, we wound up having all the DNS queries from our DNS client get sent to the same realserver.

The problem we did run into, which I've gotten help from the community on, is when using LVS-NAT, the source packet isn't SNAT'd. This is because LVS on the outgoing packet doesn't know the packet is an LVS packet, so it just forwards it out. I fixed this with an iptables rule to SNAT it myself. Just an FYI if you ever choose to use OPS with LVS-NAT.


Mmm, I guess OPS isn't quite the right solution to the DNS problem :(

23.4. Julian's One Packet Scheduler (OPS) for UDP, timeouts for DNS

Although UDP packets are connectionless and independant of each other, in an LVS, consecutive packets from a client are sent to the same realserver, at least till a timeout or a packet count has been reached. This is required for services like ntp where each realserver is offset in time by a small amount from the other realservers and the client must see the same offset for each packet to determine the time. The output of ipvsadm in this situation will not show a balanced LVS, particularly for a small number of client IPs.

Julian has experimental patch LVS patches for a "one packet scheduler" which sends each client's UDP packet to a different realserver.

Ratz 17 Nov 2006

First off: OPS is not a scheduler :), it's a scheduler overrider, at best.

Its really a bit of a hack (but probably a hack that is needed), especially with regard to the non-visible part in user space. I have solved this in my previously submitted Server-Pool implementation, where several flags are exported to user space and displayed visibly. Now I remember that all this has already been discussed ... with viktor (at) inge-mark (dot) hr and Nicolas Baradakis has already ported stuff to 2.6.x kernels:


Julian's reply:


So Julian sees this patch in 2.6.x as well :). I've also found a thread where I put my concerns regarding OPS:


The porting is basically all done, once we've put effort into Julian's and my concerns. User space is the issue here, and with it how Wensong believes it should look like.


As an option, I can't see any harm in this and I do appreciate that it is needed for some applications. Definitely not as default policy for UDP, because the semantic difference is rather big:

srcIP:srcPort --> dstIP:dstPort --> call scheduler code & add template
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> read template entry for this hash
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> read template entry for this hash
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> call scheduler code & add template
srcIP:srcPort <-- dstIP:dstPort --> do nothing

srcIP:srcPort --> dstIP:dstPort --> call scheduler code (RR, LC, ...)
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> call scheduler code
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> call scheduler code
srcIP:srcPort <-- dstIP:dstPort --> do nothing
srcIP:srcPort --> dstIP:dstPort --> call scheduler code
srcIP:srcPort <-- dstIP:dstPort --> do nothing

The other question is, if it makes sense to restrict it to UDP, or give a choice with TCP as well?


This is a problem with people's perception and expectation regarding load balancing. Even though the wording refers to a balanced state, this is all dependent on the time frame you're looking at it. Wouldn't it be more important to evaluate the (median) balanced load over a longer period on the RS to judge the fairness of a scheduler? I've had to explain this to so many customers already. There are situations where the OPS (I prefer EPS for each packet scheduling) approach makes sense.

Julian Nov 18 2006

I don't know what the breakage is and who will solve it but I can only summarize the discussions:

  • RADIUS has responses, not only requests
  • DNS expects responses (sometimes many if client retransmits)
  • there is a need for application aware scheduling (e.g.SIP). SIP does not look like a good candidate for OPS in its current semantics.
  • Everyone talks about OPS. But for what applications? May be that is why OPS is not included in kernel - there is no enough demand and understanding about the test case which OPS solved, in one or two setups.
  • Last known 2.6 patch for OPS (parts):


Threads of interest:

2005-Sep, Oct:


"Rudd, Michael" Michael (dot) Rudd (at) tekelec (dot) com 12 Apr 2007

With the OPS feature turned off, the source IP address is correctly SNATed to my VIP. With the OPS feature on and working correctly(which we need for our UDP service), the source IP address isn't correctly SNATed.

Julian 18 Apr 2007

OPS is implemented for setups where there is no reply for the original packet.

DNS and RADIUS have reply, so they need something different (which is not done by OPS):

  • every original packet should pass scheduling step
  • reply (or replies) should go back properly, the hash connection should keep the needed information (VIP, VPORT)

So, what you need is something different. OPS now works in this way:

  • schedule original packet but don't hash the connection
  • when reply packet is received it can not find connection and the reply packet is treated as non-IPVS packet (real server receives ICMP or packets goes further). That is why such replies don't have proper VIP:VPORT if passed to output device.
Is anybody aware of the code for this? I assume its related to not looking up the connection in the hash table anymore with OPS thus not SNATing. Maybe an iptables rule could fix this possibly?

You can use rule but may be you will not get the right VPORT every time. Not sure why you need OPS. It was created when someone needed to generate many requests from same CIP:CPORT with the assumption that there are no replies. Only when many connections come from same CIP:CPORT in the configured UDP time period the connection reusing does not allow scheduling to be used for every packet. That is why OPS was needed to schedule many packets (coming before expiration) from same CIP:CPORT->VIP:VPORT to different real servers.

May be what you need from OPS is impossible: when OPS is not used if reply is delayed, IPVS will wait until the configured UDP timeout is expired, but this value can be different from the timeout your clients is using. Difference in miliseconds can be fatal. What can happen is that a different request from same CPORT will go to the same real server as long as the UDP timeout is not expired. There can be different situations:

  • clients can retransmit on some timeout (DNS, RADIUS)
  • nobody is instructed how many requests should be passed (and the same number of replies if such application mode is used) before removing NAT connection explicitly before expiration to allow next request to be scheduled to different real server.

So, the main problem is that it is not easy to balance single CIP to many real servers if there are replies that can be delayed or when requests can be retransmitted. There is no way IPVS to know when to forget one connection to allow scheduling for the next packet from same CIP:CPORT. So, if the client expects replies then OPS should not be used. Instead, short UDP timeout should be used and one should be ready single CIP:CPORT to be scheduled to same real server even if many distinct (from application point of view) requests are sent from same socket.

So I send my DNS query to my VIP on my directors. It gets routed to a realserver which I've attached the vip to bond1.201:0. According to others I've talked to I shouldn't need an iptables rule but I still don't see the packet out with the source ip address of the VIP. I see the packet with the source IP of the actual realserver. Its possible it is a routing issue though so I plan on digging deeper on that today.

For LVS-DR reply should be generated in real server with src=VIP. If you ask the question for LVS-NAT then with OPS you will need the iptables SNAT rule because IPVS does not recognize replies. But I have never tested such setup. Without OPS you don't need iptables SNAT rule, IPVS translates the source address.

Should I need an iptables rule at all for LVS-DR?

No, reply goes directly from real server to client.

23.5. icmp responses aren't generated by UDP timeouts on VIP-less directors

Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 19 Jan 2007

I am using LVS director with no VIP for load balancing ipsec servers accessed by NAT'ed clients (udp 500/4500, fwmark method). When I remove a realserver (ipvsadm -d ...), its clients are not notified after their connections expire. I suspect that icmp responses are simply not generated on the director as they should be.


Yes, icmp_send() has code that feeds ip_route_output*() with non-local source address (the VIP that is not configured as IP address in director). The ICMP reply logic is implemented in a way that ensures the ICMP packet will use local IP address, for example, when IP router wants to send reply for packet destined to next hop (looks like our case).

The networking maintainers still wait for someone to go and split all callers of ip_route_output*() to such that require local source address and others that don't require. The goal is to move the check for local source address out of ip_route_output to allow code such as NAT or IPVS to get output route with non-local source address (may be there are other such uses). Every place should be audited and check for local IP should be added only if needed.

The ICMP reply code is a such place that needs to send ICMP replies with local address, the receiver should see who generates the error.

So, for the problem in original posting: the IPVS users that need to send ICMP replies for VIPs should configure the VIPs in director. I'm not sure there will be another solution. If one day ip_route_output does not validate the source address may be icmp_send can rely only on this check as before:

        saddr = iph->daddr;
        if (!(rt->rt_flags & RTCF_LOCAL))
                saddr = 0;

Then director will send ICMP replies from VIPs, by using the local-delivery method to accept traffic for VIP.

I hope the problem can be solved in another way, e.g.

  • isakmp keep alive
  • longer UDP timeout
  • persistency

It is against principles UDP users to expect reliability from internet. See RFC1122, 4.1.1

Any support for ISAKMP keep alives in your devices?

Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 13 Feb 2007

If you mean DPD (dead peer detect) - yes, it is supported (I use OpenSwan), but it does not work very well for me. In my case, several tunnels can use the same ISAKMP association, and only one of them is removed when the peer is assumed dead. Other tunnels stay on, ignoring ICMP port unreachable messages my patched director is sending, until they expire. My current workaround is not using DPD, but setting a short rekey period (15 mins or less).

Ratz 5 Feb 2007

It does generally not make sense for the director to send notification of the connection expiration since the connection establishment was between the CIP and VIP. The director does not have any open sockets. However, I can understand the need for such a notification. If the director knew the endpoint's socket state, we wouldn't have the need for this opportunistic timeout handling currently present in IPVS. From my understanding it's not required from anyone to send ICMP message back, especially on the grounds of pulling a machine from the network. I would need to re-read the RFC and study the Steven's diagrams to make a supported statement.

Here's Janusz's description of the patch (which is being incorporated into LVS). If a realserver goes down, then the director being the last hop is supposed to let the client know that the (virtual) service it was connected to doesn't exist anymore. Presumably shortly thereafter, failover will fix up the ipvsadm table and the client gets to pick a new service?

Janusz Krzysztofik jkrzyszt (at) tis (dot) icnet (dot) pl 27 Mar 2007

Yes, that is it, but there's more too. It acts on a VIP-less director in two cases at least:

In an overload case, it should allow using the VIP source address in icmp port unreachable packets which are sent by ip_vs_leave(), "called by ip_vs_in, when the virtual service is available, but no destination is available for a new connection" (quote from ip_vs_core.c).

As you said, but with some tricks: On my VIP-less fwmark LVS-DR director, I use it for sending icmp port unreachable packets after a UDP connection is removed when a relaserver goes down, or it just expires and a packet could be sent to a different realserver without client knowledge (for TCP, TCP_RESET is sent that does not need this patch).

First of all, my LVS-DR traffic all goes through conntrack (see F5_snat http://www.austintek.com/LVS/LVS-HOWTO/HOWTO/LVS-HOWTO.LVS-NAT.html#F5_snat) Only the first packet of each connection is marked for a specific service to get IPVS connection entry established. Next packets are just marked for local routing rule, so they can be seen, matched against existing connection and redirected by ip_vs_in(). If the connection is removed (expired, timeout after realserver goes down, or immediately with help of expire_nodest_conn), the packet goes through ip_vs_in() untouched (no matching connection nor service) and ends up in udp_rcv(), where the icmp port unreachable packet with the VIP source address is generated and sent to client with help of the patch in question.

Unfortunately, this does not remove the corresponding conntrack entry (unlike TCP_RESET does), so if a client ignores icmp errors and keeps sending, all following packets go the same way and director keeps responding with icmp errors. To solve this problem, I have moved ip_vs_in() before INPUT filter hook (http://www.icnet.pl/download/ip_vs_core-input-before-filter.patch), set up netfilter rules that generate log entries for this case, and set up a user-space daemon tracing log and removing conntrack entries for non-existing LVS connections (http://www.icnet.pl/download/delete_connection.sh).