34. LVS: Details of LVS operation, Security, DoS

34.1. Top 20 security vunerabilities

See list of top 10 windows and top 10 unix vunerabilities

34.2. Top 75 security tools from the people at nmap

See Top 75 Security Tools survey of May 2003, by polling the nmap mailing list.

34.3. Network Testing with Abberant Packets

Note
This is not exactly DoS, but is from a thread on another mailing list.

Jeff The Riffer riffer (at) vaxer (dot) net 27 Feb 2007

We used several tools to generate abberant behavior, rather than packet replays. One was Core Impact, which actually exploits known holes and installs agents. It can do TCP evasion techniques to a limited extant. For abberant behavior, we found a nifty little open-source tool called isic, which lets you generate all sorts of abnormal traffic:

http://www.mirrors.wiretapped.net/security/packet-construction/isic/

It has binaries to generate abnormal ethernet, UDP, TCP, IP, and ICMP traffic. You can control percentages of the different abnormalities as well as volume of traffic. It's VERY noisy and aggressive stuff, but great for seeing if you can brign down a system. You can also use to to generate a packet storm while trying to sneak in through a more mundane attack amd trick your IDS/IPS route. We had problems getting it compiled, but someone was able to find a Debian package for it. The Debian package was converted to RPM using Alien and the RPM worked great under SuSe 10.0.

Other than that we just used NMap and Nessus to generate varying levels of traffic and alerts. Isic was very useful for us...

NMap/Nessus are to test how good IPS are to detect scanning. We were doing a very comprehensive test so we made no assumptions about capabilities of the products. NMap and Nessus by themselves would not be sufficient.

The problem iwth replaying actual packet captures to test an IPS is that it will be for whatever IP addresses were in play when the capture is done, so that won't really work either. You can muck around with the .cap file and change the IPs and MAC addresses but it's an iffy solution.

Core Impact is really great. But it's commercial and expensive, so most folks aren't going to have it. But, Metasploit is free and can do many of the same things. Just not as easily.

34.4. Do I need security, really?

Malcolm Turnbull

Assuming that you have an LVS loadbalancer running on a linux box and this box is behing a firewall so that only ports 80 and 443 are allowed from clients. Do you really need to harden the loadbalancer firewall rules ? What about SYN cookies?

Ratz 01 Oct 2002

Yes, always. DROP ALL, accept TCP 80/443 only. Especially if the packet filter in front and the LVS are running the same OS :)

Nothing can prevent SYN flooding, you can only live better with it when you have SYN cookies enabled. With a wrongly set backlog queue size you still face big penalty with SYN/RST attacks. See syncookies.

Roberto Nibali ratz (at) tac (dot) ch 03 May 2001

It doesn't matter whether you're running an e-gov site or you mom's homepage. You have to secure it anyway, because the webserver is not the only machine on a net. A breach of the webserver will lead to a breach of the other systems too.

Joe Sep 2005

As Marcus Ranum says (http://www.ranum.com/security/computer_security/editorials/dumb/) "Worms aren't smart enough to realize that your web site/home network isn't interesting".

Joe 01 Oct 2002

Yes Virginia, you need security.

There's the technical level. Can an intruder, who gets beyond the firewall, do any damage after getting access to the director, the realservers? If so, do you care? Maybe you do, maybe you don't - it will depend on what you have on those machines - if it's only publically available (readonly) webpages, you're less concerned than if you have customer business information on it.

Are there adjacent machines on the network, that have more sensitive data than yours, that could be attacked from your compromised machines? You don't want to be an intermediate site to an attack on a expensive setup in the next rack.

You may think that with a hardened front end, backend machines need less protection. However a new exploit we haven't thought about may be hop through a firewall and land on one of the less protected machines behind it. You should think about the damage that could occur if the attacker gained root access to any of your machines. LVS-DR is easier to protect in this case, as packets from the attacker on a realserver will be coming from the RIP, while packets from the LVS'ed service will be coming from the VIP. (If the realservers are in a 3-Tier LVS LVS, then only the packets from the RIP on the realserver to the external 3-Tier services should have routes.) There should be no default gw on the realservers for packets from the RIPs. Packets from the RIPs to 0:0 should be dropped (and logged). The only allowed packets from the RIPs are those needed for internal networking between the machine in the LVS (e.g.local mailing of logs, updates to files). These packets will have dst_addr as another machine on the RIP network. On the director, there should be no default gw for packets from the VIP (see routing for LVS-DR). The only way an attacker with root can send packets to the outside world is by changing the routing tables (which you should be able to pickup).

But security is more than a technical thing. How will your customers react if their website goes down, gets defaced or has credit card info stolen? You're going to have to explain that your actions were diligent and the break-in was beyond anything that you could be expected to handle. You then have to mollify them and make sure you keep the account. You'll also have to explain to potential customers why that humongous break-in that made it to the front page of the NY Times for a week, doesn't reflect poorly on you. If these people are non-technical (as most people with the money are) this may be difficult.

It is just easier to make sure there never is a break-in. Of course there's no end to the things you can do in this department, so somewhere you'll have to decide what you're prepared to deal with upfront at the keyboard and what your prepared to deal with at the backend, after a break-in, face-to-face with an unhappy customer, who is simultaneously dealing with his own unhappy customers. To your client, you aren't a genius with a keyboard who understands computers. Nope - you're the security guard they've hired to look after a warehouse of their widgets and if you let someone get off with them, they're not going to be terribly interested in your explanation of why or how it happened.

I'd say the minimum for a production machine, exposed to the internet, is a set of rules on each machine (director and realservers) that only allows the packets needed for the LVS (by port, IP, proto) and drops the rest. Every packet to and from a machine must be inspected by a filter rule. Every rejected packet must be logged (at least till you find out where they're coming from). Routing should be designed to allow out, only the packets you want to go out (outgoing packets are filtered by port and IP).

If you're being bombarded with malicious traffic (spam, DoS), tcpdump is not a good diagnostic tool - you will not be able to decipher the deluge. Try snort. Here's an Introduction to Network-Based Intrusion Detection Systems Using Snort.

Limit places where intruders can login (e.g. with xinetd). For maintenance, don't login over the same networks that the LVS traffic flows on (e.g.RIP network, outside network with VIP). For maintenance/admin, only allow connection by ssh through a separate ethernet card on a different set of wires and different network (backed up with filter rules and xinetd), or via the console.

34.5. What to do after a break-in, prevention strategies

In the early 1990's, a break-in was unusual and being a criminal act, some investigative body was notified (e.g. the police). This being a new type of crime, usually the investigators had no idea how to handle the situation. At my work a multi cpu refrigerator sized mail server was compromised and the investigators swooped in and seized the server; not just the disk(s) - the whole server, and wheeled it out the door. We were told that the server would be returned on completion of the investigations (and any trial if suspects were apprehended). On asking them when that might be, we were told that if they could not find any suspects, that the machine would be returned when the case timed out, which would be 8yrs.

This was a big lesson to all involved. The next break-in I was involved in, the machine wouldn't boot. I reformatted the disk and reinstalled the OS from CD and the user's files from backups. When the investigators arrived and asked to inspect the machine. I told them that the disk had been reformatted and offered them the last tape backup. They didn't want it.

Subsequently I attended a talk by a /programmer/lawyer/cyber-investigator from Washington DC, who worked for the US govt. He told us that after a break-in we must not touch the machine (it's tampering with evidence) and gave us a list of contacts. At question time, I told him the story about the server being wheeled out the door (which several people in the audience were familiar with), and my reformatting of the disk on the machine I handled, at which he grew visibly angry. I asked him if he expected us to call them if they were going to take our hardware away when they only needed to take a bitwise copy of the disks. After all, the police don't seize your house after a burglary.

In his reply, it was clear that he recognised that such actions by the investigators weren't optimal, but as to what I should do next time, he only offered more standard party line and I decided that next time I wouldn't be calling anyone.

Following a set of Unix/Linux break-ins (Apr 2004), Stanford U put out (link dead Jun 2004 http://securecomputing.stanford.edu/alerts/multiple-unix-6Apr2004.html) "Multiple Unix compromises on campus", describing their problems and offering links to further pages (such as information on rootkits).

Unfortunately the current state of security is that much work is needed to get it and much of the prevention work seems to be applying patches. This is a lot of work and I can't imagine that it will prevent most break-ins. I personally find tiresome the practice of forced passwd expiration every 3 months on the 30 accounts I have in several administrative fiefdoms. I'm expected to keep them in my head when they are 8 char mixed uc/lc and contain at least 2 numbers. Who are they kidding? I tape them to the edge of my monitor. The article links to Steps for Recovering from a UNIX or NT System Compromise a CERT/AusCERT paper. It seems to have been written by people who like being on committees and who want you to spend so much time securing your machine, that you'll never be able to use it. Unfortunately the people who make decisions about managing computers have never dealt with a break-in and know nothing about security will cover their asses (arses) with a never ending round of patching. The result is that users have to suffer machines being rebooted from under them every 2 weeks (and loosing all the user settings), the SA never does anything useful but when the inevitable break-in occurs the manager can happily say to some committee "we did everything we could to prevent it".

My only interaction with CERT was not good. One day (mid 1990s) I got vitriolic e-mail from a person announcing that he was one of the top AusCERT security experts, and that my map server (now at AZ_PROJ map server and producing about 10,000 maps/yr) was doing robotic attacks on his network. If I didn't cease and desist immediately, a dire fate would befall me.

Now if there was some problem with my machine and I had come to the notice of CERT, I would expect a letter from CERT saying

Dear Sir,

	It has come to our notice that your machine (IP=xxx.xxx.xxx.xxx)
has been sending these packets (logs enclosed) to machines (n1...nx).
Since these aren't part of the expected packet stream, we're concerned
that there may be some problem. Do you know anything about this?

	This is a routine letter and indicates the beginning
of an investigation into a problem and can be tracked at
url/case_number. 

	Hopeing to hear from you.
	Thank you
	Your friendly CERT representative

I assumed I was talking to a crazed idiot.

The next day I got an even more vitriolic e-mail from the same guy, promising me certain internet death if I didn't stop attacking his machines. Somewhere in here, he sent me logs, whose relevance to the problem was not obvious at the time.

Then I got e-mail from a user of my map generator saying that it had stopped working for him and could I help. The map generator produced azimuthal equidistant projection maps in many formats, including an X-client which could popup an X-window map on your screen (there were instructions on setting xhost etc). The user was having problems getting the X-display of my maps going, when previously it had worked. AFAIK, no-one was using the X-client and I had forgotten it was there (everyone was generating gifs). Somehow (IPs?) I connected this user to the AusCERT expert. I told the user what to do and then sent off e-mail to the CERT expert, giving the url of my map generator and telling him to go look at what it did.

This only inflammed the CERT security expert more and shortly thereafter I got e-mail from an even higher level AusCERT uber-expert telling me that I'd been listed as one of the biggest internet nasties of all time and that no-one was ever going to get a packet in or out of my machine ever again.

I explained to the uber-expert what my machine was doing and that that he was probably getting X-packets from my server and to go try it out for himself.

There was silence for a couple of days, and then a rash of apologies from both CERT experts. There was no indication that they had learned anything from this or would change their methods next time.

The fact that the top CERT experts in Australia don't know an X-packet from a hole in the wall and are prepared to declare internet death on someone without an investigation (courteous or otherwise), indicates that we shouldn't hold out much hope of CERT saving us from anything. If CERT can save us from CERT, we should be thankfull.

34.6. More about syncookies

anon

humans usually do not establish SYN connection. It is more likley to be Nimda or other worms. If I can determine a threshold of simultenious SYN connection that nimda usually creates, I will be able to drop packets from specific source IP which meet the threshold.

Roberto Nibali ratz (at) tac (dot) ch 06 Aug 2003

Search google using my name and syncookies for more information on why syncookies have no measurable impact on reducing real DoS.

If you can _really_ figure out a metric for mutually exclusive TCP/SYN patterns generated by existing worms and write it down in a mathematical formula which has lower false positive rate than any TCP/QoS "defense" mechanism using stochastic (timed) fairness approach, you need not worry about money anymore. In fact influential people in the Internet business might feel a sudden urge to talk to you! ;)

34.7. Can filter rules stop the intruder hopping to other machines?

Malcolm Turnbull malcolm (dot) turnbull (at) crocus (dot) co (dot) uk 14 Feb 2003

Nope, if you're hacked they can just change your firewall rules... One of my clients got hacked and the only way they found out was because the hacker (possibly script kiddy) tried to flush the iptables rules, therfore breaking all of the NAT rules therefore taking down the web site...

How did he get in: broke into IIS through common bug, installed a trojan, used SSH to get from the web server to the firewall .. etc etc...

Even if you put the LVS behind a firewall (which I prefer) you still need to open port 80... is it secure ? yes I think so hackers tend to concentrate on application i.e. apache or IIS these days its much easier..

One other gotcha.. If your fallback server is localhost you are obviously exposing your local apache installation !

Nate Carlson natecars (at) real-time (dot) com

the firewall should be configured so untrusted hosts (e.gthe web server -- any box that isn't the box that people are expected to log in from) can't connect to the SSH port (or any other service) on the firewall.

34.8. Where filter rules act

Joe - iptables (2.4 kernels) has no "iptables -C" to check your rules (at least not yet - one is promised).

Ratz

If you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it:

packets coming from outside to the LVS do:

PRE_ROUTING -> LOCAL_IN(LVS in) -> POST_ROUTING

packets leaving the LVS travel:

PRE_ROUTING -> FORWARD(LVS out) -> POST_ROUTING

From the iptables howto: COMPATIBILITY WITH IPCHAINS

This iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three.

Julian

  • 2.4 director:

    Packets coming into the director (out->in):

    • NAT: INPUT -> input routing -> local: LVS/DEMASQ -> input routing -> forwarding -> OUTPUT
    • DR/TUN: INPUT -> input routing -> local: LVS -> output routing -> OUTPUT

    packets leaving the LVS travel (in->out):

    • NAT only: INPUT -> input routing -> FORWARD (-j MASQ) -> LVS/MASQ -> OUTPUT

  • 2.2 director:

    INPUT in 2.2 is similar as PRE_ROUTING in 2.4, i.e. INPUT, OUTPUT and FORWARD are the 2.2 firewall chains

    input routing: ip_route_input()
    output routing: ip_route_output()
    forwarding: ip_forward()
    local: ip_local_deliver()
    

Matthew S. Crocker matthew (at) crocker (dot) com 31 Aug 2001

How do I filter LVS? Does LVS grab the packets before or after iptables?

Julian

LVS is designed to work after any kind of firewall rules. So, you can put your ipchains/iptables rules safely. If you are using iptables put them on LOCAL_IN, not on FORWARD. The LVS packets do not go through FORWARD.

Note

Although LVS is compatible with any kind of filter rule (i.e. ipchains, iptables), it has incompatibilities with netfilter i.e. you maynot be able to have your firewall on the director. For more info see the Running a firewall on the director.

Joe

If you are being attacked, it might be better to filter upstream (e.g. the router or your ISP), to prevent the LAN from being flooded.

34.9. /proc filesystem flags for ipv4, e.g.rp_filter

You could wind up flipping a lot of these flags. Explanations are available in the obscure section of the Adv Routing HOWTO . In particular rp_filter and log_martians are used in julians_martian_modification. For more information on rp_filter see Reverse Path Filtering .

34.10. tcp timeout values, don't change them (at least yet)

The tcp timeout values have their values for good reason (even if you don't know what they are), and realservers operating as an LVS must appear as normal tcp servers to the clients.

Wayne, 19 Oct 2001

I have a question about the 'IP_MASQ_S_FIN_TIMEOUT" values in "net/ipv4/ip_masq.c" for the 2.2.x kernel. What purpose is served by having the terminated masqueraded TCP connection entries remain in memory for the default timeout of 2 minutes? Why isn't the entry freed immediately?

Julian Anastasov ja (at) ssi (dot) bg 20 Oct 2001

Because the TCP connection is full-duplex. The internal-end sends FIN and waits for the FIN from external host. Then TIME_WAIT is entered.

Perhaps what I'm really asking is why there is an mFW state at all.

[IP_MASQ_S_FIN_WAIT] = 2*60*HZ,
/* OUTPUT */
/* mNO, mES, mSS, mSR, mFW, mTW, mCL, mCW, mLA, mLI */
/*syn*/ {{mSS, mES, mSS, mSR, mSS, mSS, mSS, mSS, mSS, mLI }},
/*fin*/ {{mTW, mFW, mSS, mTW, mFW, mTW, mCL, mTW, mLA, mLI }},
/*ack*/ {{mES, mES, mSS, mES, mFW, mTW, mCL, mCW, mLA, mES }},
/*rst*/ {{mCL, mCL, mSS, mCL, mCL, mTW, mCL, mCL, mCL, mCL }},
};
/mFW

This state has timeout corresponding to the similar state in the internal end. The remote end is still sending data while the internal side is in FIN_WAIT state (after a shutdown). The remote end can claim that it is still in established state not seeing the FIN packet from internal side. In any case, the remote end has 2 minutes to reply. It can even work for longer time if each packet follows in these two minutes not allowing the timer to expire. It depends in what state is the internal end, FIN_WAIT1 or FIN_WAIT2. May be the socket in the internal end is already closed.

The only thing I can think of is if the other end of the TCP connection spontaneously issues a half close before the initiator sends his half close. Then it might be desirable to wait a while for the initiator to send his half close prior to disposing of the connection totally. What would be the consequences of using ipchains -M -S to set this value to, say, 1 second?

In any case, timeout values lower than those in the internal hosts are not recommended. If we drop the entry in LVS, the internal end still can retransmit its FIN packet. And the remote end has two minutes to flush its sending queue and to ack the FIN. IMO, you should claim that the timer in FIN_WAIT state should not be restarted on packets coming from remote end. Anything else is not good because it can drop valid packets that fit into the normal FIN_WAIT time range.

Jaroslav Libak 28 Nov 2006

When I click refresh in firefox several times while viewing load balanced page, I get a FIN_WAIT connection for every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid of them fast (is this ok btw?, it was like 2 minutes before which I think is way too long). What is worse, I get "established" connection on the backup director (running the syncd) for every refresh. I have read this is due to a simplification in the synchronization code. I'm using hash table size 2^20 (which doesn't limit the maximum number of values in it, it just sets the number of rows, then each row has a linked list). Doesn't it cause some slowdown in the LVS?

Horms 29 Nov 2006

There has long been a plan to allow the timeout values to be manipulated from user space. I think it actually was possible using /proc at some stage, but the code was removed for various (good) reasons. Then there was a plan to implement the feature by extending the sysctl interface. I suspect that this, or using sysfs is currently the prefered option by the upstream kernel guys.

A really worthwhile contribution to LVS would be to complete this code. I can find out from the upstream people what their prefered option for implementing this is if you are interested in having a crack at it. I don't imagine the code will be that hard.

I understand that your concern is memory preasure on the slave in the case of a DoS attack. And it is true that the simplification in the synchronisation protocol can exasabate that problem. However, by doing it this way the synchronisatin traffic is actually reduced, including in the case of a DoS attack. So expanding it may actually just move the problem else where.

Keeping in mind that a connection entry is in the vicintity of 128 bytes, it is my opinion that unless you have an extremely small ammount of memory available on the system to start with, DoSing the machine in this way is quite hard. I did try once, DoSing a box from istelf, and basically the default timeouts were easily able to keep up with the DoS, and I think the total memory used never exceded a few hundred Mb.

I would be very surprised if increasing the value would cause a slowdown, it does however increase the memory required for the array that forms the base of the hash - at 2^20 you are looking at order 2^20 = 1Mb for the size of that array. For larger values, like 32 (=4Gb), this starts to become rediculous. Decreasing it can, in theory, cause a slowdown if you have a lot of connections. But in practice I don't think it does unless you make it very small.

In short, 20 should be fine, though you can probably get the same preformance with 16. 10 is probably a bit too small.

34.11. /proc file system settings for LVS: security and private copies of tcp timeouts for LVS connections (you can change these)

In LVS-DR, the director only sees the packets from the client going to the realserver, but not the replies. After seeing a CLOSE, the director puts the connection into InActConn and uses its value of TIME_WAIT before assuming that the connection has dropped. (In fact the director has no idea of the connection state of the realserver, but these assumptions seems to work OK). In the earlier versions of LVS, the director uses the standard tcpip timeouts for its estimates of the connection state of the realserver. In the newer versions of LVS (somewhere in 2.4.x), you can fiddle with a set of private copies of the timeout values which ip_vs uses for LVS connection tracking.

As well other ip_vs parameters (e.g. for security) can be altered in /proc.

Roberto Nibali ratz (at) tac (dot) ch 03 May 2001

The load balancer is basically on as secure as Linux itself is. ipchains settings don't affect LVS functionality (unless by mistake you use the same mark for ipchains and ipvsadm). LVS itself has some built-in security, mainly to try to secure the realserver in case of a DoS attack. There are several parameters you might want to set in the proc-fs.

Note

Ratz 10 Aug 2004

Those values below were used as kind of a defense mechanism in the ancient days. I believe these are to be replaced by the same parameters exported through the ip_conntrack module. Load ip_conntrack and walk the /proc/sys/net/ipv4/netfilter tree.

  • /proc/sys/net/ipv4/vs/amemthresh
  • /proc/sys/net/ipv4/vs/am_droprate
  • /proc/sys/net/ipv4/vs/drop_entry
  • /proc/sys/net/ipv4/vs/drop_packet
  • /proc/sys/net/ipv4/vs/secure_tcp
  • /proc/sys/net/ipv4/vs/debug_level

    With this you select the debug level (0: no debug output, >0: debug output in kernlog, the higher the number to higher the verbosity)

    The following are timeout settings. For more information see TCP/IP Illustrated Vol. I, R. Stevens.

  • /proc/sys/net/ipv4/vs/timeout_close - CLOSE
  • /proc/sys/net/ipv4/vs/timeout_closewait - CLOSE_WAIT
  • /proc/sys/net/ipv4/vs/timeout_established - ESTABLISHED
  • /proc/sys/net/ipv4/vs/timeout_finwait - FIN_WAIT
  • /proc/sys/net/ipv4/vs/timeout_icmp - ICMP
  • /proc/sys/net/ipv4/vs/timeout_lastack - LAST_ACK
  • /proc/sys/net/ipv4/vs/timeout_listen - LISTEN
  • /proc/sys/net/ipv4/vs/timeout_synack - SYN_ACK
  • /proc/sys/net/ipv4/vs/timeout_synrecv - SYN_RECEIVED
  • /proc/sys/net/ipv4/vs/timeout_synsent - SYN_SENT
  • /proc/sys/net/ipv4/vs/timeout_timewait - TIME_WAIT
  • /proc/sys/net/ipv4/vs/timeout_udp - UDP

You don't want your director replying to pings from the outside world.

With the FIN timeout being about 1 min (2.2.x kernels), if most of your connections are non-persistent http (only taking 1 sec or so), most of your connections will be in the InActConn state.

unknown

will the info from loading ip_conntrack and walking the /proc/sys/net/ipv4/netfilter tree be used along with secure_tcp defense strategy as LVS DoS defense strategy (http://www.linux-vs.org/docs/defense.html) described to replace the timeouts mentioned.

Ratz 12 Aug 2004

I don't know (I've been out of the development loop for about a year) but I rather think not since they look kind of orthogonal to the existing netfilter timers which only got added about 6 months or so ago. One of the issues in fiddling with those timers is that they influence too much of the rest of the stack.

I also don't think the documentation is up to date anymore, it should be adjusted to reflect the current state of operation. Like that it only confuses people who don't want or can't read the kernel code.

If you're interested, check out following path:

net/ipv4/ipvs/ip_vs_ctl.c:ip_vs_sysctl_defense_mode()
net/ipv4/ipvs/ip_vs_ctl.c:update_defense_level()
net/ipv4/ipvs/ip_vs_ctl.c:ip_vs_secure_tcp_set()
net/ipv4/ipvs/ip_vs_conn.c:"set state table, according to proc-fs value"

from there you set the TCP state transition table. If you have the secure_tcp sysctl set, the kernel will be dealing with the vs_tcp_states_dos state transition table, if you have it unset, it will be dealing with the normal vs_tcp_states table.

The related timer for the state transitions are vs_timeout_table{_dos}. In former days you could influence those timers via proc-fs. Nowadays we seem to switch to the *_dos timer model under attack according to the comment in the code. But this is not correct. It should read that as soon as the sysctrl for tcp_defense is set, we will also be using the *_dos table timers along with the vs_tcp_states_dos state transition table.

Conclusion: The disabled proc-fs values have been replace by a static hardcoded mapping of the timers for tcp_defense. I could imagine that not a lot of people really used to tweak those parameters anyway.

Hendrik Thiel, 20 Mar 2001

we are using a lvs in NAT Mode and everything works fine ... Probably, the only Problem seems to be the huge number of (idle) Connection Entries.

ipvsadm shows a lot of InActConn (more than 10000 entries per Realserver) entries. ipchains -M -L -n shows that these connections last 2 minutes. Is it possible to reduce the time to keep the Masquerading Table small? e.g. 10 seconds ...

Joe

For 2.2 kernels, you can use netstat -M instead of ipchains -M -L. For 2.4.x kernels use cat /proc/net/ip_conntrack.

Julian

One entry occupies 128 bytes. 10k entries mean 1.28MB memory. This is not a lot of memory and may not be a problem.

For 2.2, to reduce the number of entries in the ipchains table, you can reduce the timeout values. You can edit the TIME_WAIT, FIN_WAIT values in ip_masq.c, or enable the secure_tcp strategy and alter the proc values there. FIN_WAIT can also be changed with ipchains.

Note

It is not a good idea to change the tcpip timeouts (particularly to save 1M).

With the later versions of ip_vs (2.4.x), the director has its own copies of the tcpip timeout values, and you can change them.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 10 May 2004

If you are concerned about the number of InActConn, you can reduce the FIN_WAIT timeout in /proc/sys/net/ipv4/vs/timeout_finwait.

For 2.6.x versions of ip_vs (May 2004), the timeouts have not been implemented yet.

Julian 12 May 2004

IPVS for 2.6 has code to use different timeout tables but we forgot to implement it fully. The intention was to implement per protocol/service/app timeouts by adding some code to libipvs and the kernel. It is not preferred to export so many values via /proc interface, so now it is disabled until someone decides to implement the above set/get controls. Only the timeout_* values in /procare disabled, so now they do not exist in 2.6. All other sysctls remain.

34.12. timeouts the same for all services

Alois Treindl alois (at) astro (dot) ch

I have LVS-NAT configured so that ssh VIP connects me to one particular realserver. I would like to keep this ssh connection permanent, (to observe the cluster during its operation). This ssh connection times out with inactivity as expected. Can this be changed, without affecting the timeout values of other LVS services? Alternately I can connect by ssh to each machine without using LVS.

Julian Anastasov ja (at) ssi (dot) bg 12 May 2001

Currently, there are only global timeout values which are not very useful for some boxes with mixed functions. The masquerading, the LVS and its virtual services use same timeout values. The problem is that there are too many timeouts.

The solution would be to separate these timeouts, i.e. per virtual service timeouts, separated from the masquerading. According to the virtual service protocol it can serve the TCP EST and the UDP timeout role. So this can be one value that will be specified from the users. By this way the in->out ssh/telnet/whatever connections can use their own timeout (1/10/100 days) and the external clients will use the standard credit of 15 minutes. But may be it is too late for 2.2 to change this model. Is one user specified timeout value enough?

34.13. Director Connection Hash Table

Note
Because the 2.0.x implementation of ip_vs was in the masquerading code, this table used to be called the "IP masquerading table".
Note
Joe: A regular table has room for N entries, with an index of range N. A hash table is a table that has room for N entries, but stores entries for indices that have a range of M, where M>N. In the case of LVS, the connection hash table must store entries over the whole range of internet IPs, but only has (initially) 4096 (say) entries. Algorithms exist which allow adding and deleting entries in hash tables at speeds comparable to those in regular tables.

from Peter Mueller: a general article on hashing (http://www.citi.umich.edu/projects/linux-scalability/reports/hash.html, site gone Sep 2004)

The director maintains a hash table of connections marked with

<CIP, CPort, VIP, VPort, RIP, RPORT>

where

  • CIP: Client IP address
  • CPort: Client Port number
  • VIP: Virtual IP address
  • VPort: Virtual Port number
  • RIP: RealServer IP address
  • RPort: RealServer Port number.

The hash table speeds up the connection lookup and keeps state so that packets belonging to a connection from the client will be sent to the allocated realserver. If you are editing the .config by hand look for CONFIG_IP_MASQUERADE_VS_TAB_BITS.

Warning

Do not even think of changing the LVS (hash) table size unless you know a lot more about ip_vs than we do. If you still want to change the hash table size, at least read everything here first.

tao cui

In the output of ipvsadm what does the "size" mean?

IP Virtual Server version 1.0.9 (size=4096)

or

IP Virtual Server version 1.0.9 (size=65536)

Horms 24 Dec 2003

This refers to the number of hash buckets in the IPVS connection table. This is configured at compile time by setting CONFIG_IP_VS_TAB_BITS, the default is 12.

size = 2^CONFIG_IP_VS_TAB_BITS

Thus CONFIG_IP_VS_TAB_BITS = 12 -> size = 4096
     CONFIG_IP_VS_TAB_BITS = 16 -> size = 65536

Note that this is the number of hash buckets, not the maximum number of connections. A bucket can contain zero or more connections. The maximum number of connections is only limited by the memory available.

Janno de Wit

How can I see if connectiontable is full? `dmesg` gives no output.

Horms 07 Jan 2005

The connection table cannot become full. It is a hash table and you can continue to add entries until you run out of memory, at which time something very apparent should turn up in dmsg.

Ratz

The original poster actually has got a point :) So what about this:

Note
partial diff shown for brevity - Joe
diff -Nur linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_conn.c
linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_conn.c
--- linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_conn.c   2003-11-03
17:26:50.000000000 +0100
+++ linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_conn.c   2003-12-24
09:21:37.000000000 +0100
@@ -1519,7 +1519,7 @@

         IP_VS_INFO("Connection hash table configured "
-                  "(size=%d, memory=%ldKbytes)\n",
+                  "(hash buckets=%d, memory=%ldKbytes)\n",
                    IP_VS_CONN_TAB_SIZE,
                    (long)(IP_VS_CONN_TAB_SIZE*sizeof(struct
list_head))/1024);
         IP_VS_DBG(0, "Each connection entry needs %d bytes at least\n",

diff -Nur linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_ctl.c
linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_ctl.c
--- linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_ctl.c    2003-11-03
17:26:50.000000000 +0100
+++ linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_ctl.c    2003-12-24
09:22:47.000000000 +0100
@@ -1488,7 +1488,7 @@
         pos = 192;
         if (pos > offset) {
                 sprintf(temp,
-                       "IP Virtual Server version %d.%d.%d (size=%d)",
+                       "IP Virtual Server version %d.%d.%d (hash
buckets=%d)",
                         NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
                 len += sprintf(buf+len, "%-63s\n", temp);
                 len += sprintf(buf+len, "%-63s\n",
@@ -1942,7 +1942,7 @@
         {
                 char buf[64];

-               sprintf(buf, "IP Virtual Server version %d.%d.%d (size=%d)",
+               sprintf(buf, "IP Virtual Server version %d.%d.%d (hash
buckets=%d)",
                         NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
                 if (*len < strlen(buf)+1) {
                         ret = -EINVAL;

The default LVS hash table size (2^12 entries) originally meant 2^12 simultanous connections. These early versions of ipvs would crash your machine if you alloted too much memory to this table.

Julian 7 Jun 2001

This was because the resulting bzImage was too big. Users selected a value too big for the hash table and even the empty table (without linked connections) couldn't fit in the available memory.

This problem has been fixed in kernels>0.9.9 with the connection table being a linked list.

Note
Note: If you're looking for memory use with "top", it reports memory allocated, not memory you are using. No matter how much memory you have, Linux will eventually allocate all of it as you continue to run the machine and load programs.

Each connection entry takes 128 bytes, 2^12 connections requires 512kbytes.

Note
not all connections are active - some are waiting to timeout.

As of ipvs-0.9.9 the hash table is different.

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

With CONFIG_IP_MASQUERADE_VS_TAB_BITS we specify not the max number of the entries (connections in your case) but the number of the rows in a hash table. This table has columns which are unlimited. You can set your table to 256 rows and to have 1,800,000 connections in 7000 columns average. But the lookup is slower. The lookup function chooses one row using hash function and starts to search all these 7000 entries for match. So, by increasing the number of rows we want to speedup the lookup. There is _no_ connection limit. It depends on the free memory. Try to tune the number of rows in this way that the columns will not exceed 16 (average), for example. It is not fatal if the columns are more (average) but if your CPU is fast enough this is not a problem.

All entries are included in a table with (1 << IP_VS_TAB_BITS) rows and unlimited number of columns. 2^16 rows is enough. Currently, LVS 0.9.7 can eat all your memory for entries (using any number of rows). The memory checks are planned in the next LVS versions (are in 0.9.9?).

Julian 7 Jun 2001

Here is the picture:

the hash table is an array of double-linked list heads, i.e.

struct list_head *ip_vs_conn_tab;

In some versions ago ( < 0.9.9? ) it was a static array, i.e.

struct list_head ip_vs_table[IP_VS_TAB_SIZE];

struct list_head is 8 bytes (d-linked list), the next and prev pointers

In the second variant when IP_VS_TAB_SIZE is selected too high the kernel crashes on boot. Currently (the first variant), vmalloc(IP_VS_TAB_SIZE*sizeof(struct list_head)) is used to allocate the space for the empty hash table for connections. Once the table is created, more memory is allocated only for connections, not for the table itself.

In any case, after boot, before any connections are created, the occupied memory for this empty table is IP_VS_TAB_SIZE*8 bytes. For 20 bits this is (2^20)*8 bytes=8MB. When we start to create connection they are enqueued in one of these 2^20 double-linked lists after evaluating a hash function. In the ideal case you can have one connection per row (a dream), so 2^20 connections. When I'm talking about columns, in this example we have 2^20 rows and average 1 column used.

The *TAB_BITS define only the number of rows (the power of 2 is useful to mask the hash function result with the IP_VS_TAB_SIZE-1 instead of using '%' module operation). But this is not a limit for the number of connections. When the value is selected from the user, the real number of connections must be considered. For example, if you think your site can accept 1,000,000 simultaneous connections, you have to select such number of hash rows that will spread all connections in short rows. You can create these 1,000,000 conns with TAB_BITS=1 too but then all these connections will be linked in two rows and the lookup process will take too much time to walk 500,000 entries. This lookup is performed on each received packet.

The selection of *TAB_BITS is entirely based on the recommendation to keep the d-linked lists short (less than 20, not 500,000). This will speedup the lookup dramatically.

So, for our example of 1,000,000 we must select table with 1,000,000/20 rows, i.e. 50,000 rows. In our case the min TAB_BITS value is 16 (2^16=65536 >= 50000). If we select 15 bits (32768 rows) we can expect 30 entries in one row (d-linked list) which increases the average time to access these connections.

So, the TAB_BITS selection is a compromise between the memory that will use the empty table and the lookup speed in one table row. They are orthogonal. More rows => More memory => faster access. So, for 1,000,000 entries (which is an real limit for 128MB directors) you don't need more than 16 bits for the conn hash table. And the space occupied by such empty table is 65536*8=512KBytes. Bits greater than 16 can speedup the lookup more but we waste too much memory. And usually we don't achieve 1,000,000 conns with 128MB directors, some memory is occupied for other things.

The reason to move to vmalloc-ed buffer is because an 65536-row table occupies 512KB and if the table is statically defined in the kernel the boot image is with 512KB longer which is obviously very bad. So, the new definition is a pointer (4 bytes instead of 512KB in the bzImage) to the vmalloc'ed area.

Ratz's code adds limits per service while this sysctl can limit everything. Or it can be additional strategy (oh, another one) vs/lowmem. The semantic can be "Don't allocate memory for new connections when the low memory threshold is reached". It can work for the masquerading connections too (2.2). By this way we will reserve memory for the user space. Very dangerous option, though.

Joe

what's dangerous about it?

One user process can allocate too much memory and to cause the LVS to drop new connections because the lowmem threshold is reached.

May be conn_limit is better or something like this:

if (conn_number > min_conn_limit && free_memory < lowmem_thresh)
         DROP_THIS_PACKET_FOR_NEW_CONN

why have a min_conn_limit in here? If you put more memory into the director, hen you'll have to recompile your kernel. Is it because finding conn_number is cheaper than finding free_memory?

:) The above example with real numbers:

if (conn_number > 500000 && free_memory < 10MB) DROP

i.e.don't allow the user processes to use memory that LVS can use. But when there are "enough" LVS connections created we can consider reserving 10MB for the user space and to start dropping new connections early, i.e. when there are less than 10MB free memory. If conn_number <500000 LVS simply will hit the 0MB free memory point and the user space will be hurted because these processes allocated too much memory in this case.

But obtaining the "free_memory" may be costs CPU cycles. May be we can stick with a snapshot on each second.

The number of valid connections shouldn't change dramatically in 1 sec. However a DoS might still cause problems.

Yes, the problem is on SYN attack.

Ratz

max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

what's the security problem?

SYN/RST flood. My patch will set the weight of the realserver to 0 in case the upper threshold is reached. But I do not test if the requesting traffic is malicious or not, so in case of SYN-flood it may be 99% of the packets causing the server to be taken out of service. In the end we have set all server to weight 0 and the load balancer is non-functional either. But you don't have the memory problem :)

And it hasn't crashed either.

Ratz

I kinda like it but as you said, there is the amem_thresh, my approach (which was not actually done because of this problem :) and now having a lowmem_thresh. I think this will end up in a orthogonal semantic for memory allocation. For example if you enable the amem_thresh the conn_number > min_conn_limit && free_memory <lowmem_thresh would never be the case. OTOH if you set the lowmem_thresh to low the amem_thresh is ineffective. My patch would suffer from this too.

Julian Anastasov ja (at) ssi (dot) bg 08 Jun 2001

lowmem_thresh is not related to amemthresh but when amemthresh <lowmem_thresh the strategies will never be activated. lowmem_thresh should be less than amemthresh. Then the strategies will try to keep the free memory in the lowmem_thresh:amemthresh range instead of the current range 0:amemthresh

Example (I hope you have memory to waste):

lowmem_thresh=16MB (think of it as reserved for user processes and kernel) amemthresh=32MB (when the defense strategies trigger) min_conn_limit=500000 (think of it as 60MB reserved for LVS connections)

So, the conn_number can grow far away after min_conn_limit but only while lowmem_thresh is not reached. If conn_number <500000 and free_memory <lowmem_thresh we will wait the OOM killer to help us. So, we have 2 tuning parameters: the desired number of connections and some space reserved for user processes. And may be this is difficult to tune, we don't know how the kernel prevents problems in VM before activating the killer, i.e. swapping, etc. And the cluster software can take some care when allocating memory.

Hayden Myers hayden (at) spinbox (dot) com 18 Mar 2002

There's also some info located in kernel help. I posted it below for convenience.

Using a big ipvs hash table for virtual server will greatly reduce conflicts in the ipvs hash table when there are hundreds of thousands of active connections. Note the table size must be power of 2. The table size will be the value of 2 to the your input number power. For example, the default number is 12, so the table size is 4096. Don't input the number too small, otherwise you will lose performance on it. You can adapt the table size yourself, according to your virtual server application. It is good to set the table size not far less than the number of connections per second multiplying average lasting time of connection in the table. For example, your virtual server gets 200 connections per second, the connection lasts for 200 seconds in average in the masquerading table, the table size should be not far less than 200x200, it is good to set the table size 32768 (2**15).

Another note that each connection occupies 128 bytes effectively and each hash entry uses 8 bytes, so you can estimate how much memory is needed for your box.

Ratz: Leave the settings as a general rule.

Some people still want to change the hash table size

Daniel Burke 28 Jun 2002

In anticipation of our capacity requirements growing, we had decided it was necessary to increase the connection table size. The value it was at was 16, based on our calculations we needed to bump it to 26 to handle what were will be throwing at it.

Julian

It is insane to use 26. That means 2^26 * space for 2 pointers. On x86 it takes 512MB just for allocating empty hash table with 2^26 d-linked lists. Refer to the HOWTO for calculating the best hash table size according to the number of connections. You can select the size (POWER) in this way:

2^POWER = AVERAGE_NUM_CONNS/10

The magic value 10 in this case is the average number of conns expected in one d-linked list, the lookup is slower for more conns.

Example:

POWER=16 => 65536 rows => 655360 conns, 10 on each row

Joe - Wensong has stepped in to stop people from doing this anymore.

Wensong

Just added code that limits the input number from 8 to 20, in order to prevent this configuring problem from happening again.

Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. I think Julian and/or Wensong wrote it. Does anyone of you still have that code?

I'd say that the rehashing that would need to take place would consume more CPU cycles than a yet-to-be-proven gain from increasing the bucket size.

Horms horms (at) verge (dot) net (dot) au 17 Feb 2003

Agreed. To expand on this for the benefit of others. The hash table is just that. A hash. Each bucket in the hash can have multiple entries. The implementation is such that each bucket is a linked list of entries.

Exactly. And around 2000 we tuned the hash generation function to have the best balanced distribution over all buckets with a magic prime number which IIRC was pretty exactly the golden ratio if you divided 2^32 through that number.

ratz@zar:~ > echo "(2^32)/2654435761" | bc -l
1.61803399392945414737
ratz@zar:~ >

We took the wisdom from Chuck Lever's paper Linux Kernel Hash Table Behavior: Analysis and Improvements (http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf, site down Sep 2004) So once you have an evenly distributed hash table the search for an entry is almost best case for all entries.

Thus there is no limit on the number of entries the hash table can hold (other than the amount of memory available). The only advantage of increasing the hash table size is to reduce, statistically speaking, the number of entries in each bucket.

Which gains you almost nothing. The search is still ... left as an exercise to all CS students here :)

And thus the amount of hash table traversal. To be honest I think the gain is probably minimal even in extreme circumstances.

Especially since computing the hash value entry uses most of the times almost the same amount of time as parsing the linked list.

Anecdotally, a colleague of mine did a test on making the linked lists reordering. So that the most recently used entry was moved to the front.

Interesting approach. However I guess that the cache line probably already had this entry. It would be interesting to see the amount of TLB flushes done by the kernel depending on the amount of traffic and hash table size.

He then pushed a little bit traffic through the LVS box (>700Mbit/s). We didn't really see any improvement. Which would make me think that hash table performance isn't a particular bottleneck in the current code.

Jernberg, Bernt wrote:

my throughput is 2Gb/s

The tests that I was involved with with >700Mb/s of traffic used a hash-table with 17bits. I am not entirely sure how that number was derived as I did not do the tests myself. But it would probably be a good start for you.

You are probably going to see a bigger difference by compiling a non-SMP kernel and eliminating spinlocks than you will twiddling the hash-table bits. I believe that by having an SMP kernel, the overhead of spinlocks is significant under high load.

(For more on SMP/UMP with LVS, see comments by Horms on SMP doesn't help.)

Jernberg, Bernt

I have deployed it at a customer sight which offers ftp,http and rsyc services. They calculated that they will need 2^21 entries in the hash if it is supported.

Ratz

Let me see (4 secs session coherency and 1/8 of the traffic are valid SYN requests matching the template):

ratz@zar:~ > echo "l(4*2*1024*1024*1024/8)" | bc -l
20.79441541679835928251
ratz@zar:~ >

So yes, this would roughly be 21 bits. But now I ask you to read the nice explanation of Horms in this thread on why you do not need to increase the bucket size of the hash table to be able to hold 2**21 entries. You can perfectly well use 17 bits which would give you a linked list depth (provided we have an equilibrium in distribution over the buckets):

ratz@zar:~ > echo "2^21/2^17" | bc -l
16.00000000000000000000
ratz@zar:~ >

So lousy 16 entries for one bucket when using 17 bit. This is bloody _nothing_. Let's take the worst case: You'd have maybe 32 entries which is still _nothing_. Your CPU doesn't even fully awake to find an entry in this list :).

The amount of RAM you need to hold 2^21 templates entries for a session time of 4 seconds is roughly:

ratz@zar:~ > echo "4*(128*2^21)/1024/1024" | bc -l
1024.00000000000000000000
ratz@zar:~ >

1GB. So you're on the safe end. However if you plan on using persistency you'd run out of memory pretty soonish.

Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. We used it to tweak the hash function.

Julian, 17 Feb 2003

hashlvs I created a script for easy testing. Currently, there are 2 hash functions for tests. I don't remember for what LVS version the hash_*.c files were created.

usage: get copy of the conn table output (use real data) and feed it to the scripts by specifying the used hash table size (in bits) and the desired hash function method. The result is the expected access time in pseudo units. Bigger access time means slower lookup.

34.14. Hash table connection timeouts

How long are the connection entries held for ? (Column 8 of /proc/net/ip_masquerade ?)

Julian

The default timeout value for TCP session is 15 minutes, TCP session after receiving FIN is 2 miniutes, and UDP session 5 minutes. You can use ipchains -M -S tcp tcpfin udp to set your own time values.

If we assume a clunky set of web servers being balanced that take 3s to serve an object, then if the connection entries are dropped immediately then we can balance about 20 million web requests per minute with 128M RAM. If however the connection entries are kept for a longer time period this puts a limit on the balancer.

Yeah, it is true.

e.g. (assuming column 8 is the thing I'm after!)

Actually, the column 8 is the delta value in sequence numbers. The timeout value is in column 10.

[zathras@consus /]$ head -n 1000 /proc/net/ip_masquerade | \
sed -e "s/  */ /g"|cut -d" " -f8|sort -nr|tail -n500|head -n1 8398

i.e. Held for about 2.3 hours, which would limit a 128Mb machine to balance about 10.4 million requests per day. (Which is definitely on the low side knowing our throughput...)

Horms horms (at) vergenet (dot) net

When a connection is recieved by an IPVS server and forwarded (by whatever means) to a back-end server at what stage is this connection entered into the IPVS table. It is before or as the packet is sent to the back-end server or delayed until after the 3 way handshake is complete.

Lars

The first packet is when the connection is assigned to a realserver, thus it must be entered into the table then, otherwise the 3 way handshake would likely hit 3 different realservers.

unknown

It has been alleged that IBMs Net Director waits until the completion of the three way handshake to avoid the table being filled up in the case of a SYN flood. To my mind the existing SYN flood protection in Linux should protect the IPVS table in any case and the connection needs to be in the IPVS table to enable the 3 way handshake to be completed.

Wensong

There is state management in connection entries in the IPVS table. The connection in different states has different timeout value, for example, the timeout of the SYN_RECV state is 1 minute, the timeout of the ESTABLISHED state is 15 minutes (the default). Each connection entry occupy 128 bytes effective memory. Supposing that there is 128 Mbytes free memory, the box can have 1 million connection entries. The over 16,667 packet/second rate SYN flood can make the box run out of memory, and the syn-flooding attacker probably need to allocate T3 link or more to perform the attack. It is difficult to syn-flood a IPVS box. It would be much more difficult to attach a box with more memory.

I assume that the timeout is tunable, though reducing the timeout could have implications for prematurely dropping connections. Is there a possibility of implementing random SYN drops if too many SYN are received as I believe is implemented in the kernel TCP stack.

Yup, I should implement random early drop of SYN entries long time ago as Alan Cox suggested. Actually, it would be simple to add this feature into the existing IPVS code, because the slow timer handler is activated every second to collect stale entries. I just need to some code to that handler, if over 90% (or 95%) memory is used, run drop_random_entry to randomly tranverse 10% (or 5%) entries and drop the SYN-RCV entries in them.

A second, related question is if a packet is forwarded to a server, and this server has failed and is sunsequently removed from the available pool using something like ldirectord. Is there a window where the packet can be retransmitted to a second server. This would only really work if the packet was a new connection.

Yes, it is true. If the primary load balaner fails over, all the established connections will be lost after the backup takes over. We probably need to investigate how to exchange the state (connection entries) periodically between the primary and the backup without too much performance degradation.

If persistent connections are being used and a client is cached but doesn't have any active connections does this count as a connection as far as load balancing, particularly lc and wlc is concerned. I am thinking no. This being the case, is the memory requirement for each client that is cached but has no connections 128bytes as per the memory required for a connection.

The reason that the existing code uses one template and creates different entries for different connections from the same client is to manage the state of different connections from the same client, and it is easy to seemlessly add into existing IP Masquerading code. If only one template is used for all the connections from the same client, the box receives a RST packet and it is impossible to identify from which connection.

We using Hash Table to record an established network connection. How do we know the data transmission by one conection is over and when should we delete it from the Hash Table?

Julian Anastasov ja (at) ssi (dot) bg 24 Dec 2000

OK, here we'll analyze the LVS and mostly the MASQ transition tables from net/ipv4/ip_masq.c. LVS support adds some extensions to the original MASQ code but the handling is same.

First, we have three protocols handled: TCP, UDP and ICMP. The first one (TCP) has many states and with different timeout values, most of them set to reasonable values corresponding to the recommendations from some TCP related rfc* documents. For UDP and ICMP there are other timeout values that try to keep the both ends connected for reasonable time without creating many connection entries for each packet.

There are some rules that keep the things working:

- when a packet is received for an existing connection or when a new connection is created a timer is started/restarted for this connection. The timeout used is selected according to the connection state. If a packet is received for this connection (from one of the both ends) the timer is restarted again (and may be after a state change). If no packet is received during the selected period, the masq_expire() function is called to try to release the connection entry. It is possible masq_expire() to restart the timer again for this connection if it is used from other entries. This is the case for the templates used to implement the persistent timeout. They occupy one entry with timer set to the value of the persistent time interval. There are other cases, mostly used from the MASQ code, where helper connections are used and masq_expire() can't release the expired connection because it is used from others.

- according to the direction of the packet we distinguish two cases: INPUT where the packet comes in demasq direction (from the world) and OUTPUT where the packet comes from internal host in masq direction.

masq. What does "masq direction" mean for packets that are not translated using NAT (masquerading), for example, for Direct Routing or Tunneling? The short answer is: there is no masq direction for these two forwarding methods. It is explained in the LVS docs. In short, we have packets in both directions when NAT is used and packets only in one direction (INPUT) when DR or TUN are used. The packets are not demasqueraded for DR and TUN method. LVS just hooks the LOCAL_IN chain as the MASQ code is privileged in Linux 2.2 to inspect the incoming traffic when the routing decides that the traffic must be delivered locally. After some hacking, the demasquerading is avoided for these two methods, of course, after some changes in the packet and in its next destination - the realservers. Don't forget that without LVS or MASQ rules, these packets hit the local socket listeners.

How are the connection states changed? Let's analyze for example the masq_tcp_states table (we analyze the TCP states here, UDP and ICMP are trivial). The columns specify the current state. The rows explain the TCP flag used to select the next TCP state and its timeout. The TCP flag is selected from masq_tcp_state_idx(). This function analyzes the TCP header and decides which flag (if many are set) is meaningful for the transition. The row (flag index) in the state table is returned. masq_tcp_state() is called to change ms->state according to the current ms->state and the TCP flag looking in the transition table. The transition table is selected according to the packet direction: INPUT, OUTPUT. This helps us to react differently when the packets come from different directions. This is explained later, but in short the transitions are separated in such way (between INPUT and OUTPUT) that transitions to states with longer timeouts are avoided, when they are caused from packets coming from the world. Everyone understands the reason for this: the world can flood us with many packets that can eat all the memory in our box. This is the reason for this complex scheme of states and transitions. The ideal case is when there is no different timeouts for the different states and when we use one timeout value for all TCP states as in UDP and ICMP. Why not one for all these protocols? The world is not ideal. We try to give more time for the established connections and if they are active (i.e. they don't expire in the 15 mins we give them by default) they can live forever (at least to the next kernel crash^H^H^H^H^Hupgrade).

How does LVS extend this scheme? For the DR and TUN method we have packets coming from the world only. We don't use the OUTPUT table to select the next state (the director doesn't see packets returning from the internal hosts). We need to relax our INPUT rules and to switch to the state required by the external hosts :( We can't derive our transitions from the trusted internal hosts. We change the state only based on the packets coming from the the clients. When we use the INPUT_ONLY table (for DR and TUN) the LVS expects a SYN packet and then an ACK packet from the client to enter the established state. The director enters the established state after a two packet sequence from the client without knowing what happens in the realserver, which can drop the packets (if they are invalid) or establish a connection. When an attacket sends SYN and ACK packets to flood a LVS-DR or LVS-Tun director, many connections are established state. Each each established connection will allocate resources (memory) for 15 mins by default. If the attacker uses many different source addresses for this attack the director will run out of memory.

For these two methods LVS introduces one more transition table: the INPUT_ONLY table which is used for the connections created for the DR and TUN forwarding methods. The main goal: don't enter established state too easily - make it harder.

Oh, maybe you're just reading the TCP specifications. There are sequence numbers that the both ends attach to each TCP packet. And you don't see the masq or LVS code to try to filter the packets according to the sequence numbers. This can be fatal for some connections as the attacker can cause state change by hitting a connection with RST packet, for example (ES->CL). The only info needed for this kind of attack is the source and destination IP addresses and ports. Such kind of attacks are possible but not always fatal for the active connections. The MASQ code tries to avoid such attacks by selecting minimal timeouts that are enough for the active connections to resurrect. For example, if the connection is hit by TCP RST packet from attacker, this connection has 10 seconds to give an evidence for its existance by passing an ACK packet through the masq box.

To make the things complex and harder for the attacker to block a masq box with many established connections, LVS extends the NAT mode (INPUT and OUTPUT tables) by introducing internal server driven state transitions: the secure_tcp defense strategy. When enabled, the TCP flags in the client's packet can't trigger switching to established state without acknowledgement from the internal end of this connection. secure_tcp changes the transition tables and the state timeouts to achieve this goal. The mechanism is simple: keep the connection is SR state with timeout 10 seconds instead of the default 60 seconds when the secure_tcp is not enabled.

This trick depends on the different defense power in the realservers. If they don't implement SYN cookies and so sometimes don't send SYN+ACK (because the incoming SYN is dropped from their full backlog queue), the connection expires in LVS after 10 seconds. This action assumes that this is a connection created from attacker, since one SYN packet is not followed by another, as part from the retransmissions provided from the client's TCP stack.

We give 10 seconds to the realserver to reply with SYN+ACK (even 2 are enough). If the realserver implements SYN cookies the SYN+ACK reply follows the SYN request immediatelly. But if there are no SYN cookies implemented the SYN requests are dropped when the backlog queue length is exceeded. So secure_tcp is by default useful for realservers that don't implement SYN cookies. In this case the LVS expires the connections in SYN state in a short time, releasing the memory resources allocated from them. In any case, secure_tcp does not allow switching to established state by looking in the clients packets. We expect ACK from the realserver to allow this transition to EST state.

The main goal of the defense strategies is to keep the LVS box with more free memory for other connections. The defense for the realservers can be build in the realservers. But may be I'll propose to Wensong to add a per-connection packet rate limit. This will help against attacks that create small number of connections but send many packets and by this way load the realservers dramatically. May be two values: rate limit for all incoming packets and rate limit per connection.

The good news is that all these timeout values can be changed in the LVS setup, but only when the secure_tcp strategy is enabled. An SR timeout of 2 seconds is a good value for LVS clusters when realservers don't implement SYN cookies: if there is no SYN+ACK from the realserver then drop the entry at the director.

The bad news is of course, for the DR and TUN methods. The director doesn't see the packets returning from the realservers and LVS-DR and LVS-Tun forwarding can't use the internal server driven mechanism. There are other defense strategies that help when using these methods. All these defense strategies keep the director with memory free for more new connections. There is no known way to pass only valid requests to the internal servers. This is because the realservers don't provide information to the director and we don't know which packet is dropped or accepted from the socket listener. We can know this only by receiving an ACK packet from the internal server when the three-way handshake is completed and the client is identified from the internal server as a valid client, not as spoofed one. This is possible only for the NAT method.

ksparger (at) dialtoneinternet (dot) net (29 Jan 2001) rephrases this by saying the LVS-NAT is layer-3 aware. For example, NAT can 'see' if a realserver responds to a packet it's been sent or not, since it's watching all of the traffic anyway. If the server doesn't respond within a certain period of time, the director can automatically route that packet to another server. LVS doesn't support this right now, but, NAT would be the more likely candidate to support it in the future, as NAT understands all of the IP layer concepts, and DR doesn't necessarily.

Julian

Someone must put back the realserver when it is alive. This sounds like a user space job. The traffic will not start until we send requests. We have to send L4 probes to the realserver (from the user space) or to probe it with requests (LVS from kernel space)?

34.15. Hash Table DoS

A posting (Jun 2003) on Slashdot links to a paper on Denial of Service via Algorithmic Complexity Attacks. The article shows how to mount a DoS by attack on hash tables. Access to entries in hash tables for most algorithms is different for the average case (randomly sorted data where access time might be O(n log n)) and for the worst case (all in reverse order, where access time could be O(n^2)). Programmers hope that real life data is not pathological. If the attacker knows the hash algorithm (i.e. has the source code), then they may be able to construct a worst case dataset for the hashing algorithm, which will bring the server to its knees. The paper discusses constructing attacks in which all data is entered into one bucket of the hash table.

In the case of LVS, the hash table contents are (CIP:port, proto, VIP:port). The client only has a small number of variables (the port and proto they are sending from) from which to mount an attack, the others being fixed (CIP, VIP:port).

Horms horms (at) verge (dot) net (dot) au 05 Jun 2003

Here is my take on this: LVS uses the following hash

(proto XOR CIP XOR (CIP>>IP_VS_CONN_TAB_BITS) XOR port) & IP_VS_CONN_TAB_MASK

where:
proto =  protocol (TCP=6, UDP=17)
CIP   =  source/client IP address (host byte order)
port  =  source port (host byte order)
IP_VS_CONN_TAB_BITS defaults to 8
IP_VS_CONN_TAB_MASK is (1 << IP_VS_CONN_TAB_BITS) - 1 thus the default is 0xff

(from here '^' will mean power)

The only inputs a user/client can effect are CIP and port.

I would say that it is quite easy for someone to set things up so that they consistently hit the same bucket. For instance by connecting from the same ip address with different ports from the set (port % IP_VS_CONN_TAB_MASK) = n (though we observe that each end-user only has 2^(16-IP_VS_CONN_TAB_BITS) = 256 such ports). The client would need to use multiple source IP addresses.

The effect is that instead of n connections going into 2^IP_VS_CONN_TAB_BITS different buckets they will go into one bucket. Thus LVS will have to do on average n/2 traversals instead of n/2^(IP_VS_CONN_TAB_BITS+1) traversals.

The real effect is to amplify traversal times by 2^IP_VS_CONN_TAB_BITS. Though it is worth remembering that the larger IP_VS_CONN_TAB_BITS is then lower 2^(16-IP_VS_CONN_TAB_BITS) becomes, and thus the greater the number of source IP addresses required becomes. Though if it was a UDP bassed attack this might not be an issue as the source IP could be spoofed.

This could become a problem if n became very large. But how large? Traversal is actually pretty fast. So I think that n would need to be quite large indeed to have a noticable effect on LVS and larger still to seriously degrade performance. Though I could be wrong.

As for solutions, it is a bit tricky AFAIK. Perhaps using some component of the skb which is static for a connection, but not directly influenced by end users. But that may well open up a whole new can of worms.

Ratz 05 Jun 2003

Theoretically we're susceptible to this sort of attack. Check out the devastating effects on running Julian's testlvs with certain parameters. You can still enable the LVS DoS defense strategies though (see testing DoS strategies).

34.15.1. testing hash code

Julian, 14 Jun 2003

Maybe it is time to change the hash function used for the table with connections. Today I played with some data from my 2.2 director and fixed the tools that measure different hash functions. I tested the default LVS function, one that uses 2654435761 and the Jenkins hash that is present in recent 2.4 and 2.5 kernels. We need some help from the math perspective.

Here are some tools for testing the hash functions Look in Julian's LVS page for test.txt which contains short instructions for testing and ipvs-1.0.9-hash1.diff test hash code for 2.4.21.

I created test patch against ipvs-1.0.9 (not tested). This is an attempt to introduce randomness on IPVS load. As for the tests with the different hash functions you can see my results and of course to try them yourself. My conclusion is that 2654435761 is better and faster but I hope we will see other results.

34.16. Hash table size, director will crash when it runs out of memory.

Yasser Nabi

IP Virtual Server version 0.9.0 (size=16777216)

Julian Anastasov ja (at) ssi (dot) bg 25 May 2001

Too much, it takes 128MB for the table only. Use 16 bits for example.

Is this a hidden/undocumented problem with IPVS or it's just an observation of memory waste ? (we use 18 bits in production)

empty hash tables:

18 bits occupy 2MB RAM
24 bits occupy 128MB RAM

If the box has 128MB and the bits are 24 the kernel crash is mandatory, soon or later. And this is a good reason the virtual service not to be hit. Expect funny things to happen on box with low memory

I forgot that not anyone uses 256Mb or more RAM on directors :)

Yes, 256MB in real situation is ~1,500,000 connections, 128 bytes each, 64MB for other things ... until someone experiments with SYN attack

However, for me it makes sense to use up to 66% of total memory for LVS, especially on high-traffic directors (in the idea that the directors doesn't run all the desktop garbage that comes with most distros).

If the used bits are 24, an empty hash table is 128MB. For the rest 128MB you can allocate 1048576 entries, 128 bytes each ... after the kernel killed all processes.

Some calcs considering the magic value 16 as average bucket length and for 256MB memory:

For 17 bits:

2^17=131072 => 1MB for empty hash table

131072*16=2097152 entries=256MB for connections

For 18 bits:

2^18=262144 => 2MB for empty hash table

for each MB for hash table we lose space for 8192 entries but we speedup the lookup.

So, even for 1GB directors, 19 or 20 is the recommended value. Anything above is a waste of memory for hash table. In 128MB we can put 1048576 entries. In the 24-bit case they are allocated for d-linked list heads.

Joe 6 Jun 2001

what happens after the table fills up? Does ipvs handle new connect requests gracefully (ie drops them and doesn't crash)?

Julian

The table has fixed number of rows and unlimited number of columns (d-lists where the connection structures are enqueued). The number of connections allocated depends on the free memory.

Once there is no memory to allocate connection structure, the connection requests will be dropped. Expect crashes maybe at another place (usually user space) :)

I'm not sure what the kernel will decide in this situation but don't rely on the fact some processes will not be killed. There is a constant network activity and a need for memory for packets (floods/bursts).

And the reason the defense strategies to exist is to free memory for new connections by removing the stalled ones. The defense strategy can be automatically activated on memory threshold. Killing the cluster software on memory presure is not good.

So, the memory can be controlled, for example, by setting drop_entry to 1 and tuning amemthresh. On floods it can be increased. It depends on the network speed too: 100/1000mbit. Thresholds of 16 or 32 megabytes can be used in such situations, of course, when there are more memory chips.

Roberto Nibali ratz (at) tac (dot) ch

The director never crashes because of exhaustion of memory. If he tries to allocate memory for a new entry into the table and kmalloc returns NULL, we return, or better drop the packet in processing and generate a page fault.

You could use my treshhold limitation patch. You calculate how many connections you can sustain with your memory by multiplying each connection entry with 128bytes and divide by the amount of realserver and set the limitation alike. Example:

128MByte, persistency 300s: max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

Joe

It would seem that we need a method of stopping the director hash table from using all memory whether as a result of a DoS attack or in normal service. Let's say you fill up RAM with the hash table and all user processes go to swap, then there will be problems - I don't know what, but it doesn't sound great - at a high number of connections I expect the user space processes will be needed too. I expect we need to leave a certain amount for user space processes and not allow the director to take more than a certain amount of memory.

It would be nice if the director didn't crash when the number of connections got large. Presumably a director would be functioning only as a director and the amount of memory allocated to user space processes wouldn't change a whole lot (ie you'd know how much memory it needed).

34.17. The LVS code does not swap

Joe Feb 2001

With sufficient number of connections, a director could start to swap out its tables (is this true?) In this case, throughput could slow to a crawl. I presume the kernel would have to retrieve parts of the table to find the realserver associated with incoming packets. I would think in this case it would be better to drop connect requests than to accept them.

Julian

IMO, this is not true. LVS uses GFP_ATOMIC kind of allocations and as I know such allocations can't be swapped out.

If it's possible for LVS to start the director to swap, is there some way to stop this?

You can try with testlvs whether the LVS uses swap. Start the kernel with LILO option mem=8M and with large swap area. Then check whether more than 8MB swap is used.

34.18. Other factors determining the number of connections

In earlier verions of LVS, you set the amount of memory for the tables (in bytes). Now you allocate a number of hashes, whose size can grow without limit, allowing an unlimited number of connections. Once the number of connections becomes sufficiently large, then other resources will become limiting.

  • out of memory.

    The ipvs code doesn't handle this, presumably the director will crash (also see the threshold patch). Instead you handle this by brute force, adding enough memory to accept the maximum number of connections your setup will ever be asked to handle (e.g. under a DoS attack). This memory size can be determined by the multiplying the rate at which your network connection can push connect requests to the director, by the timeout values, which are set by FIN_WAIT or the persistence.

  • out of ports.

    You can expand the number of ports to 65k, but eventually you'll reach the 65k port limit.

34.19. Port range: limitations, expanding port range on directors

Sometimes client processes on the realservers need to connect with machines on the internet (see clients on realservers.

Wayne wayne (at) compute-aid (dot) com Nov 5 2001

Say you have a web page that has to retrieve on-line ads from one of your advertiser (people who pay you for showing their ads). If you have 50,000 visitors on your site, you will open 50,000 connections between your web server and the ad server out there somewhere. The masquerade limit is 4,096 per pair of IP addresses, and 40,960 per LVS box. In our case, the realserver is behind the LVS-NAT director, which also functions as the firewall, so the realserver MUST use the director to reach the ad servers.

Usually the RIP is private (e.g.192.168/16) and will have to be NAT'ed to the outside world. This can be done with LVS-NAT or LVS-DR by adding masquerading rules to the director's iptables/ipchains rules. (With LVS-DR, you also have to route the packets from the RIP - this routing is setup by default with the configure script)

Less often you want to use more ports on your LVS client machines.

Wang Haiguang

My client machine it uses port numbers between 1024 - 4096. After reaching 4096, it will loop back to 1024 and reuse the ports. I want to use more port nubmers

michael_e_brown (at) dell (dot) com 06 Feb 2001

echo 1024 65000 > /proc/sys/net/ipv4/ip_local_port_range
/usr/src/linux/Documentation/networking/ip-sysctl.txt

While normal client processes use ports in order starting at 1024, masqueraded ports start at 61440 (2^16-2^12) for 2.2.x kernels (see clients on realservers). The masquerading code does not check if other processes are requesting ports and thus port collisions could occur. It is assumed on a NAT box that no other processes are initiating connections (i.e. you aren't running a passive ftp server).

Horms 26 Jun 2007

I beleive that there is a school of thought that source ports should be randomised to mitigate certain classes of security threats.

Horms 17 May 2004

I am a bit rusty on 2.2. But the restricted port range for NAT'ed connections with 2.2 sounds familiar. I seem to recall you could change the range, but it required changing a define in the kernel and recompiling. I think it was changed to a /proc value in 2.4.

Note
For 2.4.x kernels, the restriction to only use high ports is removed. The NAT router uses ports starting at 1024.

Horms 17 May 2004

In 2.4 the ephemerial port range and the nat port range are the same. If this is the case, which I guess it is, then it would seem likely there is some sort of collision detectionion. I took a look and this does not seem to be the case. I assume the kernel has some other way of handling this. But I am not sure at this moment. If you are interested try looking at tcp_v4_get_port() and tcp_unique_tuple(). I'm not convinced that Michael Brown's comment is correct at the moment, but I don't have the definitive answer either.

Wayne wayne (at) compute-aid (dot) com 14 May 2000,

If running a load balancer tester, say the one from IXIA to issue connections to 100 powerful web servers, would all the parameters in Julian's description need to be changed, or it should not be a problem for having many many connections from a single tester?

Julian

There is no limit for the connections from the internal hosts. Currently, the masquerading allows one internal host to create 40960 TCP connections. But the limit of 4096 connections to one external service is still valid.

If 10 internal hosts try to connect to one external service, each internal host can create 4096/10 => 409 connections.

For UDP the problem is sometimes worse. It depends on the /proc/sys/net/ipv4/ip_masq_udp_dloose value.

Joe

which is internal and which is external here? The client, the realservers?

This is a plain masquerading so internal and external refer to masquerading. These limits are not for the LVS connections, they are only for the 2.2 masquerading.

			</para><para>
		 / 65095	Internal Servers
External Server:PORT	-  ...	 MADDR --------------------
		 \ 61000

When many internal clients try to connect to same external real service, the total number of TCP connections from one MADDR to this remote service can be 4096 because the masq uses only 4096 masq ports by default. This is a normal TCP limit, we distinguish the TCP connections by the fact they use different ports, nothing more. And the masq code is restricted by default to use the above range of 4096 ports.

In the whole masquerading table there is a space only for 40960 TCP, 40960 UDP and 40960 ICMP connections. These values can be tuned by changing ip_masq.c:PORT_MASQ_MUL.

For 2.4 masquerading, all ports can be used for the masqueraded connections.

Waynewayne (at) compute-aid (dot) com 1 Nov 2001

PORT_MASQ_MUL appears to serve only as a check to make sure the masquerading facility does not hog all the memory, and that actually things would still work no matter how large PORT_MASQ_MUL is, or even if the checks using it are disabled. Is this true?

Julian

By multiplying this constant with the masq port range, you define the connection limit for each protocol. This is related to the memory used for masquerading. This is a real limit, but not for LVS connections, because they are usually not limited by port collisions, and LVS does not check this limit.

What about using more than the 32k range? What is the maximum I could select?

Peter Muellerpmueller (at) sidestep (dot) com

You should be able to use about 60k, i.e. 1024-6100. I hope you have lots of RAM :-)

Julian continuing

The PORT_MASQ_MUL value simply determines the recommended length of one row in the masq hash table for connections, but in fact it is involved in the above connection limits. It is recommended that the busy masq routers must increase this value. May be the 4096 masq port range too. This involves squid servers behind masq router.

LVS uses another table without limits. For LVS setups the same TCP restrictions apply but for the external clients:

	4999 \
Client	     - VIP:VPORT LVS Director
	1024 /

The limit of client connections to one VIP:VPORT is limited to the number of used client ports from same Client IP.

The same restrictions apply to UDP. UDP has the same port ranges. But for UDP the 2.2 kernel can apply different restrictions. They are caused from some optimizations that try to create one UDP entry for many connections. The reason for this is the fact that one UDP client can connect to many UDP servers while this is not common for TCP.

Joe

when you increase the port range, you need more memory. Is this only because you can have more connections and hence will need a bigger ipvsadm table?

Yes, the first need is for more masqueraded connections and they allocate memory. LVS uses separate table and it is not limited. We distinguish LVS-NAT from Masquerade. LVS-NAT (and any other method) does not allocate extra ports, even for other ranges. It shadows only the defined port. No other ports are involved until masquerading is used.

ipvs doesn't check port ranges and so collisions can occur with regular services (ftp was mentioned). I would have thought that a process needing to open a IP connnection would ask the tcp code in the kernel for a connection and let that code handle the assignment of the port.

LVS does not allocate local ports. When the masquerade is used to help with some protocol, the masquerade performs the check (ftp for example).

The port range has nothing to do with LVS. It helps the masquerading to create more connections because there is fixed limit for each protocol. But sometimes LVS for 2.2 uses ip_masq_ftp, so may be only then this mport range is used.

X-window connections are at 6000.. Will you be able to start an X-session if these ports are in use by a director masquerading out connections from the realservers?

If we put LVS (ipvsadm -A ) in front of this port 6000 then X sessions will be stopped. OTOH, masquerade does not select ports in this range, the default is above 61000. So, any FTP sessions will not disturb local ports, of course, if you don't increase the mport range to cover the well defined server ports such as X.

34.20. Director does not have any ports (connections) open for an LVS connection

The director is just a router (admittedly with slightly different rules than the standard layer 3 router). There are no connections made (ports opened) between the client and the director, or between the realservers and the director. The director does keep track of the packets passing through that are for LVS'ed services (connection tracking) as part of updating the hash table and for the server state synch demon.

Sebastien BONNET sebastien(dot) bonnet (at) experian (dot) fr 11 May 2004

There's no "open" connection on the director, just tracked connections. The clients are not "connected" to the load balancer.

For the client, assuming a client always uses a different port for an outgoing connection, it can roughly initiate 65K connections.

For the realserver, there's no real port limit for a daemon listening on a single port: it uses just one. The realserver can have connections to all ports on all IPs, i.e. 256*256*256*256*(65536-1024) connections (the realserver may run out of memory before it can make all these connections).

If there was no file descriptor limit nor memory constraint, a server could handle way more than the current "port limit" (65K) simultaneous connections.

34.21. apps starved for ports

LVS Account, 27 Feb 2001

I'm trying to do some load testing of LVS using a reverse proxy cache server as the load balanced app. The error I get is from a load generating app.. Here is the error:

byte count wrong 166/151

Julian Anastasov ja (at) ssi (dot) bg

Broken app.

this goes on for a few hundred requests then I start getting:

Address already in use

App uses too many local ports.

This is when I can't telnet to port 80 any more... If I try to telnet to 10.0.0.80 80 I get this:

$ telnet 10.0.0.80 80
Trying 10.0.0.80...
telnet: Unable to connect to remote host: Resource temporarily unavailable

No more free local ports.

If I go directly to the web server OR if I go directly to the IP of the reverse proxy cache server, I don't get these errors.

Hm, there are free local ports now.

I'm using a load balancing app that I call this way:

/home/httpload/load -sequential -proxyaddr 10.0.0.80 -proxyport
0  -parallel 120 -seconds 6000000 /home/httpload/url

upping the local port range has helped tremendously

34.22. realserver running out of ports

Here's a case where a realserver ran out of udp ports doing DNS looksup while serving http.

Hendrik Thiel thiel (at) falkag (dot) de

I am using IP Virtual Server version 0.9.14 (size=4096). We have 6 Realservers each.

-> RemoteAddress:Port   Forward Weight ActiveConn InActConn
-> server1:www          Masq    1      68         12391

Today we reached a new peak (very fast, few minutes) 30Mbps, up from the normal 15Mbit/s. Afterwards the following kernel messages (dmesg) showed up...

IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894).
IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894).
IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31888).

Julian Anastasov ja (at) ssi (dot) bg 20 Nov 2001 (heavily edited by Joe)

It seems you are flooding a single remote host with UDP requests from a realserver. Your service, www, is TCP and is not directly connected to these messages. You've reached the UDP limit per destination (4096), there are still free UDP ports on the realserver for other destinations.

Hendrik

yes it's DNS, each realserver is a caching DNS.

resolv.conf
nameserver 127.0.0.1
nameserver external IP

34.23. Maximum number of NICs

This is not really an LVS question, but people want to know.

ntadmin (at) reachone (dot) com

We are nearing 255 virtual interfaces on the external side of our LVS system (Joe - presumably the number of VIPs). Can somebody tell me if this is going to be a hard limit or if we can go beyond 255 on one network card?

Roberto Nibali ratz (at) tac (dot) ch 17 Dec 2003

No problem (proof of concept below):

# ip addr show dev lo | grep 'inet ' | wc -l
       1
# for ((i = 1; i < 255; i++)); do for ((j = 1; j < 4; j++)); do ip addr add
127.0.$j.$i/32 dev lo brd + label lo:$i-$j 1>/dev/null 2>&1; done; done
# ip addr show dev lo | grep 'inet ' | wc -l
     763
# for ((i = 1; i < 255; i++)); do for ((j = 1; j < 4; j++)); do ip addr del
127.0.$j.$i/32 dev lo brd + label lo:$i-$j 1>/dev/null 2>&1; done; done
# ip addr show dev lo | grep 'inet ' | wc -l
       1

34.24. DoS

LVS is vunerable to DoS by an attacker making repeated connection requests. Each connection requires 128bytes of memory - eventually the director will run out of memory. This will take a while but an attacker has plenty of time if you're asleep. As well with LVS-DR and LVS-Tun, the director doesn't have access to the TCPIP tables in the realserver(s) which show whether a connection has closed (see director hash table). The director can only guess that the connection has really closed, and does so using timeouts.

Roberto Nibali ratzi (at) tac (dot) ch 10 Sep 2002

It's _impossible_ to differentiate between malicious and good traffic. End of story. But you can rate limit incoming SYNs within ingress policy. This was discussed about 2 years ago when the secure_tcp and drop_packet stuff was about to be introduced.

For information on DoS strategies for LVS see DoS page.

Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 14 Feb 2001

If I am not misunderstanding something, the variable /proc/sys/net/ipv4/vs/timeout_established gives the time a TCP connection can be idle and after that the entry corresponding to this connection is cleared. My problem is that it seems that sometimes it's not the case. For example I have a system (2.2.16 and ipvs 0.9.15) with /proc/sys/net/ipv4/vs/timeout_established = 480, but the entries are created with a real timeout of 120.

Julian Anastasov ja (at) ssi (dot) bg

Read The secure_tcp defense strategy where the timeouts are explained. They are valid for the defense strategies only. For TCP EST state you need to read the ipchains man page.

For more explanation of the secure_tcp strategy also see the explanation of the director's hash table.

when I play with ipchains -M -S > [value] 0 0 the variable /proc/sys/net/ipv4/vs/timeout_established is modified even when /proc/sys/net/ipv4/vs/secure_tcp is set to 0, so I'm not using the secure TCP defense. The "real" timeout is of course set to [value] when a new TCP connection appears. So should I understand that timeout_established, timeout_udp,... are always modified by "ipchains -M -S .... whatever I use or not secure TCP defense but if secure-tcp is set to 0, other variables give the timeouts to use? If so, are these variable accessible or how to check their value?

ipchains -M -S modifies the two TCP and the UDP timeouts in the two secure_tcp modes: off and on. So, ipchains changes the three timeout_XXX vars. When you change the timeout_* vars you change them for secure_tcp=on only. Think for the timeouts as you have two sets: for the two secure_tcp modes: on and off. ipchains changes the 3 vars in the both sets. While secure_tcp is off, changing timeout_* does not affect the connection timeouts. They are used when secure_tcp is on.

Note
Joe: ipchains 0 value 0, where value=10, does not change the timeout values or number of entries seen in InActConn or seen with netstat -M, or ipchains -M -L -n.

LVS has its own tcpip state table, when in secure_tcp mode.

carl.huang

what are the vs_tcp_states[ ] and vs_tcp_states_dos[ ] elements in the in ip_vs_conn structure for?

Roberto Nibali ratz (at) tac (dot) ch 16 Apr 2001

The vs_tcp_states[] table is the modified state transition table for the TCP state machine. The vs_tcp_states_dos[] is a yet again modified state table in case we are under attack and secure_tcp is enabled. It is tigher but not conforming to the RFC anymore. Let's take an example how you can read it:

static struct vs_tcp_states_t vs_tcp_states [] = {
/*      INPUT */
/*        sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }},

The elements 'sXX' mean state XX, so for example, sFW means TCP state FIN_WAIT, sSR means TCP state SYN_RECV and so on. Now the table describes the state transition of the TCP state machine from one TCP state to another one after a state event occured. For example: Take row 2 starting with sES and ending with sCL. At the first, commentary row, you see the incoming TCP flags (syn,fin,ack,rst) which are important for the state transition. So the rest is easy. Let's say, you're in row 2 and get a fin so you go from sES to sCW, which should by conforming to RFC and Stevens.

Short illustration:

/*           , sES,
/*syn*/ {{   ,    ,
/*fin*/ {{   , sCW,

It was some months ago last year when Wensong, Julian and I discussed a security enhancement for the TCP state transition and after some heavy discussion they implemented it. So the second table vs_tcp_states_dos[] was born. (look in the mailing list in early 2000).

34.25. DoS, from the mailing list

34.25.1. Malicious attacks (SYN floods)

LVS has been tested with a 100Mbit/sec syn-flooding attack by Alan Cox and Wensong.

Each connection requires 128 bytes. A machine with 128M of free memory could hold 1M concurrent connections. An average connection lasts 300secs. Connections which just receive the syn packet are expired in 30secs (starting ipvs 0.8 ). An attacker would have to initiate 3k connections/sec (600Mbps) to maintain the memory at the 128M mark and would require several T3 lines to keep up the attack.

34.25.2. testing DoS

joern maier 22 Nov 2000

I've got a problem protecting my LVS from SYN-flood attacks. Somehow the drop_entry mechanism seems not to work. Doing a SYN-flood with 3 clients to my LVS ( 1 director + 3 RS ) the system gets unreachable. A single realserver under the same attack by those clients stays alive.

Julian

You can't SYN flood the director with only 3 clients. You need more clients (or as an alternative, you can download testlvs from the LVS web site). What does ipvsadm -Ln show under attack? How you activate drop_entry? What does cat drop_entry show?

all realservers have tcp_syncookies enabled (1), tcp_max_syn_backlog=128, the director is set drop_entry var to 1 (echo 1 > drop_entry). Before compiling the kernel, I set the table size to 2^20. My Director has 256 MB and no other applications running.

You don't need such a large table, really.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 04 Nov 2004

I'm currently facing a ddos syn-flood attack against my cluster. Fortunately, those guys do not have enough machines to flood all my servers and the service is still up and running. They seem to use spoofed source IPs (as usual) so I can't even know where it comes from.

Anyway, It is now 24 hours they are playing like that, and I would like to stop it. Do you have an idea? Don't tell me that I have to use iptables to reduce the syn rate, I can't :). I have a lot of mobile clients, and the wap gateways can send me a lot of valid syns.

Jacob Coby jcoby (at) listingbook (dot) com 04 Nov 2004

You can try turning on tcp_syncookies: http://www.mail-archive.com/focus-linux@securityfocus.com/msg00185.html

echo 1 > /proc/sys/net/ipv4/tcp_syncookies

I forgot to mention that I've had tcp_syncookies enabled on individual systems for about 3 years now with no problems. I've had it enabled on every machine in a LVS-DR cluster for 6 months with no problems.

With testlvs and two clients, my LVS gets to a denial of service, although cat drop_entry shows me a "1".

director:/etc/lvs# ipvsadm -Ln:
192.168.10.1:80 lc
192.168.1.4:80	Tunnel 	1	0	33246
192.168.1.3:80	Tunnel 	1	0	33244
192.168.1.2:80	Tunnel 	1	0	33246

run testlvs with 100,000 source addresses.

during the flooding attack the connection values stay around this size. Using the SYN-flood tool with which I tried it before, ivsadm shows me

192.168.10.1:80 lc
192.168.1.4:80	Tunnel 	1	0	356046
192.168.1.3:80	Tunnel 	1	0	355981
192.168.1.2:80	Tunnel 	1	0	356013

so it shows me about ten times as many connections as your tool. I took a look at the packets, both are quiet similar, they only differ in the Windowsize (testlvs has 0, the other tool uses a size of 65534) and sequence numbers (o.k. checksum as well)

I am activating drop entry like this:

I switch on my computer (director) and start linux with the LVS Kernel

cd /proc/sys/net/ipv4/vs
echo 1 > drop_entry

Julian

Maybe you need to tune amemthresh. 1024 pages (4MB) are too low value. How much memory does "free" show under attack? You can try with 1/8 RAM size for example. The main goal of these defense strategies is to keep free memory in the director, nothing more. The defense strategies are activated according to the free memory size. The packet rate is not considered.

joern maier

That sounds all good to me, but what I'm really wondering about is, why has the drop_entry variable still a value of 1. I thought it has to be 2 when my System is under attack? To me it looks like LVS does not even think it's under attack and therefore does not use the drop_entry mechanism.

You are right. You forgot to specify when the LVS to think it is under attack. drop_entry switches automatically from 1 to 2 when the free memory reaches amemthresh. Do you know that your free memory is below 4MB? See defense strategies.

So, 1,000,000 entries created from the other tool uses 128MB memory. You have 256MB :) To reduce the amount of memory the kernel sees, boot with mem=128MB (in lilo) or set amemthresh to 32768 or run testlvs with more source addresses (2,000,000). I'm not sure if the last will help if the other tool you use does not limit the number of spoofed addresses. But don't run testlvs with less than -srcnum 2000000. If the setup allows rate > 33,333 packets/sec LVS can create 2,000,000 entries that expire for 60 seconds (the SYN_RECV timeout). Better not to use the -random option in testlvs for this test.

So, you can test with such large values but make sure you tune amemthresh in production with the best value for your director. The default value is not very useful. You can test whether 1/8 is a good value (8192 for 4K page size).

Sameer Garg sameer (dot) garg (at) gmail (dot) com 15 Apr 2008

We have been experiencing D/Dos on http. The LVS is uneffected by the D/Dos but the real servers are suffering. Beside the D/Dos the LVS is currently handling 5 subdomains and approximately 10QPS.

We are using LVS-Tun configuration. Due to our distributed setup and service provider limitation we can't put a perimeter firewall so we are thinking of stopping them at or before the LVS.

At the director I have tuned the route flush and route garbage collection variables but that is all I could figure out.After reading the howto and the mailing list I have concluded that it is possible to use iptalbles with LVS-DR and LVS-NAT. Is it advisable to put iptables on the director in a LVS-TUN setup?

Unrelated question: Anybody using a opensource firewall Iptables/pf in production for 100M connection?

Michael Schwartzkopff misch (at) multinet (dot) de 15 Apr 2008

Yes. iptables is even nescessary if you take LVS descisions based on the mangle table. I haven't seen any 100M setups in production, but shold be possible. Perhaps this helps: http://lists.sans.org/pipermail/unisog/2005-August/025040.html

Bgs bgs (at) bgs (dot) hu

We use lvs+netfilter solution with ~Gbit/sec traffic and DDoS attack above gigabit. We had DDoS attacks in the 60k-100k bot range. You can handle these with a reasonable level of service, but if you want your users to experience just small hiccups a mitigator on the outmost layer with a good feedback from your system into the mitigator blacklist.

34.25.3. on the design of the DoS preventer

Alan Cox alan (at) lxorguk (dot) ukuu (dot) org (dot) uk>

The biggest problem with load balancing, when you need to do this sort of trickery (and its one the existing load balancing patches seem to have is that if you store per connection state then a synflood will take out your box (if you run out of ram) or run a delightfully efficient DoS attack, if you don't. The moment you factor time into your state tables you are basically well and truly screwed.

Lars Marowsky-Bree lmb (at) teuto (dot) net> 8 Jun 1999

This can be solved with a hashtable, where you take the source ip as the key and look up the server to direct the request. Since the hash table is fixed size, we can do with fixed resources.

Given a proper hash function, this scheme is _ideal_ for basic round-robin and weighted round-robin with fixed weights and we should look at implementing this. Keeping state if not necessary _is_ a bug.

We are screwed however and can't do this if we want to do least-connections, dynamic load-based balancing, add servers at a later time etc and still deliver sticky connections (ie that connections from client A will stay on server B until a timeout happens or server B dies).

Basically, since we _need_ to keep state on a per-client basis for this we can be screwed easily by bombarding us with a randomized source IP.

Now - for all but the most simple load balancing, we NEED to keep state. So, we need to weasle our way out of this mess somehow.

One approach would be to integrate SYN cookies into the load-balancer itself and only pass on the connection if the TCP handshake succeeded. Now, there are a few obvious problems with this: It is a very complex task. And, it still screws us in the case of an UDP flood.

"The easy way out" for TCP connections is to do this stuff in user space - a load-balancing proxy, which connects to the backend servers. Problems with this are that it isn't transparent to the backend servers anymore (all connections come from the IP of the loadbalancer), it does not scale as well (no direct routing approach etc possible), and we still did not solve UDP.

I propose the following: We continue to maintain state like we always did. But when we hit, lets say, 32,000 simulteanous connections, we go into "overload" mode - that is, all new connections are mapped using the hash table like Alan proposed, but we still search the stateful database first.

There are a few problems with this too: It is not as fast as the pure hash table, since we need to look into the stateful database before consulting the hashtable. If weights change during overload mode, sticky connections can't be easily guaranteed (I thus propose to NOT change weights during overload mode, or at least ignore the changes with regard to the hashing).

However, these are disadvantages which only happen under attack. At the moment, we would simply crash, so it IS an improvement. It is a fully transparent approach and works with UDP too. The effort to implement this is acceptable. (if it were userspace I would give it a try sometime;)

And if we implement this scheme for fixed loadbalancing, which someone else definetely should, reusing the code here might not be that much of a problem.

34.25.4. Timeout in MASQ tables

Michael McConnell michaelm (at) eyeball (dot) com 08 Oct 2001

the command

#ipchains -L -M

returns a list of masqueraded connections, i.e.

TCP  01:38.01 10.1.1.41           21.1.112.43         80 (80) -> 4052
TCP  01:38.08 10.1.1.41           21.1.112.43         80 (80) -> 4053
TCP  00:25.09 10.1.1.11           20.170.180.17       80 (80) -> 4430

If ipchains (kernel 2.2) has been set with a 10hr TCP timeout

ipchains -M -S 7200 0 0 (10 hour TCP timeout)

Now these connections remain (will populate the ipvsadm table) for 10 hours. Does anyone have any suggestions as to how to purge this table manually? If I run out of ports, I get a DoS (2 hr timeout, 30,000 TCP connections...DoS)

Peter Mueller

If you alter /proc/net/ip_masquerade, it will break the established connection. Isn't that what you want to do?

No matter what I do I can not seem to reset, clear or modify this manually.

if you do not like the prospect of altering directly perhaps try a shell script:

#!/bin/sh
#hopefully this works and you won't shoot yourself in the foot...
ipchains -M -S 1 0 0
sleep 5
ipchains -M -S 7200 0 0

Setting this Value only effects *NEW* connections, connections already set are unaffected.

Julian Anastasov ja (at) ssi (dot) bg>

Without a timeout values specific for each LVS virtual service and another for the masqueraded connections it is difficult to play such games. It seems only one timeout needs to be separated, the TCP EST timeout. The reason that such support is not in 2.2 is because nobody wants to touch the user structures. IMO, it can be added for 2.4 if there are enough free options in ipvsadm but it also depends on some implementation details.

If you worry for the free memory you can use some defense the LVS DoS defense strategies

	echo 1 > drop_entry

34.25.5. BIG IP SYN Check and Dynamic Reaping

unknown

Is there anything like the BIG-IP syn check (http://www.f5.com/solutions/tech/security/) to prevent DoS?.

Ratz 12 Aug 2004

For the RS or the director or both? I think you are referring to those two marketing features:

SYN Check: One type of DoS attack is known as a SYN flood in which an attack is made for the purpose of exhausting a system's resources leaving it unable to establish legitimate connections. The BIG-IP system's SYN Check feature works to alleviate SYN flooding by sending cookies to the requesting client on the server's behalf and by not recording state information for connections that have not completed the initial TCP handshake. This unique feature ensures that servers only process legitmate connections and the BIG-IP SYN queue is not exhausted, and normal TCP communication can continue. The SYN Check feature complements the BIG-IP Dynamic Reaping feature, in that while the Dynamic Reaping handles established connection flooding, SYN Check addresses embryonic connection flooding to prevent the SYN queue from becoming exhausted.

Dynamic Reaping - The BIG-IP software contains two global settings that provide the ability to reap connections adaptively. Used to prevent denial-of-service (DoS) attacks, enterprises can specify a low-watermark threshold and a high-watermark threshold. The low-watermark threshold determines at what point adaptive reaping becomes more aggressive. The high-watermark threshold determines when non-established connections through the BIG-IP product will no longer be allowed. The value of this variable represents a percentage of memory utilization. Once memory utilization has reached this mark, connections are disallowed until the available memory has been reduced to the low-watermark threshold.

SYN Check can be enabled on the RS for all major Unices. For the rest we have to give in the fact that LVS is a software load balancer and does not have the possibilites a hardware load balancer has with regard to doing SYN cookies for other servers. Also limiting the backlog queue on a per socket basis definitely helps.

Dynamic Reaping could very well be brought into conjuction with our TCP DoS defense mechanism. Read about it at: LVS DoS defense (http://www.linux-vs.org/docs/defense.html).

I have tested F5 BigIP load balancers and I was able to flood the RS just as well as using LVS. SYN flooding cannot be prevented, it can only be rate limited. It's a flaw in the TCP protocol which we'll have to live with. There are a couple of defense mechanisms but non of them can really distinguish between malicious TCP/SYN and friendly TCP/SYN.

34.26. Testing DoS Strategies with testlvs: Creating large numbers of InActConn

34.26.1. testlvs

testlvs (by Julian ja (at) ssi (dot) bg) is available on Julian's software page.

It sends a stream of SYN packets (SYN flood) from a range of addresses (default starting at 10.0.0.1) simulating connect requests from many clients. Running testlvs from a client will occupy most of the resources of your director and the director's screen/mouse/keyboard will/may lock up for the period of the test. To run testlvs, I export the testlvs directory (from my director) to the realservers and the client and run everything off this exported directory.

Fabrice fabrice (at) urbanet (dot) ch 11 Dec 2001

I can reach 60K SYN/s with a mean of about 54K using a PIII 500MHz client.

The load on the LVS-NAT director was a high (always 100% system usage, and a swap between the ttys takes about 3-5 seconds). That poor box couldn't handle the load and wasn't able to send back packets (maybe only 10 per seconds). This means that the DoS was successfull but it's only working during the flood, it won't brake any services (thanks to Syn_Cookies).

Julian Anastasov ja (at) ssi (dot) bg

If you want to measure a maximal possible rate use -srcnum 10 or another small number to avoid beating the routing cache in the director. If you need to test the defense strategies you need large value in -srcnum. The default is too small for this, it avoids errors.

I think the only way to prevent the DoS in this case is to upgrade the LVS box hardware :)

Not always. LVS does not protect the realservers. The result can be the output pipe loaded from replies on DoS attack. You should try some ingress rate limiting, independent from LVS. Of course, your hardware should not be blocked from such attacks, you need faster MB+CPU if you care for such problems.

I looked with the vmstat 1 and 10, as Julian recommanded. Shouldn't the values of the number of interruptions with "vmstat 10" be 10 times more than "vmstat 1"'s?

No, they should be equal, up to 5% are good, they show that the process scheduling really works. If you are under attack and you can't handle it then the snapshots from vmstat 1 are delayed and the results differ too much from the results provided for longer time interval.

I got with vmstat 1: interrupts = ca. 400'000, cpu sys = 100 and with vmstat 10: interrupts = ca. 60'000, cpu sys = 100

Your director reached its limits. You should try to flood it with slower client(s). When you see that the input packet rate is equal to the successfully forwarded packets (received on the real server) then stop to slow down the attack. You reach the maximal packet rate possible to deliver to the realservers. On NAT you should consider the replies, they are not measured with testlvs tests. They will need may be the same CPU power.

34.26.2. configure realserver

The realserver is configured to reject packets with src_address 10.0.0.0/8.

Here's my modified version of Julian's show_traffic.sh which is run on the realserver to measure throughput. Start this on the realserver before running testlvs on the client. For your interest you can look on the realserver terminal to see what's happening during a test.

#!/bin/sh
#show_traffic.sh
#by Julian Anastasov ja (at) ssi (dot) bg
#modified by Joseph Mack jmack (at) wm7d (dot) net
#
#run this on the realserver before starting testlvs on the client
#when finished, exit with ^C.
#
#suggested parameters for testlvs
#testlvs VIP:port -tcp -packets 20000
#where
#	VIP:port - target IP:port for test
#
#packets are sent at about 10000 packets/sec on my
#100Mbps setup using 75 and 133MHz pentium classics.
#
#------------------------------------------
# setup a few things
to=10		#sleep time
trap return INT #trap ^C from the keyboard (used to exit the program)
iface="$1"	#NIC to listen on

#------------------------------------------
#user defined variables

#network has to be the network of the -srcnet IP
#that is used by the copy of testlvs being run on the client
#(default for testlvs is 10.0.0.0)

network="10.0.0.0"
netmask="255.0.0.0"
#-------------------------------------------
function get_packets() {
	cat /proc/net/dev | sed -n "s/.*${iface}:\(.*\)/\1/p" | \
	awk '{ packets += $2} ; END { print packets }'
}

function call_get_packets() {
	while true
	do
		sleep $to
		p1="`get_packets "$iface"`"
		echo "$((($p1-$p0)/$to)) packets/sec"
		p0="$p1"
	done
}
#-------------------------------------------
echo "Hit control-C to exit"

#initialise packets at $iface
p0="`get_packets "$iface"`"

#reject packets from $network
route add -net $network netmask $netmask reject

call_get_packets

#restore routing table on exit
route del -net $network netmask $netmask reject
#-------------------------------------------

34.26.3. configure director

I used LVS-NAT on a 2.4.2 director, with netpipe (port 5002) as the service on two realservers. You won't be using netpipe for this test, ie you won't need a netpipe server on the realserver You just need a port that you can set up an LVS on and netpipe is in my /etc/services, so the port shows up as a name rather than a number.

Here's my director

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.6 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:netpipe rr
  -> RS2.mack.net:netpipe             Masq    1      0          0
  -> RS1.mack.net:netpipe             Masq    1      0          0

34.26.4. run testlvs from client

run testlvs (I used v0.1) on the client. Here testlvs is sending 256 packets from 254 addresses (the default) in the 10.0.0.0 network. (My setup handles 10,000 packets/sec. 256 packets appears to be instantaneous.)

client: #./testlvs 192.168.2.110:5002 -tcp -packets 256

when the run has finished, go to the director

director:/etc/lvs# ipvsadm
IP Virtual Server version 0.2.6 (size=4096)
Prot LocalAddress:Port Scheduler Flags
  -> RemoteAddress:Port               Forward Weight ActiveConn InActConn
TCP  lvs2.mack.net:netpipe rr
  -> RS2.mack.net:netpipe             Masq    1      0          127
  -> RS1.mack.net:netpipe             Masq    1      0          127

(If you are running a 2.2.x director, you can get more information from ipchains -M -L -n, or netstat -M. For 2.4.x use cat /proc/net/ip_conntrack.)

This output shows 254 connections that have closed are are waiting to timeout. A minute or so later, the InActConn will have cleared (on my machine, it's 50secs).

If you send the same number of packets (256), from 1000 different addresses, (or 1000 packets to 256 addresses), you'll get the same result in the output of ipvsadm (not shown)

client: #./testlvs 192.168.2.110:5002 -tcp -srcnum 1000 -packets 256

In all cases, you've made 254 connections.

If you send 1000 packets from 1000 addresses, you'd expect 1000 connections.

./testlvs 192.168.2.110:5002 -tcp -srcnum 1000 -packets 1000

Here's the total number of InActConn as a function of the number of packets (connection attempts). Results are for 3 consecutive runs, allowing the connections to timeout in between.

The numbers are not particularly consistent between runs (aren't computers deterministic?). Sometimes the blinking lights on the switch stopped during a test, possibly a result of the tcp race condition (see the performance page)

packets		InActConn
 1000		 356, 368, 377
 2000		 420, 391, 529
 4000		 639, 770, 547
 8000		 704, 903,1000
16000		1000,1000,1000

You don't get 1000 InActConn with 1000 packets (connection attempts). We don't know why this is.

Julian

I'm not sure what's going on. In my tests there are dropped packets too. They are dropped before reaching the director, maybe from the input device queue or from the routing cache. We have to check it.

34.26.5. InActConn with drop_entry defense strategy

repeating the control experiment above, but using the drop_entry strategy (see the DoS strategies for more information).

director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_entry

packets		InActConn, drop_entry=3
 1000		369,368,371
 2000		371,380,409
 4000		467,578,458
 8000		988,725,790
16000		999,994,990

The drop_entry strategy drops 1/32 of the entries every second, so the number of InActConn decreases linearly during the timeout period, rather than dropping suddenly at the end of the timeout period.

34.26.6. InActConn with drop_packet defense strategy

repeating the control experiment above, but using the drop_packet strategy (see the DoS strategies for more information).

director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/drop_packet

packets		InActConn, drop_packet=3
 1000		338,339,336
 2000		331,421,382
 4000		554,684,629
 8000		922,897,480,662
16000		978,998,996

The drop_packet=3 strategy will drop 1/10 of the packets before sending them to the realserver. The connections will all timeout at the same time (as for the control experiment, about 1min), unlike for the drop_entry strategy. With the variability of the InActConn number, it is hard to see the drop_packet defense working here.

34.26.7. InActConn with secure_tcp defense strategy

repeating the control experiment above, but using the secure_tcp strategy (see the DoS strategies for more information). The SYN_RECV value is the suggested value for LVS-NAT.

director:/etc/lvs# echo "3" >/proc/sys/net/ipv4/vs/secure_tcp
director:/etc/lvs# echo "10" >/proc/sys/net/ipv4/vs/timeout_synrecv

packets		InActConn, drop_packet=3
 1000		 338, 372, 359
 2000		 405, 367, 362,
 4000		 628, 507, 584
 8000		 642,1000, 886
16000		1000,1000,1000

This strategy drops the InActConn from the ipvsadm table after 10secs.

34.26.8. maximum number of InActConn

If you want to get the maximum number of InActConn, you need to run the test for longer than the FIN timeout period (here 50secs). 2M packets is enough here. As well you want as many different addresses used as possible. Since testlvs is connecting from the 10.0.0.0/8 network, you could have 254^3=16M connections. Since only 2M packets can be passed before connections start to timeout and the director connection table reaches a steady state with new connections arriving and old connections timing out, there is no point in sending packets from more that 2M source addresses.

Note: you can view the contents of the connection table with

2.2

  • netstat -Mn

  • cat /proc/net/ip_masquerade

2.4

  • cat /proc/net/ip_vs_conn

Here's the InActConn with various defense strategies. The InActConn is the maximum number reachable, the scrnum and packets are the numbers needed to saturate the director. The time of the test must exceed the timeouts. InActConn was determined by running a command like this

client: #./testlvs 192.168.2.110:5002 -tcp -srcnum 1000000 -packets 2000000

and then adding the (two) entries in the InActConn column from the output of ipvsadm.

kernel		DoS strategy   	InActConn	-srcnum	-packets (10k/sec)
SYN cookie
no		secure_tcp	13,400		200,000	200,000
		syn_recv=10
no		none		99,400		500,000 1,000,000
yes		non		70,400		1,000,000 2,000,000

34.26.9. Is the number of InActConn a problem?

edited from Julian

The memory used is 128 bytes/connection and 60k connections will tie up 7M of memory. LVS does not use system sockets. LVS has its own connection table. The limit is the amount of memory you have - virtually unlimited. The masq table (by default 40960 connections per protocol). is a separate table and is used only for LVS/NAT FTP or for normal MASQ connections.

However the director was quite busy during the testlvs test. Attempts to connect to other LVS'ed services (not shown in the above ipvsadm table) failed. Netpipe tests run at the same time from the client's IP (in the 192.168.1.0/24 network) stopped, but resumed at the expected rate after the testlvs run completed (i.e. but before the InActConn count dropped to 0).

34.26.10. port starved machines

Matthijs van der Klip matthijs (dot) van (dot) der (dot) klip (at) nos (dot) nl 10 Nov 2001

used a fast (Origin 200) single client to generate generate between 3000 and 3500 hits/connections per second to his LVS'ed web cluster. No matter how many/few realservers in the cluster, he could only get 65k connections.

Julian

You are missing one reason for this problem: the fact that your client(s) create connections from limited number of addresses and ports. Try to answer yourself from how many different client saddr/sport pairs you hit the LVS cluster. IMO, you reach this limit. I'm not sure how many test client hosts you are using. If the client host is only one then there is a limit of 65536 TCP ports per src IP addr. Each connection has expiration time according to its proto state. When the rate is high enough not to allow the old entries to expire, you reach a situation where the connections are reused, i.e. the connection number showed from ipvsadm -L does not increase.

34.27. Debugging LVS

34.27.1. new way

echo x > /proc/sys/net/ipv4/debug_level
where 0&lt;x&lt;9

34.27.2. old way (may still work - haven't tested it)

Is there any way to debug/watch the path between the director and the realserver?

Wensong

below the entry

CONFIG_IP_MASQUERADE_VS_WLC=m

in /usr/src/linux/.config, add the line

CONFIG_IP_VS_DEBUG=y

This switch affects ip_vs.h and ip_vs.c. make clean in /usr/src/linux/net/ipv4 and rebuild the kernel and modules.

(other switches you will find in the code are IP_VS_ERR IP_VS_DBG IP_VS_INFO )

Look in syslog/messages for the output. The actual location of the output is determined by /etc/syslog.conf. For instance

kern.*                                          /usr/adm/kern

sends kernel messages to /usr/adm/kern (re-HUP syslogd if you change /etc/syslog.conf). Here's the output when LVS is first setup with ipvsadm

$ tail /var/adm/kern

Nov 13 17:26:52 grumpy kernel: IP_VS: RR scheduling module loaded.

( Note CONFIG_IP_VS_DEBUG is not a debug level output, so you don't need to add

*.=debug                                        /usr/adm/debug

to your syslog.conf file )

Finally check whether packets are forwarded successfully through direct routing. (also you can use tcpdump to watch packets between machines.)

Ratz ratz (at) tac (dot) ch

Since some recent lvs-versions, extensive debugging can be enabled to get either more information about what's exactly going on or to help you understanding the process of packet handling within the director's kernel. Be sure to have compiled in debug support for LVS (CONFIG_IP_VS_DEBUG=yes in .config)

You can enable debugging by setting:

echo $DEBUG_LEVEL > /proc/sys/net/ipv4/vs/debug_level

where DEBUG_LEVEL is between 0 and 10.

The do a tail -f /var/log/kernlog and watch the output flying by while connecting to the VIP from a CIP.

If you want to disable debug messages in kernlog do:

echo 0 > /proc/sys/net/ipv4/vs/debug_level

If you run tcpdump on the director and see a lot of packets with the same ISN and only SYN and the RST, then either

  • you haven't handled the The Arp Problem (most likely)

  • you're trying to connect directly to the VIP from within the cluster itself

34.28. realserver content: filesystem or database? (the many reader, single writer problem)

The client can be assigned to any realserver. One of the assumptions of LVS is that all realservers have the same content. This assumption is easy to fullfill for services like http, where the administrator updates the files on all realservers when needed. For services like mail or databases, the client writes to storage on one realserver. The other realservers do not see the updates unless something intervenes. Various tricks are described elsewhere here for mailservers and databases. These require the realservers to write to common storage (for mail the mailspool is nfs mounted; for databases, the LVS client connects to a database client on each realserver and these database clients write to a single databased on a backend machine, or the databased's on each realserver are capable of replicating).

One solution is to have a file system which can propagate changes to other realservers. We have mentioned gfs and coda in several places in this HOWTO as holding out hope for the future. People now have these working.

Wensong Zhang wensong (at) gnuchina (dot) org 05 May 2001

It seems to me that Coda is becoming quite stable. I have run coda-5.3.13 with the root volume replicated on two coda file servers for near two months, I haven't met problem which need manual maintance until now. BTW, I just use it for testing purposes, it is not running in production site.

Mark Hlawatschek hlawatschek (at) atix (dot) de 2001-05-04

we've had good experiences with the use of GFS. We've used LVS with the GFS for about one year in older versions and it worked quite stably. We successfully demonstrated the solution with a newer version of GFS (4.0) at the CEBit 2001. Several domains (i.e. http://www.atix.org) will be served by the new configuration next week.

Mark's slides from his talk in German at DECUS in Berlin (2001) is available.

K Kopper karl_kopper (at) yahoo (dot) com 6 Jun 2006

To share files on the real servers and ensure that all real servers see the same changes at the same time a good NAS box or even a Linux NFS server built on top of a SAN (using Heartbeat to failover the NFS server service and IP address the real servers use to access it) works great. If you run "legacy" applications that perform POSIX-compliant locking you can use the instructions at http://linux-ha.org/HaNFS to build your own HA NFS solution with two NFS server boxes and a SAN (only one NFS server can mount the SAN disks at a time, but at failover time the backup server simply mounts the SAN disks and fails over the locking statd information). Of course purchasing a good HA NAS device has other benefits like non-volatile memory cache commits for faster write speed.

If you are building an application from scratch then your best bet is probably to store data using a database and not the file system. The database can be made highly available behind the real servers on a Heartbeat pair (again with SAN disks wired up to both machines in the HA pair, but only one server mounting the SAN disks where the database resides at a time). Heartbeat comes with a Filesystem script that helps with this failover job. If your applications store state/session information in SQL and can query back into the database at each request (a cookie, login id, etc.) then you will have a cluster that can tolerate the failure of a real server without losing session information--hopefully just a reload click on the web browser for all but the worst cases (like ~Sin flight~T transactions).

With either of these solutions your applications do not have to be made cluster-aware. If you are developing something from scratch you could try something like Zope Enterprise Objects (ZEO) for Python, or in Java (JBOSS) there is JGroups to multicast information to all Java containers/threads, but then you~Rll have to re-solve the locking problem (something NFS and SQL have a long track record of doing safely). But you were just asking about file systems and I got off topic . . .

Brad Dameron brad (at) seatab (dot) com 7 Jun 2006

I am using RedHat GFS with my SAN to do file sharing. Works great.

Note
It's not clear what the following poster is doing. He may just be smb exporting a filesystem to the realserver. Here's more info on LVS'ing samba.

Kai Suchomel1 KAISUCH (at) de (dot) ibm (dot) com 12 Jun 2006

The Samba Service is responsible to share a SAN Filesystem. Here especially GPFS. This File system is shared among all the Samba Services on the RS. So that when the client connects to the VIP and the SAN Filesystem for the Client, it is transparent on which RS the connection will be established. When the RS fails, after doing a reconnect, the Client can access the SAN Filesystem over another RS. I am trying to implement HA for Samba Filesharing.

Joe

What happens to the state stored on the domain server (or whatever the LVS appears to be to the windows clients), when the RS goes down? Are you copying files between the LVS and the windows clients? What are your windows clients using the LVS for? So you have a single SAN exporting files to multiple realservers? Why do you do this? Is this faster than having the SAN export the files directly? Or are the realservers doing something else as well? (or doesn't the SAN export files to windows machines?)

Kai

The Realservers are responsible for the FIlesystem, here GPFS is used. GPFS is IBMs Cluster Filessystem.

Joe: I guess we'll hear more later.

34.29. Developement: Supporting IPSec on LVS

see Julian's notes on developing code for IPSec over LVS.

Farid Sawari has IPSec working with 2.4 and 2.6 LVS-NAT.