33. LVS: Details of LVS operation, Security, DoS

33.1. Top 20 security vunerabilities

See list of top 10 windows and top 10 unix vunerabilities

33.2. Top 75 security tools from the people at nmap

See Top 75 Security Tools survey of May 2003, by polling the nmap mailing list.

33.3. Network Testing with Abberant Packets

Note
This is not exactly DoS, but is from a thread on another mailing list.

Jeff The Riffer riffer (at) vaxer (dot) net 27 Feb 2007

We used several tools to generate abberant behavior, rather than packet replays. One was Core Impact, which actually exploits known holes and installs agents. It can do TCP evasion techniques to a limited extant. For abberant behavior, we found a nifty little open-source tool called isic, which lets you generate all sorts of abnormal traffic:

http://www.mirrors.wiretapped.net/security/packet-construction/isic/

It has binaries to generate abnormal ethernet, UDP, TCP, IP, and ICMP traffic. You can control percentages of the different abnormalities as well as volume of traffic. It's VERY noisy and aggressive stuff, but great for seeing if you can brign down a system. You can also use to to generate a packet storm while trying to sneak in through a more mundane attack amd trick your IDS/IPS route. We had problems getting it compiled, but someone was able to find a Debian package for it. The Debian package was converted to RPM using Alien and the RPM worked great under SuSe 10.0.

Other than that we just used NMap and Nessus to generate varying levels of traffic and alerts. Isic was very useful for us...

NMap/Nessus are to test how good IPS are to detect scanning. We were doing a very comprehensive test so we made no assumptions about capabilities of the products. NMap and Nessus by themselves would not be sufficient.

The problem iwth replaying actual packet captures to test an IPS is that it will be for whatever IP addresses were in play when the capture is done, so that won't really work either. You can muck around with the .cap file and change the IPs and MAC addresses but it's an iffy solution.

Core Impact is really great. But it's commercial and expensive, so most folks aren't going to have it. But, Metasploit is free and can do many of the same things. Just not as easily.

33.4. Do I need security, really?

Malcolm Turnbull

Assuming that you have an LVS loadbalancer running on a linux box and this box is behing a firewall so that only ports 80 and 443 are allowed from clients. Do you really need to harden the loadbalancer firewall rules ? What about SYN cookies?

Ratz 01 Oct 2002

Yes, always. DROP ALL, accept TCP 80/443 only. Especially if the packet filter in front and the LVS are running the same OS :)

Nothing can prevent SYN flooding, you can only live better with it when you have SYN cookies enabled. With a wrongly set backlog queue size you still face big penalty with SYN/RST attacks. See syncookies.

Roberto Nibali ratz (at) tac (dot) ch 03 May 2001

It doesn't matter whether you're running an e-gov site or you mom's homepage. You have to secure it anyway, because the webserver is not the only machine on a net. A breach of the webserver will lead to a breach of the other systems too.

Joe Sep 2005

As Marcus Ranum says (http://www.ranum.com/security/computer_security/editorials/dumb/) "Worms aren't smart enough to realize that your web site/home network isn't interesting".

Joe 01 Oct 2002

Yes Virginia, you need security.

There's the technical level. Can an intruder, who gets beyond the firewall, do any damage after getting access to the director, the realservers? If so, do you care? Maybe you do, maybe you don't - it will depend on what you have on those machines - if it's only publically available (readonly) webpages, you're less concerned than if you have customer business information on it.

Are there adjacent machines on the network, that have more sensitive data than yours, that could be attacked from your compromised machines? You don't want to be an intermediate site to an attack on a expensive setup in the next rack.

You may think that with a hardened front end, backend machines need less protection. However a new exploit we haven't thought about may be hop through a firewall and land on one of the less protected machines behind it. You should think about the damage that could occur if the attacker gained root access to any of your machines. LVS-DR is easier to protect in this case, as packets from the attacker on a realserver will be coming from the RIP, while packets from the LVS'ed service will be coming from the VIP. (If the realservers are in a 3-Tier LVS LVS, then only the packets from the RIP on the realserver to the external 3-Tier services should have routes.) There should be no default gw on the realservers for packets from the RIPs. Packets from the RIPs to 0:0 should be dropped (and logged). The only allowed packets from the RIPs are those needed for internal networking between the machine in the LVS (e.g.local mailing of logs, updates to files). These packets will have dst_addr as another machine on the RIP network. On the director, there should be no default gw for packets from the VIP (see routing for LVS-DR). The only way an attacker with root can send packets to the outside world is by changing the routing tables (which you should be able to pickup).

But security is more than a technical thing. How will your customers react if their website goes down, gets defaced or has credit card info stolen? You're going to have to explain that your actions were diligent and the break-in was beyond anything that you could be expected to handle. You then have to mollify them and make sure you keep the account. You'll also have to explain to potential customers why that humongous break-in that made it to the front page of the NY Times for a week, doesn't reflect poorly on you. If these people are non-technical (as most people with the money are) this may be difficult.

It is just easier to make sure there never is a break-in. Of course there's no end to the things you can do in this department, so somewhere you'll have to decide what you're prepared to deal with upfront at the keyboard and what your prepared to deal with at the backend, after a break-in, face-to-face with an unhappy customer, who is simultaneously dealing with his own unhappy customers. To your client, you aren't a genius with a keyboard who understands computers. Nope - you're the security guard they've hired to look after a warehouse of their widgets and if you let someone get off with them, they're not going to be terribly interested in your explanation of why or how it happened.

I'd say the minimum for a production machine, exposed to the internet, is a set of rules on each machine (director and realservers) that only allows the packets needed for the LVS (by port, IP, proto) and drops the rest. Every packet to and from a machine must be inspected by a filter rule. Every rejected packet must be logged (at least till you find out where they're coming from). Routing should be designed to allow out, only the packets you want to go out (outgoing packets are filtered by port and IP).

If you're being bombarded with malicious traffic (spam, DoS), tcpdump is not a good diagnostic tool - you will not be able to decipher the deluge. Try snort. Here's an Introduction to Network-Based Intrusion Detection Systems Using Snort.

Limit places where intruders can login (e.g. with xinetd). For maintenance, don't login over the same networks that the LVS traffic flows on (e.g.RIP network, outside network with VIP). For maintenance/admin, only allow connection by ssh through a separate ethernet card on a different set of wires and different network (backed up with filter rules and xinetd), or via the console.

33.5. What to do after a break-in, prevention strategies

In the early 1990's, a break-in was unusual and being a criminal act, some investigative body was notified (e.g. the police). This being a new type of crime, usually the investigators had no idea how to handle the situation. At my work a multi cpu refrigerator sized mail server was compromised and the investigators swooped in and seized the server; not just the disk(s) - the whole server, and wheeled it out the door. We were told that the server would be returned on completion of the investigations (and any trial if suspects were apprehended). On asking them when that might be, we were told that if they could not find any suspects, that the machine would be returned when the case timed out, which would be 8yrs.

This was a big lesson to all involved. The next break-in I was involved in, the machine wouldn't boot. I reformatted the disk and reinstalled the OS from CD and the user's files from backups. When the investigators arrived and asked to inspect the machine. I told them that the disk had been reformatted and offered them the last tape backup. They didn't want it.

Subsequently I attended a talk by a /programmer/lawyer/cyber-investigator from Washington DC, who worked for the US govt. He told us that after a break-in we must not touch the machine (it's tampering with evidence) and gave us a list of contacts. At question time, I told him the story about the server being wheeled out the door (which several people in the audience were familiar with), and my reformatting of the disk on the machine I handled, at which he grew visibly angry. I asked him if he expected us to call them if they were going to take our hardware away when they only needed to take a bitwise copy of the disks. After all, the police don't seize your house after a burglary.

In his reply, it was clear that he recognised that such actions by the investigators weren't optimal, but as to what I should do next time, he only offered more standard party line and I decided that next time I wouldn't be calling anyone.

Following a set of Unix/Linux break-ins (Apr 2004), Stanford U put out (link dead Jun 2004 http://securecomputing.stanford.edu/alerts/multiple-unix-6Apr2004.html) "Multiple Unix compromises on campus", describing their problems and offering links to further pages (such as information on rootkits).

Unfortunately the current state of security is that much work is needed to get it and much of the prevention work seems to be applying patches. This is a lot of work and I can't imagine that it will prevent most break-ins. I personally find tiresome the practice of forced passwd expiration every 3 months on the 30 accounts I have in several administrative fiefdoms. I'm expected to keep them in my head when they are 8 char mixed uc/lc and contain at least 2 numbers. Who are they kidding? I tape them to the edge of my monitor. The article links to Steps for Recovering from a UNIX or NT System Compromise a CERT/AusCERT paper. It seems to have been written by people who like being on committees and who want you to spend so much time securing your machine, that you'll never be able to use it. Unfortunately the people who make decisions about managing computers have never dealt with a break-in and know nothing about security will cover their asses (arses) with a never ending round of patching. The result is that users have to suffer machines being rebooted from under them every 2 weeks (and loosing all the user settings), the SA never does anything useful but when the inevitable break-in occurs the manager can happily say to some committee "we did everything we could to prevent it".

My only interaction with CERT was not good. One day (mid 1990s) I got vitriolic e-mail from a person announcing that he was one of the top AusCERT security experts, and that my map server (now at AZ_PROJ map server and producing about 10,000 maps/yr) was doing robotic attacks on his network. If I didn't cease and desist immediately, a dire fate would befall me.

Now if there was some problem with my machine and I had come to the notice of CERT, I would expect a letter from CERT saying

Dear Sir,

	It has come to our notice that your machine (IP=xxx.xxx.xxx.xxx)
has been sending these packets (logs enclosed) to machines (n1...nx).
Since these aren't part of the expected packet stream, we're concerned
that there may be some problem. Do you know anything about this?

	This is a routine letter and indicates the beginning
of an investigation into a problem and can be tracked at
url/case_number. 

	Hopeing to hear from you.
	Thank you
	Your friendly CERT representative

I assumed I was talking to a crazed idiot.

The next day I got an even more vitriolic e-mail from the same guy, promising me certain internet death if I didn't stop attacking his machines. Somewhere in here, he sent me logs, whose relevance to the problem was not obvious at the time.

Then I got e-mail from a user of my map generator saying that it had stopped working for him and could I help. The map generator produced azimuthal equidistant projection maps in many formats, including an X-client which could popup an X-window map on your screen (there were instructions on setting xhost etc). The user was having problems getting the X-display of my maps going, when previously it had worked. AFAIK, no-one was using the X-client and I had forgotten it was there (everyone was generating gifs). Somehow (IPs?) I connected this user to the AusCERT expert. I told the user what to do and then sent off e-mail to the CERT expert, giving the url of my map generator and telling him to go look at what it did.

This only inflammed the CERT security expert more and shortly thereafter I got e-mail from an even higher level AusCERT uber-expert telling me that I'd been listed as one of the biggest internet nasties of all time and that no-one was ever going to get a packet in or out of my machine ever again.

I explained to the uber-expert what my machine was doing and that that he was probably getting X-packets from my server and to go try it out for himself.

There was silence for a couple of days, and then a rash of apologies from both CERT experts. There was no indication that they had learned anything from this or would change their methods next time.

The fact that the top CERT experts in Australia don't know an X-packet from a hole in the wall and are prepared to declare internet death on someone without an investigation (courteous or otherwise), indicates that we shouldn't hold out much hope of CERT saving us from anything. If CERT can save us from CERT, we should be thankfull.

33.6. More about syncookies

anon

humans usually do not establish SYN connection. It is more likley to be Nimda or other worms. If I can determine a threshold of simultenious SYN connection that nimda usually creates, I will be able to drop packets from specific source IP which meet the threshold.

Roberto Nibali ratz (at) tac (dot) ch 06 Aug 2003

Search google using my name and syncookies for more information on why syncookies have no measurable impact on reducing real DoS.

If you can _really_ figure out a metric for mutually exclusive TCP/SYN patterns generated by existing worms and write it down in a mathematical formula which has lower false positive rate than any TCP/QoS "defense" mechanism using stochastic (timed) fairness approach, you need not worry about money anymore. In fact influential people in the Internet business might feel a sudden urge to talk to you! ;)

33.7. Can filter rules stop the intruder hopping to other machines?

Malcolm Turnbull malcolm (dot) turnbull (at) crocus (dot) co (dot) uk 14 Feb 2003

Nope, if you're hacked they can just change your firewall rules... One of my clients got hacked and the only way they found out was because the hacker (possibly script kiddy) tried to flush the iptables rules, therfore breaking all of the NAT rules therefore taking down the web site...

How did he get in: broke into IIS through common bug, installed a trojan, used SSH to get from the web server to the firewall .. etc etc...

Even if you put the LVS behind a firewall (which I prefer) you still need to open port 80... is it secure ? yes I think so hackers tend to concentrate on application i.e. apache or IIS these days its much easier..

One other gotcha.. If your fallback server is localhost you are obviously exposing your local apache installation !

Nate Carlson natecars (at) real-time (dot) com

the firewall should be configured so untrusted hosts (e.gthe web server -- any box that isn't the box that people are expected to log in from) can't connect to the SSH port (or any other service) on the firewall.

33.8. Where filter rules act

Joe - iptables (2.4 kernels) has no "iptables -C" to check your rules (at least not yet - one is promised).

Ratz

If you're dealing with netfilter, packets don't travel through all chains anymore. Julian once wrote something about it:

packets coming from outside to the LVS do:

PRE_ROUTING -> LOCAL_IN(LVS in) -> POST_ROUTING

packets leaving the LVS travel:

PRE_ROUTING -> FORWARD(LVS out) -> POST_ROUTING

From the iptables howto: COMPATIBILITY WITH IPCHAINS

This iptables is very similar to ipchains by Rusty Russell. The main difference is that the chains INPUT and OUTPUT are only traversed for packets coming into the local host and originating from the local host respectively. Hence every packet only passes through one of the three chains; previously a forwarded packet would pass through all three.

Julian

  • 2.4 director:

    Packets coming into the director (out->in):

    • NAT: INPUT -> input routing -> local: LVS/DEMASQ -> input routing -> forwarding -> OUTPUT
    • DR/TUN: INPUT -> input routing -> local: LVS -> output routing -> OUTPUT

    packets leaving the LVS travel (in->out):

    • NAT only: INPUT -> input routing -> FORWARD (-j MASQ) -> LVS/MASQ -> OUTPUT

  • 2.2 director:

    INPUT in 2.2 is similar as PRE_ROUTING in 2.4, i.e. INPUT, OUTPUT and FORWARD are the 2.2 firewall chains

    input routing: ip_route_input()
    output routing: ip_route_output()
    forwarding: ip_forward()
    local: ip_local_deliver()
    

Matthew S. Crocker matthew (at) crocker (dot) com 31 Aug 2001

How do I filter LVS? Does LVS grab the packets before or after iptables?

Julian

LVS is designed to work after any kind of firewall rules. So, you can put your ipchains/iptables rules safely. If you are using iptables put them on LOCAL_IN, not on FORWARD. The LVS packets do not go through FORWARD.

Note

Although LVS is compatible with any kind of filter rule (i.e. ipchains, iptables), it has incompatibilities with netfilter i.e. you maynot be able to have your firewall on the director. For more info see the Running a firewall on the director.

Joe

If you are being attacked, it might be better to filter upstream (e.g. the router or your ISP), to prevent the LAN from being flooded.

33.9. /proc filesystem flags for ipv4, e.g.rp_filter

You could wind up flipping a lot of these flags. Explanations are available in the obscure section of the Adv Routing HOWTO . In particular rp_filter and log_martians are used in julians_martian_modification. For more information on rp_filter see Reverse Path Filtering .

33.10. tcp timeout values, don't change them (at least yet)

The tcp timeout values have their values for good reason (even if you don't know what they are), and realservers operating as an LVS must appear as normal tcp servers to the clients.

Wayne, 19 Oct 2001

I have a question about the 'IP_MASQ_S_FIN_TIMEOUT" values in "net/ipv4/ip_masq.c" for the 2.2.x kernel. What purpose is served by having the terminated masqueraded TCP connection entries remain in memory for the default timeout of 2 minutes? Why isn't the entry freed immediately?

Julian Anastasov ja (at) ssi (dot) bg 20 Oct 2001

Because the TCP connection is full-duplex. The internal-end sends FIN and waits for the FIN from external host. Then TIME_WAIT is entered.

Perhaps what I'm really asking is why there is an mFW state at all.

[IP_MASQ_S_FIN_WAIT] = 2*60*HZ,
/* OUTPUT */
/* mNO, mES, mSS, mSR, mFW, mTW, mCL, mCW, mLA, mLI */
/*syn*/ {{mSS, mES, mSS, mSR, mSS, mSS, mSS, mSS, mSS, mLI }},
/*fin*/ {{mTW, mFW, mSS, mTW, mFW, mTW, mCL, mTW, mLA, mLI }},
/*ack*/ {{mES, mES, mSS, mES, mFW, mTW, mCL, mCW, mLA, mES }},
/*rst*/ {{mCL, mCL, mSS, mCL, mCL, mTW, mCL, mCL, mCL, mCL }},
};
/mFW

This state has timeout corresponding to the similar state in the internal end. The remote end is still sending data while the internal side is in FIN_WAIT state (after a shutdown). The remote end can claim that it is still in established state not seeing the FIN packet from internal side. In any case, the remote end has 2 minutes to reply. It can even work for longer time if each packet follows in these two minutes not allowing the timer to expire. It depends in what state is the internal end, FIN_WAIT1 or FIN_WAIT2. May be the socket in the internal end is already closed.

The only thing I can think of is if the other end of the TCP connection spontaneously issues a half close before the initiator sends his half close. Then it might be desirable to wait a while for the initiator to send his half close prior to disposing of the connection totally. What would be the consequences of using ipchains -M -S to set this value to, say, 1 second?

In any case, timeout values lower than those in the internal hosts are not recommended. If we drop the entry in LVS, the internal end still can retransmit its FIN packet. And the remote end has two minutes to flush its sending queue and to ack the FIN. IMO, you should claim that the timer in FIN_WAIT state should not be restarted on packets coming from remote end. Anything else is not good because it can drop valid packets that fit into the normal FIN_WAIT time range.

Jaroslav Libak 28 Nov 2006

When I click refresh in firefox several times while viewing load balanced page, I get a FIN_WAIT connection for every refresh. So I set tcpfin parameter using ipvsadm to 15 seconds to get rid of them fast (is this ok btw?, it was like 2 minutes before which I think is way too long). What is worse, I get "established" connection on the backup director (running the syncd) for every refresh. I have read this is due to a simplification in the synchronization code. I'm using hash table size 2^20 (which doesn't limit the maximum number of values in it, it just sets the number of rows, then each row has a linked list). Doesn't it cause some slowdown in the LVS?

Horms 29 Nov 2006

There has long been a plan to allow the timeout values to be manipulated from user space. I think it actually was possible using /proc at some stage, but the code was removed for various (good) reasons. Then there was a plan to implement the feature by extending the sysctl interface. I suspect that this, or using sysfs is currently the prefered option by the upstream kernel guys.

A really worthwhile contribution to LVS would be to complete this code. I can find out from the upstream people what their prefered option for implementing this is if you are interested in having a crack at it. I don't imagine the code will be that hard.

I understand that your concern is memory preasure on the slave in the case of a DoS attack. And it is true that the simplification in the synchronisation protocol can exasabate that problem. However, by doing it this way the synchronisatin traffic is actually reduced, including in the case of a DoS attack. So expanding it may actually just move the problem else where.

Keeping in mind that a connection entry is in the vicintity of 128 bytes, it is my opinion that unless you have an extremely small ammount of memory available on the system to start with, DoSing the machine in this way is quite hard. I did try once, DoSing a box from istelf, and basically the default timeouts were easily able to keep up with the DoS, and I think the total memory used never exceded a few hundred Mb.

I would be very surprised if increasing the value would cause a slowdown, it does however increase the memory required for the array that forms the base of the hash - at 2^20 you are looking at order 2^20 = 1Mb for the size of that array. For larger values, like 32 (=4Gb), this starts to become rediculous. Decreasing it can, in theory, cause a slowdown if you have a lot of connections. But in practice I don't think it does unless you make it very small.

In short, 20 should be fine, though you can probably get the same preformance with 16. 10 is probably a bit too small.

33.11. /proc file system settings for LVS: security and private copies of tcp timeouts for LVS connections (you can change these)

In LVS-DR, the director only sees the packets from the client going to the realserver, but not the replies. After seeing a CLOSE, the director puts the connection into InActConn and uses its value of TIME_WAIT before assuming that the connection has dropped. (In fact the director has no idea of the connection state of the realserver, but these assumptions seems to work OK). In the earlier versions of LVS, the director uses the standard tcpip timeouts for its estimates of the connection state of the realserver. In the newer versions of LVS (somewhere in 2.4.x), you can fiddle with a set of private copies of the timeout values which ip_vs uses for LVS connection tracking.

As well other ip_vs parameters (e.g. for security) can be altered in /proc.

Roberto Nibali ratz (at) tac (dot) ch 03 May 2001

The load balancer is basically on as secure as Linux itself is. ipchains settings don't affect LVS functionality (unless by mistake you use the same mark for ipchains and ipvsadm). LVS itself has some built-in security, mainly to try to secure the realserver in case of a DoS attack. There are several parameters you might want to set in the proc-fs.

Note

Ratz 10 Aug 2004

Those values below were used as kind of a defense mechanism in the ancient days. I believe these are to be replaced by the same parameters exported through the ip_conntrack module. Load ip_conntrack and walk the /proc/sys/net/ipv4/netfilter tree.

  • /proc/sys/net/ipv4/vs/amemthresh
  • /proc/sys/net/ipv4/vs/am_droprate
  • /proc/sys/net/ipv4/vs/drop_entry
  • /proc/sys/net/ipv4/vs/drop_packet
  • /proc/sys/net/ipv4/vs/secure_tcp
  • /proc/sys/net/ipv4/vs/debug_level

    With this you select the debug level (0: no debug output, >0: debug output in kernlog, the higher the number to higher the verbosity)

    The following are timeout settings. For more information see TCP/IP Illustrated Vol. I, R. Stevens.

  • /proc/sys/net/ipv4/vs/timeout_close - CLOSE
  • /proc/sys/net/ipv4/vs/timeout_closewait - CLOSE_WAIT
  • /proc/sys/net/ipv4/vs/timeout_established - ESTABLISHED
  • /proc/sys/net/ipv4/vs/timeout_finwait - FIN_WAIT
  • /proc/sys/net/ipv4/vs/timeout_icmp - ICMP
  • /proc/sys/net/ipv4/vs/timeout_lastack - LAST_ACK
  • /proc/sys/net/ipv4/vs/timeout_listen - LISTEN
  • /proc/sys/net/ipv4/vs/timeout_synack - SYN_ACK
  • /proc/sys/net/ipv4/vs/timeout_synrecv - SYN_RECEIVED
  • /proc/sys/net/ipv4/vs/timeout_synsent - SYN_SENT
  • /proc/sys/net/ipv4/vs/timeout_timewait - TIME_WAIT
  • /proc/sys/net/ipv4/vs/timeout_udp - UDP

You don't want your director replying to pings from the outside world.

With the FIN timeout being about 1 min (2.2.x kernels), if most of your connections are non-persistent http (only taking 1 sec or so), most of your connections will be in the InActConn state.

unknown

will the info from loading ip_conntrack and walking the /proc/sys/net/ipv4/netfilter tree be used along with secure_tcp defense strategy as LVS DoS defense strategy (http://www.linux-vs.org/docs/defense.html) described to replace the timeouts mentioned.

Ratz 12 Aug 2004

I don't know (I've been out of the development loop for about a year) but I rather think not since they look kind of orthogonal to the existing netfilter timers which only got added about 6 months or so ago. One of the issues in fiddling with those timers is that they influence too much of the rest of the stack.

I also don't think the documentation is up to date anymore, it should be adjusted to reflect the current state of operation. Like that it only confuses people who don't want or can't read the kernel code.

If you're interested, check out following path:

net/ipv4/ipvs/ip_vs_ctl.c:ip_vs_sysctl_defense_mode()
net/ipv4/ipvs/ip_vs_ctl.c:update_defense_level()
net/ipv4/ipvs/ip_vs_ctl.c:ip_vs_secure_tcp_set()
net/ipv4/ipvs/ip_vs_conn.c:"set state table, according to proc-fs value"

from there you set the TCP state transition table. If you have the secure_tcp sysctl set, the kernel will be dealing with the vs_tcp_states_dos state transition table, if you have it unset, it will be dealing with the normal vs_tcp_states table.

The related timer for the state transitions are vs_timeout_table{_dos}. In former days you could influence those timers via proc-fs. Nowadays we seem to switch to the *_dos timer model under attack according to the comment in the code. But this is not correct. It should read that as soon as the sysctrl for tcp_defense is set, we will also be using the *_dos table timers along with the vs_tcp_states_dos state transition table.

Conclusion: The disabled proc-fs values have been replace by a static hardcoded mapping of the timers for tcp_defense. I could imagine that not a lot of people really used to tweak those parameters anyway.

Hendrik Thiel, 20 Mar 2001

we are using a lvs in NAT Mode and everything works fine ... Probably, the only Problem seems to be the huge number of (idle) Connection Entries.

ipvsadm shows a lot of InActConn (more than 10000 entries per Realserver) entries. ipchains -M -L -n shows that these connections last 2 minutes. Is it possible to reduce the time to keep the Masquerading Table small? e.g. 10 seconds ...

Joe

For 2.2 kernels, you can use netstat -M instead of ipchains -M -L. For 2.4.x kernels use cat /proc/net/ip_conntrack.

Julian

One entry occupies 128 bytes. 10k entries mean 1.28MB memory. This is not a lot of memory and may not be a problem.

For 2.2, to reduce the number of entries in the ipchains table, you can reduce the timeout values. You can edit the TIME_WAIT, FIN_WAIT values in ip_masq.c, or enable the secure_tcp strategy and alter the proc values there. FIN_WAIT can also be changed with ipchains.

Note

It is not a good idea to change the tcpip timeouts (particularly to save 1M).

With the later versions of ip_vs (2.4.x), the director has its own copies of the tcpip timeout values, and you can change them.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 10 May 2004

If you are concerned about the number of InActConn, you can reduce the FIN_WAIT timeout in /proc/sys/net/ipv4/vs/timeout_finwait.

For 2.6.x versions of ip_vs (May 2004), the timeouts have not been implemented yet.

Julian 12 May 2004

IPVS for 2.6 has code to use different timeout tables but we forgot to implement it fully. The intention was to implement per protocol/service/app timeouts by adding some code to libipvs and the kernel. It is not preferred to export so many values via /proc interface, so now it is disabled until someone decides to implement the above set/get controls. Only the timeout_* values in /procare disabled, so now they do not exist in 2.6. All other sysctls remain.

33.12. timeouts the same for all services

Alois Treindl alois (at) astro (dot) ch

I have LVS-NAT configured so that ssh VIP connects me to one particular realserver. I would like to keep this ssh connection permanent, (to observe the cluster during its operation). This ssh connection times out with inactivity as expected. Can this be changed, without affecting the timeout values of other LVS services? Alternately I can connect by ssh to each machine without using LVS.

Julian Anastasov ja (at) ssi (dot) bg 12 May 2001

Currently, there are only global timeout values which are not very useful for some boxes with mixed functions. The masquerading, the LVS and its virtual services use same timeout values. The problem is that there are too many timeouts.

The solution would be to separate these timeouts, i.e. per virtual service timeouts, separated from the masquerading. According to the virtual service protocol it can serve the TCP EST and the UDP timeout role. So this can be one value that will be specified from the users. By this way the in->out ssh/telnet/whatever connections can use their own timeout (1/10/100 days) and the external clients will use the standard credit of 15 minutes. But may be it is too late for 2.2 to change this model. Is one user specified timeout value enough?

33.13. Director Connection Hash Table

Note
Because the 2.0.x implementation of ip_vs was in the masquerading code, this table used to be called the "IP masquerading table".
Note
Joe: A regular table has room for N entries, with an index of range N. A hash table is a table that has room for N entries, but stores entries for indices that have a range of M, where M>N. In the case of LVS, the connection hash table must store entries over the whole range of internet IPs, but only has (initially) 4096 (say) entries. Algorithms exist which allow adding and deleting entries in hash tables at speeds comparable to those in regular tables.

from Peter Mueller: a general article on hashing (http://www.citi.umich.edu/projects/linux-scalability/reports/hash.html, site gone Sep 2004)

The director maintains a hash table of connections marked with

<CIP, CPort, VIP, VPort, RIP, RPORT>

where

  • CIP: Client IP address
  • CPort: Client Port number
  • VIP: Virtual IP address
  • VPort: Virtual Port number
  • RIP: RealServer IP address
  • RPort: RealServer Port number.

The hash table speeds up the connection lookup and keeps state so that packets belonging to a connection from the client will be sent to the allocated realserver. If you are editing the .config by hand look for CONFIG_IP_MASQUERADE_VS_TAB_BITS.

Warning

Do not even think of changing the LVS (hash) table size unless you know a lot more about ip_vs than we do. If you still want to change the hash table size, at least read everything here first.

tao cui

In the output of ipvsadm what does the "size" mean?

IP Virtual Server version 1.0.9 (size=4096)

or

IP Virtual Server version 1.0.9 (size=65536)

Horms 24 Dec 2003

This refers to the number of hash buckets in the IPVS connection table. This is configured at compile time by setting CONFIG_IP_VS_TAB_BITS, the default is 12.

size = 2^CONFIG_IP_VS_TAB_BITS

Thus CONFIG_IP_VS_TAB_BITS = 12 -> size = 4096
     CONFIG_IP_VS_TAB_BITS = 16 -> size = 65536

Note that this is the number of hash buckets, not the maximum number of connections. A bucket can contain zero or more connections. The maximum number of connections is only limited by the memory available.

Janno de Wit

How can I see if connectiontable is full? `dmesg` gives no output.

Horms 07 Jan 2005

The connection table cannot become full. It is a hash table and you can continue to add entries until you run out of memory, at which time something very apparent should turn up in dmsg.

Ratz

The original poster actually has got a point :) So what about this:

Note
partial diff shown for brevity - Joe
diff -Nur linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_conn.c
linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_conn.c
--- linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_conn.c   2003-11-03
17:26:50.000000000 +0100
+++ linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_conn.c   2003-12-24
09:21:37.000000000 +0100
@@ -1519,7 +1519,7 @@

         IP_VS_INFO("Connection hash table configured "
-                  "(size=%d, memory=%ldKbytes)\n",
+                  "(hash buckets=%d, memory=%ldKbytes)\n",
                    IP_VS_CONN_TAB_SIZE,
                    (long)(IP_VS_CONN_TAB_SIZE*sizeof(struct
list_head))/1024);
         IP_VS_DBG(0, "Each connection entry needs %d bytes at least\n",

diff -Nur linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_ctl.c
linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_ctl.c
--- linux-2.4.23-preX-orig/net/ipv4/ipvs/ip_vs_ctl.c    2003-11-03
17:26:50.000000000 +0100
+++ linux-2.4.23-preX-ratz/net/ipv4/ipvs/ip_vs_ctl.c    2003-12-24
09:22:47.000000000 +0100
@@ -1488,7 +1488,7 @@
         pos = 192;
         if (pos > offset) {
                 sprintf(temp,
-                       "IP Virtual Server version %d.%d.%d (size=%d)",
+                       "IP Virtual Server version %d.%d.%d (hash
buckets=%d)",
                         NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
                 len += sprintf(buf+len, "%-63s\n", temp);
                 len += sprintf(buf+len, "%-63s\n",
@@ -1942,7 +1942,7 @@
         {
                 char buf[64];

-               sprintf(buf, "IP Virtual Server version %d.%d.%d (size=%d)",
+               sprintf(buf, "IP Virtual Server version %d.%d.%d (hash
buckets=%d)",
                         NVERSION(IP_VS_VERSION_CODE), IP_VS_CONN_TAB_SIZE);
                 if (*len < strlen(buf)+1) {
                         ret = -EINVAL;

The default LVS hash table size (2^12 entries) originally meant 2^12 simultanous connections. These early versions of ipvs would crash your machine if you alloted too much memory to this table.

Julian 7 Jun 2001

This was because the resulting bzImage was too big. Users selected a value too big for the hash table and even the empty table (without linked connections) couldn't fit in the available memory.

This problem has been fixed in kernels>0.9.9 with the connection table being a linked list.

Note
Note: If you're looking for memory use with "top", it reports memory allocated, not memory you are using. No matter how much memory you have, Linux will eventually allocate all of it as you continue to run the machine and load programs.

Each connection entry takes 128 bytes, 2^12 connections requires 512kbytes.

Note
not all connections are active - some are waiting to timeout.

As of ipvs-0.9.9 the hash table is different.

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

With CONFIG_IP_MASQUERADE_VS_TAB_BITS we specify not the max number of the entries (connections in your case) but the number of the rows in a hash table. This table has columns which are unlimited. You can set your table to 256 rows and to have 1,800,000 connections in 7000 columns average. But the lookup is slower. The lookup function chooses one row using hash function and starts to search all these 7000 entries for match. So, by increasing the number of rows we want to speedup the lookup. There is _no_ connection limit. It depends on the free memory. Try to tune the number of rows in this way that the columns will not exceed 16 (average), for example. It is not fatal if the columns are more (average) but if your CPU is fast enough this is not a problem.

All entries are included in a table with (1 << IP_VS_TAB_BITS) rows and unlimited number of columns. 2^16 rows is enough. Currently, LVS 0.9.7 can eat all your memory for entries (using any number of rows). The memory checks are planned in the next LVS versions (are in 0.9.9?).

Julian 7 Jun 2001

Here is the picture:

the hash table is an array of double-linked list heads, i.e.

struct list_head *ip_vs_conn_tab;

In some versions ago ( < 0.9.9? ) it was a static array, i.e.

struct list_head ip_vs_table[IP_VS_TAB_SIZE];

struct list_head is 8 bytes (d-linked list), the next and prev pointers

In the second variant when IP_VS_TAB_SIZE is selected too high the kernel crashes on boot. Currently (the first variant), vmalloc(IP_VS_TAB_SIZE*sizeof(struct list_head)) is used to allocate the space for the empty hash table for connections. Once the table is created, more memory is allocated only for connections, not for the table itself.

In any case, after boot, before any connections are created, the occupied memory for this empty table is IP_VS_TAB_SIZE*8 bytes. For 20 bits this is (2^20)*8 bytes=8MB. When we start to create connection they are enqueued in one of these 2^20 double-linked lists after evaluating a hash function. In the ideal case you can have one connection per row (a dream), so 2^20 connections. When I'm talking about columns, in this example we have 2^20 rows and average 1 column used.

The *TAB_BITS define only the number of rows (the power of 2 is useful to mask the hash function result with the IP_VS_TAB_SIZE-1 instead of using '%' module operation). But this is not a limit for the number of connections. When the value is selected from the user, the real number of connections must be considered. For example, if you think your site can accept 1,000,000 simultaneous connections, you have to select such number of hash rows that will spread all connections in short rows. You can create these 1,000,000 conns with TAB_BITS=1 too but then all these connections will be linked in two rows and the lookup process will take too much time to walk 500,000 entries. This lookup is performed on each received packet.

The selection of *TAB_BITS is entirely based on the recommendation to keep the d-linked lists short (less than 20, not 500,000). This will speedup the lookup dramatically.

So, for our example of 1,000,000 we must select table with 1,000,000/20 rows, i.e. 50,000 rows. In our case the min TAB_BITS value is 16 (2^16=65536 >= 50000). If we select 15 bits (32768 rows) we can expect 30 entries in one row (d-linked list) which increases the average time to access these connections.

So, the TAB_BITS selection is a compromise between the memory that will use the empty table and the lookup speed in one table row. They are orthogonal. More rows => More memory => faster access. So, for 1,000,000 entries (which is an real limit for 128MB directors) you don't need more than 16 bits for the conn hash table. And the space occupied by such empty table is 65536*8=512KBytes. Bits greater than 16 can speedup the lookup more but we waste too much memory. And usually we don't achieve 1,000,000 conns with 128MB directors, some memory is occupied for other things.

The reason to move to vmalloc-ed buffer is because an 65536-row table occupies 512KB and if the table is statically defined in the kernel the boot image is with 512KB longer which is obviously very bad. So, the new definition is a pointer (4 bytes instead of 512KB in the bzImage) to the vmalloc'ed area.

Ratz's code adds limits per service while this sysctl can limit everything. Or it can be additional strategy (oh, another one) vs/lowmem. The semantic can be "Don't allocate memory for new connections when the low memory threshold is reached". It can work for the masquerading connections too (2.2). By this way we will reserve memory for the user space. Very dangerous option, though.

Joe

what's dangerous about it?

One user process can allocate too much memory and to cause the LVS to drop new connections because the lowmem threshold is reached.

May be conn_limit is better or something like this:

if (conn_number > min_conn_limit && free_memory < lowmem_thresh)
         DROP_THIS_PACKET_FOR_NEW_CONN

why have a min_conn_limit in here? If you put more memory into the director, hen you'll have to recompile your kernel. Is it because finding conn_number is cheaper than finding free_memory?

:) The above example with real numbers:

if (conn_number > 500000 && free_memory < 10MB) DROP

i.e.don't allow the user processes to use memory that LVS can use. But when there are "enough" LVS connections created we can consider reserving 10MB for the user space and to start dropping new connections early, i.e. when there are less than 10MB free memory. If conn_number <500000 LVS simply will hit the 0MB free memory point and the user space will be hurted because these processes allocated too much memory in this case.

But obtaining the "free_memory" may be costs CPU cycles. May be we can stick with a snapshot on each second.

The number of valid connections shouldn't change dramatically in 1 sec. However a DoS might still cause problems.

Yes, the problem is on SYN attack.

Ratz

max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

what's the security problem?

SYN/RST flood. My patch will set the weight of the realserver to 0 in case the upper threshold is reached. But I do not test if the requesting traffic is malicious or not, so in case of SYN-flood it may be 99% of the packets causing the server to be taken out of service. In the end we have set all server to weight 0 and the load balancer is non-functional either. But you don't have the memory problem :)

And it hasn't crashed either.

Ratz

I kinda like it but as you said, there is the amem_thresh, my approach (which was not actually done because of this problem :) and now having a lowmem_thresh. I think this will end up in a orthogonal semantic for memory allocation. For example if you enable the amem_thresh the conn_number > min_conn_limit && free_memory <lowmem_thresh would never be the case. OTOH if you set the lowmem_thresh to low the amem_thresh is ineffective. My patch would suffer from this too.

Julian Anastasov ja (at) ssi (dot) bg 08 Jun 2001

lowmem_thresh is not related to amemthresh but when amemthresh <lowmem_thresh the strategies will never be activated. lowmem_thresh should be less than amemthresh. Then the strategies will try to keep the free memory in the lowmem_thresh:amemthresh range instead of the current range 0:amemthresh

Example (I hope you have memory to waste):

lowmem_thresh=16MB (think of it as reserved for user processes and kernel) amemthresh=32MB (when the defense strategies trigger) min_conn_limit=500000 (think of it as 60MB reserved for LVS connections)

So, the conn_number can grow far away after min_conn_limit but only while lowmem_thresh is not reached. If conn_number <500000 and free_memory <lowmem_thresh we will wait the OOM killer to help us. So, we have 2 tuning parameters: the desired number of connections and some space reserved for user processes. And may be this is difficult to tune, we don't know how the kernel prevents problems in VM before activating the killer, i.e. swapping, etc. And the cluster software can take some care when allocating memory.

Hayden Myers hayden (at) spinbox (dot) com 18 Mar 2002

There's also some info located in kernel help. I posted it below for convenience.

Using a big ipvs hash table for virtual server will greatly reduce conflicts in the ipvs hash table when there are hundreds of thousands of active connections. Note the table size must be power of 2. The table size will be the value of 2 to the your input number power. For example, the default number is 12, so the table size is 4096. Don't input the number too small, otherwise you will lose performance on it. You can adapt the table size yourself, according to your virtual server application. It is good to set the table size not far less than the number of connections per second multiplying average lasting time of connection in the table. For example, your virtual server gets 200 connections per second, the connection lasts for 200 seconds in average in the masquerading table, the table size should be not far less than 200x200, it is good to set the table size 32768 (2**15).

Another note that each connection occupies 128 bytes effectively and each hash entry uses 8 bytes, so you can estimate how much memory is needed for your box.

Ratz: Leave the settings as a general rule.

Some people still want to change the hash table size

Daniel Burke 28 Jun 2002

In anticipation of our capacity requirements growing, we had decided it was necessary to increase the connection table size. The value it was at was 16, based on our calculations we needed to bump it to 26 to handle what were will be throwing at it.

Julian

It is insane to use 26. That means 2^26 * space for 2 pointers. On x86 it takes 512MB just for allocating empty hash table with 2^26 d-linked lists. Refer to the HOWTO for calculating the best hash table size according to the number of connections. You can select the size (POWER) in this way:

2^POWER = AVERAGE_NUM_CONNS/10

The magic value 10 in this case is the average number of conns expected in one d-linked list, the lookup is slower for more conns.

Example:

POWER=16 => 65536 rows => 655360 conns, 10 on each row

Joe - Wensong has stepped in to stop people from doing this anymore.

Wensong

Just added code that limits the input number from 8 to 20, in order to prevent this configuring problem from happening again.

Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. I think Julian and/or Wensong wrote it. Does anyone of you still have that code?

I'd say that the rehashing that would need to take place would consume more CPU cycles than a yet-to-be-proven gain from increasing the bucket size.

Horms horms (at) verge (dot) net (dot) au 17 Feb 2003

Agreed. To expand on this for the benefit of others. The hash table is just that. A hash. Each bucket in the hash can have multiple entries. The implementation is such that each bucket is a linked list of entries.

Exactly. And around 2000 we tuned the hash generation function to have the best balanced distribution over all buckets with a magic prime number which IIRC was pretty exactly the golden ratio if you divided 2^32 through that number.

ratz@zar:~ > echo "(2^32)/2654435761" | bc -l
1.61803399392945414737
ratz@zar:~ >

We took the wisdom from Chuck Lever's paper Linux Kernel Hash Table Behavior: Analysis and Improvements (http://www.citi.umich.edu/techreports/reports/citi-tr-00-1.pdf, site down Sep 2004) So once you have an evenly distributed hash table the search for an entry is almost best case for all entries.

Thus there is no limit on the number of entries the hash table can hold (other than the amount of memory available). The only advantage of increasing the hash table size is to reduce, statistically speaking, the number of entries in each bucket.

Which gains you almost nothing. The search is still ... left as an exercise to all CS students here :)

And thus the amount of hash table traversal. To be honest I think the gain is probably minimal even in extreme circumstances.

Especially since computing the hash value entry uses most of the times almost the same amount of time as parsing the linked list.

Anecdotally, a colleague of mine did a test on making the linked lists reordering. So that the most recently used entry was moved to the front.

Interesting approach. However I guess that the cache line probably already had this entry. It would be interesting to see the amount of TLB flushes done by the kernel depending on the amount of traffic and hash table size.

He then pushed a little bit traffic through the LVS box (>700Mbit/s). We didn't really see any improvement. Which would make me think that hash table performance isn't a particular bottleneck in the current code.

Jernberg, Bernt wrote:

my throughput is 2Gb/s

The tests that I was involved with with >700Mb/s of traffic used a hash-table with 17bits. I am not entirely sure how that number was derived as I did not do the tests myself. But it would probably be a good start for you.

You are probably going to see a bigger difference by compiling a non-SMP kernel and eliminating spinlocks than you will twiddling the hash-table bits. I believe that by having an SMP kernel, the overhead of spinlocks is significant under high load.

(For more on SMP/UMP with LVS, see comments by Horms on SMP doesn't help.)

Jernberg, Bernt

I have deployed it at a customer sight which offers ftp,http and rsyc services. They calculated that they will need 2^21 entries in the hash if it is supported.

Ratz

Let me see (4 secs session coherency and 1/8 of the traffic are valid SYN requests matching the template):

ratz@zar:~ > echo "l(4*2*1024*1024*1024/8)" | bc -l
20.79441541679835928251
ratz@zar:~ >

So yes, this would roughly be 21 bits. But now I ask you to read the nice explanation of Horms in this thread on why you do not need to increase the bucket size of the hash table to be able to hold 2**21 entries. You can perfectly well use 17 bits which would give you a linked list depth (provided we have an equilibrium in distribution over the buckets):

ratz@zar:~ > echo "2^21/2^17" | bc -l
16.00000000000000000000
ratz@zar:~ >

So lousy 16 entries for one bucket when using 17 bit. This is bloody _nothing_. Let's take the worst case: You'd have maybe 32 entries which is still _nothing_. Your CPU doesn't even fully awake to find an entry in this list :).

The amount of RAM you need to hold 2^21 templates entries for a session time of 4 seconds is roughly:

ratz@zar:~ > echo "4*(128*2^21)/1024/1024" | bc -l
1024.00000000000000000000
ratz@zar:~ >

1GB. So you're on the safe end. However if you plan on using persistency you'd run out of memory pretty soonish.

Ratz

I would love to know why people always want to increase the hash table size? I remember that at one point we had a piece of code testing the hash table. We used it to tweak the hash function.

Julian, 17 Feb 2003

hashlvs I created a script for easy testing. Currently, there are 2 hash functions for tests. I don't remember for what LVS version the hash_*.c files were created.

usage: get copy of the conn table output (use real data) and feed it to the scripts by specifying the used hash table size (in bits) and the desired hash function method. The result is the expected access time in pseudo units. Bigger access time means slower lookup.

33.14. Hash table connection timeouts

How long are the connection entries held for ? (Column 8 of /proc/net/ip_masquerade ?)

Julian

The default timeout value for TCP session is 15 minutes, TCP session after receiving FIN is 2 miniutes, and UDP session 5 minutes. You can use ipchains -M -S tcp tcpfin udp to set your own time values.

If we assume a clunky set of web servers being balanced that take 3s to serve an object, then if the connection entries are dropped immediately then we can balance about 20 million web requests per minute with 128M RAM. If however the connection entries are kept for a longer time period this puts a limit on the balancer.

Yeah, it is true.

e.g. (assuming column 8 is the thing I'm after!)

Actually, the column 8 is the delta value in sequence numbers. The timeout value is in column 10.

[zathras@consus /]$ head -n 1000 /proc/net/ip_masquerade | \
sed -e "s/  */ /g"|cut -d" " -f8|sort -nr|tail -n500|head -n1 8398

i.e. Held for about 2.3 hours, which would limit a 128Mb machine to balance about 10.4 million requests per day. (Which is definitely on the low side knowing our throughput...)

Horms horms (at) vergenet (dot) net

When a connection is recieved by an IPVS server and forwarded (by whatever means) to a back-end server at what stage is this connection entered into the IPVS table. It is before or as the packet is sent to the back-end server or delayed until after the 3 way handshake is complete.

Lars

The first packet is when the connection is assigned to a realserver, thus it must be entered into the table then, otherwise the 3 way handshake would likely hit 3 different realservers.

unknown

It has been alleged that IBMs Net Director waits until the completion of the three way handshake to avoid the table being filled up in the case of a SYN flood. To my mind the existing SYN flood protection in Linux should protect the IPVS table in any case and the connection needs to be in the IPVS table to enable the 3 way handshake to be completed.

Wensong

There is state management in connection entries in the IPVS table. The connection in different states has different timeout value, for example, the timeout of the SYN_RECV state is 1 minute, the timeout of the ESTABLISHED state is 15 minutes (the default). Each connection entry occupy 128 bytes effective memory. Supposing that there is 128 Mbytes free memory, the box can have 1 million connection entries. The over 16,667 packet/second rate SYN flood can make the box run out of memory, and the syn-flooding attacker probably need to allocate T3 link or more to perform the attack. It is difficult to syn-flood a IPVS box. It would be much more difficult to attach a box with more memory.

I assume that the timeout is tunable, though reducing the timeout could have implications for prematurely dropping connections. Is there a possibility of implementing random SYN drops if too many SYN are received as I believe is implemented in the kernel TCP stack.

Yup, I should implement random early drop of SYN entries long time ago as Alan Cox suggested. Actually, it would be simple to add this feature into the existing IPVS code, because the slow timer handler is activated every second to collect stale entries. I just need to some code to that handler, if over 90% (or 95%) memory is used, run drop_random_entry to randomly tranverse 10% (or 5%) entries and drop the SYN-RCV entries in them.

A second, related question is if a packet is forwarded to a server, and this server has failed and is sunsequently removed from the available pool using something like ldirectord. Is there a window where the packet can be retransmitted to a second server. This would only really work if the packet was a new connection.

Yes, it is true. If the primary load balaner fails over, all the established connections will be lost after the backup takes over. We probably need to investigate how to exchange the state (connection entries) periodically between the primary and the backup without too much performance degradation.

If persistent connections are being used and a client is cached but doesn't have any active connections does this count as a connection as far as load balancing, particularly lc and wlc is concerned. I am thinking no. This being the case, is the memory requirement for each client that is cached but has no connections 128bytes as per the memory required for a connection.

The reason that the existing code uses one template and creates different entries for different connections from the same client is to manage the state of different connections from the same client, and it is easy to seemlessly add into existing IP Masquerading code. If only one template is used for all the connections from the same client, the box receives a RST packet and it is impossible to identify from which connection.

We using Hash Table to record an established network connection. How do we know the data transmission by one conection is over and when should we delete it from the Hash Table?

Julian Anastasov ja (at) ssi (dot) bg 24 Dec 2000

OK, here we'll analyze the LVS and mostly the MASQ transition tables from net/ipv4/ip_masq.c. LVS support adds some extensions to the original MASQ code but the handling is same.

First, we have three protocols handled: TCP, UDP and ICMP. The first one (TCP) has many states and with different timeout values, most of them set to reasonable values corresponding to the recommendations from some TCP related rfc* documents. For UDP and ICMP there are other timeout values that try to keep the both ends connected for reasonable time without creating many connection entries for each packet.

There are some rules that keep the things working:

- when a packet is received for an existing connection or when a new connection is created a timer is started/restarted for this connection. The timeout used is selected according to the connection state. If a packet is received for this connection (from one of the both ends) the timer is restarted again (and may be after a state change). If no packet is received during the selected period, the masq_expire() function is called to try to release the connection entry. It is possible masq_expire() to restart the timer again for this connection if it is used from other entries. This is the case for the templates used to implement the persistent timeout. They occupy one entry with timer set to the value of the persistent time interval. There are other cases, mostly used from the MASQ code, where helper connections are used and masq_expire() can't release the expired connection because it is used from others.

- according to the direction of the packet we distinguish two cases: INPUT where the packet comes in demasq direction (from the world) and OUTPUT where the packet comes from internal host in masq direction.

masq. What does "masq direction" mean for packets that are not translated using NAT (masquerading), for example, for Direct Routing or Tunneling? The short answer is: there is no masq direction for these two forwarding methods. It is explained in the LVS docs. In short, we have packets in both directions when NAT is used and packets only in one direction (INPUT) when DR or TUN are used. The packets are not demasqueraded for DR and TUN method. LVS just hooks the LOCAL_IN chain as the MASQ code is privileged in Linux 2.2 to inspect the incoming traffic when the routing decides that the traffic must be delivered locally. After some hacking, the demasquerading is avoided for these two methods, of course, after some changes in the packet and in its next destination - the realservers. Don't forget that without LVS or MASQ rules, these packets hit the local socket listeners.

How are the connection states changed? Let's analyze for example the masq_tcp_states table (we analyze the TCP states here, UDP and ICMP are trivial). The columns specify the current state. The rows explain the TCP flag used to select the next TCP state and its timeout. The TCP flag is selected from masq_tcp_state_idx(). This function analyzes the TCP header and decides which flag (if many are set) is meaningful for the transition. The row (flag index) in the state table is returned. masq_tcp_state() is called to change ms->state according to the current ms->state and the TCP flag looking in the transition table. The transition table is selected according to the packet direction: INPUT, OUTPUT. This helps us to react differently when the packets come from different directions. This is explained later, but in short the transitions are separated in such way (between INPUT and OUTPUT) that transitions to states with longer timeouts are avoided, when they are caused from packets coming from the world. Everyone understands the reason for this: the world can flood us with many packets that can eat all the memory in our box. This is the reason for this complex scheme of states and transitions. The ideal case is when there is no different timeouts for the different states and when we use one timeout value for all TCP states as in UDP and ICMP. Why not one for all these protocols? The world is not ideal. We try to give more time for the established connections and if they are active (i.e. they don't expire in the 15 mins we give them by default) they can live forever (at least to the next kernel crash^H^H^H^H^Hupgrade).

How does LVS extend this scheme? For the DR and TUN method we have packets coming from the world only. We don't use the OUTPUT table to select the next state (the director doesn't see packets returning from the internal hosts). We need to relax our INPUT rules and to switch to the state required by the external hosts :( We can't derive our transitions from the trusted internal hosts. We change the state only based on the packets coming from the the clients. When we use the INPUT_ONLY table (for DR and TUN) the LVS expects a SYN packet and then an ACK packet from the client to enter the established state. The director enters the established state after a two packet sequence from the client without knowing what happens in the realserver, which can drop the packets (if they are invalid) or establish a connection. When an attacket sends SYN and ACK packets to flood a LVS-DR or LVS-Tun director, many connections are established state. Each each established connection will allocate resources (memory) for 15 mins by default. If the attacker uses many different source addresses for this attack the director will run out of memory.

For these two methods LVS introduces one more transition table: the INPUT_ONLY table which is used for the connections created for the DR and TUN forwarding methods. The main goal: don't enter established state too easily - make it harder.

Oh, maybe you're just reading the TCP specifications. There are sequence numbers that the both ends attach to each TCP packet. And you don't see the masq or LVS code to try to filter the packets according to the sequence numbers. This can be fatal for some connections as the attacker can cause state change by hitting a connection with RST packet, for example (ES->CL). The only info needed for this kind of attack is the source and destination IP addresses and ports. Such kind of attacks are possible but not always fatal for the active connections. The MASQ code tries to avoid such attacks by selecting minimal timeouts that are enough for the active connections to resurrect. For example, if the connection is hit by TCP RST packet from attacker, this connection has 10 seconds to give an evidence for its existance by passing an ACK packet through the masq box.

To make the things complex and harder for the attacker to block a masq box with many established connections, LVS extends the NAT mode (INPUT and OUTPUT tables) by introducing internal server driven state transitions: the secure_tcp defense strategy. When enabled, the TCP flags in the client's packet can't trigger switching to established state without acknowledgement from the internal end of this connection. secure_tcp changes the transition tables and the state timeouts to achieve this goal. The mechanism is simple: keep the connection is SR state with timeout 10 seconds instead of the default 60 seconds when the secure_tcp is not enabled.

This trick depends on the different defense power in the realservers. If they don't implement SYN cookies and so sometimes don't send SYN+ACK (because the incoming SYN is dropped from their full backlog queue), the connection expires in LVS after 10 seconds. This action assumes that this is a connection created from attacker, since one SYN packet is not followed by another, as part from the retransmissions provided from the client's TCP stack.

We give 10 seconds to the realserver to reply with SYN+ACK (even 2 are enough). If the realserver implements SYN cookies the SYN+ACK reply follows the SYN request immediatelly. But if there are no SYN cookies implemented the SYN requests are dropped when the backlog queue length is exceeded. So secure_tcp is by default useful for realservers that don't implement SYN cookies. In this case the LVS expires the connections in SYN state in a short time, releasing the memory resources allocated from them. In any case, secure_tcp does not allow switching to established state by looking in the clients packets. We expect ACK from the realserver to allow this transition to EST state.

The main goal of the defense strategies is to keep the LVS box with more free memory for other connections. The defense for the realservers can be build in the realservers. But may be I'll propose to Wensong to add a per-connection packet rate limit. This will help against attacks that create small number of connections but send many packets and by this way load the realservers dramatically. May be two values: rate limit for all incoming packets and rate limit per connection.

The good news is that all these timeout values can be changed in the LVS setup, but only when the secure_tcp strategy is enabled. An SR timeout of 2 seconds is a good value for LVS clusters when realservers don't implement SYN cookies: if there is no SYN+ACK from the realserver then drop the entry at the director.

The bad news is of course, for the DR and TUN methods. The director doesn't see the packets returning from the realservers and LVS-DR and LVS-Tun forwarding can't use the internal server driven mechanism. There are other defense strategies that help when using these methods. All these defense strategies keep the director with memory free for more new connections. There is no known way to pass only valid requests to the internal servers. This is because the realservers don't provide information to the director and we don't know which packet is dropped or accepted from the socket listener. We can know this only by receiving an ACK packet from the internal server when the three-way handshake is completed and the client is identified from the internal server as a valid client, not as spoofed one. This is possible only for the NAT method.

ksparger (at) dialtoneinternet (dot) net (29 Jan 2001) rephrases this by saying the LVS-NAT is layer-3 aware. For example, NAT can 'see' if a realserver responds to a packet it's been sent or not, since it's watching all of the traffic anyway. If the server doesn't respond within a certain period of time, the director can automatically route that packet to another server. LVS doesn't support this right now, but, NAT would be the more likely candidate to support it in the future, as NAT understands all of the IP layer concepts, and DR doesn't necessarily.

Julian

Someone must put back the realserver when it is alive. This sounds like a user space job. The traffic will not start until we send requests. We have to send L4 probes to the realserver (from the user space) or to probe it with requests (LVS from kernel space)?

33.15. Hash Table DoS

A posting (Jun 2003) on Slashdot links to a paper on Denial of Service via Algorithmic Complexity Attacks. The article shows how to mount a DoS by attack on hash tables. Access to entries in hash tables for most algorithms is different for the average case (randomly sorted data where access time might be O(n log n)) and for the worst case (all in reverse order, where access time could be O(n^2)). Programmers hope that real life data is not pathological. If the attacker knows the hash algorithm (i.e. has the source code), then they may be able to construct a worst case dataset for the hashing algorithm, which will bring the server to its knees. The paper discusses constructing attacks in which all data is entered into one bucket of the hash table.

In the case of LVS, the hash table contents are (CIP:port, proto, VIP:port). The client only has a small number of variables (the port and proto they are sending from) from which to mount an attack, the others being fixed (CIP, VIP:port).

Horms horms (at) verge (dot) net (dot) au 05 Jun 2003

Here is my take on this: LVS uses the following hash

(proto XOR CIP XOR (CIP>>IP_VS_CONN_TAB_BITS) XOR port) & IP_VS_CONN_TAB_MASK

where:
proto =  protocol (TCP=6, UDP=17)
CIP   =  source/client IP address (host byte order)
port  =  source port (host byte order)
IP_VS_CONN_TAB_BITS defaults to 8
IP_VS_CONN_TAB_MASK is (1 << IP_VS_CONN_TAB_BITS) - 1 thus the default is 0xff

(from here '^' will mean power)

The only inputs a user/client can effect are CIP and port.

I would say that it is quite easy for someone to set things up so that they consistently hit the same bucket. For instance by connecting from the same ip address with different ports from the set (port % IP_VS_CONN_TAB_MASK) = n (though we observe that each end-user only has 2^(16-IP_VS_CONN_TAB_BITS) = 256 such ports). The client would need to use multiple source IP addresses.

The effect is that instead of n connections going into 2^IP_VS_CONN_TAB_BITS different buckets they will go into one bucket. Thus LVS will have to do on average n/2 traversals instead of n/2^(IP_VS_CONN_TAB_BITS+1) traversals.

The real effect is to amplify traversal times by 2^IP_VS_CONN_TAB_BITS. Though it is worth remembering that the larger IP_VS_CONN_TAB_BITS is then lower 2^(16-IP_VS_CONN_TAB_BITS) becomes, and thus the greater the number of source IP addresses required becomes. Though if it was a UDP bassed attack this might not be an issue as the source IP could be spoofed.

This could become a problem if n became very large. But how large? Traversal is actually pretty fast. So I think that n would need to be quite large indeed to have a noticable effect on LVS and larger still to seriously degrade performance. Though I could be wrong.

As for solutions, it is a bit tricky AFAIK. Perhaps using some component of the skb which is static for a connection, but not directly influenced by end users. But that may well open up a whole new can of worms.

Ratz 05 Jun 2003

Theoretically we're susceptible to this sort of attack. Check out the devastating effects on running Julian's testlvs with certain parameters. You can still enable the LVS DoS defense strategies though (see testing DoS strategies).

33.15.1. testing hash code

Julian, 14 Jun 2003

Maybe it is time to change the hash function used for the table with connections. Today I played with some data from my 2.2 director and fixed the tools that measure different hash functions. I tested the default LVS function, one that uses 2654435761 and the Jenkins hash that is present in recent 2.4 and 2.5 kernels. We need some help from the math perspective.

Here are some tools for testing the hash functions Look in Julian's LVS page for test.txt which contains short instructions for testing and ipvs-1.0.9-hash1.diff test hash code for 2.4.21.

I created test patch against ipvs-1.0.9 (not tested). This is an attempt to introduce randomness on IPVS load. As for the tests with the different hash functions you can see my results and of course to try them yourself. My conclusion is that 2654435761 is better and faster but I hope we will see other results.

33.16. Hash table size, director will crash when it runs out of memory.

Yasser Nabi

IP Virtual Server version 0.9.0 (size=16777216)

Julian Anastasov ja (at) ssi (dot) bg 25 May 2001

Too much, it takes 128MB for the table only. Use 16 bits for example.

Is this a hidden/undocumented problem with IPVS or it's just an observation of memory waste ? (we use 18 bits in production)

empty hash tables:

18 bits occupy 2MB RAM
24 bits occupy 128MB RAM

If the box has 128MB and the bits are 24 the kernel crash is mandatory, soon or later. And this is a good reason the virtual service not to be hit. Expect funny things to happen on box with low memory

I forgot that not anyone uses 256Mb or more RAM on directors :)

Yes, 256MB in real situation is ~1,500,000 connections, 128 bytes each, 64MB for other things ... until someone experiments with SYN attack

However, for me it makes sense to use up to 66% of total memory for LVS, especially on high-traffic directors (in the idea that the directors doesn't run all the desktop garbage that comes with most distros).

If the used bits are 24, an empty hash table is 128MB. For the rest 128MB you can allocate 1048576 entries, 128 bytes each ... after the kernel killed all processes.

Some calcs considering the magic value 16 as average bucket length and for 256MB memory:

For 17 bits:

2^17=131072 => 1MB for empty hash table

131072*16=2097152 entries=256MB for connections

For 18 bits:

2^18=262144 => 2MB for empty hash table

for each MB for hash table we lose space for 8192 entries but we speedup the lookup.

So, even for 1GB directors, 19 or 20 is the recommended value. Anything above is a waste of memory for hash table. In 128MB we can put 1048576 entries. In the 24-bit case they are allocated for d-linked list heads.

Joe 6 Jun 2001

what happens after the table fills up? Does ipvs handle new connect requests gracefully (ie drops them and doesn't crash)?

Julian

The table has fixed number of rows and unlimited number of columns (d-lists where the connection structures are enqueued). The number of connections allocated depends on the free memory.

Once there is no memory to allocate connection structure, the connection requests will be dropped. Expect crashes maybe at another place (usually user space) :)

I'm not sure what the kernel will decide in this situation but don't rely on the fact some processes will not be killed. There is a constant network activity and a need for memory for packets (floods/bursts).

And the reason the defense strategies to exist is to free memory for new connections by removing the stalled ones. The defense strategy can be automatically activated on memory threshold. Killing the cluster software on memory presure is not good.

So, the memory can be controlled, for example, by setting drop_entry to 1 and tuning amemthresh. On floods it can be increased. It depends on the network speed too: 100/1000mbit. Thresholds of 16 or 32 megabytes can be used in such situations, of course, when there are more memory chips.

Roberto Nibali ratz (at) tac (dot) ch

The director never crashes because of exhaustion of memory. If he tries to allocate memory for a new entry into the table and kmalloc returns NULL, we return, or better drop the packet in processing and generate a page fault.

You could use my treshhold limitation patch. You calculate how many connections you can sustain with your memory by multiplying each connection entry with 128bytes and divide by the amount of realserver and set the limitation alike. Example:

128MByte, persistency 300s: max amount of concurrent connections: 3495. We assume having 4 realservers equally load balance, thus we have to limit the upper threshold per realserver to 873. Like this you would never have a memory problem but a security problem.

Joe

It would seem that we need a method of stopping the director hash table from using all memory whether as a result of a DoS attack or in normal service. Let's say you fill up RAM with the hash table and all user processes go to swap, then there will be problems - I don't know what, but it doesn't sound great - at a high number of connections I expect the user space processes will be needed too. I expect we need to leave a certain amount for user space processes and not allow the director to take more than a certain amount of memory.

It would be nice if the director didn't crash when the number of connections got large. Presumably a director would be functioning only as a director and the amount of memory allocated to user space processes wouldn't change a whole lot (ie you'd know how much memory it needed).

33.17. The LVS code does not swap

Joe Feb 2001

With sufficient number of connections, a director could start to swap out its tables (is this true?) In this case, throughput could slow to a crawl. I presume the kernel would have to retrieve parts of the table to find the realserver associated with incoming packets. I would think in this case it would be better to drop connect requests than to accept them.

Julian

IMO, this is not true. LVS uses GFP_ATOMIC kind of allocations and as I know such allocations can't be swapped out.

If it's possible for LVS to start the director to swap, is there some way to stop this?

You can try with testlvs whether the LVS uses swap. Start the kernel with LILO option mem=8M and with large swap area. Then check whether more than 8MB swap is used.

33.18. Other factors determining the number of connections

In earlier verions of LVS, you set the amount of memory for the tables (in bytes). Now you allocate a number of hashes, whose size can grow without limit, allowing an unlimited number of connections. Once the number of connections becomes sufficiently large, then other resources will become limiting.

  • out of memory.

    The ipvs code doesn't handle this, presumably the director will crash (also see the threshold patch). Instead you handle this by brute force, adding enough memory to accept the maximum number of connections your setup will ever be asked to handle (e.g. under a DoS attack). This memory size can be determined by the multiplying the rate at which your network connection can push connect requests to the director, by the timeout values, which are set by FIN_WAIT or the persistence.

  • out of ports.

    You can expand the number of ports to 65k, but eventually you'll reach the 65k port limit.

33.19. Port range: limitations, expanding port range on directors

Sometimes client processes on the realservers need to connect with machines on the internet (see clients on realservers.

Wayne wayne (at) compute-aid (dot) com Nov 5 2001

Say you have a web page that has to retrieve on-line ads from one of your advertiser (people who pay you for showing their ads). If you have 50,000 visitors on your site, you will open 50,000 connections between your web server and the ad server out there somewhere. The masquerade limit is 4,096 per pair of IP addresses, and 40,960 per LVS box. In our case, the realserver is behind the LVS-NAT director, which also functions as the firewall, so the realserver MUST use the director to reach the ad servers.

Usually the RIP is private (e.g.192.168/16) and will have to be NAT'ed to the outside world. This can be done with LVS-NAT or LVS-DR by adding masquerading rules to the director's iptables/ipchains rules. (With LVS-DR, you also have to route the packets from the RIP - this routing is setup by default with the configure script)

Less often you want to use more ports on your LVS client machines.

Wang Haiguang

My client machine it uses port numbers between 1024 - 4096. After reaching 4096, it will loop back to 1024 and reuse the ports. I want to use more port nubmers

michael_e_brown (at) dell (dot) com 06 Feb 2001

echo 1024 65000 > /proc/sys/net/ipv4/ip_local_port_range
/usr/src/linux/Documentation/networking/ip-sysctl.txt

While normal client processes use ports in order starting at 1024, masqueraded ports start at 61440 (2^16-2^12) for 2.2.x kernels (see clients on realservers). The masquerading code does not check if other processes are requesting ports and thus port collisions could occur. It is assumed on a NAT box that no other processes are initiating connections (i.e. you aren't running a passive ftp server).

Horms 26 Jun 2007

I beleive that there is a school of thought that source ports should be randomised to mitigate certain classes of security threats.

Horms 17 May 2004

I am a bit rusty on 2.2. But the restricted port range for NAT'ed connections with 2.2 sounds familiar. I seem to recall you could change the range, but it required changing a define in the kernel and recompiling. I think it was changed to a /proc value in 2.4.

Note
For 2.4.x kernels, the restriction to only use high ports is removed. The NAT router uses ports starting at 1024.

Horms 17 May 2004

In 2.4 the ephemerial port range and the nat port range are the same. If this is the case, which I guess it is, then it would seem likely there is some sort of collision detectionion. I took a look and this does not seem to be the case. I assume the kernel has some other way of handling this. But I am not sure at this moment. If you are interested try looking at tcp_v4_get_port() and tcp_unique_tuple(). I'm not convinced that Michael Brown's comment is correct at the moment, but I don't have the definitive answer either.

Wayne wayne (at) compute-aid (dot) com 14 May 2000,

If running a load balancer tester, say the one from IXIA to issue connections to 100 powerful web servers, would all the parameters in Julian's description need to be changed, or it should not be a problem for having many many connections from a single tester?

Julian

There is no limit for the connections from the internal hosts. Currently, the masquerading allows one internal host to create 40960 TCP connections. But the limit of 4096 connections to one external service is still valid.

If 10 internal hosts try to connect to one external service, each internal host can create 4096/10 => 409 connections.

For UDP the problem is sometimes worse. It depends on the /proc/sys/net/ipv4/ip_masq_udp_dloose value.

Joe

which is internal and which is external here? The client, the realservers?

This is a plain masquerading so internal and external refer to masquerading. These limits are not for the LVS connections, they are only for the 2.2 masquerading.

			</para><para>
		 / 65095	Internal Servers
External Server:PORT	-  ...	 MADDR --------------------
		 \ 61000

When many internal clients try to connect to same external real service, the total number of TCP connections from one MADDR to this remote service can be 4096 because the masq uses only 4096 masq ports by default. This is a normal TCP limit, we distinguish the TCP connections by the fact they use different ports, nothing more. And the masq code is restricted by default to use the above range of 4096 ports.

In the whole masquerading table there is a space only for 40960 TCP, 40960 UDP and 40960 ICMP connections. These values can be tuned by changing ip_masq.c:PORT_MASQ_MUL.

For 2.4 masquerading, all ports can be used for the masqueraded connections.

Waynewayne (at) compute-aid (dot) com 1 Nov 2001

PORT_MASQ_MUL appears to serve only as a check to make sure the masquerading facility does not hog all the memory, and that actually things would still work no matter how large PORT_MASQ_MUL is, or even if the checks using it are disabled. Is this true?

Julian

By multiplying this constant with the masq port range, you define the connection limit for each protocol. This is related to the memory used for masquerading. This is a real limit, but not for LVS connections, because they are usually not limited by port collisions, and LVS does not check this limit.

What about using more than the 32k range? What is the maximum I could select?

Peter Muellerpmueller (at) sidestep (dot) com

You should be able to use about 60k, i.e. 1024-6100. I hope you have lots of RAM :-)

Julian continuing

The PORT_MASQ_MUL value simply determines the recommended length of one row in the masq hash table for connections, but in fact it is involved in the above connection limits. It is recommended that the busy masq routers must increase this value. May be the 4096 masq port range too. This involves squid servers behind masq router.

LVS uses another table without limits. For LVS setups the same TCP restrictions apply but for the external clients:

	4999 \
Client	     - VIP:VPORT LVS Director
	1024 /

The limit of client connections to one VIP:VPORT is limited to the number of used client ports from same Client IP.

The same restrictions apply to UDP. UDP has the same port ranges. But for UDP the 2.2 kernel can apply different restrictions. They are caused from some optimizations that try to create one UDP entry for many connections. The reason for this is the fact that one UDP client can connect to many UDP servers while this is not common for TCP.

Joe

when you increase the port range, you need more memory. Is this only because you can have more connections and hence will need a bigger ipvsadm table?

Yes, the first need is for more masqueraded connections and they allocate memory. LVS uses separate table and it is not limited. We distinguish LVS-NAT from Masquerade. LVS-NAT (and any other method) does not allocate extra ports, even for other ranges. It shadows only the defined port. No other ports are involved until masquerading is used.

ipvs doesn't check port ranges and so collisions can occur with regular services (ftp was mentioned). I would have thought that a process needing to open a IP connnection would ask the tcp code in the kernel for a connection and let that code handle the assignment of the port.

LVS does not allocate local ports. When the masquerade is used to help with some protocol, the masquerade performs the check (ftp for example).

The port range has nothing to do with LVS. It helps the masquerading to create more connections because there is fixed limit for each protocol. But sometimes LVS for 2.2 uses ip_masq_ftp, so may be only then this mport range is used.

X-window connections are at 6000.. Will you be able to start an X-session if these ports are in use by a director masquerading out connections from the realservers?

If we put LVS (ipvsadm -A ) in front of this port 6000 then X sessions will be stopped. OTOH, masquerade does not select ports in this range, the default is above 61000. So, any FTP sessions will not disturb local ports, of course, if you don't increase the mport range to cover the well defined server ports such as X.

33.20. Director does not have any ports (connections) open for an LVS connection

The director is just a router (admittedly with slightly different rules than the standard layer 3 router). There are no connections made (ports opened) between the client and the director, or between the realservers and the director. The director does keep track of the packets passing through that are for LVS'ed services (connection tracking) as part of updating the hash table and for the server state synch demon.

Sebastien BONNET sebastien(dot) bonnet (at) experian (dot) fr 11 May 2004

There's no "open" connection on the director, just tracked connections. The clients are not "connected" to the load balancer.

For the client, assuming a client always uses a different port for an outgoing connection, it can roughly initiate 65K connections.

For the realserver, there's no real port limit for a daemon listening on a single port: it uses just one. The realserver can have connections to all ports on all IPs, i.e. 256*256*256*256*(65536-1024) connections (the realserver may run out of memory before it can make all these connections).

If there was no file descriptor limit nor memory constraint, a server could handle way more than the current "port limit" (65K) simultaneous connections.

33.21. apps starved for ports

LVS Account, 27 Feb 2001

I'm trying to do some load testing of LVS using a reverse proxy cache server as the load balanced app. The error I get is from a load generating app.. Here is the error:

byte count wrong 166/151

Julian Anastasov ja (at) ssi (dot) bg

Broken app.

this goes on for a few hundred requests then I start getting:

Address already in use

App uses too many local ports.

This is when I can't telnet to port 80 any more... If I try to telnet to 10.0.0.80 80 I get this:

$ telnet 10.0.0.80 80
Trying 10.0.0.80...
telnet: Unable to connect to remote host: Resource temporarily unavailable

No more free local ports.

If I go directly to the web server OR if I go directly to the IP of the reverse proxy cache server, I don't get these errors.

Hm, there are free local ports now.

I'm using a load balancing app that I call this way:

/home/httpload/load -sequential -proxyaddr 10.0.0.80 -proxyport
0  -parallel 120 -seconds 6000000 /home/httpload/url

upping the local port range has helped tremendously

33.22. realserver running out of ports

Here's a case where a realserver ran out of udp ports doing DNS looksup while serving http.

Hendrik Thiel thiel (at) falkag (dot) de

I am using IP Virtual Server version 0.9.14 (size=4096). We have 6 Realservers each.

-> RemoteAddress:Port   Forward Weight ActiveConn InActConn
-> server1:www          Masq    1      68         12391

Today we reached a new peak (very fast, few minutes) 30Mbps, up from the normal 15Mbit/s. Afterwards the following kernel messages (dmesg) showed up...

IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894).
IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31894).
IP_MASQ:ip_masq_new(proto=UDP): could not get free masq entry (free=31888).

Julian Anastasov ja (at) ssi (dot) bg 20 Nov 2001 (heavily edited by Joe)

It seems you are flooding a single remote host with UDP requests from a realserver. Your service, www, is TCP and is not directly connected to these messages. You've reached the UDP limit per destination (4096), there are still free UDP ports on the realserver for other destinations.

Hendrik

yes it's DNS, each realserver is a caching DNS.

resolv.conf
nameserver 127.0.0.1
nameserver external IP

33.23. Maximum number of NICs

This is not really an LVS question, but people want to know.

ntadmin (at) reachone (dot) com

We are nearing 255 virtual interfaces on the external side of our LVS system (Joe - presumably the number of VIPs). Can somebody tell me if this is going to be a hard limit or if we can go beyond 255 on one network card?

Roberto Nibali ratz (at) tac (dot) ch 17 Dec 2003

No problem (proof of concept below):

# ip addr show dev lo | grep 'inet ' | wc -l
       1
# for ((i = 1; i < 255; i++)); do for ((j = 1; j < 4; j++)); do ip addr add
127.0.$j.$i/32 dev lo brd + label lo:$i-$j 1>/dev/null 2>&1; done; done
# ip addr show dev lo | grep 'inet ' | wc -l
     763
# for ((i = 1; i < 255; i++)); do for ((j = 1; j < 4; j++)); do ip addr del
127.0.$j.$i/32 dev lo brd + label lo:$i-$j 1>/dev/null 2>&1; done; done
# ip addr show dev lo | grep 'inet ' | wc -l
       1

33.24. DoS

LVS is vunerable to DoS by an attacker making repeated connection requests. Each connection requires 128bytes of memory - eventually the director will run out of memory. This will take a while but an attacker has plenty of time if you're asleep. As well with LVS-DR and LVS-Tun, the director doesn't have access to the TCPIP tables in the realserver(s) which show whether a connection has closed (see director hash table). The director can only guess that the connection has really closed, and does so using timeouts.

Roberto Nibali ratzi (at) tac (dot) ch 10 Sep 2002

It's _impossible_ to differentiate between malicious and good traffic. End of story. But you can rate limit incoming SYNs within ingress policy. This was discussed about 2 years ago when the secure_tcp and drop_packet stuff was about to be introduced.

For information on DoS strategies for LVS see DoS page.

Laurent Lefoll Laurent (dot) Lefoll (at) mobileway (dot) com 14 Feb 2001

If I am not misunderstanding something, the variable /proc/sys/net/ipv4/vs/timeout_established gives the time a TCP connection can be idle and after that the entry corresponding to this connection is cleared. My problem is that it seems that sometimes it's not the case. For example I have a system (2.2.16 and ipvs 0.9.15) with /proc/sys/net/ipv4/vs/timeout_established = 480, but the entries are created with a real timeout of 120.

Julian Anastasov ja (at) ssi (dot) bg

Read The secure_tcp defense strategy where the timeouts are explained. They are valid for the defense strategies only. For TCP EST state you need to read the ipchains man page.

For more explanation of the secure_tcp strategy also see the explanation of the director's hash table.

when I play with ipchains -M -S > [value] 0 0 the variable /proc/sys/net/ipv4/vs/timeout_established is modified even when /proc/sys/net/ipv4/vs/secure_tcp is set to 0, so I'm not using the secure TCP defense. The "real" timeout is of course set to [value] when a new TCP connection appears. So should I understand that timeout_established, timeout_udp,... are always modified by "ipchains -M -S .... whatever I use or not secure TCP defense but if secure-tcp is set to 0, other variables give the timeouts to use? If so, are these variable accessible or how to check their value?

ipchains -M -S modifies the two TCP and the UDP timeouts in the two secure_tcp modes: off and on. So, ipchains changes the three timeout_XXX vars. When you change the timeout_* vars you change them for secure_tcp=on only. Think for the timeouts as you have two sets: for the two secure_tcp modes: on and off. ipchains changes the 3 vars in the both sets. While secure_tcp is off, changing timeout_* does not affect the connection timeouts. They are used when secure_tcp is on.

Note
Joe: ipchains 0 value 0, where value=10, does not change the timeout values or number of entries seen in InActConn or seen with netstat -M, or ipchains -M -L -n.

LVS has its own tcpip state table, when in secure_tcp mode.

carl.huang

what are the vs_tcp_states[ ] and vs_tcp_states_dos[ ] elements in the in ip_vs_conn structure for?

Roberto Nibali ratz (at) tac (dot) ch 16 Apr 2001

The vs_tcp_states[] table is the modified state transition table for the TCP state machine. The vs_tcp_states_dos[] is a yet again modified state table in case we are under attack and secure_tcp is enabled. It is tigher but not conforming to the RFC anymore. Let's take an example how you can read it:

static struct vs_tcp_states_t vs_tcp_states [] = {
/*      INPUT */
/*        sNO, sES, sSS, sSR, sFW, sTW, sCL, sCW, sLA, sLI, sSA */
/*syn*/ {{sSR, sES, sES, sSR, sSR, sSR, sSR, sSR, sSR, sSR, sSR }},
/*fin*/ {{sCL, sCW, sSS, sTW, sTW, sTW, sCL, sCW, sLA, sLI, sTW }},
/*ack*/ {{sCL, sES, sSS, sES, sFW, sTW, sCL, sCW, sCL, sLI, sES }},
/*rst*/ {{sCL, sCL, sCL, sSR, sCL, sCL, sCL, sCL, sLA, sLI, sSR }},

The elements 'sXX' mean state XX, so for example, sFW means TCP state FIN_WAIT, sSR means TCP state SYN_RECV and so on. Now the table describes the state transition of the TCP state machine from one TCP state to another one after a state event occured. For example: Take row 2 starting with sES and ending with sCL. At the first, commentary row, you see the incoming TCP flags (syn,fin,ack,rst) which are important for the state transition. So the rest is easy. Let's say, you're in row 2 and get a fin so you go from sES to sCW, which should by conforming to RFC and Stevens.

Short illustration:

/*           , sES,
/*syn*/ {{   ,    ,
/*fin*/ {{   , sCW,

It was some months ago last year when Wensong, Julian and I discussed a security enhancement for the TCP state transition and after some heavy discussion they implemented it. So the second table vs_tcp_states_dos[] was born. (look in the mailing list in early 2000).

33.25. DoS, from the mailing list

33.25.1. Malicious attacks (SYN floods)

LVS has been tested with a 100Mbit/sec syn-flooding attack by Alan Cox and Wensong.

Each connection requires 128 bytes. A machine with 128M of free memory could hold 1M concurrent connections. An average connection lasts 300secs. Connections which just receive the syn packet are expired in 30secs (starting ipvs 0.8 ). An attacker would have to initiate 3k connections/sec (600Mbps) to maintain the memory at the 128M mark and would require several T3 lines to keep up the attack.

33.25.2. testing DoS

joern maier 22 Nov 2000

I've got a problem protecting my LVS from SYN-flood attacks. Somehow the drop_entry mechanism seems not to work. Doing a SYN-flood with 3 clients to my LVS ( 1 director + 3 RS ) the system gets unreachable. A single realserver under the same attack by those clients stays alive.

Julian

You can't SYN flood the director with only 3 clients. You need more clients (or as an alternative, you can download testlvs from the LVS web site). What does ipvsadm -Ln show under attack? How you activate drop_entry? What does cat drop_entry show?

all realservers have tcp_syncookies enabled (1), tcp_max_syn_backlog=128, the director is set drop_entry var to 1 (echo 1 > drop_entry). Before compiling the kernel, I set the table size to 2^20. My Director has 256 MB and no other applications running.

You don't need such a large table, really.

Francois JEANMOUGIN Francois (dot) JEANMOUGIN (at) 123multimedia (dot) com 04 Nov 2004

I'm currently facing a ddos syn-flood attack against my cluster. Fortunately, those guys do not have enough machines to flood all my servers and the service is still up and running. They seem to use spoofed source IPs (as usual) so I can't even know where it comes from.

Anyway, It is now 24 hours they are playing like that, and I would like to stop it. Do you have an idea? Don't tell me that I have to use iptables to reduce the syn rate, I can't :). I have a lot of mobile clients, and the wap gateways can send me a lot of valid syns.

Jacob Coby jcoby (at) listingbook (dot) com 04 Nov 2004

You can try turning on tcp_syncookies: http://www.mail-archive.com/focus-linux@securityfocus.com/msg00185.html

echo 1 > /proc/sys/net/ipv4/tcp_syncookies

I forgot to mention that I've had tcp_syncookies enabled on individual systems for about 3 years now with no problems. I've had it enabled on every machine in a LVS-DR cluster for 6 months with no problems.

With testlvs and two clients, my LVS gets to a denial of service, although cat drop_entry shows me a "1".

director:/etc/lvs# ipvsadm -Ln:
192.168.10.1:80 lc
192.168.1.4:80	Tunnel 	1	0	33246
192.168.1.3:80	Tunnel 	1	0	33244
192.168.1.2:80	Tunnel 	1	0	33246

run testlvs with 100,000 source addresses.

during the flooding attack the connection values stay around this size. Using the SYN-flood tool with which I tried it before, ivsadm shows me

192.168.10.1:80 lc
192.168.1.4:80	Tunnel 	1	0	356046
192.168.1.3:80	Tunnel 	1	0	355981
192.168.1.2:80	Tunnel 	1	0	356013

so it shows me about ten times as many connections as your tool. I took a look at the packets, both are quiet similar, they only differ in the Windowsize (testlvs has 0, the other tool uses a size of 65534) and sequence numbers (o.k. checksum as well)

I am activating drop entry like this:

I switch on my computer (director) and start linux with the LVS Kernel

cd /proc/sys/net/ipv4/vs
echo 1 > drop_entry

Julian

Maybe you need to tune amemthresh. 1024 pages (4MB) are too low value. How much memory does "free" show under attack? You can try with 1/8 RAM size for example. The main goal of these defense strategies is to keep free memory in the director, nothing more. The defense strategies are activated according to the free memory size. The packet rate is not considered.

joern maier

That sounds all good to me, but what I'm really wondering about is, why has the drop_entry variable still a value of 1. I thought it has to be 2 when my System is under attack? To me it looks like LVS does not even think it's under attack and therefore does not use the drop_entry mechanism.

You are right. You forgot to specify when the LVS to think it is under attack. drop_entry switches automatically from 1 to 2 when the free memory reaches amemthresh. Do you know that your free memory is below 4MB? See defense strategies.

So, 1,000,000 entries created from the other tool uses 128MB memory. You have 256MB :) To reduce the amount of memory the kernel sees, boot with mem=128MB (in lilo) or set amemthresh to 32768 or run testlvs with more source addresses (2,000,000). I'm not sure if the last will help if the other tool you use does not limit the number of spoofed addresses. But don't run testlvs with less than -srcnum 2000000. If the setup allows rate > 33,333 packets/sec LVS can create 2,000,000 entries that expire for 60 seconds (the SYN_RECV timeout). Better not to use the -random option in testlvs for this test.

So, you can test with such large values but make sure you tune amemthresh in production with the best value for your director. The default value is not very useful. You can test whether 1/8 is a good value (8192 for 4K page size).

33.25.3. on the design of the DoS preventer

Alan Cox alan (at) lxorguk (dot) ukuu (dot) org (dot) uk>

The bi