32. LVS: Performance and Kernel Tuning

We are now (2006) in an era where the CPU is no longer rate determining in an LVS director.

Ratz 20 Feb 2006

The processor never is an issue regarding LVS unless your NIC is so badly designed that the CPU would need to take over the packet processing ;)

32.1. Performance Articles

(a non-LVS article on Configuring large scale unix clusters by Dan Kegel.)

The article performance data for a single realserver LVS, shows how to test the network links with netpipe and how to determine the effects of LVS on latency and throughput. Note the effect of the LVS code on system performance is the difference between the performance with the director box just forwarding packets as a router to the realservers and the director box acting as an LVS director. If you find that the director is causing the throughput to decrease, you first have to determine if the slowdown is due to the hardware/OS or due to ip_vs. It is possible that the PCI bus or your network cards cannot handle the throughput when the box us just forwarding packets. Several articles have been published about the performance of LVS, by people who did not differentiate the effects of the ip_vs code from slowdown caused by the hardware that the ip_vs code was running on.

Pat O'Rourke has done some performance tests on high end machines.

Padraig Brady padraig (at) antefacto (dot) com 29 May 2002, measured 60usec latency for normal forwarding and 80usec latency for forwarding as a director on his setup.

Ted Pavlic was running 4 realservers with 1016 (4 x 254) RIPs way back (1999?).

Jeremy Kusnet (1 Oct 2002) is running a setup with 53 VIPs, 8 services/VIP, 6 realservers, (53*8*6 = 2688) RIPs.

unknown:

Has anyone on this list use LVS to load balance firewalls? If so, what kind of limitations did you see with regard to Mbps and kpps?

Peter Mueller pmueller (at) sidestep (dot) com 30 Dec 2004

Yes, see the list archives. The limit is in the PCI bus. If you are pushing the limit of the LVS PCI bus then it won't help to use LVS.

Anyway, I have not gotten more than 100kpps unless using NAPI. Some people report getting up to 1.2mpps (!) a few years ago with intel gigabits, 64-bit 66mhz individual cards and buses. These figures are with 64 byte packets. There was a post on quagga archive recently about this, check there for more details.

Did you run into any issues with stateful connections and how many simultaneous connections did it handle?

I'm sure if you run iptables on a router it will drop your numbers, probably by a lot.

32.2. Estimating throughput: Rule of Thumb

People on the LVS mailing list have found that a 400MHz director will saturate a 100Mbps ethernet link.

Somewhere else I read that you need 1Hz of CPU for every bps of I/O. So 100Mbps ethernet (two directions) will need a 200MHz machine.

In the old days, I measured 50Mbps throughput with a 75MHz director, indicating that you need a 150MHz CPU to saturate a 100Mbps link.

32.3. Estimating throughput: 100Mbps FE is really 8000packets/sec ethernet

If you are just setting up an LVS to see if you can set one up, then you don't care about performance. When you want to put one on-line for other people to use, you'll want to know the expected performance.

On the assumption that you have tuned/tweeked your farm of realservers and you know that they are capable of delivering data to clients at a total rate of bits/sec or packets/sec, you need to design a director capable of routing this number of requests and replies for the clients.

First some background information on networking hardware. At least for Linux (the OS I've measured, see performance data for single realserver LVS), a network rated at 100Mbps is not 100Mbps all the time. It's only 100Mbps when continuously carrying packets of mtu size (1500bytes). A packet with 1 bit of data takes as long to transmit as a full mtu sized packet. If your packets are <ack>s, or 1 character packets from your telnet editing session or requests for http pages and images, you'll barely reach 1Mbps on the same network. On the performance page, you'll notice that the hit rate increases as the size of the hit targets (in bytes) gets smaller. Hit rate is not neccessarily a good indicator of network throughput.

A possible explanation for the ethernet rate being a function only of the packet rate is in an article on Gigabit Ethernet Jumbo Frames (look for the section "Local performance issues") (also see jumbo frames). Each packet requires an interrupt and the per-packet processing overhead sets the limit for TCP performance. This has resulted in a push for larger (jumbo) packets with Gigabit ethernet (e.g. 64kB, a whole NFS frame). The original problem was that the MTU size chosen for 100Mbps ethernet (1500bytes) is the same as for 10Mbps ethernet. This was so that packets traversing the different types of ethernet would not have to be fragmented and defragmented. However the side effect was that the packets are too small for 100Mbps ethernet and you can double your ethernet throughput by doubling your packet size.

Tcpip can't use the full 100Mbps of 100Mbps network hardware, as most packets are paired (data, ack; request, ack). A link carrying full mtu data packets and their corresponding <ack>s, will presumably be only carrying 50Mbps. A better measure of network capacity is the packet throughput. An estimate of the packet throughput comes from the network capacity (100Mbps)/mtu size(1500bytes) = 8333 packets/sec.

Thinking of a network as 100Mbps rather than ca.8000packets/sec is a triumph of marketing. When offered the choice, everyone will buy network hardware rated at 100Mbps even though this capacity can't be used with your protocols, over another network which will run continuously at 8000packets/sec for all protocols. Only for applications like ftp will near full network capacity be reached (then you'll be only running at 50% of the rated capacity as half the packets are <ack>s). I notice (Jun 2005) that switch speficiations e.g. Netgear FS516 (http://www.netgear.com/products/details/FS516.php) are quoted in packets/sec rather than bytes/sec.

A netpipe test (realservers are 75MHz pentiums and can't saturate the 100Mbps network) shows that some packets must be "small". Julian's show_traffic script shows that for small packets (<128bytes), the throughput is constant at 1200packets/sec. As packets get bigger (upto mtu size), the packet throughput decreases to 700packets/sec, and then increases to 2600packets/sec for large packets.

The constant throughput in packets/sec is a first order approximation of of tcpip network throughput and is the best information we have to predict director performance.

In the case where a client is in an exchange of small packets (<mtu size) with a realserver in a LVS-DR LVS, each of the links (client-director, director-realserver, realserver-client) would be saturated with packets, although the bps rate would be low. This is the typical case for non-persistent http when 7 packets are required for the setup and termination of the connection, 2 packets are required for data passing (eg the request GET /index.html and the reply) and an <ack> for each of these. Thus only 1 out of 11 packets is likely to be near mtu size, and throughput will be 10% of the rated bps throughput even though the network is saturated.

The first thing to determine then is the rate at which the realservers are generating/receiving packets. If the realservers are network limited, i.e. the realservers are returning data in memory cache (eg a disk-less squid) and have 100Mbps connections, then each realserver will saturate a 100Mbps link. If the service on the realserver requires disk or CPU access, then each realserver will be using proportionately less of the network. If the realserver is generating images on demand (and hence is compute bound) then it may be using very little of the network and the director can be handling packets for another realserver.

The forwarding method affects packet throughput. With LVS-NAT all packets go through the director in both directions. As well the LVS-NAT director has to rewrite incoming and reply packets for each realserver. This is a compute intensive process (but less so for 2.4 LVS-NAT). In a LVS-DR or LVS-Tun LVS, the incoming packets are just forwarded (requiring little intervention by the director's CPU) and replies from the realservers return to the client directly by a separate path (via the realserver's default gw) and aren't seen by the director.

In a network limited LVS, for the same hardware, because there are separate paths for incoming and returning packets with LVS-DR and LVS-Tun, the maximum (packet) throughput is twice that of LVS-NAT. Because of the rewriting of packets in LVS-NAT, the load average on a LVS-NAT director will be higher than for a LVS-DR or LVS-Tun director managing twice the number of packets.

In a network bound situation, a single realserver will saturate a director of similar hardware. This is a relatively unusual case for the LVS's deployed so far. However it's the situation where replies are from data in the memory cache on the realservers (eg squids).

With a LVS-DR LVS, the realservers have their own connection to the internet, the rate limiting step is the NIC on the director which accepts packets (mostly <ack>s) from the clients. The incoming network is saturated for packets but is only carrying low bps traffic, while the realservers are sending full mtu sized packets out their default gw (presumably the full 100Mbps).

The information needed to design your director then is the number of packets/sec your realserver farm is delivering. The director doesn't know what's in the packets (being an L4 switch) and doesn't care how big they are (1 byte of payload or full mtu size).

If the realservers are network limited, then the director will need the same CPU and network capacity as the total of your realservers. If the realservers are not network limited, then the director will need correspondingly less capacity.

If you have 7 network limited realservers with 100Mbps NICs, then they'll be generating an average of 7x8000 = 50k packets/sec. Assuming the packets arrive randomly the standard deviation for 1 seconds worth of packets is +/- sqrt(50000)=200 (ie it's small compared to the rate of arrival of packets). You should be able to connect these realservers to a 1Gbps NIC via a switch, without saturating your outward link.

If you are connected to the outside world by a slow connection (eg T1 line), then no matter how many 8000packet/sec realservers you have, you are only going to get 1.5Mbps throughput (or half that, since half the packets are <ack>s).

Note: The carrying capacity of 100Mbps network of 8000packets/sec may only apply to tcpip exchanges. My 100Mbps network will carry 10,000 SYN packets/sec when tested with Julian's testlvs program.

Wayne wayne (at) compute-aid (dot) com 03 Apr 2001

The performance page calculates the ack as 50% or so of the total packets. I think that might not accurate, since in the twist-pair and full duplex mode, ack and request are travelling on two different pairs. Even in the half duplex mode, the packets for two directions are transmit over two pairs, one for send, one for receive, only the card and driver can handle them in full duplex or half duplex mode. So the packets would be 8000 packets/sec all the times for the full duplex cards.

Joe: presumably for any particular connection, the various packets have to be sent in order and whether they are travelling over one or two pairs of cables would not matter. However multiple connections may be able to make use of both pairs of wires.

Unfortunately we only can approximately predict the performance of an LVS director. Still the best estimates come from comparing with a similar machine.

The performance page shows that a 133MHz pentium director can handle 50Mbps throughput. With 2.2 kernel LVS-NAT, the load average on the director is unusably high, but with LVS-DR, the director has a low load average.

Statements on the website indicate that a 300MHz pentium LVS-DR director running a 2.2.x kernel can handle the traffic generated by a 100Mbps link to the clients. (A 550MHz PIII can direct 120Mbps.)

Other statements indicate that single CPU high end (800MHz) directors cannot handle 1Gbps networks. Presumably multiple directors or SMP directors will be needed for Gbps networks. (also see the section on SMP doesn't help.)

From: Jeffrey A Schoolcraft dream (at) dr3amscap3 (dot) com 7 Feb 2001

I'm curious if there are any known DR LVS bottlenecks? My company had the opportunity to put LVS to the test the day following the superbowl when we delivered 12TB of data in 1 day, and peaked at about 750Mbps.

In doing this we had a couple of problems with LVS (I think they were with LVS). I was using the latest lvs for 2.2.18, and ldiretord to keep the machines in and out of LVS. The LVS servers were running redhat with an EEPro100. I had two clusters, web and video. The web cluster was a couple of 1U's with an acenic gig card, running 2.4.0, thttpd, with a somewhat performance tuned system (parts of the C10K). At peak our LVS got slammed with 40K active connections (so said ipvsadmin). When we reached this number, or sometime before, LVS became in-accessible. I could however pull content directly from a server, just not through the LVS. LVS was running on a single proc p3, and load never went much above 3% the entire time, I could execute tasks on the LVS but http requests weren't getting passed along.

A similar thing occurred with our video LVS. While our realservers aren't quite capable of handling the C10K, we did about 1500 a piece and maxed out at about 150Mbps per machine. I think this is primarily modem users fault. I think we would have pushed more bandwidth to a smaller number of high bandwidth users (of course).

I know this volume of traffic choked LVS. What I'm wondering is, if there is anything I could do to prevent this. Until we got hit with too many connections (mostly modems I imagine) LVS performed superbly. I wonder if we could have better performance with a gig card, or some other algorithm (I started with wlc, but quickly changed to wrr because all the rr calculations should be done initially and never need to be done again unless we change weights, I thought this would save us).

Another problem I had was with ldirectord and the test (negotiate, connect). It seemed like I needed some type of test to put the servers in initially, then too many connections happened so I wanted no test (off), but the servers would still drop out from ldirectord. That's a snowball type problem for my amount of traffic, one server gets bumped because it's got too many connections, and then the other servers get over-loaded, they'll get dropped to, then I'll have an LVS directing to localhost.

So, if anyone has pushed DR LVS to the limits and has ideas to share on how to maximize it's potential for given hardware, please let me know.

32.4. Jumbo frames

All users of ethernet should understand the effects of MTU size on packet throughput and why you need jumbo frames. The problem is that the MTU=1500bytes was designed for the original implementation of ethernet at 3Mbps. The clock speed was upped to 10Mbps for commercial release, but the MTU was not changed, presumably not to change the required buffer size. When 100Mpbs ethernet arrived, the MTU was maintained for backward compatibility on mixed 10/100 networks. The MTU is 30 times too small for 100Mbps and 300 times too small for 1Gbps ethernet.

Joe 26 Apr 2002 (posting to the beowulf mailing list)

I know that jumbo frames increase throughput rate on GigE and was wondering if a similar thing is possible with regular FE.

Donald Becker becker (at) scyld (dot) com 26 Apr 2002

I used to track which FE NICs support oversized frames. Jumbo frames turned out to be so problematic that I've stopped maintaining the table.

the MTU of 1500 was chosen for 10Mbps ethernet and was kept for 100Mbps and 1Gbps ethernet for backwards compatibility on mixed networks. However MTU=1500 is too small for 100Mbps and 1Gbps ethernet. In Gbps ethernet jumbo frames (ie bigger MTU) is used to increase throughput.

Yup, 1500 bytes was chosen for interactive response on original Ethernet. (Note: originally Ethernet was 3Mbps, but commercial equipment started at 10Mbps.) The backwards compatibility issue is severe. The only way to automatically support jumbo frames is using the paged autonegotiation information, and there is no standard established for this.

Jumbo frame *will* break equipment that isn't expecting oversized packets. If you detect a receive jabber (which is what a jumbo frame looks like), you are allowed (and _should_) disable your receiver for a period of time. The rationale is that a network with an on-going problem is likely to be generating flawed packets that shouldn't be interpreted as valid.

With netpipe I found that throughput on FE was approx linear with increasing MTU upto the max=1500bytes. I assume that there is no sharp corner at 1500 and if in principle larger frames could be sent, then throughput should also increase for FE. (Let's assume that the larger packets will never get off the LAN and will never need to be fragmented).

I couldn't increase the MTU above 1500 with ifconfig or ip link. I found that the MTU seemed to be defined in

linux/include/if_ether.h
as
ETH_DATA_LEN and ETH_FRAME_LEN

and increased these by 1500, recompiled the kernel and net-tools and rebooted. I still can't install a device with MTU>1500

VLAN sends a packet larger than the standard MTU, having an extra 4 bytes of out of band data. The VLAN people have problems with larger MTUs. Here's their mailing list

http://www.WANfear.com/pipermail/vlan/

where I found the following e-mails

http://www.WANfear.com/pipermail/vlan/2002q2/002385.html
http://www.WANfear.com/pipermail/vlan/2002q2/002399.html
http://www.WANfear.com/pipermail/vlan/2002q2/002401.html

which indicate that the MTU is set in the NIC driver and that in some cases the MTU=1500 is coded into the hardware or is at least hard to change.

Most of the vLAN people don't initially understand the capability of the NICs, or why disabling Rx length checks is a Very Bad Idea. There are many modern NIC types that have explicit VLAN support, and VLAN should only be used with those NICs. (Generic clients do not require VLAN support.

I don't know whether regular commodity switches (eg Netgear FS series) care about packet size, but I was going to try to send packets over a cross-over cable initially.

Hardware that isn't expecting to handle oversized frames might break in unexpected ways when Rx frame size checking is disabled. Breaking for every packet is fine. Occasionally corrupting packets as a counter rolls over might never be pinned on the NIC.

The driver also comes into play. Most drivers are designed to receive packets into a single skbuff, assigned to a single descriptor. With jumbo frames the driver might need to be redesigned with multiple descriptors per packet. This adds complexity and might introduce new race conditions. Another aspect is that dynamic Tx FIFO threshold code is likely to be broken when the threshold size exceeds 2KB. This is a lurking failure -- it will not reveal itself until the PCI is very busy, then Boom...

Most switches very much care about packet size. Consider what happens in store-and-forward mode.

All of these issues can be fixed or addressed on a case-by-case basis. If you know the hardware you are using, and the symptoms of the potential problems, it's fine to use jumbo frames. But I would never ship a turn-key product or preconfigured software that used jumbo frames by default. It should always require expertise and explicit action for the end user to turn it on.

Josip Loncaric josip (at) icase (dot) edu 29 Apr 2002

The backwards compatibility issue is severe

Jumbo frames are great to reduce host frame procesing overhead, but, unfortunately, we arrived at the same conclusion: jumbo frames and normal equipment do not mix well. If you have a separate network where all participants use jumbo frames, fine; otherwise, things get messy.

Alteon (a key proponent of jumbo frames) has some suggestions: define a normal frame VLAN including everybody and a (smaller) jumbo frame VLAN; then use their ACEswitch 180 to automatically fragment UDP datagrams when routing from a jumbo frame VLAN to a non-jumbo frame VLAN (TCP is supposed to negotiate MTU for each connection, so it should not need this help). This sounds simple, but it requires support for 802.1Q VLAN tagging in Linux kernel if a machine is to participate in both jumbo frame and in non-jumbo frame VLAN. Moreover, in practice this mix is fragile for many reasons, as Donald Becker has explained...

One of the problems I've seen involves UDP packets generated by NFS. When a large UDP packet (jumbo frame MTU=9000) is fragmented into 6 standard (MTU=1500) UDP packets, the receiver is likely to drop some of these 6 fragments because they are arriving too closely spaced in time. If even one fragment is dropped, the NFS has to resend that jumbo UDP packet, and the process can repeat. This results in a drastic NFS performance drop (almost 100:1 in our experience). To restore performance, you need significant interrupt mitigation on the receiver's NIC (e.g. receive all 6 packets before interrupting), but this can hurt MPI application performance. NFS-over-TCP may be another good solution (untested!).

We got good gigabit ethernet bandwidth using jumbo frames (about 2-3 times better than normal frames using NICs with Alteon chipsets and the acenic driver), but in the end full compatibility with existing non-jumbo equipment won the argument: we went back to normal frames. The frame processing overhead does not seem as bad now that CPUs are so much faster (2GHz+), even with our gigabit ethernet, and particularly not with fast ethernet.

However, if we had a separate jumbo-frame-only gigabit ethernet network, we'd stick to jumbo frames. Jumbo frames are simply a better solution for bulk data transfer, even with fast CPUs.

32.5. Network Latency

Network latency in an LVS is determined by the internet and is beyond the control of the person setting up the LVS. In a beowulf, the network is local and latency is important. Here's a posting from the beowulf mailing list about latency and throughput for small packets on Gbps (GigE) ethernet.

Richard Walsh rbw (at) ahpcrc (dot) org 07 Mar 2003

In the limit of a 1 byte message, the inverse of the latency is the worst-case bandwidth for repeatedly sending 1 byte. On a GigE system with a latency of say 50 usecs your worst case bandwidth is 20 KB/sec :-(. This is mostly a hardware number. If you add in other contributors to the latency things get worse. As message size shrinks latency eventually dominates the transfer time ... the larger the latency the sooner this happens. Under the heading of "everything is just another form of something else", the distinction between latency and bandwidth gets muddy as latency grows relative to message size.

On the other hand, if you can manage your message sizes, keep the latency piece a small percentage of the message transit time, and have good bandwidth you may not care what the latency is. Pushing up data volumes per node imply larger surfaces to communicate which imply larger messages. These transfers can be hidden behind compute cycles. Of course, one has to worry about faster processors shrinking compute cycles.

32.6. Mixture of 100Mbps and GigE ethernet

Jeremy Kusnetz

We are planning on upgrading the network our realservers to gigE to support a gigE connection to our NFS server. I need to have gigE on the realservers due to potential buffering issues losing NFS udp packets coming from the NFS server.

Now that the realservers will be on gigE, I can see a potential of the realservers sending data to the director faster then the director's internal 100mb connection can handle and start buffering packets on the swtich. Because of that I'm planning on putting a gigE interface on the internal connection of the director, but leaving a 100mb nic connecting the director to the outside routers. Now the director would be buffering data coming in at gigE speeds and sending out the data at 100mb speeds. Am I going to have any problems on the director doing this kind of buffering? I figure it could probably handle it better then the switch could. Am I right?

Ryan Leathers ryan (dot) leathers (at) globalknowledge (dot) com 29 Mar 2004

TCP is your friend. Even if you had Jumbo frames enabled and large payloads its extremely unlikely that TCP would fail to sufficiently throttle the delivery rate before you would bump into a hard buffer. TCP is decoupled from the underlying transports. While this is inefficient in obvious ways it is also presicely what protects us from situations like the one you describe. The down side is that TCP doesn't really get with the program until a problem BEGINS to happen. GigE on your director will reduce some of the switching burden, but ultimately, its TCP's behavior which will throttle end-to-end traffic within the tolerable capacity of your infrastructure. If top performance is of concern you might consider traffic shaping on the realservers for egress traffic.

Unfortunately we have some UDP protocols we are load balancing, namely DNS and radius. I'm not too worried about TCP traffic, but I am worried about losing UDP packets. Maybe I should keep the interconnects between the realservers and the director 100DB like it is now, but do NFS on a separate network off of the gigE cards.

32.7. NICs and Switches, 100Mbps (FE) and 1Gbps (GigE)

(Apr 2002)

If you are going into production, you should test that your NICs and switch works well with the hardware in your node. Give it a good exercising with a netpipe test. (see the performance page where the netpipe test is used). Netpipe will determine the network latency, the maximum throughput and whether the hardware behaves properly under stress. Latency determines system speed for processes transferring small packets over a small number of hops (usually one hop), while maximum throughput determines your system behaviour for large numbers of MTU sized packets.

The beowulfers have done the most to find out which network hardware is useful at high load. You can look through the whole beowulf mailing list archive after downloading it. Unfortunately there is no online keywork search like we have on the LVS mailing list (thanks to Hank). Beowulfers are interested in both latency and throughput. If your LVS is sending packets to clients on the internet, your latency will be determined by the internet. If your connection to your clients is through a T1 line, your maximum throughput will also be determined by the internet. The beowulfers use either 100Mbps FE or for high throughput, myrinet. They don't use 1Gbps ethernet as it doesn't scale and is expensive.

Network performance is expensive. In a beowulf with myrinet, half the cost of the hardware is in the networking. The difference between a $100k beowulf and a $5M Cray, which has the same number, type and speed of commodity DEC Alpha CPUs, is the faster interconnects in the Cray. With a Cray supercomputer, you aren't buying fast CPUs (Cray doesn't make its CPUs anymore, they're using the same CPUs that you can buy in a desktop machine), what you're buying is the fast onboard networking between the CPUs.

32.7.1. 100Mbps

Martin Seigert at Simon Fraser U posted Benchmarks for various NICS to the beowulf mailing list. The conclusion was that for fast CPUs (i.e.600MHz, which can saturate 100Mbps ethernet) the 3c95x and tulip cards were equivelent.

Note

from a posting on the beowulf mailing list: "the tulip is the ne2k of the late 90's". (I use them when I can - Joe).

For slower CPUs (166MHz) which cannot saturate 100Mbs ethernet, the on-board processing on the 3Com hards allowed marginally better throughput.

I use Netgear FA310TX (tulip), and eepro100 for single-port NICs. The related FA311 card seems to be Linux incompatible (postings to the beowulf mailing list), currently (Jul 2001) requiring a driver from Netgear (this was the original situation with the FA310 too). I also use a quad DLink DFE-570TX (tulip) on the director. I'm happy with all of them.

The eepro100 has problems as Intel seems to change the hardware without notice and the linux driver writers have trouble handling all the versions of hardware. One kernel (2.2.18?) didn't work with the eepro100. and new kernels seem to have problems occasionally. I bought all of my eepro100's at once and presumably they are identical and presumably the bugs have been worked out for them. There have been a relatively large number of posting of people with eepro100 problems on the LVS mailing list. You should expect continuing problems with this card, which will be incrementally solved by kernel patches.

32.7.2. 100Mbps switches

These are now cheap. The parameters that determine the performance of a switch are

  • the backplane bandwidth

    This is the total rate of packet or bit throughput that the switch can handle through all ports at once. If you have an 8 port 100Mbps switch, you need a backplane bandwidth of 400Mbps to allow all 4 pairs of ports to talk at the same time.

    A hub is just a switch that only allows one pair of ports to talk at a time.

  • cut through, store and foward

    At low packet throughput a switch will "cut through", i.e. after decoding the dst_lladdr (the MAC address of the target) on the packet, it will switch the packet through to the appropriate port before the rest of the packet has arrived at the switch. This will ensure low latency for packet transfer. You can test switch latency by replacing the switch with a cross-over cable.

    When the packet throughput exceeds the backplane bandwidth, the switch can store packets till there is space on the link layer. This is called "store and forward"

    What you want to know is whether the switch does cut through and/or store and forward, and if it does both, when the change over occurs.

  • large packet handling

    You want to know what the switch does with packets larger than the standard 1500byte MTU.

Unfortunately the suppliers of commodity switches, Netgear, 3COM and HP are less than forthcoming on their specs.

(Apr 2002, Extreme Networks, quote backplane, which they call "switch fabric", bandwidth for their products, which include GigE switches. Sep 2002 - they've removed the webpage.)

Netgear gives the most information (and has the cheapest switches) and I bought my switch from them, as a way of supporting their efforts here. Someone I know who buys a lot of switches said he had some Netgear switches arrive DOA. They were returned without problem, but manufacturers should be shipping working boxes. 3COM gives less information than Netgear, while HP just wants to know whether you are using the device at home or at a business and they'll tell you which box you need. I started seeing advertisements for Dell switches in Sept 2001, at lower prices than for Netgear, but the Dell website didn't acknowlege that they existed and a contact at Dell couldn't find specs for them. The Dell switches appeared to be rebadged equipment from another networking company. None of these suppliers give the above neccessary information - they aren't selling switches, they're selling "productivity enhancement solutions".

I'm happy with my Netgear FS series switch, but then I haven't tested it with 100Mbps simultaneously on all ports.

Since the price of a switch rises exponentially with its backplane bandwidth (required for more ports), an often suggested solution (which I haven't tested), is to divide your network into smaller groupings of computers, connected by 2 layers of switches (this will increase your network latency, since now two hops may be required).

32.7.3. 550MHz CPU saturates 100Mbps ethernet

Martin Hamilton martin (at) net (dot) lut (dot) ac (dot) uk Nov 14 2001

we (JWCS) also use LVS on our home institutional caches. These are somewhat smaller scale, e.g. some 10m URLs/day at the moment for Loughborough's campus caches vs. 130m per day typically on the JANET caches. The good news is that LVS in tunnelling mode is happily load balancing peaks of 120MBit/s of traffic on a 550MHz PIII.

Folk in ac.uk are welcome to contact us at support (at) wwwcache (dot) ja (dot) net for advice on setting up and operating local caches. I'm afraid we can only provide this to people on the JANET network, like UK Universities and Colleges.

32.7.4. 1Gbps (GigE) NICs

Here's a review of GigE over copper wire. All current (May 2002) GigE NICs support jumbo frames and are cheap (US$100). The best latency (SysConnect SK9821 NIC) is 50usecs. While this is nice, I'm getting 150usec on a pair of 100Mbps ethernet cards connected between a 133MHz pentium 1 and a dual 180MHz Pentium Pro at a fraction of the cost. The SysConnect card can deliver 900Mbps with jumbo frames.

Here's the (http://www.scd.ucar.edu/nets/docs/reports/HighSpeed/ - link dead Jul 2004) NCAR High Performance Networking Tests which gives background info on fast networks (ATM and GigE). The main point for us is that the jumbo frame formats are proprietary and non-iteroperable (we need a standard here).

32.7.5. 1Gbps (GigE) switches

While GigE NICs are cheap, the switches are still expensive. The suppliers of commodity GigE switches are even less forthcoming with their specs than they are for their 100Mbps switches.

(Note: Apr 2002, I just found "http://www.extremenetworks.com/products/products.asp which quote backplane, which they call "switch fabric", bandwidth for their products, which include GigE switches. Note: Oct 2002, link is dead.)

It seems that the manufactureres would rather you figure out the specs yourself. Netgear is selling a 24 port GigE switch (according to a vendor), but all you can find on the Netgear website (Apr 2002) is a 100Mbps switch with 4 GigE ports. Some of the vendors want you to figure out the existance of their boxes too.

Here's my estimate of commodity GigE switch specs:

A 24 port cisco 6000 series GigE chassis (no added boards) costs US$60k, has a backplane bandwidth of 32Gbps and supports jumbo frames. The commodity switches (e.g. 24 port HP4108) costs US$6k and do not support jumbo frames. On the assumption that the backplane bandwidth is what you're paying for, then the spec on the HP box is 3Gbps i.e. only 3 pairs of ports on your 24 port box can be active at once. This box is not much more than a hub, something the manufacturers would not want you to know.

You need jumbo frames and you can't have them. It is pointless going to GigE unless you have jumbo frames. These switch specs explain why GigE scales so badly and why beowulfers would rather stay with 100Mbps than spend any money on GigE.

For some performance data see Gigabit ethernet TCP switch performance.

32.8. Ethernet,NIC Bonding

This has not proven reliable or easy to setup, at least in the hands of the Beowulfers. Make sure your LVS is working properly before trying ethernet bonding.

Craig Ward

Are there any known issues with bonding NICs and LVS? I've got a setup with 4 boxes, all 4 are web servers and 2 are directors. The VIP is being brought up on bond0:0 fine, but I can't ping this from any machine and its not showing in the arp table on my windows client. Strangely, if I manually bring up an ip on bond0:0 that is NOT the the VIP I can ping it fine. I'm wondering if any of the noarp rules on the each director, used for when they are slave directors, is somehow "stuck" and it's not arping for the VIP whatever interface it's on?

Brad Hudson brad (dot) hudson (at) gmail (dot) com 5 Nov 2005

Here is how I use bonding:

$node = IPVS server
$real = real server
$fip = front end ip
$fvip = front end vip
$bip = back end ip
$bvip = back end vip
  1. $node has eth1 and eth3 bonded together on $fip as bond0
  2. $fvip sits on bond0:0 and accepts all incoming requests for load balancing
  3. $node has eth0 and eth2 bonded together on $bip as bond1
  4. $bvip sits on bond1:0 and talks to all real servers where each $real has a gateway of $bvip

FYI: $node is also in failover cluster #1 and all real servers are in load balanced cluster #2

32.9. NIC problems - eepro100

32.9.1. counter overflows

(This is from 1999 I think)

linux with an eepro100 can't pass more than 2^31-1 packets. This may not still be a problem.

Jerry Glomph Black black (at) real (dot) com Subject: 2-billion-packet bug?

I've seen several 2.2.12/2.2.13 machines lose their network connections after a long period of fine operation. Tonight our main LVS box fell off the net. I visited the box, it had not crashed at all. However, it was not communicating via its (Intel eepro100) ethernet port.

The evil evidence:

eth0      Link encap:Ethernet  HWaddr 00:90:27:50:A8:DE
          inet addr:172.16.0.20  Bcast:172.16.255.255  Mask:255.255.0.0
          UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
          RX packets:15 errors:288850 dropped:0 overruns:0 frame:0
          TX packets:2147483647 errors:1 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:100
          Interrupt:10 Base address:0xd000

Check out the TX packets number! That's 2^31-1. Prior to the rollover, In-and-out packets were roughly equal. I think this has happened to non-LVS systems as well, on 2.2 kernels. ifconfigging eth0 down-and-up did nothing. A reboot (ugh) was necessary.

It's still happening 2yrs later. This time the counter stops, but the network is still functional.

Hendrik Thiel thiel (at) falkag (dot) de 20 Nov 2001

using lvs with eepro100 cards (kernel 2.2.17) and encountered a TX packets value stopping at 2147483647 (2^32-1) thats what ifconfig tells...the system still runs fine ...

it seems to be a ifconfig Bug. Check out the TX packets number! That's 2^31-1.

eth0 Link encap:Ethernet HWaddr 00:90:27:50:A8:DE
inet addr:172.16.0.20 Bcast:172.16.255.255 Mask:255.255.0.0
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:15 errors:288850 dropped:0 overruns:0 frame:0
TX packets:2147483647 errors:1 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:100
Interrupt:10 Base address:0xd000

Simon A. Boggis

Hmmm, I have a couple of eepro100-based linux routers - the one thats been up the longest is working fine (167 days, kernel 2.2.9) but the counters are jammed - for example, `ifconfig eth0' gives:

eth0 Link encap:Ethernet HWaddr 00:90:27:2A:55:48
inet addr:138.37.88.251 Bcast:138.37.88.255 Mask:255.255.255.0
IPX/Ethernet 802.2 addr:8A255800:0090272A5548
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:2147483647 errors:41 dropped:0 overruns:1 frame:0
TX packets:2147483647 errors:13 dropped:0 overruns:715 carrier:0
Collisions:0
Interrupt:15 Base address:0xa000

BUT /proc/net/dev reports something more believable:

hammer:/# cat /proc/net/dev
Inter-| Receive | Transmit
face |bytes packets errs drop fifo frame compressed multicast|bytes packets errs drop fifo colls carrier compressed
eth0:2754574912 2177325200 41 0 1 0 0 0 2384782514 3474415357 13 0 715 0 0 0

Thats RX packets: 2177325200 and TX packets: 3474415357 compared to : 2147483647 from ifconfig eth0

Note

From Purrer Wolfgang www (dot) purrer (dot) at 24 Apr 2003: If Donald Becker's drivers aren't helping, you can always get the drivers from Intel. (Joe - it's an appalingly designed site).

32.9.2. new drivers

Andrey Nekrasov

After I changed to kernel 2.4.x with "arp hidden patch"

eepro100: wait_for_cmd_done timeout!

I haven't had any problems before with NIC Intel EEPRO/100.

Julian 4 Feb 2002

This problem happens not only with LVS. Search the web or linux-kernel: http://marc.theaimsgroup.com/?t=100444264400003&r=1&w=2

32.9.3. Kernel 2.4.18

Peter Mueller pmueller (at) sidestep (dot) com 22 May 2002

2.4.18 has an eepro100 bug in it. Looking in dejanews, there is a slow down under some circumstances. You should use the driver from 2.4.17.

32.9.4. bonding with eepro100

Roberto Nibali ratz (at) drugphish (dot) ch 07 Nov 2003

The eepro100 never really worked well for me in conjunction with bonding. I also use the e100/e1000 drivers as suggested by Brian.

Also note that the bonding architecture has gotten a major (actually huge) overhaul during between the 2.4.21 and 2.4.23-preX phase. Among them is the possibility to set and change the MAC and the MTU in ALB/TBL modes, fixed arp monitoring, better 802.3ad support and proper locking. You might wanna play with some of the newer 2.4.23-pre kernels or at least with 2.4.22 if the problem persists, then again it's highly time-consuming to follow latest ("stable") development kernels currently.

32.10. NIC problems - tulip

(Joe, Nov 2001 I don't know if this is still a problem, we haven't heard any more about it and haven't had any other tulip problems, unlike the eepro100.)

John Connett jrc (at) art-render (dot) com 05 May 1999

Any suggestions as to how to narrow it down? I have an Intel EtherExpress PRO 100+ and a 3COM 3c905B which I could try instead of the KNE 100TX to see if that makes a difference.

A tiny light at the end of the tunnel! Just tried an Intel EtherExpress PRO 100+ and it works! Unfortunately, the hardware is fixed for the application I am working on and has to use a single Kingston KNE 100TX NIC ...

Some more information. The LocalNode problem has been observed with both the old style (21140-AF) and the new style (21143-PD) of Kingston KNE 100TX NIC. This suggests that there is a good chance that it will be seen with other "tulip" based NICs. It has been observed with both the "v0.90 10/20/98" and the "v0.91 4/14/99" versions of tulip.c.

I have upgraded to vs-0.9 and the behaviour remains the same: the EtherExpress PRO 100+ works; the Kingston KNE 100TX doesn't work.

It is somewhat surprising that the choice of NIC should have this impact on the LocalNode behaviour but work successfully on connections to slave servers.

Any suggestions as to how I can identify the feature (or bug) in the tulip driver would be gratefully received. If it is a bug I will raise it on the tulip mailing list.

32.11. dual/quad ethernet cards, IRQ sharing problems

any recommendations?

Ratz 04 Dec 2002: We're using Adaptec Quadboards (Adaptec ANA-62044, 64-bit PCI) and they work like a charm. You can stick in 6 of those on a Intel Serverboard and have 24 NICs. We're however testing the new Intel Quadboards that will officially be available on Q1 2003. We chose Adaptec because in the past we've had bad experiences with badly broken DLink hardware. This mostly concerned their switches. But once a product sheds bad light on the decision you will hardly convince yourself that the rest of the product line works correctly, IMHO.

Are IRQ-sharing lockups a thing of the past?

P3-600's: I very well remember and we have delivered a few such packet filters without any major problems.

ASUS P*B-*: never had a single problem. Ok, we use a 2.2.x kernel with some enhancements of mine (not IRQ routing related). There should be no problem.

the boards in this era that I used had this problem, and guides like Anandtech and tomshardware advised to configure the BIOS to have each PCI slot be a set IRQ.

Do _not_ do that! Linux will choose an IRQ for the PCI slot and depending on whether the board has SCSI or IDE the IRQ wired routing on the local APIC is different. Forcing an IRQ on a specific PCI slot makes ASUS boards with older firmware releases go banana when assigning the IRQ routing, especially those with a onboard SCSI chip. There you have a reversed initialisation phase. Also if you're using the PCI-sharing option from the BIOS make sure to enable PCI-2.x compliancy and use an up-to-date BIOS release. And last but not least: All this doesn't work if you use Realtek-chipset based NICs. They are fundamentally flawed when it comes to IRQ sharing. Nowadays this is solved however and you can use this el-cheapo NIC.

Nowadays you can look into the motherboard booklet and see the wiring. If you intend to put in an additional SCSI card you need to make sure that the routing is separated. In most 5 to 6 PCI-slot boards, you could for example select slot 1 and 2 for separation since they are not routed over the same chip. It's depending on the bridge however.

This all changes if you have a SMP board (how could it be any different of course :)). There you need to distinguish every single motherboard factorisation to know how to solve the eventual problem of deplaced IRQ sharing. It will very much depend on the PCI chipset support in the kernel (in Windows world this would be the busmaster driver).

Ok.. so IRQ sharing is FIXED in 99% of situations now? I can take 2 quad cards from different manufacturer's and put them in the same box and they will work on the same IRQ (from the BIOS perspective)?

This is not said. All 4 ports of a single quadboard will have the same IRQ but if you put in a second quadboard from a different vendor your machine might just end up using different IRQ. Interport IRQ routing on a single quadboard is almost always shared. Also you need to take into account that this can change if you enable local APIC on UP or APIC in general for SMP boards. There you most propably end up with less probability of shared IRQs. However you end up with bigger problems with certain Intel boards based on the 440GX+/440LX+ chipset.

So SCSI needs to be a seperate IRQ from the rest? Don't share SCSI. What about Firewire or USB2 or ... ;)

I'm not that much into firewire and USB2, since having a packet filter or high traffic node with quadboards normally implies in not having the need for any firewire or USB2 devices. YMMV.

Have you looked at the AMD bus at all?

If you're going SMP then yes, pay a 200 bucks more and get a decent board with EMP and MCE support via console (UART). I have looked at the AMD boards but in our lab we've found them to be less ready to work properly with the rest of our hardware then Intel based boards. I wish support for AMD boards in Linux would be better but this is just a matter of time.

32.12. Flakey Switch

Here a user tracked down a poor performance problem, to a possible flakey switch, when serving windows media server.

Mark Weaver mark (at) npsl (dot) co (dot) uk 23 Mar 2004

The test client is using the Windows Media Load Simulator. This just makes a lot of connections and streams back data. The average stream only gets up to about 35Mbit. At this point, CPU usage on the director is ~20% (which would seem to indicate that I should be able to get a lot more out of it). CPU on the test box is at about 25% and on the media server at 4%.

The problematic part is that the director begins dropping about 10% of externally originated packets at this level of load. I wouldn't say any machine involved is stressed here, but pinging the external IP of the director gives that huge loss. This noticeably affects say, SSH, on the director or TS to the media server. This is constrasted with pinging the external IP of the test box, which gives 0% loss.

I would therefore conclude that this is an issue with the director, but I'm not sure what. My next guess would be to try swapping the VIA NIC for another 3com one, but could it really be that bad? I can't see it being an issue with the cisco switch (test box and director are both connected to it); the cisco router (same), or the d-link switch (not involved in ping to director), so I'm at a loss as to what else to conclude.

I trundled my self down the the hosting centre to do some further testing. It turns out that when plugging a couple of test clients directly into the switch in front of the director, I can get 90Mbits sustained load out of it using around 40% CPU on the director, nearly 100% CPU on both test clients, 5% CPU on the WMS machine, and ~1.5k concurrent connections. Fantastic!

The issue then, appears to be either the cisco switch or the cable connecting it; there is nothing else left to test. I did swap the NICs out for eepros (but the problem still persists when stressing through the cisco kit). Safe to say, I'm very, very impressed by this!

32.13. performance testing tools

32.13.1. Web Polygraph

Dennis Kruyt, 9 Jan 2002

I am looking for software for testing my lvs webservers with persistant connection. With normal http benchmarking tools all request come from one IP, but I want to simulate a few hundred connections from different IPss with persistant connections.

Joe Cooper joe (at) swelltech (dot) com 09 Jan 2002

Web Polygraph is a benchmarking framework originally designed for web proxies. It will generate thousands of IPs on the client box if you request them. It does not currently have a method to test existing URLs, as far as I know (it provides its own realserver(s) and content, so that data is two-sided). It currently works very well for stress-testing an LVS balancer, but for the realservers themselves it probably needs a pretty good amount of coding. The folks who developed it will add features for pretty good rates, particularly if the new features fit in with their future plans. It does have some unfortunate licensing restrictions, but is free to get, use and modify for your own internal purposes.

32.13.2. getmeter

Alexandre Cassen Alexandre (dot) Casseni (at) wanadoo (dot) fr 13 Jan 2003 (2008, Alexandre is now at Alexandre (dot) Cassen (at) free (dot) fr)

getmeter: Simple tool for emulating a multi-threaded web browser. The code works for a HTTP/1.1 webserver. The purpose of this tool is to monitor webserver response time. It implements HTTP/1.1 GET using a realtime multi-threaded design, dealing with an url pool to perform global page reponse time (page and first level elements response time). It

  1. connects to a webserver (HTTP or SSL),
  2. creates 2 multi-threaded persistent connections to this webserver,
  3. performs a GET HTTP/1.1 on the url specified,
  4. parses the html content returned and creates an element pool
  5. performs GET HTTP/1.1 on each element.
  6. for each GET, mesure the response time.
  7. print the global reponse time for the page requested.

An extension can use MRTG or RRDTOOL to graph the output.

32.14. Max number of realservers

what is the maximum number of servers I can have behind the LVS without any risk of failure?

Horms horms (at) vergenet (dot) net 03 Jul 2001

LVS does not set artificial limits on the number of servers that you can have. The real limitations are the number of packets you can get through the box, the amount of memory you have to store connection information and in the case of LVS-NAT the number of ports available for masquerading. These limitations effect the number of concurrent connections you can handle and your maximum through-put. This indirectly effects how many servers you can have.

(also see the section on port range limitations.)

32.15. FAQ: What is the minimum hardware requirements for a director

Enough for the machine to boot,i.e 386CPU, 8M memory, no hard disk, 10Mbps ethernet.

32.16. FAQ: How fast/big should my director be?

There isn't a simple answer.

The speed of the director is determined by the packet throughput from/to the clients and not by the number of realservers. From the mailing list, 3-500MHz UMP directors running 2.2.x kernels with the ipvs patch can handle 100Mbps throughput. We don't know what is needed for 1Gpbs throughput, but postings on the mailing list show that top end UMP machines (eg 800MHz) can't handle it.

For the complicated answer, see the rest of this section.

Horms 12 Feb 2004

If you only want to use LVS to load balance 100Mb/s Ethernet then any machine purchased in the last few years should easily be able to do that. End of conversation :-)

If you want to go to 1Gb/s Ethernet then things get more interesting. At that point here are the things to watch out for:

  • Make sure your machines have a nice fast PCI bus. These days most machines seem to have 66Mhz/64bit or 100Mhz/64bit slots so you are fine. Back when 33Mhz/32bit was standard this was a bit more problematic.
  • Buy good NICs that have well maintaineddrivers.
  • Use UP not SMP. Unless you really need SMP on the machine for some reason, then the locking overhead is greater than the gain of an extra CPU when using LVS. This is particularly true when handling small connections, where the TCP handshake becomes significant. (That was on 2.4, not sure about 2.6, though I assume that it still holds)
  • CPU isn't really much of an issue. If you can purchase a CPU these days that is too slow to run LVS, even up to 1Gb/s then I would be very surprised. Certainly anything over a 1GHz PII should be fine.
  • Memory. First understand that LVS has no internal limits on the number of connections it can handle, so you are only bound by your system resources. Here is the equation. For each connection you have in LVS's connection table you need about 128bytes. Connections will stay in the table for 120 seconds after a connection is closed. So if your peak is, say 300 connections/s, then you need about 300*120*128=4608000bytes=4Mbytes of memory for the connection table, which I think you will agree isn't much. If you are using persistance then an extra entry (template) will be created per end-user (masked by the persistance netmask) and these will stay around for the duration of the persistance timeout. You can do the maths there. But the bottom line is that unless you are expecting an extremely high number of connections, then you don't need much memory.

    Obviously you will need memory for other things like the OS, monitoring tools etc... But I think that 256Mb of RAM should be more than enough.

32.17. SMP doesn't help, but 64 bit does

LVS is kernel code. In particular the network code is kernel code. Kernel code is only SMP in 2.4.x kernels (user space SMP started in 2.0.x kernels). To take advantage of SMP for LVS then you must be running a 2.4.x kernel.

Horms (06 Apr 2003): SMP doesn't help for kernel code at high load -

Some things that you may want to consider are: Using a non SMP kernel - there is actually more overhead in obtaining spinlocks than the advantage you get from multile CPUs if you are only doing LVS. If you only have one processor you should definately use a non-SMP kernel. You may also want to consider using NAPI with the ethernet driver and if you really want to use SMP then setting up affinity for the two NICS with different processors is probably a good idea.

Dusten Splan Dusten (at) opinionsurveys (dot) com 12 May 2003

Is LVS smp aware? I have dual 1.1Ghz processors, dual 1000BaseT Ethernet, with 2.4.20 compiled as an smp and with the lvs sources compiled in, set up as a one nic one network DR unit with wrr and it is working very nicely. It is pushing about 50Mbps at peek and is only using about 15% on average of the processing power when doing a vmstat. Now here's the question - when doing a vmstat the numbers look like this (this is a snapshot when we are doing about 30Mbps):

 0  0  0      0 474252  22496 352428    0    0     0     0 25769  3248  0 11 89
 0  0  0      0 474252  22496 352428    0    0     0     0 25816  3194  0 12 88
 0  0  0      0 474252  22496 352428    0    0     0     0 24526  2772  0 11 89
 0  0  0      0 474252  22496 352428    0    0     0     0 24714  2939  0  9 91
 0  0  0      0 474252  22496 352428    0    0     0     0 25404  3081  0  8 92
 0  0  0      0 474252  22496 352428    0    0     0     0 25238  2996  0 11 89
 0  0  0      0 474252  22496 352428    0    0     0     0 24872  2960  0 10 90
 0  0  0      0 474252  22496 352428    0    0     0     0 24760  2850  0  7 93
 0  0  0      0 474252  22496 352428    0    0     0     0 25341  2984  0 10 90
 0  0  0      0 474252  22496 352428    0    0     0     0 24689  2743  0  8 92

now if I look at top I get.

27 processes: 26 sleeping, 1 running, 0 zombie, 0 stopped
CPU0 states:   0.0% user  20.0% system    0.0% nice   0.0% iowait  80.0% idle
CPU1 states:   0.1% user   1.0% system    0.0% nice   0.0% iowait  98.4% idle
Mem:  1032948k av,  558912k used,  474036k free, 0k shrd, 22496k buff, 188192k active, 189740k inactive
Swap:  522104k av,       0k used,  522104k free, 352428k cached

When doing more traffic the load on cpu0 increases and nothing is happing on cpu1. My question is why am I not seeing this processor usage distributed over both processors. I know that on a Sun box the network card is stuck to a single processor and will not use the other processors.

Here's a sample of what mpstat has to say about the hole thing.

[root@www99 root]# mpstat -P ALL 1 10;
Linux 2.4.20 (www99)    05/14/2003

03:10:05 PM  CPU   %user   %nice %system   %idle    intr/s
03:10:06 PM  all    0.00    0.00   12.00   88.00  19841.00
03:10:06 PM    0    0.00    0.00   23.00   77.00  19748.00
03:10:06 PM    1    0.00    0.00    1.00   99.00    100.00

Horms

LVS really should utilise both CPUs. As you note the 2.4 kernels are multithreaded. LVS should take advantage of this. It definately bears further investigation.

The problem with performance is that in multithreading the kernel a lot of spinlocks were introduced. From the testing that I was involved in, its seems that the overhead in obtaining these locks is greater than the advange of having access to a second CPU. That is in the case of using the box only as an LVS Linux Director. If you are doing lots of other things as well then this may not be the case. I would suggest that if you are building a machine that will act primarily as an LVS ldirectord, then a non-SMP kernel should give you the best performance.

This however, does not answer, and is not particularly relevant to your problem. Sorry.

Wensong

The major LVS processing is run inside the softirqs in the kernel 2.4. The softirqs (even the same) can run parallely on the two CPUs or more inside the kernel 2.4. So, the LVS in the kernel 2.4 should take advantage of SMP. We spent a lot of efforts keeping the locking granularity of LVS small too.

As for Dusten's problem, I am not sure why one CPU is 80% idle and the other is always 100% idle. From the mpstat output, almost all the interrupts go to the first CPU. Is it possible that 20% CPU cycles have been spent handling interrupts at the first CPU?

Michael Brown michael_e_brown (at) dell (dot) com wrote on 26 Dec 2000

I've seen significant improvements using dual and quad processors with 2.4. Under 2.2 there are improvements but not astonishing ones. Things like 90% saturation of a Gig link using quad processors. 70% using dual processors and 55% using a single processor under 2.4.0test.

I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most evidence points to >100% improvement for network intensive tasks.

only one CPU can be in the kernel with 2.2. Since LVS is all kernel code, there is no benefit to LVS by using SMP with 2.2.x. Kernel 2.[3-4] can use multiple CPUs. While standard (300MHz pentium) directors can easily handle 100Mbps networks, they cannot handle an LVS at Gbps speeds. Either SMP directors with 2.4.x kernels or multiple directors (each with a separate VIP all pointing to the same realservers) are needed.

Since LVS-NAT requires computation on the director (to rewrite the packets) not needed for LVS-DR and LVS-Tun, SMP would help throughput.

Joe

If you're using LVS-NAT then you'll need a machine that can handle the full bandwidth of the expected connections. If this is T1, you won't need much of a machine. If it's 100Mbps you'll need more (I can saturate 100Mbps with a 75MHz machine). If you're running LVS-DR or LVS-Tun you'll need less horse power. Since most LVS is I/O I would suspect that SMP won't get you much. However if the director is doing other things too, then SMP might be useful

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

Yep, LVS in 2.2 can't use both CPUs. This is not a LVS limitation. It is already solved in the latest 2.3 kernels: softnet. If you are using the director as realserver too, SMP is recommended.

Pat O'Rourke orourke (at) mclinux (dot) com 03 Jan 2000

In our performance tests we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

We've been using a 2.2.17 kernel with the 1.0.2 LVS patch and bumped the send / recv socket buffer memory to 1mb for both the uni-processor and SMP scenarios. The director is an Intel based system with 550 mhz Pentium 3's.

In some tests I've done with FTP, I have seen *significant* improvements using dual and quad processors using 2.4. Under 2.2, there are improvements, but not astonishing ones.

Things like 90% saturation of a Gig link using quad processors, 70% using dual processors and 55% using a single processor under 2.4.0test. Really amazing improvements.

Michael E Brown michael_e_brown (at) dell (dot) com 26 Dec 2000

What are the percentage differences on each processor configuration between 2.2 and 2.4? How does a 2.2 system compare to a 2.4 system on the same hardware?

I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most of the evidence on tests that I have run points to a > 100% improvement for *network intensive* tasks.

In our experiments we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

Kees Hoekzema kees (at) tweakers (dot) net 10 Jun 2008

Yes it is true that for LVS it does not really matter, but LVS is not always the only thing running on a system, in my case I also let it run a DNS server and it has a lot of iptables rules. At the moment we are doing around 60 mbit/s and my stats are:

$ mpstat -P ALL 1 10 
Average:     CPU   %user   %nice    %sys %iowait    %irq   %soft  %steal %idle    intr/s
Average:     all    0.23    0.00    0.69    0.13    0.48    2.93    0.00 95.54  11772.75
Average:       0    0.10    0.00    0.30    0.40    0.00    0.00    0.00 99.20   1000.30
Average:       1    0.30    0.00    1.01    0.10    0.00    0.00    0.00 98.59      0.00
Average:       2    0.10    0.00    0.41    0.00    0.62    1.24    0.00 97.62   2941.92
Average:       3    0.31    0.00    1.04    0.00    1.35   10.77    0.00 86.54   7830.64

(a quadcore xeon was just a little bit more expensive than a singlecore cpu at the time we bought the servers).

Also you can have different cpu's handle the interrupts:

$ cat /proc/interrupts
            CPU0       CPU1       CPU2       CPU3
1274:        220    4719154 2977300793          0   PCI-MSI-edge      eth1
1275:        244    3649053          0 1213052367   PCI-MSI-edge      eth0

With the speed of the current cpu's however I think it does not really matter if you have a single or multi-core cpu, they all can handle gbit/s of data.

64 bit is helpful if you don't want stuff like /proc/net/dev overflowing (4 gbyte/5 minutes=16 Mpbs of traffic, so if you get above that, and you try to use those statistics, the counter may overflow before you read it again)

32.18. Performance Hints from the Squid people

There is information on the Squid site about tuning a squid box for performance. I've lost the original URL, but here's one about file descriptors and another (link dead Mar 2004 http://www.devshed.com/Server_Side/Administration/SQUID/) by Joe Cooper (occasional contributor to the LVS mailing list) that also addresses the FD_SETSIZE problem (i.e. not enough filedescriptors). The squid performance information should apply to an LVS director. For a 100Mbps network, current PC hardware on a director can saturate a network without these optimizations. However current single processor hardware cannot saturate 1Gpbs network, and optimizations are helpful. The squid information is as good a place to start as any.

Here's some more info

Michael E Brown michael_e_brown (at) dell (dot) com 29 Dec 2000

How much memory do you have? How fast of network links? There are some kernel parameters you can tune in 2.2 that help out, and there are even more in 2.4. From the top of my head,

  • /proc/sys/net/core/*mem* - tune to your memory spec. The defaults are not optimized for network throughput on large memory machines.
  • 2.4 only /proc/sys/net/ipv4/*mem*
  • For fast links, with multiple adapters (Two gig links, dual CPU) 2.4 has NIC-->CPU IRQ binding. That can really help also on heavily loaded links.
  • For 2.2 I think I would go into your BIOS or RCU (if you have one) and hardcode all NIC adapters (Assuming identical/multiple NICS) to the same IRQ. You get some gain due to cache affinity, and one interrupt may service IRQs from multiple adapters in one go, on heavily loaded links.
  • Think "Interrupt coalescing". Figure out how your adapter driver turns this on and do it. If you are using Intel Gig links, I can send you some info on how to tune it. Acenic Gig adapters are pretty well documented.

For a really good tuning guide, go to spec.org, and look up the latest TUX benchmark results posted by Dell. Each benchmark posting has a full list of kernel parameters that were tuned. This will give you a good starting point from which to examine your configuration.

The other obvious tuning recommendation: Pick a stable 2.4 kernel and use that. Any (untuned) 2.4 kernel will blow away 2.2 in a multiprocessor configuration. If I remember correctly 2.4.0test 10-11 are pretty stable.

Some information is on

http://www.LinuxVirtualServer.org/lmb/LVS-Announce.html

only one CPU can be in the kernel with 2.2. Since LVS is all kernel code, there is no benefit to LVS by using SMP with 2.2.x. Kernel 2.[3-4] can use multiple CPUs. While standard (300MHz pentium) directors can easily handle 100Mbps networks, they cannot handle an LVS at Gbps speeds. Either SMP directors with 2.4.x kernels or multiple directors (each with a separate VIP all pointing to the same realservers) are needed.

Since LVS-NAT requires computation on the director (to rewrite the packets which is not needed for LVS-DR and LVS-Tun), SMP would help throughput.

Joe

If you're using LVS-NAT then you'll need a machine that can handle the full bandwidth of the expected connections. If this is T1, you won't need much of a machine. If it's 100Mbps you'll need more (I can saturate 100Mbps with a 75MHz machine). If you're running LVS-DR or LVS-Tun you'll need less horse power. Since most LVS is I/O I would suspect that SMP won't get you much. However if the director is doing other things too, then SMP might be useful

Julian Anastasov uli (at) linux (dot) tu-varna (dot) acad (dot) bg

Yep, LVS in 2.2 can't use both CPUs. This is not a LVS limitation. It is already solved in the latest 2.3 kernels: softnet. If you are using the director as realserver too, SMP is recommended.

Pat O'Rourke orourke (at) mclinux (dot) com 03 Jan 2000

In our performance tests we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

We've been using a 2.2.17 kernel with the 1.0.2 LVS patch and bumped the send / recv socket buffer memory to 1mb for both the uni-processor and SMP scenarios. The director is an Intel based system with 550 mhz Pentium 3's.

In some tests I've done with FTP, I have seen *significant* improvements using dual and quad processors using 2.4. Under 2.2, there are improvements, but not astonishing ones.

Things like 90% saturation of a Gig link using quad processors, 70% using dual processors and 55% using a single processor under 2.4.0test. Really amazing improvements.

Michael E Brown michael_e_brown (at) dell (dot) com 26 Dec 2000

What are the percentage differences on each processor configuration between 2.2 and 2.4? How does a 2.2 system compare to a 2.4 system on the same hardware?

I haven't had much of a chance to do a full comparison of 2.2 vs 2.4, but most of the evidence on tests that I have run points to a > 100% improvement for *network intensive* tasks.

In our experiments we've been seeing an SMP director perform significantly worse than a uni-processor one (using the same hardware - only difference was booting an SMP kernel or uni-processor).

Joe: here's a posting from the Boewulf mailing list, about increasing the number of file descriptors/sockets. It is similar to the postings on the squid webpages (mentioned above).

Yudong Tian yudong (at) hsb (dot) gsfc (dot) nasa (dot) gov 30 Sep 2003

The number of sockets a process can open is limited by the number of file descriptors (fds). Type "ulimit -n" under bash to get this number, which usually 1024 by default.

You can increase this number if you wish. Google "increase Linux file descriptors" you will find many examples, like this one: http://support.zeus.com/faq/zws/v4/entries/os/linuxfd.html

If you want to be really sure, you can compile and run the following c program to get the number, which is the output plus 3 (stdout, stdin and stderr):

/*
# test how many fd you have
$Id$
*/

int main(int argc, char *argv[])
{

 int i = 0;

 while( tmpfile()){ i++; }
 printf("Your free fds: %d\n", i);

}
/*** end of program ***/

If you are running a TCP server and want to test how many clients that server can support, you can run the following perl program to test:

#!/usr/bin/perl
# Test the max number of tcp connections a server can support
# $Id: testMaxCon.pl,v 1.1 2003/06/23 19:10:41 yudong Exp yudong $
# usage: testMaxCon.pl IP port

use IO::Socket::INET;

@ARGV == 2 or die "Usage: testMaxCon.pl svr_ip svr_port\n";

my ($ip, $port) = @ARGV;

my $i = 0;

do {
  $i++;
  $socket[$i] = IO::Socket::INET->new(PeerAddr => $ip,
                                      PeerPort => $port,
                                      Proto => "tcp",
                                      Timeout => 6,
                                      Type => SOCK_STREAM);
} while ($socket[$i]);

print "Max TCP connections supported for this client: ", $i-1, "\n";
## end of program

Of course for this test you have to make sure you have more fds to use than the server.

Brian Barrett brbarret (at) osl (dot) iu (dot) edu 02 Oct 2003

On linux, there is a default per-process limit of 1024 (hard and soft limits) file descriptors. You can see the per-process limit by running limit (csh/tcsh) or ulimit -n (sh). There is also a limit on the total number of file descriptors that the system can have open, which you can find by looking at /proc/sys/fs/file-max. On my home machine, the max file descriptor count is around 104K (the default), so that probably isn't a worry for you.

There is the concept of a soft and hard limit for file descriptors. The soft limit is the "default limit", which is generally set to somewhere above the needs of most applications. The soft limit can be increased by a normal user application up to the hard limit. As I said before, the defaults for the soft and hard limits on modern linux machines are the same, at 1024. You can adjust either limit by adding the appropriate lines in /etc/security/limits.conf (at least, that seems to be the file on both Red Hat and Debian). In theory, you could set the limit up to file-max, but that probably isn't a good idea. You really don't want to run your system out of file descriptors.

There is one other concern you might want to think about. If you ever use any of the created file descriptors in a call to select(), you have to ensure all the select()ed file descriptors fit in an FD_SET. On Linux, the size of an FD_SET is hard-coded at 1024 (on most of the BSDs, Solaris, and Mac OS X, it can be altered at application compile time). So you may not want to ever set the soft limit above 1024. Some applications may expect that any file descriptor that was successfully created can be put into an FD_SET. If this isn't the case, well, life could get interesting.

AlberT AlberT (at) SuperAlberT (dot) it 02 Oct 2003

from man setrlimit:

getrlimit and setrlimit get and set resource limits respectively. Each resource has an associated soft and hard limit, as defined by the rlimit structure (the rlim argument to both getrlimit() and setrlimit()):

            struct rlimit {
                rlim_t rlim_cur;   /* Soft limit */
                rlim_t rlim_max;   /* Hard limit (ceiling
                                      for rlim_cur) */
            };

The soft limit is the value that the kernel enforces for the corresponding resource. The hard limit acts as a ceiling for the soft limit: an unprivileged process may only set its soft limit to a value in the range from 0 up to the hard limit, and (irreversibly) lower its hard limit. A privileged process may make arbitrary changes to either limit value.

The value RLIM_INFINITY denotes no limit on a resource (both in the structure returned by getrlimit() and in the structure passed to setrlimit()).

RLIMIT_NOFILE Specifies a value one greater than the maximum file descriptor number that can be opened by this process. Attempts (open(), pipe(), dup(), etc.) to exceed this limit yield the error EMFILE.

32.19. realservers filling conntrack tables (LVS-DR)

Wiboon Warasittichai wiboon (dot) w (at) psu (dot) ac (dot) th 08 Jun 2007

I set up 2 directors IP 192.168.96.11 (active/standby) with 4 real servers (squid) for a week ago. I noticed in dmesg output in the realservers

ip_conntrack: table full, dropping packet.
ip_conntrack: table full, dropping packet.

So I restart iptables. Then, ip_conntrack goes below 65536 max.

[root@proxy5-in ~]# cat /proc/slabinfo | grep conn
ip_conntrack_expect      0      0     92   42    1 : tunables  120   60 8 : slabdata      0      0      0
ip_conntrack         20723  20723    232   17    1 : tunables  120   60 8 : slabdata   1219   1219    120

But within a day, it reached max ip_conntrack again. I checked with cat /proc/net/ip_conntrack | grep UNREPLIED which showed many lines with ESTABLISHED and UNREPLIED.

tcp      6 419803 ESTABLISHED src=192.168.96.11 dst=192.168.192.7 
sport=8080 dport=56055 packets=1 bytes=601 [UNREPLIED] src=192.168.192.7 
dst=192.168.96.11 sport=56055 dport=8080 packets=0 bytes=0 mark=0 use=1

I think it's because the squid realservers directly send answer back from internet to client and then client send FIN to director, isn't it? Do IPVS/DR have any configurations to get rid of these ip_conntrack? Do I need to unload module ip_conntrack on all squid boxes?

Graeme Fowler graeme (at) graemef (dot) net 08 Jun 2007

Ideally you have to unload the module. Why do you have the conntrack module loaded in the first place? An alternative method, if you absolutely must keep the conntrack rules in place, is to explicitly use the NOTRACK target on packets destined for the Squid service. On the realserver, as an example:

iptables -t raw -I PREROUTING -p tcp --dport 3128 -j NOTRACK
iptables -I INPUT -p tcp -m tcp --dport 3128 -j ACCEPT

The first line will remove tracking from packets destined for TCP port 3128 on the realserver.

32.20. Conntrack, effect on throughput

Note

I can't figure out if this should belong in the performance or the netfilter section - any suggestions?

Conntrack is part of netfilter. It keeps track of connections and knows the relationship of a packet to existing connections. This enables filtering to reject or allow packets e.g. a packet appearing to be part of a passive ftp connection is only valid after a call to setup a passive ftp connection. For a tutorial on conntrack, see the links inside Iptables - How Does It Work? (http://www.sns.ias.edu/~jns/wp/2006/01/24/iptables-how-does-it-work/) by James Stephens.

Rodger Erickson rerickson (at) aventail (dot) com 17 Dec 2001

Does anyone have any comments they can make on the effect of conntrack on LVS performance? The LVS device I'm using also has to do some DNAT and SNAT, which require conntrack to be enabled.

Julian: We need to port the 2.2 masquerade to 2.4 :) LVS reused some code from 2.2 but much of it is removed and I'm not sure it can be added back easily. It would be better to redesign some parts of Netfilter for 2.5 or 2.7 :) You can use the ipchains compat module. But may be it does not work for FTP and is broken at some places.

The best approach might be to test the slowdown when using both LVS and conntracking and if it's not fast enough, buy faster hardware. It will take less time :) You can test the slowdown with some app or even with testlvs.

patrick edwards

My lvs works with no problem. However with in a matter of an hour or two my bandwidth drops to virtually nothing and the CPU load goes ballistic. I have a 100Mbit internal network, but at times i'm lucky to see 50Kps.

Christian Bronk chris (at) isg (dot) de 15 Jan 2002: For our 2.4 kernel test servers, it turned out that ipchains under kernel 2.4 does full connection-tracking and makes the system slow. Try to use iptables or the arp-patch instead.

Fabrice fabrice (at) urbanet (dot) ch 17 Dec 2001

When I ran testlvs, with conntrack enabled on the client machine (the one that runs testlvs), I had a mean of about 2000 SYN/s. When I removed thoses modules (there are many conntracks) I reached 54'000 SYN/s!

Julian Anastasov ja (at) ssi (dot) bg 22 Dec 2001: I performed these tests with 40K SYN/s incoming (director near its limits), LVS-NAT, 2.4.16, noapic, SYN flood from testlvs with -srcnum 32000 to 1 realserver.

  • only IPVS 0.9.8 able to send the 40K/sec (same as incoming), 3000 context switches/sec
  • after modprobe ip_conntrack hashsize=131072

    10-15% performance drop, 500 context switches/sec

  • after modprobe iptable_nat (no NAT rules) 5% performance drop, same number of context switches/sec
  • Additional test: -j DNAT instead of IPVS
    # modprobe ip_conntrack hashsize=131072
    # modprobe iptable_nat
    # cat /proc/sys/net/ipv4/ip_conntrack_max
    1048576
    
    Tragedy: 1000P/s out, director is overloaded.

I looked into the ip_conntrack hashing, it is not perfect for incoming traffic between two IPs, but note that testlvs uses different IPs, so after little tuning, it seems that the DNAT's problem is not the bad hash function. Maybe I'm missing some NF tuning parameters.

Andy Levine: Is it absolutely necessary to have IP Connection tracking turned on in the kernel if we are using LVR_DR? We are experiencing performance hits with the connection tracking code (especially on SMP boxes) and would like to take it out of our kernel.

Wensong:25 Dec 2002: if you have performance problem you can remove it. LVS uses its own simple and fast connection tracking for performance reasons, instead of using netfilter connection tracking. So, it will not affect LVS, if netfilter conntrack modules are not loaded. LVS/NAT should work too without the conntrack modules.

32.21. Don't use the preemptible/preemptable/preemptive kernels

Note
the different versions of the word in the title is for searching.

Brian Jackson

Just as a little experiment, since I have enjoyed my preemptible/low latency patches, I decided to test my lvs cluster with the patches. The results were interesting.

Roberto Nibali ratz (at) tac (dot) ch 10 Sep 2002

Preemptible kernels don't buy you anything on a server, it's simply speaking for a desktop machine where you'd like to listen to mp3 (non-skipping of course) and compile the kernel. Low latency for tcpip as needed for LVS is incompatible with the concept of preemptible kernels, in that of the network stack runs in softirqs and get's worked around by the kernel scheduler. If your driver generates a lot of IRQs for RX buffer dehooking, the scheduler must be invoked to get those packets pushed in the TCP stack or you loose packets. As long as you don't run X and some number crunching software on the realservers, preemtible kernels hurt TCP/IP stack performance IMHO.

32.22. 9.6Gbps served using LVS-DR with gridftp

Horms 24 Nov 2005

This information comes from Dan Yocum, slightly reformated and forwarded with permission. Note that while the cluster was pushing 9.6Gbps, the linux director was doing a negilgable ammount of work, which seems to indicate that LVS could push a great deal more traffic given sufficient real-servers and end-users.

On Mon, Nov 21, 2005 at 01:51:27PM -0600, Dan Yocum wrote:

Just a quick update on the LVS-DR server I used for for our bandwidth challenge last week at SuperComputing: The director saw an increase of around 120Kbps when we ran our bandwidth challenge tests. At times the aggregate bandwidth out of the 21 real servers was around 9.6Gbps, so the amount of traffic on the director was negligible. There were 41 clients grabbing the data from the servers, each machine ran a gridftp client with 16 parallel streams. Packet size was standard (1500), so no jumbo frames.

The URL for the mrtg graphs are here: http://m-s-fcc-mrtg.fnal.gov/~netadmin/mrtg/mrtg-rrd.cgi/s-s-fcc1-server/s-s-fcc1-server_3_3.html

The test occurred on Wed of last week in the last half of the day.

Sure, no problem. One of the BWC participants put a page up here if you're interested in the details of the other participants: http://www-iepm.slac.stanford.edu/monitoring/bulk/sc2005/hiperf.html