AustinTek LVS page | LVS homepage | AustinTek homepage |

Ottawa Linux Symposium (OLS) 23-26 Jul 2003

Joseph Mack

jmack (at) wm7d (dot) net

Aug 2003 and updated Feb 2004


Table of Contents

1. About OLS
2. Technical Presentations
3. After Dinner Talks
4. Other Interesting Points
4.1. People who are difficult to work with
4.2. Some programs won't compile
4.3. Disks are cheap
4.4. The real reason you want to give away your code
4.5. Travel broadens your perspective
4.6. The Mounties get their man

1. About OLS

This meeting, and the many satellite meetings that occur before and after, is the main annual technical meeting for Linux. There are no vendors or sales people. There is a limit of 500 attendees.

OLS has grown in large part because of the location of Ottawa.

  • Ottawa in July is a nice place to be. The hotels, conference center and restaurants are all within walking distance, downtown is safe (people roller blade along with the cars in the traffic, like cyclists), and if you want to go for a walk, you are near the Ottawa River, parks and Rideau canal. Apart from the conference center (which, like most indoor places in North America in the summer, is freezing thanks to air conditioning), temperatures are pleasant day and night.
  • the east coast of NA represents a reasonable compromise for the cost of travel totalled over all attendees, most coming from NA and Europe. International people come from Australia and Japan.
  • The US govt doesn't want people to innovate software and has subjected people who've done so to multi-year legal intimidation for implementing algorithms from text books, and incarcerated people without bail for discussing technical matters common knowlege to the audience. Much of the technical innovation in open source software has moved from USA and no-one expects a technical Linux conference in USA for the forseeable future.

Although OLS started 5 yrs ago as a single meeting, satellite meetings are now conducted before or after the main meeting, to take advantage of the presence of people working on common projects e.g. the (by invitation only) Linux Kernel Summit is held in the two days before OLS.

The Linux Virtual Server project (LVS) met before (and during and after) the meeting. Unlike the Kernel Summit, the LVS people are quite inclusive about who we'll talk to and only expect you to be able to drink beer. LVS team members are from China, Japan, Switzerland, France, Australia, Bulgaria. I'm the only American [1]. Apart from e-mail, this is our only chance to meet face-to-face. This year Lars, Horms, Ratz and myself attended.

The priority scheduler, originally proposed by Ratz, was worked on in hotel rooms and on steps leading into the conference center, anywhere that laptops could be put down. Ratz and Horms did their LVS development, simulating multiple machines on their laptops, with VM-ware.

Ratz needs the priority scheduler to handle sudden peaks in load, when a client will be sent to a "Please wait 1 min" with state information being held about the client. This scheduler is needed for servers, where an SLA requires that users always be able to connect. This is a problem when a server has to handle I/O in bursts, e.g. when bookings open for a rock concert (the Rolling Stones concerts in Canada with 500,000 tickets, sold out in several minutes, requiring the servers to hold 100,000 connections for the time needed for the bookings).

2. Technical Presentations

  • Horms LVS tutorial was the first talk of the conference.

    Horms talk included a description of his implementation of the LVS synch demon and on his Saru code (http://www.ultramonkey.org/papers/active_active/), which allows multiple directors to be active at the same time. Horms showed how the synch state demon didn't need to run in client-server mode, but would work better in peer-to-peer mode, where it would broadcast the state of all connections it knew about. If the director was in backup mode, then it would be silent. Since only ESTABLISHED connections are relevent to failover, only ESTABLISHED state information is broadcast.

  • Infiniband: - a high bandwidth, low latency commodity connect backed by Intel (and presumably others) that may still yet prove to be useful in beowulves. It has been in the works for quite a while but the drivers (non tcpip) are still being written. It's hard to believe Infiniband will be deployed before it's obsolete. Other high throughput (10GBps), but expensive, connects, eg Quadrics, are already deployed.
  • Zero copy: Transfer of data through the kernel requires multiple copying of data. While latency is a problem everywhere, in the tcpip stack it limits transfer of data with high speed connects. Although this can be fixed for specific hardware with a hardware specific communication protocol, the flexibility is lost for the tcpip stack to work with any lower layer. Since no other models/protocols are available, high speed driver writers are faced with writing their own incompatible and hardware specific protocols.
  • Embedded Linux: As usual with proprietary hardware for which no information is available, the main problem is trying to figure out how to talk to the hardware.
  • Suspend to Disk: This isn't only for laptops. To conserve power and disklife, everyone wants to suspend their desktop machines overnight. For quite some time now, all hardware devices have been designed with the ability to move to and return from various agreed upon levels of suspension/activity. The required instructions are closely held business secrets and people who want to know them are regarded as terrorists by governments who don't understand the Boston Tea Party. Accordingly Linux has 3 different methods of suspending to disk, all of which work on a their limited range of hardware. This talk discussed the new Linux framework to be released in August 2003, which will call drivers to shut down and awaken hardware in an orderly fashion. It now only remains to figure out how to talk to the remaining hardware.
  • Lustre: This is a filesystem designed for kilonode clusters and won the contract for ASCI Purple and is running on 2 other kilonode clusters. No-one had written a file system for such large systems (Sistina's GFS only scales to about 64nodes) and Lustre, a handfull of people, had no competition for the bid (even by IBM, who subcontracted them rather than trying to compete). Lustre is one level above the hardware and disk failure is handled by RAID.
  • Google: A sysadmin at Google showed photos of the Google hardware at various stages from the first day (8 assorted desktop boxes) to today's setup (rows of 1U racks to the vanishing point). They only use cheap and nasty hardware (no RAID, no failover, no ECC memory). With MTBF for disk being about a year, they must have several disk jockies just replacing disks. For hardware it's "one strike and you're out". If a machine crashes, it is rebooted. If it doesn't come up, it is replaced and not used again in the Google server (they find other uses for them). They don't have a problem with income - advertising handles it. He illustrated the "spehl chequer" by showing the 250 different spellings of "Britney Spears" understood by Google. This would have been a good after dinner talk. During question time, he displayed the search strings being entered in realtime back at Google HQ. These arrived at several/sec. About every 10th entry brought laughs or hoots from the audience (people want to know about all sorts of things it seems). After several mins, the audience stopped reacting; the previously remarkable queries started to all look the same.
  • Linux Virtual Memory Management: The Linux VM is not in good shape; a Linux machine with load average of 80 will likely crash. The VM for *BSD is quite good. (I don't know why the *BSD VM isn't used in Linux.) Linux developers are trying various algorithms for VM with the hope of developing an O(1) VM. They aren't there yet, but Linux VM is better than it was a year ago. Academics regard the VM problem as solved, but the original VM algorithms were developed for machines that were much nearer to Von Neuman architecture than current machines. Although most people think that computer speed is increasing with Moore's Law, not all hardware is advancing at the same rate. The slews have accentuated the non-von properties of modern computers.

    Table 1. Scaling of Speeds of Hardware with Moore's Law

    scaling with Moore's Law Hardware examples
    faster than Moore's Law disk size
    with Moore's Law CPU speed, memory size, number of processes
    slower than Moore's Law disk speed, memory speed
    much slower than Moore's Law number of CPUs
    do not scale with Moore's Law speed of light (affects distance separating chips on a board)

    These changes have meant that machines that not so long ago were running 2000 processes are now handling 50000 processes, where the algorithms for VM do not scale. In particular, the ratio of job size to disk speed means that it's now expensive to swap.

    The VM problem shows up with one of the beowulfs I run. For an overtemperature event at 100% CPU on 32 CPUs, it takes 45secs to schedule an orderly shutdown (we're scheduling a disorderly shutdown which takes about 10 secs).

    Note
    In my 2.4.24 kernel, I now notice that at high load, the CPU runs at 100% (sys, user, nice), whereas previously there would always be idle time (5-20%) no matter how high the load average. Presumably this is a result of better job scheduling.
  • Security/Packet Filtering: The security tools and traffic shaping available with Linux is excellent, although difficult to use. (Traffic shaping gives priority to interactive and streaming applications like realtime audio, while lowering the priority for non-interactive applications like ftp). However the code implementing it and the user interface, change with each major kernel version. Although two years ago we were promised that the filtering would not change in kernel 2.6, it is now clear to all concerned that it has to change. Thus vendors and users who had to port their 2.0 filtering rules to 2.2, and then to 2.4, can now look forward to doing the same thing for 2.6. One problem is that rule loading scales with O(n^2) and is only usable for small (<5,000) rule sets. The rules look like assembler, it's easy to make mistakes, difficult to check and rules are not extendable or nestable (you can't make a subroutine). A higher level language is needed and a compiler or parser needs to be written. The user interface is only usable by adept security people who setup rules all the time. An independant package, nf-hipac is now available which scales to at least 50,000 rules and has been mathematically proven to give the filtering defined in its conf files. However the setup overhead is large, and there is no point in using it for small rulesets.

3. After Dinner Talks

These are put on by sponsors and follow welcoming speeches by organisers etc. Although the talks are not given by Linux developers, the sponsors usually put some effort to make their talk relevant to the audience. Last year Wayne Meretsky, a senior designer from AMD, talked about their new 64bit CPU, its pipeline and multithreading design and showed that many-way SMP systems would be possible with near UMA performance (the access time to memory on another CPU was only twice that to local memory, meaning that there was no point in programming the compiler to treat memory accesses to other CPUs any differently to local memory accesses). This talk was received well.

This year a marketer from Intel gave a long talk on the "Ecology of Linux" the message being that "working together we can get it done". Very quickly the noise level at the tables rose as people returned to their conversations.

On another evening, IBM sponsored a talk by Ian Foster (Mr "Grid"). Ian Foster has a book out on Grid which reviewers on amazon.com regard as "cheer leading". Instead of a technical talk (say on the design of Globus), he plugged the latest edition of his book and gave a zero content power point presentation on the similarities between Grid computing and the electrical grid.

Figure 1. LVS Project members at IBM Sponsored Dinner

IBM Sponsored dinner, OLS 2003
L-R: Ratz (Sicilian living in Switzerland), Lars (Germany), Joe (Australian living in USA), Horms (Australian living in Japan). Lars is always programming, at least from his drinks. The rest of us are in post-programming mode. (the image is a clickable link.) Photo by Horms.

4. Other Interesting Points

4.1. People who are difficult to work with

Much of the information transfer occured outside the talks. Problems that I've walked around and accepted as part of living would come up when someone would say "I found out why the glibc `make test` fails for machines with disks mounted with 'noatime'", or "anyone know why there are 3 drivers for device 'foo'?". The last problem is usually due to the person, who started the project, not accepting code/patches/new ideas from other people. These ("difficult to work with") people are unfortunately common in Linux.

4.2. Some programs won't compile

There is a lot of commonly used code out there that won't compile in my hands (gnome, gnucash) and people use the precompiled binaries, negating most of the point of having the source code. I found that other people can't compile these packages either. While reassured that I'm not incompetent, I'm distressed that code is being released in such shape and being incorporated into other code.

I've spent some time trying to compile gnome parts on my machines, to find that instead of looking for the file they want (e.g. some library) they look for an ascii *.la file, that is supposed list the location of the file, but often doesn't. Why they hell doesn't gnome use ldconfig to find the libraries, just like everyone else does?

4.3. Disks are cheap

Disks are really cheap (Come on guys - you're supposed to say "Joe, how cheap are they?"). Well, they are so cheap ($1/G) that if you're working with people with reasonable disk hygiene, it's cheaper to install another disk than to send a sysadmin who costs $1/min to free up some diskspace.

4.4. The real reason you want to give away your code

People usually think free code is a good idea because it gives them a whole lot of code that does exactly what they want for free. This, of course, is completely wrong. The main expense of code is not writing it, but is debugging, maintaining and upgrading it. The real reason then that you want to give your code away, is so that you'll quickly have 1000s of users who will send in patches, bug fixes and new ideas and you won't have to do a anything except take credit for it all.

4.5. Travel broadens your perspective

The Europeans are amazed that power lines in USA are above ground, particularly in areas of ice or hurricanes (or where I live, both) (Sydney with neither, has power underground). They also aren't happy that NA uses cell phone standards incompatible with those the rest of the world and are left standing around unable to get dial-tones. Of course you patiently explain to them that the phone providers here are only responding to the voice of the consumer exercising their right to choose, as you would expect in a free country. Here the consumer is asking for, nay demanding, incompatible protocols in adjacent zones and that people not be allowed to send data over cell phones. The situation in the rest of the world, interoperability from country to country and people who've been using their cell phones as laptop modems for a decade, is exactly what you'd expect when governments step in and over-regulate and strangle innovation in a free-enterprise system.

I mentioned to Ratz, that I had UPSs at home. Why? (he asked in astonishment). Because the power goes out occasionally and even if it doesn't, you get power bumps (<1sec) now and again, enough to bring down a machine and derange the phone answering machine, VCR and clocks. (In Maryland, I used to get a 1sec power bump once a week). This brought cries of amazement. His machine at home didn't have a UPS and had an uptime of over 400 days. He had 100's of machines at work, all of which were required to be up all year, and none of which had a UPS or had ever needed them. Why, if you couldn't rely on the power, no-one would be able to do business. It sounds like there isn't a UPS in the whole of Europe and you have to places like Iraq before you get anything comparable to the state of the power system in USA.

4.6. The Mounties get their man

We had the first near-arrest in 5yrs of OLS. One of the younger speakers was found by the RCMP in a state of geographical confusion and gravitational challenge, while attempting to return to his hotel one night following discussions about the future of open source code. After engaging in some light conversation, he was taken in hand by the Mounties, who safely chauffered him without further incident to his hotel.

Note
A Conservative Prime Minister of Australia was once found in the foyer of a hotel in another country in a similar condition, but without his trousers. The newspapers in Australia described him as being found in an "emotional state", adding a phrase to the australian lexicon. The whereabouts of his trousers is still a mystery.


[1] :-) For those who haven't met me, I was born in Australia and still have an Australian accent. Although I have US citizenship, it's not obvious to most people that I'm an American.

AustinTek LVS page | LVS homepage | AustinTek homepage |