37. LVS: Dynamic Routing, multiple gateways, realservers in multiple LVSs, dead gateway detection

37. LVS: Dynamic Routing, multiple gateways, realservers in multiple LVSs, dead gateway detection
Prev		Next

Normally multiple routes are handled by routers. However you may not have access to the router tables for administrative reasons or because someone wants to protect their turf (they don't want someone not in their department poisoning their router tables). Here we describe setting up multiple routes and how they can be used in an LVS.

37.1. Setting up multiple gateways: Realservers shared between two LVSs: ip route append

If realservers are supplying services through two directors, then the realservers need two default routes (one through each director). This is allowed by the TCPIP RFCs but rarely implemented. You cannot add a 2nd default route with ip route add, you'll get an error saying that the route already exists. Instead you use the command ip route append. This was worked out by Posko (Malalon)

posko P (dot) Osko (at) elka (dot) pw (dot) edu (dot) pl 17 Apr 2002

I used ip append at home when I was testing source address routing for RealServers. But when I started working with my setup I found that I can't set up two default routes for different addresses in one routing table (in Linux it's by default table 'main' where all normal routes are stored) because only one default route works at the same time (the first added to route table). So I decided to create separate route tables (named 201 and 202 in my setup) containing default route for each alias using the following command:
ip route add 0/0 src 192.168.1.2 via 192.168.1.1 table 201
and route packets with source address from 192.168.1.2 according to this table (201)
ip rule add from 192.168.1.2/25 table 201 prio 220

Here are the details from Pawel Osko, Warsaw University of Technology, Faculty of Electronics and Information Technology.

You can create two (or more) LVS-NAT directors using the Policy Routing. The simplest setup is one RS working with two DIRs:

         --------
        | client |
         --------
            |
            |
         _________
        |         |
       DIR1      DIR2
        |         |
         ---------
            |
           RS

The first step is to create working setup with one DIR and one RS. In my setup I'm using one NIC two Networks LVS-NAT. Example (my) setup: (You can use the Configure Script to set it up.)

DIR1:
VIP=A.B.C.70
DIRECTOR_VIP_DEVICE=eth0:110
DIRECTOR_INSIDEIP=192.168.1.1
DIRECTOR_DEFAULT_GW=A.B.C.126

RS:
RIP=192.168.1.2
GW=192.168.1.1

Now test it. If everything is ok, set the second DIR, and change settings on RS:

DIR2:
VIP=A.B.C.71
DIRECTOR_VIP_DEVICE=eth0:110
DIRECTOR_INSIDEIP=192.168.2.1
DIRECTOR_DEFAULT_GW=A.B.C.126


RS:
RIP=192.168.2.2
GW=192.168.2.1

and test it.

Now you know for sure that your DIRs are set up properly and your RS can work with both of them.

Step 2.

Keep directors working. Delete addresses on network interface on RS (using `ip addr del` command for example). Add two addresses to NIC (I'm using eth0):

ip addr add 192.168.1.2/25 broadcast 192.168.1.127 dev eth0 label eth0:1
ip addr add 192.168.2.2/25 broadcast 192.168.2.127 dev eth0 label eth0:2

Check if everything is ok:

ip addr show

Each of addresses will work with other DIR. Now you must make packets from eth0:1 go to DIR1 and from eth0:2 to DIR2. Source routing will be used to do this.

Create rules for each IP:

ip rule add from 192.168.1.2/25 table 201 prio 220
ip rule add from 192.168.2.2/25 table 202 prio 220

where 201 and 202 are names of tables.

Add default routes for each IP:

ip route add 0/0 src 192.168.1.2 via 192.168.1.1 table 201
ip route add 0/0 src 192.168.2.2 via 192.168.2.1 table 202

You are done! Now all packets from 192.168.1.2 will go through DIR1 and packets from 192.168.2.2 through DIR2.

New RSs can be added now, simply follow instructions in Step 2 for new IPs. You can also have more DIRs, just add more IPs on RS. I set up LVS-NAT with four DIRs working with four RSs using mon to dect RS's failures and everything works perfect (at last!).

37.2. Connecting from clients through multiple parallel links: the dead gateway problem

This is not an LVS problem, just a normal routing problem. You can have multiple default gateways in Linux. The problem is knowing when one of them has died.

The "Connected" site has a discussion of dead gateway detection (http://www.freesoft.org/CIE/RFC/1122/56.htm, site gone 14 Sep 2004) derived from the RFCs. The points raised are

active probes (e.g. pings) are expensive, scale poorly and MUST NOT be used continuously to check the status of a first-hop gateway. When pings must be used, they MUST only be used when traffic is being sent to the gateway.
other layers (above and below the IP layer) SHOULD be able to give advice to the routing layer, when positive (gateway OK) or negative (gateway dead) information is available.
Dead gateway detection is covered in some detail in RFC-816 [IP:11]. Experience to date has not produced a complete algorithm which is totally satisfactory, though it has identified several forbidden paths and promising techniques.

In case you're wondering, what they're really saying is that dead gateway detection was not built into the protocol and no satisfactory solution for its absence has been found.

Ratz 22 Jan 2006

According to RFC816 and RFC1122 there are multiple ways to perform DGD, however I've only seen about 3 of those in the wild:

Link-layer information that reliably detects and reports host failures (e.g., ARPANET Destination Dead messages) should be used as negative advice.
An ICMP Redirect message from a particular gateway should be used as positive advice about that gateway.
Packets arriving from a particular link-layer address are evidence that the system at this address is alive. However, turning this information into advice about gateways requires mapping the link-layer address into an IP address, and then checking that IP address against the gateways pointed to by the route cache. This is probably prohibitively inefficient.

The Alteon switch does media detection and could also listen to special L2 PDU packets, including advertisements. Media detection under Linux is an often discussed and to date not resolved issue. For about 2 months starting last November, a couple of people on netdev have been working on proper link state propagation in the core kernel, the result will be seen in 2.6.17 ;). Other than that I suggest you use non-cheap but excellently supported NICs, like e1000 and check the media state using ethtool or write a netlink listener.

You are allowed to ping, but only if nothing else works for you (3.3.1.4):

Active probes such as "pinging" (i.e., using an ICMP Echo Request/Reply exchange) are expensive and scale poorly. In particular, hosts MUST NOT actively check the status of a first-hop gateway by simply pinging the gateway continuously.
Even when it is the only effective way to verify a gateway's status, pinging MUST be used only when traffic is being sent to the gateway and when there is no other positive indication to suggest that the gateway is functioning.
To avoid pinging, the layers above and/or below the Internet layer SHOULD be able to give "advice" on the status of route cache entries when either positive (gateway OK) or negative (gateway dead) information is available.

Multiple routes to the internet is discussed in Routing for multiple uplinks/providers (http://lartc.org/howto/lartc.rpdb.multiple-links.html) and Multiple Connections to the Internet (http://linux-ip.net/html/adv-multi-internet.html). Julian (immediately below) has a dead gateway detection mechanism and a working setup with dead gateway detection is shown at Nano-Howto to use more than one independent Internet connection. by Christoph Simon (http://www.ssi.bg/~ja/nano.txt). The author warns

The setup of all this is not a question of 5 minutes

Logu lvslog (at) yahoo (dot) com 5 Oct

I have two isdn internet connection from two different isps. I am going to put an lvs_nat between the users and these two links so as to loadbalace the bandwidth.

Julian

You can use the Linux's multipath feature:

# ip ru
0:      from all lookup local
50:     from all lookup main
...
100:    from 192.168.0.0/24 lookup 100
200:    from all lookup 200
32766:  from all lookup main
32767:  from all lookup 253

# ip r l t 100
default  src DUMMY_IP
	nexthop via ISP1  dev DEV1 weight 1
	nexthop via ISP2  dev DEV2 weight 1

# ip r l t 200
default via ISP1 dev DEV1  src MY_IP1
default via ISP2 dev DEV2  src MY_IP2

You can add my dead gateway detection extension (for now only against 2.2)

This way you will be able fully to utilize the both lines for masquerading. Without this patch you will not be able to select different public IPs to each ISP. They are named "Alternative routes". Of course, in any case the management is not an easy task. It needs understanding.

anon

I currently have multiple adsl modems that connects to the internet.

Alexandre Cassen alexandre (dot) cassen (at) wanadoo (dot) fr 11 Apr 2003

This is a routing design problem, commonly accomplished done by loadbalancing default route at the routing level (netlink). You add 2 default gateway with the same weight to provide outbound loadbalancing. Since current linux kernel routing suffer lake of dead gateway detection, you will need to apply Julian's "dead gateway detection" patch.

37.3. Dynamic Routing to handle loss of routing in directors

Here I show how use dynamic routing to handle routing following failure of the link from a director to its default gw. The director with the failed default route gets its new routing information from the adjacent director, which is assumed to have a functional route to the outside world.

After I got this to work, I found out that you don't do dynamic routing if the interfaces on two machines are in the same networks as shown here, as happens with duplicate directors (or routers).

-------network_A---------
      |         |
    host_1    host_2
      |         |
-------network_B---------

In the case of common networks, alternate routes are (usually) handled by multiple static routes with different weights e.g. Routing for multiple uplinks/providers in the Linux Advanced Routing and Traffic Control HOWTO. This section then is not exactly central to LVS failover and unless you have some other reason to read about dynamic routing, you may want to skip this section. This was my first attempt at dynamic routing. Even if you use dynamic routing, I won't be surprised if there are better ways of doing it. Suggestions welcome.

You use dynamic routing only if the hosts are connected to non-common networks, as here, where host_1 is not connected to network_C, while host_2 (which is connected to and can communicate with host_1) is.

----network_A-----
      |
    host_1
      |
----network_B-----
      |
    host_2
      |
----network_C----

Dynamic routing would be used by host_2 to send information about network_C to host_1 (etc).

I had previously been handling routing failure with scripts. Script driven failover (where as well, you have to reconfigure demons to listen to the moveable IP and the router has to think that is has a new name), requires the scripts to run in pairs (to_up on one machine, and to_down on the other). The scripts have to be synchronized and have to run to completion on both machines. If one machine becomes deranged and looses track of its state, then scripts won't failover cleanly. You should be able to down/crash/wedge any single NIC/route/disk/demon in a failover router pair without loosing routing, no matter what. I found that my scripts would often result in some hung state. Perhaps better scripts would have handled it, but this would indicate that functional scripts are difficult to write.

I was looking for other ways of handling routing failure, when John Reuning posted on the mailing list that he was using zebra. I had not managed to even figure out how to setup the .conf file last time I tried (several years ago) as I found the docs inscrutable (some sections were blank). Here's the posting from John Reuning, which showed me how simple it was to configure zebra and which started me off with dynamic routing.

John Reuning john (at) metalab (dot) unc (dot) edu 17 Feb 20004
I've included the .conf files below. I didn't do anything crazy coming up with this stuff. There were sample config files in the source code, and I just copied what I needed.
To make snmp work, these need to go in the snmpd config:
smuxpeer 1.3.6.1.4.1.3317.1.2.1 zebra
smuxpeer 1.3.6.1.4.1.3317.1.2.2 zebra_bgpd
The one quirk I remember is that one of the daemons needs to start before the other. If zebra isn't running when bgpd starts up, it freaks out.
bgpd.conf
! bgpd.conf
!
hostname bgpd
password xxxxxx
enable password xxxxxx
log syslog
!log stdout
!log file bgpd.log
smux peer .1.3.6.1.4.1.3317.1.2.2 zebra_bgpd
!
!bgp multiple-instance
!
router bgp 2
  bgp router-id 192.168.1.254
  neighbor 10.0.1.1 remote-as 2
  neighbor 192.168.2.1 remote-as 2
  neighbor 192.168.2.1 route-reflector-client
  neighbor 192.168.2.2 remote-as 2
  neighbor 192.168.2.2 route-reflector-client
!
line vty
!
zebra.conf
! zebra.conf
!
hostname director
password xxxxxx
enable password xxxxxx
log syslog
!log stdout
!log file zebra.log
!
smux peer .1.3.6.1.4.1.3317.1.2.1 zebra
!

I thought it would be better to handle the failover with hardened and well tested demon(s) running on each machine, that maintain communication, and know what to do when one machine is in an arbitary fault state. These demons then would run the minimum depth of the more fragile, dependant scripts.

Zebra is a GPL package containing the common dynamic routing demons (ripd, bgpd, ospfd). Zebra runs on many platforms and uses a command syntax close to that of cisco IOS (i.e. you can use the cisco documentation if you're stuck). Useful documentation I found

A review on Zebra by Mike Metzger (link dead Mar 2004, http://www.unixreview.com/documents/s=1350/urm001a/). An introduction to setting up Zebra. This didn't give me enough information to get going, but did tell me that someone understood it and gave me hope that I would too.
Build a network router on Linux by Dominique Cimafranca and Rex Young (http://www-106.ibm.com/developerworks/linux/library/l-emu/) - a slightly more advanced introduction to Zebra. This, together with the config files supplied by John Reuning (below), contained enough information for me to get Zebra to do something.
The Zebra documentation (http://www.zebra.org/) (this seems to be complete now - a few years back, whole sections were blank).
Cisco documentation (http://www.cisco.com/pcgi-bin/Support/browse/index.pl?i=Technologies&f=770). After bootstrapping from the Cimafranca and Young article, I could use the articles here e.g. Routing Information Protocol (RIP) (link dead Mar 2004, http://www.cisco.com/univercd/cc/td/doc/cisintwk/ito_doc/rip.html), Using the Border Gateway Protocol for Interdomain Routing (link dead Mar 2004, http://www.cisco.com/univercd/cc/td/doc/cisintwk/ics/icsbgp4.html).
I was helped by Tom Brosnan and Steve Buchanan, networking people at my job.
After I got this working, I found Dynamic routing - OSPF and BGP (http://www.lartc.org/howto/lartc.dynamic-routing.html) in the Linux Advanced Routing and Traffic Control HOWTO.

As with most computer documentation, you already have to understand the topic in order to be able to read it. Much documentation about dynamic routing concerns the differences between RIP, BGP, OSPF, and goes into details about convergence, horizons... You don't need any of that right now. All you need to know is that these 3 protocols move routing information from one machine to another and that the syntax of the commands for them is much the same. For moving routing information within an AS, you use rip (the original protocol) or ospf (the newer protocol). For moving routing information between different ASs, you need bgp (I think).

To the LVS client, as far as routing is concerned, an LVS appears to be a single leaf node. For an LVS with one director, all routing is to the director and the LVS really is a single leaf node. When multiple directors are involved, and the VIP hops between directors on failover, the inbound routing can be handled at the arp level (the director uses send-arp to update the location of the VIP). For outbound routing (i.e. packets from the VIP on the director to 0/0), dynamic routing protocols can be used. One place that dynamic routing could be used in an LVS, is following failure of the link to 0/0, a director does a failover and no longer having a route to 0/0, has to route packets through the other director (see diagram below).

	Note
	I wanted the setup to run a router failover pair. If you are using this to maintain outbound routing for an LVS, you will only need this for LVS-NAT. For LVS-DR and LVS-Tun, for security, there should be no route from the VIP on the director to 0/0 - see default gw for director with LVS-DR/LVS-Tun.

Normally with dynamic routing, the routers (here, the two directors) are in contact with upstream routers (running a dynamic routing protocol), who feed routing information to them. The link state of the network (up||down) can be inferred from the presence (or absence) of the routing advertisements. With routing advertisements exchanged at 30-60 sec intervals, it will take ripd about 3 mins to timeout a dead link. bgp is a little faster and takes about a minute to timeout a dead link.

In the general case, you may not be able to get dynamic routing information from upstream. Some organisations are big and inflexible, there maybe turf battles, and the IT department will worry about getting bogus advertisements from you.

Where I live, network link failure (or routing failure, which may appear as a link failure) is the most common problem when maintaining service.

	Note
	Other problems, e.g. power failures occur more often, but these can be handled by UPSs; disk failures, which you have to plan for, are handled by disk monintoring tools and pre-emptive disk replacement of working disks as they approach their warrantee expiration.

Assuming that the two routers (directors) are both functional, then failover after a routing/link failure has to handle two problems

detection of link/routing failure
A setup is needed that works without link (or routing) information from upstream machines. In the absence of packets from an upstream machine, link (or routing) failure detection is difficult. I will assume that this is being handled by the failover demon (keepalived/vrrpd or Linux-HA).
reconfiguring the default gw
The director with a failed route to the outside, has to route via the other director.

Here's some info about the differences between routing via tables (i.e. how you set up a leaf node) and routing with dynamic routing protocols (i.e. on a router)

leaf nodes: automatically route to networks on interfaces. All other packets are sent to a default gw. The machine's view of the network is fixed and it knows that it is at the edge of the internet.

routers: advertise networks and IPs. Other routers pick up the messages and figure out the routing. All routers see themselves in the middle of the internet, with no idea where they are in it, or how big the internet is (the Ptolemeic view of the network). In particular, RIP and OSPF routers don't know about edges of ASs or the existence of other ASs. You don't explicitely set routes, rather you list the IPs of the neighbors and then let the routing demons figure out the topology. Except for border routers, the other routers (running RIP or OSPF) don't know about an AS.

	Note
	What you need to have your own AS, depends on your clout and size in the networking world. If you're a big governmental agency with offices throughout the country, have thousands of networked computers, route all your intra-agency's packets through clouds (leased lines, where packets don't go onto the internet) and route all your packets to the internet over multiple redundant links, via local ISPs at each site, then you'll have your own AS. Big ISPs will have an AS. If you're a small organisation, you'll have static links to your provider and you won't have an AS. Small dial-up companies with just a few machines handling the traffic have static links and don't even get routing advertisements from their ISPs. Businesses usually are dealing with computers or networks, but not both. If you're in a business that uses computers (e.g. you're an applications programmer), then you won't have an AS. If you even ask the question "what do I need to have an AS?", then you aren't in the network business and you won't have one.

An AS is connected through border routers (usually two or more for redundancy) to an ISP which is connected to the internet backbone. The border routers act as a default gw for the routers inside the AS (and do so by the instruction "default route originate" in their .conf file) and appear as a "route of last resort" in the routing tables of the inside machines.

If you want your own (private) AS, then use the private AS numbers 64512-65535 (the AS equivalent of 192.168.x.x IPs). Advertisements for these ASs are not propagated.

After convergence, the routers within an AS will know the routes in the AS and will know which machines to use as their default gw (gateway of last resort).

Here's the setup for the demonstration with two routers (directors), working as an active/backup pair, running a dynamic routing protocol.

Note

	Note
`dummy0` is configured with an IP in the Cimafranca paper, partly for their demonstration. This IP allows you to ping the node from the outside, as long as at least one hardware NIC is up. Supposedly this IP is a convenience to be able to identify a host (although I didn't have any need for it). `dummy0` is chosen as it is the interface least likely to go down. cisco routers use lo for this IP, but apparently the convention with Linux is to use dummy0. The IPs on each dummy0 interface are in different networks. If they are in the same network, you can't route to the IP on dummy0 on adjacent machines.

dummy0 is configured with an IP in the Cimafranca paper, partly for their demonstration. This IP allows you to ping the node from the outside, as long as at least one hardware NIC is up. Supposedly this IP is a convenience to be able to identify a host (although I didn't have any need for it). dummy0 is chosen as it is the interface least likely to go down. cisco routers use lo for this IP, but apparently the convention with Linux is to use dummy0.

The IPs on each dummy0 interface are in different networks. If they are in the same network, you can't route to the IP on dummy0 on adjacent machines.

Here is the network during normal functioning

                    ______________    ______________
                   |              |  |              |
                   |    router    |  |    router    |
                   |______________|  |______________|
                    192.168.1.253     192.168.1.254
                          |                 |
                          |                 |
                          |                 |
                          |                 |
                          |                 |
                     192.168.1.1       192.168.1.2
                    ______________    ______________
                   |              |  |              |
10.0.1.1/24=dummy0-|    backup    |  |    active    |-dummy0=10.0.2.1/24
                   |______________|  |______________|
                          |                 |
                     192.168.2.1       192.168.2.2
                          |                 |
                           -----------------
                                   |
                              realservers

Here is the network immediately following link loss to the backup director's default gw.

the backup director has no default gw.
the active director has a default gw.
The job of the dynamic routing demon is to let the backup director know where the default gw is.

                    ______________    ______________
                   |              |  |              |
                   |    router    |  |    router    |
                   |______________|  |______________|
                    192.168.1.253     192.168.1.254
                          |                 |
                                            |
                          X                 |
                                            |
                          |                 |
                     192.168.1.1       192.168.1.2
                    ______________    ______________
                   |              |  |              |
10.0.1.1/24=dummy0-|    backup    |  |    active    |-dummy0=10.0.2.1/24
                   |______________|  |______________|
                          |                 |
                     192.168.2.1       192.168.2.2
                          |                 |
                           -----------------
                                   |
                              realservers

Here is the network after re-establishing the default route for the backup master node. This takes about 25 secs with RIP.

                    ______________    ______________
                   |              |  |              |
                   |    router    |  |    router    |
                   |______________|  |______________|
                    192.168.1.253     192.168.1.254
                          |                 |
                                            |
                                            |
                           -----------------|
                          |                 |
                     192.168.1.1       192.168.1.2
                    ______________    ______________
                   |              |  |              |
10.0.1.1/24=dummy0-|    backup    |  |    active    |-dummy0=10.0.2.1/24
                   |______________|  |______________|
                          |                 |
                     192.168.2.1       192.168.2.2
                          |                 |
                           -----------------
                                   |
                              realservers

Note

	Note
You install the `iproute2` tools for zebra to work and and the CLI commands must be policy routing commands. There are two series of network tools available with Linux ifconfig, route; these are the old style commands. ip addr show, ip route show from the `iproute2` policy routing tools. The routes/IPs added by rip/zebra are added by the `iproute2` tools. The two series of commands are incompatible. IPs (or routes) added by `iproute2` may not be visible to ifconfig (or route). Routes added by ip route add may be visible to route but aren't capable of being deleted by route. All IP and route commands from the command line must use the `iproute2` tools.

You install the iproute2 tools for zebra to work and and the CLI commands must be policy routing commands.

There are two series of network tools available with Linux

ifconfig, route; these are the old style commands.
ip addr show, ip route show from the iproute2 policy routing tools.

The routes/IPs added by rip/zebra are added by the iproute2 tools. The two series of commands are incompatible. IPs (or routes) added by iproute2 may not be visible to ifconfig (or route). Routes added by ip route add may be visible to route but aren't capable of being deleted by route. All IP and route commands from the command line must use the iproute2 tools.

If you like names rather than port numbers, add these to /etc/services

#zebra
zebrasrv      2600/tcp            # zebra service
zebra         2601/tcp            # zebra vty
ripd          2602/tcp            # RIPd vty
ripngd        2603/tcp            # RIPngd vty
ospfd         2604/tcp            # OSPFd vty
bgpd          2605/tcp            # BGPd vty
ospf6d        2606/tcp            # OSPF6d vty
#

zebra.conf

!
! Zebra configuration saved from vty
!   2004/02/19 17:48:27
!
! hostname given at zebra prompt and passwd
hostname zebra
password zebra
!
! enable "enable" command and give passwd for it.
enable password zebra
!
! log to a file
log file /var/log/zebra.log
! alternatively, log to a facility
!log syslog
!log stdout
!
! turn on vtysh access
line vty
!
! the interfaces you want Zebra to know about
! (tell zebra about all of them)
interface lo
!
interface dummy0
!
interface tunl0
!
interface eth0
!
interface eth1
!---------------------------------

here's my zebra init script, ripd init script, bgpd init script, ospfd init script.

Now use the zebra shell (vtysh or telnet localhost zebra) to install an IP on dummy0 (following the instructions of Cimafranca and Young).

	Note
	these instructions will only work if dummy0 is in `zebra.conf` the different prompts for bash, zebra, zebra in "enable" mode, zebra in "configure terminal" mode, zebra in "configure interface" mode.

You can add the IP for dummy0 into zebra.conf with an editor instead. You could also add the IP on bootup, but by adding the information to the .conf file, the IP will only be present after you start up zebra.

director:/etc/zebra# /etc/rc.d/rc.zebra start
director:/etc/zebra# telnet localhost zebra
Trying 127.0.0.1...
Connected to localhost.
Escape character is '^]'.

Hello, this is zebra (version 0.94).
Copyright 1996-2002 Kunihiro Ishiguro.


User Access Verification

Password:
zebra> enable
Password:
zebra# configure terminal
zebra(config)# interface dummy0
zebra(config-if)# ip address 10.0.1.1/24
zebra(config-if)# quit
zebra(config)# write
Configuration saved to /etc/zebra/zebra.conf
zebra(config)# end
zebra# show run

Current configuration:
!
hostname zebra
password zebra
enable password zebra
log file /var/log/zebra.log
!
interface lo
!
interface dummy0
 ip address 10.0.1.1/24
!
interface tunl0
!
interface eth0
!
interface eth1
!
!
line vty
!
end
zebra# quit
Connection closed by foreign host.
director:/etc/zebra# cat zebra.conf
!
! Zebra configuration saved from vty
!   2004/02/24 17:51:02
!
hostname zebra
password zebra
enable password zebra
log file /var/log/zebra.log
!
interface lo
!
interface dummy0
 ip address 10.0.1.1/24
!
interface tunl0
!
interface eth0
!
interface eth1
!
!
line vty
!

Next time you start up zebra, the new zebra.conf script will add the IP to dummy0 and the src route (as if you'd run ip addr add 10.0.1.1/24 dev dummy0 brd + from the command line).

Start up zebra on the second director and add an IP to dummy0 there (you can copy the zebra.conf file here to the other director and change the IP for dummy0).

Now you're going to start ripd. Here's my ripd.conf

!
! Zebra configuration saved from vty
!   2004/03/01 14:38:03
!
hostname ripd
password zebra
enable password zebra
log file /var/log/ripd.log
!
interface lo
!
interface dummy0
!
interface tunl0
!
interface eth0
!
interface eth1
!
router rip
 network eth0
 network eth1
!
line vty
!

Here I add networks to the conf file from the zebra interface (you could use an editor on the conf file too).

director:/etc/zebra# telnet 0 ripd
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.

Hello, this is zebra (version 0.94).
Copyright 1996-2002 Kunihiro Ishiguro.


User Access Verification

Password:
ripd> enable
Password:
ripd# configure terminal
ripd(config)# router rip
ripd(config-router)# network 10.0.1.0/24
ripd(config-router)# network 192.168.1.0/24
ripd(config-router)# write
Configuration saved to /etc/zebra/ripd.conf
ripd(config-router)# show run
ripd(config-router)# show run

Current configuration:
!
hostname ripd
password zebra
enable password zebra
log file /var/log/ripd.log
!
interface lo
!
interface dummy0
!
interface tunl0
!
interface eth0
!
interface eth1
!
router rip
 network 10.0.1.0/24
 network 192.168.1.0/24
 network eth0
 network eth1
!
line vty
!
end
ripd(config-router)#  quit
ripd(config)# exit
ripd# exit
Connection closed by foreign host.
director:/etc/zebra#

Here's the ripd.conf I used for the demo.

! -*- rip -*-
!
! RIPd sample configuration file
!
! $Id: ripd.conf.sample,v 1.11 1999/02/19 17:28:42 developer Exp $
!
hostname ripd
password zebra
enable password zebra
!
! debug rip events
! debug rip packet
!
router rip
network 0.0.0.0/0
network 192.168.1.0/24
network 192.168.2.0/24
network eth0
network eth1
redistribute kernel
!
!default-information originate
!
log file /var/log/ripd.log

Make sure both routers have default routes.

backup:/etc/zebra: ip route add default via 192.168.1.253
active:/etc/zebra: ip route add default via 192.168.1.254

Activate debugging in zebra (so you will see notices of rip updates on the screen) and then show the routes

backup:/etc/zebra# telnet 0 zebra
Trying 0.0.0.0...
Connected to 0.
Escape character is '^]'.

Hello, this is zebra (version 0.94).
Copyright 1996-2002 Kunihiro Ishiguro.


User Access Verification

Password:
zebra> enable
Password:
zebra# debug zebra packet
zebra# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
       B - BGP, > - selected route, * - FIB route

K>* 0.0.0.0/0 via 192.168.1.253, eth1
R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:07:44
C>* 10.0.2.0/24 is directly connected, dummy0
K * 127.0.0.0/8 is directly connected, lo
C>* 127.0.0.0/8 is directly connected, lo
K * 192.168.1.0/24 is directly connected, eth1
C>* 192.168.1.0/24 is directly connected, eth1
K * 192.168.2.0/24 is directly connected, eth0
C>* 192.168.2.0/24 is directly connected, eth0

The output shows that the backup router has a default route added by the kernel (at the CLI above) and a route to 10.0.1.0 added by RIP, which enables routing to 10.0.1.1 on the other machine. (A similar view will be seen by running ip route show at the CLI.) The [120/2] indicates the administrative weight of the route [120] and the number of hops [2].

Then do the following in order -

From another window, remove the default route at the command prompt (ip route del default via 192.168.1.253)

in the zebra window above, up arrow and rerun the show ip route to show that the default route has gone.

zebra# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
       B - BGP, > - selected route, * - FIB route

R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:18:01
C>* 10.0.2.0/24 is directly connected, dummy0
K * 127.0.0.0/8 is directly connected, lo
C>* 127.0.0.0/8 is directly connected, lo
K * 192.168.1.0/24 is directly connected, eth1
C>* 192.168.1.0/24 is directly connected, eth1
K * 192.168.2.0/24 is directly connected, eth0
C>* 192.168.2.0/24 is directly connected, eth0

watch in the zebra window for a RIP update

zebra# 2004/03/02 21:21:54 ZEBRA: zebra message received [ZEBRA_IPV4_ROUTE_ADD] 14

up arrow in the zebra window and rerun the show ip route to show the new default route.

zebra# show ip route
Codes: K - kernel route, C - connected, S - static, R - RIP, O - OSPF,
       B - BGP, > - selected route, * - FIB route

R>* 0.0.0.0/0 [120/2] via 192.168.1.254, eth1, 00:00:03
R>* 10.0.1.0/24 [120/2] via 192.168.2.1, eth0, 00:18:31
C>* 10.0.2.0/24 is directly connected, dummy0
K * 127.0.0.0/8 is directly connected, lo
C>* 127.0.0.0/8 is directly connected, lo
K * 192.168.1.0/24 is directly connected, eth1
C>* 192.168.1.0/24 is directly connected, eth1
K * 192.168.2.0/24 is directly connected, eth0
C>* 192.168.2.0/24 is directly connected, eth0

The new (x.x.x.254 rather than x.x.x.253) default gw is now installed and this time it's installed by RIP (rather than the kernel). Here's the view of the routing as shown from the CLI

backup:/etc/zebra# ip route show
10.0.1.0/24 via 192.168.2.1 dev eth0  proto zebra  metric 2 equalize
192.168.2.0/24 dev eth0  scope link
192.168.2.0/24 dev eth0  proto kernel  scope link  src 192.168.2.253
192.168.1.0/24 dev eth1  scope link
192.168.1.0/24 dev eth1  proto kernel  scope link  src 192.168.1.253
10.0.2.0/24 dev dummy0  proto kernel  scope link  src 10.0.2.253
127.0.0.0/8 dev lo  scope link
default via 192.168.1.254 dev eth1  proto zebra  metric 2 equalize

This time the default route is installed by zebra.

You can time the route failover: At 18:31 (min:sec since executing), the new route has been up for 00:03 seconds. The failover occurred at 18:01, showing that the new route took ((31-3)-1)=27 seconds to appear after failover.

The new default gw is the other director's default gw. I had initially hoped that the new default gw would be an IP on the active director, and that ICMP redirects would handle re-routing to the active director's default gw. However this didn't work, although I though it would for a while. Here's what happened.

If you activate the line

default-information originate

in ripd.conf on just the active director, the active director, having a default route of its own, will advertise that it is a default route. If you then do the failover, the default route on the backup director will be an IP on the active director. (I thought I was home at this stage.) Since you want to do this symmetrically, you activeate the same line to ripd.conf on the backup director. The problem (from talking to Steve Buchanan) is that the backup director, if it's been told to advertise that it is the default route, is not going to accept an advertisement from anyone else (like the active director) declaring that they are the default gw instead. After activating the option default-information originate, then on failure of the link, the backup master node will not accept the RIP update of a default route and will not show a default route.

With dynamic routing then, after failover, the default route for the backup router is the default route of the active router, and not an IP on the active router. Functionally these achieve the same result if there are no other problems with the routing on the backup router.

37.4. Dynamic routing with gated: An LVS that connects to the outside world through two networks

Patrick LeBoutillier patl (at) fusemail (dot) com 26 May 2004

Here is a "recipe" for creating LVS clusters with machines that support redundant networking.

Our production environment is fully redundant at the network level (each machine has two network interfaces, each connected to a different network). All machine are connected to both these networks and data can come from either network. On each machine, service run on a local network address and gated announces the route for these networks via both network interface. My task was to create an LVS cluster of 2 such machines (each a potential director and realserver as well).

The network setup:

Network 1 is 192.168.10.0/24
Network 2 is 192.168.11.0/24

Machine 1:
  - eth0: 192.168.10.1
  - eth1: 192.168.11.1
  - local network on loopback (lo:real): 192.168.20.1/32

Machine 2:
  - eth0: 192.168.10.2
  - eth1: 192.168.11.2
  - local network on loopback (lo:real): 192.168.21.1/32

Virtual IP is 192.168.30.1

gated setup:

Have gated announce (and accept) the following routes:

Machine 1:
- announce 192.168.20.1/32
- accept routes from 192.168.10.2 and 192.168.11.2
Machine 2:
- announce 192.168.21.1/32
- accept routes from 192.168.10.1 and 192.168.11.1

These routes will be used by ldirectord to monitor the realservers.

Recipe

Install UltraMonkey as usual, but:

Make sure to configure ping nodes in both networks.

	Note
	A "ping node" is a pingable IP that is used by the heartbeat `ipfail` plugin, to determine if a director has lost network connectivity. The "ping node" terminology is defined at Getting Started with Linux-HA (heartbeat) (http://linux-ha.org/download/GettingStarted.html).

- Create the virtual IP alias as 192.168.30.1

- A virtual service definition in ldirectord.cf should look something like this:

     virtual=192.168.30.1:80
             real=192.168.20.1:80 gate
             real=192.168.21.1:80 gate
             service=http
             checkport=80
             request="/test.html"
             receive="test"
             scheduler=rr
             protocol=tcp

In a normal setup, heartbeat manages the virtual IP alias and brings it up on the active director. If I understand correctly, an arp request is then sent, making the other machines in the local network aware that the active director is now the machine to be reached for the virtual IP.

In this setup we will tell heartbeat to leave the virtual IP alias alone and have it tell gated to announce the route for the 192.168.30.1/32 network instead. Therefore ONLY the active director will announce the routes to reach the virtual IP network.

Change your haresources line to something like this:
node1.cluster.tld gated-toggle ldirectord

Place the following (or equivalent) code in a file called /etc/ha.d/resource.d/gated-toggle:

--------8<--------
#!/bin/bash
#
# This gated control script should only be called by heartbeat!
#
# start: RESTART gated with the original (non-director config)
# stop:  RESTART gated with the director config
#

# Source function library.
. /etc/rc.d/init.d/functions

# Source networking configuration.
. /etc/sysconfig/network

# Check that networking is up.
[ ${NETWORKING} = "no" ] && exit 0

gdc=/usr/sbin/gdc
gated=/usr/sbin/gated
prog=gated

if [ ! -f /etc/gated.conf -o ! -f $gdc ] ; then
        action $"Not starting $prog: " true
        exit 0
fi

PATH=$PATH:/usr/bin:/usr/sbin

RETVAL=0

start() {
        echo -n $"Starting $prog: "
        CFG=$1
        if [ "$CFG" != "" ] ; then
                RES='$2$3'                
RE="s/^(\s*\#+)(.*)(\#\s*heartbeat-toggle\s*)$/$RES/"
                /usr/bin/perl -p -e "$RE" /etc/gated.conf > $CFG
                daemon $gated -f $CFG
        else
                daemon $gated
        fi
        RETVAL=$?
        [ $RETVAL -eq 0 ] && touch /var/lock/subsys/gated
        echo
        return $RETVAL
}

stop() {
        # Stop daemons.
        action $"Stopping $prog" $gdc stop
        RETVAL=$?
        if [ $RETVAL -eq 0 ] ; then
                rm -f /var/lock/subsys/gated
        fi
        return $RETVAL
}

# See how we were called.
case "$1" in
  start)
        stop
        start "/etc/gated-heartbeat.conf"
        ;;
  stop)
        stop
        start
        ;;
  *)
        echo $"Usage: $0 {start|stop}"
        exit 1
esac

exit $RETVAL
-------->8--------

What this script does is:

On resource acquisition:
Copy the gated configuration file (/etc/gated.conf) to another file (/etc/gated-heartbeat.conf), activate the route for the virtual IP network and restart gated using the new file.
On resource loss:
Restart gated using the original configuration.

	Note
	gated must always be running and must start at boot time using the non-active (default) config.

Modify /etc/gated.conf accordingly. Here is the /etc/gated.conf file for machine 1:

--------8<--------
options syslog upto debug;

smux off;
bgp off;
egp off;
ospf off;

rip yes{
  interface all noripin noripout;
  interface eth0 ripin ripout version 2 multicast;
  interface eth1 ripin ripout version 2 multicast;
  trustedgateways 192.168.10.2 192.168.11.2 (...) # other routers in 
the network ;
};


static {
        192.168.20.1 masklen 32 interface 127.0.0.1 preference 0 retain;
        192.168.30.1 masklen 32 interface 127.0.0.1 preference 0 retain;
};

import proto rip{
  all;
};

# On exporte differentes affaires, en concordance avec le mode de 
fonctionnement (prod/releve)
export proto rip{
  proto static{
          host 192.168.20.1 metric 1;
#          host 192.168.30.1 metric 1; # heartbeat-toggle
  };
};
-------->8--------

The gated-toggle script will look for all lines ending with "# heartbeat-toggle" and turn them on (or off) depending on the cluster state.

I suspect you could do something similar with zebra or some other routing software, as long you can restart it with a different config or (even better) change it's config dynamically (maybe you can dynamically change the config for gated, but I'm not aware of this).

37.5. flapping stemming from convergence time for spanning tree

Shaun McCullagh shaun (dot) mccullagh (dot) marviq (dot) com 27 May 2004

I've encountered some flapping problems with Keepalived v1.1.1 (on RH Linux 7.3 Kernel 2.4.18-5) when used with Cisco 2948 and C3548-XL switches. Both Master and Backup PC's use 3COM 905C NICS. As an experiment I tried ifconfig eth2 down on the Backup system to check it recovered from a FAULT state. The system went into FAULT state as expected, but I when I did ifconfig eth2 up, keepalived initially went to Backup state, then started oscillating between MASTER and BACKUP state.
I fixed the problem by increasing the advert_int to 35 seconds (on both Master and Backup system). The problem with this is when Keepalived is started the VIPs obviously take much longer to start than if the advert_int is set to 5 seconds.
I'd grateful for suggestions as to what to investigate, as I quite like to set the advert_int back to 5 seconds

Graeme Fowler keepalived (at) graemef (dot) net 27 May 2004

Hard set your switch speed/duplex settings for those ports, and use "mii-tool" (assuming it will support your cards) to do the same at the server end. Cisco switches take up to 30 seconds to complete their autonegotiation - if they're hard set, they don't.

Kjetil Torgrim Homme kjetilho (at) ifi (dot) uio (dot) no 27 May 2004

it's not auto-negotiation which takes time, it's the spanning tree algorithm. It's required to wait for 30 seconds to discover loops in the topology (nodes will only announce their presence so often). You can turn this off, with the configuration option spanning-tree portfast, if you're certain the port will never be used to connect to switches.

Graeme Fowler keepalived (at) graemef (dot) net 28 May 2004

Whoops! My mistake; indeed it is the spanning tree algorithm. I also ensure that I have "spanning-tree portfast" set on interfaces which I know will always be connected to hosts rather than switches (or in fact where I know that the port may connect to a switch which is not spanning tree capable).
One point of note though is that I have on occasion been bitten by interfaces which continually autonegotiate - whilst connectivity seems OK, the interface itself flaps wildly ever few seconds. Hence the comments about hard-setting port speeds :)

Prev	Up	Next
36. LVS: High Availability, Failover protection	Home	38. LVS: Server State Sync Demon, syncd (saving the director's connection state on failover)