Sunday, December 4, 2011

EIGRP Query-Reply Process and Stuck-in-Active Routes

When an EIGRP router loses a route that does not have a FS in its topology table, it will look for an alternative path to the destination. The route is now considered active. A route is considered passive when a router is not performing computation for the particular route. Re-computing an active route involves sending query packets to all neighbors on interfaces other than the one used for the previous successor (due to the split horizon rule), inquiring whether they have a route to the particular destination. If a neighbor has an alternative route, it will answer the query with the path in a reply packet and does not further propagate the query packet. If a neighbor does not have an alternative path, it will continue to query its own neighbors for an alternative path. The queries would be propagated through the network and create an expanding tree of queries.

When a router replies a query, it stops spreading of the query through that branch of the network. However, the query can still spread through other portions of the network as other routers attempt to find alternative paths which might not exist.

Due to the nature of the reliable multicast approach implemented by EIGRP, a reply must be received for each query generated when searching for alternative paths for a lost route. When a route enters the active state and queries are initiated, the only way the route can come out of the active state and transitions to passive state is after the router received a reply for every query generated! Besides that, every query and reply is acknowledged using an ACK message!

For each neighbor to whom a query is sent, an EIGRP router will set a reply status flag (r) to keep track of all outstanding queries that are waiting for replies. The DUAL computation is completed when the router has received a reply for every query sent out earlier.

By default, if an EIGRP router does not receive a reply to for each outstanding query within 3 minutes (the default Active timer value), the route will enter into the Stuck-in-Active (SIA) state. When a route is in the SIA state, the router will reset the neighbor relationships for neighbors that failed to reply, and cause the router to go active on all routes known through the neighbors and re-advertise all its known routes to the neighbors. Query scoping, which limits the scope of query propagation through a network – query range, helps to reduce the occurrences of SIA. Keeping the query packets close to the source reduces the chance that a failure in a part of the network to involve routers in other parts of the network in the query-reply convergence process.

The best methods for limiting the scope of query range and building scalable EIGRP networks:
i) Configure route summarization on the outbound interfaces of the appropriate routers.
ii) Configure the remote routers as EIGRP stub routers.

Other methods used for limiting query range include route filtering and interface packet filtering. Ex: When specific EIGRP routing updates to a router are filtered, upon the router receives a query about a filtered network, the router would reply and indicate that the network is unreachable and does not further propagate the query to other neighbors.

The timers active-time [time-limit | disabled] EIGRP router subcommand changes the active-state time limit in minutes. The default value which is 3 minutes – timers active-time 3 will not be shown in the router configuration files. The timers active-time command is equivalent to the timers active-time disabled command. With the active timer disabled, SIA route which does not receive a reply within 3 minutes would not cause the reset of the neighborship between the querying and queried routers.

The eigrp log-neighbor-changes router subcommand enables the logging of neighbor adjacency events monitors the stability of the routing system and detects SIA-related problems.
Note: This command is enabled by default.
Note: The show ip eigrp sia-event hidden EXEC command displays the SIA events.

An erroneous approach which is often used to decrease the chance of occurrence of a SIA route is implementing multiple EIGRP ASs to somewhat simulate OSPF areas in order to bound the query range, with mutual redistribution between the different ASs. However, this approach will not achieve the intended results. When a query reaches the edge of an AS, where routes are redistributed into another AS, the original query is first answered, followed by the edge router initiates a new query in the other AS. As a result, the query process has not been stopped; the querying process continues into other AS, and the route can eventually enter into SIA.

Another misconception about AS boundaries is implementing multiple ASs to protect one AS from route flapping in another AS. If routes are being redistributed between ASs, route transitions occur in an AS will be detected in the other ASs.

The EIGRP Query-Reply Process on Redundant Topology

The figure above is used for the discussion of the EIGRP Query-Reply process upon a lost route on redundant topology. The convergence process is considered complex even with only 2 head quarter routers and 2 branch routers. In networks with hundreds of branch routers, the process would become even more complex and can cause severe problems.

HQ1 advertises 172.16.1.0/24 to all other routers. The successor path for HQ2 to reach the network is via the Fast Ethernet link to HQ1. The successor paths for branch routers (BR1 and BR2) to reach the network are via their serial links to HQ1. Additionally, they have also learnt the feasible successor path through HQ2.

Assume that the EIGRP metrics for Fast Ethernet and Serial are 100 and 1000 respectively.

Below shows the EIGRP topology table for BR1 and BR2 for network 172.16.1.0/24. BR1 and BR2 have determined that the path to HQ1 is the successor while the path to HQ2 is the feasible successor (the AD is 200 through HQ2, which is less than the FD through HQ1) for the network.

Neighbor Feasible Distance Advertised Distance
HQ1 1100 100
HQ2 1200 200

Below shows the EIGRP topology table for HQ2 for network 172.16.1.0/24. HQ2 does not have a FS, as all paths through the branch routers have an AD greater than the FD through HQ1.

Neighbor Feasible Distance Advertised Distance
HQ1 200 100
BR1 2100 1100
BR2 2100 1100

At a particular time, HQ1 lost the route to 172.16.1.0/24 due to a network device failure. It sends a query to HQ2, BR1, and BR2 to seek for a feasible successor as it does not have any. When the branch routers receive the query, they automatically install the feasible successor path through HQ2 into their routing tables and respond to HQ1 with their new successor through HQ2. They also remove the unreachable path through HQ1 from their routing tables and install the path through HQ2 into their routing tables.

Now HQ1 has received the responses for 2 out of its 3 queries, but it is still waiting for the response from HQ2. When HQ2 receives the query from HQ1 for 172.16.1.0/24, it propagates the query to BR1 and BR2, as it does not have a FS but knows that a path exists through each branch router to reach 172.16.1.0/24.

BR1 and BR2 receive the query from HQ2 and check their topology tables for alternative paths. However, both branch routers do not have another path at the moment, as HQ1 has just informed that it has lost the path to this network. As the branch routers do not have an answer to the query from HQ2, they create a query and send it to all neighbors except the neighbor that they received the query from (due to split horizon). In this case, the branch routers send the query to HQ1.

HQ1 would reply to BR1 and BR2 that the route has an infinite metric and is unreachable. Once BR1 and BR2 receive the reply, the edge of the network is reached and the edge routers do not have any more neighbors to query. BR1 and BR2 would reply back to HQ2 and eventually HQ2 would reply back to HQ1.
Note: The longest query range in this sample network is HQ1 > HQ2 > BR1, BR2 > HQ1.

Below summarizes the events of the Query-Reply process in this scenario:
1) HQ1 sends queries to HQ2, BR1, and BR2.
2) BR1 and BR2 reply to HQ1 with their feasible successor paths via HQ2.
3) HQ2 propagates the query from HQ1 to BR1 and BR2 as it does not have a feasible successor path.
4) BR1 and BR2 propagate the query from HQ2 to HQ1.
5) HQ1 replies to BR1 and BR2 that it does not have a feasible successor path.
6) BR1 and BR2 reply back to HQ2.
7) HQ2 replies back to HQ1.

This scenario shows that in a network with redundant links between the head quarter and branch offices, not only the branch routers are required to respond to queries from the head quarter, but they also continue the search for a successor by propagated the queries back to the head quarter – HQ2 in this case. The Query-Reply packets which may traverse from the head quarter to the branch offices and back to the head quarter is often overlooked by network architects.

Route summarization can limit the query range by limiting the knowledge of a router regarding the subnets in a network. When a subnet is down, queries will only be propagated to as far as the routers that receive the summary route and do not have knowledge about the specific subnet, as a router propagates the query about a network only if it has an exact match in the routing table.

Route Summarization Limits the Scope of EIGRP Query Range

The figure above shows a sample scenario in which RT2 advertises a summary route of 172.16.0.0/16 to RT3. When the network 172.16.1.0/16 is down, RT3 would receive a query from RT2 as propagated from RT1. As RT3 has received only a summary route and the specific queried network is not in its routing table, it would reply with a “Network 172.16.1.0/24 is unreachable” message with an infinite metric of 4294967295 and does not further propagate the query to RT4.

However, RT2 would send a query for the summary route 172.16.0.0/16 to RT3 and the query will be propagated to RT4 as the only component subnet 172.16.1.0/24 is unreachable while it requires at least one valid component subnet for the summary route to remain in its routing table. In production environments, RT1 should have multiple directly connected subnets in the 172.16.0.0/16 major network and hence this additionally query-reply process will never occur.
Note: The FD of a summary route is same as the FD to the lowest-metric component subnet. If RT2 have different metric to the component subnets as advertised from RT1, it would send out a query to RT3 and the query will be propagated to RT4 when the lowest-metric component subnet which determines the metric for the summary route is unreachable, as it would like to query for any metric that is lower or better than its next lowest-metric component subnet that determines the metric for the 172.16.0.0/16 summary route.

Route summarization can be implemented on the network as shown in Figure 6-10 to reduce the convergence traffic (queries and replies) on the redundant topology. The ip summary-address eigrp 100 172.16.0.0 255.255.0.0 interface subcommand which configured on the outbound interfaces of the head quarter routers allows HQ1 and HQ2 to advertise the 172.16.0.0/16 summary route instead of the specific route to the branch routers – BR1 and BR2. Eventually, BR1 and BR2 will not propagate the query for 172.16.1.0/24 back to HQ1, but will reply to a query for 172.16.1.0/24 with the “network is unreachable” message instead.

No comments:

Post a Comment