Monday, April 9, 2012

Cisco IOS Packet Switching Architectures

Cisco IOS packet switching algorithms (or paths) are the most foundation components of Cisco routers. Different switching methods and structures affect how a router performs its primary task.

In Ethernet, switching is the process of forwarding frames based on destination MAC addresses. In routers, switching is the process of forwarding packets between interfaces within a router.

Packet switching is responsible for moving packets across a router. Its most basic operations are:
  1. A packet is received from an inbound interface.
  2. The packet’s destination address is inspected and compared against the routing table.
  3. If there is a match, the packet is forwarded out to the appropriate outbound interface.
  4. If there is no match, the packet is discarded or dropped.
The main concern of IOS is not much on how to switch packets, but how to switch them quickly. Packet switching is a data-intensive operation as opposed to a computation-intensive operation. Speeding up the operation is not as simple as using a faster CPU. Other factors, eg: I/O bus and memory technology, can also have big impact on the packet switching performance.

The challenge of IOS developers is to create the highest possible performance based on the limitations of the given CPU, I/O bus, and memory technology.

Process switching was the only switching method when IOS was first developed. Later IOS releases introduced newer improved switching methods. Some methods reply on hardware-specific optimizations while other use software techniques that work well on many platforms.

Below lists the switching methods developed and available as of Cisco IOS release 12.0:
  1. Process Switching
  2. Fast Switching
  3. Autonomous Switching
  4. Silicon Switching Engine (SSE) Switching
  5. Optimum Switching
  6. Distributed Fast Switching
  7. Cisco Express Forwarding (CEF)
  8. Distributed Cisco Express Forwarding (dCEF)
This chapter covers only 4 of these methods in detail – process switching, fast switching, optimum switching, and Cisco Express Forwarding. Autonomous and SSE switching are platform-specific and aren’t commonly used in today’s networks. Distributed fast switching is actually an implementation of optimum switching on intelligent line cards.


Process Switching

Process switching is available on every version of IOS, and every platform. Its main disadvantage against other switching architectures is its speed, as it is required to perform routing table lookup for every packet and has the least amount of performance optimizations, and consumes large amounts of CPU resources. However, it has the advantage of being platform-independent and provides some load balancing capabilities not found in other switching methods.

Process Switching

Below lists the steps for process switching:
  1. The network interface processor detects there is a packet on the network media that needs to be processed. It receives and transfers the packet to the I/O memory.
  2. The interface processor generates a receive interrupt which interrupts the processor and inform it that there is a received packet waiting in I/O memory that needs to be processed. The IOS interrupt handler inspects the packet’s header information (eg: encapsulation type, network layer header) to determine packet type, and copies it into the processor memory if necessary (platform dependant). The processor then places the packet on input queue of the appropriate switching process and releases the interrupt. The switching process for IP packets is called ip_input.
  3. During the next time the scheduler runs, it notices the presence of a packet in the input queue for the ip_input process, and schedules the process for execution.
  4. The actual packet forwarding operation begins when the ip_input process runs. It consults the RIB (routing table) to determine the next-hop address and the output interface, and then consults the ARP cache to retrieve the corresponding physical layer address for the next-hop.
  5. The ip_input process rewrites the MAC header of the packet with the appropriate addresses, and places the packet on the output queue of the appropriate outbound network interface. The packet is queued for transmission.
  6. The packet is moved from the output queue of the outbound interface to the transmit queue of the outbound interface. Any outbound QoS operation takes place between these two queues.
  7. The output interface processor detects a packet in its transmit queue and waiting for transmission. It dequeues the packet and transmits it onto the network media.
  8. After the output interface processor transmitted the packet, it interrupts the processor to inform that the packet has been transmitted. IOS then updates the outbound packet counters and frees or releases the space in I/O memory formerly occupied by the packet.
The key points in the process that make it slow is that the processor waits for the next scheduled execution of the ip_input process; and when the ip_input process finally runs, it refers the RIB – a very slow process. The ip_input process is run at the same priority level as other processes, eg: routing protocols, and the HTTP web server interface. Process switching should never be used as the switching method of choice. Other methods produce significantly better performance.

Packet sourced from or destined to the router itself, eg: SNMP traps from the router and Telnet packets destined for the router, are always process-switched.

Load Balancing with Process Switching

One of the advantages of process switching is per-packet load balancing, which provides a simple way to route traffic over parallel paths to a destination – packets are distributed among the available paths based on the routing metric (or path cost) assigned to each path. The metric or cost is used to calculate a load share counter, which actually determines which path to take.

Note: Some applications, eg: VoIP, cannot tolerate per-packet load balancing because packets may arrive out of order and affect the processing of the data. Always ensure that load balancing is performed on a per-destination basis when the network is supporting such applications.

Equal Cost Path Load Balancing

Below shows that RT1 has 2 equal cost paths to the 10.1.3.0/24 network. The asterisk character (*) next to one of the paths indicates the path that will be used for the next packet switched toward 10.1.3.0/24. The traffic share count on both paths is 1, which means packets are switched out each path in a round-robin basis.
RT1#sh ip route 10.1.3.0 255.255.255.0
Routing entry for 10.1.3.0/24
  Known via "static", distance 1, metric 0
  Routing Descriptor Blocks:
  * 10.1.1.2
      Route metric is 0, traffic share count is 1
    10.1.2.2
      Route metric is 0, traffic share count is 1

RT1#

IGRP and EIGRP can install unequal cost paths into the routing table. Figure A2-3 below shows a modified setup in Figure A2-2, with the upper path has about twice the bandwidth of the other.

Unequal Cost Path Load Balancing

Below shows RT1 has 2 unequal cost paths to the 10.1.3.0/24 network. The lower-cost path through 10.1.1.1 has a traffic share count of 2, while the higher-cost path through 10.1.2.1 has a traffic share count of 1. In such a way, for every 2 packets switched out the higher-cost path, 1 packet is switched out the lower-cost path, as indicated by the packet numbers in the figure.
RT1#sh ip route 10.1.3.1
Routing entry for 10.1.3.0/24
  Known via "eigrp 100", distance 90, metric 1538560, type internal
  Redistributing via eigrp 100
  Last update from 10.1.2.2 on Serial0/1, 00:02:15 ago
  Routing Descriptor Blocks:
  * 10.1.1.2, from 10.1.1.2, 00:02:15 ago, via Serial0/0
      Route metric is 1538560, traffic share count is 2
      Total delay is 20100 microseconds, minimum bandwidth is 2500 Kbit
      Reliability 255/255, minimum MTU 1500 bytes
      Loading 1/255, Hops 1
    10.1.2.2, from 10.1.2.2, 00:02:15 ago, via Serial0/1
      Route metric is 3074560, traffic share count is 1
      Total delay is 20100 microseconds, minimum bandwidth is 1000 Kbit
      Reliability 255/255, minimum MTU 1500 bytes
      Loading 1/255, Hops 1

RT1#

Although per-packet load balancing gives the most even distribution across parallel paths, it has a significant drawback – it does not preserve packet order, packets can arrive out of order due to the varying latency and delay factors throughout the various routing paths. Out-of-order packet processing can significantly degrade system performance, reduce TCP session throughput, loss of data in some UDP-based protocols (eg: SNA or NetBIOS Fast Sequenced Transport (FST), or Voice-over-IP – VoIP), and might be even interpreted as attacks by some firewalls.

As mentioned earlier, the main disadvantage of process switching is its speed. Process switching requires routing table and ARP cache lookup for every packet, the time required to perform a lookup grows along with the sizes of the routing table and ARP cache. Recursive routes also require additional lookups in the routing table.

Another factor that affects the speed of process switching is in-memory data transfer time. On some platforms, packets must be copied between I/O memory and processor memory before and after the switching process. Memory data copy operations are very CPU intensive.

Fast Cache was deployed to overcome the speed limitation of process switching. It allows IOS to have smaller lookup tables, switches packets during the packet receive interrupt, and removes the need for in-memory data copying.


Interrupt Context Switching

Interrupt context switching is another switching method that is much faster than process switching. Below are the main differences between interrupt context switching and process switching:
  1. The current process is interrupted to perform packet switching. Packets are switched during the packet receive interrupt, rather than the scheduled ip_input process. The ip_input process is rarely called, and the processor no longer has to wait for a process to complete.
  2. The processor uses route cache to find all the information needed to switch the packet. The type of route cache being used and the contents in the cache depend on the router hardware and the type of interrupt content switching – fast switching, optimum switching, or Cisco Express Forwarding (CEF).
  3. Cache entries are built by the ip_input process while packets are first process switched. The first packet sent to a destination will always be process switched. Subsequent packets to the same destination can be fast switched and the scheduled ip_input process does not get involved in switching the packets as long as the cache entry exists.
A cache is a special high-speed storage mechanism used to store some frequently access data. A cache has smaller size or capacity, higher speed, and more expensive than normal main memory. It is being used to achieve higher performance as applications can access frequently access data in a faster manner compared to accessing the slower main memory.

Interrupt Context Switching

Below are the steps of interrupt context switching and how cache is being used in the switching process:
  1. The network interface processor detects there is a packet on the network media that needs to be processed. It receives and transfers the packet to the I/O memory.
  2. The interface processor generates a receive interrupt. During the interrupt, the IOS interrupt handler determines the packet type, and then begins the switching process.
  3. The process consults the route cache to determine the next-hop address, the output interface, and the MAC header for the next-hop (used for rewriting the MAC header of the packet). Note: A cache entry is built by the ip_input process during the first lookup process.
  4. The packet is copied to either the output or transmit queue of the outbound interface depends upon various factors. The receive interrupt is ended, and the originally running process continues.
  5. The output interface processor detects a packet in its transmit queue and waiting for transmission. It dequeues the packet and transmits it onto the network media.
  6. After the output interface processor transmitted the packet, it interrupts the processor to indicate that the packet has been transmitted. IOS then updates the outbound packet counters and frees or releases the space in I/O memory formerly occupied by the packet.
Notice that the ip_input process never gets involved when switching the packet. The currently running process is interrupted, instead of waiting for the next scheduled execution of the ip_input process. No scheduled process is involved when a packet is fast switched as long as a cache entry exists. IOS can now perform the entire packet switching operation with very short period of an interrupt using Fast Cache! Caching allows IOS to separate the resource-intensive task of making a routing decision from the relatively easy task of forwarding a packet. Fast switching introduced the concept of “route once, forward many times”.

The show ip cache verbose EXEC command displays the routing table cache for fast switching.
RT1#sh ip cache verbose
IP routing cache 2 entries, 344 bytes
   5 adds, 3 invalidates, 0 refcounts
Minimum invalidation interval 2 seconds, maximum interval 5 seconds,
   quiet interval 3 seconds, threshold 0 requests
Invalidation rate 0 in last second, 0 in last 3 seconds
Last full cache invalidation occurred 00:00:32 ago

Prefix/Length           Age       Interface       Next-hop
10.1.3.1/32-24          00:00:30  Serial0/1       10.1.2.2
                   4   0F000800
192.168.1.2/32-24       00:00:30  FastEthernet1/0 192.168.1.2
                   14  CC020F400000CC000F4000100800

RT1#
RT1#sh arp
Protocol  Address          Age (min)  Hardware Addr   Type   Interface
Internet  192.168.1.1             -   cc00.0f40.0010  ARPA   FastEthernet1/0
Internet  192.168.1.2             1   cc02.0f40.0000  ARPA   FastEthernet1/0
RT1#

Unlike the master tables that used to build the caches, which are basically just big lists, caches are implemented with well organized data structures to achieve fast retrieval of a particular entry with minimal search operations. The searching time increases proportionately with the size of the table for the master tables; while the searching time for the caches can be kept small and fairly constant regardless of the total number of entries.

The IP Fast Cache was first implemented as a data structure called hash table. In Cisco IOS Release 10.2, it was replaced with another data structure called 2-way radix tree, a form of binary tree. A radix tree stores the forwarding information and MAC headers based on the binary representation of the key (the unique field that uniquely identifies each set of data). The binary tree can have up to 32 levels, corresponding to the bits within an IP address.

Below shows the steps of searching a number throughout a binary tree:
  1. Start from the left (the most significant bit) of the binary number to be searched for.
  2. Begin at the root of the tree, branch left or right in the tree based on that digit.
  3. Compare the next digit until reaching an end node.
The Fast Switch Binary Tree
- For visualizing purpose, the nodes that marked in grey in the figure above match an IP address of 170.x.x.x (the binary value of 170 is 10101010).

The benefit of this design is speed when compared with searching the RIB, as information regarding the next-hop and MAC address changes is stored within each end node, and finding specific entries is very quick since the tree is very deterministic.

The route cache is not directly related to the routing table, and it is updated only when packets are being process-switched. To keep the entries in the route cache current and from losing their synchronization with the routing table and ARP cache, as well as to keep unused entries in the fast cache from unduly consuming memory on the router, 1/20th of the route cache is aged out or discarded randomly every minute. Those entries must then be rebuilt through process switching.

And since there is no correlation between the MAC headers (used for rewrites) in the ARP cache and the structure of the fast cache, when the ARP table changes, some portion of the fast cache must be invalidated and rebuilt through process switching.


Traffic Load Sharing with Fast Switching

Fast switching does not support load sharing on a per-packet basis, which could lead to uneven link utilization in some situations where multiple paths exist between 2 destinations. This lack of a deterministic load balancing scheme is a concern for many network designers. Newer switching algorithms, eg: Cisco Express Forwarding (CEF), now support load balancing schemes that overcome this problem.

Load sharing with fast switching occurs on a per-destination basis. When there are multiple equal cost paths for a particular destination network, fast cache has an entry for each host reachable within that network, and all traffic destined for a particular host follows a particular link.


Optimum Switching

Optimum switching is basically fast switching with some caching optimizations. As like fast switching, optimum switching performs the entire packet switching process during a single packet receive interrupt. The main difference between fast and optimum switching is the way that the route cache is accessed. The optimum switching code is also developed to take advantages of certain processor architectures; while the fast switching code is generic, and is not optimized for any specific processor. Unlike fast switching that utilizes binary tree that is designed to support any type of value, optimum switching is available only for the IP protocol.

Optimum switching utilizes a 256-way multiway tree instead of a binary tree to record and retrieve information in the route cache. For IP addresses, a multiway tree is much faster because each octet can only be one of 256 values.

Optimum Switching 256-way Multiway Tree

The root branches into 256 nodes numbered 0 to 255. Each node then branches into an additional 256 nodes. This pattern continues for 4 levels, one for each octet in the IPv4 addressing. The nodes in grey show how an IP address (3.253.7.5) would be matched. This type of tree is much faster for IP address lookups than a binary tree because the table is optimized for the IP address format. Information for each route (prefix) or IP address is stored in the final node. Since the size of this information is variable and each node may or may not contain information, the overall table size is also variable; and this is a drawback – searching the tree is not as efficient as it might be if every node were of a known static size.

Since the information stored in the nodes has no direct relationship to the RIB and ARP cache, the entries are also aged out and rebuilt through process switching as like with fast switching.

Optimum switching is available only on high-end routers, eg: Cisco 7500 and 7600 series.


Cisco Express Forwarding (CEF)

Cisco Express Forwarding (CEF) is the preferred switching path on any router that supports it and is the default switching path on all modern routers.

CEF introduces the concept of a trie (the forwarding table – FIB) upon the optimum switching multiway tree, in which the data is not stored within the nodes, but rather each node that is the same size becomes a pointer to another table – the adjacency table, that contains the data, eg: next-hop information and MAC header, that are needed to perform packet switching. The forwarding table is built from the routing table; while the adjacency table is built from the ARP table, Frame Relay map table, or other L2 table types.
Note: The term trie comes from the word retrieve, and is pronounced like tree or try.
Note: The terms CEF forwarding table, Forwarding Information Base (FIB) and CEF table are being used interchangeably for many CEF-related texts.

CEF Forwarding Table and Adjacency Table

One of the biggest advantages of CEF is that the forwarding and adjacency tables are built without process switching – without waiting for a packet to be sent to a destination, in fact the tables are built before any packets are switched. Additionally, the forwarding table is built separately from the adjacency table – an error in one of the tables does not cause the other to become stale. When the ARP cache changes, only the adjacency table changes, therefore aging and invalidation of the forwarding table is no longer necessary. The CEF entries in the forwarding table never age, as they are linked directly to the routing table, changes upon the dynamic routing table are immediately propagated to the forwarding table.

The adjacency table contains information other than MAC header and outbound interface as below:
  • cached – a prebuilt MAC header rewrite string and outbound interface used to reach a particular adjacent host or router.
  • receive – Packets destined to this IP address should be received by the router, which includes broadcast addresses and IP addresses configured on the router itself.
  • drop – Packets destined to this IP address should be dropped, which includes traffic denied by an access list, or routed to the NULL interface.
  • punt – Packets that Cisco Express Forwarding cannot switch and will be passed to the next best switching algorithm for processing – normally fast switching.
  • glean – The destination is directly connected, but there is no prebuilt MAC header rewrite string currently available. An ARP request must be sent to build the MAC header.

Below shows how CEF build the appropriate adjacency table entry from the ARP cache information for a directly connected destination – 192.168.1.2 that is resolved to a glean adjacency.
RT1#sh ip cef detail
IP CEF with switching (Table Version 16), flags=0x0
  6 routes, 0 reresolve, 0 unresolved (0 old, 0 new), peak 2
  3 instant recursive resolutions, 0 used background process
  9 leaves, 11 nodes, 12664 bytes, 18 inserts, 9 invalidations
  0 load sharing elements, 0 bytes, 0 references
  universal per-destination load sharing algorithm, id B02A5381
  2(0) CEF resets, 0 revisions of existing leaves
  Resolution Timer: Exponential (currently 1s, peak 1s)
  1 in-place/0 aborted modifications
  refcounts:  3078 leaf, 3072 node

  Table epoch: 0 (9 entries at this epoch)

0.0.0.0/0, version 0, epoch 0, attached, default route handler
0 packets, 0 bytes
  via 0.0.0.0, 0 dependencies
    valid no route adjacency
0.0.0.0/32, version 0, epoch 0, receive
192.168.1.0/24, version 5, epoch 0, attached, connected
0 packets, 0 bytes
  via FastEthernet1/0, 0 dependencies
    valid glean adjacency
192.168.1.0/32, version 3, epoch 0, receive
192.168.1.1/32, version 2, epoch 0, receive
192.168.1.255/32, version 4, epoch 0, receive
224.0.0.0/4, version 1, epoch 0
0 packets, 0 bytes, Precedence routine (0)
  via 0.0.0.0, 0 dependencies
    next hop 0.0.0.0
    valid drop adjacency
224.0.0.0/24, version 2, epoch 0, receive
255.255.255.255/32, version 1, epoch 0, receive
RT1#
RT1#debug ip cef table
IP CEF table debugging is on
RT1#
RT1#debug arp
ARP packet debugging is on
RT1#
RT1#ping 192.168.1.2

Type escape sequence to abort.
Sending 5, 100-byte ICMP Echos to 192.168.1.2, timeout is 2 seconds:

00:10:10: IP ARP: creating incomplete entry for IP address: 192.168.1.2 interface FastEthernet1/0
00:10:10: IP ARP: sent req src 192.168.1.1 cc00.0550.0010,
                 dst 192.168.1.2 0000.0000.0000 FastEthernet1/0
00:10:10: IP ARP: rcvd rep src 192.168.1.2 cc02.0550.0000, dst 192.168.1.1 FastEthernet1/0
00:10:10: CEF-IP: Checking dependencies of 192.168.1.0/24
00:10:10: CEF-Table: Adjacency-prefix 192.168.1.2/32 add request -- succeeded.
00:10:12: IP ARP: rcvd req src 192.168.1.2 cc02.0550.0000, dst 192.168.1.1 FastEthernet1/0
00:10:12: IP ARP: sent rep src 192.168.1.1 cc00.0550.0010,
                 dst 192.168.1.2 cc02.0550.0000 FastEthernet1/0.!!!
Success rate is 60 percent (3/5), round-trip min/avg/max = 16/25/40 ms
RT1#
RT1#sh ip cef detail
IP CEF with switching (Table Version 17), flags=0x0
  7 routes, 0 reresolve, 0 unresolved (0 old, 0 new), peak 2
  3 instant recursive resolutions, 0 used background process
  10 leaves, 11 nodes, 12800 bytes, 19 inserts, 9 invalidations
  0 load sharing elements, 0 bytes, 0 references
  universal per-destination load sharing algorithm, id B02A5381
  2(0) CEF resets, 0 revisions of existing leaves
  Resolution Timer: Exponential (currently 1s, peak 1s)
  1 in-place/0 aborted modifications
  refcounts:  3080 leaf, 3072 node

  Table epoch: 0 (10 entries at this epoch)

Adjacency Table has 1 adjacency
0.0.0.0/0, version 0, epoch 0, attached, default route handler
0 packets, 0 bytes
  via 0.0.0.0, 0 dependencies
    valid no route adjacency
0.0.0.0/32, version 0, epoch 0, receive
192.168.1.0/24, version 5, epoch 0, attached, connected
0 packets, 0 bytes
  via FastEthernet1/0, 0 dependencies
    valid glean adjacency
192.168.1.0/32, version 3, epoch 0, receive
192.168.1.1/32, version 2, epoch 0, receive
192.168.1.2/32, version 16, epoch 0, connected, cached adjacency 192.168.1.2
0 packets, 0 bytes
  via 192.168.1.2, FastEthernet1/0, 0 dependencies
    next hop 192.168.1.2, FastEthernet1/0
    valid cached adjacency
192.168.1.255/32, version 4, epoch 0, receive
224.0.0.0/4, version 1, epoch 0
0 packets, 0 bytes, Precedence routine (0)
  via 0.0.0.0, 0 dependencies
    next hop 0.0.0.0
    valid drop adjacency
224.0.0.0/24, version 2, epoch 0, receive
255.255.255.255/32, version 1, epoch 0, receive
RT1#

The following packets are sent or punted to the CPU for process switching:
  • ARP requests and replies
  • IP packets that require a response from a router, eg: TTL has expired, MTU is exceeded, fragmentation is needed, etc.
  • IP broadcasts that will be relayed as unicast, eg: DHCP requests, IP helper address functions.
  • Routing protocol updates.
  • Cisco Discovery Protocol packets.
  • Packets requiring encryption.
  • Packets triggering Network Address Translation.
  • Other non-IP protocol packets, eg: IPX, AppleTalk, DECnet, etc.
The version number indicates the number of times the CEF entry has been updated since the CEF table was generated. The epoch number indicates the number of times the CEF table has been flushed and regenerated as a whole.

CEF supports load balancing over equal-cost paths. Load balancing at the switching level is far superior by load balancing via routing protocols, as routing protocols operate at a higher level. CEF load balancing is performed on a per-destination basis by default – all packets for a destination will use the same link. CEF can be configured to perform load balancing on a per-packet basis, for scenarios such as there is only a single host on the other end of the links, or there is a single server that consumes the majority of the bandwidth for any single link.

Issue the ip load-sharing [per-packet | per-destination] interface subcommand to enables per-packet or per-destination CEF load balancing for an interface. When enabling per-packet load balancing to a particular destination, all interfaces that can forward traffic to the destination must be enabled for per-packet load balancing.
Note: The ip load-sharing per-destination interface subcommand is the default configuration and is not shown in the configuration files.

No comments:

Post a Comment