- The Triad Elements—Address, Route, Rule
- RPDB—The Linux Policy Routing Implementation
- System Packet Paths—IPChains/NetFilter
- Summary
RPDBThe Linux Policy Routing Implementation
Under Linux, the implementation of Policy Routing structure is carried out through the mechanism of the Routing Policy DataBase (RPDB). The RPDB is the cohesive set of routes, route tables, and rules. Since addressing is a direct function of these elements, it also is part of the system. What the RPDB primarily does is provide the internal structure and mechanism for implementing the rule element of Policy Routing. It also provides the multiple routing tables available under Linux.
Linux with the RPDB and the complete rewrite of the IP addressing and routing structures in kernel 2.1 and higher sustains 255 routing tables, and 2^32 rules. That is one rule per IP address under IPv4. In other words, you can specify a rule to govern every single address available in the entire IPv4 address space. That works out to over 4 billion rules.
The RPDB itself operates upon the rule and route elements of the triad. In the operation of RPDB, the first element considered is the operation of the rule. The rule, as you saw, may be considered as the filter or selection agent for applying Policy Routing.
The following text about the RPDB and the definition of Policy Routing is adapted from Alexey Kuznetsov's documentation for the IPROUTE2 utility suite, with Alexey's permission. I have rewritten parts of the text to clarify some points. Any errors or omissions should be directed to me.
Classic routing algorithms used on the Internet make routing decisions based only on the destination address of packets and, in theory but not in practice, on the TOS field. In some circumstances you may want to route packets differently, depending not only on the destination addresses but also on other packet fields such as source address, IP protocol, transport protocol ports, or even packet payload. This task is called Policy Routing.
To solve this task, the conventional destination-based routing table, ordered according to the longest match rule, is replaced with the RPDB, which selects the appropriate route through execution of rules. These rules may have many keys of different natures, and therefore they have no natural order except that which is imposed by the network administrator. In Linux the RPDB is a linear list of rules ordered by a numeric priority value. The RPDB explicitly allows matching packet source address, packet destination address, TOS, incoming interface (which is packet meta data, rather than a packet field), and using fwmark values for matching IP protocols and transport ports. Fwmark is the packet filtering tag that you will use in Chapter 6 and is explained later on in this chapter in the section "System Packet PathsIPChains/NetFilter."
Each routing policy rule consists of a selector and an action predicate. The RPDB is scanned in the order of increasing priority, with the selector of each rule applied to the source address, destination address, incoming interface, TOS, and fwmark. If the selector matches the packet, the action is performed. The action predicate may return success, in which case the rule output provides either a route or a failure indication, and RPDB lookup is then terminated. Otherwise, the RPDB program continues on to the next rule.
What is the action semantically? The natural action is to select the nexthop and the output device. This is the way a packet path route is selected by Cisco IOS; let us call it "match & set." In Linux the approach is more flexible because the action includes lookups in destination-based routing tables and selecting a route from these tables according to the classic longest match algorithm. The "match & set" approach then becomes the simplest case of Linux route selection, realized when the second level routing table contains a single default route. Remember that Linux supports multiple routing tables managed with the ip route command.
At startup, the kernel configures a default RPDB consisting of three rules:
Priority 0: Selector = match anything
Action = lookup routing local table (ID 255)
The local table is the special routing table containing high priority control routes for local and broadcast addresses.
Rule 0 is special; it cannot be deleted or overridden.
Priority 32766: Selector = match anything
Action = lookup routing main table (ID 254)
The main table is the normal routing table containing all non-policy routes. This rule may be deleted or overridden with other rules.
Priority 32767: Selector = match anything
Action = lookup routing table default (ID 253)
The table default is empty and reserved for post-processing if previous default rules did not select the packet. This rule also may be deleted.
Do not mix routing tables and rules. Rules point to routing tables, several rules may refer to one routing table, and some routing tables may have no rules pointing to them. If you delete all the rules referring to a table, then the table is not used but still exists. A routing table will disappear only after all the routes contained within it are deleted.
Each RPDB entry has additional attributes attached. Each rule has a pointer to some routing table. NAT and masquerading rules have the attribute to select a new IP address to translate/masquerade. Additionally, rules have some of the optional attributes that routes have, such as realms. These values do not override those contained in routing tables; they are used only if the route did not select any of those attributes.
The RPDB may contain rules of the following types:
unicastThe rule prescribes returning the route found in the routing table referenced by the rule.
blackholeThe rule prescribes dropping a packet silently.
unreachableThe rule prescribes generating the error Network is unreachable.
prohibitThe rule prescribes generating the error Communication is administratively prohibited.
natThe rule prescribes translating the source address of the IP packet to some other value.
You will see how these rule actions operate primarily in Chapters 5 and 6. There you will make hands-on use of the command set and implement several Policy Routing structures.
The RPDB was the first implementation of and first mention within the Linux community of the concept of Policy Routing. When you consider that the ip utility was first released in late spring of 1997, and that Alexey's documentation was released in April of 1999 coinciding with the official Linux 2.2 kernel release in May of 1999, then you realize that the Linux Policy Routing structure is already over four years old. In Internet time that is considered almost ancient. But as with most new network subjects, such as IPv6 and Policy Routing, Linux leads the way.
The RPDB itself was an integral part of the rewrite of the networking stack in Linux kernel 2.2. The way in which the Policy Routing extensions are accessed is through a defined set of additional control structures within the Linux kernel. These are the NETLINK and RT_NETLINK objects and related constructs. If you are curious about the programmatic details you can look through the source to the ip utility itself. The call structure and reference to the kernel internals is laid out quite well.
One of the important features that makes the RPDB implementation so special is that it is completely backward-compatible with the standard network utilities. You do not need to use the ip utility to perform standard networking tasks on your system. You can use ifconfig and route and get along quite fine. In fact, you can even compile the kernel without the NETLINK family objects and still use standard networking tools. It is only when you need to use the full features of the RPDB that you need to use the appropriate utility.
This backward compatibility is due to the RPDB being a complete replacement of the Linux networking structure, especially as it relates to routing. The addressing modalities for Policy Routing, as discussed in the "Address" section earlier in this chapter (and illustrated in depth in Chapter 5), were also implemented as part of this change. But the main changes, besides the addition of the rule element, were the changes to the route element. Drawing upon Alexey's documentation again I provide the following information on the route element construct.
In the RPDB, each route entry has a key consisting of the protocol prefix, which is the pairing of the network address and network mask length, and optionally the TOS value. An IP packet matches the route if the highest bits of the packet's destination address are equal to the route prefix, at least up to the prefix length, and if the TOS of the route is zero or equal to the TOS of the packet.
If several routes match the packet, the following pruning rules are used to select the best one:
The longest matching prefix is selected; all shorter ones are dropped.
If the TOS of some route with the longest prefix is equal to the TOS of the packet, routes with different TOSes are dropped.
If no exact TOS match is found and routes with TOS=0 exist, the rest of the routes are pruned. Otherwise the route lookup fails.
If several routes remain after steps 13 have been tried, then routes with the best preference value are selected.
If several routes still exist, then the first of them is selected.
Note the ambiguity of action 5. Unfortunately, Linux historically allowed such a bizarre situation. The sense of the word "the first" depends on the literal order in which the routes were added to the routing table, and it is practically impossible to maintain a bundle of such routes in any such order.
For simplicity we will limit ourselves to the case wherein such a situation is impossible, and routes are uniquely identified by the triplet of prefix, TOS, and preference. Using the ip command for route creation and manipulation makes it impossible to create non-unique routes.
One useful exception to this rule is the default route on non-forwarding hosts. It is "officially" allowed to have several fallback routes in cases when several routers are present on directly connected networks. In this case, Linux performs "dead gateway detection" as controlled by Neighbour Unreachability Detection (nud) and references from the transport protocols to select the working router. Thus the ordering of the routes is not essential. However, in this specific case it is not recommended that you manually fiddle with default routes but instead use the Router Discovery protocol. Actually, Linux IPv6 does not even allow user-level applications access to default routes.
Of course, the preceding route selection steps are not performed in exactly this sequence. The routing table in the kernel is kept in a data structure that allows the final result to be achieved with minimal cost. Without depending on any particular routing algorithm implemented in the kernel, we can summarize the sequence as this: Route is identified by the triplet {prefix,tos,preference} key, which uniquely locates the route in the routing table.
Each route key refers to a routing information record. The routing information record contains the data required to deliver IP packets, such as output device and next hop router, and additional optional attributes, such as path MTU (Maximum Transmission Unit) or the preferred source address for communicating to that destination.
It is important that the set of required and optional attributes depends on the route type. The most important route type is a unicast route, which describes real paths to other hosts. As a general rule, common routing tables contain only unicast routes. However, other route types with different semantics do exist. The full list of types understood by the Linux kernel is as follows:
unicastThe route entry describes real paths to the destinations covered by the route prefix.
unreachableThese destinations are unreachable; packets are discarded and the ICMP message host unreachable (ICMP Type 3 Code 1) is generated. The local senders get error EHOSTUNREACH.
blackholeThese destinations are unreachable; packets are silently discarded. The local senders get error EINVAL.
prohibitThese destinations are unreachable; packets are discarded and the ICMP message communication administratively prohibited (ICMP Type 3 Code 13) is generated. The local senders get error EACCES.
localThe destinations are assigned to this host, the packets are looped back and delivered locally.
broadcastThe destinations are broadcast addresses, the packets are sent as link broadcasts.
throwSpecial control route used together with policy rules. If a throw route is selected, then lookup in this particular table is terminated, pretending that no route was found. Without any Policy Routing, it is equivalent to the absence of the route in the routing table, the packets are dropped, and ICMP message net unreachable (ICMP Type 3 Code 0) is generated. The local senders get error ENETUNREACH.
natSpecial NAT route. Destinations covered by the prefix are considered as dummy (or external) addresses, which require translation to real (or internal) ones before forwarding. The addresses to translate to are selected with the attribute via.
anycast (not implemented)The destinations are anycast addresses assigned to this host. They are mainly equivalent to local addresses, with the difference that such addresses are invalid to be used as the source address of any packet.
multicastSpecial type, used for multicast routing. It is not present in normal routing tables.
Linux can place routes within multiple routing tables identified by a number in the range from 1 to 255 or by a name taken from the file /etc/iproute2/rt_tables. By default all normal routes are inserted to the table main (ID 254), and the kernel uses only this table when calculating routes.
Actually, another routing table always exists that is invisible but even more important. It is the local table (ID 255). This table consists of routes for local and broadcast addresses. The kernel maintains this table automatically, and administrators should not ever modify it and do not even need to look at it in normal operation.
In Policy Routing, the routing table identifier becomes effectively one more parameter added to the key triplet {prefix,tos,preference}. Thus, under Policy Routing the route is obtained by {tableid,key triplet}, identifying the route uniquely. So you can have several identical routes in different tables that will not conflict, as was mentioned earlier in the description of action 5 and "the first" mechanism associated with action 5.
These changes to the route element provide one of the core strengths of the RPDB, multiple independent route tables. As you will see in Chapter 5, the rule element alone can only perform a selection or filter operation. It is still up to the route to indicate where the packet needs to go next. Adding on top of these elements the QoS mechanisms to determine and set the TOS field and the ability to route by the TOS field provides you with the most powerful and flexible routing structure available under IPv4 and IPv6.
In summary, the RPDB is the core facility for implementing Policy Routing under Linux. The RPDB streamlines the mechanism of dealing with rules and multiple route tables. All operations of the rule and route structure are centralized into a single point of access and control. The addition of various alternate actions and destinations for routes and rules through the RPDB allows you to fine-tune the mechanism of Policy Routing without needing to hack sections of the networking code.