Building scalable networks

Prof. Laurent Vanbever, Joel Fischer (Scribe 1), Alexandra Marchon (Scribe 2), Dennis Küenzi (Scribe 3)

In this part of the lecture, we'll speak about how we can build scalable networks, that is networks that can handle a large (and growing) numbers of routers, routes, and prefixes. We'll do that considering both the control plane (routing) viewpoint and the data plane (forwarding) viewpoint.

Slide 1
Slide 1
Slide 2
Slide 2

First off, let me define what I mean by "large" network. Here, our goal should be to build networks that can have tens of thousands of routers, learning dozens of millions of routes, for up to 1.2 million prefixes. (Each prefix can have multiple paths associated to it, cf. BGP.) This is the realm of the Google, the Meta, the Amazon, the Deutsche Telekom, or the Microsoft of this world.

Slide 3
Slide 3

Being responsible for computating the forwarding state, the control plane will be affected by all these metrics: the size of the topology (the number of routers), how many routes are learned (that is, how many paths) and for how many prefixes.

Slide 4
Slide 4

In contrast, the data plane scalability will mostly depend on the number of prefixes it needs to cary.

Slide 5
Slide 5

As a reminder, the control plane is in charge of maintaining the routing tables (known as Routing Information Bases, or RIBs) which contain all the routes that the router has learned thus far; computing (and propagating) the best route for each destination prefix; and downloading the best routes for each destination prefix—that is, the forwarding state—into the forwarding table.

Slide 6
Slide 6

The forwarding table is often referred to as Forwarding Information Base (or FIB). A router uses its FIB to forward packets from one port to another. To sustain high forwarding speeds, the FIB is typically implemented in hardware, using SRAM. Doing so allows routers to forward at multiple Terabits per second.

Slide 7
Slide 7

We'll mainly see two scaling techniques in this lecture. The first one will help us improving how routes propagate within a network, considering both inter-domain routing (BGP) and intra-domain routing (OSPF). The second technique will help us reduce the number of entries in the forwarding tables by aggregating and filtering them away.

Slide 8
Slide 8
Slide 9
Slide 9
Slide 10
Slide 10
Slide 11
Slide 11
Slide 12
Slide 12

Let us start with how we can scale BGP route propagation in large networks.

Slide 13
Slide 13
Slide 14
Slide 14

As you know, BGP is the internet routing protocol we use today in the Internet. It's used between different Autonomous Systems, or ASes.

Slide 15
Slide 15
Slide 16
Slide 16

BGP is a path vector protocol: you have (AS) paths in the routing announcements, and each AS appends itself to the path before sending a route to the next AS.

Slide 17
Slide 17

BGP comes in two flavors. First, you have external BGP (or eBGP) sessions which run between routers located in different ASes. These sessions are used to learn/advertise routes from/to neighboring ASes.

Slide 18
Slide 18
Slide 19
Slide 19
Slide 20
Slide 20

Then you have internal BGP (iBGP), which runs between routers located in the same AS. iBGP sessions are used to propagate externally-learned (and internally-sourced) routing information within the network.

Slide 21
Slide 21

So for instance, iBGP allows all the routers in this US-based network to learn that they can reach this /16 prefix with this AS path using Seattle as the egress router.

Slide 22
Slide 22

Given a set of routes, BGP routers elect one best path for a prefix. BGP is often referred to as a single path protocol. The way to think about it is the following: let's say I'm a router, I know 10 routes for reaching one of Google prefixes, meaning I need to go from 10 to 1. What I will do is apply each of these steps in sequence until there is only one route left. Observe that there is a tie-breaking rule, meaning we are guaranteed to end up with a single route at the end of the process. So, you start with 10 routes, and then you will remove all the routes that do not have the highest local preference. Let's say you have 4 routes left. Then, out of these 4, you will remove all the ones that do not have the shortest AS path. Let's say now you only have 2, and then you continue to go down until you hit 1, which is then the best route. That's the route you propagate, not only externally but also internally, modulo policies (route-maps).

The different steps in the decision process can be divided in two sets: global and local steps. Global steps refer to globally-visible attributes (the preference, the AS path length, ...) which are directly attached to the BGP routes. At the end of the selection process, all the routers will have picked equivalent routes from the global steps viewpoint: all the routers will have a route with the same (highest) local pref and the same (shortest) AS path length. Note that you can have 2 routers that pick 2 different AS paths, but if one has a length of 3, the other one will also have a length of 3. The local steps refer to attributes that are local to a specific router and pertain mostly to hot potato routing. Hot potato routing is the idea that, when a network receives traffic, it should push it away as quickly as possible to minimize transit costs. There are two steps there: prefer eBGP routes over iBGP and prefer lower IGP metric to the egress router. If you are an egress router and you have some external session with others alongside with internal sessions, and your learn 2 routes, one external and one internal, you should prefer the external one if they are equivalent because it enables you to send the traffic away from the network instead of inside the network. Likewise, if you are a router inside the network, and you learn multiple internal routes that are equivalent again, then you would prefer one that comes from an egress that is closer to you than from an egress that is further away. So if I am here, a router in Zurich, and I learn a route from Basel and Geneva, and they are both perfectly equivalent, I mean much better practically to select the one in Basel because Basel is closer to Zurich and Geneva. So the traffic will take the shortest distance if it exits by Basel and Geneva from my perspective. Because these are local decisions, different routers would make different decisions based on this. A router in Lausanne, faced with the same decision, would obviously prefer to exit via Geneva rather than Basel.

Slide 23
Slide 23
Slide 24
Slide 24

Besides using the IGP to select routes, a BGP router would also use the IGP to figure out the physical next hop to use to reach an egress. For instance, if you are this Atlanta router over there and the next stop for this route is Seattle (which is not directly connected to you), then you need a way to figure out what is the physical next stop to use to reach Seattle. So this is where you would also use the IGP.

Slide 25
Slide 25

This process of figuring out the physical next hop for each BGP prefix is known as (recursive) next-hop resolution. Think of it as merging two routing tables. The first routing table, the BGP routing table, tells you that Seattle is the egress for that /16. The second routing table, the OSPF routing table, tells you that if you want to reach Seattle then you need to send the traffic to Chicago. Chicago is a directly-connected next stop.

Slide 26
Slide 26

This process happens on all the routers in the network. At the end, Atlanta will send the traffic to Chicago, which will send it to Kansas City, Salt Lake City, and Seattle, and then the traffic finally gets out.

Slide 27
Slide 27

This is an illustration of what I was saying just before about the hot potato routing step. If you are this router here in Houston and you learn two iBGP routes, one for which the egress is Seattle and the other one for which the egress is in New York. These routes have the same preference, the same local preference, 200, the same AS path length, 1, not the same AS paths, 20 and 40, but the length is the same, so they are equivalent from the BGP selection viewpoint.

Slide 28
Slide 28

Because the routes are perfectly equivalent, you will reach this step of the decision process. In this case, Houston happens to be closer to New York than Seattle, meaning that the traffic to this /16 will go to New York.

With hot potato routing, you tend to separate the network into zones. Assuming that external routes will come east and west (which is common for such a US-based network), somewhere in the middle, in the Midwest, routers will send left, while on the east side they send right. So there is kind of a frontier, somewhere around here, where you go left or right. You can think of it as with river. There are separation lines where one side of it you go to the Mediterranean and then the other side of it you go to the Atlantic.

Slide 29
Slide 29
Slide 30
Slide 30
Slide 31
Slide 31
Slide 32
Slide 32

So, inter-domain routing in a nutshell. In eBGP, you learn and advertise routes from and to your neighboring AS. iBGP is used to disseminate internal and external routes within the AS, and the IGP, or OSPF, shortest path routing, helps you compute the shortest path routes within the AS and to identify the closest egress point.

Slide 33
Slide 33

We are now ready to start talking about scalability.

Slide 34
Slide 34

One thing we have always told you in your intro lecture is that iBGP routers do not redistribute iBGP routes to one another. That means if I'm an iBGP router, and I learn a route from another iBGP router, I don't redistribute that route to any other iBGP router. With this rule in place, the only way to guarantee full propagation is to create a complete graph of iBGP sessions amongst all routers. That is, we need a full mesh of BGP sessions.

That's a lot of iBGP sessions: n(n-1)/2, where n is the number of iBGP routers. With 100 routers, you have to configure 5k sessions, and with 1000 routers, which is again the ballpark number we are targeting, we end up with half a million iBGP sessions configured. There is no way you can do that manually.

Slide 35
Slide 35
Slide 36
Slide 36

Even if you automate this though, it's still very annoying to have each router maintaining half a million iBGP sessions. First, it's a lot of TCP connections to maintain. Second, you will often end up with your routers learning too many BGP routes. Then, whenever a router updates its best route, it would have to send that update to half a million neighbors (which consumes CPU and bandwidth). The killer argument though is that, whenever you add or remove a router in your network, you need to touch every single other router in your network to add or remove the corresponding session. This is extremely annoying in terms of network operations.

Slide 37
Slide 37

One solution, which is heavily used in practice, is known as BGP route reflecction. As the name indicates, route reflection allows some iBGP routers to reflect routes that they learn over an iBGP session onto another iBGP session. Doing so allows to create a hierarchy of iBGP routers, with route reflectors sitting above client routers.

One of the classical insights of computer science, which says that whenever you want more scalability, you should introduce a hierarchical structure, very much applies here.

Slide 38
Slide 38

BGP route reflectors establish "normal" iBGP sessions. The only thing that changes is that, in the configuration, some sessions are indicated as being client sessions, while others are indicated as being peer sessions. Route reflectors establish peer sessions amongst themselves (i.e. with other route reflectors) and client sessions otherwise.

Slide 39
Slide 39

This is a first example of route reflection topology (on the right). Each of the clients (RA and RB) only maintains one single session with their route reflectors (RR1 and RR2) which in turn maintain a single peer session in-between them.

Slide 40
Slide 40

If a route reflector learns a route from a peer (another route reflector), it will relay it to its clients (customers), but not to its other peers. While if a route reflector learns a route from a client, it will relay it to everyone. You might recognize here a connection with the classical propagation rules you know in eBGP based on customer, peer, and provider sessions. (The two logics relate indeed.)

Slide 41
Slide 41

Let's look at an another example. Similarly to the example above, you can see that there are two route reflectors: RR1 and RR2 connected by a peer session. RA is the client of RR1, while RB is the client of both RR1 and RR2. (For redundancy purposes, it's good practice to ensure that each client router is connected to at least two distinct route reflectors.)

As I said, to whom a route reflector reflects depends on the session types on which the best route is learned. So if RR1 learns a route from its client, RA, it will reflect it to everyone, including RR2. Upon receiving the route, and assuming its the best route, RR2 will also reflect the route to everyone (except the session onto which it has learned it), meaning RB will receive two copies of the route, one via RR1, the other one via RR2.

Slide 42
Slide 42

This slide considers the case where RR2 learns a prefix over an eBGP session that it reflects to everyone, peer and client. RR1 will then reflect it back to its own client, RA, because it comes from a peer. Observe that, unlike RB, RA only receives one copy of the route, since it has only one route reflector.

Slide 43
Slide 43

Route reflection allows to scale, but comes at the cost of route diversity. In a full mesh, each router is learning all the globally-preferred routes known for a prefix. In contrast, with route reflection, each router will only learn the best route advertised by its route reflector(s). Often this means going from having routers learning dozens of paths towards the same prefix (one per egress router learning a globally-preferred route) to two or three.

Besides routers might end up using different routes when in a route reflection topology than when they are using a full mesh. In other words, there can be be suboptimal routing. This is because the route reflector is making its own decision based on its own local attributes. The decision step pertaining to the IGP distance I was referring to before is now done from the route reflector's viewpoint, so the route reflector will make local decisions that are then reflected towards the clients. In other words, clients are not (necessarily) making their own local decision anymore.

Slide 44
Slide 44

Let me give you an example of that. We have a simple network with RA, RB, and RC that act as egresses for the network. RA learns one external route, r1, RB learns r2 and r3, and then RC learns r4. These routes are for the same prefix, and they are equally-preferred: they have the same local pref, same AS path length, etc.

In this case, RA would select R1, RB would select r2 (because of the tie-breaking step, assuming r2 comes on the session with the smallest peer address), and RC would select r4. So if we are looking at an iBGP full mesh, RD would learn r1, r2, and r4. RD would then prefer r1 based on the IGP distance to reach each of the egress (2 is less than 5 or 6).

Slide 45
Slide 45
Slide 46
Slide 46
Slide 47
Slide 47

Now, let's make the router in the middle, RE, a route reflector. We remove the full mesh, and we just add client sessions between RE and RA, RB, RC, and RD. As before, RA, RD, and RC, we select r1, r2, and r4, that doesn't change. They advertise that now to their route reflector, which is RE.

And now RE will select, from its own viewpoint, what is the best route there. From RE's viewpoint, to reach A is a cost of six; to reach B is a cost of one; and to reach C is a cost of two. So RE is taking a different decision than RD and it will pick r2 which it then reflects it to all its client, including RD in this case, which will only learn one single route instead of three, and this route will be r2, with RB as the egress. This is a very clear case of suboptimal routing. It's correct, the traffic is not dropped, and there is no loop here or black holes. But we are now using a longer path than what we would have taken in a full mesh. That's the price to pay for route reflection.

Of course, how suboptimal the routing becomes depends on the position of the route reflectors. And you can phrase that as an optimization problem, but we won't go into details of that optimization problem in this lecture.

Slide 48
Slide 48
Slide 49
Slide 49
Slide 50
Slide 50
Slide 51
Slide 51
Slide 52
Slide 52

With route reflection, things can get worse than "just" suboptimal routing though. Route reflection can indeed lead to both routing and forwarding anomalies. Forwarding anomalies are of two types: forwarding loops and black holes. Routing anomalies include problems such as permanent oscillations and multiple stable states. Route reflection, when done wrong, can cause both anomalies.

Let me first give you an example inducing forwarding loops. Granted, I'm creating weird route reflection topologies in order to create these anomalies, but these are perfectly legit.

Slide 53
Slide 53

So now I'm considering the same example as before, with two route reflectors, A and B, and then I have two clients, C and D. But as you can see, I have swapped over the client session. So now we kind of have like the route reflection topology is not in sync anymore with the physical topology. So instead of having RC be a client of RA, RC is a client of B, and RD is a client of A. So your client would be at the opposite corner.

And for A and RD, they appear as before. So as you can see, A learns a route for one, and then B learns another route for two.

Again, the same prefixes that are all equivalent as before. So RA and RD will reflect that to their clients, C and D, meaning what? Meaning that C we learn R2 and D we learn R1. See what I mean?

So what is the shortest path between RC and R2? Well, it's RD, RRE. What is the shortest path between RD and R1? Well, it's RC, RRA. So as you can see now, we have a forwarding loop between these two routers that has been formed.

This is a permanent forwarding loop, so the traffic will forever be dropped in between these two routers, unless, of course, you realize it, and then you change the configuration. But this is a classical example of the problem. Again, it can be subtle.

This problem will only manifest when you have exactly these two. A route on A and B that are equivalent, for which then the problem happens. But if you had only a route of A or only a route of B, you would not have the forwarding anomaly appear.

So again, I just want to tell you that verifying a network is not something that you really can do at scale. So traffic is now trapped in a forwarding loop. So that's an example of forwarding anomalies.

Slide 54
Slide 54
Slide 55
Slide 55
Slide 56
Slide 56
Slide 57
Slide 57

Another example of anomalies at scale are permanent oscillations. So again, this is a little bit of a crazy network, but I'll just give you one example where it happens.

Here you have a topology with three vertices around the bottom. One there is R1, one there is R2, and one there is R3. And then you have three route reflectors. The route reflectors are totally fine. There is a full mesh of peer sessions between tier ones.

Again, same thing. And then this router is the client of A, this router of B, and then this router is the client of C. But then this is where the weirdness happens. If you look at the preference of A in terms of IGP cost, A is closer to this router than to that router's client. So A prefers R2 over R1.

B is closer to R3 than it is to R2. So B prefers R3 over R2. And then, finally, C prefers R1 over R3. So again, you can see now we have a cyclical relationship here, where each router prefers a router on the right, like a torus kind of thing, where C goes back to A.

Here, you would do that in the exercises. But this network will never converge. So if you configure this network, again, you can, nothing prevents you from spinning up a virtual network, and then configuring the network like that, you will see that the routers will forever exchange routes. They will never converge.

It doesn't necessarily mean that the traffic will be lost. It just means that sometimes RA will go here, and we send the traffic to R2, sometimes we send it to R1, etc. So the traffic will keep shifting, not necessarily being dropped. You can combine oscillation with the forwarding loops, in which case the oscillation creates the forwarding loops. These are the most extreme cases. But that wouldn't be the case over there.

Still, it's very annoying for traffic to oscillate. Again, if you think about what TCP does, TCP is trying to estimate the delay, the RTT, the round-trip time between the source and the destination. And it needs a little bit of time for that. It needs some packets in order to measure this. So if the paths keep changing, then the estimates of TCP are kind of pointless because the two paths can have drastically different RTT in practice, meaning that the behavior of TCP will continue to shift as well. So we have poor performance. TCP really doesn't like jitter, meaning variance in the RTT. This is why it's bad. And also, you have your router forever doing work, which is useless.

Slide 58
Slide 58
Slide 59
Slide 59
Slide 60 Run this slide!
Slide 60

The last problem I'd like to talk about is non-determinism. What I mean by that is that there are multiple solutions that can be computed, multiple stable solutions, and you do not know which one will be picked. This is very annoying from an operational viewpoint, because you cannot reason about what your network will do in advance.

In this example we have two route reflectors, RRA and RRB, and then we have these two clients, RC and RB. RC advertises r1, and then RB advertises r2. What you have to see is that RRA prefers r2 over r1, because it has a cost of one to each RD and a cost of two to each RC. RRB is exactly the opposite. It prefers RC over RD.

So what happens? Well, let's say, this router over there, it's a bit faster than this router over here to announce the route to the route reflectors. So RRA wil learn r1 first, and then reflect it to RRB. In this case RRB is happy: it has learned its best route. This means that, later on, when RD finally advertises r2 to RB, RB doesn't reflect it because this is not its best route. In this situation, RA looses.

Of course, you can have the symmetrical case, in which RD is the first one to advertise r2. In this case, the network stabilizes in another configuration where RA is the happy one. This non-determinism is an issue. Since you do not know if the traffic will go left or right, it's hard to do things like capacity planning, for instance.

Note that this is not an oscillation, the network does stabilize, but you do not know in advance into which state.

Slide 61
Slide 61
Slide 62
Slide 62
Slide 63
Slide 63
Slide 64
Slide 64
Slide 65
Slide 65
Slide 66
Slide 66

Unfortunately, deciding whether a given route reflection is correct or not is NP-hard, meaning there cannot exist an efficient algorithm to answer this question in the general case (unless P equals NP, and let's not go there).

So what do you do though? It's not like the internet is falling apart all the time because of this. So there must be something afoot there that allows this to work.

Luckily, sufficient (not necessary) conditions for correctness are known. This means that if you configure your network following these guidelines, you will be guaranteed that there will be no problem. If you don't follow these guidelines, then you may or may not have a problem.

Slide 67
Slide 67

The first condition is to ensure that route reflectors prefer the routes coming from their clients over the routes coming from their peers. In a lot of the examples I was showing you, this was not the case. You had one route reflector that was preferring the routes of the client over the routes from other route reflectors. That's bad, because then you can start to combine this into a preference circles.

The second condition is to ensure that "BGP messages follow the data path". That is, that every shortest path between any two routers should be a valid “signaling path”. If we look back at the example of slide 57, that was clearly not the case. There we could see that the shortest path between RC and RRA (resp. between RD and RRB) was not a valid signaling path (there was no corresponding BGP session on it). This conditions allows to guarantee the absence of deflection. Deflection happens when different BGP routers along a forwarding path choose different egresses. Again, in slide 57, we had RC picking RRB as egress, and RD picking RRA as egress, and both of them lying alongside the shortest path of one another towards the selected egress.

Slide 68
Slide 68

Slide 69 is another example of a network which violates the second condition every shortest path between any two routers should be a valid signaling path, which creates a deflection.

In this network, there is only one route reflector (RR1) which connects to two clients (C1 and C2). Both RR1 and C2 learn an equally-preferred route for a prefix p (r1 and r2, respectively).

As you can see, the shortest path between C1 and C2 is the direct link (C1, C2). This shortest path is not a valid BGP signaling path: BGP routes do not flow alongside it. Here, there is only one valid signaling path between C2 and C1: C2, RR1, C1.

Because of this a deflection arises. Here, RR1 chooses r1 as best route, and propagates it to C1 and C2. C1 therefore selects r1 as best. C2 though learns both routes: r1 from RR1, and r2 from its eBGP peer. Out of the two, C2 prefers the external route (eBGP routes are preferred over iBGP routes).

Now, observe that the shortest path between C1 and RR1 goes via C2. This means that when C1 sends traffic to p (which it believes will exist via RR1), it hits C2 which then forward the traffic out instead of towards RR1.

While this deflection looks innocent at first, it isn't. Amongst others, it can lead to forwarding loops across different ASes since traffic does not follow the AS-PATH anymore. (If you are interested, you can check Figure 16 in this paper for an example.)

Slide 69
Slide 69

In Slide 70, we have turned C2 into a route reflector (and renamed accordingly to RR2) and connected it to C1 as well. Now, the condition that every shortest path should be a valid signaling path is respected since there is an iBGP session and BGP messages "flowing" alongside (RR2, C1). This network is not subject to deflections. Here, C1 will select r2 as egress and traffic will indeed leave via RR2.

Slide 70
Slide 70

In most cases, you'll see a single level of route reflection. But just for you to know, you can actually create multiple levels of route reflection, should you want to do that. So you can have a route reflector that has client route reflectors that themselves have clients. You can have as many layers as you want. So you can stack these route reflection layers on top of one another. Finally, you can configure route reflector with what is known as a cluster ID, which as the name indicates allows to group route reflectors into clusters.

Slide 71
Slide 71
Slide 72
Slide 72

Let us switch gear now and speak about and how we scale link-state-like IGPs.

Slide 73
Slide 73

In link-state routing, as you know, what happens is that the routers build a map of the network graph by flooding link-state advertisements. Each router maintains its incident links and costs and then floods that throughout the entire network. Think of it as a tiny piece of a puzzle: each router knows its local view and then sends that to everyone else. Everybody puts the pieces together, builds a graph, and then runs Dijkstra on it.

What's distributed in OSPF is not the computation. Each router locally run Dijkstra on the reconstituted graph. Instead, what is distributed in OSPF is the flooding of the graph. In BGP (and, more generally, distance-vector protocols), the computation of the routes themselves are distributed. So it's a very different family of protocol, they work (very) differently.

Slide 74
Slide 74

So what are the scalability challenges behind link-state routing? Well, the larger the topology, the more work it will take for the routers to flood. Likewise, the large the topology, the larger the graph to store. And then, of course, the larger the topology the more time it takes run Dijkstra on it. So, you need more bandwidth for the flooding, more memory to store the graph, and then you need more compute time to run Dijkstra.

Slide 75
Slide 75

How do we improve scalability? Well, the first part is to reduce as much as possible the overhead of the algorithm. That means trying to reduce the amount of computation you need to do and reduce the frequency at which you do these computations.

The second step is to introduce a hierarchy. Similarly to what we have done with route reflection.

Slide 76
Slide 76

Let's talk quickly about reducing the algorithm overhead.

The first thing is to make sure that you use optimized data structures. In your introduction lecture, you may have seen simple versions of Dijkstra that were running in $\mathcal{O}(n^2)$, where $n$ represents the number of vertices in the network. Using a heap, you can reduce the complexity to $\mathcal{O}((m+n) log(m))$, where $m$ represents the number of edges.

The second thing, when there is a change, you might not need to recompute the entire forwarding state (i.e. that is, recomputing the shortest path towards all other nodes). You might be able to reuse some portions of the existing shortest paths. In short, you can make the algorithm incremental.

The last technique is simply to run the algorithm less frequently by e.g. pacing the computation. You make the router wait a little bit before recomputing. This is very useful, for instance, when you have links flapping, that is, links that go down, up, down, up, etc. These are very typical failures. Since each router recomputes its shortest paths every time there is a change in the graph, the entire network is recomputing shortest paths with each link flap. Pacing can help here. Of course, if you are too aggressive with your pacing, then you run into the risk of having your routers not immediately reacting in case of a real failure. As usual, this comes with limitations that you need to be aware of.

Slide 77
Slide 77
Slide 78
Slide 78

Reusing intermediate results is interesting, because in a lot of cases, when there is a change in the graph, it may be that the result of the computation does not change that much. For instance, this is what happens if a router starts advertising a new prefix (a new leaf appears). From the graph viewpoint, there is no change in the shortest path. (Recall that the shortest paths are computed towards nodes, towards routers, not towards prefixes.)

Slide 79
Slide 79
Slide 80
Slide 80

Likewise, if you have have a link that fails, which does not belong to the previous shortest path tree, then that link failing will not affect the results of the shortest path computation. You can see this on this example in Slide 81.

In this network, in blue, we have all the links that belong to the shortest path tree between A and everyone else. These are the links that are used to forward traffic between A and everyone else in the network.

Slide 81
Slide 81

If any of the links in red, the ones that are not part of the shortest path tree fail, that won't change a thing for A. By simply checking whether a topology change affects any of the links on the shortest path or not, A can easily save on computations.

Slide 82
Slide 82

Okay, so now, of course, what about the failure of these other links? The blue links, those links are on the shortest path string. If they fail, they will inevitably affect some aspect of the shortest paths, and you will have no choice but to recompute. But, again, it's not like you have to recompute the entire shortest, path in most situations.

Slide 83
Slide 83

For instance, if you think about that link failure over with the red star, this will have an effect on only this part of the graph, the subtree which is rooted in L.

The paths between A and B, A and C, A and D, are completely independent. They won't change if that link over there has failed. Only the path towards this router over here might change after that failure.

And here, the other example of this slide, is where a new link appears. This again affect necessarily all the graph. There is only one impacted subtree.

Slide 84
Slide 84
Slide 85
Slide 85

There is a harder version of ISPF which also works here, which automatically identifies the subtrees that are affected by a failure or a new link appearing, and then only redoes that part of the computation. We won't see the algorithm in the lecture. Why? Because even Cisco acknowledges that it's actually a tricky algorithm to run, as you can see in Slide 86. "This performance improvement has been proved to be risky." What they are acknowledging there is that there were bugs in their implementation. That's one thing, but also the other thing is that the Dijkstra algorithm is actually pretty damn fast to run. Modern routers have server-grade CPUs. Running even the full-blown computation without any of the ISPF tricks, it's actually only milliseconds long, even on a large graph. So, in a sense, they did not see the benefit anymore of maintaining a complex, buggy piece of code, or fixing it, when the new generation of routers essentially don't need it anymore. That's why, slowly but surely, as you buy new routers now, full ISPF is not activated anymore.

Slide 86
Slide 86

So let me now speak about how we can introduce hierarchies in link-state IGPs. Essentially, the idea is that if you are a router, you should only know the detailed map of the region you belong to. Let's say I'm a router in the Niederdorf region of Zurich; I will only know now the entire graph of the Niederdorf region of Zurich, and I won't know the graph outside of that region. We split the graph into pieces, and then have routers only know that piece of the graph. Of course, you still want routers in different regions to learn about destinations in the other regions.

Slide 87
Slide 87

Using hierarchical (instead of flat) IGPs mandate to divide the topology into a set of regions that are known as areas. The division is constrained in that the resulting topology should be composed of a special area (the backbone area, often referred to as area 0) connecting peripheral areas. We refer to border routers any routers which is in more than one area.

Slide 88
Slide 88

This is how it tends to look like in practice. Here you see a network with 5 areas and, in between areas, you see that the area border routers. You can have plenty of non area border routers in each area, and you have plenty of them in area zero as well.

For instance, here, you have four routers in area zero, one of which is connected to this 10/24 prefix over there.

Slide 89
Slide 89
Slide 90
Slide 90

Routers only know the entire topology of the areas they are part of. So how does it work in between areas? This is where it's a bit interesting. In OSPF, what is distributed is the flooding of the puzzle pieces, of the graph, and then you run Dijkstra centrally, but nothing else. If you look at a protocol like BGP, the computation is really distributed, because I'm sending my messages until I converge.

Hierarchical IGPs behave a little bit like BGP in-between areas in that area border routers, like this guy over here, only advertise the cost to reach each destination across areas instead of advertising the entire topology. So here, the border router will just advertise that it can reach 10/24 with a given cost, in this case 16, which is indicated in the graph.

This means now all the routers in area 1 know that they can reach 10/24 via this router with a cost of 16, but they do not know the entire topology of area 0. And they don't need to, they just need to know the cost to reach this destination via the area 1 router.

That's where the scaling happens. You remove the topology information, and you replace that with cost information pertaining to prefixes. That means really the only thing you gain is the fact that the graph is smaller. You still have one entry per prefix in the network, that doesn't change, you still have one advertisement per destination.

But you can still gain things. The idea is, if the graph is bigger, you can have two problems in the network: you can have problems with the number of prefixes, or you can have a problem with the size of the network. Or you can have a problem with both. So far, what we have discussed only helps you when your problem is the size of the graph, not the number of destinations.

Slide 91
Slide 91
Slide 92
Slide 92

To also improve on the number of prefixes, area border routers can also aggregate multiple prefixes together before redistributing them across areas. You can see an example of this at play in Slide 94. Instead of advertising both 10.0.0.0/24 and 10.0.1.0/24, they advertise one aggregate prefix 10.0.0.0/23. When doing so the big question is the cost that should be associated to the aggregate. In OSPF, that cost is set to be the largest cost to reach any child prefix in the aggregate. For the top border router, that cost is 16 as it has a cost of 16 to reach 10.0.0.0/24. Likewise, the bottom router will advertise a cost of 14 as this is the largest cost to reach any of the child prefixes (here 10.0.0.0/24 as well).

Slide 93
Slide 93
Slide 94
Slide 94

As usual, scaling doesn't come for free. Like route reflection, going hierarchical with an IGP can introduce anomalies, this is the same thing here. Whenever you start to divide the network into areas, you can have changes in the forwarding paths that are computed by a network.

Slide 95
Slide 95

Let us compare these paths before and after aggregation. Assuming this is the topology inside Area 1, before the aggregation, if you compute the shortest path between this router and this router, it's this path over here with a cost of 11, okay? Now, you introduce the areas. Then I have this router announcing 16, this router announcing 14. So from the perspective of this router over there - this router learns 16 from that router, 16 + 1 = 17. It learns 14 from this router, 14 + 1 = 16. 16 is less than 17.

You see that, as you introduce areas, and you start to aggregate prefixes together, this leads to suboptimal routing. This is the equivalent case of router reflection, where by reducing diversity, I was introducing suboptimal routing. Here, by reducing visibility of the topology, I'm introducing suboptimal routing as well. Whenever you try to scale things, you remove routing information, and this means that forwarding can become suboptimal.

Slide 96
Slide 96
Slide 97
Slide 97
Slide 98
Slide 98

Summarization helps, but it comes at a cost. You have to be aware of the cost, and whether you are ready to pay for that. Advantages are, you reduce the size of the link-state database, which is nice. You have a smaller graph to deal with. But you can introduce sub-optimal routing, and it's more complex to operate.

One of the constraints you have is that the network must look like an area 0 connected to other areas, and if it doesn't, you have to technically redesign it. That's why, in practice, I must say, people nowadays tend to be running on flat (non-hierarchical) IGPs, because the operational cost is deemed as not worth it.

Slide 99
Slide 99

Alright, as I told you at the beginning, this lecture very much relates to research topics as well, so I just want to mention a few papers in this domain. I will try to do that for every piece of the lecture.

Slide 100
Slide 100

The first paper relates to iBGP correctness. I told you that deciding whether an iBGP configuration is correct or not is NP-hard, this paper relates to that.

Slide 101
Slide 101

This paper also relates to complexity theory. Here we aimed at proving that deciding whether a network will converge or not is actually exactly the same problem as deciding whether a program will terminate it. I'm guessing you all know the halting problem? This is one of the problems that Alan Turing was working on, studying the complexity of deciding whether a program will terminate or not, on any possible input.

That's a famous, undecidable problem. There is no algorithm, and there will never be an algorithm to do that. Skipping tons of details, how Turing did it though, is by introducing the Turing machine, and then showing these results of the Turing machine. What we have done in the paper mentioned in Slide 102 is build a Turing machine only using BGP routers. So you can kind of simulate computation using BGP as an engine.

It's a bit twisted, and it's definitely not practical. But it essentially allows you to make a statement like this, that BGP has the same computing power as a Turing machine. And that means that deciding whether a network converges or not is impossible, at least in the general case. This is why people do this, look at these questions. It's to understand the complexity of some problems. If you are into complexity theory, it's definitely a (super) fun paper to read.

Slide 102
Slide 102

Ok. Let us now switch gear and start speaking about how we reduce the amount of prefixes that we have to propagate. Once you have optimized route propagation, you indeed also want to start reducing the amount of things you are propagating.

Slide 103
Slide 103

Maintaining many destinations (prefixes) is problematic for at least three reasons. First, whenever there are new prefixes, you need your routers to spend time computing the routing and forwarding state for these new prefixes. Second, you need them to maintain these information in memory, which is happening both in the control plane, but also in the data plane. The latter one is especially problematic as routers tend to have very limited amount of memory in the data plane. Why? Because the data plane is built to forward traffic fast, Terabits per second fast, and you can't do that in software. And, of course, fast hardware-based memory, i.e. SRAM, is expensive. In many devices today the limit happens around 1.5 million entries. Once you run out of space in your hardware, the solution is to buy a new router, or to buy a new switch. You cannot buy an upgrade module from digitec.ch and then install it into your router. It doesn't work.

There is more. Not only you need to store this state but you also need to reprogram it whenever there is a change in the network. Just consider the example of a router which is forwarding everything to the left, and then need to forwarding everything to the right after a failure. This reprogramming takes time. And the more prefixes you have, the more prefixes you need to rewrite, and the more time it takes.

Slide 104
Slide 104

So we would like to reduce the size of the forwarding tables. The intuition to do so is to leverage the fact that the IP address space is hierarchical, you can have children-parent relationships between prefixes.

Slide 105
Slide 105

So, for instance, you can have this /8 prefix, which contains $2^24$ IP addresses. It can contain different children prefixes inside it. This /16, and two /24s. Whenever you have a child prefix with the same next hop as the parent prefix, you can delete the child prefix, and then rely on the parent entry to forward the corresponding matching packets.

Slide 106
Slide 106
Slide 107
Slide 107
Slide 108
Slide 108

In this example, you can see that two children share the same next hope interface (2) as the parent. If you delete them, the router will just start to forwarding the corresponding traffic using the /8 as the longest prefix match. For the packets falling into these /24 prefixes, the forwarding decision would be exactly as before.

Slide 109
Slide 109
Slide 110
Slide 110
Slide 111
Slide 111
Slide 112
Slide 112

This filtering operation, you can think of doing it locally, at the router level or globally, at the network-level. By local filtering, I mean only filtering the prefixes that get installed in the forwarding table of a router, without filtering what this router advertises to its neighbors. In the previous example, the router would still advertise the two /24s in the control plane. By global filtering, I mean not only filtering the prefixes that get installed in the forwarding table, but also the prefixes that get advertise in the control plane, i.e. the routing tables. This is more interesting as it allows to save network-wide. As you may realize though, doing so is more tricky. We need to use more powerful techniques to do the global filtering than to do the local filtering. Today, local filtering is definitely done quite often, while global filtering is still a research idea.

Slide 113
Slide 113

Let us start unpacking what local filtering is.

Slide 114
Slide 114

You can think of four strategies (or levels) for local filtering. The first one is the one of the slide that I was just showing you before. You have, let's say, a /8, and a /16 which is a child of it. Providing that they go to the same next step you can just delete the /16. This is exactly as the previous slide.

Slide 115
Slide 115

Then there are situations like this. You have two /24 that are adjacent to each other, 10/24, 10.1/24, but there is no 10/23. You have two children that are next to each other, so if the parent were there, you could delete them, but there is no parent in the table.

Here, what you can do is you can introduce a 10/23, which forwards to the common next step, okay? That allows you to filter out the two children. So there is this concept of adding parent prefixes that allow you then to filter out the child prefixes. Of course, you want to introduce less parent prefixes than the children you are filtering out. Otherwise, there is no effect.

Slide 116
Slide 116
Slide 117
Slide 117

If you do these operations here, and you filter out the table accordingly, you will end up with forwarding that works exactly as before And this will be a strongly consistent forwarding table. So we introduce this concept of strong forwarding consistency. Strong forwarding consistency means that every single packet that you may think about will be forwarded in exactly the same way prior and after the filter. This is true for multiple purposes, like IP packets, which were matched by identity before, but it's also true for any non-routed destination.

Slide 118
Slide 118
Slide 119
Slide 119

In contrast, weak forwarding consistency only mandates the forwarding behavior to be maintained for the routable destinations, but not necessarily for the non-routable destinations. With weak forwarding consistency, packets that were dropped before may end up being forwarded to a next-hop after filtering.

Slide 120
Slide 120

The two bottom strategies (Level 3 and Level 4) are only weakly forwarding consistent. They aggregate non-adjacent children prefixes under one parent prefixes. Doing so creates "holes" in the prefix tree that become routable by matching on the parent prefix. One example here could be aggregating prefixes 10.0.0.0/24 and 10.0.3.0/24 (and only these two prefixes) into 10.0.0.0/22. In doing so, the aggregate entry starts forwarding traffic towards 10.0.1.0/24 and 10.0.2.0/24, while traffic for these prefixes was initially dropped.

Slide 121
Slide 121

I told you last week just before ending that there is this algorithm which is quite known. It's known as the optimal routing table constructor (ORTC) and ORTC is essentially an optimal algorithm that can squeeze the forwarding table into its minimal form. So it's an interesting algorithm from that point of view. That means it will guarantee give you the minimum amount of entries and on top of that it also guarantees strong forwarding consistency. So you have exactly the same forwarding behavior prior and after the identification of the algorithm. And of course, under the hood it relies on subnetting and supernetting, so the prefix hierarchy that is inherent to the address space.

Slide 122
Slide 122

So this is an example of ORTC just to get it started. As you will see, the algorithm is quite simple; you will also do exercises on that. We apply the algorithm, that's the best way to see how an algorithm works. Essentially this algorithm works on forwarding rules that are written in binary format. This is, for instance, an example of a forwarding table with four entries. We have a default entry that maps the traffic to port 1, 00* is mapped to 2, 10* to 2 as well, and 11* to 3. So you can represent this forwarding table as such. You could also represent it as a binary tree, which is what the algorithm uses under the hood. This is just the exact same representation of the four rules. You can see the root, which is the default route, has 1 as next stop, then it's a binary tree, has 2 children per node at most. 1 goes left, 1 goes right, and so essentially as you go down the tree, you add digits to the prefix that you are currently matching. You start with 1, and then for instance if it is 00 I wait for 1 to 2. If it is 01 it's not defined, so I'll inherit the next stop of the parent, which is 1. If it is 10, I go to 2, and 11, I go to 3. So this binary tree essentially is the exact same representation; it's a different representation of the exact same semantic as these four rules over there. And so what ORTC does is reduce this into another equivalent binary tree with less entries ideally. For this example you gain 1, which is not a crazy saving, as I'm sure you see, but in practice you can save a lot more. This algorithm usually can actually reach a reduction factor of almost 50%. You can divide the amount of entries by 2 in many practical cases. So that's quite significant. And so here that's the output of ORTC, something you can already directly observe. As I was saying here, what you can observe is that over there, there are like two 2s. So the most popular next stops, out of the three next stops that can be, 1, 2 and 3, is next stop 2. So what the algorithm is trying to do over there is to promote next stop 2 closer, or in this case, at the root. So by doing this, you are allowed to kind of like inherit, you allow most nodes to inherit from root.

Slide 123
Slide 123

So that's why you can see, that's the insight behind ORTC, is that short prefix, so short meaning like closer to the root, should route to the most popular next stops, while longer prefixes closer to the leaves will route to the least popular next stop. Another way again to see it, the closer the prefix to the root, the more popular the next stop.

Slide 124
Slide 124

So now let's see how the algorithm works. So you see again it's a single item. It works in three consecutive, tree traversals. So you can traverse the tree three times. It's sometimes not the most efficient item from that viewpoint, but it's still linear in the number of entries. There is a first pass that will normalize the tree, so you see what that means, but essentially it's to fully expand the tree, ensuring that you have zero or two children for every node, and kind of like pushing down the root information down to the leaves. That will happen from top down. Then the second pass will compute next stop popularity, this happens from the bottom up. So now we start from the leaves, and then we kind of like buttoning up, up to the root. And then what we do is select what are the most popular next stops, because these are the next stops that are one closer to the root as I said. Because that's why you go from the leaves. And then finally the third pass, then we actually filter out information. So we go down from the root, traverse towards the leaves, and track the information from the top to determine if it can be filtered out as we move lower and rely on our own information instead. So that's kind of like the idea of the three passes.

Slide 125
Slide 125

So the first pass, as I said, we normalize the tree. So if we start with the table we considered initially, this first time we kind of like populate each node, with zero or two children. And we also push the information coming from the parent down to the leaves. So as you can see here, the leaves are kind of like already defined, we have two, two, three, but there is this node here missing. And this node will be forwarded to one, because that's the closest parent that has a defined next step, that's one. So this is why then what the algorithm does here is putting a one there. So now we have like the tree for which all the leaves are defined. And there is nothing at the higher level. And the other thing that the algorithm does, which is kind of like a detail, is just turning the next step into six. That would be important for the second step. So this is really just normalization, nothing crazy.

Slide 126
Slide 126
Slide 127
Slide 127

Second aspect, now we can't compute the most popular next steps. As I said, now we go from the leaves towards the root. And then what we do is that whenever there are two children, we are looking at the binary tree, so we always have exactly two children for one node. So we look at the two children, left and right, A and B. Remember these are sets as of now. And then we apply a merging operation, which is denoted here with this hash sign. And so we replace, what we will put in the parent node is the merging of A and B. And so as you can see the merging operation is the following one. If the intersection is not empty, I will put the intersection on the parent node. If the intersection is empty, I will put the union. So that's what this merging operation does. Example here. So we look at these two nodes over there, one and two. The intersection is empty. When the intersection is empty, we do the union, so one two. So that's what I put there, one two. Two three, intersection empty. I do the union, I put there two three. Now I'm doing the same because I do this recursively until I hit the root. As you can see, one two intersects with two three. Now I have a non-empty intersection. Two is the non-empty intersection, meaning I will put two at the top over there. Then I stop because here we are reaching the root. So as you can see that essentially what I'm doing with this intersection is kind of like selecting the most popular next step in this case, which is 2 in that case. That's essentially what this operation is doing.

Slide 128
Slide 128
Slide 129
Slide 129
Slide 130
Slide 130

That leads us to step three, filtering. This is again coming now from the root down. Now what we do is essentially kind of like remembering what is the current next step that we are matching. So for instance when we start from the root and we start from the root, the current next step is next step two. And then we will go down the tree, level by level, and then we look at whether each node has a two in it or not. And if it has a two, then it means we can filter out that node because we can rely upon the parent. We do this process recursively. So for instance here, we select two because there is only one choice. We do 2 there. Here we can delete the entry because it matches its forwards. We can forward two over there, same thing over there. And then here we end up with two one two three. We continue the process. As you can see, we can delete this node now because it also matches two. And we can delete that node as well because it also matches two. And now we end up with this tree over there. So this essentially is the final outcome.

Slide 131
Slide 131
Slide 132
Slide 132

So the algorithm itself, the code fixed on the slide is very basic. So this is the code of the algorithm. We will apply it during the exercise session as well. And that's really the best way to see how this works. Play a little bit with it on a few examples. But it's a really, really lazy way. So you can see that in the past tree, this inherited definition, which is what is currently inherited next up. And so right now when I initialize, the inherited next up is the one of the root. And then it could change, of course, as I move down. So this would be kept by this variable. You really need to execute that once you set it to actually see how it works. But I can assure you it's not complicated at all.

Slide 133
Slide 133

So this algorithm is optimal, but its effectiveness will depend on a few things.

Slide 134
Slide 134

The first thing they could depend on is how many next stops you have. So if you have a router with, I don't know, 256 interfaces, 256 next stops, you will have, let's say, less chance to aggregate than if you have a router with only two next stops. Because the more next stops in common you have, the more opportunities you have to aggregate. So that's the first condition. And then the second condition, it will also depend on how aggregated the prefixes are in the first place. So if you have many adjacent prefixes that can be filtered out, or if you don't. So, in the end, these two conditions give you the relation back.

Slide 135
Slide 135

So it's optimal, then it comes to the trade-off, as usual. And the trade-off is that it does not work when it comes to updates. So once you have compressed the data, you cannot update the compressed data. So whenever there is an update, you need to recompress the entire thing in order to guarantee correctness. Let us look at an example.

Slide 136
Slide 136

So this is the origin of the thing over there. So this is not compressed. As you can see, we have two /15, one that forwards to B along to A, and then we also have a /16 that forwards to A. Then we consider a simple update operation where we replace this /15, we replace it from sending to B to send to Q. So after the update, we end up with this table over here. We have Q, A, and A. That's the forwarding behavior.

Slide 137
Slide 137

Okay, now let's apply ORTC over there. So when you apply ORTC on this table, what you should see is that A is the most popular next stop. We have two A's and only one B. So A, logically speaking, will end up on the root. And then you need to keep up with the semantics, so you also need to have a B there. When do you forward to B? When you go 01. This branch over there, this will be the branch for which you forward to B. Otherwise, the rest is A. That's essentially what this aggregate is telling you. So what if I do now the exact same update? I will change this /15 from B to Q. So here I will update this /15 from nothing to Q. That would be the equivalent update that I'm performing on the aggregated field. So I will end up with this table over here. And here you should see that this table on the right-hand side and this table on the left-hand side, these are not semantically equivalent. So here, for instance, you should see that when is it that you go to Q when you have 01, as we discussed. While here, when is it that you go to Q when you have 00. So that's not good. And here also you should observe that we never go to B. While here we still forward to B when we have 01. So that's also not good. So clearly there are multiple things that are wrong in this table. So that means that the update operations you cannot apply directly straight on the aggregated field. You need to apply on non-aggregated field and then you need to reapply the algorithm in the end. That's how you can guarantee correctness.

Slide 138
Slide 138
Slide 139
Slide 139

There are much more complex algorithms that allow also to apply updates to the aggregated state. This is one of them, for instance. We won't see that in this lecture. But if you are interested, just know that these exist.

Slide 140
Slide 140

Okay, so now for the last few minutes, before the break, let me talk about global filtering, global aggregation. So again here, I want to target not only the FIBs, but I also want to target the RIBs. I want to target the routing table. I want to do that internet wide. I dont want to do that routing by routing table.

Slide 140
Slide 140
Slide 141
Slide 141

So this is a research paper we wrote in 2016. I love this paper. Being a research paper, it is not (yet?) deployed. But still, I want to give you insights into the technique. As you will see, it will relate to isotonicity and routing algebras. So it's a nice kind of like connection points for later in the course.

Slide 142
Slide 142

So in this paper, we define DRAGON, which is a discrete route aggregation technique. And so DRAGON essentially allows a router to reason locally about the global effect they might have by filtering out some routes. That's kind of like a crux of this entire paper, is that we allow a router to reason about what would be the effect globally, of doing things locally.

Slide 143
Slide 143
Slide 144
Slide 144

And the way it works is by having routers to compare routes for different prefixes. And then figure out automatically whether they can remove the child or the children prefixes. A DRAGON router will receive messages, i.e. BGP routes, from its neighbors. For instance, here we receive a /8 and a /24. Observe that the /24 is a child of the /8. We refer them to p and q in the rest of the slides. The DRAGON enabled router here will figure out, according to some rules, that the child can be filtered out. That means two things. It means first, not put them in the forwarding table. And second thing is not advertise the /24 to any neighbors. As you can see, now we have clearly effects on the rest of the network. So of course, we want to make sure that as we do that, we don't have terrible effects that happen in the internet.

Slide 145
Slide 145

So the cool thing with DRAGON is that it really aligns with the road networks. I told you, if you think about signs to HB (Hauptbahnhof), I don't need a sign to Zurich Hauptbahnhof when I'm in Basel. When I'm in Basel, I want a sign to Zurich. And then when I arrive in Zurich, I want a sign to Zurich Hauptbahnhof. This is exactly what DRAGON enables you to do. The sign to Hauptbahnhof, think of it as the route for the /24. So you want the route for the /24, which is a very specific route, to be visible in small vicinity around the origin AS. And then as you go further and further, you want this to be filtered out and then rely on the parent prefix. Like for instance here, the /16. So the /16 in my example would be the sign goes towards Zurich. And you want everyone else in Europe and the world are probably outside of Europe here, like just a route that says Europe goes right, and then Europe then goes towards Switzerland.

Slide 146
Slide 146
Slide 147
Slide 147

So DRAGON does that in a way that preserves routing and forwarding decision, because we want correctness of course. So we don't want to end up with essentially the internet being incorrect and not providing stability anymore.

Slide 148
Slide 148
Slide 149
Slide 149

So we already talked at length about forwarding consistency. So there is strong forwarding consistency, weak forwarding consistency, that DRAGON supports as well.

But in this lecture we talk about a slightly looser consistency property, which is routing consistency. So in routing consistency, I will relax my consistency criteria. I will allow routers to change the forwarding next hop after we run DRAGON. What I want to preserve though is the routing attribute. So if you were using a customer route to reach the /24 prior to filtering, I want you to have a customer route to the /16 after filtering. I don't want you to start sending to a provider post-filtering. If you are interested, we also show in the paper how you can preserve forwarding consistency, but that requires more technical background.

Slide 150
Slide 150
Slide 151
Slide 151

So let's consider this simple internet. We have nine ASes, with one router per AS.

Slide 152
Slide 152

As always, we have some ASes acting as providers and customers. By convention, I will draw the provider over the customer, and join them by a straight line. So U7 is the provider of U9, U6 is the provider of U9, U4 is the provider of U7, etc. You get the idea. You can also see here that U1 and U2 are peers.

Slide 153
Slide 153

As in the internet today, provider ASes tend to advertise (shorter) parent prefixes, and customers ASes tend to advertise (longer) children prefixes. For instance, we assume that U7 advertises a /16 (we call it P), and AS U9 a /24 (we call it Q), which is contained inside the /16.

Slide 154
Slide 154
Slide 155
Slide 155

With DRAGON we want to preserve route consistency. Here, we'll only care about two attributes: "learned from customers", which I will denote that in blue, and "learned from provider", which I will denote in red. Of course, you prefer customer routes over provider routes. Exportation rules, you know them already: Customer routes are advertised to everyone, while provider routes are only advertised to customer. So this is all classical, nothing fancy.

Slide 156
Slide 156

Now let us start to look at the routing state for Q. So Q is advertised by U9, advertises to U6 and U7. U6, U7, these are customer routes. It comes from a customer, which is U9, so they are colored in blue.

Slide 157
Slide 157

They propagate this route from U6 to U1 and U8, and U7 will propagate that to U4 and U2. So U1, U2 and U4 get also blue, because they learned customer routes, while U8 learns a provider route, that's why it is in red.

Slide 158
Slide 158

So now we have the third round of propagation, and now U3 learns a customer route from U4, and U5 learns a provider route from U1.

Slide 159
Slide 159

So that's the final state, this is the final routing state for q.

Slide 160
Slide 160

So now we do the same thing for p.

Slide 161
Slide 161

p again is not advertised by U9, it's advertised by U7. We do the same kind of trick, we propagate the prefixes, and then we look who ends up blue, who ends up red. So here, as you can see, U9 ends up red, because it learns the prefix from the provider. That also means that U9 won't advertise it to U6, because it won't create value in the internet. Then the prefix will propagate up over these nodes of an area. So there we learn the customer routes, and then finally it provocates down, and these U5, U6, U8, they learn the provider routes, while we do learn to this point.

Slide 162
Slide 162
Slide 163
Slide 163
Slide 164
Slide 164
Slide 165
Slide 165
Slide 166
Slide 166
Slide 167
Slide 167

Okay, so we have left state for q, right state for p. So something you should see directly, is that there are many nodes that are the same colours. That's good, because that means technically, these nodes, they could filter out the q and rely on the p, and they keep routing consistency with the same path.

Slide 168
Slide 168

And then there are a few nodes over here, that have different colours for q and p. Blue here, then red over there. These they cannot filter. They cannot filter, because if they do so, then they clearly are worse off.

Slide 169
Slide 169
Slide 170
Slide 170

So the big question that Dragan asks, and that we asked for this paper, is what if all the other nodes that have the same colour, here on the slide, what if they were filter? And we call them PR in the paper, they are called CR, it doesn't really matter. What matters is just all the nodes that have the same attributes for the p and the q, what if they filter? What happens? So it turns out that it works. If you have all these PR nodes filtering, you actually end up with a state which is routing the system.

Slide 171
Slide 171
Slide 172
Slide 172

So we need a starting point. We need one node to kick start the process. So it doesn't matter which node you pick. You can pick any node, here I will just pick u4. And then now I'm denoting the merge state, as you can see here on the legend. On the left you have the state for prefix q, and on the right you have the state for prefix p. So I've just kind of combined those images in one. Make it a bit more visual.

Slide 173
Slide 173

So U4 is blue on both sides, left and right, so it can filter. Let's say it filters out prefix q. When you filter out, that means that you stop advertising that to your neighbour, you stop advertising q to your neighbour, so that means U3 now stopped receiving prefix q from U4.

Slide 174
Slide 174

Because for 3, that was the only customer route it knew to reach q. See that? So the other two routes it knows to reach q, they come from providers, they come from U1 and U2. So the fact that U4 is filtering prefix q, makes U3 switch to a provider route for prefix q.

Slide 175
Slide 175
Slide 176
Slide 176

You are not helping with this. So if you are 3, you are being screwed. Because your customer is filtering out more specific preferences, now you start to rely on your provider preferences. So that's bad. We should not stop here. Let's continue.

Slide 177
Slide 177
Slide 178
Slide 178

U3, I would argue, has an incentive to filter as well. Why? Because it has a provider route for p as well. So if it filters out the route, it deletes the route, then it will retreat. It will just forward the packet according to the /16, which is a customer route. So essentially for U3, you have the doubling incentive. You filter, so you have less space in the forwarding table. It's better for you anyway to filter. And then by filtering, you retrieve a customer route, instead of having a provider route.

Slide 179
Slide 179
Slide 180
Slide 180

So doing this, you save twice, and actually in the paper, not in the lecture, that we show that all nodes filtering is a nash equilibrium. So everybody has an incentive to filter, and once you filter, you have no incentive to move away from filter. Which is kind of like a wake, kind of like bring everyone on board. And again, when you filter, you retrieve a provider route, so instead of going to provider, you go to customer. That's great. And you also, on the side, you also win space in the forwarding table.

Slide 181
Slide 181

So we now just kind of give you a super basic algorithm that you can run on any router, make it dragon enabled. So considering a node or router u, with a child prefix q and its parent prefix p. So the algorithm that you need to apply in order to figure out whether you can filter or not, u, the router. So if you know the destination for p, and the elected q route is greater than or equal to the elected p route, what does it mean? It means you have to understand that as a greater than means less preferred in this context. So if the elected q route is less preferred or equally preferred than the elected p route, then you filter the q-routes.

Slide 182
Slide 182
Slide 183
Slide 183

That's it. So, as I said, this is extremely simple to write. The difficulty is, of course, to prove the correctnetss. This is something that we do in the paper at length that I won't be able to do in the lecture, but if you are interested, please check the paper. But just the gist of the idea, so we proved in the paper this theorem that no matter the order in which you run the algorithm, a route consistent state is reached eventually. And that depends on two other theorems. So we show first that for every node u, for every router, the elected q route, the route that you had for the session before, in my example, can only worsen as other ports are filtering around you. So that gives you the incentive to start filtering as well. But when you are in a position where the q route is strictly better than the p route, then in that case it won't move. It won't change in that situation. So, together, putting these two things together allows us to prove that no matter what happens, we always end up eventually, sooner or later, we will reach a route consistent state.

Slide 184
Slide 184
Slide 185
Slide 185
Slide 186
Slide 186
Slide 187
Slide 187

The reason why it works is based on an assumption, which is that we have isotonicity in place. So when I remember isotonicity, I talked about it in the first lecture. So it's really based upon that. And the idea again of isotonicity, if I'm an AS and I prefer route A over route B, and you are one of my neighbor, then you, my neighbor, you would also prefer route A over route B. So that's kind of like the idea of isotonicity. We don't change the preferences between routes as they propagate throughout the network. And so that means that if I'm an AS and I decide to filter out route Q, because it is, for instance, worse than the P route, you would have done the same decision in my sheets, because I'm assuming isotonicity. And so isotonicity is really kind of like really hidden, but really under the hood it's there. Just know that it's there for like optimality, it's not there for correctness, which is one thing we also explained in the paper. So what it means is that if you have policies that are not isotope, then it's not like we need to lose black holes in this kind of things. It's just that we lose optimality. We might end up not filtering enough. It's not like we lose correctness, which is good when you want to get the paper accepted. Justify these things.

Slide 188
Slide 188

Quick words about performance. Extremely good. So we show with this technique that you can filter out between 50, actually 80% of the forwarding table of the internet out. So 50%, that's when we don't introduce aggregation prefixes. So what I mean by this is that if you don't know the internet routing table, and you look at the children's premises, you have roughly 50% of them that have a parent. So that means they are eligible for the use of DRAGON, because there is a parent that you can use to filter them out. But then you have like the remaining 50% of the prefixes, there is no parent. There is no /4 and no /16. So you can't do anything because if you filter you create a black hole.

Slide 189
Slide 189
Slide 190
Slide 190
Slide 191
Slide 191

So we have a more optimal version of DRAGON, in which we can introduce parent prefixes, that then will act as parent for your parentless children at the moment. And that allows them to go to the 50% of the efficiency. Because now we have like, for instance, /8, and /24. So here I won't go into the details, but just for you to know that this is where you get the 80% from, is by relying upon aggregate prefixes that act as parents for the parentless prefixes. And there, as I said, yes, you go from 50%, which is kind of like the maximum you can reach, you filter out almost every one that has a parent, to 80%, or 79%. So that, you could argue, is a big cushion, is a big save. So there is a lot of waste in the internet forwarding table as of today.

Slide 192
Slide 192
Slide 193
Slide 193