Fast convergence

Let us now speak about fast convergence, i.e. how do we converge quickly when we lose connectivity upon a sudden, unplanned failure? Our goal here will be to enable even large networks to converge within 1 second.

Slide 1

Slide 2

Convergence is the distributed process of transitioning from a network-wide routing and forwarding state to another network-wide routing and forwarding state.

Slide 3

Convergence can be unplanned, or sudden, e.g. upon link or node failures, or it can be planned, e.g. you chose to upgrade the link tomorrow at 10am. At this point, you need to shut down the link, put the new link, and then turn it on. When the failure is a planned, you can of course plan ahead. For instance, you can preemptively move the traffic away from the corresponding link (this is known as "draining" the link).

Slide 4

During convergence, different routers might have inconsistent views of the network. This is usually a recipe for (transient) disasters.

Slide 5

Whenever different routers believe that the network looks different, they might make decisions that are inconsistent with one another. And it can be hard to work out where the anomaly is, etc. So, whenever you deviate from the initial or final forwarding state, problems can happen.

Slide 6

These problems/anomalies, you know them well by now. We have blackholes first, which are annoying because traffic is lost but, at least, traffic is lost right away. Then we have forwarding loops which are very annoying. As for blackholes, traffic is eventually lost, but not right away which means that this looping traffic needlessly consumes resources.

Slide 7

Let's consider a simple example of convergence in this small network with 6 routers running OSPF.

Slide 8

Let us look at what happens to paths from everyone to RF when there is a failure of the link RC-RE. This link failure does not disconnect the graph, meaning that routers can retrieve connectivity by recomputing their forwarding paths. To do so, each router will first have to learn about the failure, updating its internal representation of the topology, run Dijkstra to recompute its shortest-path tree, and then (possibly) update its forwarding state accordingly. The question is: how long will it take for the routers to do all this and retrieve connectivity following the failure?

Slide 9

Slide 10

Slide 11

Slide 12

Let's us put all the events on a time axis. At some point $t$, the failure happens. The link dies. After some time, RC and RE will detect the failure. RC will then start flooding routing updates to its neighbors, recompute its shortest path, and then switch to RA. After some time, RA will receive the update, it will also rerun Dijkstra, and then update its state to forward to the right, to RB. After even longer, RB will also receive the message from RA, re-compute Dijkstra, and then start pushing traffic down to RD.

Slide 13

The first thing to note is that during the time immediately following the failure and before RC switches to RA, we have a black hole. The link is not there anymore, and RC is dropping all the packets towards RF on the floor. Then, the moment RC starts to push the traffic up, a forwarding loop appears. This forwarding loop will last until both RA and RB update update their forwarding state. Here, you will have two loops appearing: one between RA and RC, and then another one between RA and RB.

Slide 14

Slide 15

So essentially during the time when the failure happens and RB updates, you have traffic losses due to two distinct reasons: black holes and forwarding loops.

Slide 16

There are four main sources of delay here. You first need to detect the failure; then you need to communicate about the failure to your neighbors; then you need to recompute your forwarding state; and then, finally, you need to update your forwarding state.

Slide 17

The first three take on the order of milliseconds. The last one, updating the forwarding state, is annoying because its time is on the order of the number of prefixes. As I told you this can be very slow, like hundred of seconds slow.

Slide 18

Slide 19

Coming back to these measurements we discuss before, we see that, in the worst cas, we observed loss of traffic for up to 2.5 minutes. This particular router took 250 microseconds to update one prefix, multiply this by half a million, and you end up with 2.5 minutes.

Slide 20

Slide 21

Slide 22

Slide 23

Slide 24

Here is another set of stats about the number of failures you can expect to see in large networks with thousands of devices and tens of thousands of links. As you can see, there were observing around 40 (resp. 5) link (resp. node) failures on average. So, every single hour you can expect more than one link failure in a large network. So, clearly, how fast the network is to react matters.

Slide 25

Four main sources of delay, mean four main sources of improvement.

Slide 26

The first improvement is speeding up the time routers take to detect failures.

Slide 27

What do you want from a failure detector besides being fast? Well, you also want it to be both accurate and lightweight. As in many things in distributed systems, it is hard to get all threes at the same time. You can have something which is fast and lightweight, but it's not accurate at all. You can have something which is accurate and lightweight, but it's not fast, etc. There is a design space, and you have to adapt your detector to your needs.

In the following, I will talk about two simple mechanisms. The first one is to rely on the physical signals that come from hardware, whenever you can. And then the second one is to rely on either software-based (or hardware-based) beaconing.

Slide 28

Some physical/link-layer technologies offer hardware-based failure detection by tracking the signal level and reporting any changes in it. This is typically the case when you are using optical fibers, for instance, like SONET, SMH, etc. All of these are optical fibers technologies which monitor the light signal, and whenever light disappears (this is known as a "loss of light" event), an alert is triggered to the control plane, which then initiatives the convergence process.

Slide 29

Slide 30

What are the pros and cons of relying upon hardware-based signal? Well, the big pro is that it's fast, as fast as can be generally. Think few milliseconds. It's also lightweight as the hardware needs to track the signal anyway, so detecting losses is not a big overhead.

The con typically is with the coverage. It only detects local failures, and it only detects it if you have the right physical layer technologies. In this example, for instance, you have two routers, RA and RB that are connected through a switch using optical fibers connections. What happens is that the link between RB and the switch fails. In this case, RB will see it, but RA won't. Because from RA's viewpoint, the connection is with the switch, and this connection is totally fine. There is still signal/light over there.

Bottom line, these are great signals to track but you cannot solely rely on them.

Slide 31

The other way to do it is to use software- or hardware-based beaconing.

Slide 32

Beaconing is as simple as you would expect. You have your routers exchange beacons. (We often talk about hello message in network contexts.) And you have them exchange them at a given frequency. Whenever they miss $k$ of them in a row (where $k$ is a configurable parameter), they would detect the failure and start converging.

By default each routing protocol comes with its own beaconing system. BGP has one, OSPF has one, IS-IS has one, etc. For each of them, you can configure two things: the time separating two beacons, i.e. the frequency, and the interval after which a failure is triggered. So how many misses can you have before you start triggering? The first one is typically known as the hello interval, while the second one is known as the dead interval.

Slide 33

Here are default values for some common protocol on Cisco devices. As you can see, we are talking about (many) seconds. So you most likely want to lower these default values as one of the first steps when you configure a router.

Of course, increasing the beaconing frequency has a cost when it comes to CPU usage. Just think about BGP routers, some of them can have hundreds of BGP neighbors. So there can be a lot of processing time involved there.

Slide 34

Again let us discuss pros and cons. One big pro is that it works on all platforms and it detects a wide range of failures. Indeed, in order for a router to send a beacon to another router, you need the beacon to cross the control plane, go into the dataplane, be forwarded by the forwarding table, go across the link, any possible intermediate device (such as a switch), hit the other router, climb up from the dataplane into the control plane, all the way to the corresponding routing application on the other router. So we are testing a huge amount of components there. With the first technique, where we were only tracking the loss of light, we were only looking at a small piece of the puzzle.

Slide 35

The cons? Well, for one, it tends to be slower, and there is a huge overhead. There is overhead comes from the cost of increasing the frequency but also because, by default, each protocol uses its own beaconing system, which is of course quite silly. If you have two routers that are both OSPF and BGP neighbors, you don't need the two processes to send beacons across the common link to realize it is dead, only one would suffice.

Slide 36

The solution to having each protocol "beacons" on its own is to use a protocol-agnostic beaconing service known as Bi-directional Forwarding Detection (BFD). Routers supporting BFD only run one BFD beaconing session and routing protocols then subscribe to BFD notifications.

Slide 37

The cool thing with BFD is that it can also run in hardware. After all, if you think it, generating and keep track of beaconing messages is not complicated and can be easily supported in ASICs. So that means that you can run very aggressive intervals without bothering the CPU. Typical values here are 50 milliseconds hello intervals and 150 milliseconds dead interval.

Slide 38

What are the pros and cons? Pros: it is almost as fast as relying upon physical signals (when running in hardware), it also has low overhead (when running in hardware), and test a wider range of railure. Cons: you require router support, which is not always there.

Slide 39

To sum up what we have seen about failure detection, whenever you have support from the hardware, use it. Whenever you have support from BFD, use that as well. They are complementary. Only fall back to protocol-specific beaconing as a last resort.

Slide 40

Let us now talk about how to speed-up the second source of delay: communicating about the failure.

Slide 41

What we are after here is two things really: we want to quickly be generating the messages telling everyone that a link is dead; and we want these messages to propagate quickly inside the network.

Slide 42

Both problems are easy to solve. For the first problem, we need to ensure that the software processes that are responsible for generating routing message have absolute priority over the others. We also want the ability to preempt other processes that might be running. This way if a management process (e.g. one that collects statistics) is running, it can be paused for the process generating routing messages to execute.

Slide 43

Once the messages are out, we don't want them to be queued up into buffers because there might be congestion inside the network. The solution here is again simple. Routing messages are tagged with specific labels and routers/switches are pre-configured to give absolute priority to these packets. So, if there is congestion inside the network, these routing messages will be able to bypass queues.

Slide 44

One important note though, you do not need the entire network to learn about the failure to retrieve connectivity. It really depends on where the connectivity is lost and what is the final path.

Think about this network for instance in which the link between R4 and R5 fails. In blue, you can see the original path, and in red the final one. The final path will be essentially going the other way around. Observe that R2 and R1 don't move in-between the initial and the final state. This means that they don't really care about the failure in this case.

In contrast, here R3 and R4 are the critical routers. These are the routers that need to learn the failure and to adapt their forwarding state in order to retrieve connectivity.

Slide 45

Slide 46

This is true towards R5, but the same thing also happens towards R4 with different routers. There, it is R1 and R5 that are critical.

Slide 47

What about the third sources of delay now? How do we recompute forwarding state quickly?

Slide 48

Well, we already talked about this in the context of internal routing. There I talked about things like iSPF, various optimizations to the Dijkstra algorithm, etc. I won't go over those again. Instead, I want to speak a little bit more about fast computation when it comes to BGP.

Slide 49

Slide 50

If you think about BGP versus link-state routing from a computation viewpoint, two bottlenecks quickly emerge. The first one is that, in BGP, computations are prefix-based. In OSPF, computations are router-based (OSPF routers compute shortest paths towards other routers). This is very annoying when we end up with millions of prefixes.

A second problem in BGP is that routers often do not have the full visibility of the routing state available and need to wait for it to propagate network-wide. BGP is indeed great when it comes to hiding paths, especially if you start to use route reflection and whatnot. In many instances, routers are stuck waiting to learn about the new routes, and this slows things down tremendously.

Slide 51

Take this simple network as an example. We have an AS connected to a provider network through two sessions, a primary and a backup. Over both sessions, the AS learns 1.1 million prefixes.

Slide 52

Here, R1 is the primary egress, while R5 is the backup one. So R1 is configured to attach a higher local pref (say 200) to the routes it learned from the provider, while R5 is configured to attach a strictly lower local pref (50) on the routes it learns from the provider.

Slide 53

Let us now consider what happens when the link between R1 and the provider fails. Before the failure, you should note that all the internal routers (R2, R3, and R4), only know about the routes announced by R1, even if it is a full mesh network. This is because, for R5, the best route is also the one that goes via R1, meaning R5 will not announce these routes internally. (Remember in BGP, each router propagates only one route per prefix). In the end, no one except for R5 knows that there is a way to exit the network by a R5 prior to failure.

Slide 54

Slide 55

Slide 56

When the inevitable with R1 happens, R1 first needs to detect it, then it will send withdraw message for these 1.1 million prefixes. This withdrawal messages will then have to propagate towards all the neighbors, including R5.

Slide 57

Upon receiving these withdrawals, R5 will remove the routes from its routing table and, doing so, realize now all these external routes with a local pref 50 are best routes. After that R5 will start sending updates for 1.1 million prefixes internally. As you may imagine, this process takes a lot of time.

Slide 58

Multiple solutions exist to solve this problem. One of them is to force R5 to advertise these "best external routes" inside the network in order for all routers to learn the backup routes. (Observe that so requires to make router behave in a non-standard way.) Observe that no one, include R5, will actually use any of these routes, because they all know other routes with local pref 200.

Slide 59

When the failure happens, when routes receive the withdraw from R1, they can immediately switch to the route via R5 without having to wait for R5 to announce the backup route in that sense. This is where time is saved.

Slide 60

Multiple BGP extensions support these extra advertisements. One of them is ADD-PATHs, another one is Best External. Both work similarly, as show before, and allow routers to propagate backup routes inside the network.

When you have more backup routes, you need less communications once there is a failure. Of course, you also require more memory. There is no free lunch. And so there it is, once again, the classical trade-off in networking between speed and memory.

Slide 61

Let me now finally speak about the "pièce de resistance", which is the update of the forwarding state.

Slide 62

As I said multiple times now, updating the forwarding state tends to be the key bottleneck in many situations.

Slide 63

Fundamentally, the problems we have come from the fact that our forwarding tables are flat: they simply map (many) destination prefix(es) to few physical next-hops. And whenever there is a change to one of these physical next-hops, many entries (prefixes) need to be updated,

In order to solve this problem, we need to re-organize the (typically hardware-based) forwarding tables, making them hierarchical so as to support fast incremental updates. Observe that this re-organization tends to be "good for business" for routing vendor as it means having an extra motivation to sell new boxes.

Slide 64

Having hardware that allows for fast incremental updates is only one piece of the puzzle though. The other pieces include pre-computing the backup forwarding state (for a subset of the possible failure scenarios) and pre-loading the backup forwarding state into the forwarding tables. This means that the forwarding tables will now contain both the primary state alongside with backup states. Upon a failure of a particular resource (say a link), the corresponding backup state (if pre-loaded) is then quickly activated. Ideally, this activation is done with a single bit flip.

Slide 65

In the following, we'll speak about how we can do that for both intra-domain and inter-domain routing. For intra-domain routing, we'll look at a technique called Loop-Free Alternates (LFA). For inter-domain routing, we'll look at another technique called Prefix-Independent Convergence (PIC).

Slide 66

Slide 67

An LFA for a router $x$ is a physical neighbor of $x$ than $x$ can deviate the traffic to for a destination $p$ without the traffic coming back to $x$. You can think of LFAs as backup neighbors. A router can immediately retrieve connectivity by sending the traffic to them, while everyone is busy propagating routing messages and recomputing Dijkstra.

Slide 68

Let us look at a first example. We have a basic topology with 3 routers (R1, R2, R3). Prefix $p_1$ is connected to router R2. We consider the perspective of R1 in the following.

If you are router R1, you should observe that router R3 does not use you to reach $p_1$. Because R3 goes straight to R2 using its direct link. We say that R3 is a per-prefix LFA for R1 for prefix $p_1$.

What is cool is that in a link-state IGP like OSPF, everybody knows the entire map of the network. So R1 can do this computation. You wouldn't be able to do that in BGP, because BGP routers do not learn any information about the internal topology.

Slide 69

This slide gives the exact condition that R1 can use to compute its LFA for each prefix. This condition states that for a neighbor $N$ to be a LFA, the distance between $N$ and the destination prefix $p$ should be less than the sum between distance between $N$ and R1 and the distance between R1 and $p$. This guarantees that the neighbor will not send back the traffic via the router.

You can just plug in the numbers and see that the condition holds on the topology from our previous example. Here, the distance between $N$ and $p_1$ is 1. 1 is clearly less than the distance between R3 and R1, which is 10, plus the distance between R1 and R2, which is 1.

Slide 70

By default the computation of an LFA is done on a per-prefix basis. If you look in this topology, there are two prefixes, $p_1$ connected to R5 and $p_2$ connect to R3. Again, we consider the perspective of R1.

What is the shortest path from R2 to reach $p_2$? Well, the path R2-R1-R3 has a cost of 2, and R2-R4-R3 has a cost of 2. So assuming that router R2 is load balancing, that means that R2 is using R1 to reach $p_2$. So R1 cannot protect the link failure here by blasting traffic for $p_2$ to R2. R2 is not a per-prefix LFA for $p_2$.

But R2 is a per-prefix LFA for $p_1$. We see that the shortest path between R2 and $p_1$ is R2-R4-R5 with a cost of 6, which is smaller than R2-R1-R3-R5 with a cost 7. So that means that R2 is an LFA for R1 for the destination $p_2$.

Slide 71

You can extend/generalize the notion of per-prefix LFA to the one of a per-link LFA. The difference, as the name indicates, is that a per-link LFA is a neighbor that you can redirect the traffic to for any destination that goes over a given link.

Slide 72

So you can compute per link LFA's using a very simple algorithm. Essentially if you want to compute a per-link LFA for link 1, you look at all the destination prefix that you are using link 1 for, and then you look for a neighbor of yours, hopefully there is one, such that for all of these destinations the neighbor is a per-prefix LFA. If there is such a neighbor, this neighbor becomes a per-link LFA.

The advantage of computing per-link LFA is that you can, whenever a link fails, you can blast all the traffic independently of the destination to that neighbor.

Slide 73

We already saw an example here, but it should be clear that the coverage of a per-link LFA is strictly equal, it's lower than or equal to the coverage of a per-prefix LFA. This should be clear from the definition

Some topologies are better than others when it comes to LFA coverage. A topology consisting of rings is not great because each router only has two neighbors. Topologies that are more meshed will see more alternative paths and so more potential for LFAs.

Slide 74

This is just an illustration of this, you can see here the LFA coverage in terms of the percentage of links, so how many links can be protected through LFA's in different topologies? And you can see that this varies quite a lot, so that some topologies are very bad, only 15% of the links that can be protected, while others rotate around 2/3 of the links. So what distinguishes them? While these two, the lower two numbers, are ring-based designs, as I said, not great, while the others essentially are highly meshed, I think triangular-based designs are very good design triangles, so you have more possibilities than these two.

Slide 75

Slide 76

Slide 77

I want now to speak about the notion of remote LFA which is another extension to LFA that allows the coverage of LFA to be extended a little bit.

Consider the ring topology depicted here in which we we are looking at R1 and what it can do in terms of LFA towards R6. Similar to the previous example, R1 does not have a per link, nor a per-prefix LFA to R6, so imagine a prefix connected to R6 to be precise. Why? Because if you look at R1, it has only two neighbors here, it uses the first one, R2 to reach R6, and then the second one, which is R3, also uses R1 to reach R6, so if you are R3 here, and you compute the shortest path to R6, you have either 25 going down, or you have 1, 6, 10, 12, so clearly R3 will take this path over there going up. So it should be clear that R1 cannot blast the traffic to R6 to R3, because by doing so it creates a, so R1 does not have a per-prefix, nor a per-link LFA. So remote LFAs, as I said, will help, and remote LFAs, as the name indicates, they allow to use of a remote router, a remote node, as an LFA, so not only your direct physical neighbors, but also neighbors that are like, well not neighbors, but also routers that are beyond your physical neighbors. So let me give you the intuition of how it works. So we already just explained that R3 is not good for sending traffic to R6, because it will come back, but if you observe what is going on in R3, the shortest path of R3 to R5, this is the direct link. So if you are R3 and you want to reach R5 now, directly into the cost of 15, or you can go the other way, which is 6, 10, 12, 22. So R3, to go to R5, does not use R1, so that's interesting. R3 uses R1 to go to R6 but does not use R1 to go to R5. So that's kind of like the first precondition of remote LFA, and the second precondition of remote LFA is that if you look at R5 now, R5 to go to R6 uses the direct link over there, it also does not go up anymore.

Slide 78

Slide 79

So that's like the second piece of the puzzle, because now what we can do, as you can see here on the slide, is encapsulating the traffic. And by encapsulating the traffic, you are adding a new header onto the packet, and the new header, the destination in the new header will be R5, instead of R6. And so by doing this, if you are R1, if you are encapsulating the traffic and say the destination is R5 now, and send that to R3, R3 will happily push that down, it will not push it back anymore because to reach R5, R3 goes down directly. And then what R5 will do, is that it will see that it is the destination because the header says R5, and what it will do is remove the external header and then forward the traffic according to the internal header, which now says R6. And for R6, R5 goes straight down. As you can see, it allows you to create a shortcut, a repair tunnel, between R1 and R5, even though R5 is not directly physically connected to R1. Of course you need this encapsulation capability, and then we'll talk about MPLS, which is one technology to do that in just a few minutes.

Slide 80

There are many, many ways that you can encapsulate traffic with a different header. But I hope you get the intuition here. So there is, again, like a nine, oh sorry, question, yes. We need to get an example of the remote LFA, so if the link between R1 and R3 fails, how does it know to not use the remote LFA but send it regularly to six, because it gets worse, and it goes to five, and then back again? You need to condition the usage of the LFA with respect to the failure that is happening in the network. You have to have a little bit of luck, that's why you need more complex hardware, because you also need to be able to check what is going on there, in order to do the right thing, and that's a good question. Is the use of an LFA only triggered by direct links that are connected with R1, or would it also be like, let's say, the link between R4 and R6 is... The use of an LFA is kind of like a local use, but of course, it could be that the traffic that reaches R1 to R6 is not locally sourced, it comes from somewhere, I don't know, like, let's say there is an entire network up there which is connected to R1, and so it's kind of like you create a shortcut by activating the LFA, but essentially all the traffic that hits R1 will take a shortcut from the LFA. And what is the condition that R1 says, okay, I want to now take this LFA? Is it only if the link between R1 and R2 is down, or could it also be if the link between R2 and R6 is down? So your question is, is it only when the link between R1 and R2 fails, or could it be also when the link between R2 and R4 fails? It could be both, but it's more likely that R1 will be the first one to know when it's a link with R2 that fails, that it is the link between R2 and R4. So here, that's not happening by default, so that's more like a research project, but R1 will only trigger the LFA when it either realizes there is a probability change, or it's a dive in that case. But you could imagine that if R2 and R4 are very slow to notify the others, one could actually realize that there is an issue because traffic is going into a backup, so that means like all the TCP traffic, you see retransmission, for instance. So technically, what R1 needs to realize is that there is a failure. Either that comes from a physical signal, loss of life, then you trigger the LFA, or it comes from like more fancy mechanism, so that kind of like, yeah. The LFA can use it all the time, that's the beauty of the LFA, right? It is guaranteed that the traffic will not do when they are using the LFA. Even if multiple users use the LFA at the same time? Even if multiple users use the LFA at the same time, yes. So that's kind of like the nice property of this backup state. Again, the coverage is not 100%, so it's not like you can unless you design a network for that, right? You might not be able to use that if you want to.

Slide 81

How do you compute this remote LFA? Well, you can look at the algorithm there. Let me just give you the gist of it using an example. It's easier. It's really just computing two sets and then doing the intersection between them. That's how you compute LFAs. So what you do you are at one gear and you want to compute the remote LFA's. So you will first compute all the set of nodes that you can reach, not going via this step, and all these paths will come back to your question. And so here you will realize for the particular link failure that you are trying to protect against, these are R3 and R5. So R3 and R5 have the only two nodes that one is not using R1 or R2 for, okay? So because, as you can see, R1 will go directly here to R3 and will use this path to reach R5 as well. All the other nodes, R1 is use the link path on the left. The second set Q is the set of nodes which reach the destination that you are trying to protect for, not via the link R1, R2. And here this set of nodes is R2, R4, and R5. So R2, R4, and R5 are nodes that are not using 1, 2 to reach 6. And then what you do, you simply intersect P and Q. As you can see here, there's only one intersection. One node in this section is R5. So this is the only remote LFA that are available to protect R6. So again, this does not guarantee full coverage. This does not guarantee that this would be non-empty. So just to give you an example where it is empty, if you change these link waves from 15 to 23, so in that case, when you do that, you should realize that R3, we use R1 also to reach R5, because if this 23 end up 15, then going over would be 1, 6, 10, 12, 22. So when you set 23 there, you make R3 go up, and then if you recompute these two sets, you will end up with the empty set over there, which means that there is no remote LFA. There is no local LFA, essentially, and there is no LFA full stop. Again, it depends on the particular link waves, depends on the topology. But again, like remote LFA, as I said, it is a simple condition that a router can check locally without coordination to know which neighbor can safely redirect the traffic. Okay, so that's it for LFA computation. Now, I want to mention how do we reorganize the forwarding table to activate this LFA. So there are multiple ways to do that. I will just show one. And actually, it's quite relevant that we will talk about P4 just after. Because in P4, you can really program the forwarding table to support this. It's actually very, very easy. But the idea is to, instead of having a flat table in your hardware, to have multiple tables in sequence, and you will load the primary forwarding state and the backup forwarding state alongside one another. Then, when packets enter the switch or the router, they are matched. Instead of just having a primary next stop attached to them, we have a primary next stop and a backup next stop. That's the first thing that we do. We will attach to each packet a primary next hop, a backup next hop. And then, this packet will go into a second table. In the second table, you will have a second match that will be performed. Now, the match will be performed not against the destination IP anymore but against the next stop. And in particular, what the table will do is check the status of the next stop. So, is the status zero or one? One means the next stop is alive, zero means the next stop is dead. And as you can see, it's a simple change. So, depending on whether the next stop is alive or dead, you are using the primary next stop or the backup next stop. Here, this goes back to some of your questions. As you can see, I'm only protecting against the failure of the single next stop at the time. So here, if you want to support tooling failure at the same time, you will need to complexify failure two to check for the status of not only one, but two next stop. So these again, are like a trade-off. If you want to protect against more failures, you will need to load more state into your table. So that will cost you more. And then it's a design choice, right? What is the probability of this failure happening versus what is the cost of having all these state to load it into your home? It'san optimization problem that differs depending on the network. So keep that in mind because again, like in P4, you will see it's quite easy to implement this.

Slide 82

Slide 83

Slide 84

So that was for LFA. So that's now over. I will quickly go through PIC, which is a pretty simple idea. And it relates to what I've just said about your important data.

Slide 85

So PIC, or prefix-independent convergence, the idea here is, as the name indicates, we would like to make the BGP router converge in a way independent of the number of prefixes that routers want traffic for. So whether the router is wanting traffic for 10 prefixes or 1.1 million prefixes, then there is a failure to converge as fast. That's kind of like the idea. The problem of BGP by default without fancy new tables is that it flattens out. So the following table is flattened out by doing a merge between the BGP PIC and the iBGP PIC. So the BGP PIC tells you for each of the infinite prefixes what is the egress to pick or to go to to reach it.

Slide 86

For instance, here I'm considering R1, I'm looking at R1 over there, and for the 800k prefixes, R4 is the egress. So you can see the BGP length stop, which is the egress again, the physical length stop is R4 for all these prefixes. And then what the router will do is then perform this recursive lookup, then look at how locally can I reach R4. R4 is not a local neighbor. So, the local neighbor I should use to reach R4 is connected via port 0. And port 0 here, as you can see, it's connected to R2. And then the router, we merge these two tables together and then download into the hardware a flat table that maps each of the 800k prefixes to physical port 0. This means that when you have a failure, so for instance, you change this thing. Here, from port 0 to 1, because the link between R1 and R2 fails, then you need to perform 800k updates into your foreign data, which is very simple. So again, like minutes. So what is the problem?

Slide 87

Slide 88

Slide 89

The problem is that we are merging in the control plane, and then downloading the flattened table inside that plane to the hardware. So, we would like to avoid having to flatten the table in the hardware. We would like to keep the theoretical table inside the hardware as well. We would like not to flatten it out. We would like to have two tables in sequence inside the hardware and the forwarding table. A first BGP table, and then after an IGP table. So now you have to imagine we are in hardware now, we are not in the software anymore. A packet arrives, it will match against one of the 800k prefixes, will be attached again to the next stop, BGP next stop, R4 in this case. So you have to think that now it's kind of like a little piece of state which is attached next to the packet. The packet hits the switch; first table attaches R4 alongside it. And then the packet ends this extra state, they go inside the second table, and then the second table I do a second match on that extra piece of state, and that second match I turn it into the physical port here 1. So that means essentially when, sorry, it should be 0 there, and when there is a failure then I can just change the mapping between R4 and 1. I will turn it into R4 and 0.

Slide 90

I will do a single update, it takes 250 microseconds for one update, and then all the traffic for all the destinations will immediately go onto the next step, and reroute all the traffic. And this is where the name prefix independent convergence comes from. So if you take the same router as before, if you remember this ETX router without prefix independent convergence, that was the convergence time observed, worst case immediate, and then when you activate it, this is kind of like the convergence time observed. It's essentially completely independent from the number of prefixes that are attached. So that's very nice, it's a feature to have of course. But you need your router to support that; you need to have this support in the hardware for these subsequent 4.1 update.

Slide 91

Alright, questions on this before we move on? Okay, so that's it already for prefix-independent convergence. As I said, I want to mention just a few research works, as usual, on that specific fast convergence work.

Slide 92

The first one that I would like to mention is Blink. So this is a paper we wrote a few years ago, which is about this observation that when there is a failure happening, it is not local; which is remote. So when you have like a local failure, you can apply all the techniques that are presented in the lecture, and you can converge within one second. That's great. But what happens if there is a failure in your provider network, for instance? So, if there is a failure in your provider network, you will need EBGP to tell you that there is a failure before you start moving away from it. So that means for remote failures that would be more frequent than local failures, you are still bound by slow convergence and you're still depending on BGP to converge. And so as I mentioned, the question that was asked before is if you think about what is going on in the data plane now, when there is a failure, remote failure, let's say there is a failure towards Netflix. From Swisscom to Netflix, the bus is dead. So, let's say it happens at peak time. So what you will have is like tons of TCP connections that suddenly start to fail at the same time and retransmit at the same time as well. So this will lead to a very large signal created in the data plane, which is a burst of retransmitted packets that happen immediately after the failure or close after the failure. If you think about designing, if we can harness the signal and extract the signal directly from the data plane, then we can use that as a detection signal, and then we can reroute the traffic before BGP tells that the bus is dead. And so this is exactly what this paper does. It analyzes the retransmissions. We do that using B4. It does that on the live traffic as the packets traverse the switch, and then it catches burst of retransmissions, then infers what is the problem, and then rerout the traffic. So that allows us to kind of like get local fast converting speed for remote failures.

Slide 93

The second paper is a paper from two years ago. And this is a paper that looks at more complicated failures than, let's say, hard cut failures. All of the examples I was telling you about were kind of like very abruptly cutting, a link going down, a neighbor going down, et cetera. These failures are the easiest to detect. We discussed about low supply techniques, beaconing, et cetera. Very simple to detect because there is really nothing that goes through the link anymore. Many, most of the failures, are not black and white. They are actually gray. So that means there are kind of like nasty failures in which only a subset of the traffic is affected by the failure, but not all the traffic. And oftentimes, your defense will not be affected by the failure. So all your hello messages have changed. They will happily flow over this thing. Yet there is a problem, for instance, memory corruption that affects a particular prefix, and that prefix will then divide that into another. So this is very complicated now to detect because beaconing cannot, by design, capture this. You really need to look at all the traffic because these failures can affect any subset of the traffic and then analyze this again in real time in order to discover the failure. What we do here to solve the problem, again, use P4 for that, is that we have subsequent switches, switch A and switch B, that are neighbors of each other and A forwards to B. We have them exchanging counters. Each of them will compute how much traffic I see for each entrance. I will do that on A, and then you will do that on B. And then you being downstream of me, you will send me the counters every day in a second. Then, we compare the two counters together for each destination, and we see whether there is a discrepancy. So that's kind of like the high-level insights. Of course, as you may imagine, you cannot do that for 1.1 million counters. So this is where, kind of like the optimization starts on how you can encode this counter in a way that allows you to stay. And so that is way beyond the lecture. But if you are interested, that is essentially how we do it. But that paper kind of start by talking about the shortcomings of protocols like BFT that we have talked about, etc. So really relates to these topics.

Slide 94