Network Tomography

Dr. Georgia Fragkouli, Alexandra Marchon (Scribe)

Today we'll see together how we can infer link performance properties from end-to-end observations.

Slide 1
Slide 1

Let me start by introducing the problem. Think of the Internet as a simple graph, where the nodes at the edges represent end-users, while the nodes in the middle are switches/routers. The edges in this graph interconnect the different nodes and are links that carry Internet traffic between pairs of end-users.

When we look at the graph on this slide, we see the topology of the network: four end-users at the top and five different end-users at the bottom. The graph connects them, and the end-users communicate with each other.

Slide 2
Slide 2

Most of the time, the Internet links are expected to behave well, in the sense that they don't lose more than a certain amount of the packets that go along them. But sometimes the links drop more packets than they are expected to drop. In this lecture, we are going to be characterizing, during a time interval, certain links as good or bad.

Slide 3
Slide 3

We will be saying that a link is bad or lossy if, during that time internal, the link's loss rate exceeds or is equal to a given threshold. We represent these links in orange. The other links we are going to call good or non-lossy. These are the links where the loss rate is below this threshold. And we will represent those in green.

In this graph, we have taken a snapshot of the network during a particular time interval. And we have marked some links as orange (three links here), while the rest of them are green.

Slide 4
Slide 4

The fact that some links are good or bad affects the performance of end-to-end paths that cross these links. We will be saying that a path or an individual path is a sequence of links through which traffic flows between two end-users.

If we take a look again at this graph, we see that we have marked here two paths in purple. Both of them are starting at the same end-user on the top, and they finish at different destination end-users at the bottom.

Slide 5
Slide 5

Now, similar to the characterization of links as good or bad, a path is bad/lossy if it goes through at least one bad link during the time interval, and we represent those in orange. The other paths, which have no bad links in this time interval, we are going to call them good/non-lossy, and we represent them in green.

So here, on this graph, we look at the same snapshot we have just seen, and we see that one of the two paths is marked green (the left one), while the other one is marked orange. The first one has all green links along the path for this time interval, while the other one has at least one orange link, in this case two bad links.

Slide 6
Slide 6

Now, if you put yourselves in the shoes of these end-users communicating through these paths, upon observing poor path performance, you may start wondering which of the links in between is responsible. The challenge is that you, as a user, do not have access to the individual links. You cannot go and measure their statistics. So how are you going to figure out which particular link is the cause, just based on end-to-end performance observations? This is the main problem we are going to try to solve today.

Slide 7
Slide 7

The technique we are going to use is called network-performance tomography, or for short, tomography. It infers link-performance properties by combining the performance of multiple paths that intersect at given links. The intuition here is that if you know just the performance of one path, then you cannot figure out the link-performance data. But if you can have multiple paths intersecting at common links, then this becomes possible.

Tomography in general takes two inputs: the network topology---the link sequences along with traffic flows from one end-user to the other, i.e., which links make up the paths you communicate with---and secondly the path performance, for example the loss rate of the path.

Slide 8
Slide 8

The key idea of inferring the link-performance properties is to write a system of equations that connects what users know—--the path performance—--to what users don't know---the link performance. Intuitively, if you can have this system of equations, then you can solve it to figure out the unknowns, the link performance.

Slide 9
Slide 9

Tomography has been used in wide-area networks and ISPs to figure out different per-link performance properties. For example, it could be used to find the per-link loss rate, average delay, or utilization.

We are going to focus on two properties. The first one is the probability of a link being good, and the second one is link neutrality.

Slide 10
Slide 10

I will come back to this later, but for now, we say that a link is neutral only if it treats the same all packets from traffic along different paths. In particular, in this lecture, we are going to consider a link as neutral if it has the same probability of being good or bad across all paths.

You may wonder why a link could be non-neutral. There are multiple reasons for that. One reason may be that this is caused on purpose by the ISP, because the ISP wants to prioritize some traffic over some other traffic. Or another reason might be that there is some bug in the switches causing them to drop only certain traffic.

Slide 11
Slide 11

That was the intro. We will now continue with the remainder of the lecture. The first part will focus on the probability of a link being good.

Slide 12
Slide 12

Here we see another network, which is a simplified version of the previous one. It is a network with three end-users. The one at the top is communicating with the left user at the bottom through path 1 ($p_1$), and also communicating with another user at the bottom through path 2 ($p_2$). The network has three links, $l_C$, $l_1$ and $l_2$.

Slide 13
Slide 13

During a given time interval, we will consider a link as good if its loss rate is below 1%. What does that mean? It means that if the link receives X packets during the time interval, it delivers to the next link at least 99% of them.

Slide 14
Slide 14

Here, let's say that links $l_1$ and $l_2$ are good all the time. So their loss rate is below 1% at any given time interval. In contrast, $l_C$ gets bad with probability 0.5. This means that if you were to observe what $l_C$ does for multiple time intervals, you could expect that 50% of the time, $l_C$'s loss rate is above or equal to 1%.

The question we ask is: How can end-users infer that $l_C$ is the culprit here for the performance they observe on the paths $p_1$ and $p_2$? We will see one by one the steps we take to write the system of equations we talked about earlier.

Slide 15
Slide 15

The first step is to estimate the path performance, which is part of the input to the tomographic algorithm. Here, the path performance is the probability of the path being good.

Slide 16
Slide 16

How do we decide, based on our observations, whether the path is good at any given time interval?

Slide 17
Slide 17

We consider a time interval during which we send traffic along the path, and then we measure what fraction of the packets make it to the destination. This is the loss rate during the time interval.

Slide 18
Slide 18

Let me give you two concrete examples of path performance of a good and a bad path. For simplicity, we will consider that: whenever a link is bad, it has a loss rate of 10%, which is above the 1% threshold we mentioned before; whenever a link is good, it has a loss rate of 0.1%, which is below the 1% threshold.

We consider the case of path $p_1$ being good or bad based on whether $l_C$ is good or bad. Links $l_1$ and $l_2$ are always good here. Let us take the case of $l_C$ being bad. Then, the loss rate of $p_1$ is 1 minus the delivery rate along the path. As said, $l_1$ is a good link, so it has a loss rate of 0.1%, i.e., a delivery rate of 99.9%. $l_C$ is bad in this case, so it has a loss rate of 10%, i.e., a delivery rate of 90%. So, when $l_C$ is bad, the loss rate of $p_1$ is around 10.1%.

If we were to do the same calculation for the case of $l_C$ being good, we would arrive at a loss rate of around 0.2%.

So based on this, you see that these two numbers, 10.1% and 0.2%, are far apart.

Slide 19
Slide 19

The question is: Where exactly do we set the threshold so that below the threshold we identify the path as good and above it as bad? We do not have access to the links, so we have to judge whether a path is good or bad based on the path-performance numbers.

Slide 20
Slide 20

This is the next step. After we have computed the path loss rate, we compare this loss rate to a threshold. Note that this threshold is not necessarily the 1% we use for links. Let's see how we have to compute it so that we distinguish the two path-performance numbers we have seen before. Ideally, we would like to set this threshold such that the path definition holds: a path is good if and only all the links are good.

Slide 21
Slide 21

In practice, however, this is not always possible. The logical statement in the path definition has two directions and by setting the threshold, we can only guarantee one of them, which is: if the path is bad, then at least one of its links must have been bad during this time interval. But just based on this threshold, we cannot guarantee that whenever we see a good path, then all the links must have been good during that interval.

Slide 22
Slide 22

Let's see how to set the threshold to guarantee the first direction I mentioned. We set the threshold to the maximum loss rate that the path would experience if all of its links were good.

Slide 23
Slide 23

For example, here we know that $p_1$ has two links, $l_C$ and $l_1$. So during a time interval, if we were to send that many packets over $p_1$, then if $l_C$ is good, at least 99% of them will make it to $l_1$.

Slide 24
Slide 24
Slide 25
Slide 25

Out of the packets that are delivered to $l_1$, at least 99% of them will make it to the destination, because $l_1$ is also good.

Slide 26
Slide 26

So, the delivery rate over $p_1$ is 99%*99%, and the corresponding loss rate is 1 minus the delivery rate, i.e., around 2%.

Slide 27
Slide 27

Hence, we set the path threshold to this number.

Now compare that to the numbers we have seen before, 10.1% and 0.2%. You see that indeed if we had set the threshold to 2%, the bath path would indeed be characterized as bad, while the good one would be characterized as good.

Slide 28
Slide 28

Now we have a threshold. Let's see how to estimate the probability of the path being good, because this is ultimately what we need to plug into the system equations we are going to write.

Slide 29
Slide 29

For that, we divide time into multiple time intervals, and for each of them, we characterize the path as good or bad. In the end, we estimate the probability of the path being good as the fraction of intervals where the path has been good.

Slide 30
Slide 30

Let me give you an example.

Slide 31
Slide 31

Here, we have split the entire measurement time into four time intervals.

Slide 32
Slide 32

For each of them, we measure the performance of $p_1$, i.e., its loss rate. It turns out that in 50% of the cases, $p_1$ had a loss rate below 2%, so it was good, and in the other cases, it was bad.

Slide 33
Slide 33
Slide 34
Slide 34

Given these measurements, we would then say that the probability of $p_1$ being good equals 0.5, meaning that 50% of the time the path is good and 50% bad.

Slide 35
Slide 35

Now we move to step 2. Once we have the path performance that we need, we need to write the system equations for each individual path.

Slide 36
Slide 36

We again start with path $p_1$.

Slide 37
Slide 37

By definition of a good path, $p_1$ is good when $l_C$ is good and $l_1$ is good. That gives us this equation here, which says that the probability of $p_1$ being good equals the probability of both links that make up the path being good.

Slide 38
Slide 38

Now, if we assume that links are good or bad independently of each other, then we can arrive at this formula here. This says that to obtain the probability of $p_1$ being good, we have to multiply the per-link probabilities.

Slide 39
Slide 39

A common technique in network tomography to get linear equations is to take the logarithm on both sides; by the properies of a logarithm, the multiplication is replaced by addition.

Slide 40
Slide 40

For simplicity, we will be using this notation: We use Ys for the logarithms of path performance, and Xs for the logarithms of link performance. Here we have written the same equation but with a Y/X notation.

Slide 41
Slide 41

Then, we write the equation for path $p_2$. We take a similar approach and get the equation in green. It differs from the above one in that it has X variables for the performance of $l_C$ and $l_2$, instead of $l_C$ and $l_1$. This is because $p_2$ has links $l_C$ and $l_2$ (instead of $l_C$ and $l_1$ for $p_1$).

Slide 42
Slide 42

So what we have seen so far is how to write a system of linear equations: the left-hand side has the path performances, which is what users can measure; the right-hand side has these X variables, which denote per-link performance, and these are the unknowns that tomography aims to infer.

Slide 43
Slide 43

The third step is to solve this system of equations to arrive at a solution for the unknowns. We do this by using linear algebra.

Slide 44
Slide 44

However, as you might have noticed, in the general case this system has multiple solutions. It is not full-rank, because it is undetermined: it has two linearly independent equations for three unknowns. So, it has infinite solutions.

Slide 45
Slide 45

Let us see two of the possible solutions. Before, we have estimated that path $p_1$ is good 50% of the time. Assume the same for $p_2$.

Slide 46
Slide 46

If we plug these numbers here, we would have log(0.5) on the left side of both equations.

Slide 47
Slide 47

This is one solution, which says that $l_C$ is the culprit: We see that $X_C$ has a non-zero value here ($X_C=log(0.5)$), indicating that $l_C$ is bad 50% of the time, while $X_1=X2=0$, indicating that $l_1$ and $l_2$ are always good.

Slide 48
Slide 48

However, there are other solutions. Here is another one, which says that $l_C$ is not the culprit---it is always good---but the problem is $l_1$ and $l_2$, because these are bad 50% of the time.

The problem is that if we have many solutions, not all of them are realistic. One of them is what actually happened, and we may not know how to pick the right one.

People have used heuristics to constrain the system of equations. For example, one may favor the solution where $l_C$ is the culprit, because based on historical data, $l_C$ is often bad. Or one may favor the solution that minimizes the bad links, as this is more realistic than many/all the links being congested/lossy.

However, using heuristics, one may not necessarily get the correct solution. We would like to find a unique solution. So we have to further constrain the system of equations, and we need to obtain more information.

Slide 49
Slide 49

How do we obtain more information? We use the so-called performance correlation among multiple paths. It can actually be shown that if we do not have this performance correlation, then in the general case, any system of equations we construct is going to be undertermined.

Slide 50
Slide 50

The intuition here is that if the performance of a set of paths, let's say $p_1$ and $p_2$, is correlated, then one possible explanation is that this is caused by the common links of the two paths.

Slide 51
Slide 51

Let's see in this network how we would estimate the path performance correlation based on end-to-end observations. We estimate the performance correlation of $p_1$ and $p_2$ as the fraction of time where both $p_1$ and $p_2$ are simultaneously good.

Slide 52
Slide 52

This is the information we have used so far. We have split time into intervals, and for each interval we have used whether a path is good or bad.

Slide 53
Slide 53

The additional information we are going to consider is how often both of them are good simultaneously. In this case, we see that $p_1$ and $p_2$ are either simultaneously good, which happens 50% of the time, or simultanesously bad, which also happens 50% of the time. Based on this, we estimate that the probability of both of them being good is 0.5.

Slide 54
Slide 54

Knowing the performance correlation of $p_1$ and $p_2$ helps us figure out a unique solution, where $l_C$ is the culprit. Again, we need the assumption of independent link behavior. Otherwise, another valid explanation here is that $l_C$ is always good while links $l_1$ and $l_2$ are correlated: they are always either simultaneously good (50% of the time) or simultaneously bad (the other 50% of the time).

However, assuming independent links, the only explanation is that the common link of $p_1$ and $p_2$, $l_C$, is bad 50% of the time, while the rest of the links are always good.

Slide 55
Slide 55

Let's look at an alternative scenario just to show how the performance correlation of $p_1$ and $p_2$ would have changed if it was not $l_C$ causing the problem, but $l_1$ and $l_2$.

Slide 56
Slide 56

If $l_1$ gets bad 50% of the time and that happens independently of the 50% of the time that $l_2$ gets bad, then as before, each of the path would get bad 50% of the time. But now we would not expect to see them getting simultaneously bad all the time. So for example in time interval 3 they got simultaneously bad, but in time interval 1 they did not.

Slide 57
Slide 57

As a result, the probability of both of them being good would be lower: 0.25 instead of the 0.5 we have seen before. And if we were to plug in these numbers in the system of equations, the solution would indicate that $l_1$ and $l_2$ are responsible.

Slide 58
Slide 58

So what that tells us is that to find a unique solution to the system of equations, we do not only have to write equations for individual paths, but also for sets of paths.

Slide 59
Slide 59

We will be calling two paths a path pair, while for the more generic case of one, two, or more paths, we are going to be calling them a pathset.

Slide 60
Slide 60

Again, we will follow the same characterization that we followed for links and individual paths. During a time interval, we will be saying that a path pair is good if and only if all of its paths are good, and similarly for pathsets.

Slide 61
Slide 61

So let's revise step 2: Not only we have to write equations for individual paths but also for each of the pathsets.

Slide 62
Slide 62

What does that mean in this particular network? Here, additionally to $p_1$ and $p_2$, there is only one pathset, namely path pair $\{p_{12}\}$, and let's see how to write the equation for this.

Slide 63
Slide 63

By the definition, the path pair $\{p_{12}\}$ is good when all of its paths, $p_1$ and $p_2$, are good. So, the probability of $\{p_{12}\}$ being good equals the probability of $p_1$ being good and $p_2$ being good.

Slide 64
Slide 64

Then we expand what it means by the definition for each of the paths to be good: $p_1$ is good when $l_C$ is good---the first link of $p_1$---and $l_1$ is good---the second link of $p_1$. And similarly for $p_2$.

Slide 65
Slide 65

This is equivalent to this equation here.

Slide 66
Slide 66

Then, we use again the assumption of independent link behavior to write the probability of $\{p_{12}\}$ being good as the multiplication of the probabilities of all the links that make up the paths. And then by taking again the logarithms and following the notation, we have this equation here, where $\{Y_{12}\} indicates the logarithm of the probability of path pair $\{p_{12}\}$ being good. What you can note here is that each of the individual links that make up the two paths appears only once; just an important note for when solving exercises.

Slide 67
Slide 67

Putting it all together, we have this system of three linear equations that are independent of one another. That means that the system is now full-rank, and we can solve for the unknowns to uniquely infer the per-link probabilities.

Slide 68
Slide 68

To do that, we plug in the performance numbers for the pathsets that we have seen before. In the case where only $l_C$ is bad (w.p. 0.5), the probabilities were 0.5 for each of $p_1$, $p_2$, and $\{p_{12}\}$.

Slide 69
Slide 69

And then if we solve the system, this is the unique solution, which says that $l_C$ is bad with probability 0.5. This is same as the ground truth, so tomography correctly inferred the culprit link.

Slide 70
Slide 70
Slide 71
Slide 71

We already mentioned some assumptions, but let's see what other assumptions we make when writing such systems of equations.

Slide 72
Slide 72

The first assumption is stationarity, meaning that the loss behavior of a link does not change over time. How we implicitly used this assumption? By having a single variable to capture whether a link is bad.

Slide 73
Slide 73

The second assumption is neutrality, which says that the loss behavior of a link is the same for all paths crossing it. We use this assumption is by using, for each link, the same variable across all paths/equations that involve this link.

Slide 74
Slide 74

The third assumption is stability, that says that paths remain unchanged over time. We use that because we wrote a single equation for each path.

Slide 75
Slide 75

The fourth assumption---we already mentioned it---is independence. It says that a link is bad independently from any other link. And how we used this assumption is by obtaining the path performance as the multiplication of the per-link probabilities.

Slide 76
Slide 76

And the final assumption is separability, that says that a path is good if and only if all of its links are good. So this is not entirely an assumption; part of it is, because this logical statement has these two directions, we talked about earlier. So for one direction, we set the path loss threshold to guarantee that whenever we see a bad path, then at least one link of it was bad.

But there is this other part of the statement that says that whenever we see a good path, it must have been that all the links were good. The latter is not guaranteed by the threshold we set just because some links can be just a little more lossy than they should, hence characterized as bad. And then because the other good links are very good, for example they do not drop any traffic, they make up for this little extra loss that the bad ones inflict. Hence overall, we see that the path is good, while at least one link might have actually been bad.

Slide 77
Slide 77

So now, let's assume these assumptions hold. And let's say we can write a full-rank system of equations. Then tomography can always infer the behavior of the links.

Slide 78
Slide 78

And here are the sufficient conditions for having a full-rank system of equations. So the first condition says that we can include in the system an equation for each pathset, and that includes individual paths. Why we may not be able to have this condition holding? Because we don't have enough end-users to measure all these paths. So we might be able to write the equation, but we don't have the path performance that corresponds to it, so it is as if we don't have the equation.

And the second condition is that every link is distinguishable from one another. This means that they are traversed by different sets of paths.

Slide 79
Slide 79

Let's see why these sufficient conditions hold in the network we have seen before. Here, the paths that traverse $l_C$ are $p_1$ and $p_2$, while for $l_1$ it is $p_1$, and for $l_2$ it is $p_2$. You see that all these three sets are different from one another, which means the second condition---that links are distinguishable from one another---indeed holds.

And because we are able to monitor the different pathsets and we wrote the system of equations for those, this is why we have sufficient conditions to find a unique solution.

Slide 80
Slide 80

Instead, if we had this slightly different version of the network---where we just replaced $l_C$ with two links, $l_{C1}$ and $l_{C2}$---here the sufficient conditions do not hold, no matter if we monitor $p_1$ and $p_2$. Why? Because you see that here, $l_{C1}$ and $l_{C2}$ are both traversed by the same set of paths, $p_1$ and $p_2$. This means, intuitively, that just based on external observations, we can never be sure that one of the two links is responsible. The lossy behavior of, for example, $l_{C1}$, can always be attributed to $l_{C2}$. This means we cannot say with certainty that the problematic link was $l_{C1}$ or $l_{C2}$.

Slide 81
Slide 81

That concludes the first part about the first link performance property. Now we will switch to the second one, which is the link neutrality status.

Slide 82
Slide 82

Here we are going to consider a slightly different network, which has one additional path, $p_3$, one additional link, $l_3$, and one more end-user.

Slide 83
Slide 83

Let's consider that the network is non-neutral. What does that mean? That it has at least one non-neutral link. The non-neutral link here is $l_C$: $l_C$ never drops any traffic from $p_1$, but 50% of the time, it does drop traffic from $p_2$ and $p_3$. So the behavior of $l_C$ is different depending on the path we look at. This is why this link is non-neutral.

So the question we are going to ask is, can we find whether this link is non-neutral based on end-to-end performance observations?

Slide 84
Slide 84

Before we see how to answer this question, let's see what it means for a link to be non-neutral. Here is the original network with $l_C$ non-neutral. We can think of it as writing the neutral-equivalent network, where all links are neutral: We substitute $l_C$ with two links, $l_{CC}$ and $l_{CA}$. $l_{CC}$ captures the common behavior that $l_C$ has to all paths crossing it, $p_1$ included. $l_{CA}$ captures the additionally bad/lossy behavior that the $l_C$ link here has on $p_2$ and $p_3$; you see that we put $l_{CA}$ only along paths $p_2$ and $p_3$, so $l_{CA}$ is not traversed by $p_1$.

Slide 85
Slide 85
Slide 86
Slide 86

If we take a snapshot of this network, and we use the same color mapping as before, we will mark $l_C$ here with both green and orange, because for this time snapshot, $l_C$ was good for $p_1$, but bad for the other two paths.

Slide 87
Slide 87

In its neutral-equivalent network---which does not exist but is just a way to think about the actual network---$l_{CC}$ is green, while $l_{CA}$ is orange. We also see that $p_1$ is green, because all of its links are green, while the other two paths are orange, because at least one of their links is orange.

So the question is, how do we find that $l_C$ is non-neutral? How do we write the system of equations to prove it?

Slide 88
Slide 88

The system of equations will assume neutral links and then try to disprove it: If we form an unsolvable system of equations, that reveals that at least one of the neutrality assumptions we have made is wrong, i.e., at least one link is non-neutral.

Slide 89
Slide 89

So here is the intuition. Let's say that some links are non-neutral, meaning they treat differently different paths. Then if we were to use a different variable to capture the link performance across different paths, this system of equations would be solvable, because what we simply wrote down is that each link can have a different performance, a different variable for the different paths.

Slide 90
Slide 90

The equivalent statement of this logical sentence is that if the system is unsolvable, then we used the same variable to capture different things, namely to capture the different behavior of a link across different paths.

Slide 91
Slide 91

This is equivalent to saying that we wrongly assumed as neutral at least one link, which was actually non-neutral.

Slide 92
Slide 92

Here, we did not reach a conclusion about which particular link is non-neutral; we just said that at least one must have been non-neutral. How do we find that $l_C$ specifcally was non-neutral?

Slide 93
Slide 93

We have to write the system of equations as follows. There are two steps. We write equations not for all pathsets, but only for the path pairs that share only $\lambda$---this is the link we want to prove as non-neutral. So these path pairs must only intersect at this $\lambda$ link, and nothing else. Then we write an equation for these path pairs, capturing the performance correlation of these paths, plus the individual paths that are involved in these path pairs.

And then, when we write the system of equations, we have to be careful how we name variables: We have to map the performance of the $\lambda$ link to a single variable across all paths, and then for each individual path, we have to use a single new variable to capture the performance of all other links, other than $\lambda$ ($l_C$ here), on this path.

Slide 94
Slide 94
Slide 95
Slide 95

We will see an example soon, but let me give you first the intuition for this. What we achieve by following this recipe is that the only assumption about neutrality we make is about $\lambda$. So, if it turns out the system of equations is unsolvable, by what we described earlier, the only explanation is that we assumed $\lambda$ to be neutral while it was not.

And then, the way we write these different equations for the different path pairs, we try to form an unsolvable system of equations that if unsolvable, it will prove that $\lambda$ is non-neutral.

Slide 96
Slide 96

Here we have an example. It is the same network we have seen before, and we try to figure out that $l_C$ is non-neutral (this is the only non-neutral link here) based on end-to-end observations.

To do that, first we have to identify which path pairs intersect only at $l_C$: $\{p_{12}\}$, $\{p_{13}\}$, and $\{p_{23}\}$.

Slide 97
Slide 97

So by the recipe we just described, we have to write equations for all these three path pairs, plus the individual paths, $p_1$, $p_2$, $p_3$, that make up these three path pairs.

Slide 98
Slide 98

To write this system of equations, let's start with the equations for the individual paths. Here, we see these equations. What I would like you to note is that indeed, we have used the same variable, $X_C$, across all equations to capture the performance of $l_C$. This is our assumption that $l_C$ is neutral. And then, for each of the paths, we have used another variable to capture the performance of all other links of the path except for $l_C$. In this small network there was just one more link, for example $p_1$ has only $l_1$ in addition to $l_C$, but in the exercises we will see a more complex topology where this is not the case. But note that even if $p_1$ had additionally links $l_5$, $l_6$, $l_8$, we would have used a single variable, different from others, to capture the performance of all links of $p_1$ other than $l_C$.

Slide 99
Slide 99

We also see the equations for the three path pairs. The equations use the assumption of link independence, and they contain one instance of every variable that corresponds to links that belong to the path pair.

Slide 100
Slide 100

Now we have to show that the system of equations is unsolvable. For any system of equations, it is enough to prove that a sub-system of it is unsolvable, and then we know that the overall system is also unsolvable. Here, we will try to prove unsolvable the red system of equations.

Slide 101
Slide 101

We start by plugging in the path performance. $p_1$ has zero path performance, because $p_1$ is always good: it consists of always good links ($l_C$ is always good for $p_1$, same for $l_1$). So the probability of $p_1$ being good equals one, and then the logarithm of one equals zero.

The path performance of $p_2$ is log(0.5), which says that the probability of $p_2$ being bad is 0.5. Why is that? Because for $p_2$, $l_C$ gets bad with probability 0.5, which results in the overall path getting bad with the same probability. Similarly for $p_3$.

Finally, for path pair $\{p_{23}\}$, we see that the performance is log(0.5). Why is that? If you consider only the part of the network that has only $p_2$ and $p_3$, this is the same as in the original neutral network we have seen in the beginning of the lecture. So if the common link gets bad with probability 0.5, then the correlation of $p_2$ and $p_3$ is 0.5 because of the common link.

Slide 102
Slide 102

So now that we have the path performance, we can solve the system of equations. Here, it is easy to solve it because already the first equation gives us a lot of information. It tells us that $X_C=X_1=0$. This is because the addition of them is zero and Xs are non-positive numbers: they are logarithms of probabilities so they can be up to log(1)=0. As such, the sum of non-positive numbers being equal to zero means that each of them must be zero. Then we plug $X_C=X_1=0$ in all equations.

Slide 103
Slide 103

Then we can solve for X2 based on equation two. We get $X_2=log(0.5)$, and we plug it everywhere.

Slide 104
Slide 104

Here, we have arrived at an inconsistency: based on the third equation, $X_3=log(0.5)$, but based on the last equation, $X_3=0$. So this is an inconsistency and it means that the system is unsolvable.

Slide 105
Slide 105

Since the only assumption we made here about neutrality was that $l_C$ was non-neutral, the system being unsolvable means that we have proven this assumption to be wrong---tomography correctly infers that $l_C$ is non-neutral.

In the exercises we will see scenarios with more complex topologies. It is important to note that this technique cannot prove that a link is neutral---it can only prove that it is non-neutral. So, if we were to arrive at a solvable system, that wouldn't say anything about the link. It could be that we just did not have enough path performance measurements to constrain the system such that it becomes unsolvable.

Slide 106
Slide 106

This brings us to the conclusion of the tomography part. What we have seen today is that tomography is a technique that exploits correlations in end-to-end performance measurements to infer per-link performance properties.

Slide 107
Slide 107

In this lecture, we looked at two particular properties: one is the probability of a link being good/non-lossy; the other one is link neutrality. And we have seen how to write the system of equations to figure out the properties using tomography.

Slide 108
Slide 108