On the Conservation of Mutual Information

Machine Learning, like several other disciplines, models many phenomena as random variables (RVs): variables that can take on any value in some set, each with some probability.¹ These RVs can be pixel intensities in an image, words in a sentence, sensor readings from a robot’s surroundings, or even the exchange of observations and actions as an agent interacts with its environment in time. Information theory is one way to study these RVs. One of the many interesting results in information theory is the conservation of directed information: a law that tells us that as two entities exchange RVs back and forth, the total information they share is neatly partitioned into what flows from one to the other, and what flows back. This post’s focus will be the proof of that conservation law.

I will save the formal introduction of these ideas for more articulate voices (Shannon, 1948; MacKay, 2003) , and just give a brief outline here. The entropy of a random variable, $\mathbb{H}(X)$, measures how certain we are about which values the RV will take on. The outcome of a fair dice roll is a high-entropy event, but the outcome of whether the sun will rise tomorrow morning is very low-entropy. Building on entropy, we can also measure how much of it two RVs have in common. The mutual information of two RVs is a measure of how much information they share, $\mathbb{I}(X;Y) = \mathbb{H}(X)+\mathbb{H}(Y)-\mathbb{H}(X,Y) = \mathbb{I}(Y;X)$. In other words how much one can tell about a RV by knowing another RV. Two independent RVs have 0 mutual information, and an RV’s mutual information with itself is just its entropy $\mathbb{H}(X)$.

Say we have two sequences of random variables, each of length $n$, denoted $X^n = (X_1, X_2, \dots X_n)$ and $Y^n = (Y_1, Y_2, \dots Y_n)$, with a new $X_t$ and $Y_t$ being generated in that order in each unit of time (we will use subscripts to denote the RV at a timestep, and superscripts to denote the sequence of RVs up to that timestep). Information theory equips us with tools to quantify how the $X$s and $Y$s influence each other over time.

RV sequence diagram — A sequence of random variables exchanged over time.

Massey (1990) introduced the directed information to measure this influence:

$$ \mathbb{I}(X^n \rightarrow Y^n) \doteq \sum_{k=1}^n \mathbb{I}(X^k; Y_k|Y^{k-1}). $$

If you’re anything like me, this sum of conditional mutual information was very difficult to parse at first. It may help to look at the sum term by term. Notice that each term conditions on all prior $Y$s, capturing the cumulative context available at each step.

Directed Information diagram — Visualizing the directed information between two sequences of random variables.

Directed information is asymmetric by design: \mathbb{I}(X^n \rightarrow Y^n) captures what flows forward, from the $X$s to the $Y$s, but says nothing about what flows back. A natural question then is: does anything nice happen when you account for both directions at once? It turns out the answer is a clean conservation law, the directed information analogue of kinetic and potential energy summing to a constant:

$$ \mathbb{I}(X^n \rightarrow Y^n) + \mathbb{I}(0 \circ Y^{n-1} \rightarrow X^n) = \mathbb{I}(X^n; Y^n) $$

If we assume $X_t$ always happens before $Y_t$ in this synchronized sequence, we have to prepend the $Y$-sequence with a $0$ to properly quantify the effect of the $Y$s on the $X$s (second term on the LHS),as (Shannon, 1948) measures influence only between sequences of the the same length. (MacKay (2003) later generalizes this to arbitrary sequences). We use $\circ$ to denote the concatenation of sequences. It worth noting any arbitrary constant ($1, 0.42, 1/137, -3, \pi$) could have been used here, because it’s just a hack on our end. Any $X_1$ we measure is never affected by our choice of constant.

Shannon (1948) proves this statement for sequences of any length by induction, and we’ll present a more spelled-out, verbose version of it here. A proof by induction is often used to prove a statement for all natural numbers ($\mathbb{N} = 1, 2, 3, \dots$) in this case we are proving the statement for all sequence lengths), and is always composed of the following 4 steps:

Show that the statement is true for $n=1$.
Assume there exists some $k \in \mathbb{N}$, such that the statement is true.
Show that if the statement is true for $n=k$, it must be true for $n=k+1$ as well.
Conclude that since we have shown the statement is true for $n=1$, it must also be true for $n=2$, and hence also $n=3$, and so on it, so it must be true for all $n \in \mathbb{N}$.

I like to think of this akin to knocking over a set of dominos, with step 1 giving that initial push, and step 3 propagating that knock to all of the natural numbers.

We begin by considering the case when $n=1$. On the left hand side, we see:

\begin{align*} \mathbb{I}(X^1 \rightarrow Y^1) + \mathbb{I}(0 \rightarrow X^1) &= \sum_{n=1}^1 \mathbb{I}(X^n;Y_n|Y^{n-1}) + 0 \\ &= \mathbb{I}(X_1; Y_1|()) \\ &= \mathbb{I}(X^1; Y^1) \tag{the right hand side} \end{align*}

$\therefore$ true for $n=1$.

We assume there exists $k \in \mathbb{N}$ such that
$$ \mathbb{I}(X^k\rightarrow Y^k) + \mathbb{I}(0 \circ Y^{k-1}\rightarrow X^k) = \mathbb{I}(X^k;Y^k) $$
With this assumption, let us consider $n = k+1$. On the left hand side we see:

$$ \begin{align}\mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) + \mathbb{I}(0\circ Y^{k} \rightarrow X^{k+1}) \end{align} $$

Using the defintion of directed information, we can write out the first term of (1) as

$$ \begin{align*} \mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) &=\sum_{n=1}^{k+1}\mathbb{I}(X^n; Y_n|Y^{n-1})\\ &=\sum_{n=1}^{k}\mathbb{I}(X^n; Y_n|Y^{n-1}) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\\ &=\mathbb{I}(X^k \rightarrow Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k). \end{align*} $$

Similarly for the second term:

$$ \begin{align*} \mathbb{I}(0 \circ Y^k \rightarrow X^{k+1}) &= \mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(0 \circ Y^k; X_{k+1}|X^k) \\ &=\mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(Y^k; X_{k+1}|X^k), \end{align*} $$

where the last line is us ignoring the effect of our arbitrary constant, $0$, on the mutual information term (sequence lengths only have to be the same in Shannon (1948)’s directed information formulation). With both these decompositions, we can rewrite the left handside expression $(2)$ as:

$$ \begin{align*} \mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) + \mathbb{I}(0\circ Y^{k} \rightarrow X^{k+1}) &= \mathbb{I}(X^k \rightarrow Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\\ &\,\,\,\,\, \color{black} + \mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(Y^k; X_{k+1}|X^k) \\ & = \mathbb{I}(X^k; Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\\ &\color{white}= \color{nlack} + \mathbb{I}(Y^k; X_{k+1}|X^k) \tag{$n=k$ assumption}\\ &= \mathbb{I}(X^{k+1}; Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\\ &= \mathbb{I}(X^{k+1}; Y^{k+1}) \end{align*} $$

where the last two lines are a result of the chain rule for mutual information.

$\therefore$ the statement is true for $n=k+1$.

We have shown that if the statement is true for some $n=k$, then it must be true for $n=k+1$. Since we have shown that it is true for $n=1$, by induction it must be true for all $n \in \mathbb{N}$.

References

Shannon (1948), A Mathematical Theory of Communication. Bell System Technical Journal.
MacKay (2003), Information Theory, Inference and Learning Algorithms. Cambridge University Press.
Massey (1990), Causality, Feedback and Directed Information. Proceedings of the International Symposium on Information Theory and its Applications.

Probability density, in the case of continuous sets. ↩︎