Machine Learning, like several other disciplines, models many phenomena as random variables (RVs): variables that can take on any value in some set, each with some probability \footnote{probability density, in the case of continuous sets}. These RVs can be pixel intensities in an image, words in a sentence, sensor readings from a robot’s surroundings, or even the exchange of observations and actions as an agent interacts with its environment in time. Information theory is one way to study these RVs. One of the many interesting results in information theory is the conservation of directed information: a law that tells us that as two entities exchange RV signals back and forth, the total information they share is neatly partitioned into what flows from one to the other, and what flows back. This post’s focus will be the proof of that conservation law.
I will save the formal introduction of these ideas for more articulate voices [1,2], and just give a brief outline here. The entropy of a random variable, $\mathbb{H}(X)$, measures how certain we are about which values the RV will take on. The outcome of a fair dice roll is a high-entropy event, but the outcome of whether the sun will rise tomorrow morning is very low-entropy. Building on entropy, we can also measure how much of it two RVs have in common. The mutual information of two RVs is a measure of how much information they share, $\mathbb{I}(X;Y) = \mathbb{H}(X)+\mathbb{H}(Y)-\mathbb{H}(X,Y) = \mathbb{I}(Y;X)$. In other words how much one can tell about a RV by knowing another RV. Two independent RVs have 0 mutual information, and an RV’s mutual information with itself is just its entropy $\mathbb{H}(X)$.
Say we have two sequences of random variables, each of length $n$, denoted $X^n = (X_1, X_2, \dots X_n)$ and $Y^n = (Y_1, Y_2, \dots Y_n)$, with a new $X_t$ and $Y_t$ being generated in that order in each unit of time (we will use subscripts to denote the RV at a timestep, and superscripts to denote the sequence of RVs up to that timestep). Information theory equips us with tools to quantify how the $X$s and $Y$s influence each other over time.
[3] introduced the directed information to measure this influence: $$ \mathbb{I}(X^n \rightarrow Y^n) \doteq \sum_{k=1}^n \mathbb{I}(X^k; Y_k|Y^{k-1}). $$ If you’re anything like me, this sum of conditional mutual information was very difficult to parse at first. It may help to look at the sum term by term. Notice that each term conditions on all prior $Y$s, capturing the cumulative context available at each step.
Directed information is asymmetric by design: \mathbb{I}(X^n \rightarrow Y^n) captures what flows forward, from the $X$s to the $Y$s, but says nothing about what flows back. A natural question then is: does anything nice happen when you account for both directions at once? It turns out the answer is a clean conservation law, the directed information analogue of kinetic and potential energy summing to a constant:
$$ \mathbb{I}(X^n \rightarrow Y^n) + \mathbb{I}(0 \circ Y^{n-1} \rightarrow X^n) = \mathbb{I}(X^n; Y^n) $$
If we assume $X_t$ always happens before $Y_t$ in this synchronized sequence, we have to prepend the $Y$-sequence with a $0$ to properly quantify the effect of the $Y$s on the $X$s (second term on the LHS),as [1] measures influence only between sequences of the the same length. ([2] later generalizes this to arbitrary sequences). We use $\circ$ to denote the concatenation of sequences. It worth noting any arbitrary constant ($1, 0.137, -3, \pi, c$) could have been used here, because it’s just a hack on our end. Any $X_1$ we measure is never affected by our choice of constant.
[1] proves this statement for sequences of any length by induction, and we’ll present a more spelled-out, verbose version of it here. A proof by induction is often used to prove a statement for all natural numbers ($\mathbb{N} = 1, 2, 3, \dots$; in this case we are proving the statement for all sequence lengths), and is always composed of the following 4 steps:
- Show that the statement is true for $n=1$.
- Assume there exists some $k \in \mathbb{N}$, such that the statement is true.
- Show that if the statement is true for $n=k$, it must be true for $n=k+1$ as well.
- Conclude that since we have shown the statement is true for $n=1$, it must also be true for $n=2$, and hence also $n=3$, and so on it, so it must be true for all $n \in \mathbb{N}$.
I like to think of this akin to knocking over a set of dominos, with step 1 giving that initial push, and step 3 propagating that knock to all of the natural numbers.
We begin with 1., considering the case when $n=1$. On the left hand side, we see:
$$ \begin{align*}\mathbb{I}(X^1 \rightarrow Y^1) + \mathbb{I}(0 \rightarrow X^1) &= \sum_{n=1}^1 \mathbb{I}(X^n;Y_n|Y^{n-1}) + 0\ &= \mathbb{I}(X_1; Y_1|())\ &= \mathbb{I}(X^1; Y^1) \text{(the right hand side)} \end{align*}
$$
$\therefore$ true for $n=1$.
- Assume there exists $k \in \mathbb{N}$ such that
$$ \mathbb{I}(X^k\rightarrow Y^k) + \mathbb{I}(0 \circ Y^{k-1}\rightarrow X^k) = \mathbb{I}(X^k;Y^k) $$
- When $n = k+1$, on the left hand side we see:
$$ \begin{align}\mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) + \mathbb{I}(0\circ Y^{k} \rightarrow X^{k+1}) \end{align} $$
Note that the first term can be rewritten as:
$$ \begin{align*} \mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) &=\sum_{n=1}^{k+1}\mathbb{I}(X^n; Y_n|Y^{n-1})\ &=\sum_{n=1}^{k}\mathbb{I}(X^n; Y_n|Y^{n-1}) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\ &=\mathbb{I}(X^k \rightarrow Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k) \end{align*} $$
Similarly for the second term:
$$ \begin{align*} \mathbb{I}(0 \circ Y^k \rightarrow X^{k+1}) &= \mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(0 \circ Y^k; X_{k+1}|X^k) \ &=\mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(Y^k; X_{k+1}|X^k) \end{align*} $$
where the last line is basically ignoring the effect of our arbitrary constant ($0$) on the mutual information term (sequence lengths only have to be the same in [1]’s directed information formulation). With both these decompositions, we can rewrite the left handside expression $(2)$ as:
$$ \begin{align*} \mathbb{I}(X^{k+1} \rightarrow Y^{k+1}) + \mathbb{I}(0\circ Y^{k} \rightarrow X^{k+1}) &= \color{ForestGreen}\mathbb{I}(X^k \rightarrow Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\ &\color{white}= \color{black} + \mathbb{I}(0 \circ Y^{k-1} \rightarrow X^k) + \mathbb{I}(Y^k; X_{k+1}|X^k) \ & = \mathbb{I}(X^k; Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\ &\color{white}= \color{nlack} + \mathbb{I}(Y^k; X_{k+1}|X^k) (\text{by our assumption at,} n=k)\ &= \mathbb{I}(X^{k+1}; Y^k) + \mathbb{I}(X^{k+1};Y_{k+1}|Y^k)\ &= \mathbb{I}(X^{k+1}; Y^{k+1}) \end{align*} $$
where the last two lines are a result of the chain rule for mutual information.
$\therefore$ the statement is true for $n=k+1$.
- We have shown that if the statement is true for some $n=k$, then it must be true for $n=k+1$. Since we have shown that it is true for $n=1$, by induction it must be true for all $n \in \mathbb{N}$.
References
[1] Kaneda, Toshiko, and Carl Haub, How many people have ever lived on earth?, Population Reference Bureau 9 (2018)
[2] Bodleian Library, Carbon dating finds Bakhshali manuscript contains oldest recorded origins of the symbol 'zero', University of Oxford, hyperlink , last accessed: 20th Feb 2021
[3] C. F. Gauss, Theoria residuorum biquadraticorum. Commentatio secunda, Commentationes Societatis Regiae Scientiarum Gottingensis Recentiores (1831)
[4] E. Seigel, Scientific Proof Is A Myth, Forbes, hyperlink , last accessed: 20th Feb 2021
[5] R. Feynman, The Character of Physical Law Lecture Series, Cornell University (1964), hyperlink , last accessed 21 st Feb 2021