Conditional entropy has a natural interpretation which is that it tells us how surprised we are to see
Y=y
given that we know
X=x
. If
X
and
Y
are independent, then
H(Y)=H(Y∣X)
because realizing
X
gives no additional information about
Y
.
Theorem 23 (Chain Rule of Entropy)
H(X,Y)=H(X)+H(X∣Y).
In addition to knowing how much our surprise changes for a random variable when we observe a different random variable, we can also quantify how much additional information observing a random variable gives us about another.
Definition 43
For random variables
X
and
Y
, the mutual information is given by
I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X).
Source Coding
Source coding deals with finding the minimal number of bits required to represent data. This is essentially the idea of lossless compression. In this case, our message is the sequence of realizations of independently and identically distributed random variables
(Xi)i=1n∼pX
. The probability of observing a particular sequence is then
P(x1,x2,⋯,xn)=∏i=1npX(xi).
Theorem 24 (Asymptotic Equipartition Property)
If we have a sequence of independently and identically distributed random variables
(Xi)i=1n∼pX
, then
−n1logP(x1,x2,⋯,xn)
converges to
H(X)
in probability.
Theorem 24 tells us that with overwhelming probability, we will observe a sequence that is assigned probability
2−nH(X)
. Using this idea, we can define a subset of possible observed sequences that in the limit, our observed sequence must belong to with overwhelming probability.