Berkeley Notes
  • Introduction
  • EE120
    • Introduction to Signals and Systems
    • The Fourier Series
    • The Fourier Transform
    • Generalized transforms
    • Linear Time-Invariant Systems
    • Feedback Control
    • Sampling
    • Appendix
  • EE123
    • The DFT
    • Spectral Analysis
    • Sampling
    • Filtering
  • EECS126
    • Introduction to Probability
    • Random Variables and their Distributions
    • Concentration
    • Information Theory
    • Random Processes
    • Random Graphs
    • Statistical Inference
    • Estimation
  • EECS127
    • Linear Algebra
    • Fundamentals of Optimization
    • Linear Algebraic Optimization
    • Convex Optimization
    • Duality
  • EE128
    • Introduction to Control
    • Modeling Systems
    • System Performance
    • Design Tools
    • Cascade Compensation
    • State-Space Control
    • Digital Control Systems
    • Cayley-Hamilton
  • EECS225A
    • Hilbert Space Theory
    • Linear Estimation
    • Discrete Time Random Processes
    • Filtering
  • EE222
    • Real Analysis
    • Differential Geometry
    • Nonlinear System Dynamics
    • Stability of Nonlinear Systems
    • Nonlinear Feedback Control
Powered by GitBook
On this page
  • Quantifying Information
  • Source Coding
  • Channel Coding

Was this helpful?

  1. EECS126

Information Theory

Information Theory is a field which addresses two questions

  1. Source Coding: How many bits do I need to losslessly represent an observation.

  2. Channel Coding: How reliably and quickly can I communicate a message over a noisy channel.

Quantifying Information

Intuitively, for a PMF of a disrete random variable, the surprise associated with a particular realization is −log⁡pX(x)-\log p_X(x)−logpX​(x) since less probable realizations are more surprising. With this intuition, we can try and quantify the “expected surprise” of a distribution.

Definition 40

For a Discrete Random Variable X∼pXX\sim p_XX∼pX​, the Entropy of XXX is given by

H(x)=E[−log⁡2pX(x)]=−∑x∈XpX(x)log⁡2pX(x).H(x) = \mathbb{E}\left[-\log_2 p_X(x)\right] = -\sum_{x\in\mathcal{X}} p_X(x)\log_2p_X(x).H(x)=E[−log2​pX​(x)]=−∑x∈X​pX​(x)log2​pX​(x).

Alternative interpretations of entropy are the average uncertainty and how random XXX is. Just like probabilites, we can define both joint and conditional entropies.

Definition 41

For Discrete Random Variables XXX and YYY, the joint entropy is given by

H(X,Y)=E[−log⁡2pXY(x,y)]=−∑x,y∈X×YpXY(x,y)log⁡2pXY(x,y).H(X,Y) = \mathbb{E}\left[-\log_2p_{XY}(x, y)\right] = -\sum_{x,y\in\mathcal{X}\times\mathcal{Y}}p_{XY}(x, y)\log_2p_{XY}(x, y).H(X,Y)=E[−log2​pXY​(x,y)]=−∑x,y∈X×Y​pXY​(x,y)log2​pXY​(x,y).

Definition 42

For Discrete Random Variable XXX and YYY, the conditional entropy is given by

H(Y∣X)=E[−log⁡2pY∣X(y∣x)]=∑x∈XpX(x)H(Y∣X=x).H(Y|X) = \mathbb{E}\left[-\log_2p_{Y|X}(y|x)\right] = \sum_{x\in\mathcal{X}}p_X(x)H(Y|X=x).H(Y∣X)=E[−log2​pY∣X​(y∣x)]=∑x∈X​pX​(x)H(Y∣X=x).

Conditional entropy has a natural interpretation which is that it tells us how surprised we are to see Y=yY=yY=y given that we know X=xX=xX=x. If XXX and YYY are independent, then H(Y)=H(Y∣X)H(Y) = H(Y|X)H(Y)=H(Y∣X) because realizing XXX gives no additional information about YYY.

Theorem 23 (Chain Rule of Entropy)

H(X,Y)=H(X)+H(X∣Y).H(X, Y) = H(X) + H(X|Y).H(X,Y)=H(X)+H(X∣Y).

In addition to knowing how much our surprise changes for a random variable when we observe a different random variable, we can also quantify how much additional information observing a random variable gives us about another.

Definition 43

For random variables XXX and YYY, the mutual information is given by

I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X).I(X;Y) = H(X) - H(X|Y) = H(Y) - H(Y|X).I(X;Y)=H(X)−H(X∣Y)=H(Y)−H(Y∣X).

Source Coding

Source coding deals with finding the minimal number of bits required to represent data. This is essentially the idea of lossless compression. In this case, our message is the sequence of realizations of independently and identically distributed random variables (Xi)i=1n∼pX\left(X_i\right)_{i=1}^n \sim p_X(Xi​)i=1n​∼pX​. The probability of observing a particular sequence is then

P(x1,x2,⋯ ,xn)=∏i=1npX(xi).P(x_1, x_2, \cdots, x_n) = \prod_{i=1}^np_X(x_i).P(x1​,x2​,⋯,xn​)=∏i=1n​pX​(xi​).

Theorem 24 (Asymptotic Equipartition Property)

If we have a sequence of independently and identically distributed random variables (Xi)i=1n∼pX\left(X_i\right)_{i=1}^n \sim p_X(Xi​)i=1n​∼pX​, then −1nlog⁡P(x1,x2,⋯ ,xn)-\frac{1}{n}\log P(x_1, x_2, \cdots, x_n)−n1​logP(x1​,x2​,⋯,xn​) converges to H(X)H(X)H(X)in probability.

Theorem 24 tells us that with overwhelming probability, we will observe a sequence that is assigned probability 2−nH(X)2^{-nH(X)}2−nH(X). Using this idea, we can define a subset of possible observed sequences that in the limit, our observed sequence must belong to with overwhelming probability.

Definition 44

For a fixed ϵ>0\epsilon > 0ϵ>0, for each n≥1n\geq 1n≥1, the typical set is given by

Aϵ(n)={(x1,x2,⋯ ,xn):2−n(H(X)+ϵ)≤P(x1,x2,⋯ ,xn)≤2−n(H(X)−ϵ)}.A_\epsilon^{(n)} = \left\{ (x_1, x_2, \cdots, x_n) : 2^{-n(H(X)+\epsilon)}\leq P(x_1, x_2, \cdots, x_n) \leq 2^{-n(H(X)-\epsilon)} \right\}.Aϵ(n)​={(x1​,x2​,⋯,xn​):2−n(H(X)+ϵ)≤P(x1​,x2​,⋯,xn​)≤2−n(H(X)−ϵ)}.

Two important properties of the typical set are that

  1. lim⁡n→∞P((x1,x2,…,xn)∈Aϵ(n))=1\lim_{n\to\infty}P\left((x_1, x_2, \ldots, x_n) \in A_{\epsilon}^{(n)}\right) = 1limn→∞​P((x1​,x2​,…,xn​)∈Aϵ(n)​)=1

  2. ∣Aϵ(n)∣≤2n(H(X)+ϵ)|A_{\epsilon}^{(n)}| \leq 2^{n(H(X)+\epsilon)}∣Aϵ(n)​∣≤2n(H(X)+ϵ)

The typical set gives us an easy way to do source coding. If I have NNN total objects, then I only need log⁡N\log NlogN bits to represent each object, so I can define a simple protocol which is

  1. If (xi)i=1n∈Aϵ2(n)(x_i)_{i=1}^{n} \in A^{(n)}_{\frac{\epsilon}{2}}(xi​)i=1n​∈A2ϵ​(n)​, then describe them using the log⁡∣Aϵ2(n)∣≤n(H(X)+ϵ2)\log|A^{(n)}_{\frac{\epsilon}{2}}| \leq n\left(H(X)+\frac{\epsilon}{2}\right)log∣A2ϵ​(n)​∣≤n(H(X)+2ϵ​) bits

  2. If (xi)i=1n∉Aϵ2(n)(x_i)_{i=1}^n \not \in A^{(n)}_{\frac{\epsilon}{2}}(xi​)i=1n​∈A2ϵ​(n)​, then describe them naiively with nlog⁡∣X∣n\log|\mathcal{X}|nlog∣X∣ bits.

This makes the average number of bits required to describe a message

E[# of Bits]≤n(H(X)+ϵ2)P((xi)i=1n∈Aϵ2(n))+nlog⁡∣X∣P((xi)i=1n∈Aϵ2(n))≤n(H(X)+ϵ2)+nϵ2≤n(H(X)+ϵ)\begin{aligned} \mathbb{E}\left[\text{\# of Bits}\right] &\leq n\left(H(X)+\frac{\epsilon}{2}\right)P\left((x_i)_{i=1}^n\in A_{\frac{\epsilon}{2}}^{(n)}\right) + n\log |\mathcal{X}|P\left((x_i)_{i=1}^n\in A_{\frac{\epsilon}{2}}^{(n)}\right) \\ &\leq n(H(X)+\frac{\epsilon}{2}) + n \frac{\epsilon}{2} \leq n(H(X)+\epsilon)\end{aligned}E[# of Bits]​≤n(H(X)+2ϵ​)P((xi​)i=1n​∈A2ϵ​(n)​)+nlog∣X∣P((xi​)i=1n​∈A2ϵ​(n)​)≤n(H(X)+2ϵ​)+n2ϵ​≤n(H(X)+ϵ)​

This is the first half of a central result of source coding.

Theorem 25 (Source Coding Theorem)

If (Xi)i=1n∼pX(X_i)_{i=1}^n \sim p_X(Xi​)i=1n​∼pX​ are a sequence of independently and identically distributed random varibles, then for any ϵ>0\epsilon > 0ϵ>0 and nnn sufficiently large, we can represent (Xi)i=1n(X_i)_{i=1}^n(Xi​)i=1n​ using fewer than n(H(X)+ϵ)n(H(X) + \epsilon)n(H(X)+ϵ) bits. Conversely, we can not losslessly represent (Xi)i=1n(X_i)_{i=1}^n(Xi​)i=1n​ using fewer than nH(X)nH(X)nH(X)bits.

This lends a new interpretation of the entropy H(X)H(X)H(X): it is the average number of bits required to represent XXX.

Channel Coding

Whereas source coding deals with encoding information, channel coding deals with transmitting it over a noisy channel. In general, we have a message MMM, and encoder, a channel, and a decoder as in Figure 1.

Each channel can be described by a conditional probability distribution pY∣X(y∣x)p_{Y|X}(y|x)pY∣X​(y∣x) for each time the channel is used.

Definition 45

For a channel described by pY∣Xp_{Y|X}pY∣X​, the capacity is given by

C=max⁡pXI(X;Y).C = \max_{p_X} I(X; Y).C=maxpX​​I(X;Y).

In words, the capacity describes the maximum mutual information between the channel input and output.

Definition 46

Suppose we use the channel nnn times to send a message that takes on average H(m)H(m)H(m) bits to encode, then the rate of the channel is

R=H(M)nR = \frac{H(M)}{n}R=nH(M)​

Theorem 26 (Channel Coding Theorem)

For a channel decsribed by pY∣Xp_{Y|X}pY∣X​ and ϵ>0\epsilon>0ϵ>0 and R<CR < CR<C, for all nnn sufficiently large, there exists a rate RRR communication scheme that achieves a probability of error less than ϵ\epsilonϵ. If R>CR > CR>C, then the probability of error converges to 1 for any communication scheme.

PreviousConcentrationNextRandom Processes

Last updated 3 years ago

Was this helpful?

Figure 1: Channel Coding