Berkeley Notes
  • Introduction
  • EE120
    • Introduction to Signals and Systems
    • The Fourier Series
    • The Fourier Transform
    • Generalized transforms
    • Linear Time-Invariant Systems
    • Feedback Control
    • Sampling
    • Appendix
  • EE123
    • The DFT
    • Spectral Analysis
    • Sampling
    • Filtering
  • EECS126
    • Introduction to Probability
    • Random Variables and their Distributions
    • Concentration
    • Information Theory
    • Random Processes
    • Random Graphs
    • Statistical Inference
    • Estimation
  • EECS127
    • Linear Algebra
    • Fundamentals of Optimization
    • Linear Algebraic Optimization
    • Convex Optimization
    • Duality
  • EE128
    • Introduction to Control
    • Modeling Systems
    • System Performance
    • Design Tools
    • Cascade Compensation
    • State-Space Control
    • Digital Control Systems
    • Cayley-Hamilton
  • EECS225A
    • Hilbert Space Theory
    • Linear Estimation
    • Discrete Time Random Processes
    • Filtering
  • EE222
    • Real Analysis
    • Differential Geometry
    • Nonlinear System Dynamics
    • Stability of Nonlinear Systems
    • Nonlinear Feedback Control
Powered by GitBook
On this page
  • Distributions
  • Properties of Distributions
  • Expectation
  • Variance
  • Common Discrete Distributions
  • Common Continuous Distributions
  • Jointly Gaussian Random Variables
  • Hilbert Spaces of Random Variables

Was this helpful?

  1. EECS126

Random Variables and their Distributions

PreviousIntroduction to ProbabilityNextConcentration

Last updated 3 years ago

Was this helpful?

Definition 5

A random variable is a function X:Ω→RX:\Omega\rightarrow\mathbb{R}X:Ω→R with the property ∀α∈R, {ω∈Ω: X(ω)≤α}∈F\forall \alpha\in\mathbb{R},\ \{\omega\in\Omega:\ X(\omega) \leq \alpha\} \in \mathcal{F}∀α∈R, {ω∈Ω: X(ω)≤α}∈F.

The condition in Definition 5 is necessary to compute P(X≤α), ∀α∈RP(X\leq \alpha),\ \forall \alpha\in\mathbb{R}P(X≤α), ∀α∈R. This requirement also let us compute P(X∈B)P(X\in B)P(X∈B) for most sets by leveraging the fact that F\mathcal{F}F is closed under complements, unions, and intersections. For example, we can also compute P(X>α)P(X > \alpha)P(X>α) and P(α<X≤β)P(\alpha < X \leq \beta)P(α<X≤β). In this sense, the property binds the probability space to the random variable.

Definition 5 also implies that random variables satisfy particular algebraic properties. For example, if X,YX,YX,Y are random variables, then so are X+Y,XY,Xp,lim⁡n→∞XnX+Y, XY, X^p, \lim_{n\to\infty}X_nX+Y,XY,Xp,limn→∞​Xn​, etc.

Definition 6

A discrete random variable is a random variable whose codomain is countable.

Definition 7

A continuous random variable is a random variable whose codomain is the real numbers.

Although random variables are defined based on a probability space, it is often most natural to model problems without explicitly specifying the probability space. This works so long as we specify the random variables and their distribution in a “consistent” way. This is formalized by the so-called but can largely be ignored.

Distributions

Roughly speaking, the distribution of a random variable gives an idea of the likelihood that a random variable takes a particular value or set of values.

Definition 8

The probability mass function (or distribution) of a discrete random variable XXX is the frequency with which XXX takes on different values.

pX:X→[0,1] where X=range(X),pX(x)=Pr{X=x}.p_X:\mathcal{X} \rightarrow [0, 1] \text{ where } \mathcal{X} = \text{range}(X),\qquad p_X(x) = \text{Pr}\left\{X=x\right\} .pX​:X→[0,1] where X=range(X),pX​(x)=Pr{X=x}.

Note that ∑x∈XpX(x)=1\sum_{x\in\mathcal{X}}p_X(x) = 1∑x∈X​pX​(x)=1 since ⋂x∈X{w:X(w)=x}=Ω\bigcap_{x\in\mathcal{X}}\{w: X(w) = x\} = \Omega⋂x∈X​{w:X(w)=x}=Ω.

Continuous random variables are largely similar to discrete random variables. One key difference is that instead of being described by a probability “mass”, they are instead described by a probability “density”.

Definition 9

The probability density function (distribution) of a continuous random variable describes the density by which a random variable takes a particular value.

fX:R→[0,∞) where ∫−∞∞fX(x)dx=1 and Pr{X∈B}=∫BfX(x)dxf_X: \mathbb{R}\to [0, \infty) \text{ where } \int_{-\infty}^{\infty}f_X(x)dx = 1 \text{ and } \text{Pr}\left\{X\in B\right\} = \int_B f_X(x)dxfX​:R→[0,∞) where ∫−∞∞​fX​(x)dx=1 and Pr{X∈B}=∫B​fX​(x)dx

Observe that if a random variable XXX is continuous, then the probability that it takes on a particular value is zero.

Pr{X=x}=lim⁡δ→0Pr{x≤X≤x+δ}=lim⁡δ→0∫xx+δfX(u)du=∫xxfX(u)du=0\text{Pr}\left\{X=x\right\} = \lim_{\delta\to0} \text{Pr}\left\{x \leq X \leq x +\delta\right\} = \lim_{\delta\to 0}\int_x^{x+\delta}f_X(u)du = \int_{x}^{x}f_X(u)du = 0Pr{X=x}=limδ→0​Pr{x≤X≤x+δ}=limδ→0​∫xx+δ​fX​(u)du=∫xx​fX​(u)du=0

Definition 10

The cumulative distribution function (CDF) gives us the probability of a random variable XXX being less than or equal to a particular value.

FX:R→[0,1],FX(x)=Pr{X≤x}F_X:\mathbb{R} \to [0, 1],\quad F_X(x) = \text{Pr}\left\{X \leq x\right\}FX​:R→[0,1],FX​(x)=Pr{X≤x}

Note that by the Kolomogorov axioms, FXF_XFX​ must satisfy three properties:

  1. FXF_XFX​ is non-decreasing.

  2. lim⁡x→0FX(x)=0\lim_{x\to0} F_X(x) = 0limx→0​FX​(x)=0 and lim⁡x→∞FX(x)=1\lim_{x\to\infty} F_X(x) = 1limx→∞​FX​(x)=1.

  3. FXF_XFX​ is right continuous.

It turns out that if we have any function FXF_XFX​ that satisfies these three properties, then it is the CDF of some random variable on some probability space. Note that FX(x)F_X(x)FX​(x) gives us an alternative way to define continuous random variables. If FX(x)F_X(x)FX​(x) is absolutely continuous, then it can be expressed as

FX(x)=∫−∞xfX(x)dxF_X(x) = \int_{-\infty}^{x}f_X(x)dxFX​(x)=∫−∞x​fX​(x)dx

for some non-negative function fX(x)f_X(x)fX​(x), and this is the PDF of a continuous random variable.

Often, when modeling problems, there are multiple random variables that we want to keep track of.

Definition 11

If XXX and YYY are random variables on a common probability space (Ω,F,P)(\Omega, \mathcal{F}, P)(Ω,F,P), then the joint distribution (denoted pXY(x,y)p_{XY}(x, y)pXY​(x,y) or fXY(x,y)f_{XY}(x, y)fXY​(x,y)describes the frequencies of joint outcomes.

Note that it is possible for XXX to be continuous and YYY to be discrete (or vice versa).

Definition 12

The marginal distribution of a joint distribution is the distribution of a single random variable.

pX(x)=∑ypXY(x,Y=y),fX(x)=∫−∞∞fXY(x,y)dyp_X(x) = \sum_yp_{XY}(x, Y=y), \qquad f_X(x) = \int_{-\infty}^{\infty}f_{XY}(x, y)dypX​(x)=∑y​pXY​(x,Y=y),fX​(x)=∫−∞∞​fXY​(x,y)dy

Definition 13

Two random variables XXX and YYYare independent if their joint distribution is the product of the marginal distributions.

Just like independence, we can extend the notion of conditional probability to random variables.

Definition 14

The conditional distribution of XXX given YYY captures the frequencies of XXX given we know the value of YYY.

pX∣Y(x∣y)=PXY(x,y)pY(y),fX∣Y(x∣y)=fXY(x,y)fY(y)p_{X|Y}(x|y) = \frac{P_{XY}(x, y)}{p_Y(y)}, \qquad f_{X|Y}(x|y) = \frac{f_{XY}(x, y)}{f_Y(y)}pX∣Y​(x∣y)=pY​(y)PXY​(x,y)​,fX∣Y​(x∣y)=fY​(y)fXY​(x,y)​

Often, we need to combine or transform several random variables. A derived distribution is the obtained by arithmetic of several random variables or applying a function to several (or many) random variables. Since the CDF of a distribution essentially defines that random variable, it can often be easiest to work backwards from the CDF to the PDF or PMF. In the special case where we want to find Y=g(X)Y=g(X)Y=g(X) for a function ggg.

Fy(y)=Pr{Y≤y}=Pr{g(x)≤y}=Pr{X∈g−1([−∞,y])},g−1(y)={x:g(x)=y}.F_y(y) =\text{Pr}\left\{Y \leq y\right\} = \text{Pr}\left\{g(x) \leq y\right\} = \text{Pr}\left\{X \in g^{-1}([-\infty, y])\right\} , \quad g^{-1}(y) = \{ x: g(x) = y \}.Fy​(y)=Pr{Y≤y}=Pr{g(x)≤y}=Pr{X∈g−1([−∞,y])},g−1(y)={x:g(x)=y}.

Another special case of a derived distribution is when adding random variables together.

Theorem 5

The resulting distribution of a sum of two independent random variables is the convolution of the distributions of the two random variables.

pX+Y(z)=∑k=−∞∞pX(k)pY(z−k),fX+Y(z)=∫−∞∞fX(x)fY(z−x)dxp_{X+Y}(z) = \sum_{k=-\infty}^{\infty}p_X(k)p_Y(z-k), \quad f_{X+Y}(z) = \int_{-\infty}^{\infty}f_X(x)f_Y(z - x)dxpX+Y​(z)=∑k=−∞∞​pX​(k)pY​(z−k),fX+Y​(z)=∫−∞∞​fX​(x)fY​(z−x)dx

Properties of Distributions

Expectation

Definition 15

The expectation of a random variable describes the center of a distribution,

E[X]=∑x∈XxpX(x),E[X]=∫−∞∞xfX(x)dx\mathbb{E}\left[X\right] =\sum_{x\in\mathcal{X}}xp_X(x), \quad \mathbb{E}\left[X\right] = \int_{-\infty}^{\infty}xf_X(x)dxE[X]=∑x∈X​xpX​(x),E[X]=∫−∞∞​xfX​(x)dx

provided the sum or integral converges.

Expectation has several useful properties. If we want to compute the expectation of a function of a random variable, then we can use the law of the unconscious statisitician.

Theorem 6 (Law of the Unconscious Statistician)

E[g(X)]=∑x∈Xg(x)pX(x),E[g(X)]=∫−∞∞g(x)fX(x)dx\mathbb{E}\left[g(X)\right] = \sum_{x\in\mathcal{X}}g(x)p_X(x), \quad \mathbb{E}\left[g(X)\right] = \int_{-\infty}^{\infty}g(x)f_X(x)dxE[g(X)]=∑x∈X​g(x)pX​(x),E[g(X)]=∫−∞∞​g(x)fX​(x)dx

Another useful property is its linearity.

E[aX+bY]=aE[X]+bE[Y], ∀a,b∈R.\mathbb{E}\left[aX+bY\right] = a\mathbb{E}\left[X\right] +b\mathbb{E}\left[Y\right] ,\ \forall a, b\in\mathbb{R}.E[aX+bY]=aE[X]+bE[Y], ∀a,b∈R.

Sometimes it can be difficult to compute expectations directly. For disrete distributions, we can use the tail-sum formula.

Theorem 7 (Tail Sum)

For a non-negative integer random variable,

E[X]=∑k=1∞Pr{X≥k}.\mathbb{E}\left[X\right] = \sum_{k=1}^{\infty}\text{Pr}\left\{X\geq k\right\} .E[X]=∑k=1∞​Pr{X≥k}.

When two random variables are independent, expectation has some additional properties.

Theorem 8

If XXX and YYY are independent, then

E[XY]=E[X]E[Y].\mathbb{E}\left[XY\right] = \mathbb{E}\left[X\right] \mathbb{E}\left[Y\right] .E[XY]=E[X]E[Y].

Earlier, we saw that we find a derived distribution by transforming and combining random variables. Sometimes, we don’t need to actually compute the distribution, but only some of its properties.

Definition 16

The nth moment of a random variable is E[Xn]\mathbb{E}\left[X^n\right] E[Xn].

It turns out that we can encode the moments of a distribution into the coefficients of a special power series.

Definition 17

The moment generating function of a random variable XXX is given by MX(t)=E[etX]M_X(t) = \mathbb{E}\left[e^{tX}\right] MX​(t)=E[etX].

Notice that if we apply the power series expansion of etXe^{tX}etX, we see that

MX(t)=∑n=0∞t!n!E[Xn].M_X(t) = \sum_{n=0}^{\infty}\frac{t!}{n!}\mathbb{E}\left[X^n\right] .MX​(t)=∑n=0∞​n!t!​E[Xn].

Thus the nth moment is encoded in the coefficients of the power series and we can retrieve them by taking a derivative:

E[Xn]=dndtnMX(t).\mathbb{E}\left[X^n\right] = \frac{d^{n}}{dt^{n}}M_X(t).E[Xn]=dtndn​MX​(t).

Another interesting point to notice is that for a continuous random variable

MX(t)=∫−∞∞fX(x)etxdxM_X(t) = \int_{-\infty}^{\infty}f_X(x)e^{tx}dxMX​(t)=∫−∞∞​fX​(x)etxdx

is the Laplace transform of the distribution over the real line, and for a discrete random variable,

MX(t)=∑x=−∞∞pX(x)etxM_X(t) = \sum_{x=-\infty}^{\infty}p_X(x)e^{tx}MX​(t)=∑x=−∞∞​pX​(x)etx

is the Z-transform of the distribution evaluated along the curve at e−te^{-t}e−t.

Theorem 9

If the MGF of a function exists, then it uniquely determines the distribution.

This provides another way to compute the distribution for a sum of random variables because we can just multiply their MGF.

Variance

Definition 18

The variance of a discrete random variable XXX describes its spread around the expectation and is given by

Var(X)=E[(X−E[X])2]=E[X2]−E[X]2.\text{Var}\left(X\right) = \mathbb{E}\left[(X-\mathbb{E}\left[X\right] )^2\right] = \mathbb{E}\left[X^2\right] -\mathbb{E}\left[X\right] ^2.Var(X)=E[(X−E[X])2]=E[X2]−E[X]2.

Theorem 10

When two random variables XXX and YYY are independent, then

Var(X+Y)=Var(X)+Var(Y).\text{Var}\left(X+Y\right) = \text{Var}\left(X\right) + \text{Var}\left(Y\right) .Var(X+Y)=Var(X)+Var(Y).

Definition 19

The covariance of two random variables describes how much they depend on each other and is given by

Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y].\text{Cov}\left(X, Y\right) = \mathbb{E}\left[(X-\mathbb{E}\left[X\right] )(Y-\mathbb{E}\left[Y\right] )\right] = \mathbb{E}\left[XY\right] - \mathbb{E}\left[X\right] \mathbb{E}\left[Y\right] .Cov(X,Y)=E[(X−E[X])(Y−E[Y])]=E[XY]−E[X]E[Y].

If Cov(X,Y)=0\text{Cov}\left(X,Y\right) = 0Cov(X,Y)=0 then XXX and YYY are uncorrelated.

Definition 20

The correlation coefficient gives a single number which describes how random variables are correlated.

ho(X,Y)=Cov(X,Y)Var(X)Var(Y).ho(X, Y) = \frac{\text{Cov}\left(X, Y\right) }{\sqrt{\text{Var}\left(X\right) }\sqrt{\text{Var}\left(Y\right) }}.ho(X,Y)=Var(X)​Var(Y)​Cov(X,Y)​.

Note that −1≤ρ≤1-1\leq \rho \leq 1−1≤ρ≤1.

Common Discrete Distributions

Definition 21

XXX is uniformly distributed when each value of XXX has equal probability.

X∼Uniform({1,2,⋯ ,n})  ⟹  pX(x)={1nx=1,2,⋯ ,n,0 else.X\sim \text{Uniform}(\{ 1, 2, \cdots, n \}) \implies p_X(x) = \begin{cases} \frac{1}{n} & x = 1, 2, \cdots, n,\\ 0 & \text{ else.} \end{cases}X∼Uniform({1,2,⋯,n})⟹pX​(x)={n1​0​x=1,2,⋯,n, else.​

Definition 22

XXX is a Bernoulli random variable if it is either 000 or 111 with pX(1)=pp_X(1) = ppX​(1)=p.

X∼Bernoulli(p)  ⟹  pX(x)={1−px=0,px=1,0 else.X\sim\text{Bernoulli}(p) \implies p_X(x) = \begin{cases} 1 - p & x=0,\\ p & x=1,\\ 0 & \text{ else.} \end{cases}X∼Bernoulli(p)⟹pX​(x)=⎩⎨⎧​1−pp0​x=0,x=1, else.​

E[X]=pVar(X)=(1−p)p\mathbb{E}\left[X\right] = p \qquad \text{Var}\left(X\right) = (1-p)pE[X]=pVar(X)=(1−p)p

Bernoulli random variables are good for modeling things like a coin flip where there is a probability of success. Bernoulli random variables are frequently used as indicator random variables 1A\mathbb{1}_A1A​ where

1A={1if A occurs,0 else.\mathbb{1}_A = \begin{cases} 1 & \text{if A occurs,}\\ 0 & \text{ else.} \end{cases}1A​={10​if A occurs, else.​

When paired with the linearity of expectation, this can be a powerful method of computing the expectation of something.

Definition 23

XXX is a Binomial random variable when

X∼Binomial(n,p)  ⟹  pX(x)={(nx)px(1−p)n−xx=0,1,⋯ ,n0 else.X \sim \text{Binomial}(n, p) \implies p_X(x) = \begin{cases} \binom{n}{x} p^x (1-p)^{n-x} & x=0, 1, \cdots, n\\ 0 & \text{ else.} \end{cases}X∼Binomial(n,p)⟹pX​(x)={(xn​)px(1−p)n−x0​x=0,1,⋯,n else.​

E[X]=npVar(X)=np(1−p)\mathbb{E}\left[X\right] = np \qquad \text{Var}\left(X\right) = np(1-p)E[X]=npVar(X)=np(1−p)

A binomial random variable can be thought of as the number of successes in nnn trials. In other words,

X∼Binomial(n,p)  ⟹  X=∑i=1nXi,Xi∼Bernoulli(p).X \sim \text{Binomial}(n, p) \implies X = \sum_{i=1}^{n}X_i, \quad X_i \sim \text{Bernoulli}(p).X∼Binomial(n,p)⟹X=∑i=1n​Xi​,Xi​∼Bernoulli(p).

By construction, if X∼Binomial(n,p)X\sim\text{Binomial}(n, p)X∼Binomial(n,p) and Y∼Binomial(m,p)Y\sim\text{Binomial}(m, p)Y∼Binomial(m,p) are independent, then X+Y∼Binomial(m+n,p)X+Y \sim \text{Binomial}(m+n, p)X+Y∼Binomial(m+n,p).

Definition 24

A Geometric random variable is distributed as

X∼Geom(p)  ⟹  pX(x)={p(1−p)x−1x=1,2,⋯0 else.X\sim\text{Geom}(p) \implies p_X(x) = \begin{cases} p(1-p)^{x-1} & x=1, 2, \cdots\\ 0 & \text{ else}. \end{cases}X∼Geom(p)⟹pX​(x)={p(1−p)x−10​x=1,2,⋯ else.​

E[X]=1pVar(X)=1−pp2\mathbb{E}\left[X\right] = \frac{1}{p} \qquad \text{Var}\left(X\right) = \frac{1-p}{p^2}E[X]=p1​Var(X)=p21−p​

Geometric random variables are useful for modeling the number of trials required before the first success. In other words,

X∼Geom(p)  ⟹  X=min⁡{k≥1:Xk=1} where Xi∼Bernoulli(p).X \sim \text{Geom}(p) \implies X = \min\{k \geq 1: X_k=1 \} \text{ where } X_i\sim \text{Bernoulli}(p).X∼Geom(p)⟹X=min{k≥1:Xk​=1} where Xi​∼Bernoulli(p).

A useful property of geometric random variables is that they are memoryless:

Pr{X=K+M∣X>k}=Pr{X=M}.\text{Pr}\left\{X=K+M|X>k\right\} = \text{Pr}\left\{X=M\right\} .Pr{X=K+M∣X>k}=Pr{X=M}.

Definition 25

A Poisson random variable is distributed as

X∼Poisson(λ)  ⟹  pX(x)={λxe−λx!x=0,1,⋯0 else.X\sim Poisson(\lambda) \implies p_X(x) = \begin{cases} \frac{\lambda^xe^{-\lambda}}{x!} & x=0, 1, \cdots \\ 0 & \text{ else.} \end{cases}X∼Poisson(λ)⟹pX​(x)={x!λxe−λ​0​x=0,1,⋯ else.​

E[X]=λ\mathbb{E}\left[X\right] = \lambdaE[X]=λ

Poisson random variables are good for modeling the number of arrivals in a given interval. Suppose you take a given time interval and divide it into nnn chunks where the probability of arrival in chunk iii is Xi∼Bernoulli(pn)X_i \sim \text{Bernoulli}(p_n)Xi​∼Bernoulli(pn​). Then the total number of arrivals Xn=∑i=1nXiX_n = \sum_{i=1}^{n}X_iXn​=∑i=1n​Xi​ is distributed as a Binomial random variable with expectation npn=λnp_n=\lambdanpn​=λ. As we increase nnn to infinity but keep λ\lambdaλ fixed, we arrive at the poisson distribution.

A useful fact about Poisson random variables is that if X∼Poisson(λ)X\sim\text{Poisson}(\lambda)X∼Poisson(λ) and Y∼Poisson(μ)Y\sim\text{Poisson}(\mu)Y∼Poisson(μ) are independent, then X+Y∼Poisson(λ+μ)X+Y \sim \text{Poisson}(\lambda + \mu)X+Y∼Poisson(λ+μ).

Common Continuous Distributions

Definition 26

A continuous random variable is uniformly distributed when the pdf of XXX is constant over a range.

X∼Uniform(a,b)  ⟹  fX(x)={1b−aa≤x≤b,0 else.X \sim \text{Uniform}(a, b) \implies f_X(x) = \begin{cases} \frac{1}{b-a} & a \leq x \leq b,\\ 0 & \text{ else.} \end{cases}X∼Uniform(a,b)⟹fX​(x)={b−a1​0​a≤x≤b, else.​

The CDF of a uniform distribution is given by

FX(x)={0,x<a,x−ab−a,x∈[a,b)1,x≥b.F_X(x) = \begin{cases} 0, & x < a,\\ \frac{x-a}{b-a}, & x\in[a, b)\\ 1, & x \geq b. \end{cases}FX​(x)=⎩⎨⎧​0,b−ax−a​,1,​x<a,x∈[a,b)x≥b.​

Definition 27

A continuous random variable is exponentially distributed when its pdf is given by

X∼Exp(λ)  ⟹  fX(x)={λe−λxx≥0,0 else.X \sim \text{Exp}(\lambda) \implies f_X(x) = \begin{cases} \lambda e^{-\lambda x} & x \geq 0,\\ 0 & \text{ else.} \end{cases}X∼Exp(λ)⟹fX​(x)={λe−λx0​x≥0, else.​

Exponential random variables are the only continuous random variable to have the memoryless property:

Pr{X>t+s∣X>s}=Pr{X>t},t≥0.\text{Pr}\left\{X > t+s | X > s\right\} = \text{Pr}\left\{X > t\right\} , \quad t \geq 0.Pr{X>t+s∣X>s}=Pr{X>t},t≥0.

The CDF of the exponential distribution is given by

FX(x)=λ∫0xe−λudu=1−e−λxF_X(x) = \lambda \int_0^{x}e^{-\lambda u}du = 1 - e^{-\lambda x}FX​(x)=λ∫0x​e−λudu=1−e−λx

Definition 28

XXX is a Gaussian Random Variable with mean μ\muμ and variance σ2\sigma^2σ2 (denoted X∼N(μ,σ2)X\sim \mathcal{N}(\mu, \sigma^2)X∼N(μ,σ2)) if it has the PDF

fX(x)=12πσ2e−(x−μ)22σ2f_X(x) = \frac{1}{\sqrt{2\pi\sigma^2}}e^{\frac{-(x-\mu)^2}{2\sigma^2}}fX​(x)=2πσ2​1​e2σ2−(x−μ)2​

The standard normal is X∼N(0,1)X\sim\mathcal{N}(0, 1)X∼N(0,1), and it has the CDF

Φ(x)=12π∫−∞xe−u22du\Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{\frac{-u^2}{2}} duΦ(x)=2π​1​∫−∞x​e2−u2​du

There is no closed from for Φ(x)\Phi(x)Φ(x). It turns out that every normal random variable can be transformed into the standard normal (i.e X−μσ∼N(0,1)\frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)σX−μ​∼N(0,1)). Some facts about Gaussian random variables are

  1. If X∼N(μx,σx2), Y∼N(μy,σy2)X\sim\mathcal{N}(\mu_x, \sigma_x^2),\ Y\sim\mathcal{N}(\mu_y, \sigma_y^2)X∼N(μx​,σx2​), Y∼N(μy​,σy2​) are independent, then X+Y∼N(μx+μy,σx2+σy2)X+Y \sim \mathcal{N}(\mu_x+\mu_y, \sigma_x^2 + \sigma_y^2)X+Y∼N(μx​+μy​,σx2​+σy2​).

  2. If X,YX,YX,Y are independent and (X+Y),(X−Y)(X+Y), (X-Y)(X+Y),(X−Y) are independent, then both XXX and YYY are Gaussian with the same variance.

Jointly Gaussian Random Variables

Jointly Gaussian Random Varables, also known as Gaussian Vectors, can be defined in a variety of ways.

Definition 29

A Gaussian Random Vector X=[X1⋯Xn]T\boldsymbol{X} = \begin{bmatrix} X_1 & \cdots & X_n \end{bmatrix}^TX=[X1​​⋯​Xn​​]T with density on Rn\mathbb{R}^nRn, Cov(X)=Σ,E[X]=μ\text{Cov}\left(\boldsymbol{X}\right) =\Sigma, \mathbb{E}\left[X\right] =\boldsymbol{\mu}Cov(X)=Σ,E[X]=μ is defined by the pdf

fX(x)=1(2π)ndet(Σ)e−12(x−μ)TΣ−1(x−μ)f_{\boldsymbol{X}}(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n\text{det}(\Sigma)}}e^{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\boldsymbol{x}-\boldsymbol{\mu})}fX​(x)=(2π)ndet(Σ)​1​e−21​(x−μ)TΣ−1(x−μ)

Definition 30

A joint gaussian random variable is an affine transformation of independent and identically distributed standard normals.

X=μ+AW\boldsymbol{X} = \boldsymbol{\mu}+A\boldsymbol{W}X=μ+AW

where A=Σ1/2A=\Sigma^{1/2}A=Σ1/2 is a full-rank matrix and W\boldsymbol{W}Wis a vector of i.i.d standard normals.

Definition 31

A random variable is jointly gaussian if all 1D projections are Gaussian

aTX∼N(aTμ,aTΣa)\boldsymbol{a}^T\boldsymbol{X} \sim \mathcal{N}(\boldsymbol{a}^T\boldsymbol{\mu}, \boldsymbol{a}^T\Sigma\boldsymbol{a})aTX∼N(aTμ,aTΣa)

In addition to their many definitions, jointly gaussian random variables also have interesting properties.

Theorem 11

If X\boldsymbol{X}X and Y\boldsymbol{Y}Y are jointly gaussian random variables, then

X=μX+ΣXYΣY−1(Y−μY)+V where V∼N(0,ΣX−ΣXYΣY−1ΣYX)X = \mu_{\boldsymbol{X}}+\Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}(\boldsymbol{Y} - \boldsymbol{\mu_Y}) + \boldsymbol{V} \text{ where } V \sim \mathcal{N}(0, \Sigma_X-\Sigma_{\boldsymbol{XY}}\Sigma_Y^{-1}\Sigma{\boldsymbol{YX}})X=μX​+ΣXY​ΣY−1​(Y−μY​)+V where V∼N(0,ΣX​−ΣXY​ΣY−1​ΣYX)

Theorem 11 tells us that each entry in Gaussian Vector can be thought of as a “noisy” version of the others.

Hilbert Spaces of Random Variables

One way to understand random variables is through linear algebra by thinking of them as vectors in a vector space.

Definition 32

An real inner product space VVV is composed of a vector space VVV over a real scalar field equipped with an inner product ⟨⋅,⋅⟩\langle \cdot,\cdot \rangle⟨⋅,⋅⟩ that satisfies ∀u,v,w∈V\forall u,v,w\in V∀u,v,w∈V, a,b∈Ra, b\in\mathbb{R}a,b∈R,

  1. ⟨u,v⟩=⟨v,u⟩\langle u, v \rangle = \langle v, u \rangle⟨u,v⟩=⟨v,u⟩

  2. ⟨au+bv,w⟩=a⟨u,w⟩+b⟨v,w⟩\langle au + bv, w \rangle = a \langle u,w \rangle + b \langle v,w \rangle⟨au+bv,w⟩=a⟨u,w⟩+b⟨v,w⟩

  3. ⟨u,u⟩≥0\langle u,u \rangle \geq 0⟨u,u⟩≥0 and <u,u>=0⇔u=0<u,u> = 0 \Leftrightarrow u = 0<u,u>=0⇔u=0

Inner products spaces are equipped with the norm ∥v∥=⟨v,v⟩\|v\| = \sqrt{\langle v, v \rangle }∥v∥=⟨v,v⟩​.

Definition 33

A Hilbert Space is a real inner product space that is complete with respect to its norm.

Loosely, completeness means that we can take limits of without exiting the space. It turns out that random variables satisfy the definition of a Hilbert Space.

Theorem 12

Let (Ω,F,P)(\Omega, \mathcal{F}, P)(Ω,F,P) be a probability space. The collection of random variables XXX with E[X2]<∞\mathbb{E}\left[X^2\right] < \inftyE[X2]<∞ on this probability space form a Hilbert Space with respect to the inner product ⟨X,Y⟩=E[XY]\langle X, Y \rangle = \mathbb{E}\left[XY\right] ⟨X,Y⟩=E[XY].

Hilbert spaces are important because they provide a notion of geometry that is compatible with our intuition as well as the geometry of Rn\mathbb{R}^nRn (which is a Hilbert Space). One geometric idea is that of orthogonality. Two vectors are orthogonal if ⟨X,Y⟩=0\langle X, Y\rangle = 0⟨X,Y⟩=0. Two random variables will be orthogonal if they are zero-mean and uncorrelated. Using orthogonality, we can also define projections.

Theorem 13 (Hilbert Projection Theorem)

Let H\mathcal{H}H be a Hilbert Space and U⊆H\mathcal{U} \subseteq \mathcal{H}U⊆H be a closed subspace. For each vector v∈Hv\in\mathcal{H}v∈H, argmin∥u−v∥\text{argmin} \|u-v\|argmin∥u−v∥ has a unique solution (there is a unique closest point u∈Uu\in\mathcal{U}u∈U to vvv). If uuu is the closest point to vvv, then ∀u∈U, ⟨u−v,u′⟩\forall u\in\mathcal{U},\ \langle u-v, u'\rangle∀u∈U, ⟨u−v,u′⟩.

Theorem 13 is what gives rise to important properties like the Pythogorean Theorem for any Hilbert Space.

∥u∥2+∥u−v∥2=∥v∥ where u=argmin∥u−v∥.\|u\|^2 + \|u-v\|^2 = \|v\| \text{ where } u=\text{argmin}\|u-v\|.∥u∥2+∥u−v∥2=∥v∥ where u=argmin∥u−v∥.

Suppose we had to random variables XXX and YYY. What happens if we try and project one onto the other?

Definition 34

The conditional expectation of XXX given YYY is the bounded continuous function of YYY such that X−E[X∣Y]X - \mathbb{E}\left[X|Y\right] X−E[X∣Y]is orthogonal to all other bounded continuous functions ϕ(Y)\phi(Y)ϕ(Y).

∀ϕ, E[(X−E[X∣Y])ϕ(Y)]=0.\forall \phi,\ \mathbb{E}\left[(X-\mathbb{E}\left[X|Y\right] )\phi(Y)\right] = 0.∀ϕ, E[(X−E[X∣Y])ϕ(Y)]=0.

Thus, the conditional expectation is the function of YYY that is closest to XXX. It’s interpretation is that the expectation of XXX can change after observing some other random variable YYY. To find E[X∣Y]\mathbb{E}\left[X|Y\right] E[X∣Y], we can use the conditional distribution of XXX and YYY.

Theorem 14

The conditional expectation of a conditional distribution is given by

E[X∣Y=y]=∑x∈XxpX∣Y(x∣y),E[X∣Y=y]=∫−∞∞xfX∣Y(x,y)dx\mathbb{E}\left[X|Y=y\right] = \sum_{x\in\mathcal{X}}xp_{X|Y}(x|y), \quad \mathbb{E}\left[X|Y=y\right] = \int_{-\infty}^{\infty}xf_{X|Y}(x, y)dxE[X∣Y=y]=∑x∈X​xpX∣Y​(x∣y),E[X∣Y=y]=∫−∞∞​xfX∣Y​(x,y)dx

\label{defn:drv-conditional-expect}

Notice that E[X∣Y]\mathbb{E}\left[X|Y\right] E[X∣Y]is a function of the random variable YYY, meaning we can apply Theorem 6.

Theorem 15 (Tower Property)

For all functions fff,

E[f(Y)X]=E[f(Y)E[X∣Y]]\mathbb{E}\left[f(Y)X\right] = \mathbb{E}\left[f(Y)\mathbb{E}\left[X|Y\right] \right]E[f(Y)X]=E[f(Y)E[X∣Y]]

Alternatively, we could apply lineary of expectation to Definition 34 to arrive at the same result. If we apply Theorem 15 to the function f(Y)=1f(Y) = 1f(Y)=1, then we can see that E[E[X∣Y]]=E[X]\mathbb{E}\left[\mathbb{E}\left[X|Y\right] \right] = \mathbb{E}\left[X\right] E[E[X∣Y]]=E[X].

Just as expectation can change when we know additional information, so can variance.

Definition 35

Conditional Variance is the variance of XXX given the value of YYY.

Var(X∣Y=y)=E[(X−E[X∣Y=y])2∣Y=y]=E[X2∣Y=y]−E[X∣Y=y]2\text{Var}\left(X|Y=y\right) = \mathbb{E}\left[(X - \mathbb{E}\left[X|Y=y\right] )^2 | Y=y\right] = \mathbb{E}\left[X^2|Y=y\right] - \mathbb{E}\left[X|Y=y\right] ^2Var(X∣Y=y)=E[(X−E[X∣Y=y])2∣Y=y]=E[X2∣Y=y]−E[X∣Y=y]2

Conditional variance is a random variable just as expectation is.

Theorem 16 (Law of Total Variance)

Var(X)=E[Var(X∣Y)]+Var(E[X∣Y])\text{Var}\left(X\right) = \mathbb{E}\left[\text{Var}\left(X|Y\right) \right] + \text{Var}\left(\mathbb{E}\left[X|Y\right] \right)Var(X)=E[Var(X∣Y)]+Var(E[X∣Y])

The second term in the law of total variance (Var(E[X∣Y])\text{Var}\left(\mathbb{E}\left[X|Y\right] \right) Var(E[X∣Y])) can be interpreted as on average, how much uncertainty there is in XXX given we know YYY.

Kolmogorov Extension Theorem