# Random Variables and their Distributions

The condition in Definition 5 is necessary to compute $P(X\leq \alpha),\ \forall \alpha\in\mathbb{R}$. This requirement also let us compute $P(X\in B)$ for most sets by leveraging the fact that $\mathcal{F}$ is closed under complements, unions, and intersections. For example, we can also compute $P(X > \alpha)$ and $P(\alpha < X \leq \beta)$. In this sense, the property binds the probability space to the random variable.

Definition 5 also implies that random variables satisfy particular algebraic properties. For example, if $X,Y$ are random variables, then so are $X+Y, XY, X^p, \lim_{n\to\infty}X_n$, etc.

Although random variables are defined based on a probability space, it is often most natural to model problems without explicitly specifying the probability space. This works so long as we specify the random variables and their distribution in a “consistent” way. This is formalized by the so-called Kolmogorov Extension Theorem but can largely be ignored.

## Distributions

Roughly speaking, the distribution of a random variable gives an idea of the likelihood that a random variable takes a particular value or set of values.

Note that $\sum_{x\in\mathcal{X}}p_X(x) = 1$ since $\bigcap_{x\in\mathcal{X}}\{w: X(w) = x\} = \Omega$.

Continuous random variables are largely similar to discrete random variables. One key difference is that instead of being described by a probability “mass”, they are instead described by a probability “density”.

#### Definition 9

The probability density function (distribution) of a continuous random variable describes the density by which a random variable takes a particular value.

$f_X: \mathbb{R}\to [0, \infty) \text{ where } \int_{-\infty}^{\infty}f_X(x)dx = 1 \text{ and } \text{Pr}\left\{X\in B\right\} = \int_B f_X(x)dx$

Observe that if a random variable $X$ is continuous, then the probability that it takes on a particular value is zero.

$\text{Pr}\left\{X=x\right\} = \lim_{\delta\to0} \text{Pr}\left\{x \leq X \leq x +\delta\right\} = \lim_{\delta\to 0}\int_x^{x+\delta}f_X(u)du = \int_{x}^{x}f_X(u)du = 0$

Note that by the Kolomogorov axioms, $F_X$ must satisfy three properties:

$F_X$ is non-decreasing.

$\lim_{x\to0} F_X(x) = 0$ and $\lim_{x\to\infty} F_X(x) = 1$.

$F_X$ is right continuous.

It turns out that if we have any function $F_X$ that satisfies these three properties, then it is the CDF of some random variable on some probability space. Note that $F_X(x)$ gives us an alternative way to define continuous random variables. If $F_X(x)$ is absolutely continuous, then it can be expressed as

$F_X(x) = \int_{-\infty}^{x}f_X(x)dx$

for some non-negative function $f_X(x)$, and this is the PDF of a continuous random variable.

Often, when modeling problems, there are multiple random variables that we want to keep track of.

Note that it is possible for $X$ to be continuous and $Y$ to be discrete (or vice versa).

Just like independence, we can extend the notion of conditional probability to random variables.

Often, we need to combine or transform several random variables. A derived distribution is the obtained by arithmetic of several random variables or applying a function to several (or many) random variables. Since the CDF of a distribution essentially defines that random variable, it can often be easiest to work backwards from the CDF to the PDF or PMF. In the special case where we want to find $Y=g(X)$ for a function $g$.

$F_y(y) =\text{Pr}\left\{Y \leq y\right\} = \text{Pr}\left\{g(x) \leq y\right\} = \text{Pr}\left\{X \in g^{-1}([-\infty, y])\right\} , \quad g^{-1}(y) = \{ x: g(x) = y \}.$

Another special case of a derived distribution is when adding random variables together.

## Properties of Distributions

### Expectation

Expectation has several useful properties. If we want to compute the expectation of a function of a random variable, then we can use the law of the unconscious statisitician.

Another useful property is its linearity.

$\mathbb{E}\left[aX+bY\right] = a\mathbb{E}\left[X\right] +b\mathbb{E}\left[Y\right] ,\ \forall a, b\in\mathbb{R}.$

Sometimes it can be difficult to compute expectations directly. For disrete distributions, we can use the tail-sum formula.

When two random variables are independent, expectation has some additional properties.

Earlier, we saw that we find a derived distribution by transforming and combining random variables. Sometimes, we don’t need to actually compute the distribution, but only some of its properties.

It turns out that we can encode the moments of a distribution into the coefficients of a special power series.

Notice that if we apply the power series expansion of $e^{tX}$, we see that

$M_X(t) = \sum_{n=0}^{\infty}\frac{t!}{n!}\mathbb{E}\left[X^n\right] .$

Thus the nth moment is encoded in the coefficients of the power series and we can retrieve them by taking a derivative:

$\mathbb{E}\left[X^n\right] = \frac{d^{n}}{dt^{n}}M_X(t).$

Another interesting point to notice is that for a continuous random variable

$M_X(t) = \int_{-\infty}^{\infty}f_X(x)e^{tx}dx$

is the Laplace transform of the distribution over the real line, and for a discrete random variable,

$M_X(t) = \sum_{x=-\infty}^{\infty}p_X(x)e^{tx}$

is the Z-transform of the distribution evaluated along the curve at $e^{-t}$.

This provides another way to compute the distribution for a sum of random variables because we can just multiply their MGF.

### Variance

#### Definition 19

The covariance of two random variables describes how much they depend on each other and is given by

$\text{Cov}\left(X, Y\right) = \mathbb{E}\left[(X-\mathbb{E}\left[X\right] )(Y-\mathbb{E}\left[Y\right] )\right] = \mathbb{E}\left[XY\right] - \mathbb{E}\left[X\right] \mathbb{E}\left[Y\right] .$

If $\text{Cov}\left(X,Y\right) = 0$ then $X$ and $Y$ are uncorrelated.

Note that $-1\leq \rho \leq 1$.

## Common Discrete Distributions

Bernoulli random variables are good for modeling things like a coin flip where there is a probability of success. Bernoulli random variables are frequently used as indicator random variables $\mathbb{1}_A$ where

$\mathbb{1}_A = \begin{cases} 1 & \text{if A occurs,}\\ 0 & \text{ else.} \end{cases}$

When paired with the linearity of expectation, this can be a powerful method of computing the expectation of something.

A binomial random variable can be thought of as the number of successes in $n$ trials. In other words,

$X \sim \text{Binomial}(n, p) \implies X = \sum_{i=1}^{n}X_i, \quad X_i \sim \text{Bernoulli}(p).$

By construction, if $X\sim\text{Binomial}(n, p)$ and $Y\sim\text{Binomial}(m, p)$ are independent, then $X+Y \sim \text{Binomial}(m+n, p)$.

Geometric random variables are useful for modeling the number of trials required before the first success. In other words,

$X \sim \text{Geom}(p) \implies X = \min\{k \geq 1: X_k=1 \} \text{ where } X_i\sim \text{Bernoulli}(p).$

A useful property of geometric random variables is that they are memoryless:

$\text{Pr}\left\{X=K+M|X>k\right\} = \text{Pr}\left\{X=M\right\} .$

Poisson random variables are good for modeling the number of arrivals in a given interval. Suppose you take a given time interval and divide it into $n$ chunks where the probability of arrival in chunk $i$ is $X_i \sim \text{Bernoulli}(p_n)$. Then the total number of arrivals $X_n = \sum_{i=1}^{n}X_i$ is distributed as a Binomial random variable with expectation $np_n=\lambda$. As we increase $n$ to infinity but keep $\lambda$ fixed, we arrive at the poisson distribution.

A useful fact about Poisson random variables is that if $X\sim\text{Poisson}(\lambda)$ and $Y\sim\text{Poisson}(\mu)$ are independent, then $X+Y \sim \text{Poisson}(\lambda + \mu)$.

## Common Continuous Distributions

The CDF of a uniform distribution is given by

$F_X(x) = \begin{cases} 0, & x < a,\\ \frac{x-a}{b-a}, & x\in[a, b)\\ 1, & x \geq b. \end{cases}$

Exponential random variables are the only continuous random variable to have the memoryless property:

$\text{Pr}\left\{X > t+s | X > s\right\} = \text{Pr}\left\{X > t\right\} , \quad t \geq 0.$

The CDF of the exponential distribution is given by

$F_X(x) = \lambda \int_0^{x}e^{-\lambda u}du = 1 - e^{-\lambda x}$

The standard normal is $X\sim\mathcal{N}(0, 1)$, and it has the CDF

$\Phi(x) = \frac{1}{\sqrt{2\pi}}\int_{-\infty}^{x}e^{\frac{-u^2}{2}} du$

There is no closed from for $\Phi(x)$. It turns out that every normal random variable can be transformed into the standard normal (i.e $\frac{X - \mu}{\sigma} \sim \mathcal{N}(0, 1)$). Some facts about Gaussian random variables are

If $X\sim\mathcal{N}(\mu_x, \sigma_x^2),\ Y\sim\mathcal{N}(\mu_y, \sigma_y^2)$ are independent, then $X+Y \sim \mathcal{N}(\mu_x+\mu_y, \sigma_x^2 + \sigma_y^2)$.

If $X,Y$ are independent and $(X+Y), (X-Y)$ are independent, then both $X$ and $Y$ are Gaussian with the same variance.

### Jointly Gaussian Random Variables

Jointly Gaussian Random Varables, also known as Gaussian Vectors, can be defined in a variety of ways.

#### Definition 29

A Gaussian Random Vector $\boldsymbol{X} = \begin{bmatrix} X_1 & \cdots & X_n \end{bmatrix}^T$ with density on $\mathbb{R}^n$, $\text{Cov}\left(\boldsymbol{X}\right) =\Sigma, \mathbb{E}\left[X\right] =\boldsymbol{\mu}$ is defined by the pdf

$f_{\boldsymbol{X}}(\boldsymbol{x}) = \frac{1}{\sqrt{(2\pi)^n\text{det}(\Sigma)}}e^{-\frac{1}{2}(\boldsymbol{x}-\boldsymbol{\mu})^T\Sigma^{-1}(\boldsymbol{x}-\boldsymbol{\mu})}$

In addition to their many definitions, jointly gaussian random variables also have interesting properties.

#### Theorem 11

If $\boldsymbol{X}$ and $\boldsymbol{Y}$ are jointly gaussian random variables, then

$X = \mu_{\boldsymbol{X}}+\Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}(\boldsymbol{Y} - \boldsymbol{\mu_Y}) + \boldsymbol{V} \text{ where } V \sim \mathcal{N}(0, \Sigma_X-\Sigma_{\boldsymbol{XY}}\Sigma_Y^{-1}\Sigma{\boldsymbol{YX}})$

Theorem 11 tells us that each entry in Gaussian Vector can be thought of as a “noisy” version of the others.

## Hilbert Spaces of Random Variables

One way to understand random variables is through linear algebra by thinking of them as vectors in a vector space.

#### Definition 32

An real inner product space $V$ is composed of a vector space $V$ over a real scalar field equipped with an inner product $\langle \cdot,\cdot \rangle$ that satisfies $\forall u,v,w\in V$, $a, b\in\mathbb{R}$,

$\langle u, v \rangle = \langle v, u \rangle$

$\langle au + bv, w \rangle = a \langle u,w \rangle + b \langle v,w \rangle$

$\langle u,u \rangle \geq 0$ and $<u,u> = 0 \Leftrightarrow u = 0$

Inner products spaces are equipped with the norm $\|v\| = \sqrt{\langle v, v \rangle }$.

Loosely, completeness means that we can take limits of without exiting the space. It turns out that random variables satisfy the definition of a Hilbert Space.

Hilbert spaces are important because they provide a notion of geometry that is compatible with our intuition as well as the geometry of $\mathbb{R}^n$ (which is a Hilbert Space). One geometric idea is that of orthogonality. Two vectors are orthogonal if $\langle X, Y\rangle = 0$. Two random variables will be orthogonal if they are zero-mean and uncorrelated. Using orthogonality, we can also define projections.

#### Theorem 13 (Hilbert Projection Theorem)

Let $\mathcal{H}$ be a Hilbert Space and $\mathcal{U} \subseteq \mathcal{H}$ be a closed subspace. For each vector $v\in\mathcal{H}$, $\text{argmin} \|u-v\|$ has a unique solution (there is a unique closest point $u\in\mathcal{U}$ to $v$). If $u$ is the closest point to $v$, then $\forall u\in\mathcal{U},\ \langle u-v, u'\rangle$.

Theorem 13 is what gives rise to important properties like the Pythogorean Theorem for any Hilbert Space.

$\|u\|^2 + \|u-v\|^2 = \|v\| \text{ where } u=\text{argmin}\|u-v\|.$

Suppose we had to random variables $X$ and $Y$. What happens if we try and project one onto the other?

Thus, the conditional expectation is the function of $Y$ that is closest to $X$. It’s interpretation is that the expectation of $X$ can change after observing some other random variable $Y$. To find $\mathbb{E}\left[X|Y\right]$, we can use the conditional distribution of $X$ and $Y$.

Notice that $\mathbb{E}\left[X|Y\right]$is a function of the random variable $Y$, meaning we can apply Theorem 6.

Alternatively, we could apply lineary of expectation to Definition 34 to arrive at the same result. If we apply Theorem 15 to the function $f(Y) = 1$, then we can see that $\mathbb{E}\left[\mathbb{E}\left[X|Y\right] \right] = \mathbb{E}\left[X\right]$.

Just as expectation can change when we know additional information, so can variance.

Conditional variance is a random variable just as expectation is.

The second term in the law of total variance ($\text{Var}\left(\mathbb{E}\left[X|Y\right] \right)$) can be interpreted as on average, how much uncertainty there is in $X$ given we know $Y$.

Last updated