Estimation

Whereas hypothesis testing is about discriminating between two or more hypotheses, estimation is about guessing the numerical value of or ground truth of a random variable.

In order to measure the quality of our estimation, we need a metric to measure error. One commonly used error is the mean squared error

E[(XX^(Y))2].\mathbb{E}\left[(X - \hat{X}(Y))^2\right] .

Theorem 45

The minimum mean square estimate (MMSE) of a random variable XX is given by the conditional expectation.

X^(Y)=E[XY]=argminX^E[(XX^(Y))2].\hat{X}(Y) = \mathbb{E}\left[X|Y\right] = \text{argmin}_{\hat{X}} \mathbb{E}\left[(X - \hat{X}(Y))^2\right] .

This essentially follows from the definition of conditional expectation since it is orthogonal to all other functions of YY, and so by the Hilbert Projection Theorem, it must be the projection of XX onto the space of all functions of YY. There are two problems with using MMSE all the time.

  1. We often don’t know pYXp_{Y|X} explicitly and only have a good model for it.

  2. Even if we knew the model pYXp_{Y|X}, conditional expectations are difficult to compute.

Linear Estimation

Since finding the MMSE is difficult, we can restrict ourselves to funtions of a particular type.

Definition 86

The Linear Least Squares Estimator (LLSE) L[XY]\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right] is the projection of a vector of random variables X\boldsymbol{X} onto the subspace of linear functions of observations Yi, U={a+BY}Y_i,\ \mathcal{U} = \left\{ \boldsymbol{a} + B\boldsymbol{Y} \right\} where Y\boldsymbol{Y}is a vector of observations.

By the orthogonality principle,

  1. E[(XL[XY])1]=0    E[L[XY]]=E[X]\mathbb{E}\left[(\boldsymbol{X} - \mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right])1\right] = 0 \implies \mathbb{E}\left[\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right]\right] = \mathbb{E}\left[\boldsymbol{X}\right]

  2. E[(XL[XY])Yi]=0\mathbb{E}\left[(\boldsymbol{X} - \mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right])Y_i\right] = 0

From here, we can derive a closed form expression for the LLSE. Let μY=E[Y],μX=E[X],ΣY=E[(YμY)(YμY)T],ΣXY=E[(XμX)(YμY)T]\boldsymbol{\mu_{Y}} = \mathbb{E}\left[\boldsymbol{Y}\right] , \boldsymbol{\mu_{X}} = \mathbb{E}\left[\boldsymbol{X}\right] , \Sigma_{\boldsymbol{Y}} = \mathbb{E}\left[(\boldsymbol{Y}-\boldsymbol{\mu_Y})(\boldsymbol{Y}-\boldsymbol{\mu_Y})^T\right] , \Sigma_{\boldsymbol{XY}} = \mathbb{E}\left[(\boldsymbol{X}-\boldsymbol{\mu_X})(\boldsymbol{Y}-\boldsymbol{\mu_Y})^T\right]. By substituting L[XY]=a+BY\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{a}+B\boldsymbol{Y} into the equations we found from the orthogonality principle,

a+BμY=μXa(μY)i+BE[YYi]=E[XYi]    a(μY)i+B(ΣY)i+B(μY)iμY=(ΣXY)i+(μY)iμx    aμYT+BΣY+BμYμYT=ΣXY+μXμYT\begin{aligned} \boldsymbol{a}+B\boldsymbol{\mu_Y} &= \boldsymbol{\mu_X} \\ a(\boldsymbol{\mu_Y})_i + B \mathbb{E}\left[\boldsymbol{Y}Y_i\right] = \mathbb{E}\left[\boldsymbol{X}Y_i\right] &\implies \boldsymbol{a}(\boldsymbol{\mu_{Y}})_i + B(\Sigma_{\boldsymbol{Y}})_i + B(\boldsymbol{\mu_{Y}})_i\boldsymbol{\mu_Y} = (\Sigma_{\boldsymbol{XY}})_i + (\boldsymbol{\mu_{Y}})_i\boldsymbol{\mu_x}\\ &\implies \boldsymbol{a}\boldsymbol{\mu_Y}^T + B\Sigma_{\boldsymbol{Y}}+B\boldsymbol{\mu_Y\mu_Y}^T = \Sigma_{\boldsymbol{XY}}+\boldsymbol{\mu_X\mu_Y}^T\end{aligned}

Solving this system yields

B=ΣXYΣY1a=μXΣXYΣY1μY.B = \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1} \qquad \boldsymbol{a} = \boldsymbol{\mu_X} - \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}\boldsymbol{\mu_Y}.

Theorem 46

The Linear Least Squares Estimator for vector of random variables X\boldsymbol{X} given a vector of random variables Y\boldsymbol{Y} is

L[XY]=μX+ΣXYΣY1(YμY)\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{\mu_X} + \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}(\boldsymbol{Y}-\boldsymbol{\mu_Y})

If XX and YY are both a single random variable, this reduces to

L[XY]=μX+Cov(X,Y)Var(Y)(YμY)\mathbb{L}\left[X|Y\right] = \mu_X + \frac{\text{Cov}\left(X, Y\right) }{\text{Var}\left(Y\right) }(Y - \mu_Y)

Since LLSE is essentially projection onto a Linear Subspace, if we have an orthogonal basis for the subspace, then we can do the projection onto the subspace one component at a time. The Gram-Schmidt Process turns vectors Y1,,YnY_1,\cdots,Y_n into an orthonormal set Y~1,,Y~n\tilde{Y}_1, \cdots, \tilde{Y}_n. If we define Y(n)=(Y1,,Yn)Y^{(n)}=(Y_1, \cdots, Y_n),

  1. Y~1=Y1Y1\tilde{Y}_1 = \frac{Y_1}{\|Y_1\|}

  2. Y~i+1=Yi+1k=1iYi+1,Y~kY~k=Yi+1L[Yi+1Y(i)]\tilde{Y}_{i+1} = Y_{i+1} - \sum_{k=1}^{i}\langle Y_{i+1}, \tilde{Y}_k \rangle \tilde{Y}_k = Y_{i+1} - \mathbb{L}\left[Y_{i+1}|Y^{(i)}\right]

Definition 87

The linear innovation sequence of random variables Y1,,YnY_1,\cdots,Y_n is the orthogonal set Y1~,,Yn~\tilde{Y_1}, \cdots, \tilde{Y_n}produced by Gram Schmidt

Since Y~n\tilde{Y}_{n} is orthogonal to L[YnY~(n1)]\mathbb{L}\left[Y_n|\tilde{Y}^{(n-1)}\right], they belong to different parts of the subspace formed by Y1,,YnY_1,\cdots,Y_n.

Theorem 47

L[XY(n)]=L[XY~n]+L[XY~(n1)]\mathbb{L}\left[X|Y^{(n)}\right] = \mathbb{L}\left[X|\tilde{Y}_n\right] + \mathbb{L}\left[X|\tilde{Y}^{(n-1)}\right]

Note that in general, the LLSE is not the same as the MMSE. However, if XX and YY are Jointly Gaussian, then the LLSE does, in fact, equal the MMSE.

Kalman Filtering

Definition 88

A system evolves according to a state space model if the state Xn\boldsymbol{X}_n at time nn and observations Yn\boldsymbol{Y}_n at time nn are related by

n0, Xn+1=AXn+Vnn1, Yn=CXn+Wn\forall n\geq 0,\ \boldsymbol{X}_{n+1} = A\boldsymbol{X}_n + \boldsymbol{V}_n \qquad \forall n\geq 1,\ \boldsymbol{Y}_n=C\boldsymbol{X}_n+\boldsymbol{W}_n

where VnV_n and WnW_nare noise terms.

State space models are flexible and describe a variety of processes. Suppose we want to linearly estimate Xn\boldsymbol{X}_n from the Yn\boldsymbol{Y}_n we have seen so far.

Theorem 48

The linear estimate X^nn=L[XnY1,,Yn]\hat{\boldsymbol{X}}_{n|n} = \mathbb{L}\left[\boldsymbol{X}_n|\boldsymbol{Y}_1,\cdots,\boldsymbol{Y}_n\right] can be computed recursively via the Kalman Filter.

  1. X^00=0,Σ00=Cov(X0)\hat{\boldsymbol{X}}_{0|0} = 0, \Sigma_{0|0} = \text{Cov}\left(\boldsymbol{X}_0\right).

  2. For n1n\geq 1, update

X^nn=AX^n1n1+KnY~nY~n=YnCX^nn1Σnn1=AΣn1n1AT+ΣV\hat{\boldsymbol{X}}_{n|n} = A\hat{\boldsymbol{X}}_{n-1|n-1} + K_n\tilde{\boldsymbol{Y}}_n \qquad \tilde{\boldsymbol{Y}}_n = Y_n - C\hat{\boldsymbol{X}}_{n|n-1} \qquad \Sigma_{n|n-1} = A\Sigma_{n-1|n-1}A^T+\Sigma_{\boldsymbol{V}}

Kn=Σnn1CT(CΣnn1CT+ΣW)1Σnn=(IKnC)Σnn1K_n = \Sigma_{n|n-1}C^T(C\Sigma_{n|n-1}C^T+\Sigma_{\boldsymbol{W}})^{-1} \qquad \Sigma_{n|n}=(I - K_nC)\Sigma_{n|n-1}

Kalman filtering is a simple algorithm which lets us do online, optimal estimation. Variants of it can do things such as prediction or smoothing.

Last updated