Links

Estimation

Whereas hypothesis testing is about discriminating between two or more hypotheses, estimation is about guessing the numerical value of or ground truth of a random variable.
Figure 3: Estimation Setup
In order to measure the quality of our estimation, we need a metric to measure error. One commonly used error is the mean squared error
E[(XX^(Y))2].\mathbb{E}\left[(X - \hat{X}(Y))^2\right] .

Theorem 45

The minimum mean square estimate (MMSE) of a random variable
XX
is given by the conditional expectation.
X^(Y)=E[XY]=argminX^E[(XX^(Y))2].\hat{X}(Y) = \mathbb{E}\left[X|Y\right] = \text{argmin}_{\hat{X}} \mathbb{E}\left[(X - \hat{X}(Y))^2\right] .
This essentially follows from the definition of conditional expectation since it is orthogonal to all other functions of
YY
, and so by the Hilbert Projection Theorem, it must be the projection of
XX
onto the space of all functions of
YY
. There are two problems with using MMSE all the time.
  1. 1.
    We often don’t know
    pYXp_{Y|X}
    explicitly and only have a good model for it.
  2. 2.
    Even if we knew the model
    pYXp_{Y|X}
    , conditional expectations are difficult to compute.

Linear Estimation

Since finding the MMSE is difficult, we can restrict ourselves to funtions of a particular type.

Definition 86

The Linear Least Squares Estimator (LLSE)
L[XY]\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right]
is the projection of a vector of random variables
X\boldsymbol{X}
onto the subspace of linear functions of observations
Yi, U={a+BY}Y_i,\ \mathcal{U} = \left\{ \boldsymbol{a} + B\boldsymbol{Y} \right\}
where
Y\boldsymbol{Y}
is a vector of observations.
By the orthogonality principle,
  1. 1.
    E[(XL[XY])1]=0    E[L[XY]]=E[X]\mathbb{E}\left[(\boldsymbol{X} - \mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right])1\right] = 0 \implies \mathbb{E}\left[\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right]\right] = \mathbb{E}\left[\boldsymbol{X}\right]
  2. 2.
    E[(XL[XY])Yi]=0\mathbb{E}\left[(\boldsymbol{X} - \mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right])Y_i\right] = 0
From here, we can derive a closed form expression for the LLSE. Let
μY=E[Y],μX=E[X],ΣY=E[(YμY)(YμY)T],ΣXY=E[(XμX)(YμY)T]\boldsymbol{\mu_{Y}} = \mathbb{E}\left[\boldsymbol{Y}\right] , \boldsymbol{\mu_{X}} = \mathbb{E}\left[\boldsymbol{X}\right] , \Sigma_{\boldsymbol{Y}} = \mathbb{E}\left[(\boldsymbol{Y}-\boldsymbol{\mu_Y})(\boldsymbol{Y}-\boldsymbol{\mu_Y})^T\right] , \Sigma_{\boldsymbol{XY}} = \mathbb{E}\left[(\boldsymbol{X}-\boldsymbol{\mu_X})(\boldsymbol{Y}-\boldsymbol{\mu_Y})^T\right]
. By substituting
L[XY]=a+BY\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{a}+B\boldsymbol{Y}
into the equations we found from the orthogonality principle,
a+BμY=μXa(μY)i+BE[YYi]=E[XYi]    a(μY)i+B(ΣY)i+B(μY)iμY=(ΣXY)i+(μY)iμx    aμYT+BΣY+BμYμYT=ΣXY+μXμYT\begin{aligned} \boldsymbol{a}+B\boldsymbol{\mu_Y} &= \boldsymbol{\mu_X} \\ a(\boldsymbol{\mu_Y})_i + B \mathbb{E}\left[\boldsymbol{Y}Y_i\right] = \mathbb{E}\left[\boldsymbol{X}Y_i\right] &\implies \boldsymbol{a}(\boldsymbol{\mu_{Y}})_i + B(\Sigma_{\boldsymbol{Y}})_i + B(\boldsymbol{\mu_{Y}})_i\boldsymbol{\mu_Y} = (\Sigma_{\boldsymbol{XY}})_i + (\boldsymbol{\mu_{Y}})_i\boldsymbol{\mu_x}\\ &\implies \boldsymbol{a}\boldsymbol{\mu_Y}^T + B\Sigma_{\boldsymbol{Y}}+B\boldsymbol{\mu_Y\mu_Y}^T = \Sigma_{\boldsymbol{XY}}+\boldsymbol{\mu_X\mu_Y}^T\end{aligned}
Solving this system yields
B=ΣXYΣY1a=μXΣXYΣY1μY.B = \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1} \qquad \boldsymbol{a} = \boldsymbol{\mu_X} - \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}\boldsymbol{\mu_Y}.

Theorem 46

The Linear Least Squares Estimator for vector of random variables
X\boldsymbol{X}
given a vector of random variables
Y\boldsymbol{Y}
is
L[XY]=μX+ΣXYΣY1(YμY)\mathbb{L}\left[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{\mu_X} + \Sigma_{\boldsymbol{XY}}\Sigma_{\boldsymbol{Y}}^{-1}(\boldsymbol{Y}-\boldsymbol{\mu_Y})
If
XX
and
YY
are both a single random variable, this reduces to
L[XY]=μX+Cov(X,Y)Var(Y)(YμY)\mathbb{L}\left[X|Y\right] = \mu_X + \frac{\text{Cov}\left(X, Y\right) }{\text{Var}\left(Y\right) }(Y - \mu_Y)
Since LLSE is essentially projection onto a Linear Subspace, if we have an orthogonal basis for the subspace, then we can do the projection onto the subspace one component at a time. The Gram-Schmidt Process turns vectors
Y1,,YnY_1,\cdots,Y_n
into an orthonormal set
Y~1,,Y~n\tilde{Y}_1, \cdots, \tilde{Y}_n
. If we define
Y(n)=(Y1,,Yn)Y^{(n)}=(Y_1, \cdots, Y_n)
,
  1. 1.
    Y~1=Y1Y1\tilde{Y}_1 = \frac{Y_1}{\|Y_1\|}
  2. 2.
    Y~i+1=Yi+1k=1iYi+1,Y~kY~k=Yi+1L[Yi+1Y(i)]\tilde{Y}_{i+1} = Y_{i+1} - \sum_{k=1}^{i}\langle Y_{i+1}, \tilde{Y}_k \rangle \tilde{Y}_k = Y_{i+1} - \mathbb{L}\left[Y_{i+1}|Y^{(i)}\right]

Definition 87

The linear innovation sequence of random variables
Y1,,YnY_1,\cdots,Y_n
is the orthogonal set
Y1~,,Yn~\tilde{Y_1}, \cdots, \tilde{Y_n}
produced by Gram Schmidt
Since
Y~n\tilde{Y}_{n}
is orthogonal to
L[YnY~(n1)]\mathbb{L}\left[Y_n|\tilde{Y}^{(n-1)}\right]
, they belong to different parts of the subspace formed by
Y1,,YnY_1,\cdots,Y_n
.

Theorem 47

L[XY(n)]=L[XY~n]+L[XY~(n1)]\mathbb{L}\left[X|Y^{(n)}\right] = \mathbb{L}\left[X|\tilde{Y}_n\right] + \mathbb{L}\left[X|\tilde{Y}^{(n-1)}\right]
Note that in general, the LLSE is not the same as the MMSE. However, if
XX
and
YY
are Jointly Gaussian, then the LLSE does, in fact, equal the MMSE.

Kalman Filtering

Definition 88

A system evolves according to a state space model if the state
Xn\boldsymbol{X}_n
at time
nn
and observations
Yn\boldsymbol{Y}_n
at time
nn
are related by
n0, Xn+1=AXn+Vnn1, Yn=CXn+Wn\forall n\geq 0,\ \boldsymbol{X}_{n+1} = A\boldsymbol{X}_n + \boldsymbol{V}_n \qquad \forall n\geq 1,\ \boldsymbol{Y}_n=C\boldsymbol{X}_n+\boldsymbol{W}_n
where
VnV_n
and
WnW_n
are noise terms.
State space models are flexible and describe a variety of processes. Suppose we want to linearly estimate
Xn\boldsymbol{X}_n
from the
Yn\boldsymbol{Y}_n
we have seen so far.

Theorem 48

The linear estimate
X^nn=L[XnY1,,Yn]\hat{\boldsymbol{X}}_{n|n} = \mathbb{L}\left[\boldsymbol{X}_n|\boldsymbol{Y}_1,\cdots,\boldsymbol{Y}_n\right]
can be computed recursively via the Kalman Filter.
  1. 1.
    X^00=0,Σ00=Cov(X0)\hat{\boldsymbol{X}}_{0|0} = 0, \Sigma_{0|0} = \text{Cov}\left(\boldsymbol{X}_0\right)
    .
  2. 2.
    For
    n1n\geq 1
    , update
X^nn=AX^n1n1+KnY~nY~n=YnCX^nn1Σnn1=AΣn1n1AT+ΣV\hat{\boldsymbol{X}}_{n|n} = A\hat{\boldsymbol{X}}_{n-1|n-1} + K_n\tilde{\boldsymbol{Y}}_n \qquad \tilde{\boldsymbol{Y}}_n = Y_n - C\hat{\boldsymbol{X}}_{n|n-1} \qquad \Sigma_{n|n-1} = A\Sigma_{n-1|n-1}A^T+\Sigma_{\boldsymbol{V}}
Kn=Σnn1CT(CΣnn1CT+ΣW)1Σnn=(IKnC)Σnn1K_n = \Sigma_{n|n-1}C^T(C\Sigma_{n|n-1}C^T+\Sigma_{\boldsymbol{W}})^{-1} \qquad \Sigma_{n|n}=(I - K_nC)\Sigma_{n|n-1}
Kalman filtering is a simple algorithm which lets us do online, optimal estimation. Variants of it can do things such as prediction or smoothing.