# Estimation

Whereas hypothesis testing is about discriminating between two or more hypotheses, estimation is about guessing the numerical value of or ground truth of a random variable.

![Figure 3: Estimation Setup](https://3896527066-files.gitbook.io/~/files/v0/b/gitbook-x-prod.appspot.com/o/spaces%2F-MQYqMkpf2WPiP6MKWRe%2Fuploads%2Fgit-blob-495c0db34dd62fdb6dabd0a2b658dcf15e3e97f0%2Fe2c33359d2b7dd223cea31d0e87089b684a47b09.png?alt=media)

In order to measure the quality of our estimation, we need a metric to measure error. One commonly used error is the mean squared error

$$\mathbb{E}\left\[(X - \hat{X}(Y))^2\right] .$$

{% hint style="info" %}

### Theorem 45

The minimum mean square estimate (MMSE) of a random variable $$X$$ is given by the conditional expectation.

$$\hat{X}(Y) = \mathbb{E}\left\[X|Y\right] = \text{argmin}\_{\hat{X}} \mathbb{E}\left\[(X - \hat{X}(Y))^2\right] .$$
{% endhint %}

This essentially follows from the definition of conditional expectation since it is orthogonal to all other functions of $$Y$$, and so by the Hilbert Projection Theorem, it must be the projection of $$X$$ onto the space of all functions of $$Y$$. There are two problems with using MMSE all the time.

1. We often don’t know $$p\_{Y|X}$$ explicitly and only have a good model for it.
2. Even if we knew the model $$p\_{Y|X}$$, conditional expectations are difficult to compute.

## Linear Estimation

Since finding the MMSE is difficult, we can restrict ourselves to funtions of a particular type.

{% hint style="info" %}

### Definition 86

The Linear Least Squares Estimator (LLSE) $$\mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right]$$ is the projection of a vector of random variables $$\boldsymbol{X}$$ onto the subspace of linear functions of observations $$Y\_i,\ \mathcal{U} = \left{ \boldsymbol{a} + B\boldsymbol{Y} \right}$$ where $$\boldsymbol{Y}$$is a vector of observations.
{% endhint %}

By the orthogonality principle,

1. $$\mathbb{E}\left\[(\boldsymbol{X} - \mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right])1\right] = 0 \implies \mathbb{E}\left\[\mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right]\right] = \mathbb{E}\left\[\boldsymbol{X}\right]$$
2. $$\mathbb{E}\left\[(\boldsymbol{X} - \mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right])Y\_i\right] = 0$$

From here, we can derive a closed form expression for the LLSE. Let $$\boldsymbol{\mu\_{Y}} = \mathbb{E}\left\[\boldsymbol{Y}\right] , \boldsymbol{\mu\_{X}} = \mathbb{E}\left\[\boldsymbol{X}\right] , \Sigma\_{\boldsymbol{Y}} = \mathbb{E}\left\[(\boldsymbol{Y}-\boldsymbol{\mu\_Y})(\boldsymbol{Y}-\boldsymbol{\mu\_Y})^T\right] , \Sigma\_{\boldsymbol{XY}} = \mathbb{E}\left\[(\boldsymbol{X}-\boldsymbol{\mu\_X})(\boldsymbol{Y}-\boldsymbol{\mu\_Y})^T\right]$$. By substituting $$\mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{a}+B\boldsymbol{Y}$$ into the equations we found from the orthogonality principle,

$$\begin{aligned} \boldsymbol{a}+B\boldsymbol{\mu\_Y} &= \boldsymbol{\mu\_X} \ a(\boldsymbol{\mu\_Y})*i + B \mathbb{E}\left\[\boldsymbol{Y}Y\_i\right] = \mathbb{E}\left\[\boldsymbol{X}Y\_i\right] &\implies \boldsymbol{a}(\boldsymbol{\mu*{Y}})*i + B(\Sigma*{\boldsymbol{Y}})*i + B(\boldsymbol{\mu*{Y}})*i\boldsymbol{\mu\_Y} = (\Sigma*{\boldsymbol{XY}})*i + (\boldsymbol{\mu*{Y}})*i\boldsymbol{\mu\_x}\ &\implies \boldsymbol{a}\boldsymbol{\mu\_Y}^T + B\Sigma*{\boldsymbol{Y}}+B\boldsymbol{\mu\_Y\mu\_Y}^T = \Sigma\_{\boldsymbol{XY}}+\boldsymbol{\mu\_X\mu\_Y}^T\end{aligned}$$

Solving this system yields

$$B = \Sigma\_{\boldsymbol{XY}}\Sigma\_{\boldsymbol{Y}}^{-1} \qquad \boldsymbol{a} = \boldsymbol{\mu\_X} - \Sigma\_{\boldsymbol{XY}}\Sigma\_{\boldsymbol{Y}}^{-1}\boldsymbol{\mu\_Y}.$$

{% hint style="info" %}

### Theorem 46

The Linear Least Squares Estimator for vector of random variables $$\boldsymbol{X}$$ given a vector of random variables $$\boldsymbol{Y}$$ is

$$\mathbb{L}\left\[\boldsymbol{X}|\boldsymbol{Y}\right] = \boldsymbol{\mu\_X} + \Sigma\_{\boldsymbol{XY}}\Sigma\_{\boldsymbol{Y}}^{-1}(\boldsymbol{Y}-\boldsymbol{\mu\_Y})$$
{% endhint %}

If $$X$$ and $$Y$$ are both a single random variable, this reduces to

$$\mathbb{L}\left\[X|Y\right] = \mu\_X + \frac{\text{Cov}\left(X, Y\right) }{\text{Var}\left(Y\right) }(Y - \mu\_Y)$$

Since LLSE is essentially projection onto a Linear Subspace, if we have an orthogonal basis for the subspace, then we can do the projection onto the subspace one component at a time. The Gram-Schmidt Process turns vectors $$Y\_1,\cdots,Y\_n$$ into an orthonormal set $$\tilde{Y}\_1, \cdots, \tilde{Y}\_n$$. If we define $$Y^{(n)}=(Y\_1, \cdots, Y\_n)$$,

1. $$\tilde{Y}\_1 = \frac{Y\_1}{|Y\_1|}$$
2. $$\tilde{Y}*{i+1} = Y*{i+1} - \sum\_{k=1}^{i}\langle Y\_{i+1}, \tilde{Y}*k \rangle \tilde{Y}*k = Y*{i+1} - \mathbb{L}\left\[Y*{i+1}|Y^{(i)}\right]$$

{% hint style="info" %}

### Definition 87

The linear innovation sequence of random variables $$Y\_1,\cdots,Y\_n$$ is the orthogonal set $$\tilde{Y\_1}, \cdots, \tilde{Y\_n}$$produced by Gram Schmidt
{% endhint %}

Since $$\tilde{Y}\_{n}$$ is orthogonal to $$\mathbb{L}\left\[Y\_n|\tilde{Y}^{(n-1)}\right]$$, they belong to different parts of the subspace formed by $$Y\_1,\cdots,Y\_n$$.

{% hint style="info" %}

### Theorem 47

$$\mathbb{L}\left\[X|Y^{(n)}\right] = \mathbb{L}\left\[X|\tilde{Y}\_n\right] + \mathbb{L}\left\[X|\tilde{Y}^{(n-1)}\right]$$
{% endhint %}

Note that in general, the LLSE is not the same as the MMSE. However, if $$X$$ and $$Y$$ are Jointly Gaussian, then the LLSE does, in fact, equal the MMSE.

## Kalman Filtering

{% hint style="info" %}

### Definition 88

A system evolves according to a state space model if the state $$\boldsymbol{X}\_n$$ at time $$n$$ and observations $$\boldsymbol{Y}\_n$$ at time $$n$$ are related by

$$\forall n\geq 0,\ \boldsymbol{X}\_{n+1} = A\boldsymbol{X}\_n + \boldsymbol{V}\_n \qquad \forall n\geq 1,\ \boldsymbol{Y}\_n=C\boldsymbol{X}\_n+\boldsymbol{W}\_n$$

where $$V\_n$$ and $$W\_n$$are noise terms.
{% endhint %}

State space models are flexible and describe a variety of processes. Suppose we want to linearly estimate $$\boldsymbol{X}\_n$$ from the $$\boldsymbol{Y}\_n$$ we have seen so far.

{% hint style="info" %}

### Theorem 48

The linear estimate $$\hat{\boldsymbol{X}}\_{n|n} = \mathbb{L}\left\[\boldsymbol{X}\_n|\boldsymbol{Y}\_1,\cdots,\boldsymbol{Y}\_n\right]$$ can be computed recursively via the Kalman Filter.

1. $$\hat{\boldsymbol{X}}*{0|0} = 0, \Sigma*{0|0} = \text{Cov}\left(\boldsymbol{X}\_0\right)$$.
2. For $$n\geq 1$$, update

$$\hat{\boldsymbol{X}}*{n|n} = A\hat{\boldsymbol{X}}*{n-1|n-1} + K\_n\tilde{\boldsymbol{Y}}*n \qquad \tilde{\boldsymbol{Y}}*n = Y\_n - C\hat{\boldsymbol{X}}*{n|n-1} \qquad \Sigma*{n|n-1} = A\Sigma\_{n-1|n-1}A^T+\Sigma\_{\boldsymbol{V}}$$

$$K\_n = \Sigma\_{n|n-1}C^T(C\Sigma\_{n|n-1}C^T+\Sigma\_{\boldsymbol{W}})^{-1} \qquad \Sigma\_{n|n}=(I - K\_nC)\Sigma\_{n|n-1}$$
{% endhint %}

Kalman filtering is a simple algorithm which lets us do online, optimal estimation. Variants of it can do things such as prediction or smoothing.
