# Linear Estimation

In Linear Estimation, we are trying to estimate a random variable $$\boldsymbol{X}$$ using an observation $$\boldsymbol{Y}$$ with a linear function of $$\boldsymbol{Y}$$. If $$\boldsymbol{Y}$$ is finite dimensional, then we can say $$\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W\boldsymbol{Y}$$ where $$W$$ is some matrix. Using Theorem 1 and the orthogonality principle, we know that

$$\langle \boldsymbol{X}-W\boldsymbol{Y}, \boldsymbol{Y} \rangle = \boldsymbol{0} \Leftrightarrow R\_{XY} = W\boldsymbol{R}\_Y$$

This is known as the **Normal Equation**. If $$R\_Y$$ is invertible, then we can apply the inverse to find $$W$$. Otherwise, we can apply the pseudoinverse $$R\_Y^\dagger$$ to find $$W$$, which may not be unique. If we want to measure the quality of the estimation, since $$\boldsymbol{X} = \boldsymbol{X}+(\boldsymbol{X}-\hat{\boldsymbol{X}})$$,

$$\begin{aligned} |\boldsymbol{X}|^2 &= |\hat{\boldsymbol{X}}|^2 + |\boldsymbol{X} - \hat{\boldsymbol{X}}|^2 \implies \ |\boldsymbol{X}-\hat{\boldsymbol{X}}|^2 &= |\boldsymbol{X}|^2 - |\hat{\boldsymbol{X}}|^2 = R\_X - R\_{XY}R\_Y^{-1}R\_{YX}\end{aligned}$$

## Affine Estimation

If we allow ourselves to consider an affine function for estimation $$\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W\boldsymbol{Y}+b$$, then this is equivalent to instead finding an estimator

$$\hat{\boldsymbol{X}}(\boldsymbol{Y}') = W\boldsymbol{Y}' \qquad \text{ where } \boldsymbol{Y}' = \begin{bmatrix} \boldsymbol{Y} \ 1 \end{bmatrix}$$

This is equivalent to the following orthogonality conditions:

1. $$\langle \boldsymbol{X}-\hat{\boldsymbol{X}}, \boldsymbol{Y} \rangle$$
2. $$\langle \boldsymbol{X}-\hat{\boldsymbol{X}}, 1 \rangle$$

Solving gives us

$$\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W(\boldsymbol{Y}-\boldsymbol{\mu}*Y) + \mu\_x \qquad \text{ where } W\Sigma\_Y=\Sigma*{XY}.$$

$$\Sigma\_Y$$ and $$\Sigma\_{XY}$$ are the auto-covariance and cross-covariance respectively. Recall that if

$$\begin{bmatrix} \boldsymbol{X} \ \boldsymbol{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}*X \ \boldsymbol{\mu}*Y \end{bmatrix}, \begin{bmatrix} \Sigma\_X & \Sigma*{XY}\ \Sigma*{YX} & \Sigma\_Y \end{bmatrix}\right)$$

then

$$\boldsymbol{X}|\boldsymbol{Y} \sim \mathcal{N}\left(\boldsymbol{\mu}*X + \Sigma*{XY}\Sigma\_Y^{-1}(\boldsymbol{Y}-\boldsymbol{\mu}*Y), \Sigma\_X-\Sigma*{XY}\Sigma\_Y^{-1}\Sigma\_{YX} \right)$$

Thus in the Joint Gaussian case, the mean of the conditional distribution is the best affine estimator of $$\boldsymbol{X}$$ using $$\boldsymbol{Y}$$, and the covariance is the estimation error. This has two interpretations.

1. Under the Gaussian assumption, the best nonlinear estimator $$\mathbb{E}\left\[\boldsymbol{X}|\boldsymbol{Y}\right]$$is affine
2. Gaussian random variables are the hardest predict because nonlinearity should improve our error, but it does not in the Gaussian case. This means if affine estimation works well, we shouldn’t try and find better non-linear estimators.

## Least Squares

The theory of linear estimation is very closely connected with the theory behind least squares in linear algebra. In least squares, we have a deterministic $$\boldsymbol{x}$$ and assume nothing else about it, meaning we are looking for an unbiased estimator. Theorem 2 tells us how to find the best linear unbiased estimator in a linear setting.

{% hint style="info" %}

#### Theorem 2 (Gauss Markov Theorem) <a href="#theorem-2" id="theorem-2"></a>

Suppose that $$\boldsymbol{Y}=H\boldsymbol{x}+\boldsymbol{Z}$$ and $$Z$$ is zero-mean with $$\langle \boldsymbol{Z}, \boldsymbol{Z} \rangle = \boldsymbol{I}$$, $$H$$ is full-column rank, then $$\hat{\boldsymbol{x}\_b} = (H^*H)^{-1}H^*\boldsymbol{Y}$$is the best linear unbiased estimator.
{% endhint %}

### Recursive Least Squares

Suppose we extend the least squares setup to allow a stochastic, but fixed, $$\boldsymbol{X}$$ where $$\langle \boldsymbol{X}, \boldsymbol{X} \rangle = \Pi\_0$$. At each timestep, we receive observations of $$\boldsymbol{X}$$ such that $$\boldsymbol{Y}\_i = h\_i^\* \boldsymbol{X} + \boldsymbol{V}\_i$$ where $$\langle \boldsymbol{V}\_i, \boldsymbol{V}\_j \rangle = \delta\[i, j]$$ and $$\langle \boldsymbol{X}, \boldsymbol{V} \rangle$$. Define

$$\boldsymbol{Y}^i = \begin{bmatrix} \boldsymbol{Y}\_0 \ \boldsymbol{Y}\_1 \ \cdots \ \boldsymbol{Y}\_i \end{bmatrix} \qquad H\_i = \begin{bmatrix} h\_0^*\ h\_1^*\ \vdots\ h\_i^\*\ \end{bmatrix} \qquad \boldsymbol{V}^i = \begin{bmatrix} \boldsymbol{V}\_0 \ \boldsymbol{V}\_1 \ \cdots \ \boldsymbol{V}\_i \end{bmatrix}$$

Then our setup becomes $$\boldsymbol{Y}^i= H\_i \boldsymbol{X} + \boldsymbol{V}^i$$.

$$R\_{XY^i} = \Pi\_0 H\_i^\* \qquad R\_{Y^i} = (H\_i\Pi\_0H\_i^\* + I)$$

Applying Theorem 1 and solving the normal equation, we see

$$\begin{aligned} W &= \Pi\_0 H\_i^*(H\_i\Pi\_0H\_i^* + I)^{-1} = \Pi\_0 H\_i^\* (I - H\_i(\Pi\_0^{-1} + H\_i^*H\_i)^{-1}H\_i^*)\ &= \Pi\_0 (I - H\_i^\*H\_i(\Pi\_0^{-1} + H\_i^*H\_i)^{-1})H\_i^* \ &= \Pi\_0 ((\Pi\_0^{-1} + H\_i^\*H\_i)(H\_i^\*H\_i)^{-1}(H\_i^\*H\_i)(\Pi\_0^{-1} + H\_i^\*H\_i)^{-1}- H\_i^\*H\_i(\Pi\_0^{-1} + H\_i^*H\_i)^{-1})H\_i^*\ &= \Pi\_0 \Pi\_0^{-1}(H\_i^*H\_i)^{-1}H\_i^*H\_i(\Pi\_0^{-1}+H\_i^*H\_i)^{-1}H\_i^*\ &= (\Pi\_0^{-1} + H\_i^* H\_i)^{-1}H\_i^*\end{aligned}$$

Suppose we want to do this in an online fashion where at each timestep $$i$$, we only use the current $$h\_i, \boldsymbol{Y}*i$$ and our previous estimate $$\boldsymbol{X}*{i-1}$$. Let $$P\_i = (\Pi\_0^{-1} + H\_i^\*H\_i)^{-1}$$. Then

$$P\_i^{-1} = \Pi\_0 + \sum\_{k=0}^i h\_k h\_k^\* = P\_{i-1}^{-1} + h\_ih\_i^\*.$$

By applying the Sherman-Morrison-Woodbury identity, we can see that

$$P\_i = P\_{i-1} = P\_{i-1} \frac{h\_ih\_i^\*}{1 + h\_i^\*P\_{-1}h\_i} P\_{i-1}$$

{% hint style="info" %}

#### Theorem 3 (Recursive Least Squares Update) <a href="#theorem-3" id="theorem-3"></a>

The best least squares estimate using $$i+1$$ data points can be found by updating the best least squares estimate using $$i$$ data points using

$$\hat{\boldsymbol{X}}*i = \hat{\boldsymbol{X}}*{i-1} + \frac{P\_{i-1}h\_i}{1 + h\_i^*P\_{i-1}h\_i}(\boldsymbol{Y}\_i - h\_i^* \hat{\boldsymbol{X}}\_{i-1})$$
{% endhint %}

Notice that this formula scales an innovation in order to improve the current estimate of $$\boldsymbol{X}$$.

Just as we could compute a recursive update, we can also compute a “downdate” where we forget a particular observation. More concretely, we want to use $$\hat{\boldsymbol{X}}*i$$ to find $$\hat{\boldsymbol{X}}*{i|k}$$, the best linear estimator of $$\boldsymbol{X}$$ using $$\boldsymbol{Y}*0,\boldsymbol{Y}*1,\cdots,\boldsymbol{Y}*{k-1},\boldsymbol{Y}*{k+1},\cdots,\boldsymbol{Y}*i$$. Defining $$P*{i|k} = (\Pi\_0^{-1} + H\_{i|k}^\*H\_{i|k})^{-1}$$,

$$P\_{i|k}^{-1} = \Pi\_0^{-1} + \sum\_{j=0,j\neq k}^i h\_jh\_j^\* = P\_i^{-1} - h\_kh\_k^{-1}.$$

Applying the Sherman-Morrison-Woodbury identity,

$$P\_{i|k} = P\_i + P\_i \frac{h\_kh\_k^\*}{h\_k^\*P\_ih\_k - 1}P\_i$$

{% hint style="info" %}

#### Theorem 4 (Recursive Least Squares Downdate) <a href="#theorem-4" id="theorem-4"></a>

The best least squares estimate using all but the kth observation can be found by updating the best least squares estimate using all data points using

$$\hat{\boldsymbol{X}}\_{i|k} = \hat{X}\_i + \frac{P\_ih\_k}{h\_k^*P\_ih\_k - 1}(Y\_k - h\_k^*\hat{\boldsymbol{X}}\_i)$$
{% endhint %}


---

# Agent Instructions: Querying This Documentation

If you need additional information that is not directly available in this page, you can query the documentation dynamically by asking a question.

Perform an HTTP GET request on the current page URL with the `ask` query parameter:

```
GET https://aparande.gitbook.io/berkeley-notes/eecs225a-0/eecs225a-2.md?ask=<question>
```

The question should be specific, self-contained, and written in natural language.
The response will contain a direct answer to the question and relevant excerpts and sources from the documentation.

Use this mechanism when the answer is not explicitly present in the current page, you need clarification or additional context, or you want to retrieve related documentation sections.
