Berkeley Notes
  • Introduction
  • EE120
    • Introduction to Signals and Systems
    • The Fourier Series
    • The Fourier Transform
    • Generalized transforms
    • Linear Time-Invariant Systems
    • Feedback Control
    • Sampling
    • Appendix
  • EE123
    • The DFT
    • Spectral Analysis
    • Sampling
    • Filtering
  • EECS126
    • Introduction to Probability
    • Random Variables and their Distributions
    • Concentration
    • Information Theory
    • Random Processes
    • Random Graphs
    • Statistical Inference
    • Estimation
  • EECS127
    • Linear Algebra
    • Fundamentals of Optimization
    • Linear Algebraic Optimization
    • Convex Optimization
    • Duality
  • EE128
    • Introduction to Control
    • Modeling Systems
    • System Performance
    • Design Tools
    • Cascade Compensation
    • State-Space Control
    • Digital Control Systems
    • Cayley-Hamilton
  • EECS225A
    • Hilbert Space Theory
    • Linear Estimation
    • Discrete Time Random Processes
    • Filtering
  • EE222
    • Real Analysis
    • Differential Geometry
    • Nonlinear System Dynamics
    • Stability of Nonlinear Systems
    • Nonlinear Feedback Control
Powered by GitBook
On this page
  • Affine Estimation
  • Least Squares
  • Recursive Least Squares

Was this helpful?

  1. EECS225A

Linear Estimation

In Linear Estimation, we are trying to estimate a random variable X\boldsymbol{X}X using an observation Y\boldsymbol{Y}Y with a linear function of Y\boldsymbol{Y}Y. If Y\boldsymbol{Y}Y is finite dimensional, then we can say X^(Y)=WY\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W\boldsymbol{Y}X^(Y)=WY where WWW is some matrix. Using Theorem 1 and the orthogonality principle, we know that

⟨X−WY,Y⟩=0⇔RXY=WRY\langle \boldsymbol{X}-W\boldsymbol{Y}, \boldsymbol{Y} \rangle = \boldsymbol{0} \Leftrightarrow R_{XY} = W\boldsymbol{R}_Y⟨X−WY,Y⟩=0⇔RXY​=WRY​

This is known as the Normal Equation. If RYR_YRY​ is invertible, then we can apply the inverse to find WWW. Otherwise, we can apply the pseudoinverse RY†R_Y^\daggerRY†​ to find WWW, which may not be unique. If we want to measure the quality of the estimation, since X=X+(X−X^)\boldsymbol{X} = \boldsymbol{X}+(\boldsymbol{X}-\hat{\boldsymbol{X}})X=X+(X−X^),

∥X∥2=∥X^∥2+∥X−X^∥2  ⟹  ∥X−X^∥2=∥X∥2−∥X^∥2=RX−RXYRY−1RYX\begin{aligned} \|\boldsymbol{X}\|^2 &= \|\hat{\boldsymbol{X}}\|^2 + \|\boldsymbol{X} - \hat{\boldsymbol{X}}\|^2 \implies \\ \|\boldsymbol{X}-\hat{\boldsymbol{X}}\|^2 &= \|\boldsymbol{X}\|^2 - \|\hat{\boldsymbol{X}}\|^2 = R_X - R_{XY}R_Y^{-1}R_{YX}\end{aligned}∥X∥2∥X−X^∥2​=∥X^∥2+∥X−X^∥2⟹=∥X∥2−∥X^∥2=RX​−RXY​RY−1​RYX​​

Affine Estimation

If we allow ourselves to consider an affine function for estimation X^(Y)=WY+b\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W\boldsymbol{Y}+bX^(Y)=WY+b, then this is equivalent to instead finding an estimator

X^(Y′)=WY′ where Y′=[Y1]\hat{\boldsymbol{X}}(\boldsymbol{Y}') = W\boldsymbol{Y}' \qquad \text{ where } \boldsymbol{Y}' = \begin{bmatrix} \boldsymbol{Y} \\ 1 \end{bmatrix}X^(Y′)=WY′ where Y′=[Y1​]

This is equivalent to the following orthogonality conditions:

  1. ⟨X−X^,Y⟩\langle \boldsymbol{X}-\hat{\boldsymbol{X}}, \boldsymbol{Y} \rangle⟨X−X^,Y⟩

  2. ⟨X−X^,1⟩\langle \boldsymbol{X}-\hat{\boldsymbol{X}}, 1 \rangle⟨X−X^,1⟩

Solving gives us

X^(Y)=W(Y−μY)+μx where WΣY=ΣXY.\hat{\boldsymbol{X}}(\boldsymbol{Y}) = W(\boldsymbol{Y}-\boldsymbol{\mu}_Y) + \mu_x \qquad \text{ where } W\Sigma_Y=\Sigma_{XY}.X^(Y)=W(Y−μY​)+μx​ where WΣY​=ΣXY​.

ΣY\Sigma_YΣY​ and ΣXY\Sigma_{XY}ΣXY​ are the auto-covariance and cross-covariance respectively. Recall that if

[XY]∼N([μXμY],[ΣXΣXYΣYXΣY])\begin{bmatrix} \boldsymbol{X} \\ \boldsymbol{Y} \end{bmatrix} \sim \mathcal{N}\left(\begin{bmatrix} \boldsymbol{\mu}_X \\ \boldsymbol{\mu}_Y \end{bmatrix}, \begin{bmatrix} \Sigma_X & \Sigma_{XY}\\ \Sigma_{YX} & \Sigma_Y \end{bmatrix}\right)[XY​]∼N([μX​μY​​],[ΣX​ΣYX​​ΣXY​ΣY​​])

then

X∣Y∼N(μX+ΣXYΣY−1(Y−μY),ΣX−ΣXYΣY−1ΣYX)\boldsymbol{X}|\boldsymbol{Y} \sim \mathcal{N}\left(\boldsymbol{\mu}_X + \Sigma_{XY}\Sigma_Y^{-1}(\boldsymbol{Y}-\boldsymbol{\mu}_Y), \Sigma_X-\Sigma_{XY}\Sigma_Y^{-1}\Sigma_{YX} \right)X∣Y∼N(μX​+ΣXY​ΣY−1​(Y−μY​),ΣX​−ΣXY​ΣY−1​ΣYX​)

Thus in the Joint Gaussian case, the mean of the conditional distribution is the best affine estimator of X\boldsymbol{X}X using Y\boldsymbol{Y}Y, and the covariance is the estimation error. This has two interpretations.

  1. Under the Gaussian assumption, the best nonlinear estimator E[X∣Y]\mathbb{E}\left[\boldsymbol{X}|\boldsymbol{Y}\right] E[X∣Y]is affine

  2. Gaussian random variables are the hardest predict because nonlinearity should improve our error, but it does not in the Gaussian case. This means if affine estimation works well, we shouldn’t try and find better non-linear estimators.

Least Squares

The theory of linear estimation is very closely connected with the theory behind least squares in linear algebra. In least squares, we have a deterministic x\boldsymbol{x}x and assume nothing else about it, meaning we are looking for an unbiased estimator. Theorem 2 tells us how to find the best linear unbiased estimator in a linear setting.

Theorem 2 (Gauss Markov Theorem)

Suppose that Y=Hx+Z\boldsymbol{Y}=H\boldsymbol{x}+\boldsymbol{Z}Y=Hx+Z and ZZZ is zero-mean with ⟨Z,Z⟩=I\langle \boldsymbol{Z}, \boldsymbol{Z} \rangle = \boldsymbol{I}⟨Z,Z⟩=I, HHH is full-column rank, then xb^=(H∗H)−1H∗Y\hat{\boldsymbol{x}_b} = (H^*H)^{-1}H^*\boldsymbol{Y}xb​^​=(H∗H)−1H∗Yis the best linear unbiased estimator.

Recursive Least Squares

Suppose we extend the least squares setup to allow a stochastic, but fixed, X\boldsymbol{X}X where ⟨X,X⟩=Π0\langle \boldsymbol{X}, \boldsymbol{X} \rangle = \Pi_0⟨X,X⟩=Π0​. At each timestep, we receive observations of X\boldsymbol{X}X such that Yi=hi∗X+Vi\boldsymbol{Y}_i = h_i^* \boldsymbol{X} + \boldsymbol{V}_iYi​=hi∗​X+Vi​ where ⟨Vi,Vj⟩=δ[i,j]\langle \boldsymbol{V}_i, \boldsymbol{V}_j \rangle = \delta[i, j]⟨Vi​,Vj​⟩=δ[i,j] and ⟨X,V⟩\langle \boldsymbol{X}, \boldsymbol{V} \rangle ⟨X,V⟩. Define

Yi=[Y0Y1⋯Yi]Hi=[h0∗h1∗⋮hi∗]Vi=[V0V1⋯Vi]\boldsymbol{Y}^i = \begin{bmatrix} \boldsymbol{Y}_0 \\ \boldsymbol{Y}_1 \\ \cdots \\ \boldsymbol{Y}_i \end{bmatrix} \qquad H_i = \begin{bmatrix} h_0^*\\ h_1^*\\ \vdots\\ h_i^*\\ \end{bmatrix} \qquad \boldsymbol{V}^i = \begin{bmatrix} \boldsymbol{V}_0 \\ \boldsymbol{V}_1 \\ \cdots \\ \boldsymbol{V}_i \end{bmatrix}Yi=​Y0​Y1​⋯Yi​​​Hi​=​h0∗​h1∗​⋮hi∗​​​Vi=​V0​V1​⋯Vi​​​

Then our setup becomes Yi=HiX+Vi\boldsymbol{Y}^i= H_i \boldsymbol{X} + \boldsymbol{V}^iYi=Hi​X+Vi.

RXYi=Π0Hi∗RYi=(HiΠ0Hi∗+I)R_{XY^i} = \Pi_0 H_i^* \qquad R_{Y^i} = (H_i\Pi_0H_i^* + I)RXYi​=Π0​Hi∗​RYi​=(Hi​Π0​Hi∗​+I)

Applying Theorem 1 and solving the normal equation, we see

W=Π0Hi∗(HiΠ0Hi∗+I)−1=Π0Hi∗(I−Hi(Π0−1+Hi∗Hi)−1Hi∗)=Π0(I−Hi∗Hi(Π0−1+Hi∗Hi)−1)Hi∗=Π0((Π0−1+Hi∗Hi)(Hi∗Hi)−1(Hi∗Hi)(Π0−1+Hi∗Hi)−1−Hi∗Hi(Π0−1+Hi∗Hi)−1)Hi∗=Π0Π0−1(Hi∗Hi)−1Hi∗Hi(Π0−1+Hi∗Hi)−1Hi∗=(Π0−1+Hi∗Hi)−1Hi∗\begin{aligned} W &= \Pi_0 H_i^*(H_i\Pi_0H_i^* + I)^{-1} = \Pi_0 H_i^* (I - H_i(\Pi_0^{-1} + H_i^*H_i)^{-1}H_i^*)\\ &= \Pi_0 (I - H_i^*H_i(\Pi_0^{-1} + H_i^*H_i)^{-1})H_i^* \\ &= \Pi_0 ((\Pi_0^{-1} + H_i^*H_i)(H_i^*H_i)^{-1}(H_i^*H_i)(\Pi_0^{-1} + H_i^*H_i)^{-1}- H_i^*H_i(\Pi_0^{-1} + H_i^*H_i)^{-1})H_i^*\\ &= \Pi_0 \Pi_0^{-1}(H_i^*H_i)^{-1}H_i^*H_i(\Pi_0^{-1}+H_i^*H_i)^{-1}H_i^*\\ &= (\Pi_0^{-1} + H_i^* H_i)^{-1}H_i^*\end{aligned}W​=Π0​Hi∗​(Hi​Π0​Hi∗​+I)−1=Π0​Hi∗​(I−Hi​(Π0−1​+Hi∗​Hi​)−1Hi∗​)=Π0​(I−Hi∗​Hi​(Π0−1​+Hi∗​Hi​)−1)Hi∗​=Π0​((Π0−1​+Hi∗​Hi​)(Hi∗​Hi​)−1(Hi∗​Hi​)(Π0−1​+Hi∗​Hi​)−1−Hi∗​Hi​(Π0−1​+Hi∗​Hi​)−1)Hi∗​=Π0​Π0−1​(Hi∗​Hi​)−1Hi∗​Hi​(Π0−1​+Hi∗​Hi​)−1Hi∗​=(Π0−1​+Hi∗​Hi​)−1Hi∗​​

Suppose we want to do this in an online fashion where at each timestep iii, we only use the current hi,Yih_i, \boldsymbol{Y}_ihi​,Yi​ and our previous estimate Xi−1\boldsymbol{X}_{i-1}Xi−1​. Let Pi=(Π0−1+Hi∗Hi)−1P_i = (\Pi_0^{-1} + H_i^*H_i)^{-1}Pi​=(Π0−1​+Hi∗​Hi​)−1. Then

Pi−1=Π0+∑k=0ihkhk∗=Pi−1−1+hihi∗.P_i^{-1} = \Pi_0 + \sum_{k=0}^i h_k h_k^* = P_{i-1}^{-1} + h_ih_i^*.Pi−1​=Π0​+∑k=0i​hk​hk∗​=Pi−1−1​+hi​hi∗​.

By applying the Sherman-Morrison-Woodbury identity, we can see that

Pi=Pi−1=Pi−1hihi∗1+hi∗P−1hiPi−1P_i = P_{i-1} = P_{i-1} \frac{h_ih_i^*}{1 + h_i^*P_{-1}h_i} P_{i-1}Pi​=Pi−1​=Pi−1​1+hi∗​P−1​hi​hi​hi∗​​Pi−1​

Theorem 3 (Recursive Least Squares Update)

The best least squares estimate using i+1i+1i+1 data points can be found by updating the best least squares estimate using iii data points using

X^i=X^i−1+Pi−1hi1+hi∗Pi−1hi(Yi−hi∗X^i−1)\hat{\boldsymbol{X}}_i = \hat{\boldsymbol{X}}_{i-1} + \frac{P_{i-1}h_i}{1 + h_i^*P_{i-1}h_i}(\boldsymbol{Y}_i - h_i^* \hat{\boldsymbol{X}}_{i-1})X^i​=X^i−1​+1+hi∗​Pi−1​hi​Pi−1​hi​​(Yi​−hi∗​X^i−1​)

Notice that this formula scales an innovation in order to improve the current estimate of X\boldsymbol{X}X.

Just as we could compute a recursive update, we can also compute a “downdate” where we forget a particular observation. More concretely, we want to use X^i\hat{\boldsymbol{X}}_iX^i​ to find X^i∣k\hat{\boldsymbol{X}}_{i|k}X^i∣k​, the best linear estimator of X\boldsymbol{X}X using Y0,Y1,⋯ ,Yk−1,Yk+1,⋯ ,Yi\boldsymbol{Y}_0,\boldsymbol{Y}_1,\cdots,\boldsymbol{Y}_{k-1},\boldsymbol{Y}_{k+1},\cdots,\boldsymbol{Y}_iY0​,Y1​,⋯,Yk−1​,Yk+1​,⋯,Yi​. Defining Pi∣k=(Π0−1+Hi∣k∗Hi∣k)−1P_{i|k} = (\Pi_0^{-1} + H_{i|k}^*H_{i|k})^{-1}Pi∣k​=(Π0−1​+Hi∣k∗​Hi∣k​)−1,

Pi∣k−1=Π0−1+∑j=0,j≠kihjhj∗=Pi−1−hkhk−1.P_{i|k}^{-1} = \Pi_0^{-1} + \sum_{j=0,j\neq k}^i h_jh_j^* = P_i^{-1} - h_kh_k^{-1}.Pi∣k−1​=Π0−1​+∑j=0,j=ki​hj​hj∗​=Pi−1​−hk​hk−1​.

Applying the Sherman-Morrison-Woodbury identity,

Pi∣k=Pi+Pihkhk∗hk∗Pihk−1PiP_{i|k} = P_i + P_i \frac{h_kh_k^*}{h_k^*P_ih_k - 1}P_iPi∣k​=Pi​+Pi​hk∗​Pi​hk​−1hk​hk∗​​Pi​

Theorem 4 (Recursive Least Squares Downdate)

The best least squares estimate using all but the kth observation can be found by updating the best least squares estimate using all data points using

X^i∣k=X^i+Pihkhk∗Pihk−1(Yk−hk∗X^i)\hat{\boldsymbol{X}}_{i|k} = \hat{X}_i + \frac{P_ih_k}{h_k^*P_ih_k - 1}(Y_k - h_k^*\hat{\boldsymbol{X}}_i)X^i∣k​=X^i​+hk∗​Pi​hk​−1Pi​hk​​(Yk​−hk∗​X^i​)

PreviousHilbert Space TheoryNextDiscrete Time Random Processes

Last updated 3 years ago

Was this helpful?