Berkeley Notes
  • Introduction
  • EE120
    • Introduction to Signals and Systems
    • The Fourier Series
    • The Fourier Transform
    • Generalized transforms
    • Linear Time-Invariant Systems
    • Feedback Control
    • Sampling
    • Appendix
  • EE123
    • The DFT
    • Spectral Analysis
    • Sampling
    • Filtering
  • EECS126
    • Introduction to Probability
    • Random Variables and their Distributions
    • Concentration
    • Information Theory
    • Random Processes
    • Random Graphs
    • Statistical Inference
    • Estimation
  • EECS127
    • Linear Algebra
    • Fundamentals of Optimization
    • Linear Algebraic Optimization
    • Convex Optimization
    • Duality
  • EE128
    • Introduction to Control
    • Modeling Systems
    • System Performance
    • Design Tools
    • Cascade Compensation
    • State-Space Control
    • Digital Control Systems
    • Cayley-Hamilton
  • EECS225A
    • Hilbert Space Theory
    • Linear Estimation
    • Discrete Time Random Processes
    • Filtering
  • EE222
    • Real Analysis
    • Differential Geometry
    • Nonlinear System Dynamics
    • Stability of Nonlinear Systems
    • Nonlinear Feedback Control
Powered by GitBook
On this page
  • LTI Filtering on WSS Processes
  • Wiener Filter
  • Non-Causal Case
  • Causal Case
  • Vector Case
  • Hidden Markov Model State Estimation
  • Causal Distribution Estimation
  • Non-Causal Distribution Estimation
  • State Sequence Estimation
  • Kalman Filtering
  • Kalman Prediction Filter
  • Schmidt’s Modification of the Kalman Filter
  • Kalman Smoother

Was this helpful?

  1. EECS225A

Filtering

PreviousDiscrete Time Random ProcessesNextEE222

Last updated 3 years ago

Was this helpful?

If we think of our signal as a discrete time random process, then like a normal deterministic signal, we can try filtering our random process.

Filtering can either be accomplished with an LTI system or some other non-linear/non-time-invariant system just like with deterministic signals.

LTI Filtering on WSS Processes

If we use an LTI filter on a WSS process, then we can easily compute how the filter impacts the spectrum of the signal.

Theorem 13

When Y(n)Y(n)Y(n) is formed by passing a WSS process XnX_nXn​ through a stable LTI system with impulse response h[n]h[n]h[n] and transfer function H(z)H(z)H(z), then SY(z)=H(z)SX(z)H∗(z−∗)S_Y(z) = H(z)S_X(z)H^*(z^{-*})SY​(z)=H(z)SX​(z)H∗(z−∗) and SYX(z)=H(z)SX(z)S_{YX}(z) = H(z)S_X(z)SYX​(z)=H(z)SX​(z). If we have a third process ZnZ_nZn​ that is jointly WSS with (Yn,Xn)(Y_n, X_n)(Yn​,Xn​), then SZY(z)=SZX(z)H∗(z−∗)S_{ZY}(z) = S_{ZX}(z)H^*(z^{-*})SZY​(z)=SZX​(z)H∗(z−∗).

This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation RW(k)=reδ[n]R_W(k) = r_e\delta[n]RW​(k)=re​δ[n] through a minimum-phase filter with transfer function L(z)L(z)L(z).

Wiener Filter

Suppose we have a stochastic WSS process YnY_nYn​ that is jointly WSS with XnX_nXn​ and that we want to find the best linear estimator of XnX_nXn​ using YnY_nYn​. The best linear estimator of XnX_nXn​ given the observations YnY_nYn​ can be written as

X^n=∑m∈Zh(m)Yn−m=h[n]∗Yn.\hat{X}_n = \sum_{m\in\mathbb{Z}}h(m)Y_{n-m} = h[n] * Y_n.X^n​=∑m∈Z​h(m)Yn−m​=h[n]∗Yn​.

This is identical to passing YnY_nYn​ through an LTI filter. If we restrict ourselves to using {Yi}i=−∞n\{Y_i\}_{i=-\infty}^{n}{Yi​}i=−∞n​ to estimate XnX_nXn​, then the best linear estimator can be written as

X^n=∑m=0∞h(m)Yn−m=h[n]∗Yn.\hat{X}_n = \sum_{m=0}^\infty h(m)Y_{n-m} = h[n] * Y_n.X^n​=∑m=0∞​h(m)Yn−m​=h[n]∗Yn​.

It is identical to passing YnY_nYn​ through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using YYY directly, we want to transform YYY into a new process WWW where RW(k)=δ[k]R_W(k) = \delta[k]RW​(k)=δ[k]. This transformation is known as whitening. From the spectral factorization of YYY, we know if we use the filter G(z)=1SY+(z)G(z) =\frac{1}{S_Y^+(z)}G(z)=SY+​(z)1​ then

SW(z)=SY(z)SY+(z)SY+∗(z−∗)=SY(z)SY+(z)SY−(z)=1.S_W(z) = \frac{S_Y(z)}{S_Y^+(z)S_Y^{+*}(z^{-*})} = \frac{S_Y(z)}{S_Y^+(z)S_Y^-(z)} = 1.SW​(z)=SY+​(z)SY+∗​(z−∗)SY​(z)​=SY+​(z)SY−​(z)SY​(z)​=1.

Now we want to find the best linear estimator of XXX using our new process WWW by designing an LTI filter Q(z)Q(z)Q(z).

Non-Causal Case

Starting with noncausal case, we can apply the orthogonality principle,

E[(Xn−X^n)Wn−k∗]=0  ⟹  E[XnWn−k∗]=∑m∈Zq(m)E[Wn−mWn−k∗]∴RXW(k)=∑m∈Zq(m)RW(k−m)  ⟹  SXW(z)=Q(z)SW(z)∴Q(z)=SXW(z)SW(z)=SXW(z)=SXY(z)(SY+(z−∗))−∗=SXY(z)SY−(z)\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m\in\mathbb{Z}}q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) = \sum_{m\in\mathbb{Z}}q(m)R_W(k-m) &\implies S_{XW}(z) = Q(z)S_W(z)\\ \therefore Q(z) = \frac{S_{XW}(z)}{S_W(z)} = S_{XW}(z) &= S_{XY}(z)(S_Y^+(z^{-*}))^{-*} = \frac{S_{XY}(z)}{S_Y^-(z)}\end{aligned}E[(Xn​−X^n​)Wn−k∗​]=0∴RXW​(k)=m∈Z∑​q(m)RW​(k−m)∴Q(z)=SW​(z)SXW​(z)​=SXW​(z)​⟹E[Xn​Wn−k∗​]=m∈Z∑​q(m)E[Wn−m​Wn−k∗​]⟹SXW​(z)=Q(z)SW​(z)=SXY​(z)(SY+​(z−∗))−∗=SY−​(z)SXY​(z)​​

When we cascade these filters,

H(z)=Q(z)G(z)=SXY(z)SY−(z)1SY+(z)=SXY(z)SY(z).H(z) = Q(z)G(z) = \frac{S_{XY}(z)}{S_Y^-(z)} \frac{1}{S_Y^+(z)}= \frac{S_{XY}(z)}{S_Y(z)}.H(z)=Q(z)G(z)=SY−​(z)SXY​(z)​SY+​(z)1​=SY​(z)SXY​(z)​.

Definition 21

The best linear estimator of XnX_nXn​ using YnY_nYn​ where (Xn,Yn)(X_n, Y_n)(Xn​,Yn​) is jointly WSS is given by the non-causal Wiener filter.

H(z)=SXY(z)SY(z)H(z) = \frac{S_{XY}(z)}{S_Y(z)}H(z)=SY​(z)SXY​(z)​

If we interpret Definition 21 in the frequency domain, for a specific ω\omegaω, we can understand H(ejω)H(e^{j\omega})H(ejω) as an optimal linear estimator for FX(ω)F_X(\omega)FX​(ω) where FX(ω)F_X(\omega)FX​(ω) is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of YnY_nYn​.

X^n=∑i∈Zh[i]∫−ππejω(n−i)dFY(ω)=∫−ππ(∑i∈Zh[i]e−jωi)ejωndFY(ω)=∫−ππH(ejω)ejωndFY(ω)\begin{aligned} \hat{X}_n &= \sum_{i\in\mathbb{Z}}h[i]\int_{-\pi}^\pi e^{j\omega(n-i)}dF_Y(\omega)\\ &= \int_{-\pi}^{\pi}\left(\sum_{i\in\mathbb{Z}}h[i]e^{-j\omega i}\right)e^{j\omega n}dF_Y(\omega) \\ &= \int_{-\pi}^\pi H(e^{j\omega})e^{j\omega n}dF_Y(\omega)\\\end{aligned}X^n​​=i∈Z∑​h[i]∫−ππ​ejω(n−i)dFY​(ω)=∫−ππ​(i∈Z∑​h[i]e−jωi)ejωndFY​(ω)=∫−ππ​H(ejω)ejωndFY​(ω)​

Since FXF_XFX​ and FYF_YFY​ have jointly orthogonal increments, this tells us that H(ejω)H(e^{j\omega})H(ejω) is just the optimal linear estimator of dFX(ω)dF_X(\omega)dFX​(ω) using dFY(ω)dF_Y(\omega)dFY​(ω). dFX(ω)dF_X(\omega)dFX​(ω) and dFY(ω)dF_Y(\omega)dFY​(ω) exist on a Hilbert space, meaning we are essentially projecting each frequency component of XnX_nXn​ onto the corresponding frequency component of YnY_nYn​.

Causal Case

First, note that in the causal case, whitening doesn’t break causality because 1SY+(z)\frac{1}{S_Y^+(z)}SY+​(z)1​ is causal. When we apply the orthogonality principle,

E[(Xn−X^n)Wn−k∗]=0  ⟹  E[XnWn−k∗]=∑m=0∞q(m)E[Wn−mWn−k∗]∴RXW(k)=∑m=0∞q[m]RW(k−m)k≥0\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m=0}^\infty q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) &= \sum_{m = 0}^\infty q[m]R_W(k-m) \qquad k \geq 0\end{aligned}E[(Xn​−X^n​)Wn−k∗​]=0∴RXW​(k)​⟹E[Xn​Wn−k∗​]=m=0∑∞​q(m)E[Wn−m​Wn−k∗​]=m=0∑∞​q[m]RW​(k−m)k≥0​

We can’t take the Z-transform of both sides because the equation is not necessarily true for k<0k < 0k<0. Instead, we can look at the function

f(k)=RXW(k)−∑m=0∞RW(k−m)q[m]={0k≥0,? else.f(k) = R_{XW}(k) - \sum_{m=0}^\infty R_W(k-m)q[m] = \begin{cases} 0 & k\geq 0,\\ ? & \text{ else.}\end{cases}f(k)=RXW​(k)−∑m=0∞​RW​(k−m)q[m]={0?​k≥0, else.​

Taking the unilateral Z-transform of both sides,

[F(z)]+=[SXW(z)−SW(z)Q(z)]+=[SXW(z)]+−Q(z)=0Q(z)=[SXW(z)]+=[SXY(z)SY−(z)]+\begin{aligned} \left[F(z)\right]_+ &= \left[S_{XW}(z) - S_W(z)Q(z)\right]_+ = \left[S_{XW}(z)\right]_+ - Q(z) = 0\\ Q(z) &= \left[S_{XW}(z)\right]_+ = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \end{aligned}[F(z)]+​Q(z)​=[SXW​(z)−SW​(z)Q(z)]+​=[SXW​(z)]+​−Q(z)=0=[SXW​(z)]+​=[SY−​(z)SXY​(z)​]+​​

Thus the filter HHH which gives the causal best linear estimator of XXX using YYY is

H(z)=Q(z)G(z)=[SXY(z)SY−(z)]+1SY+(z).H(z) = Q(z) G(z)= \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.H(z)=Q(z)G(z)=[SY−​(z)SXY​(z)​]+​SY+​(z)1​.

Definition 22

The best linear estimator of XnX_nXn​ using {Yi}i=−∞n\{Y_i\}_{i=-\infty}^{n}{Yi​}i=−∞n​ is given by the causal Wiener filter.

H(z)=Q(z)G(z)=[SXY(z)SY−(z)]+1SY+(z).H(z) = Q(z)G(z) = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.H(z)=Q(z)G(z)=[SY−​(z)SXY​(z)​]+​SY+​(z)1​.

Intuitively, this should make sense because we are using the same WWW process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.

Theorem 14

If X^NC(n)\hat{X}_{NC}(n)X^NC​(n) is the non-causal Wiener filter of XXX, then the causal wiener filter of XXX given YYY is the same as the causal wiener filter of X^NC\hat{X}_{NC}X^NC​ given YYY, and if YYY is white noise, then

X^C(n)=∑i=0∞h[i]Yn−i\hat{X}_C(n) = \sum_{i=0}^{\infty}h[i]Y_{n-i}X^C​(n)=∑i=0∞​h[i]Yn−i​

Vector Case

Suppose that instead of a Wide-Sense Stationary process, we an NNN length signal X\boldsymbol{X}X which we want to estimate with another NNN length signal Y\boldsymbol{Y}Y. We can represent both X\boldsymbol{X}X and Y\boldsymbol{Y}Y as vectors in CN\mathbb{C}^NCN. If we are allowed to use all entries of Y\boldsymbol{Y}Y to estimate X\boldsymbol{X}X, this is identical to linear estimation.

Definition 23

The non-causal Wiener filter of a finite length NNN signal Y\boldsymbol{Y}Y is given by

Ks=RXYRY−1.K_s = R_{\boldsymbol{X}\boldsymbol{Y}}R_{\boldsymbol{Y}}^{-1}.Ks​=RXY​RY−1​.

Note that this requires RY≻0R_{\boldsymbol{Y}} \succ 0RY​≻0. Suppose that we wanted to design a causal filter for the vector case, so X^i\hat{X}_iX^i​ only depends on {Yj}j=1i\{Y_j\}_{j=1}^i{Yj​}j=1i​. By the orthogonality principle,

∀1≤l≤i, E[Xi−∑j=1iKf,ijYjYl∗]=0  ⟹  RXY(i,l)=∑j=1iKf,ijRY(j,l)\forall 1 \leq l \leq i,\ \mathbb{E}\left[X_i - \sum_{j=1}^iK_{f, ij}Y_jY_l^*\right] = 0 \implies R_{\boldsymbol{XY}}(i, l) = \sum_{j=1}^i K_{f, ij}R_{\boldsymbol{Y}}(j, l)∀1≤l≤i, E[Xi​−∑j=1i​Kf,ij​Yj​Yl∗​]=0⟹RXY​(i,l)=∑j=1i​Kf,ij​RY​(j,l)

In matrix form, this means

RXY−KfRY=U+R_{\boldsymbol{XY}} - K_fR_{\boldsymbol{Y}} = U^+RXY​−Kf​RY​=U+

where U+U^+U+ is strictly upper triangular.

Theorem 15

If matrix H≻0H \succ 0H≻0, then there exists a unique lower-diagonal upper triangular factorization of H=LDL∗H=LDL^*H=LDL∗ where LLL is lower diagonal and invertible with unit diagonal entries and DDDis diagonal with positive entries.

Applying the LDL decomposition, we see that

RXY−KfLDL∗=U+  ⟹  RXYL−∗D−1−KfL=U+L−∗D−1∴[RXYL−∗D−1]L−KfL=0\begin{aligned} R_{\boldsymbol{XY}} - K_fLDL^* = U^+ \implies R_{\boldsymbol{XY}}L^{-*}D^{-1} -K_f L = U^+L^{-*}D^{-1}\\ \therefore [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_L -K_f L = 0\end{aligned}RXY​−Kf​LDL∗=U+⟹RXY​L−∗D−1−Kf​L=U+L−∗D−1∴[RXY​L−∗D−1]L​−Kf​L=0​

where [⋅]L[\cdot]_L[⋅]L​ represent the lower triangular part of a matrix.

Definition 24

The causal Wiener filter of a finite length NNN signal Y\boldsymbol{Y}Y is given by

Kf=[RXYL−∗D−1]LL−1K_f = [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_LL^{-1}Kf​=[RXY​L−∗D−1]L​L−1

Hidden Markov Model State Estimation

Suppose we have a Hidden Markov Process {Yn}n≥1\{Y_n\}_{n\geq1}{Yn​}n≥1​. We can think of determining the state {Xn}n≥1\{X_n\}_{n\geq1}{Xn​}n≥1​ as filtering {Yn}n≥1\{Y_n\}_{n\geq1}{Yn​}n≥1​.

Causal Distribution Estimation

Suppose we want to know the distribution of XtX_tXt​ after we have observered YtY^tYt.

p(xt∣yt)=p(xt,yt)p(yt)=p(xt)p(yt,yt−1∣xt)∑xp(yt,yt−1∣xt=x)p(xt=x)=p(xt)p(yt∣xt)p(yt−1∣xt)∑xp(yt∣xt=x)p(yt−1∣xt=x)p(xt=x)=p(yt∣xt)p(yt−1)p(xt∣yt−1)∑xp(yt∣xt=x)p(yt−1)p(xt=x∣yt−1)=p(yt∣xt)p(xt∣yt−1)∑xp(yt∣xt=x)p(xt=x∣yt−1)\begin{aligned} p(x_t|y^t) &= \frac{p(x_t,y^t)}{p(y^t)} = \frac{p(x_t)p(y_t, y^{t-1}|x_t)}{\sum_{x}p(y_t,y^{t-1}|x_t=x)p(x_t=x)}\\ &= \frac{p(x_t)p(y_t|x_t)p(y^{t-1}|x_t)}{\sum_xp(y_t|x_t=x)p(y^{t-1}|x_t=x)p(x_t=x)} = \frac{p(y_t|x_t)p(y^{t-1})p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(y^{t-1})p(x_t=x|y^{t-1})}\\ &=\frac{p(y_t|x_t)p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(x_t=x|y^{t-1})}\end{aligned}p(xt​∣yt)​=p(yt)p(xt​,yt)​=∑x​p(yt​,yt−1∣xt​=x)p(xt​=x)p(xt​)p(yt​,yt−1∣xt​)​=∑x​p(yt​∣xt​=x)p(yt−1∣xt​=x)p(xt​=x)p(xt​)p(yt​∣xt​)p(yt−1∣xt​)​=∑x​p(yt​∣xt​=x)p(yt−1)p(xt​=x∣yt−1)p(yt​∣xt​)p(yt−1)p(xt​∣yt−1)​=∑x​p(yt​∣xt​=x)p(xt​=x∣yt−1)p(yt​∣xt​)p(xt​∣yt−1)​​

Now if we know p(xt∣yt−1)p(x_t|y^{t-1})p(xt​∣yt−1), then we are set.

p(xt∣yt−1)=∑xp(xt,xt−1=x∣yt−1)=∑xp(xt−1=x∣yt−1)p(xt∣xt−1=x,yt−1)=∑xp(xt−1=x∣yt)p(xt∣xt−1=x)\begin{aligned} p(x_t|y^{t-1}) &= \sum_xp(x_t,x_{t-1}=x|y^{t-1}) = \sum_x p(x_{t-1}=x|y^{t-1})p(x_t|x_{t-1}=x,y^{t-1}) \\ &= \sum_x p(x_{t-1}=x|y^t)p(x_t|x_{t-1}=x)\end{aligned}p(xt​∣yt−1)​=x∑​p(xt​,xt−1​=x∣yt−1)=x∑​p(xt−1​=x∣yt−1)p(xt​∣xt−1​=x,yt−1)=x∑​p(xt−1​=x∣yt)p(xt​∣xt−1​=x)​

Now we have a recursive algorithm for computing the distribution of xtx_txt​.

Non-Causal Distribution Estimation

Suppose we are allowed to non-causally filter our signal and we care about the distribution of XtX_tXt​ after we have observed YnY^nYn. In other words, for t≥nt \geq nt≥n, we want to find γt(xt)=p(xt∣yn)\gamma_t(x_t) = p(x_t|y^n)γt​(xt​)=p(xt​∣yn). When t=nt=nt=n, γn(xn)=αn(xn)\gamma_n(x_n) = \alpha_n(x_n)γn​(xn​)=αn​(xn​). If we continue expanding backwards, then

p(xt∣yn)=∑xp(xt,xt+1=x∣yn)=∑xp(xt+1=x∣yn)p(xt∣xt+1=x,yt,yt+1n)=∑xp(xt+1=x∣yn)p(xt∣xt+1,yt)=∑xp(xt+1=x∣yn)p(xt∣yt)p(xt+1=x∣xt,yt)p(xt+1=x∣yt)=∑xγt+1(x)αt(xt)p(xt+1=x∣xt)βt+1(x)\begin{aligned} p(x_t|y^n) &= \sum_x p(x_t,x_{t+1}=x|y^n) = \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1}=x,y^t,y_{t+1}^n)\\ &= \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1},y^t) = \sum_x p(x_{t+1}=x|y^n)\frac{p(x_t|y^t)p(x_{t+1}=x|x_t,y^t)}{p(x_{t+1}=x|y^t)}\\ &= \sum_x \gamma_{t+1}(x)\frac{\alpha_t(x_t)p(x_{t+1}=x|x_t)}{\beta_{t+1}(x)}\end{aligned}p(xt​∣yn)​=x∑​p(xt​,xt+1​=x∣yn)=x∑​p(xt+1​=x∣yn)p(xt​∣xt+1​=x,yt,yt+1n​)=x∑​p(xt+1​=x∣yn)p(xt​∣xt+1​,yt)=x∑​p(xt+1​=x∣yn)p(xt+1​=x∣yt)p(xt​∣yt)p(xt+1​=x∣xt​,yt)​=x∑​γt+1​(x)βt+1​(x)αt​(xt​)p(xt+1​=x∣xt​)​​

This gives us a clear algorithm for non-causally computing the distribution of xtx_txt​.

State Sequence Estimation

Suppose we want to find the most likely sequence of states given our observations. This means we should compute

X^n=argmaxXnp(xn∣yn)\hat{X}^n = \text{argmax}_{X^n}p(x^n|y^n)X^n=argmaxXn​p(xn∣yn)

p(xt,yt)=p(xt−1,yt−1)p(xt,yt∣xt−1,yt−1)=p(xt−1,yt−1)p(xt∣xt−1,yt−1)p(yt∣xt,xt−1,yt−1)=p(xt−1,yt−1)p(xt∣xt−1)p(yt∣xt)\begin{aligned} p(x^t, y^t) &= p(x^{t-1}, y^{t-1})p(x_t, y_t|x^{t-1},y^{t-1})\\ &= p(x^{t-1}, y^{t-1})p(x_t|x^{t-1},y^{t-1})p(y_t|x_t,x^{t-1},y^{t-1}) \\ &= p(x^{t-1},y^{t-1})p(x_t|x_{t-1})p(y_t|x_t)\end{aligned}p(xt,yt)​=p(xt−1,yt−1)p(xt​,yt​∣xt−1,yt−1)=p(xt−1,yt−1)p(xt​∣xt−1,yt−1)p(yt​∣xt​,xt−1,yt−1)=p(xt−1,yt−1)p(xt​∣xt−1​)p(yt​∣xt​)​

We see that there is a recursion in the joint distribution, so if we let Vt(xt)=max⁡xt−1p(xt,yt)V_t(x_t) = \max_{x^{t-1}}p(x^t,y^t)Vt​(xt​)=maxxt−1​p(xt,yt), then

Vt(xt)=max⁡xt−1p(xt,yt)=p(yt∣xt)max⁡xt−1p(xt−1,yt−1)p(xt∣xt−1)=p(yt∣xt)max⁡xt−1[p(xt∣xt−1)max⁡xt−2p(xt−1,yt−1)]=p(yt∣xt)max⁡xt−1p(xt∣xt−1)Vt−1(xt−1)\begin{aligned} V_t(x_t) &= \max_{x^{t-1}} p(x^t, y^t) = p(y_t|x_t)\max_{x^{t-1}}p(x^{t-1},y^{t-1})p(x_t|x_{t-1})\\ &= p(y_t|x_t)\max_{x^{t-1}}\left[p(x_t|x_{t-1}) \max_{x^{t-2}} p(x^{t-1},y^{t-1})\right]\\ &= p(y_t|x_t)\max_{x^{t-1}}p(x_t|x_{t-1}) V_{t-1}(x_{t-1})\end{aligned}Vt​(xt​)​=xt−1max​p(xt,yt)=p(yt​∣xt​)xt−1max​p(xt−1,yt−1)p(xt​∣xt−1​)=p(yt​∣xt​)xt−1max​[p(xt​∣xt−1​)xt−2max​p(xt−1,yt−1)]=p(yt​∣xt​)xt−1max​p(xt​∣xt−1​)Vt−1​(xt−1​)​

The base case is that V1(x1)=p(x1)p(y1∣x1)V_1(x_1) = p(x_1)p(y_1|x_1)V1​(x1​)=p(x1​)p(y1​∣x1​). VtV_tVt​ is useful because x^n=argmaxxnVn(xn)\hat{x}_n = \text{argmax}_{x_n}V_n(x_n)x^n​=argmaxxn​​Vn​(xn​). This is because we can first maximize over X^n−1\hat{X}^{n-1}X^n−1 and YnY^nYn, so the only thing left to maximize is x^n\hat{x}_nx^n​. Once we have x^t\hat{x}_tx^t​, then we can comptue x^t−1\hat{x}_{t-1}x^t−1​ by

x^t−1=argmaxxt−1p(x^t∣xt−1)Vt−1(xt−1).\hat{x}_{t-1} = \text{argmax}_{x_{t-1}}p(\hat{x}_t|x_{t-1})V_{t-1}(x_{t-1}).x^t−1​=argmaxxt−1​​p(x^t​∣xt−1​)Vt−1​(xt−1​).

Putting these equations gives us the Viterbi algorithm.

Kalman Filtering

In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors X^i\boldsymbol{\hat{X}}_iX^i​ using some linear combination of the observations Yi\boldsymbol{Y}_iYi​.

Kalman Prediction Filter

Suppose that we want to compute the one-step prediction. In other words, given Yi\boldsymbol{Y}^iYi, we want to predict X^i+1\boldsymbol{\hat{X}}_{i+1}X^i+1​. Our observations Y\boldsymbol{Y}Y are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the Y\boldsymbol{Y}Y. To do this, we can define the innovation process

ei=Yi−Y^i∣i−1=Yi−HiX^i∣i−1\boldsymbol{e}_i = \boldsymbol{Y}_i - \boldsymbol{\hat{Y}_{i|i-1}} = \boldsymbol{Y}_i - H_i\boldsymbol{\hat{X}}_{i|i-1}ei​=Yi​−Y^i∣i−1​=Yi​−Hi​X^i∣i−1​

The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project X^i\boldsymbol{\hat{X}}_iX^i​ onto the innovations.

X^i+1∣i=∑j=0i⟨Xi+1,ej⟩Re,j−1ej=⟨Xi+1,ei⟩Re,i−1ei+∑j=0i−1⟨Xi+1,ej⟩Re,j−1ej=⟨Xi+1,ei⟩Re,i−1ei+X^i+1∣i−1=⟨Xi+1,ei⟩Re,i−1ei+X^i+1∣i=⟨Xi+1,ei⟩Re,i−1ei+FiX^i∣i−1\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= \sum_{j=0}^i\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \sum_{j=0}^{i-1}\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i-1} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i}\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + F_i\boldsymbol{\hat{X}}_{i|i-1}\end{aligned}X^i+1∣i​​=j=0∑i​⟨Xi+1​,ej​⟩Re,j−1​ej​=⟨Xi+1​,ei​⟩Re,i−1​ei​+j=0∑i−1​⟨Xi+1​,ej​⟩Re,j−1​ej​=⟨Xi+1​,ei​⟩Re,i−1​ei​+X^i+1∣i−1​=⟨Xi+1​,ei​⟩Re,i−1​ei​+X^i+1∣i​=⟨Xi+1​,ei​⟩Re,i−1​ei​+Fi​X^i∣i−1​​

The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let Kp,i=⟨Xi+1,ei⟩Re,i−1K_{p,i} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}Kp,i​=⟨Xi+1​,ei​⟩Re,i−1​ (called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.

X^i+1∣i=FiX^i∣i−1+Kp,iei.\boldsymbol{\hat{X}}_{i+1|i} = F_i\boldsymbol{\hat{X}}_{i|i-1} + K_{p,i}\boldsymbol{e}_i.X^i+1∣i​=Fi​X^i∣i−1​+Kp,i​ei​.

Now, we just need to find a recursive formulation for Kp,iK_{p,i}Kp,i​ and Re,iR_{\boldsymbol{e},i}Re,i​. Starting with Re,iR_{\boldsymbol{e},i}Re,i​, notice that we can write ei=Yi−HiX^i∣i−1=Hi(Xi−X^i∣i−1)+Vi\boldsymbol{e}_i = \boldsymbol{Y}_i-H_i\boldsymbol{\hat{X}}_{i|i-1} = H_i(\boldsymbol{X}_i-\boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_iei​=Yi​−Hi​X^i∣i−1​=Hi​(Xi​−X^i∣i−1​)+Vi​.

Re,i=⟨Hi(Xi−X^i∣i−1)+Vi,Hi(Xi−X^i∣i−1)+Vi⟩=Hi⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+Ri\begin{aligned} R_{\boldsymbol{e},i} &= \langle H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+ \boldsymbol{V_i}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V_i} \rangle \\ &= H_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + R_i\end{aligned}Re,i​​=⟨Hi​(Xi​−X^i∣i−1​)+Vi​,Hi​(Xi​−X^i∣i−1​)+Vi​⟩=Hi​⟨Xi​−X^i∣i−1​,Xi​−X^i∣i−1​⟩Hi∗​+Ri​​

To find Kp,iK_{p,i}Kp,i​, we should first find ⟨Xi+1,ei⟩\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle ⟨Xi+1​,ei​⟩.

⟨Xi+1,ei⟩=Fi⟨Xi,ei⟩+Gi⟨Ui,ei⟩=Fi⟨Xi,Hi(Xi−X^i∣i−1)+Vi⟩+⟨Ui,Hi(Xi−X^i∣i−1)+Vi⟩=Fi⟨Xi,Xi−X^i∣i−1⟩Hi∗+GiSi=Fi⟨(Xi−X^i∣i−1)+X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+GiSi=Fi⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+GiSi\begin{aligned} \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle &= F_i\langle \boldsymbol{X}_i, \boldsymbol{e}_i \rangle + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle + \langle \boldsymbol{U}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\end{aligned}⟨Xi+1​,ei​⟩​=Fi​⟨Xi​,ei​⟩+Gi​⟨Ui​,ei​⟩=Fi​⟨Xi​,Hi​(Xi​−X^i∣i−1​)+Vi​⟩+⟨Ui​,Hi​(Xi​−X^i∣i−1​)+Vi​⟩=Fi​⟨Xi​,Xi​−X^i∣i−1​⟩Hi∗​+Gi​Si​=Fi​⟨(Xi​−X^i∣i−1​)+X^i∣i−1​,Xi​−X^i∣i−1​⟩Hi∗​+Gi​Si​=Fi​⟨Xi​−X^i∣i−1​,Xi​−X^i∣i−1​⟩Hi∗​+Gi​Si​​

Notice that the matrix Pi=⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩P_i = \langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle Pi​=⟨Xi​−X^i∣i−1​,Xi​−X^i∣i−1​⟩is the auto-correlation of the estimation error, and it shows up in both Kp,iK_{p,i}Kp,i​ and ReiR_{\boldsymbol{e}_i}Rei​​. It would be useful to have a recursive solution for this matrix as well.

Pi+1=Πi+1−⟨X^i+1∣i,X^i+1∣i⟩=FiΠiFi∗+GiQiGi∗−⟨FiX^i∣i−1+Kp,iei,FiX^i∣i−1+Kp,iei⟩=FiΠiFi∗+GiQiGi∗−Fi⟨X^i∣i−1,X^i∣i−1⟩F∗+Kp,iRe,iKp,i∗=FiPiFi∗+GiQiGi∗−Kp,iRe,jKp,i∗\begin{aligned} P_{i+1} &= \Pi_{i+1}-\langle \boldsymbol{\hat{X}}_{i+1|i}, \boldsymbol{\hat{X}}_{i+1|i} \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - \langle F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i, F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - F_i\langle \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{\hat{X}}_{i|i-1} \rangle F^*+K_{p,i}R_{\boldsymbol{e},i}K_{p,i}^*\\ &= F_iP_iF_i^* + G_iQ_iG_i^* - K_{p,i}R_{\boldsymbol{e},j}K_{p,i}^*\end{aligned}Pi+1​​=Πi+1​−⟨X^i+1∣i​,X^i+1∣i​⟩=Fi​Πi​Fi∗​+Gi​Qi​Gi∗​−⟨Fi​X^i∣i−1​+Kp,i​ei​,Fi​X^i∣i−1​+Kp,i​ei​⟩=Fi​Πi​Fi∗​+Gi​Qi​Gi∗​−Fi​⟨X^i∣i−1​,X^i∣i−1​⟩F∗+Kp,i​Re,i​Kp,i∗​=Fi​Pi​Fi∗​+Gi​Qi​Gi∗​−Kp,i​Re,j​Kp,i∗​​

Putting this into a concrete algorithm, we get the Kalman Prediction Filter.

Schmidt’s Modification of the Kalman Filter

The predictive Kalman filter goes directly from X^i∣i−1\boldsymbol{\hat{X}}_{i|i-1}X^i∣i−1​ to X^i+1∣i\boldsymbol{\hat{X}}_{i+1|i}X^i+1∣i​ without ever determining X^i∣i\boldsymbol{\hat{X}}_{i|i}X^i∣i​. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.

  1. Measurement Update: Find X^i∣i\boldsymbol{\hat{X}}_{i|i}X^i∣i​ given the latest observation Yi\boldsymbol{Y}_{i}Yi​ and X^i∣i−1\boldsymbol{\hat{X}}_{i|i-1}X^i∣i−1​.

  2. State Evolution (Time) Update: Find X^i+1∣i\boldsymbol{\hat{X}}_{i+1|i}X^i+1∣i​ using what we know about the state evolution.

This mimics the approach of the forward algorithm for Hidden Markov Models, which separated updates to the distribution using a time update and a measurement update. Using our innovation process,

X^i∣i=∑j=0i⟨Xi,ej⟩Re,j−1ej=X^i∣i−1+⟨Xi,ej⟩Re,i−1ej=X^i∣i−1+⟨(Xi−X^i∣i−1)+X^i∣i−1,Hi(Xi−X^i∣i−1)+Vi⟩Re,i−1ej=X^i∣i−1+PiHi∗Re,i−1ei\begin{aligned} \boldsymbol{\hat{X}}_{i|i} &= \sum_{j=0}^i \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{V}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + P_iH_i^*R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}X^i∣i​​=j=0∑i​⟨Xi​,ej​⟩Re,j−1​ej​=X^i∣i−1​+⟨Xi​,ej​⟩Re,i−1​ej=X^i∣i−1​+⟨(Xi​−X^i∣i−1​)+X^i∣i−1​,Hi​(Xi​−X^i∣i−1​)+Vi​⟩Re,i−1​ej=X^i∣i−1​+Pi​Hi∗​Re,i−1​ei​​

The gain on the coefficient of the innovation Kf,i=PiHi∗Re,iK_{f,i}=P_iH_i^*R_{\boldsymbol{e},i}Kf,i​=Pi​Hi∗​Re,i​ is called the Kalman Filter Gain. The error of our estimator Pi∣i=⟨X^i∣i,X^i∣i⟩P_{i|i} = \langle \boldsymbol{\hat{X}}_{i|i}, \boldsymbol{\hat{X}}_{i|i} \rangle Pi∣i​=⟨X^i∣i​,X^i∣i​⟩is given by

Pi∣i=Pi−PiHi∗Re,i−1HiPi.P_{i|i} = P_i - P_iH_i^*R_{\boldsymbol{e},i}^{-1}H_iP_i.Pi∣i​=Pi​−Pi​Hi∗​Re,i−1​Hi​Pi​.

For the time update,

X^i+1∣i=FiX^i∣i+GiU^i∣i=FiX^i∣i+Gi⟨Ui,ei⟩Re,i−1ei=FiX^i∣i+Gi⟨Ui,ei⟩Re,i−1ei=FiX^i∣i+Gi⟨Ui,HXi+Vi−HiX^i∣i−1⟩Re,i−1ei=FiX^i∣i+GiSiRe,i−1ei\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\boldsymbol{\hat{U}}_{i|i} \\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i = F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, H\boldsymbol{X}_i+\boldsymbol{V}_i - H_i\boldsymbol{\hat{X}}_{i|i-1} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_iS_iR_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}X^i+1∣i​​=Fi​X^i∣i​+Gi​U^i∣i​=Fi​X^i∣i​+Gi​⟨Ui​,ei​⟩Re,i−1​ei​=Fi​X^i∣i​+Gi​⟨Ui​,ei​⟩Re,i−1​ei​=Fi​X^i∣i​+Gi​⟨Ui​,HXi​+Vi​−Hi​X^i∣i−1​⟩Re,i−1​ei​=Fi​X^i∣i​+Gi​Si​Re,i−1​ei​​

We can re-write the error if this estimator Pi+1P_{i+1}Pi+1​ as

Pi+1=FiPi∣iFi∗+Gi(Qi−SiRe,i−1Si∗)Gi∗−FiKf,iSi∗Gi∗−GiSiKf,i∗Fi∗P_{i+1} = F_iP_{i|i}F_i^* + G_i(Q_i - S_iR_{\boldsymbol{e},i}^{-1}S_i^*)G_i^* - F_iK_{f,i}S_i^*G_i^* - G_iS_iK_{f,i}^*F_i^*Pi+1​=Fi​Pi∣i​Fi∗​+Gi​(Qi​−Si​Re,i−1​Si∗​)Gi∗​−Fi​Kf,i​Si∗​Gi∗​−Gi​Si​Kf,i∗​Fi∗​

Writing this as an algorithm,

Kalman Smoother

The Kalman Prediction Filter and Schmidt’s modification of the Kalman filter are both causal filters. The Kalman Smoother provides a non-causal estimate in the same way that the Backward Recursion algorithm does for Hidden Markov Processes. In other words, the Kalman Smoother predicts X^i∣n\hat{\boldsymbol{X}}_{i|n}X^i∣n​, the best linear estimator of X^i\hat{\boldsymbol{X}}_iX^i​ from {Y0,⋯ ,YN}\{\boldsymbol{Y}_0, \cdots, \boldsymbol{Y}_N\}{Y0​,⋯,YN​}. As before, we can start with the innovation process ej=Yj−HjX^j∣j−1\boldsymbol{e}_j = \boldsymbol{Y}_j - H_j\hat{\boldsymbol{X}}_{j|j-1}ej​=Yj​−Hj​X^j∣j−1​.

X^i∣N=∑j=0N⟨Xi,ej⟩Re,j−1ej=X^i∣i−1+∑j=iN⟨Xi,ej⟩Re,j−1ej.\hat{\boldsymbol{X}}_{i|N} = \sum_{j=0}^N \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j = \hat{\boldsymbol{X}}_{i|i-1} + \sum_{j=i}^N \langle X_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j.X^i∣N​=∑j=0N​⟨Xi​,ej​⟩Re,j−1​ej​=X^i∣i−1​+∑j=iN​⟨Xi​,ej​⟩Re,j−1​ej​.

We can compute X^i∣i−1\hat{\boldsymbol{X}}_{i|i-1}X^i∣i−1​ using the Predictive Kalman filter, so we just need to compute the ⟨Xi,ej⟩Re,j−1ej\langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j⟨Xi​,ej​⟩Re,j−1​ej​.

⟨Xi,ej⟩=⟨Xi,Hj(Xi−X^j∣j−1)+Vj⟩=⟨Xi,Hj(Xj−X^j∣j−1)⟩=⟨X^i∣i−1+(Xi−X^i∣i−1),Xj−X^j∣j−1⟩Hj∗\begin{aligned} \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle &= \langle \boldsymbol{X}_i, H_j(\boldsymbol{X}_i - \hat{\boldsymbol{X}}_{j|j-1}) + \boldsymbol{V}_j \rangle = \langle \boldsymbol{X}_i, H_j(\boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1}) \rangle \\ &= \langle \hat{\boldsymbol{X}}_{i|i-1} + (\boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}), \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle H_j^*\end{aligned}⟨Xi​,ej​⟩​=⟨Xi​,Hj​(Xi​−X^j∣j−1​)+Vj​⟩=⟨Xi​,Hj​(Xj​−X^j∣j−1​)⟩=⟨X^i∣i−1​+(Xi​−X^i∣i−1​),Xj​−X^j∣j−1​⟩Hj∗​​

Xj−X^j∣j−1\boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1}Xj​−X^j∣j−1​ is orthgonal to any linear function of {Y0,⋯ ,Yj−1}\{\boldsymbol{Y}_0, \cdots, \boldsymbol{Y}_{j-1}\}{Y0​,⋯,Yj−1​}, so when j≥ij \geq ij≥i, it must be orthgonal to X^i∣i−1\hat{\boldsymbol{X}}_{i|i-1}X^i∣i−1​ since it is a function of {Yk}0i−1\{\boldsymbol{Y}_k\}^{i-1}_0{Yk​}0i−1​. Thus, for j≥ij\geq ij≥i,

⟨Xi,ej⟩=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩Hj∗\langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle = \langle \boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}, \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle H_j^*⟨Xi​,ej​⟩=⟨Xi​−X^i∣i−1​,Xj​−X^j∣j−1​⟩Hj∗​

If we denote Pij=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩P_{ij} = \langle \boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}, \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle Pij​=⟨Xi​−X^i∣i−1​,Xj​−X^j∣j−1​⟩, then

Pij=PiΦp∗(j,i)=Pi{I if j=i∏k=ij−1Fp,k if j>iP_{ij} = P_{i}\Phi_{p}^*(j, i) = P_{i} \begin{cases} I & \text{ if } j = i\\ \prod_{k=i}^{j-1} F_{p,k} & \text{ if } j > i \end{cases}Pij​=Pi​Φp∗​(j,i)=Pi​{I∏k=ij−1​Fp,k​​ if j=i if j>i​

where Fp,k=(Fk−Kp,kHk)F_{p,k} = (F_k - K_{p, k}H_k)Fp,k​=(Fk​−Kp,k​Hk​). This gives us the expression

X^i∣N=X^i∣i−1+Pi∑j=iNΦp∗(j,i)Hi∗Re,j−1ej.\hat{\boldsymbol{X}}_{i|N} = \hat{\boldsymbol{X}}_{i|i-1} + P_i \sum_{j=i}^N \Phi_p^*(j, i)H_i^*R_{e,j}^{-1}\boldsymbol{e}_j.X^i∣N​=X^i∣i−1​+Pi​∑j=iN​Φp∗​(j,i)Hi∗​Re,j−1​ej​.

If we let λi∣N=∑j=iNΦp∗(j,i)Hi∗Re,j−1ej\lambda_{i|N} = \sum_{j=i}^N \Phi_p^*(j, i)H_i^*R_{e,j}^{-1}\boldsymbol{e}_jλi∣N​=∑j=iN​Φp∗​(j,i)Hi∗​Re,j−1​ej​, then we get the recursion

λi∣N=Fp,i∗λi+1∣N+Hi∗Re,i−1ei.\lambda{i|N} = F_{p,i}^*\lambda_{i+1|N} + H_i^*R_{e, i}^{-1}\boldsymbol{e}_i.λi∣N=Fp,i∗​λi+1∣N​+Hi∗​Re,i−1​ei​.

If we want to look at the error of this estimator, we see that

Pi∣N=Pi−PiΛi∣NPi where Λi∣N=⟨λi∣N,λi∣N⟩=Fp,i∗Λi+1∣NFp,i+Hi∗Re,i−1Hi.P_{i|N} = P_i - P_i \Lambda_{i|N} P_i \text{ where } \Lambda_{i|N} = \langle \lambda_{i|N}, \lambda_{i|N} \rangle = F_{p,i}^*\Lambda_{i+1|N}F_{p,i}+H_i^*R_{e, i}^{-1}H_i.Pi∣N​=Pi​−Pi​Λi∣N​Pi​ where Λi∣N​=⟨λi∣N​,λi∣N​⟩=Fp,i∗​Λi+1∣N​Fp,i​+Hi∗​Re,i−1​Hi​.

Writing all of this as an algorithm,

Figure 1: Filtering a Disrete Time Random Process with an LTI system with transfer function $H(z)$
Figure 2: Finding the best linear estimator of $X$ using $W$ with a two-stage filter that first whitens the input.