Filtering

If we think of our signal as a discrete time random process, then like a normal deterministic signal, we can try filtering our random process.

Filtering can either be accomplished with an LTI system or some other non-linear/non-time-invariant system just like with deterministic signals.

LTI Filtering on WSS Processes

If we use an LTI filter on a WSS process, then we can easily compute how the filter impacts the spectrum of the signal.

Theorem 13

When Y(n)Y(n) is formed by passing a WSS process XnX_n through a stable LTI system with impulse response h[n]h[n] and transfer function H(z)H(z), then SY(z)=H(z)SX(z)H(z)S_Y(z) = H(z)S_X(z)H^*(z^{-*}) and SYX(z)=H(z)SX(z)S_{YX}(z) = H(z)S_X(z). If we have a third process ZnZ_n that is jointly WSS with (Yn,Xn)(Y_n, X_n), then SZY(z)=SZX(z)H(z)S_{ZY}(z) = S_{ZX}(z)H^*(z^{-*}).

This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation RW(k)=reδ[n]R_W(k) = r_e\delta[n] through a minimum-phase filter with transfer function L(z)L(z).

Wiener Filter

Suppose we have a stochastic WSS process YnY_n that is jointly WSS with XnX_n and that we want to find the best linear estimator of XnX_n using YnY_n. The best linear estimator of XnX_n given the observations YnY_n can be written as

X^n=mZh(m)Ynm=h[n]Yn.\hat{X}_n = \sum_{m\in\mathbb{Z}}h(m)Y_{n-m} = h[n] * Y_n.

This is identical to passing YnY_n through an LTI filter. If we restrict ourselves to using {Yi}i=n\{Y_i\}_{i=-\infty}^{n} to estimate XnX_n, then the best linear estimator can be written as

X^n=m=0h(m)Ynm=h[n]Yn.\hat{X}_n = \sum_{m=0}^\infty h(m)Y_{n-m} = h[n] * Y_n.

It is identical to passing YnY_n through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using YY directly, we want to transform YY into a new process WW where RW(k)=δ[k]R_W(k) = \delta[k]. This transformation is known as whitening. From the spectral factorization of YY, we know if we use the filter G(z)=1SY+(z)G(z) =\frac{1}{S_Y^+(z)} then

SW(z)=SY(z)SY+(z)SY+(z)=SY(z)SY+(z)SY(z)=1.S_W(z) = \frac{S_Y(z)}{S_Y^+(z)S_Y^{+*}(z^{-*})} = \frac{S_Y(z)}{S_Y^+(z)S_Y^-(z)} = 1.

Now we want to find the best linear estimator of XX using our new process WW by designing an LTI filter Q(z)Q(z).

Non-Causal Case

Starting with noncausal case, we can apply the orthogonality principle,

E[(XnX^n)Wnk]=0    E[XnWnk]=mZq(m)E[WnmWnk]RXW(k)=mZq(m)RW(km)    SXW(z)=Q(z)SW(z)Q(z)=SXW(z)SW(z)=SXW(z)=SXY(z)(SY+(z))=SXY(z)SY(z)\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m\in\mathbb{Z}}q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) = \sum_{m\in\mathbb{Z}}q(m)R_W(k-m) &\implies S_{XW}(z) = Q(z)S_W(z)\\ \therefore Q(z) = \frac{S_{XW}(z)}{S_W(z)} = S_{XW}(z) &= S_{XY}(z)(S_Y^+(z^{-*}))^{-*} = \frac{S_{XY}(z)}{S_Y^-(z)}\end{aligned}

When we cascade these filters,

H(z)=Q(z)G(z)=SXY(z)SY(z)1SY+(z)=SXY(z)SY(z).H(z) = Q(z)G(z) = \frac{S_{XY}(z)}{S_Y^-(z)} \frac{1}{S_Y^+(z)}= \frac{S_{XY}(z)}{S_Y(z)}.

Definition 21

The best linear estimator of XnX_n using YnY_n where (Xn,Yn)(X_n, Y_n) is jointly WSS is given by the non-causal Wiener filter.

H(z)=SXY(z)SY(z)H(z) = \frac{S_{XY}(z)}{S_Y(z)}

If we interpret Definition 21 in the frequency domain, for a specific ω\omega, we can understand H(ejω)H(e^{j\omega}) as an optimal linear estimator for FX(ω)F_X(\omega) where FX(ω)F_X(\omega) is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of YnY_n.

X^n=iZh[i]ππejω(ni)dFY(ω)=ππ(iZh[i]ejωi)ejωndFY(ω)=ππH(ejω)ejωndFY(ω)\begin{aligned} \hat{X}_n &= \sum_{i\in\mathbb{Z}}h[i]\int_{-\pi}^\pi e^{j\omega(n-i)}dF_Y(\omega)\\ &= \int_{-\pi}^{\pi}\left(\sum_{i\in\mathbb{Z}}h[i]e^{-j\omega i}\right)e^{j\omega n}dF_Y(\omega) \\ &= \int_{-\pi}^\pi H(e^{j\omega})e^{j\omega n}dF_Y(\omega)\\\end{aligned}

Since FXF_X and FYF_Y have jointly orthogonal increments, this tells us that H(ejω)H(e^{j\omega}) is just the optimal linear estimator of dFX(ω)dF_X(\omega) using dFY(ω)dF_Y(\omega). dFX(ω)dF_X(\omega) and dFY(ω)dF_Y(\omega) exist on a Hilbert space, meaning we are essentially projecting each frequency component of XnX_n onto the corresponding frequency component of YnY_n.

Causal Case

First, note that in the causal case, whitening doesn’t break causality because 1SY+(z)\frac{1}{S_Y^+(z)} is causal. When we apply the orthogonality principle,

E[(XnX^n)Wnk]=0    E[XnWnk]=m=0q(m)E[WnmWnk]RXW(k)=m=0q[m]RW(km)k0\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m=0}^\infty q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) &= \sum_{m = 0}^\infty q[m]R_W(k-m) \qquad k \geq 0\end{aligned}

We can’t take the Z-transform of both sides because the equation is not necessarily true for k<0k < 0. Instead, we can look at the function

f(k)=RXW(k)m=0RW(km)q[m]={0k0,? else.f(k) = R_{XW}(k) - \sum_{m=0}^\infty R_W(k-m)q[m] = \begin{cases} 0 & k\geq 0,\\ ? & \text{ else.}\end{cases}

Taking the unilateral Z-transform of both sides,

[F(z)]+=[SXW(z)SW(z)Q(z)]+=[SXW(z)]+Q(z)=0Q(z)=[SXW(z)]+=[SXY(z)SY(z)]+\begin{aligned} \left[F(z)\right]_+ &= \left[S_{XW}(z) - S_W(z)Q(z)\right]_+ = \left[S_{XW}(z)\right]_+ - Q(z) = 0\\ Q(z) &= \left[S_{XW}(z)\right]_+ = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \end{aligned}

Thus the filter HH which gives the causal best linear estimator of XX using YY is

H(z)=Q(z)G(z)=[SXY(z)SY(z)]+1SY+(z).H(z) = Q(z) G(z)= \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.

Definition 22

The best linear estimator of XnX_n using {Yi}i=n\{Y_i\}_{i=-\infty}^{n} is given by the causal Wiener filter.

H(z)=Q(z)G(z)=[SXY(z)SY(z)]+1SY+(z).H(z) = Q(z)G(z) = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.

Intuitively, this should make sense because we are using the same WW process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.

Theorem 14

If X^NC(n)\hat{X}_{NC}(n) is the non-causal Wiener filter of XX, then the causal wiener filter of XX given YY is the same as the causal wiener filter of X^NC\hat{X}_{NC} given YY, and if YY is white noise, then

X^C(n)=i=0h[i]Yni\hat{X}_C(n) = \sum_{i=0}^{\infty}h[i]Y_{n-i}

Vector Case

Suppose that instead of a Wide-Sense Stationary process, we an NN length signal X\boldsymbol{X} which we want to estimate with another NN length signal Y\boldsymbol{Y}. We can represent both X\boldsymbol{X} and Y\boldsymbol{Y} as vectors in CN\mathbb{C}^N. If we are allowed to use all entries of Y\boldsymbol{Y} to estimate X\boldsymbol{X}, this is identical to linear estimation.

Definition 23

The non-causal Wiener filter of a finite length NN signal Y\boldsymbol{Y} is given by

Ks=RXYRY1.K_s = R_{\boldsymbol{X}\boldsymbol{Y}}R_{\boldsymbol{Y}}^{-1}.

Note that this requires RY0R_{\boldsymbol{Y}} \succ 0. Suppose that we wanted to design a causal filter for the vector case, so X^i\hat{X}_i only depends on {Yj}j=1i\{Y_j\}_{j=1}^i. By the orthogonality principle,

1li, E[Xij=1iKf,ijYjYl]=0    RXY(i,l)=j=1iKf,ijRY(j,l)\forall 1 \leq l \leq i,\ \mathbb{E}\left[X_i - \sum_{j=1}^iK_{f, ij}Y_jY_l^*\right] = 0 \implies R_{\boldsymbol{XY}}(i, l) = \sum_{j=1}^i K_{f, ij}R_{\boldsymbol{Y}}(j, l)

In matrix form, this means

RXYKfRY=U+R_{\boldsymbol{XY}} - K_fR_{\boldsymbol{Y}} = U^+

where U+U^+ is strictly upper triangular.

Theorem 15

If matrix H0H \succ 0, then there exists a unique lower-diagonal upper triangular factorization of H=LDLH=LDL^* where LL is lower diagonal and invertible with unit diagonal entries and DDis diagonal with positive entries.

Applying the LDL decomposition, we see that

RXYKfLDL=U+    RXYLD1KfL=U+LD1[RXYLD1]LKfL=0\begin{aligned} R_{\boldsymbol{XY}} - K_fLDL^* = U^+ \implies R_{\boldsymbol{XY}}L^{-*}D^{-1} -K_f L = U^+L^{-*}D^{-1}\\ \therefore [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_L -K_f L = 0\end{aligned}

where []L[\cdot]_L represent the lower triangular part of a matrix.

Definition 24

The causal Wiener filter of a finite length NN signal Y\boldsymbol{Y} is given by

Kf=[RXYLD1]LL1K_f = [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_LL^{-1}

Hidden Markov Model State Estimation

Suppose we have a Hidden Markov Process {Yn}n1\{Y_n\}_{n\geq1}. We can think of determining the state {Xn}n1\{X_n\}_{n\geq1} as filtering {Yn}n1\{Y_n\}_{n\geq1}.

Causal Distribution Estimation

Suppose we want to know the distribution of XtX_t after we have observered YtY^t.

p(xtyt)=p(xt,yt)p(yt)=p(xt)p(yt,yt1xt)xp(yt,yt1xt=x)p(xt=x)=p(xt)p(ytxt)p(yt1xt)xp(ytxt=x)p(yt1xt=x)p(xt=x)=p(ytxt)p(yt1)p(xtyt1)xp(ytxt=x)p(yt1)p(xt=xyt1)=p(ytxt)p(xtyt1)xp(ytxt=x)p(xt=xyt1)\begin{aligned} p(x_t|y^t) &= \frac{p(x_t,y^t)}{p(y^t)} = \frac{p(x_t)p(y_t, y^{t-1}|x_t)}{\sum_{x}p(y_t,y^{t-1}|x_t=x)p(x_t=x)}\\ &= \frac{p(x_t)p(y_t|x_t)p(y^{t-1}|x_t)}{\sum_xp(y_t|x_t=x)p(y^{t-1}|x_t=x)p(x_t=x)} = \frac{p(y_t|x_t)p(y^{t-1})p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(y^{t-1})p(x_t=x|y^{t-1})}\\ &=\frac{p(y_t|x_t)p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(x_t=x|y^{t-1})}\end{aligned}

Now if we know p(xtyt1)p(x_t|y^{t-1}), then we are set.

p(xtyt1)=xp(xt,xt1=xyt1)=xp(xt1=xyt1)p(xtxt1=x,yt1)=xp(xt1=xyt)p(xtxt1=x)\begin{aligned} p(x_t|y^{t-1}) &= \sum_xp(x_t,x_{t-1}=x|y^{t-1}) = \sum_x p(x_{t-1}=x|y^{t-1})p(x_t|x_{t-1}=x,y^{t-1}) \\ &= \sum_x p(x_{t-1}=x|y^t)p(x_t|x_{t-1}=x)\end{aligned}

Now we have a recursive algorithm for computing the distribution of xtx_t.

Non-Causal Distribution Estimation

Suppose we are allowed to non-causally filter our signal and we care about the distribution of XtX_t after we have observed YnY^n. In other words, for tnt \geq n, we want to find γt(xt)=p(xtyn)\gamma_t(x_t) = p(x_t|y^n). When t=nt=n, γn(xn)=αn(xn)\gamma_n(x_n) = \alpha_n(x_n). If we continue expanding backwards, then

p(xtyn)=xp(xt,xt+1=xyn)=xp(xt+1=xyn)p(xtxt+1=x,yt,yt+1n)=xp(xt+1=xyn)p(xtxt+1,yt)=xp(xt+1=xyn)p(xtyt)p(xt+1=xxt,yt)p(xt+1=xyt)=xγt+1(x)αt(xt)p(xt+1=xxt)βt+1(x)\begin{aligned} p(x_t|y^n) &= \sum_x p(x_t,x_{t+1}=x|y^n) = \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1}=x,y^t,y_{t+1}^n)\\ &= \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1},y^t) = \sum_x p(x_{t+1}=x|y^n)\frac{p(x_t|y^t)p(x_{t+1}=x|x_t,y^t)}{p(x_{t+1}=x|y^t)}\\ &= \sum_x \gamma_{t+1}(x)\frac{\alpha_t(x_t)p(x_{t+1}=x|x_t)}{\beta_{t+1}(x)}\end{aligned}

This gives us a clear algorithm for non-causally computing the distribution of xtx_t.

State Sequence Estimation

Suppose we want to find the most likely sequence of states given our observations. This means we should compute

X^n=argmaxXnp(xnyn)\hat{X}^n = \text{argmax}_{X^n}p(x^n|y^n)

p(xt,yt)=p(xt1,yt1)p(xt,ytxt1,yt1)=p(xt1,yt1)p(xtxt1,yt1)p(ytxt,xt1,yt1)=p(xt1,yt1)p(xtxt1)p(ytxt)\begin{aligned} p(x^t, y^t) &= p(x^{t-1}, y^{t-1})p(x_t, y_t|x^{t-1},y^{t-1})\\ &= p(x^{t-1}, y^{t-1})p(x_t|x^{t-1},y^{t-1})p(y_t|x_t,x^{t-1},y^{t-1}) \\ &= p(x^{t-1},y^{t-1})p(x_t|x_{t-1})p(y_t|x_t)\end{aligned}

We see that there is a recursion in the joint distribution, so if we let Vt(xt)=maxxt1p(xt,yt)V_t(x_t) = \max_{x^{t-1}}p(x^t,y^t), then

Vt(xt)=maxxt1p(xt,yt)=p(ytxt)maxxt1p(xt1,yt1)p(xtxt1)=p(ytxt)maxxt1[p(xtxt1)maxxt2p(xt1,yt1)]=p(ytxt)maxxt1p(xtxt1)Vt1(xt1)\begin{aligned} V_t(x_t) &= \max_{x^{t-1}} p(x^t, y^t) = p(y_t|x_t)\max_{x^{t-1}}p(x^{t-1},y^{t-1})p(x_t|x_{t-1})\\ &= p(y_t|x_t)\max_{x^{t-1}}\left[p(x_t|x_{t-1}) \max_{x^{t-2}} p(x^{t-1},y^{t-1})\right]\\ &= p(y_t|x_t)\max_{x^{t-1}}p(x_t|x_{t-1}) V_{t-1}(x_{t-1})\end{aligned}

The base case is that V1(x1)=p(x1)p(y1x1)V_1(x_1) = p(x_1)p(y_1|x_1). VtV_t is useful because x^n=argmaxxnVn(xn)\hat{x}_n = \text{argmax}_{x_n}V_n(x_n). This is because we can first maximize over X^n1\hat{X}^{n-1} and YnY^n, so the only thing left to maximize is x^n\hat{x}_n. Once we have x^t\hat{x}_t, then we can comptue x^t1\hat{x}_{t-1} by

x^t1=argmaxxt1p(x^txt1)Vt1(xt1).\hat{x}_{t-1} = \text{argmax}_{x_{t-1}}p(\hat{x}_t|x_{t-1})V_{t-1}(x_{t-1}).

Putting these equations gives us the Viterbi algorithm.

Kalman Filtering

In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors X^i\boldsymbol{\hat{X}}_i using some linear combination of the observations Yi\boldsymbol{Y}_i.

Kalman Prediction Filter

Suppose that we want to compute the one-step prediction. In other words, given Yi\boldsymbol{Y}^i, we want to predict X^i+1\boldsymbol{\hat{X}}_{i+1}. Our observations Y\boldsymbol{Y} are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the Y\boldsymbol{Y}. To do this, we can define the innovation process

ei=YiY^ii1=YiHiX^ii1\boldsymbol{e}_i = \boldsymbol{Y}_i - \boldsymbol{\hat{Y}_{i|i-1}} = \boldsymbol{Y}_i - H_i\boldsymbol{\hat{X}}_{i|i-1}

The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project X^i\boldsymbol{\hat{X}}_i onto the innovations.

X^i+1i=j=0iXi+1,ejRe,j1ej=Xi+1,eiRe,i1ei+j=0i1Xi+1,ejRe,j1ej=Xi+1,eiRe,i1ei+X^i+1i1=Xi+1,eiRe,i1ei+X^i+1i=Xi+1,eiRe,i1ei+FiX^ii1\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= \sum_{j=0}^i\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \sum_{j=0}^{i-1}\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i-1} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i}\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + F_i\boldsymbol{\hat{X}}_{i|i-1}\end{aligned}

The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let Kp,i=Xi+1,eiRe,i1K_{p,i} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1} (called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.

X^i+1i=FiX^ii1+Kp,iei.\boldsymbol{\hat{X}}_{i+1|i} = F_i\boldsymbol{\hat{X}}_{i|i-1} + K_{p,i}\boldsymbol{e}_i.

Now, we just need to find a recursive formulation for Kp,iK_{p,i} and Re,iR_{\boldsymbol{e},i}. Starting with Re,iR_{\boldsymbol{e},i}, notice that we can write ei=YiHiX^ii1=Hi(XiX^ii1)+Vi\boldsymbol{e}_i = \boldsymbol{Y}_i-H_i\boldsymbol{\hat{X}}_{i|i-1} = H_i(\boldsymbol{X}_i-\boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i.

Re,i=Hi(XiX^ii1)+Vi,Hi(XiX^ii1)+Vi=HiXiX^ii1,XiX^ii1Hi+Ri\begin{aligned} R_{\boldsymbol{e},i} &= \langle H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+ \boldsymbol{V_i}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V_i} \rangle \\ &= H_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + R_i\end{aligned}

To find Kp,iK_{p,i}, we should first find Xi+1,ei\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle .

Xi+1,ei=FiXi,ei+GiUi,ei=FiXi,Hi(XiX^ii1)+Vi+Ui,Hi(XiX^ii1)+Vi=FiXi,XiX^ii1Hi+GiSi=Fi(XiX^ii1)+X^ii1,XiX^ii1Hi+GiSi=FiXiX^ii1,XiX^ii1Hi+GiSi\begin{aligned} \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle &= F_i\langle \boldsymbol{X}_i, \boldsymbol{e}_i \rangle + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle + \langle \boldsymbol{U}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\end{aligned}

Notice that the matrix Pi=XiX^ii1,XiX^ii1P_i = \langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle is the auto-correlation of the estimation error, and it shows up in both Kp,iK_{p,i} and ReiR_{\boldsymbol{e}_i}. It would be useful to have a recursive solution for this matrix as well.

Pi+1=Πi+1X^i+1i,X^i+1i=FiΠiFi+GiQiGiFiX^ii1+Kp,iei,FiX^ii1+Kp,iei=FiΠiFi+GiQiGiFiX^ii1,X^ii1F+Kp,iRe,iKp,i=FiPiFi+GiQiGiKp,iRe,jKp,i\begin{aligned} P_{i+1} &= \Pi_{i+1}-\langle \boldsymbol{\hat{X}}_{i+1|i}, \boldsymbol{\hat{X}}_{i+1|i} \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - \langle F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i, F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - F_i\langle \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{\hat{X}}_{i|i-1} \rangle F^*+K_{p,i}R_{\boldsymbol{e},i}K_{p,i}^*\\ &= F_iP_iF_i^* + G_iQ_iG_i^* - K_{p,i}R_{\boldsymbol{e},j}K_{p,i}^*\end{aligned}

Putting this into a concrete algorithm, we get the Kalman Prediction Filter.

Schmidt’s Modification of the Kalman Filter

The predictive Kalman filter goes directly from X^ii1\boldsymbol{\hat{X}}_{i|i-1} to X^i+1i\boldsymbol{\hat{X}}_{i+1|i} without ever determining X^ii\boldsymbol{\hat{X}}_{i|i}. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.

  1. Measurement Update: Find X^ii\boldsymbol{\hat{X}}_{i|i} given the latest observation Yi\boldsymbol{Y}_{i} and X^ii1\boldsymbol{\hat{X}}_{i|i-1}.

  2. State Evolution (Time) Update: Find X^i+1i\boldsymbol{\hat{X}}_{i+1|i} using what we know about the state evolution.

This mimics the approach of the forward algorithm for Hidden Markov Models, which separated updates to the distribution using a time update and a measurement update. Using our innovation process,

X^ii=j=0iXi,ejRe,j1ej=X^ii1+Xi,ejRe,i1ej=X^ii1+(XiX^ii1)+X^ii1,Hi(XiX^ii1)+ViRe,i1ej=X^ii1+PiHiRe,i1ei\begin{aligned} \boldsymbol{\hat{X}}_{i|i} &= \sum_{j=0}^i \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{V}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + P_iH_i^*R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}

The gain on the coefficient of the innovation Kf,i=PiHiRe,iK_{f,i}=P_iH_i^*R_{\boldsymbol{e},i} is called the Kalman Filter Gain. The error of our estimator Pii=X^ii,X^iiP_{i|i} = \langle \boldsymbol{\hat{X}}_{i|i}, \boldsymbol{\hat{X}}_{i|i} \rangle is given by

Pii=PiPiHiRe,i1HiPi.P_{i|i} = P_i - P_iH_i^*R_{\boldsymbol{e},i}^{-1}H_iP_i.

For the time update,

X^i+1i=FiX^ii+GiU^ii=FiX^ii+GiUi,eiRe,i1ei=FiX^ii+GiUi,eiRe,i1ei=FiX^ii+GiUi,HXi+ViHiX^ii1Re,i1ei=FiX^ii+GiSiRe,i1ei\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\boldsymbol{\hat{U}}_{i|i} \\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i = F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, H\boldsymbol{X}_i+\boldsymbol{V}_i - H_i\boldsymbol{\hat{X}}_{i|i-1} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_iS_iR_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}

We can re-write the error if this estimator Pi+1P_{i+1} as

Pi+1=FiPiiFi+Gi(QiSiRe,i1Si)GiFiKf,iSiGiGiSiKf,iFiP_{i+1} = F_iP_{i|i}F_i^* + G_i(Q_i - S_iR_{\boldsymbol{e},i}^{-1}S_i^*)G_i^* - F_iK_{f,i}S_i^*G_i^* - G_iS_iK_{f,i}^*F_i^*

Writing this as an algorithm,

Kalman Smoother

The Kalman Prediction Filter and Schmidt’s modification of the Kalman filter are both causal filters. The Kalman Smoother provides a non-causal estimate in the same way that the Backward Recursion algorithm does for Hidden Markov Processes. In other words, the Kalman Smoother predicts X^in\hat{\boldsymbol{X}}_{i|n}, the best linear estimator of X^i\hat{\boldsymbol{X}}_i from {Y0,,YN}\{\boldsymbol{Y}_0, \cdots, \boldsymbol{Y}_N\}. As before, we can start with the innovation process ej=YjHjX^jj1\boldsymbol{e}_j = \boldsymbol{Y}_j - H_j\hat{\boldsymbol{X}}_{j|j-1}.

X^iN=j=0NXi,ejRe,j1ej=X^ii1+j=iNXi,ejRe,j1ej.\hat{\boldsymbol{X}}_{i|N} = \sum_{j=0}^N \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j = \hat{\boldsymbol{X}}_{i|i-1} + \sum_{j=i}^N \langle X_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j.

We can compute X^ii1\hat{\boldsymbol{X}}_{i|i-1} using the Predictive Kalman filter, so we just need to compute the Xi,ejRe,j1ej\langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j.

Xi,ej=Xi,Hj(XiX^jj1)+Vj=Xi,Hj(XjX^jj1)=X^ii1+(XiX^ii1),XjX^jj1Hj\begin{aligned} \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle &= \langle \boldsymbol{X}_i, H_j(\boldsymbol{X}_i - \hat{\boldsymbol{X}}_{j|j-1}) + \boldsymbol{V}_j \rangle = \langle \boldsymbol{X}_i, H_j(\boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1}) \rangle \\ &= \langle \hat{\boldsymbol{X}}_{i|i-1} + (\boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}), \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle H_j^*\end{aligned}

XjX^jj1\boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} is orthgonal to any linear function of {Y0,,Yj1}\{\boldsymbol{Y}_0, \cdots, \boldsymbol{Y}_{j-1}\}, so when jij \geq i, it must be orthgonal to X^ii1\hat{\boldsymbol{X}}_{i|i-1} since it is a function of {Yk}0i1\{\boldsymbol{Y}_k\}^{i-1}_0. Thus, for jij\geq i,

Xi,ej=XiX^ii1,XjX^jj1Hj\langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle = \langle \boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}, \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle H_j^*

If we denote Pij=XiX^ii1,XjX^jj1P_{ij} = \langle \boldsymbol{X}_i - \hat{\boldsymbol{X}}_{i|i-1}, \boldsymbol{X}_j - \hat{\boldsymbol{X}}_{j|j-1} \rangle , then

Pij=PiΦp(j,i)=Pi{I if j=ik=ij1Fp,k if j>iP_{ij} = P_{i}\Phi_{p}^*(j, i) = P_{i} \begin{cases} I & \text{ if } j = i\\ \prod_{k=i}^{j-1} F_{p,k} & \text{ if } j > i \end{cases}

where Fp,k=(FkKp,kHk)F_{p,k} = (F_k - K_{p, k}H_k). This gives us the expression

X^iN=X^ii1+Pij=iNΦp(j,i)HiRe,j1ej.\hat{\boldsymbol{X}}_{i|N} = \hat{\boldsymbol{X}}_{i|i-1} + P_i \sum_{j=i}^N \Phi_p^*(j, i)H_i^*R_{e,j}^{-1}\boldsymbol{e}_j.

If we let λiN=j=iNΦp(j,i)HiRe,j1ej\lambda_{i|N} = \sum_{j=i}^N \Phi_p^*(j, i)H_i^*R_{e,j}^{-1}\boldsymbol{e}_j, then we get the recursion

λiN=Fp,iλi+1N+HiRe,i1ei.\lambda{i|N} = F_{p,i}^*\lambda_{i+1|N} + H_i^*R_{e, i}^{-1}\boldsymbol{e}_i.

If we want to look at the error of this estimator, we see that

PiN=PiPiΛiNPi where ΛiN=λiN,λiN=Fp,iΛi+1NFp,i+HiRe,i1Hi.P_{i|N} = P_i - P_i \Lambda_{i|N} P_i \text{ where } \Lambda_{i|N} = \langle \lambda_{i|N}, \lambda_{i|N} \rangle = F_{p,i}^*\Lambda_{i+1|N}F_{p,i}+H_i^*R_{e, i}^{-1}H_i.

Writing all of this as an algorithm,

Last updated