Filtering
If we think of our signal as a discrete time random process, then like a normal deterministic signal, we can try filtering our random process.
Figure 1: Filtering a Disrete Time Random Process with an LTI system with transfer function $H(z)$
Filtering can either be accomplished with an LTI system or some other non-linear/non-time-invariant system just like with deterministic signals.

LTI Filtering on WSS Processes

If we use an LTI filter on a WSS process, then we can easily compute how the filter impacts the spectrum of the signal.

Theorem 13

When
Y(n)Y(n)
is formed by passing a WSS process
XnX_n
through a stable LTI system with impulse response
h[n]h[n]
and transfer function
H(z)H(z)
, then
SY(z)=H(z)SX(z)H(z)S_Y(z) = H(z)S_X(z)H^*(z^{-*})
and
SYX(z)=H(z)SX(z)S_{YX}(z) = H(z)S_X(z)
. If we have a third process
ZnZ_n
that is jointly WSS with
(Yn,Xn)(Y_n, X_n)
, then
SZY(z)=SZX(z)H(z)S_{ZY}(z) = S_{ZX}(z)H^*(z^{-*})
.
This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation
RW(k)=reδ[n]R_W(k) = r_e\delta[n]
through a minimum-phase filter with transfer function
L(z)L(z)
.

Wiener Filter

Suppose we have a stochastic WSS process
YnY_n
that is jointly WSS with
XnX_n
and that we want to find the best linear estimator of
XnX_n
using
YnY_n
. The best linear estimator of
XnX_n
given the observations
YnY_n
can be written as
X^n=mZh(m)Ynm=h[n]Yn.\hat{X}_n = \sum_{m\in\mathbb{Z}}h(m)Y_{n-m} = h[n] * Y_n.
This is identical to passing
YnY_n
through an LTI filter. If we restrict ourselves to using
{Yi}i=n\{Y_i\}_{i=-\infty}^{n}
to estimate
XnX_n
, then the best linear estimator can be written as
X^n=m=0h(m)Ynm=h[n]Yn.\hat{X}_n = \sum_{m=0}^\infty h(m)Y_{n-m} = h[n] * Y_n.
It is identical to passing
YnY_n
through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using
YY
directly, we want to transform
YY
into a new process
WW
where
RW(k)=δ[k]R_W(k) = \delta[k]
. This transformation is known as whitening. From the spectral factorization of
YY
, we know if we use the filter
G(z)=1SY+(z)G(z) =\frac{1}{S_Y^+(z)}
then
SW(z)=SY(z)SY+(z)SY+(z)=SY(z)SY+(z)SY(z)=1.S_W(z) = \frac{S_Y(z)}{S_Y^+(z)S_Y^{+*}(z^{-*})} = \frac{S_Y(z)}{S_Y^+(z)S_Y^-(z)} = 1.
Now we want to find the best linear estimator of
XX
using our new process
WW
by designing an LTI filter
Q(z)Q(z)
.
Figure 2: Finding the best linear estimator of $X$ using $W$ with a two-stage filter that first whitens the input.

Non-Causal Case

Starting with noncausal case, we can apply the orthogonality principle,
E[(XnX^n)Wnk]=0    E[XnWnk]=mZq(m)E[WnmWnk]RXW(k)=mZq(m)RW(km)    SXW(z)=Q(z)SW(z)Q(z)=SXW(z)SW(z)=SXW(z)=SXY(z)(SY+(z))=SXY(z)SY(z)\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m\in\mathbb{Z}}q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) = \sum_{m\in\mathbb{Z}}q(m)R_W(k-m) &\implies S_{XW}(z) = Q(z)S_W(z)\\ \therefore Q(z) = \frac{S_{XW}(z)}{S_W(z)} = S_{XW}(z) &= S_{XY}(z)(S_Y^+(z^{-*}))^{-*} = \frac{S_{XY}(z)}{S_Y^-(z)}\end{aligned}
When we cascade these filters,
H(z)=Q(z)G(z)=SXY(z)SY(z)1SY+(z)=SXY(z)SY(z).H(z) = Q(z)G(z) = \frac{S_{XY}(z)}{S_Y^-(z)} \frac{1}{S_Y^+(z)}= \frac{S_{XY}(z)}{S_Y(z)}.

Definition 21

The best linear estimator of
XnX_n
using
YnY_n
where
(Xn,Yn)(X_n, Y_n)
is jointly WSS is given by the non-causal Wiener filter.
H(z)=SXY(z)SY(z)H(z) = \frac{S_{XY}(z)}{S_Y(z)}
If we interpret Definition 21 in the frequency domain, for a specific
ω\omega
, we can understand
H(ejω)H(e^{j\omega})
as an optimal linear estimator for
FX(ω)F_X(\omega)
where
FX(ω)F_X(\omega)
is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of
YnY_n
.
X^n=iZh[i]ππejω(ni)dFY(ω)=ππ(iZh[i]ejωi)ejωndFY(ω)=ππH(ejω)ejωndFY(ω)\begin{aligned} \hat{X}_n &= \sum_{i\in\mathbb{Z}}h[i]\int_{-\pi}^\pi e^{j\omega(n-i)}dF_Y(\omega)\\ &= \int_{-\pi}^{\pi}\left(\sum_{i\in\mathbb{Z}}h[i]e^{-j\omega i}\right)e^{j\omega n}dF_Y(\omega) \\ &= \int_{-\pi}^\pi H(e^{j\omega})e^{j\omega n}dF_Y(\omega)\\\end{aligned}
Since
FXF_X
and
FYF_Y
have jointly orthogonal increments, this tells us that
H(ejω)H(e^{j\omega})
is just the optimal linear estimator of
dFX(ω)dF_X(\omega)
using
dFY(ω)dF_Y(\omega)
.
dFX(ω)dF_X(\omega)
and
dFY(ω)dF_Y(\omega)
exist on a Hilbert space, meaning we are essentially projecting each frequency component of
XnX_n
onto the corresponding frequency component of
YnY_n
.

Causal Case

First, note that in the causal case, whitening doesn’t break causality because
1SY+(z)\frac{1}{S_Y^+(z)}
is causal. When we apply the orthogonality principle,
E[(XnX^n)Wnk]=0    E[XnWnk]=m=0q(m)E[WnmWnk]RXW(k)=m=0q[m]RW(km)k0\begin{aligned} \mathbb{E}\left[(X_n-\hat{X}_n)W_{n-k}^*\right] = 0 &\implies \mathbb{E}\left[X_nW^*_{n-k}\right] = \sum_{m=0}^\infty q(m)\mathbb{E}\left[W_{n-m}W^*_{n-k}\right] \\ \therefore R_{XW}(k) &= \sum_{m = 0}^\infty q[m]R_W(k-m) \qquad k \geq 0\end{aligned}
We can’t take the Z-transform of both sides because the equation is not necessarily true for
k<0k < 0
. Instead, we can look at the function
f(k)=RXW(k)m=0RW(km)q[m]={0k0,? else.f(k) = R_{XW}(k) - \sum_{m=0}^\infty R_W(k-m)q[m] = \begin{cases} 0 & k\geq 0,\\ ? & \text{ else.}\end{cases}
Taking the unilateral Z-transform of both sides,
[F(z)]+=[SXW(z)SW(z)Q(z)]+=[SXW(z)]+Q(z)=0Q(z)=[SXW(z)]+=[SXY(z)SY(z)]+\begin{aligned} \left[F(z)\right]_+ &= \left[S_{XW}(z) - S_W(z)Q(z)\right]_+ = \left[S_{XW}(z)\right]_+ - Q(z) = 0\\ Q(z) &= \left[S_{XW}(z)\right]_+ = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \end{aligned}
Thus the filter
HH
which gives the causal best linear estimator of
XX
using
YY
is
H(z)=Q(z)G(z)=[SXY(z)SY(z)]+1SY+(z).H(z) = Q(z) G(z)= \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.

Definition 22

The best linear estimator of
XnX_n
using
{Yi}i=n\{Y_i\}_{i=-\infty}^{n}
is given by the causal Wiener filter.
H(z)=Q(z)G(z)=[SXY(z)SY(z)]+1SY+(z).H(z) = Q(z)G(z) = \left[\frac{S_{XY}(z)}{S_Y^-(z)}\right]_+ \frac{1}{S_Y^+(z)}.
Intuitively, this should make sense because we are using the same
WW
process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.

Theorem 14

If
X^NC(n)\hat{X}_{NC}(n)
is the non-causal Wiener filter of
XX
, then the causal wiener filter of
XX
given
YY
is the same as the causal wiener filter of
X^NC\hat{X}_{NC}
given
YY
, and if
YY
is white noise, then
X^C(n)=i=0h[i]Yni\hat{X}_C(n) = \sum_{i=0}^{\infty}h[i]Y_{n-i}

Vector Case

Suppose that instead of a Wide-Sense Stationary process, we an
NN
length signal
X\boldsymbol{X}
which we want to estimate with another
NN
length signal
Y\boldsymbol{Y}
. We can represent both
X\boldsymbol{X}
and
Y\boldsymbol{Y}
as vectors in
CN\mathbb{C}^N
. If we are allowed to use all entries of
Y\boldsymbol{Y}
to estimate
X\boldsymbol{X}
, this is identical to linear estimation.

Definition 23

The non-causal Wiener filter of a finite length
NN
signal
Y\boldsymbol{Y}
is given by
Ks=RXYRY1.K_s = R_{\boldsymbol{X}\boldsymbol{Y}}R_{\boldsymbol{Y}}^{-1}.
Note that this requires
RY0R_{\boldsymbol{Y}} \succ 0
. Suppose that we wanted to design a causal filter for the vector case, so
X^i\hat{X}_i
only depends on
{Yj}j=1i\{Y_j\}_{j=1}^i
. By the orthogonality principle,
1li, E[Xij=1iKf,ijYjYl]=0    RXY(i,l)=j=1iKf,ijRY(j,l)\forall 1 \leq l \leq i,\ \mathbb{E}\left[X_i - \sum_{j=1}^iK_{f, ij}Y_jY_l^*\right] = 0 \implies R_{\boldsymbol{XY}}(i, l) = \sum_{j=1}^i K_{f, ij}R_{\boldsymbol{Y}}(j, l)
In matrix form, this means
RXYKfRY=U+R_{\boldsymbol{XY}} - K_fR_{\boldsymbol{Y}} = U^+
where
U+U^+
is strictly upper triangular.

Theorem 15

If matrix
H0H \succ 0
, then there exists a unique lower-diagonal upper triangular factorization of
H=LDLH=LDL^*
where
LL
is lower diagonal and invertible with unit diagonal entries and
DD
is diagonal with positive entries.
Applying the LDL decomposition, we see that
RXYKfLDL=U+    RXYLD1KfL=U+LD1[RXYLD1]LKfL=0\begin{aligned} R_{\boldsymbol{XY}} - K_fLDL^* = U^+ \implies R_{\boldsymbol{XY}}L^{-*}D^{-1} -K_f L = U^+L^{-*}D^{-1}\\ \therefore [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_L -K_f L = 0\end{aligned}
where
[]L[\cdot]_L
represent the lower triangular part of a matrix.

Definition 24

The causal Wiener filter of a finite length
NN
signal
Y\boldsymbol{Y}
is given by
Kf=[RXYLD1]LL1K_f = [R_{\boldsymbol{XY}}L^{-*}D^{-1}]_LL^{-1}

Hidden Markov Model State Estimation

Suppose we have a Hidden Markov Process
{Yn}n1\{Y_n\}_{n\geq1}
. We can think of determining the state
{Xn}n1\{X_n\}_{n\geq1}
as filtering
{Yn}n1\{Y_n\}_{n\geq1}
.

Causal Distribution Estimation

Suppose we want to know the distribution of
XtX_t
after we have observered
YtY^t
.
p(xtyt)=p(xt,yt)p(yt)=p(xt)p(yt,yt1xt)xp(yt,yt1xt=x)p(xt=x)=p(xt)p(ytxt)p(yt1xt)xp(ytxt=x)p(yt1xt=x)p(xt=x)=p(ytxt)p(yt1)p(xtyt1)xp(ytxt=x)p(yt1)p(xt=xyt1)=p(ytxt)p(xtyt1)xp(ytxt=x)p(xt=xyt1)\begin{aligned} p(x_t|y^t) &= \frac{p(x_t,y^t)}{p(y^t)} = \frac{p(x_t)p(y_t, y^{t-1}|x_t)}{\sum_{x}p(y_t,y^{t-1}|x_t=x)p(x_t=x)}\\ &= \frac{p(x_t)p(y_t|x_t)p(y^{t-1}|x_t)}{\sum_xp(y_t|x_t=x)p(y^{t-1}|x_t=x)p(x_t=x)} = \frac{p(y_t|x_t)p(y^{t-1})p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(y^{t-1})p(x_t=x|y^{t-1})}\\ &=\frac{p(y_t|x_t)p(x_t|y^{t-1})}{\sum_{x}p(y_t|x_t=x)p(x_t=x|y^{t-1})}\end{aligned}
Now if we know
p(xtyt1)p(x_t|y^{t-1})
, then we are set.
p(xtyt1)=xp(xt,xt1=xyt1)=xp(xt1=xyt1)p(xtxt1=x,yt1)=xp(xt1=xyt)p(xtxt1=x)\begin{aligned} p(x_t|y^{t-1}) &= \sum_xp(x_t,x_{t-1}=x|y^{t-1}) = \sum_x p(x_{t-1}=x|y^{t-1})p(x_t|x_{t-1}=x,y^{t-1}) \\ &= \sum_x p(x_{t-1}=x|y^t)p(x_t|x_{t-1}=x)\end{aligned}
Now we have a recursive algorithm for computing the distribution of
xtx_t
.

Non-Causal Distribution Estimation

Suppose we are allowed to non-causally filter our signal and we care about the distribution of
XtX_t
after we have observed
YnY^n
. In other words, for
tnt \geq n
, we want to find
γt(xt)=p(xtyn)\gamma_t(x_t) = p(x_t|y^n)
. When
t=nt=n
,
γn(xn)=αn(xn)\gamma_n(x_n) = \alpha_n(x_n)
. If we continue expanding backwards, then
p(xtyn)=xp(xt,xt+1=xyn)=xp(xt+1=xyn)p(xtxt+1=x,yt,yt+1n)=xp(xt+1=xyn)p(xtxt+1,yt)=xp(xt+1=xyn)p(xtyt)p(xt+1=xxt,yt)p(xt+1=xyt)=xγt+1(x)αt(xt)p(xt+1=xxt)βt+1(x)\begin{aligned} p(x_t|y^n) &= \sum_x p(x_t,x_{t+1}=x|y^n) = \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1}=x,y^t,y_{t+1}^n)\\ &= \sum_x p(x_{t+1}=x|y^n)p(x_t|x_{t+1},y^t) = \sum_x p(x_{t+1}=x|y^n)\frac{p(x_t|y^t)p(x_{t+1}=x|x_t,y^t)}{p(x_{t+1}=x|y^t)}\\ &= \sum_x \gamma_{t+1}(x)\frac{\alpha_t(x_t)p(x_{t+1}=x|x_t)}{\beta_{t+1}(x)}\end{aligned}
This gives us a clear algorithm for non-causally computing the distribution of
xtx_t
.

State Sequence Estimation

Suppose we want to find the most likely sequence of states given our observations. This means we should compute
X^n=argmaxXnp(xnyn)\hat{X}^n = \text{argmax}_{X^n}p(x^n|y^n)
p(xt,yt)=p(xt1,yt1)p(xt,ytxt1,yt1)=p(xt1,yt1)p(xtxt1,yt1)p(ytxt,xt1,yt1)=p(xt1,yt1)p(xtxt1)p(ytxt)\begin{aligned} p(x^t, y^t) &= p(x^{t-1}, y^{t-1})p(x_t, y_t|x^{t-1},y^{t-1})\\ &= p(x^{t-1}, y^{t-1})p(x_t|x^{t-1},y^{t-1})p(y_t|x_t,x^{t-1},y^{t-1}) \\ &= p(x^{t-1},y^{t-1})p(x_t|x_{t-1})p(y_t|x_t)\end{aligned}
We see that there is a recursion in the joint distribution, so if we let
Vt(xt)=maxxt1p(xt,yt)V_t(x_t) = \max_{x^{t-1}}p(x^t,y^t)
, then
Vt(xt)=maxxt1p(xt,yt)=p(ytxt)maxxt1p(xt1,yt1)p(xtxt1)=p(ytxt)maxxt1[p(xtxt1)maxxt2p(xt1,yt1)]=p(ytxt)maxxt1p(xtxt1)Vt1(xt1)\begin{aligned} V_t(x_t) &= \max_{x^{t-1}} p(x^t, y^t) = p(y_t|x_t)\max_{x^{t-1}}p(x^{t-1},y^{t-1})p(x_t|x_{t-1})\\ &= p(y_t|x_t)\max_{x^{t-1}}\left[p(x_t|x_{t-1}) \max_{x^{t-2}} p(x^{t-1},y^{t-1})\right]\\ &= p(y_t|x_t)\max_{x^{t-1}}p(x_t|x_{t-1}) V_{t-1}(x_{t-1})\end{aligned}
The base case is that
V1(x1)=p(x1)p(y1x1)V_1(x_1) = p(x_1)p(y_1|x_1)
.
VtV_t
is useful because
x^n=argmaxxnVn(xn)\hat{x}_n = \text{argmax}_{x_n}V_n(x_n)
. This is because we can first maximize over
X^n1\hat{X}^{n-1}
and
YnY^n
, so the only thing left to maximize is
x^n\hat{x}_n
. Once we have
x^t\hat{x}_t
, then we can comptue
x^t1\hat{x}_{t-1}
by
x^t1=argmaxxt1p(x^txt1)Vt1(xt1).\hat{x}_{t-1} = \text{argmax}_{x_{t-1}}p(\hat{x}_t|x_{t-1})V_{t-1}(x_{t-1}).
Putting these equations gives us the Viterbi algorithm.

Kalman Filtering

In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors
X^i\boldsymbol{\hat{X}}_i
using some linear combination of the observations
Yi\boldsymbol{Y}_i
.

Kalman Prediction Filter

Suppose that we want to compute the one-step prediction. In other words, given
Yi\boldsymbol{Y}^i
, we want to predict
X^i+1\boldsymbol{\hat{X}}_{i+1}
. Our observations
Y\boldsymbol{Y}
are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the
Y\boldsymbol{Y}
. To do this, we can define the innovation process
ei=YiY^ii1=YiHiX^ii1\boldsymbol{e}_i = \boldsymbol{Y}_i - \boldsymbol{\hat{Y}_{i|i-1}} = \boldsymbol{Y}_i - H_i\boldsymbol{\hat{X}}_{i|i-1}
The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project
X^i\boldsymbol{\hat{X}}_i
onto the innovations.
X^i+1i=j=0iXi+1,ejRe,j1ej=Xi+1,eiRe,i1ei+j=0i1Xi+1,ejRe,j1ej=Xi+1,eiRe,i1ei+X^i+1i1=Xi+1,eiRe,i1ei+X^i+1i=Xi+1,eiRe,i1ei+FiX^ii1\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= \sum_{j=0}^i\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \sum_{j=0}^{i-1}\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i-1} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + \boldsymbol{\hat{X}}_{i+1|i}\\ &= \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_{i} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i + F_i\boldsymbol{\hat{X}}_{i|i-1}\end{aligned}
The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let
Kp,i=Xi+1,eiRe,i1K_{p,i} = \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}
(called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.
X^i+1i=FiX^ii1+Kp,iei.\boldsymbol{\hat{X}}_{i+1|i} = F_i\boldsymbol{\hat{X}}_{i|i-1} + K_{p,i}\boldsymbol{e}_i.
Now, we just need to find a recursive formulation for
Kp,iK_{p,i}
and
Re,iR_{\boldsymbol{e},i}
. Starting with
Re,iR_{\boldsymbol{e},i}
, notice that we can write
ei=YiHiX^ii1=Hi(XiX^ii1)+Vi\boldsymbol{e}_i = \boldsymbol{Y}_i-H_i\boldsymbol{\hat{X}}_{i|i-1} = H_i(\boldsymbol{X}_i-\boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i
.
Re,i=Hi(XiX^ii1)+Vi,Hi(XiX^ii1)+Vi=HiXiX^ii1,XiX^ii1Hi+Ri\begin{aligned} R_{\boldsymbol{e},i} &= \langle H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+ \boldsymbol{V_i}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V_i} \rangle \\ &= H_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + R_i\end{aligned}
To find
Kp,iK_{p,i}
, we should first find
Xi+1,ei\langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle
.
Xi+1,ei=FiXi,ei+GiUi,ei=FiXi,Hi(XiX^ii1)+Vi+Ui,Hi(XiX^ii1)+Vi=FiXi,XiX^ii1Hi+GiSi=Fi(XiX^ii1)+X^ii1,XiX^ii1Hi+GiSi=FiXiX^ii1,XiX^ii1Hi+GiSi\begin{aligned} \langle \boldsymbol{X}_{i+1}, \boldsymbol{e}_i \rangle &= F_i\langle \boldsymbol{X}_i, \boldsymbol{e}_i \rangle + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle + \langle \boldsymbol{U}_i, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1})+\boldsymbol{V}_i \rangle \\ &= F_i\langle \boldsymbol{X}_i, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\\ &= F_i\langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle H_i^* + G_iS_i\end{aligned}
Notice that the matrix
Pi=XiX^ii1,XiX^ii1P_i = \langle \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1} \rangle
is the auto-correlation of the estimation error, and it shows up in both
Kp,iK_{p,i}
and
ReiR_{\boldsymbol{e}_i}
. It would be useful to have a recursive solution for this matrix as well.
Pi+1=Πi+1X^i+1i,X^i+1i=FiΠiFi+GiQiGiFiX^ii1+Kp,iei,FiX^ii1+Kp,iei=FiΠiFi+GiQiGiFiX^ii1,X^ii1F+Kp,iRe,iKp,i=FiPiFi+GiQiGiKp,iRe,jKp,i\begin{aligned} P_{i+1} &= \Pi_{i+1}-\langle \boldsymbol{\hat{X}}_{i+1|i}, \boldsymbol{\hat{X}}_{i+1|i} \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - \langle F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i, F_i\boldsymbol{\hat{X}}_{i|i-1}+K_{p,i}\boldsymbol{e}_i \rangle \\ &= F_i\Pi_iF_i^* + G_iQ_iG_i^* - F_i\langle \boldsymbol{\hat{X}}_{i|i-1}, \boldsymbol{\hat{X}}_{i|i-1} \rangle F^*+K_{p,i}R_{\boldsymbol{e},i}K_{p,i}^*\\ &= F_iP_iF_i^* + G_iQ_iG_i^* - K_{p,i}R_{\boldsymbol{e},j}K_{p,i}^*\end{aligned}
Putting this into a concrete algorithm, we get the Kalman Prediction Filter.

Schmidt’s Modification of the Kalman Filter

The predictive Kalman filter goes directly from
X^ii1\boldsymbol{\hat{X}}_{i|i-1}
to
X^i+1i\boldsymbol{\hat{X}}_{i+1|i}
without ever determining
X^ii\boldsymbol{\hat{X}}_{i|i}
. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.
  1. 1.
    Measurement Update: Find
    X^ii\boldsymbol{\hat{X}}_{i|i}
    given the latest observation
    Yi\boldsymbol{Y}_{i}
    and
    X^ii1\boldsymbol{\hat{X}}_{i|i-1}
    .
  2. 2.
    State Evolution (Time) Update: Find
    X^i+1i\boldsymbol{\hat{X}}_{i+1|i}
    using what we know about the state evolution.
This mimics the approach of the forward algorithm for Hidden Markov Models, which separated updates to the distribution using a time update and a measurement update. Using our innovation process,
X^ii=j=0iXi,ejRe,j1ej=X^ii1+Xi,ejRe,i1ej=X^ii1+(XiX^ii1)+X^ii1,Hi(XiX^ii1)+ViRe,i1ej=X^ii1+PiHiRe,i1ei\begin{aligned} \boldsymbol{\hat{X}}_{i|i} &= \sum_{j=0}^i \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},j}^{-1}\boldsymbol{e}_j \\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + \langle (\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{\hat{X}}_{i|i-1}, H_i(\boldsymbol{X}_i - \boldsymbol{\hat{X}}_{i|i-1}) + \boldsymbol{V}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}^j\\ &= \boldsymbol{\hat{X}}_{i|i-1} + P_iH_i^*R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}
The gain on the coefficient of the innovation
Kf,i=PiHiRe,iK_{f,i}=P_iH_i^*R_{\boldsymbol{e},i}
is called the Kalman Filter Gain. The error of our estimator
Pii=X^ii,X^iiP_{i|i} = \langle \boldsymbol{\hat{X}}_{i|i}, \boldsymbol{\hat{X}}_{i|i} \rangle
is given by
Pii=PiPiHiRe,i1HiPi.P_{i|i} = P_i - P_iH_i^*R_{\boldsymbol{e},i}^{-1}H_iP_i.
For the time update,
X^i+1i=FiX^ii+GiU^ii=FiX^ii+GiUi,eiRe,i1ei=FiX^ii+GiUi,eiRe,i1ei=FiX^ii+GiUi,HXi+ViHiX^ii1Re,i1ei=FiX^ii+GiSiRe,i1ei\begin{aligned} \boldsymbol{\hat{X}}_{i+1|i} &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\boldsymbol{\hat{U}}_{i|i} \\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i = F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, \boldsymbol{e}_i \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_i\langle \boldsymbol{U}_i, H\boldsymbol{X}_i+\boldsymbol{V}_i - H_i\boldsymbol{\hat{X}}_{i|i-1} \rangle R_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\\ &= F_i\boldsymbol{\hat{X}}_{i|i} + G_iS_iR_{\boldsymbol{e},i}^{-1}\boldsymbol{e}_i\end{aligned}
We can re-write the error if this estimator
Pi+1P_{i+1}
as
Pi+1=FiPiiFi+Gi(QiSiRe,i1Si)GiFiKf,iSiGiGiSiKf,iFiP_{i+1} = F_iP_{i|i}F_i^* + G_i(Q_i - S_iR_{\boldsymbol{e},i}^{-1}S_i^*)G_i^* - F_iK_{f,i}S_i^*G_i^* - G_iS_iK_{f,i}^*F_i^*
Writing this as an algorithm,

Kalman Smoother

The Kalman Prediction Filter and Schmidt’s modification of the Kalman filter are both causal filters. The Kalman Smoother provides a non-causal estimate in the same way that the Backward Recursion algorithm does for Hidden Markov Processes. In other words, the Kalman Smoother predicts
X^in\hat{\boldsymbol{X}}_{i|n}
, the best linear estimator of
X^i\hat{\boldsymbol{X}}_i
from
{Y0,,YN}\{\boldsymbol{Y}_0, \cdots, \boldsymbol{Y}_N\}
. As before, we can start with the innovation process
ej=YjHjX^jj1\boldsymbol{e}_j = \boldsymbol{Y}_j - H_j\hat{\boldsymbol{X}}_{j|j-1}
.
X^iN=j=0NXi,ejRe,j1ej=X^ii1+j=iNXi,ejRe,j1ej.\hat{\boldsymbol{X}}_{i|N} = \sum_{j=0}^N \langle \boldsymbol{X}_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j = \hat{\boldsymbol{X}}_{i|i-1} + \sum_{j=i}^N \langle X_i, \boldsymbol{e}_j \rangle R_{e,j}^{-1}\boldsymbol{e}_j.