Filtering
If we think of our signal as a discrete time random process, then like a normal deterministic signal, we can try filtering our random process.

Figure 1: Filtering a Disrete Time Random Process with an LTI system with transfer function $H(z)$
Filtering can either be accomplished with an LTI system or some other non-linear/non-time-invariant system just like with deterministic signals.
If we use an LTI filter on a WSS process, then we can easily compute how the filter impacts the spectrum of the signal.
This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation
through a minimum-phase filter with transfer function
.
Suppose we have a stochastic WSS process
that is jointly WSS with
and that we want to find the best linear estimator of
using
. The best linear estimator of
given the observations
can be written as
This is identical to passing
through an LTI filter. If we restrict ourselves to using
to estimate
, then the best linear estimator can be written as
It is identical to passing
through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using
directly, we want to transform
into a new process
where
. This transformation is known as whitening. From the spectral factorization of
, we know if we use the filter
then
Now we want to find the best linear estimator of
using our new process
by designing an LTI filter
.

Figure 2: Finding the best linear estimator of $X$ using $W$ with a two-stage filter that first whitens the input.
Starting with noncausal case, we can apply the orthogonality principle,
When we cascade these filters,
If we interpret Definition 21 in the frequency domain, for a specific
, we can understand
as an optimal linear estimator for
where
is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of
.
Since
and
have jointly orthogonal increments, this tells us that
is just the optimal linear estimator of
using
.
and
exist on a Hilbert space, meaning we are essentially projecting each frequency component of
onto the corresponding frequency component of
.
First, note that in the causal case, whitening doesn’t break causality because
is causal. When we apply the orthogonality principle,
We can’t take the Z-transform of both sides because the equation is not necessarily true for
. Instead, we can look at the function
Taking the unilateral Z-transform of both sides,
Thus the filter
which gives the causal best linear estimator of
using
is
Intuitively, this should make sense because we are using the same
process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.
Suppose that instead of a Wide-Sense Stationary process, we an
length signal
which we want to estimate with another
length signal
. We can represent both
and
as vectors in
. If we are allowed to use all entries of
to estimate
, this is identical to linear estimation.
Note that this requires
. Suppose that we wanted to design a causal filter for the vector case, so
only depends on
. By the orthogonality principle,
In matrix form, this means
where
is strictly upper triangular.
Applying the LDL decomposition, we see that
where
represent the lower triangular part of a matrix.
Suppose we have a Hidden Markov Process
. We can think of determining the state
as filtering
.
Suppose we want to know the distribution of
after we have observered
.
Now if we know
, then we are set.
Now we have a recursive algorithm for computing the distribution of
.

Suppose we are allowed to non-causally filter our signal and we care about the distribution of
after we have observed
. In other words, for
, we want to find
. When
,
. If we continue expanding backwards, then
This gives us a clear algorithm for non-causally computing the distribution of
.

Suppose we want to find the most likely sequence of states given our observations. This means we should compute
We see that there is a recursion in the joint distribution, so if we let
, then
The base case is that
.
is useful because
. This is because we can first maximize over
and
, so the only thing left to maximize is
. Once we have
, then we can comptue
by
Putting these equations gives us the Viterbi algorithm.

In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors
using some linear combination of the observations
.
Suppose that we want to compute the one-step prediction. In other words, given
, we want to predict
. Our observations
are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the
. To do this, we can define the innovation process
The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project
onto the innovations.
The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let
(called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.
Now, we just need to find a recursive formulation for
and
. Starting with
, notice that we can write
.
To find
, we should first find
.
Notice that the matrix
is the auto-correlation of the estimation error, and it shows up in both
and
. It would be useful to have a recursive solution for this matrix as well.
Putting this into a concrete algorithm, we get the Kalman Prediction Filter.
The predictive Kalman filter goes directly from
to
without ever determining
. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.
- 1.Measurement Update: Findgiven the latest observationand.
- 2.State Evolution (Time) Update: Findusing what we know about the state evolution.
This mimics the approach of the forward algorithm for Hidden Markov Models, which separated updates to the distribution using a time update and a measurement update. Using our innovation process,
The gain on the coefficient of the innovation
is called the Kalman Filter Gain. The error of our estimator
is given by
For the time update,