If we think of our signal as a discrete time random process, then like a normal deterministic signal, we can try filtering our random process.
Filtering can either be accomplished with an LTI system or some other non-linear/non-time-invariant system just like with deterministic signals.
LTI Filtering on WSS Processes
If we use an LTI filter on a WSS process, then we can easily compute how the filter impacts the spectrum of the signal.
Theorem 13
Wiener Filter
Non-Causal Case
Starting with noncausal case, we can apply the orthogonality principle,
When we cascade these filters,
Definition 21
Causal Case
Taking the unilateral Z-transform of both sides,
Definition 22
Theorem 14
Vector Case
Definition 23
In matrix form, this means
Theorem 15
Applying the LDL decomposition, we see that
Definition 24
Hidden Markov Model State Estimation
Causal Distribution Estimation
Non-Causal Distribution Estimation
State Sequence Estimation
Suppose we want to find the most likely sequence of states given our observations. This means we should compute
Putting these equations gives us the Viterbi algorithm.
Kalman Filtering
Kalman Prediction Filter
Putting this into a concrete algorithm, we get the Kalman Prediction Filter.
Schmidt’s Modification of the Kalman Filter
This mimics the approach of the forward algorithm for Hidden Markov Models, which separated updates to the distribution using a time update and a measurement update. Using our innovation process,
For the time update,
Writing this as an algorithm,
Kalman Smoother
If we want to look at the error of this estimator, we see that
Writing all of this as an algorithm,
When Y(n) is formed by passing a WSS process Xn through a stable LTI system with impulse response h[n] and transfer function H(z), then SY(z)=H(z)SX(z)H∗(z−∗) and SYX(z)=H(z)SX(z). If we have a third process Zn that is jointly WSS with (Yn,Xn), then SZY(z)=SZX(z)H∗(z−∗).
This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation RW(k)=reδ[n] through a minimum-phase filter with transfer function L(z).
Suppose we have a stochastic WSS process Yn that is jointly WSS with Xn and that we want to find the best linear estimator of Xn using Yn. The best linear estimator of Xn given the observations Yn can be written as
X^n=∑m∈Zh(m)Yn−m=h[n]∗Yn.
This is identical to passing Yn through an LTI filter. If we restrict ourselves to using {Yi}i=−∞n to estimate Xn, then the best linear estimator can be written as
X^n=∑m=0∞h(m)Yn−m=h[n]∗Yn.
It is identical to passing Yn through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using Y directly, we want to transform Y into a new process W where RW(k)=δ[k]. This transformation is known as whitening. From the spectral factorization of Y, we know if we use the filter G(z)=SY+(z)1 then
The best linear estimator of Xn using Yn where (Xn,Yn) is jointly WSS is given by the non-causal Wiener filter.
H(z)=SY(z)SXY(z)
If we interpret Definition 21 in the frequency domain, for a specific ω, we can understand H(ejω) as an optimal linear estimator for FX(ω) where FX(ω) is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of Yn.
Since FX and FY have jointly orthogonal increments, this tells us that H(ejω) is just the optimal linear estimator of dFX(ω) using dFY(ω). dFX(ω) and dFY(ω) exist on a Hilbert space, meaning we are essentially projecting each frequency component of Xn onto the corresponding frequency component of Yn.
First, note that in the causal case, whitening doesn’t break causality because SY+(z)1 is causal. When we apply the orthogonality principle,
Thus the filter H which gives the causal best linear estimator of X using Y is
H(z)=Q(z)G(z)=[SY−(z)SXY(z)]+SY+(z)1.
The best linear estimator of Xn using {Yi}i=−∞n is given by the causal Wiener filter.
H(z)=Q(z)G(z)=[SY−(z)SXY(z)]+SY+(z)1.
Intuitively, this should make sense because we are using the same W process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.
If X^NC(n) is the non-causal Wiener filter of X, then the causal wiener filter of X given Y is the same as the causal wiener filter of X^NC given Y, and if Y is white noise, then
X^C(n)=∑i=0∞h[i]Yn−i
Suppose that instead of a Wide-Sense Stationary process, we an N length signal X which we want to estimate with another N length signal Y. We can represent both X and Y as vectors in CN. If we are allowed to use all entries of Y to estimate X, this is identical to linear estimation.
The non-causal Wiener filter of a finite length N signal Y is given by
Ks=RXYRY−1.
Note that this requires RY≻0. Suppose that we wanted to design a causal filter for the vector case, so X^i only depends on {Yj}j=1i. By the orthogonality principle,
If matrix H≻0, then there exists a unique lower-diagonal upper triangular factorization of H=LDL∗ where L is lower diagonal and invertible with unit diagonal entries and Dis diagonal with positive entries.
Now we have a recursive algorithm for computing the distribution of xt.
Suppose we are allowed to non-causally filter our signal and we care about the distribution of Xt after we have observed Yn. In other words, for t≥n, we want to find γt(xt)=p(xt∣yn). When t=n, γn(xn)=αn(xn). If we continue expanding backwards, then
The base case is that V1(x1)=p(x1)p(y1∣x1). Vt is useful because x^n=argmaxxnVn(xn). This is because we can first maximize over X^n−1 and Yn, so the only thing left to maximize is x^n. Once we have x^t, then we can comptue x^t−1 by
x^t−1=argmaxxt−1p(x^t∣xt−1)Vt−1(xt−1).
In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors X^i using some linear combination of the observations Yi.
Suppose that we want to compute the one-step prediction. In other words, given Yi, we want to predict X^i+1. Our observations Y are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the Y. To do this, we can define the innovation process
ei=Yi−Y^i∣i−1=Yi−HiX^i∣i−1
The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project X^i onto the innovations.
The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let Kp,i=⟨Xi+1,ei⟩Re,i−1 (called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.
X^i+1∣i=FiX^i∣i−1+Kp,iei.
Now, we just need to find a recursive formulation for Kp,i and Re,i. Starting with Re,i, notice that we can write ei=Yi−HiX^i∣i−1=Hi(Xi−X^i∣i−1)+Vi.
Notice that the matrix Pi=⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩is the auto-correlation of the estimation error, and it shows up in both Kp,i and Rei. It would be useful to have a recursive solution for this matrix as well.
The predictive Kalman filter goes directly from X^i∣i−1 to X^i+1∣i without ever determining X^i∣i. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.
Measurement Update: Find X^i∣i given the latest observation Yi and X^i∣i−1.
State Evolution (Time) Update: Find X^i+1∣i using what we know about the state evolution.
The gain on the coefficient of the innovation Kf,i=PiHi∗Re,i is called the Kalman Filter Gain. The error of our estimator Pi∣i=⟨X^i∣i,X^i∣i⟩is given by
The Kalman Prediction Filter and Schmidt’s modification of the Kalman filter are both causal filters. The Kalman Smoother provides a non-causal estimate in the same way that the Backward Recursion algorithm does for Hidden Markov Processes. In other words, the Kalman Smoother predicts X^i∣n, the best linear estimator of X^i from {Y0,⋯,YN}. As before, we can start with the innovation process ej=Yj−HjX^j∣j−1.
Xj−X^j∣j−1 is orthgonal to any linear function of {Y0,⋯,Yj−1}, so when j≥i, it must be orthgonal to X^i∣i−1 since it is a function of {Yk}0i−1. Thus, for j≥i,
⟨Xi,ej⟩=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩Hj∗
If we denote Pij=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩, then
Pij=PiΦp∗(j,i)=Pi{I∏k=ij−1Fp,k if j=i if j>i
where Fp,k=(Fk−Kp,kHk). This gives us the expression
X^i∣N=X^i∣i−1+Pi∑j=iNΦp∗(j,i)Hi∗Re,j−1ej.
If we let λi∣N=∑j=iNΦp∗(j,i)Hi∗Re,j−1ej, then we get the recursion
λi∣N=Fp,i∗λi+1∣N+Hi∗Re,i−1ei.
Pi∣N=Pi−PiΛi∣NPi where Λi∣N=⟨λi∣N,λi∣N⟩=Fp,i∗Λi+1∣NFp,i+Hi∗Re,i−1Hi.