When Y(n) is formed by passing a WSS process Xn through a stable LTI system with impulse response h[n] and transfer function H(z), then SY(z)=H(z)SX(z)H∗(z−∗) and SYX(z)=H(z)SX(z). If we have a third process Zn that is jointly WSS with (Yn,Xn), then SZY(z)=SZX(z)H∗(z−∗).
This gives us an interesting interpretation of the spectral factorization (Definition 14) since it essentially passing a WSS process with auto-correlation RW(k)=reδ[n] through a minimum-phase filter with transfer function L(z).
Suppose we have a stochastic WSS process Yn that is jointly WSS with Xn and that we want to find the best linear estimator of Xn using Yn. The best linear estimator of Xn given the observations Yn can be written as
X^n=∑m∈Zh(m)Yn−m=h[n]∗Yn.
This is identical to passing Yn through an LTI filter. If we restrict ourselves to using {Yi}i=−∞n to estimate Xn, then the best linear estimator can be written as
X^n=∑m=0∞h(m)Yn−m=h[n]∗Yn.
It is identical to passing Yn through a causal LTI filter. Since we are trying to find a best linear estimator, it would be nice if each of the random variables we are using for estimating were uncorrelated with each other. In other words, instead of using Y directly, we want to transform Y into a new process W where RW(k)=δ[k]. This transformation is known as whitening. From the spectral factorization of Y, we know if we use the filter G(z)=SY+(z)1 then
SW(z)=SY+(z)SY+∗(z−∗)SY(z)=SY+(z)SY−(z)SY(z)=1.
Now we want to find the best linear estimator of X using our new process W by designing an LTI filter Q(z).
E[(Xn−X^n)Wn−k∗]=0∴RXW(k)=m∈Z∑q(m)RW(k−m)∴Q(z)=SW(z)SXW(z)=SXW(z)⟹E[XnWn−k∗]=m∈Z∑q(m)E[Wn−mWn−k∗]⟹SXW(z)=Q(z)SW(z)=SXY(z)(SY+(z−∗))−∗=SY−(z)SXY(z)
H(z)=Q(z)G(z)=SY−(z)SXY(z)SY+(z)1=SY(z)SXY(z).
The best linear estimator of Xn using Yn where (Xn,Yn) is jointly WSS is given by the non-causal Wiener filter.
H(z)=SY(z)SXY(z)
If we interpret Definition 21 in the frequency domain, for a specific ω, we can understand H(ejω) as an optimal linear estimator for FX(ω) where FX(ω) is the the stochastic process given by the Cramer-Khinchin decomposition (Theorem 7). More specifically, we can use the Cramer-Khinchin decomposition of Yn.
X^n=i∈Z∑h[i]∫−ππejω(n−i)dFY(ω)=∫−ππ(i∈Z∑h[i]e−jωi)ejωndFY(ω)=∫−ππH(ejω)ejωndFY(ω)
Since FX and FY have jointly orthogonal increments, this tells us that H(ejω) is just the optimal linear estimator of dFX(ω) using dFY(ω). dFX(ω) and dFY(ω) exist on a Hilbert space, meaning we are essentially projecting each frequency component of Xn onto the corresponding frequency component of Yn.
First, note that in the causal case, whitening doesn’t break causality because SY+(z)1 is causal. When we apply the orthogonality principle,
E[(Xn−X^n)Wn−k∗]=0∴RXW(k)⟹E[XnWn−k∗]=m=0∑∞q(m)E[Wn−mWn−k∗]=m=0∑∞q[m]RW(k−m)k≥0
We can’t take the Z-transform of both sides because the equation is not necessarily true for k<0. Instead, we can look at the function
f(k)=RXW(k)−∑m=0∞RW(k−m)q[m]={0?k≥0, else.
[F(z)]+Q(z)=[SXW(z)−SW(z)Q(z)]+=[SXW(z)]+−Q(z)=0=[SXW(z)]+=[SY−(z)SXY(z)]+
Thus the filter H which gives the causal best linear estimator of X using Y is
H(z)=Q(z)G(z)=[SY−(z)SXY(z)]+SY+(z)1.
The best linear estimator of Xn using {Yi}i=−∞n is given by the causal Wiener filter.
H(z)=Q(z)G(z)=[SY−(z)SXY(z)]+SY+(z)1.
Intuitively, this should make sense because we are using the same W process as in the non-causal case, but only the ones which we are allowed to use, hence use the unilateral Z-transform of the non-causal Wiener filter, which amounts to truncated the noncausal filter to make it causal.
If X^NC(n) is the non-causal Wiener filter of X, then the causal wiener filter of X given Y is the same as the causal wiener filter of X^NC given Y, and if Y is white noise, then
X^C(n)=∑i=0∞h[i]Yn−i
Suppose that instead of a Wide-Sense Stationary process, we an N length signal X which we want to estimate with another N length signal Y. We can represent both X and Y as vectors in CN. If we are allowed to use all entries of Y to estimate X, this is identical to linear estimation.
The non-causal Wiener filter of a finite length N signal Y is given by
Ks=RXYRY−1.
Note that this requires RY≻0. Suppose that we wanted to design a causal filter for the vector case, so X^i only depends on {Yj}j=1i. By the orthogonality principle,
∀1≤l≤i, E[Xi−∑j=1iKf,ijYjYl∗]=0⟹RXY(i,l)=∑j=1iKf,ijRY(j,l)
RXY−KfRY=U+
where U+ is strictly upper triangular.
If matrix H≻0, then there exists a unique lower-diagonal upper triangular factorization of H=LDL∗ where L is lower diagonal and invertible with unit diagonal entries and Dis diagonal with positive entries.
RXY−KfLDL∗=U+⟹RXYL−∗D−1−KfL=U+L−∗D−1∴[RXYL−∗D−1]L−KfL=0
where [⋅]L represent the lower triangular part of a matrix.
The causal Wiener filter of a finite length N signal Y is given by
Kf=[RXYL−∗D−1]LL−1
Suppose we have a Hidden Markov Process {Yn}n≥1. We can think of determining the state {Xn}n≥1 as filtering {Yn}n≥1.
Suppose we want to know the distribution of Xt after we have observered Yt.
p(xt∣yt)=p(yt)p(xt,yt)=∑xp(yt,yt−1∣xt=x)p(xt=x)p(xt)p(yt,yt−1∣xt)=∑xp(yt∣xt=x)p(yt−1∣xt=x)p(xt=x)p(xt)p(yt∣xt)p(yt−1∣xt)=∑xp(yt∣xt=x)p(yt−1)p(xt=x∣yt−1)p(yt∣xt)p(yt−1)p(xt∣yt−1)=∑xp(yt∣xt=x)p(xt=x∣yt−1)p(yt∣xt)p(xt∣yt−1)
Now if we know p(xt∣yt−1), then we are set.
p(xt∣yt−1)=x∑p(xt,xt−1=x∣yt−1)=x∑p(xt−1=x∣yt−1)p(xt∣xt−1=x,yt−1)=x∑p(xt−1=x∣yt)p(xt∣xt−1=x)
Now we have a recursive algorithm for computing the distribution of xt.
Suppose we are allowed to non-causally filter our signal and we care about the distribution of Xt after we have observed Yn. In other words, for t≥n, we want to find γt(xt)=p(xt∣yn). When t=n, γn(xn)=αn(xn). If we continue expanding backwards, then
p(xt∣yn)=x∑p(xt,xt+1=x∣yn)=x∑p(xt+1=x∣yn)p(xt∣xt+1=x,yt,yt+1n)=x∑p(xt+1=x∣yn)p(xt∣xt+1,yt)=x∑p(xt+1=x∣yn)p(xt+1=x∣yt)p(xt∣yt)p(xt+1=x∣xt,yt)=x∑γt+1(x)βt+1(x)αt(xt)p(xt+1=x∣xt)
This gives us a clear algorithm for non-causally computing the distribution of xt.
X^n=argmaxXnp(xn∣yn)
p(xt,yt)=p(xt−1,yt−1)p(xt,yt∣xt−1,yt−1)=p(xt−1,yt−1)p(xt∣xt−1,yt−1)p(yt∣xt,xt−1,yt−1)=p(xt−1,yt−1)p(xt∣xt−1)p(yt∣xt)
We see that there is a recursion in the joint distribution, so if we let Vt(xt)=maxxt−1p(xt,yt), then
Vt(xt)=xt−1maxp(xt,yt)=p(yt∣xt)xt−1maxp(xt−1,yt−1)p(xt∣xt−1)=p(yt∣xt)xt−1max[p(xt∣xt−1)xt−2maxp(xt−1,yt−1)]=p(yt∣xt)xt−1maxp(xt∣xt−1)Vt−1(xt−1)
The base case is that V1(x1)=p(x1)p(y1∣x1). Vt is useful because x^n=argmaxxnVn(xn). This is because we can first maximize over X^n−1 and Yn, so the only thing left to maximize is x^n. Once we have x^t, then we can comptue x^t−1 by
x^t−1=argmaxxt−1p(x^t∣xt−1)Vt−1(xt−1).
In the Kalman Filter setup, we assume that the signal we would like to filter can be represented by a state-space model. We want to predict the state vectors X^i using some linear combination of the observations Yi.
Suppose that we want to compute the one-step prediction. In other words, given Yi, we want to predict X^i+1. Our observations Y are the only thing which give us information about the state, so it would be nice if we could de-correlate all of the Y. To do this, we can define the innovation process
ei=Yi−Y^i∣i−1=Yi−HiX^i∣i−1
The last equality follows from the state-space modela and that past observation noises are uncorrelated with the current one. Now, to compute the one-step prediction, we just need to project X^i onto the innovations.
X^i+1∣i=j=0∑i⟨Xi+1,ej⟩Re,j−1ej=⟨Xi+1,ei⟩Re,i−1ei+j=0∑i−1⟨Xi+1,ej⟩Re,j−1ej=⟨Xi+1,ei⟩Re,i−1ei+X^i+1∣i−1=⟨Xi+1,ei⟩Re,i−1ei+X^i+1∣i=⟨Xi+1,ei⟩Re,i−1ei+FiX^i∣i−1
The second to last equality follows from the Wide-Sense Markovity of state space models, and the last equality is due to the state evolution noises being uncorrelated. If we let Kp,i=⟨Xi+1,ei⟩Re,i−1 (called the prediction gain), then we have a recursive estimate of the optimal one-step predictor.
X^i+1∣i=FiX^i∣i−1+Kp,iei.
Now, we just need to find a recursive formulation for Kp,i and Re,i. Starting with Re,i, notice that we can write ei=Yi−HiX^i∣i−1=Hi(Xi−X^i∣i−1)+Vi.
Re,i=⟨Hi(Xi−X^i∣i−1)+Vi,Hi(Xi−X^i∣i−1)+Vi⟩=Hi⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+Ri
To find Kp,i, we should first find ⟨Xi+1,ei⟩.
⟨Xi+1,ei⟩=Fi⟨Xi,ei⟩+Gi⟨Ui,ei⟩=Fi⟨Xi,Hi(Xi−X^i∣i−1)+Vi⟩+⟨Ui,Hi(Xi−X^i∣i−1)+Vi⟩=Fi⟨Xi,Xi−X^i∣i−1⟩Hi∗+GiSi=Fi⟨(Xi−X^i∣i−1)+X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+GiSi=Fi⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩Hi∗+GiSi
Notice that the matrix Pi=⟨Xi−X^i∣i−1,Xi−X^i∣i−1⟩is the auto-correlation of the estimation error, and it shows up in both Kp,i and Rei. It would be useful to have a recursive solution for this matrix as well.
Pi+1=Πi+1−⟨X^i+1∣i,X^i+1∣i⟩=FiΠiFi∗+GiQiGi∗−⟨FiX^i∣i−1+Kp,iei,FiX^i∣i−1+Kp,iei⟩=FiΠiFi∗+GiQiGi∗−Fi⟨X^i∣i−1,X^i∣i−1⟩F∗+Kp,iRe,iKp,i∗=FiPiFi∗+GiQiGi∗−Kp,iRe,jKp,i∗
The predictive Kalman filter goes directly from X^i∣i−1 to X^i+1∣i without ever determining X^i∣i. The Schmidt Modification of the Kalman filter separates the predictive kalman filter into two steps, allowing us to estimate the current state.
Measurement Update: Find X^i∣i given the latest observation Yi and X^i∣i−1.
State Evolution (Time) Update: Find X^i+1∣i using what we know about the state evolution.
X^i∣i=j=0∑i⟨Xi,ej⟩Re,j−1ej=X^i∣i−1+⟨Xi,ej⟩Re,i−1ej=X^i∣i−1+⟨(Xi−X^i∣i−1)+X^i∣i−1,Hi(Xi−X^i∣i−1)+Vi⟩Re,i−1ej=X^i∣i−1+PiHi∗Re,i−1ei
The gain on the coefficient of the innovation Kf,i=PiHi∗Re,i is called the Kalman Filter Gain. The error of our estimator Pi∣i=⟨X^i∣i,X^i∣i⟩is given by
Pi∣i=Pi−PiHi∗Re,i−1HiPi.
X^i+1∣i=FiX^i∣i+GiU^i∣i=FiX^i∣i+Gi⟨Ui,ei⟩Re,i−1ei=FiX^i∣i+Gi⟨Ui,ei⟩Re,i−1ei=FiX^i∣i+Gi⟨Ui,HXi+Vi−HiX^i∣i−1⟩Re,i−1ei=FiX^i∣i+GiSiRe,i−1ei
We can re-write the error if this estimator Pi+1 as
Pi+1=FiPi∣iFi∗+Gi(Qi−SiRe,i−1Si∗)Gi∗−FiKf,iSi∗Gi∗−GiSiKf,i∗Fi∗
The Kalman Prediction Filter and Schmidt’s modification of the Kalman filter are both causal filters. The Kalman Smoother provides a non-causal estimate in the same way that the Backward Recursion algorithm does for Hidden Markov Processes. In other words, the Kalman Smoother predicts X^i∣n, the best linear estimator of X^i from {Y0,⋯,YN}. As before, we can start with the innovation process ej=Yj−HjX^j∣j−1.
X^i∣N=∑j=0N⟨Xi,ej⟩Re,j−1ej=X^i∣i−1+∑j=iN⟨Xi,ej⟩Re,j−1ej.
We can compute X^i∣i−1 using the Predictive Kalman filter, so we just need to compute the ⟨Xi,ej⟩Re,j−1ej.
⟨Xi,ej⟩=⟨Xi,Hj(Xi−X^j∣j−1)+Vj⟩=⟨Xi,Hj(Xj−X^j∣j−1)⟩=⟨X^i∣i−1+(Xi−X^i∣i−1),Xj−X^j∣j−1⟩Hj∗
Xj−X^j∣j−1 is orthgonal to any linear function of {Y0,⋯,Yj−1}, so when j≥i, it must be orthgonal to X^i∣i−1 since it is a function of {Yk}0i−1. Thus, for j≥i,
⟨Xi,ej⟩=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩Hj∗
If we denote Pij=⟨Xi−X^i∣i−1,Xj−X^j∣j−1⟩, then
Pij=PiΦp∗(j,i)=Pi{I∏k=ij−1Fp,k if j=i if j>i
where Fp,k=(Fk−Kp,kHk). This gives us the expression
X^i∣N=X^i∣i−1+Pi∑j=iNΦp∗(j,i)Hi∗Re,j−1ej.