Whereas hypothesis testing is about discriminating between two or more hypotheses, estimation is about guessing the numerical value of or ground truth of a random variable.
In order to measure the quality of our estimation, we need a metric to measure error. One commonly used error is the mean squared error
E[(X−X^(Y))2].
Theorem 45
Linear Estimation
Since finding the MMSE is difficult, we can restrict ourselves to funtions of a particular type.
Definition 86
By the orthogonality principle,
Solving this system yields
Theorem 46
Definition 87
Theorem 47
Kalman Filtering
Definition 88
Theorem 48
Kalman filtering is a simple algorithm which lets us do online, optimal estimation. Variants of it can do things such as prediction or smoothing.
The minimum mean square estimate (MMSE) of a random variable X is given by the conditional expectation.
X^(Y)=E[X∣Y]=argminX^E[(X−X^(Y))2].
This essentially follows from the definition of conditional expectation since it is orthogonal to all other functions of Y, and so by the Hilbert Projection Theorem, it must be the projection of X onto the space of all functions of Y. There are two problems with using MMSE all the time.
We often don’t know pY∣X explicitly and only have a good model for it.
Even if we knew the model pY∣X, conditional expectations are difficult to compute.
The Linear Least Squares Estimator (LLSE) L[X∣Y] is the projection of a vector of random variables X onto the subspace of linear functions of observations Yi,U={a+BY} where Yis a vector of observations.
E[(X−L[X∣Y])1]=0⟹E[L[X∣Y]]=E[X]
E[(X−L[X∣Y])Yi]=0
From here, we can derive a closed form expression for the LLSE. Let μY=E[Y],μX=E[X],ΣY=E[(Y−μY)(Y−μY)T],ΣXY=E[(X−μX)(Y−μY)T]. By substituting L[X∣Y]=a+BY into the equations we found from the orthogonality principle,
The Linear Least Squares Estimator for vector of random variables X given a vector of random variables Y is
L[X∣Y]=μX+ΣXYΣY−1(Y−μY)
If X and Y are both a single random variable, this reduces to
L[X∣Y]=μX+Var(Y)Cov(X,Y)(Y−μY)
Since LLSE is essentially projection onto a Linear Subspace, if we have an orthogonal basis for the subspace, then we can do the projection onto the subspace one component at a time. The Gram-Schmidt Process turns vectors Y1,⋯,Yn into an orthonormal set Y~1,⋯,Y~n. If we define Y(n)=(Y1,⋯,Yn),