Links
Comment on page

Linear Algebraic Optimization

Many optimization problems can be solved using the machinery of Linear Algebra. These problems do not have inequality constraints or non-euclidean norms in the objective function.

Projection

The idea behind projection is to find the closest point in a set closest (with respect to particular norm) to a given point.

Definition 28

Given a vector
x\mathbf{x}
in inner product space
X\mathcal{X}
and a subspace
SXS\subseteq\mathcal{X}
, the projection of
x\mathbf{x}
onto
SS
is given by
ΠS(x)=argminySyx\Pi_S(\mathbf{x}) = \text{argmin}_{\mathbf{y}\in S}\|\mathbf{y}-\mathbf{x}\|
where the norm is the one induced by the inner product.

Theorem 9

There exists a unique vector
xS\mathbf{x}^*\in S
which solves
minySyx.\min_{\mathbf{y}\in S} \|\mathbf{y}-\mathbf{x}\|.
It is necessary and sufficient for
x\mathbf{x}^*
to be optimal that
(xx)S(\mathbf{x}-\mathbf{x}^*)\perp S
. The same condition applies when projecting onto an affine set.

Matrix Pseudo-inverses

Definition 29

A pseudoinverse is a matrix
AA^{\dagger}
that satisfies:
AAA=AAAA=A(AA)T=AA(AA)T=AAA A^\dagger A = A \quad A^\dagger A A^\dagger = A^\dagger \quad (AA^\dagger)^T = A A^\dagger \quad (A^\dagger A)^T = A^\dagger A
There are several special cases of pseudoinverses.
  1. 1.
    A=Vrdiag(1σ1,,1σr)UrTA^\dagger = V_r \text{diag}\left(\frac{1}{\sigma_1},\cdots,\frac{1}{\sigma_r}\right)U_r^T
    is the Moore-Penrose Pseudo-inverse.
  2. 2.
    When
    AA
    and non-singular,
    A=A1A^\dagger = A^{-1}
    .
  3. 3.
    When
    AA
    is full column rank,
    A=(ATA)1ATA^\dagger = (A^TA)^{-1}A^T
    .
  4. 4.
    When
    AA
    is full row rank,
    A=AT(AAT)1A^{\dagger} = A^T(AA^T)^{-1}
The pseudo-inverses are useful because they can easily compute the projection of a vector onto a related subspace of
AA
.
  1. 1.
    argminzR(A)zy2=AAy\text{argmin}_{z\in\mathcal{R}(A)}\|\mathbf{z}-\mathbf{y}\|_2 = AA^\dagger \mathbf{y}
  2. 2.
    argminzR(A)zy2=(IAA)y\text{argmin}_{z\in\mathcal{R}(A)^\perp}\|\mathbf{z}-\mathbf{y}\|_2 = (I - AA^\dagger)\mathbf{y}
  3. 3.
    argminzN(A)zy2=(IAA)y\text{argmin}_{z\in\mathcal{N}(A)}\|\mathbf{z}-\mathbf{y}\|_2 = (I - A^\dagger A)\mathbf{y}
  4. 4.
    argminzN(A)zy2=AAy\text{argmin}_{z\in\mathcal{N}(A)^\perp}\|\mathbf{z}-\mathbf{y}\|_2 = A^\dagger A\mathbf{y}

Explained Variance

The Low Rank Approximation problem is to approximate a matrix
AA
with a rank
kk
matrix
minAkAAkF2 such that rank(Ak)=k.\min_{A_k} \|A - A_k\|_F^2 \text{ such that rank}(A_k) = k.
The solution to the low rank approximation problem is simply the first
kk
terms of the SVD:
AK=i=1kσiuiviT.A_K^\star = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}^T_i.
This is because the singular values give us a notion of how much of the Frobenius Norm (Total Variance) each dyad explains.
η=AkF2AF2=ikσi2irσi2\eta = \frac{\|A_k\|_F^2}{\|A\|_F^2} = \frac{\sum_i^k \sigma_i^2}{\sum_i^r \sigma_i^2}

PCA

Suppose we had a matrix containing
mm
data points in
Rn\mathbb{R}^n
(each data point is a column), and without loss of generality, assume this data is centered around 0 (i.e
ixi=0\sum_i \mathbf{x}_i = 0
). The variance of this data along a particular direction
z\mathbf{z}
is given by
zTXXTz\mathbf{z}^TXX^T\mathbf{z}
. Principle Component Analysis is finding the directions
z\mathbf{z}
such that the variance is maximized.
maxzRnzTXXTz such that z2=1\max_{z\in\mathbb{R}^n} \mathbf{z}^TXX^T\mathbf{z} \text{ such that } \|\mathbf{z}\|_2 = 1
The left singular vector corresponding to the largest singular value of the
XXTXX^T
matrix is the optimizer of this problem, and the variance along this direction is
σ12\sigma_1^2
. If we wanted to find subsequent directions of maximal variance, they are just the left singular vectors corresponding to the largest singular values.

Removing Constraints

Following from the Fundmental Theorem of Linear Algebra, if
Ax=yA\mathbf{x}=\mathbf{y}
has a solution, then the set of solutions can be expressed as
S={xˉ+Nz}S = \{\bar{\mathbf{x}} + N\mathbf{z}\}
where
Axˉ=yA\bar{\mathbf{x}}=\mathbf{y}
and
NN
is a basis for
N(A)\mathcal{N}(A)
. This means if we have a constrained optimization problem
minxf0(x) : Ax=b,\min_\mathbf{x} f_0(\mathbf{x}) \ : \ A\mathbf{x} = \mathbf{b},
we can write an equivalent unconstrained problem
minzf0(x0+Nz)\min_\mathbf{z} f_0(\mathbf{x}_0 + N\mathbf{z})
where
Ax0=bA\mathbf{x}_0 = \mathbf{b}