Berkeley Notes
Search
K
Comment on page

# Linear Algebraic Optimization

Many optimization problems can be solved using the machinery of Linear Algebra. These problems do not have inequality constraints or non-euclidean norms in the objective function.

## Projection

The idea behind projection is to find the closest point in a set closest (with respect to particular norm) to a given point.

### Definition 28

Given a vector
$\mathbf{x}$
in inner product space
$\mathcal{X}$
and a subspace
$S\subseteq\mathcal{X}$
, the projection of
$\mathbf{x}$
onto
$S$
is given by
$\Pi_S(\mathbf{x}) = \text{argmin}_{\mathbf{y}\in S}\|\mathbf{y}-\mathbf{x}\|$
where the norm is the one induced by the inner product.

### Theorem 9

There exists a unique vector
$\mathbf{x}^*\in S$
which solves
$\min_{\mathbf{y}\in S} \|\mathbf{y}-\mathbf{x}\|.$
It is necessary and sufficient for
$\mathbf{x}^*$
to be optimal that
$(\mathbf{x}-\mathbf{x}^*)\perp S$
. The same condition applies when projecting onto an affine set.

### Definition 29

A pseudoinverse is a matrix
$A^{\dagger}$
that satisfies:
$A A^\dagger A = A \quad A^\dagger A A^\dagger = A^\dagger \quad (AA^\dagger)^T = A A^\dagger \quad (A^\dagger A)^T = A^\dagger A$
There are several special cases of pseudoinverses.
1. 1.
$A^\dagger = V_r \text{diag}\left(\frac{1}{\sigma_1},\cdots,\frac{1}{\sigma_r}\right)U_r^T$
is the Moore-Penrose Pseudo-inverse.
2. 2.
When
$A$
and non-singular,
$A^\dagger = A^{-1}$
.
3. 3.
When
$A$
is full column rank,
$A^\dagger = (A^TA)^{-1}A^T$
.
4. 4.
When
$A$
is full row rank,
$A^{\dagger} = A^T(AA^T)^{-1}$
The pseudo-inverses are useful because they can easily compute the projection of a vector onto a related subspace of
$A$
.
1. 1.
$\text{argmin}_{z\in\mathcal{R}(A)}\|\mathbf{z}-\mathbf{y}\|_2 = AA^\dagger \mathbf{y}$
2. 2.
$\text{argmin}_{z\in\mathcal{R}(A)^\perp}\|\mathbf{z}-\mathbf{y}\|_2 = (I - AA^\dagger)\mathbf{y}$
3. 3.
$\text{argmin}_{z\in\mathcal{N}(A)}\|\mathbf{z}-\mathbf{y}\|_2 = (I - A^\dagger A)\mathbf{y}$
4. 4.
$\text{argmin}_{z\in\mathcal{N}(A)^\perp}\|\mathbf{z}-\mathbf{y}\|_2 = A^\dagger A\mathbf{y}$

## Explained Variance

The Low Rank Approximation problem is to approximate a matrix
$A$
with a rank
$k$
matrix
$\min_{A_k} \|A - A_k\|_F^2 \text{ such that rank}(A_k) = k.$
The solution to the low rank approximation problem is simply the first
$k$
terms of the SVD:
$A_K^\star = \sum_{i=1}^k \sigma_i\mathbf{u}_i\mathbf{v}^T_i.$
This is because the singular values give us a notion of how much of the Frobenius Norm (Total Variance) each dyad explains.
$\eta = \frac{\|A_k\|_F^2}{\|A\|_F^2} = \frac{\sum_i^k \sigma_i^2}{\sum_i^r \sigma_i^2}$

### PCA

Suppose we had a matrix containing
$m$
data points in
$\mathbb{R}^n$
(each data point is a column), and without loss of generality, assume this data is centered around 0 (i.e
$\sum_i \mathbf{x}_i = 0$
). The variance of this data along a particular direction
$\mathbf{z}$
is given by
$\mathbf{z}^TXX^T\mathbf{z}$
. Principle Component Analysis is finding the directions
$\mathbf{z}$
such that the variance is maximized.
$\max_{z\in\mathbb{R}^n} \mathbf{z}^TXX^T\mathbf{z} \text{ such that } \|\mathbf{z}\|_2 = 1$
The left singular vector corresponding to the largest singular value of the
$XX^T$
matrix is the optimizer of this problem, and the variance along this direction is
$\sigma_1^2$
. If we wanted to find subsequent directions of maximal variance, they are just the left singular vectors corresponding to the largest singular values.

## Removing Constraints

Following from the Fundmental Theorem of Linear Algebra, if
$A\mathbf{x}=\mathbf{y}$
has a solution, then the set of solutions can be expressed as
$S = \{\bar{\mathbf{x}} + N\mathbf{z}\}$
where
$A\bar{\mathbf{x}}=\mathbf{y}$
and
$N$
is a basis for
$\mathcal{N}(A)$
. This means if we have a constrained optimization problem
$\min_\mathbf{x} f_0(\mathbf{x}) \ : \ A\mathbf{x} = \mathbf{b},$
we can write an equivalent unconstrained problem
$\min_\mathbf{z} f_0(\mathbf{x}_0 + N\mathbf{z})$
where
$A\mathbf{x}_0 = \mathbf{b}$