Linear Algebra

### Definition 1

An affine set is one of the form
$\mathcal{A}=\{ \mathbf{x}\in\mathcal{X}:\ \mathbf{x}=\mathbf{v}+\mathbf{x_0},\ \mathbf{v}\in\mathcal{V}\}$
where
$\mathcal{V}$
is a subspace of a vector space
$\mathcal{X}$
and
$x_0$
is a given point.
Notice that by Definition 1, a subspace is simply an affine set containing the origin. Also notice that the dimension of an affine set
$\mathcal{A}$
is the same as the dimension of
$\mathcal{V}$
.

# Norms

### Definition 2

A norm on the vector space
$\mathcal{X}$
is a function
$\|\cdot\|:\mathcal{X}\rightarrow\mathbb{R}$
which satisfies:
1. 1.
$\|\mathbf{x}\|\geq 0$
with equality if and only if
$\mathbf{x}=\boldsymbol{0}$
2. 2.
$\|\mathbf{x}+\mathbf{y}\|\leq\|\mathbf{x}\|+\|\mathbf{y}\|$
3. 3.
$\|\alpha \mathbf{x}\| = |\alpha|\|\mathbf{x}\|$
for any scalar
$\alpha$
.

### Definition 3

The
$l_p$
norms are defined by
$\|\mathbf{x}\|_p=\left( \sum_{k=1}^n|x_k|^p \right)^{\frac{1}{p}},\ 1\leq p\leq \infty$
In the limit as
$p\to\infty$
,
$\|\mathbf{x}\|_{\infty} = \max_k|x_k|.$
Similar to vectors, matrices can also have norms.

### Definition 4

A function
$f: \mathbb{R}^{m\times n} \to \mathbb{R}$
is a matrix norm if
$f(A) \geq 0 \quad f(A) = 0 \Leftrightarrow A = 0 \quad f(\alpha A) = |\alpha| f(A) \quad f(A+B) \leq f(A) + f(B)$

### Definition 5

The Froebenius norm is the
$l_2$
norm applied to all elements of the matrix.
$\|A\|_F = \sqrt{\text{trace} AA^T} = \sqrt{\sum_{i=1}^m \sum_{j=1}^n |a_{ij}|^2}$
One useful way to characterize matrices is by measuring their “gain” relative to some
$l_p$
norm.

### Definition 6

The operator norms is defined as
$\|A\|_p = \max_{\mathbf{u}\ne0} \frac{\|A\mathbf{u}\|_p}{\|u\|_p}$
When
$p=2$
, the norm is called the spectral norm because it relates to the largest eigenvalue of
$A^TA$
.
$\|A\|_2 = \sqrt{\lambda_{max}(A^TA)}$

# Inner Products

### Definition 7

An inner product on real vector space is a function that maps
$\mathbf{x},\mathbf{y} \in \mathcal{X}$
to a non-negative scalar, is distributive, is commutative, and
$\langle \mathbf{x}, \mathbf{x}, \rangle = 0 \Leftrightarrow \mathbf{x}=0$
.
Inner products induce a norm
$\|\mathbf{x}\| = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$
. In
$\mathbb{R}^n$
, the standard inner product is
$\mathbf{x}^T\mathbf{y}$
. The angle bewteen two vectors is given by
$\cos\theta = \frac{\mathbf{x}^T\mathbf{y}}{\|\mathbf{x}\|_2\|\mathbf{y}\|_2}.$
In general, we can bound the absolute value of the standard inner product between two vectors.

### Theorem 1 (Holder Inequality)

$|\mathbf{x}^T\mathbf{y}| \leq \sum_{k=1}^n |x_ky_k| \leq \|\mathbf{x}\|_p\|\mathbf{y}\|_q,\ p, q\geq 1 \text{ s.t } p^{-1}+q^{-1}=1.$
Notice that for
$p=q = 2$
, Theorem 1 turns into the Cauchy-Schwartz Inequality (
$|\mathbf{x}^T\mathbf{y}| \leq \|\mathbf{x}\|_2\|\mathbf{y}\|_2$
).

# Functions

We consider functions to be of the form
$f:\mathbb{R}^n\rightarrow\mathbb{R}$
. By contrast, a map is of the form
$f:\mathbb{R}^n\rightarrow\mathbb{R}^m$
. The components of the map
$f$
are the scalar valued functions
$f_i$
that produce each component of a map.

### Definition 8

The graph of a function
$f$
is the set of input-output pairs that
$f$
can attain.
$\left\{ (x, f(x))\in \mathbb{R}^{n+1}:\ x\in\mathbb{R}^n \right\}$

### Definition 9

The epigraph of a function is the set of input-output pairs that
$f$
can achieve and anything above.
$\left\{ (x,t) \in \mathbb{R}^{n+1}:\ \mathbf{x}\in\mathbb{R}^{n+1},\ t\geq f(x) \right\}$

### Definition 10

The t-level set is the set of points that achieve exactly some value of
$f$
.
$\{ \mathbf{x}\in\mathbb{R}^n:\ f(x)=t \}$

### Definition 11

The t-sublevel set of
$f$
is the set of points achieving at most a value
$t$
.
$\{ x\in\mathbb{R}^n:\ f(x)\leq t \}$

### Definition 12

The half-spaces are the regions of space which a hyper-plane separates.
$H_{\_} = \{ x: \mathbf{a}^T\mathbf{x}\leq b \} \qquad H_{+} = \{ x: \mathbf{a}^T\mathbf{x} > b \}$

### Definition 13

A polyhedron is the intersection of
$m$
half-spaces given by
$\mathbf{a}_i^T\mathbf{x}\leq b_i$
for
$i\in[1,m]$
.
When a polyhedron is bounded, it is called a polytope.

## Types of Functions

### Theorem 2

A function is linear if and only if it can be expressed as
$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x}+b$
for some unique pair
$(\mathbf{a}, b)$
.
An affine function is linear when
$b=0$
. A hyperplane is simply a level set of a linear function.

### Theorem 3

Any quadratic function can be written as the sum of a quadratic term involving a symmetric matrix and an affine term:
$q(x) = \frac{1}{2}\mathbf{x}^TH\mathbf{x}+\mathbf{c}^T\mathbf{x} + d.$
Another special class of functions are polyhedral functions.

### Definition 14

A function
$f:\mathbb{R}^n\to\mathbb{R}$
is polyhedral if its epigraph is a polyhedron.
$\text{epi } f = \left\{(x,t) \in \mathbb{R}^{n+1} :\ C \begin{bmatrix}\mathbf{x} \\ t \end{bmatrix} \leq d \right\}$

## Vector Calculus

We can also do calculus with vector functions.

### Definition 15

The gradient of a function at a point
$x$
where
$f$
is differentiable is a column vector of first derivatives of
$f$
with respsect to the components of
$\mathbf{x}$
$abla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x_1}\\ \vdots\\ \frac{\partial f}{\partial x_n} \end{bmatrix}$
The gradient is perpendicular to the level sets of
$f$
and points from a point
$\mathbf{x}_0$
to higher values of the function. In other words, it is the direction of steepest increase. It is akin to the derivative of a 1D function.

### Definition 16

The Hessian of a function
$f$
at point
$x$
is a matrix of second derivatives.
$H_{ij} = \frac{\partial^2 f}{\partial x_i \partial x_j}$
The Hessian is akin to the second derivative in a 1D function. Note that the Hessian is a symmetric matrix.

# Matrices

Matrices define a linear map between an input space and an output space. Any linear map
$f: \mathbb{R}^n \to \mathbb{R}^m$
can be represented by a matrix.

### Theorem 4 (Fundamental Theorem of Linear Algebra)

For any matrix
$A\in\mathbb{R}^{m\times n}$
,
$\mathcal{N}(A) \oplus \mathcal{R}(A^T) = \mathbb{R}^n \qquad \mathcal{R}(A) \oplus \mathcal{N}(A^T) = \mathbb{R}^m.$

## Symmetric Matrices

Recall that a symmetric matrix is one where
$A = A^T$
.

### Theorem 5 (Spectral Theorem)

Any symmetric matrix is orthogonally similar to a real diagonal matrix.
$A = A^T \implies A = U \Lambda U^T = \sum_i \lambda_i \mathbf{u}_i\mathbf{u}_i^T,\quad \|\mathbf{u}\| = 1, \quad \mathbf{u}_i^T\mathbf{u}_j = 0 \ (i \ne j)$
Let
$\lambda_{min}(A)$
be the smallest eigenvalue of symmetric matrix
$A$
and
$\lambda_{max}(A)$
be the largest eigenvalue.

### Definition 17

The Rayleigh Quotient for
$\mathbf{x} \ne \boldsymbol{0}$
is
$\frac{\mathbf{x}^TA\mathbf{x}}{\|\mathbf{x}\|^2}.$

### Theorem 6

For any
$\mathbf{x} \ne \boldsymbol{0}$
,
$\lambda_{min}(A) \leq \frac{\mathbf{x}^TA\mathbf{x}}{\|\mathbf{x}\|^2} \leq \lambda_{max}(A).$
Two special types of symmetric matrices are those with non-negative eigenvalues.

### Definition 18

A symmetric matrix is positive semi-definite if
$\mathbf{x}^TA\mathbf{x} \geq 0 \implies \lambda_{min}(A) \geq 0$
.

### Definition 19

A symmetric matrix is poitive definite if
$\mathbf{x}^TA\mathbf{x} > 0 \implies \lambda_{min}(A) > 0$
.
These matrices are important because they often have very clear geometric structures. For example, an ellipsoid in multi-dimensional space can be defined as the set of points
$\mathcal{E} = \{ x\in\mathbb{R}^m : \ \mathbf{x}^T P^{-1} \mathbf{x} \leq 1 \}$
where
$P$
is a positive definite matrix. The eigenvectors of
$P$
give the principle axes of this ellipse, and
$\sqrt{\lambda}$
are the semi-axis lengths.

## QR Factorization

Similar to how spectral theorem allows us to decompose symmetric matrices, QR factorization is another matrix decomposition technique that works for any general matrix.

### Definition 20

The QR factorization matrix are the orthogonal matrix Q and the upper triangular matrix R such that
$A = QR$
An easy way to find the QR factorization of a matrix is to apply Graham Schmidt to the columns of the matrix and express the result in matrix form. Suppose that our matrix
$A$
is full rank (i.e its columns
$\mathbf{a}_i$
are linearly independent) and we have applied Graham-Schmidt to columns
$\mathbf{a}_{i+1}\cdots\mathbf{a}_n$
to get orthogonal vectors
$\mathbf{q}_{i+1}\cdots\mathbf{q}_{n}$
. Continuing the procedure, the ith orthogonal vector
$\mathbf{q}_i$
is
$\mathbf{\tilde{q}}_i = \mathbf{a}_i - \sum_{k=i+1}^{n} (\mathbf{q}_k^T \mathbf{a}_k)\mathbf{q}_k \qquad \mathbf{q}_i = \frac{\mathbf{\tilde{q}}_i}{\|\mathbf{\tilde{q}}_i\|_2}.$
If we re-arrange this, to solve for
$\mathbf{a}_i$
, we see that
$\mathbf{a}_i = \|\mathbf{\tilde{q}}_i\|_2 \mathbf{q}_i + \sum_{k=i+1}^{n} (\mathbf{q}_k^T \mathbf{a}_k)\mathbf{q}_k.$
Putting this in matrix form, we can see that
$\begin{bmatrix} | & | & & | \\ \mathbf{a}_1 & \mathbf{a}_2 & \cdots & \mathbf{a}_{n}\\ | & | & & | \\ \end{bmatrix} = \begin{bmatrix} | & | & & | \\ \mathbf{q}_1 & \mathbf{q}_2 & \cdots & \mathbf{q}_{n}\\ | & | & & | \\ \end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & \cdots & r_{1n}\\ 0 & r_{22} & \cdots & r_{2n}\\ \vdots & \ddots & \ddots & \vdots\\ 0 & \cdots & 0 & r_{nn} \end{bmatrix} \qquad r_{ij} = \mathbf{a}_i^T\mathbf{q_j}, r_{ii} = \|\mathbf{\tilde{q}}_i\|_2.$

## Singular Value Decomposition

### Definition 21

A matrix
$A\in\mathbb{R}^{m\times n}$
is a dyad if it can be written as
$\mathbf{p}\mathbf{q}^T$
.
A dyad is a rank-one matrix. It turns out that all matrices can be decomposed into a sum of dyads.

### Definition 22

The Singular Value Decomposition of a matrix
$A$
is
$A = \sum_{i=1}^{r} \sigma_i \mathbf{u}_i\mathbf{v}_i^T$
where
$\sigma_i$
are the singular values of
$A$
and
$\mathbf{u}_i$
and
$\mathbf{v}_i$
are the left and right singular vectors.
Th singular values are ordered such that
$\sigma_1 >= \sigma_2 >= \cdots$
. The left singular values are the eigenvectors of
$AA^T$
and the right singular values are the eigenvectors of
$A^TA$
. The singular values are
$\sqrt{\lambda}_i$
where
$\lambda_i$
are the eigenvalues of
$A^TA$
. Since
$AA^T$
and
$A^TA$
are symmetric,
$\mathbf{u}_i$
and
$\mathbf{v}_i$
are orthogonal. The number of non-zero singular values is equal to the rank of the matrix. We can write the SVD in matrix form as
$A = \left[U_r\quad U_{n-r}\right]\text{diag}(\sigma_1,\cdots,\sigma_r,0,\cdots,0)\begin{bmatrix}V^T_r\\V^T_{n-r}\end{bmatrix}$
Writing the SVD tells us that
1. 1.
$V_{n-r}$
forms a basis for
$\mathcal{N}(A)$
2. 2.
$U_{r}$
form a basis for
$\mathcal{R}(A)$
The Frobenius norm and spectral norm are tightly related to the SVD.
$\|A\|_F = \sum_{i}\sigma_i^2$
$\|A\|_2^2 = \sigma_1^2$