# Linear Algebra

{% hint style="info" %}

#### Definition 1

An affine set is one of the form $$\mathcal{A}={ \mathbf{x}\in\mathcal{X}:\ \mathbf{x}=\mathbf{v}+\mathbf{x\_0},\ \mathbf{v}\in\mathcal{V}}$$ where $$\mathcal{V}$$ is a subspace of a vector space $$\mathcal{X}$$ and $$x\_0$$is a given point.
{% endhint %}

Notice that by Definition 1, a subspace is simply an affine set containing the origin. Also notice that the dimension of an affine set $$\mathcal{A}$$ is the same as the dimension of $$\mathcal{V}$$.

## Norms

{% hint style="info" %}

#### Definition 2

A norm on the vector space $$\mathcal{X}$$ is a function $$|\cdot|:\mathcal{X}\rightarrow\mathbb{R}$$ which satisfies:

1. $$|\mathbf{x}|\geq 0$$ with equality if and only if $$\mathbf{x}=\boldsymbol{0}$$
2. $$|\mathbf{x}+\mathbf{y}|\leq|\mathbf{x}|+|\mathbf{y}|$$
3. $$|\alpha \mathbf{x}| = |\alpha||\mathbf{x}|$$ for any scalar $$\alpha$$.
   {% endhint %}

{% hint style="info" %}

#### Definition 3

The $$l\_p$$ norms are defined by

$$|\mathbf{x}|*p=\left( \sum*{k=1}^n|x\_k|^p \right)^{\frac{1}{p}},\ 1\leq p\leq \infty$$
{% endhint %}

In the limit as $$p\to\infty$$,

$$|\mathbf{x}|\_{\infty} = \max\_k|x\_k|.$$

Similar to vectors, matrices can also have norms.

{% hint style="info" %}

#### Definition 4

A function $$f: \mathbb{R}^{m\times n} \to \mathbb{R}$$ is a matrix norm if

$$f(A) \geq 0 \quad f(A) = 0 \Leftrightarrow A = 0 \quad f(\alpha A) = |\alpha| f(A) \quad f(A+B) \leq f(A) + f(B)$$
{% endhint %}

{% hint style="info" %}

#### Definition 5

The Froebenius norm is the $$l\_2$$ norm applied to all elements of the matrix.

$$|A|*F = \sqrt{\text{trace} AA^T} = \sqrt{\sum*{i=1}^m \sum\_{j=1}^n |a\_{ij}|^2}$$
{% endhint %}

One useful way to characterize matrices is by measuring their “gain” relative to some $$l\_p$$ norm.

{% hint style="info" %}

#### Definition 6

The operator norms is defined as

$$|A|*p = \max*{\mathbf{u}\ne0} \frac{|A\mathbf{u}|\_p}{|u|\_p}$$
{% endhint %}

When $$p=2$$, the norm is called the spectral norm because it relates to the largest eigenvalue of $$A^TA$$.

$$|A|*2 = \sqrt{\lambda*{max}(A^TA)}$$

## Inner Products

{% hint style="info" %}

#### Definition 7

An inner product on real vector space is a function that maps $$\mathbf{x},\mathbf{y} \in \mathcal{X}$$ to a non-negative scalar, is distributive, is commutative, and $$\langle \mathbf{x}, \mathbf{x}, \rangle = 0 \Leftrightarrow \mathbf{x}=0$$.
{% endhint %}

Inner products induce a norm $$|\mathbf{x}| = \sqrt{\langle \mathbf{x}, \mathbf{x} \rangle}$$. In $$\mathbb{R}^n$$, the standard inner product is $$\mathbf{x}^T\mathbf{y}$$. The angle bewteen two vectors is given by

$$\cos\theta = \frac{\mathbf{x}^T\mathbf{y}}{|\mathbf{x}|\_2|\mathbf{y}|\_2}.$$

In general, we can bound the absolute value of the standard inner product between two vectors.

{% hint style="info" %}

#### Theorem 1 (Holder Inequality) <a href="#theorem-1" id="theorem-1"></a>

$$|\mathbf{x}^T\mathbf{y}| \leq \sum\_{k=1}^n |x\_ky\_k| \leq |\mathbf{x}|\_p|\mathbf{y}|\_q,\ p, q\geq 1 \text{ s.t } p^{-1}+q^{-1}=1.$$
{% endhint %}

Notice that for $$p=q = 2$$, Theorem 1 turns into the Cauchy-Schwartz Inequality ($$|\mathbf{x}^T\mathbf{y}| \leq |\mathbf{x}|\_2|\mathbf{y}|\_2$$).

## Functions

We consider functions to be of the form $$f:\mathbb{R}^n\rightarrow\mathbb{R}$$. By contrast, a map is of the form $$f:\mathbb{R}^n\rightarrow\mathbb{R}^m$$. The components of the map $$f$$ are the scalar valued functions $$f\_i$$ that produce each component of a map.

{% hint style="info" %}

#### Definition 8

The graph of a function $$f$$ is the set of input-output pairs that $$f$$ can attain.

$$\left{ (x, f(x))\in \mathbb{R}^{n+1}:\ x\in\mathbb{R}^n \right}$$
{% endhint %}

{% hint style="info" %}

#### Definition 9

The epigraph of a function is the set of input-output pairs that $$f$$ can achieve and anything above.

$$\left{ (x,t) \in \mathbb{R}^{n+1}:\ \mathbf{x}\in\mathbb{R}^{n+1},\ t\geq f(x) \right}$$
{% endhint %}

{% hint style="info" %}

#### Definition 10

The t-level set is the set of points that achieve exactly some value of $$f$$.

$${ \mathbf{x}\in\mathbb{R}^n:\ f(x)=t }$$
{% endhint %}

{% hint style="info" %}

#### Definition 11

The t-sublevel set of $$f$$ is the set of points achieving at most a value $$t$$.

$${ x\in\mathbb{R}^n:\ f(x)\leq t }$$
{% endhint %}

{% hint style="info" %}

#### Definition 12

The half-spaces are the regions of space which a hyper-plane separates.

$$H\_{\_} = { x: \mathbf{a}^T\mathbf{x}\leq b } \qquad H\_{+} = { x: \mathbf{a}^T\mathbf{x} > b }$$
{% endhint %}

{% hint style="info" %}

#### Definition 13

A polyhedron is the intersection of $$m$$ half-spaces given by $$\mathbf{a}\_i^T\mathbf{x}\leq b\_i$$ for $$i\in\[1,m]$$.
{% endhint %}

When a polyhedron is bounded, it is called a polytope.

### Types of Functions

{% hint style="info" %}

#### Theorem 2

A function is linear if and only if it can be expressed as $$f(\mathbf{x}) = \mathbf{a}^T\mathbf{x}+b$$ for some unique pair $$(\mathbf{a}, b)$$.
{% endhint %}

An affine function is linear when $$b=0$$. A hyperplane is simply a level set of a linear function.

{% hint style="info" %}

#### Theorem 3

Any quadratic function can be written as the sum of a quadratic term involving a symmetric matrix and an affine term:

$$q(x) = \frac{1}{2}\mathbf{x}^TH\mathbf{x}+\mathbf{c}^T\mathbf{x} + d.$$
{% endhint %}

Another special class of functions are polyhedral functions.

{% hint style="info" %}

#### Definition 14

A function $$f:\mathbb{R}^n\to\mathbb{R}$$ is polyhedral if its epigraph is a polyhedron.

$$\text{epi } f = \left{(x,t) \in \mathbb{R}^{n+1} :\ C \begin{bmatrix}\mathbf{x} \ t \end{bmatrix} \leq d \right}$$
{% endhint %}

### Vector Calculus

We can also do calculus with vector functions.

{% hint style="info" %}

#### Definition 15

The gradient of a function at a point $$x$$ where $$f$$ is differentiable is a column vector of first derivatives of $$f$$ with respsect to the components of $$\mathbf{x}$$

$$abla f(x) = \begin{bmatrix} \frac{\partial f}{\partial x\_1}\ \vdots\ \frac{\partial f}{\partial x\_n} \end{bmatrix}$$
{% endhint %}

The gradient is perpendicular to the level sets of $$f$$ and points from a point $$\mathbf{x}\_0$$ to higher values of the function. In other words, it is the direction of steepest increase. It is akin to the derivative of a 1D function.

{% hint style="info" %}

#### Definition 16

The Hessian of a function $$f$$ at point $$x$$ is a matrix of second derivatives.

$$H\_{ij} = \frac{\partial^2 f}{\partial x\_i \partial x\_j}$$
{% endhint %}

The Hessian is akin to the second derivative in a 1D function. Note that the Hessian is a symmetric matrix.

## Matrices

Matrices define a linear map between an input space and an output space. Any linear map $$f: \mathbb{R}^n \to \mathbb{R}^m$$ can be represented by a matrix.

{% hint style="info" %}

#### Theorem 4 (Fundamental Theorem of Linear Algebra) <a href="#theorem-4" id="theorem-4"></a>

For any matrix $$A\in\mathbb{R}^{m\times n}$$,

$$\mathcal{N}(A) \oplus \mathcal{R}(A^T) = \mathbb{R}^n \qquad \mathcal{R}(A) \oplus \mathcal{N}(A^T) = \mathbb{R}^m.$$
{% endhint %}

### Symmetric Matrices

Recall that a symmetric matrix is one where $$A = A^T$$.

{% hint style="info" %}

#### Theorem 5 (Spectral Theorem) <a href="#theorem-5" id="theorem-5"></a>

Any symmetric matrix is orthogonally similar to a real diagonal matrix.

$$A = A^T \implies A = U \Lambda U^T = \sum\_i \lambda\_i \mathbf{u}\_i\mathbf{u}\_i^T,\quad |\mathbf{u}| = 1, \quad \mathbf{u}\_i^T\mathbf{u}\_j = 0 \ (i \ne j)$$
{% endhint %}

Let $$\lambda\_{min}(A)$$ be the smallest eigenvalue of symmetric matrix $$A$$ and $$\lambda\_{max}(A)$$ be the largest eigenvalue.

{% hint style="info" %}

#### Definition 17

The Rayleigh Quotient for $$\mathbf{x} \ne \boldsymbol{0}$$ is $$\frac{\mathbf{x}^TA\mathbf{x}}{|\mathbf{x}|^2}.$$
{% endhint %}

{% hint style="info" %}

#### Theorem 6

For any $$\mathbf{x} \ne \boldsymbol{0}$$,

$$\lambda\_{min}(A) \leq \frac{\mathbf{x}^TA\mathbf{x}}{|\mathbf{x}|^2} \leq \lambda\_{max}(A).$$
{% endhint %}

Two special types of symmetric matrices are those with non-negative eigenvalues.

{% hint style="info" %}

#### Definition 18

A symmetric matrix is positive semi-definite if $$\mathbf{x}^TA\mathbf{x} \geq 0 \implies \lambda\_{min}(A) \geq 0$$.
{% endhint %}

{% hint style="info" %}

#### Definition 19

A symmetric matrix is poitive definite if $$\mathbf{x}^TA\mathbf{x} > 0 \implies \lambda\_{min}(A) > 0$$.
{% endhint %}

These matrices are important because they often have very clear geometric structures. For example, an ellipsoid in multi-dimensional space can be defined as the set of points

$$\mathcal{E} = { x\in\mathbb{R}^m : \ \mathbf{x}^T P^{-1} \mathbf{x} \leq 1 }$$

where $$P$$ is a positive definite matrix. The eigenvectors of $$P$$ give the principle axes of this ellipse, and $$\sqrt{\lambda}$$ are the semi-axis lengths.

### QR Factorization

Similar to how spectral theorem allows us to decompose symmetric matrices, QR factorization is another matrix decomposition technique that works for any general matrix.

{% hint style="info" %}

#### Definition 20

The QR factorization matrix are the orthogonal matrix Q and the upper triangular matrix R such that $$A = QR$$
{% endhint %}

An easy way to find the QR factorization of a matrix is to apply Graham Schmidt to the columns of the matrix and express the result in matrix form. Suppose that our matrix $$A$$ is full rank (i.e its columns $$\mathbf{a}*i$$ are linearly independent) and we have applied Graham-Schmidt to columns $$\mathbf{a}*{i+1}\cdots\mathbf{a}*n$$ to get orthogonal vectors $$\mathbf{q}*{i+1}\cdots\mathbf{q}\_{n}$$. Continuing the procedure, the ith orthogonal vector $$\mathbf{q}\_i$$ is

$$\mathbf{\tilde{q}}\_i = \mathbf{a}*i - \sum*{k=i+1}^{n} (\mathbf{q}\_k^T \mathbf{a}\_k)\mathbf{q}\_k \qquad \mathbf{q}\_i = \frac{\mathbf{\tilde{q}}\_i}{|\mathbf{\tilde{q}}\_i|\_2}.$$

If we re-arrange this, to solve for $$\mathbf{a}\_i$$, we see that

$$\mathbf{a}\_i = |\mathbf{\tilde{q}}\_i|\_2 \mathbf{q}*i + \sum*{k=i+1}^{n} (\mathbf{q}\_k^T \mathbf{a}\_k)\mathbf{q}\_k.$$

Putting this in matrix form, we can see that

$$\begin{bmatrix} | & | & & | \ \mathbf{a}*1 & \mathbf{a}*2 & \cdots & \mathbf{a}*{n}\ | & | & & | \ \end{bmatrix} = \begin{bmatrix} | & | & & | \ \mathbf{q}*1 & \mathbf{q}*2 & \cdots & \mathbf{q}*{n}\ | & | & & | \ \end{bmatrix} \begin{bmatrix} r*{11} & r*{12} & \cdots & r\_{1n}\ 0 & r\_{22} & \cdots & r\_{2n}\ \vdots & \ddots & \ddots & \vdots\ 0 & \cdots & 0 & r\_{nn} \end{bmatrix} \qquad r\_{ij} = \mathbf{a}*i^T\mathbf{q\_j}, r*{ii} = |\mathbf{\tilde{q}}\_i|\_2.$$

### Singular Value Decomposition

{% hint style="info" %}

#### Definition 21

A matrix $$A\in\mathbb{R}^{m\times n}$$ is a dyad if it can be written as $$\mathbf{p}\mathbf{q}^T$$.
{% endhint %}

A dyad is a rank-one matrix. It turns out that all matrices can be decomposed into a sum of dyads.

{% hint style="info" %}

#### Definition 22

The Singular Value Decomposition of a matrix $$A$$ is

$$A = \sum\_{i=1}^{r} \sigma\_i \mathbf{u}\_i\mathbf{v}\_i^T$$

where $$\sigma\_i$$ are the singular values of $$A$$ and $$\mathbf{u}\_i$$ and $$\mathbf{v}\_i$$are the left and right singular vectors.
{% endhint %}

Th singular values are ordered such that $$\sigma\_1 >= \sigma\_2 >= \cdots$$. The left singular values are the eigenvectors of $$AA^T$$ and the right singular values are the eigenvectors of $$A^TA$$. The singular values are $$\sqrt{\lambda}\_i$$ where $$\lambda\_i$$ are the eigenvalues of $$A^TA$$. Since $$AA^T$$ and $$A^TA$$ are symmetric, $$\mathbf{u}\_i$$ and $$\mathbf{v}\_i$$ are orthogonal. The number of non-zero singular values is equal to the rank of the matrix. We can write the SVD in matrix form as

$$A = \left\[U\_r\quad U\_{n-r}\right]\text{diag}(\sigma\_1,\cdots,\sigma\_r,0,\cdots,0)\begin{bmatrix}V^T\_r\V^T\_{n-r}\end{bmatrix}$$

Writing the SVD tells us that

1. $$V\_{n-r}$$ forms a basis for $$\mathcal{N}(A)$$
2. $$U\_{r}$$ form a basis for $$\mathcal{R}(A)$$

The Frobenius norm and spectral norm are tightly related to the SVD.

$$|A|*F = \sum*{i}\sigma\_i^2$$

$$|A|\_2^2 = \sigma\_1^2$$
