Linear Algebra

$A$ m x n where $m$ is the number of rows and $n$ the number of columns.

Definitions

$A x = a_{1} x_{1} + ... + a_{n} x_{n}$ where $a_{i}$ is the ith column of the matrix.

$A B = [A b_{1}, ..., A b_{n}]$ , just each column vector of B acting on A.

For an inverse to exist, $A x = 0$ needs to have a unique solution.

Independent vectors $v_{1}, ..., v_{n}$ i.e. it doesn’t exist $λ_{2}, ..., λ_{n}$ s.t. $v_{1} = λ_{2} v_{2} + ... + λ_{n} v_{n}$ .

Independent vectors form a basis in $R^{n}$ . Every vector in the space is a unique combination of those basis vectors.

Here are particular bases for Rn among all the choices we could make: Standard basis = columns of the identity matrix General basis = columns of any invertible matrix Orthonormal basis = columns of any orthogonal matrix

If $A$ is invertible, then $A^{T} A$ is invertible (follows from invertible), symmetric (always), and positive definite (less clear, probably to do with eigenvalues/singular valuess.

Similar matrices

Two similar matrices describe the same linear map i.e. their mapping are isomorphic and the isomorphism is the matrix $S$ . $S$ also called a change of basis.
Two square matrices $A, B$ are called similar if there is an invertible matrix $S$ s.t. $A = S^{- 1} BS$ .
Property: Similar matrices have the same characteristic polynomial (i.e. same eigenvalues).
- Proof: $P_{A} (λ) = d e t (A - λ I) = d e t (S^{- 1} BS - λ I) = d e t (S^{- 1} (B - λ I) S)$ (using multiplicative property of determinant) = $d e t (S^{- 1}) d e t (B - λ I) d e t (S) = d e t (I) d e t (B - λ I) = P_{B} (λ)$ .
Column and row space

Definition: The column space contains all the combinations of the columns

Useful decomposition if A is not full rank One can always decompose a matrix $A$ into column matrix and row matrix $A = CR$ where A is m x n, C is m x r, R is r x n.

For a 3x3 matrix with rank $r = 2$ , one can decompose into $A = CR$ where C is 3x2 (columns of C are a basis for the column space) and R is 2x3 (rows of R are a basis for the row space).

$R$ is the identity $I$ if $A$ is full-rank. Otherwise, it will be a block matrix $[I_{R}, D]$ where the columns of $D$ describe how to get the columns $r+1, …, n by using the previous columns.

Orthogonality of null-spaces and column spaces

Remember that Ax=0 means that a dot product between x and every row of A is equal to zero.

Motivation for least squares

Suppose A is tall and thin (m > n). The n columns are likely to be independent. But if b is not in the column space, Ax = b has no solution. The least squares method minimizes $‖ b - A x ‖^{2}$ by solving $A^{T} A \overset{x}{^} = A^{T} b$ (i.e. project $A x = b$ into row space ).

Orthogonal vectors

$Q^{T} Q = α I$ , columns are orthogonal, (dot product between columns)
orthonormal, the columns are also unit vectors, $Q^{T} Q = I$ .
If $Q$ is square, then $Q Q^{T} = I$ also, and thus $Q^{T} = Q^{- 1}$ .
$Q$ are rotation transforms. Indeed, $∣∣ Q x ∣ ∣^{2} = ∣∣ x ∣ ∣^{2}$ .
For the eigenvalues of $Q$ , $Q x = λ x$ ⇒ $∣∣ Q x ∣ ∣^{2} = ∣ λ ∣^{2} ∣∣ x ∣ ∣^{2}$ ⇒ $λ = \pm 1$
For A full-rank, we can orthogonalize
- $A = QR$ . Then the columns of $Q$ are orthornormal. $R$ is upper-triangular (by Gram-Schmidt iterative construction).
- Example for least squares:
  - $m > n$ , $m$ equations $A x = b$ , $n$ unknowns, minimize $∣∣ b - A x ∣∣ = ∣∣ e ∣ ∣^{2}$ .
  - Normal equations for the best $\overset{x}{^}$ : $A^{T} e = 0$ or $A^{T} A \overset{x}{^} = A^{T} b$ . or $\overset{x}{^} = (A^{T} A)^{- 1} A^{T} b$ .
  - If $A = QR$ , then $R^{T} Q^{T} QR \overset{x}{^} = R^{T} Q^{T} b$ which leads to $R \overset{x}{^} = Q^{T} b$ (R is much easier to invert)

Eigenvalues and Eigenvectors

An eigenvector $x$ with eigenvalue $λ$ of matrix $A$ (only for square matrices) is $A x = λ x$
To find the eigenvalues, we need to find the nullspace of $(A - λ I)$ i.e. $x$ s.t. $(A - λ I) x = 0$
There exists a nullspace iff $(A - λ I)$ is not invertible iff $d e t (A - λ I) = 0$ . This is the characteristic equation, and we solve for $λ$ .
Property: The eigenvalues of a triangular matrix are the entries on its main diagonal.

If not symmetric

$A = X Λ X^{- 1}$
$A^{2}, A^{3}, ...$ have the same eigenvectors as A. $A^{n} = X Λ^{n} X^{- 1}$

Spectral theorem

Let $S$ be a symmetric matrix
Then $S = S^{T}$ has orthogonal eigenvectors $x^{T} y$ =0. (easy proof)
Let $Q$ be the eigenvectors of $S$ , then $SQ = Λ Q$ and thus, $S = Q Λ Q^{- 1} = Q Λ Q^{T}$ . (spectral theorem). This is a sum of a rank one matrices formed by $S = Q Λ Q^{T} = λ_{1} q_{1} q_{1}^{T} + ... + λ_{r} q_{n} q_{n}^{T}$

Singular values

$A^{T} A$ is square, symmetric, nonnegative definite.
With $S = A^{T} A$ (thus symmetric), this will lead to the singular values of A. SVD: $A = U Σ V^{T}$ with $U^{T} U = I$ and $V^{T} V = I$ .
We have $A^{T} A$ = $V Σ^{T} U^{T} U Σ V^{T} = V Σ^{2} V^{T}$ .
Indeed, the $v_{i}$ are eigenvectors of $A^{T} A$ and $A^{T} A$ is symmetric. $A^{T} A v_{i} = λ_{i} = σ_{i}^{2} v_{i}$ . $V^{T} V = I$ .
We then have $A V = U Σ$ and thus $u_{i} = \frac{A v _{i}}{σ _{i}}$
SVD

Trace

$t r (A)$ is the divergence of the vector field created by $A$ ?
divergence = rate of area change/area (of a local area around a point that evolves in a vector field).
Usually, divergence is a quantity dependent on the position of the point within the vector field.
But for the vector generated by matrix A, tr(A) is a constant.

Determinant

Property 1. $d e t (I) = 1$ .
Property 2. Exchange rows of $A$ : reverse the sign of $d e t (A)$ . Thus, for permutation matrices, $d e t (P) = \pm 1$ .
$a c b d = a d - b c$
Property 3a. For $t \in R$ , $t a c t b d = t a c b d$
Property 3b. For $a^{'} an d b^{'} \in R$ , $a + a^{'} c b + b^{'} d = a c b d + a^{'} c b^{'} d$
$d e t$ is a linear operator for each row, while keeping other rows the same.
Property 4. If there’s 2 equal rows → $d e t = 0$ (test for invertibility). Proof: exchange the two equal rows. The determinant must change sign but the matrix is the same ⇒ $d e t = 0$

PCA

Given data matrix $X (n \times p)$ with $n$ datapoints and $p$ features, we can project $X$ into smaller dimensional space, which we will optimally linearly combine the features, according to least-squares, best low-rank approximation and Frobenius norm.
$\hat{X} = X V$ where $V$ comes from the SVD decomposition $X = U Σ V^{T}$
or the eigenvector decomposition of the covariance matrix (don’t forget to demean the data matrix) $X^{T} X = V Λ V^{T}$ because the covariance matrix is symmetric and positive-definite.

Big picture

elimination $A = LU$
orthogonalization $A = QR$
eigenvalues $A = S Δ S^{- 1}$
Singular values $A = U Σ V^{T}$

🤖 Harold's Notes

Explorer