Hey guys! Since class is over and im back at Taiwan, I want to share some learning on some mathematical foundations for ML. I’m not sure what I want to cover nor how deep I want to cover, although I’ll try to make this as approachable for people.

Prerequisites:

Matrix Operations and Identities, Basis, Diagonalizability, Eigenvalues and Vectors, Variance and Covariance

What is PCA and Why?

Principal Component Analysis (PCA) is a powerful statistical technique commonly used for dimension reduction and simplification, while retaining the important information in the data.

Consider a typical RGB image with dimensions of 224 x 224 x 3, totaling 150,528 data points for a single image. That is quite alot of data points for each image! In reality, many of these points are correlated and don’t significantly contribute to our understanding of the image’s content—altering a few pixels won’t change whether an image of a cat is recognized as a dog.

Performing dimension reduction with PCA can speed up computation by reducing the number of dimensions to process, allowing only the most crucial information to be retained while removing redundant, useless data (noise).

Formulating and Solving PCA

Consider a dataset where each data point xi\mathbf{x}_i has mm features, represented as columns in the matrix X\mathbf{X}. Here, X\mathbf{X} is a m×nm \times n matrix with datapoints x1,,xn\mathbf{x}_1, \dots, \mathbf{x}_n. We want to preprocess it to mean = 0.

The goal of PCA is to reduce the dimensionality of this dataset from mm to kk (where k<mk < m) by finding a new orthonormal basis β={β1,,βk}\beta' = \{\beta'_1, \dots, \beta'_k\}, β1βk\beta'_1 \geq \dots \geq \beta'_k that “best expresses” the variability in the data.

After obtaining β\beta', each data point xi\mathbf{x}_i is projected onto this new basis to obtain:

yi=(xiβ1)β1++(xiβk)βk\mathbf{y}_i = (\mathbf{x}_i \cdot \beta'_1)\beta'_1 + \dots + (\mathbf{x}_i \cdot \beta'_k)\beta'_k

where we denote Y\mathbf{Y} as a matrix whose column ii is the coefficients of yi\mathbf{y}_i.

We can also write this projection process as a matrix equation:

Y=BX\mathbf{Y} = \mathbf{B}\mathbf{X}

Where the ii'th row Bi=βi\mathbf{B}_i = \beta'_i. (This equation is also called change of basis)

Optimization Objective

We need to clairify what “best expresses” meant in the previous paragraph.

In the best case, we would want each principal component have the highest variance. High variance indicates wide spread from the mean, which essentially means we retain as much information as possible from the original dataset.

Futhermore, we would wish the covariance between the principal components is minimized. Having high covariance means the components are correlated, and would be redundant.

Combined together, it’s easy to see in the most optimal case, the covariance matrix for Y\mathbf{Y}, CY=1n1YYT\mathbf{C_Y} = \frac{1}{n - 1}\mathbf{Y}\mathbf{Y}^T, should be a diagonal matrix.

Our problem now becomes finding B\mathbf{B} that satisfies our goals, which we can find out by manipulating some equations:

CY=1n1YYT=1n1(BX)(BX)T=1n1B(XXT)BT=1n1B(PDP1)BT,\begin{aligned} \mathbf{C_Y} &= \frac{1}{n - 1}\mathbf{Y}\mathbf{Y}^T \\ &= \frac{1}{n - 1}(\mathbf{B}\mathbf{X})(\mathbf{B}\mathbf{X})^T \\ &= \frac{1}{n - 1}\mathbf{B}(\mathbf{X}\mathbf{X}^T)\mathbf{B}^T \\ &= \frac{1}{n - 1}\mathbf{B}(\mathbf{P}\mathbf{D}\mathbf{P}^{-1})\mathbf{B}^T, \end{aligned}

where P\mathbf{P} and D\mathbf{D} are the eigenvectors and eigenvalues of XXT\mathbf{X}\mathbf{X}^T, respectively.

Notice if we choose B=PkT\mathbf{B} = \mathbf{P}_k^T, containing only the top kk eigenvectors, simplifies CY\mathbf{C_Y} to:

CY=1n1(PkTP)D(P1Pk)=1n1IkDIkT=1n1Dk\begin{aligned} \mathbf{C_Y} &= \frac{1}{n - 1}(\mathbf{P}_k^T\mathbf{P})\mathbf{D}(\mathbf{P^{-1}}\mathbf{P}_k) \\ &= \frac{1}{n - 1} \mathbf{I}_k\mathbf{D}\mathbf{I}_k^T \\ &= \frac{1}{n - 1} \mathbf{D}_k \end{aligned}

where Dk\mathbf{D}_k contains only the largest kk eigenvalues.

Thus, We’ve shown choosing B=PkT\mathbf{B} = \mathbf{P}_k^T ensures that CY\mathbf{C_Y} is diagonal, effectively making each principal component capture distinct variance from the dataset, and “best expressing” the data as per our initial goal.

We are also guaranteed that B\mathbf{B} forms a orthonormal basis, since P\mathbf{P} is also orthonormal due to being eigenvectors of symmertrical matricies.

Examples

After roughly going through the idea behind PCA, I will give two examples to make the process even clearer.

Example 1: Tabular Data

This example will go through the process of PCA mathematically via a simple tabular a data.

X=[2.32.61.53.14.95.33.26.35.15.24.95.38.26.37.46.84.43.13.63.5]\mathbf{X} = \begin{bmatrix} 2.3 & 2.6 & 1.5 & 3.1 \\ 4.9 & 5.3 & 3.2 & 6.3 \\ 5.1 & 5.2 & 4.9 & 5.3 \\ 8.2 & 6.3 & 7.4 & 6.8 \\ 4.4 & 3.1 & 3.6 & 3.5 \\ \end{bmatrix}

Our dataset contains 4 data points aranged in columns and 5 features for each data point. I designed it so x2x_2 is around 2 times of x1x_1, and x4x_4 is two times of x5x_5, and x3x_3 is independent.

Step 1. Center the data

We first want to make the mean of the data 0 to calculate the covariance matrix later.

Xc=XXˉ=[0.0750.2250.8750.7250.0250.3751.7251.3750.0250.0750.2250.1751.0250.8750.2250.3750.750.550.050.15]\mathbf{X}_{\text{c}} = \mathbf{X} - \bar\mathbf{X} = \begin{bmatrix} -0.075 & 0.225 & -0.875 & 0.725 \\ -0.025 & 0.375 & -1.725 & 1.375 \\ -0.025 & 0.075 & -0.225 & 0.175 \\ 1.025 & -0.875 & 0.225 & -0.375 \\ 0.75 & -0.55 & -0.05 & -0.15 \\ \end{bmatrix}

We use X=Xc\mathbf{X} = \mathbf{X}_\text{c} from now on.

Step 2. Calculate the symmetric matrix XXT\mathbf{X}\mathbf{X}^T

XXT=[0.0750.2250.8750.7250.0250.3751.7251.3750.0250.0750.2250.1751.0250.8750.2250.3750.750.550.050.15][0.0750.0250.0251.0250.750.2250.3750.0750.8750.550.8751.7250.2250.2250.050.7251.3750.1750.3750.15]=[1.3472.5920.3430.7420.2452.5925.0080.6581.2570.3450.3430.6580.0880.2070.0750.7421.2570.2072.0071.2950.2450.3450.0751.2950.89]\begin{aligned} \mathbf{X}\mathbf{X}^T &= \begin{bmatrix} -0.075 & 0.225 & -0.875 & 0.725 \\ -0.025 & 0.375 & -1.725 & 1.375 \\ -0.025 & 0.075 & -0.225 & 0.175 \\ 1.025 & -0.875 & 0.225 & -0.375 \\ 0.75 & -0.55 & -0.05 & -0.15 \\ \end{bmatrix} \begin{bmatrix} -0.075 & -0.025 & -0.025 & 1.025 & 0.75 \\ 0.225 & 0.375 & 0.075 & -0.875 & -0.55 \\ -0.875 & -1.725 & -0.225 & 0.225 & -0.05 \\ 0.725 & 1.375 & 0.175 & -0.375 & -0.15 \\ \end{bmatrix} \\ &= \begin{bmatrix} 1.347 & 2.592 & 0.343 & -0.742 & -0.245 \\ 2.592 & 5.008 & 0.658 & -1.257 & -0.345 \\ 0.343 & 0.658 & 0.088 & -0.207 & -0.075 \\ -0.742 & -1.257 & -0.207 & 2.007 & 1.295 \\ -0.245 & -0.345 & -0.075 & 1.295 & 0.89 \\ \end{bmatrix} \end{aligned}

Step 3. Find eigenvalues and vectors for the matrix

You can use your favorite method to find the quantities, I just used a solver and this is the result:

Eigen Values:

[41070.00.00.00.00.02.71070.00.00.00.00.00.0020.00.00.00.00.02.3510.00.00.00.00.06.986] \begin{bmatrix} -4 \cdot 10^{-7} & -0.0 & 0.0 & 0.0 & 0.0 \\ -0.0 & -2.7 \cdot 10^{-7} & 0.0 & 0.0 & 0.0 \\ -0.0 & -0.0 & 0.002 & 0.0 & 0.0 \\ -0.0 & -0.0 & 0.0 & 2.351 & 0.0 \\ -0.0 & -0.0 & 0.0 & 0.0 & 6.986 \\ \end{bmatrix}

Eigen Vectors:

[0.8330.0120.3240.1160.4340.4550.0940.0530.3110.8270.0420.9660.2270.020.1110.150.1540.5410.750.3130.2740.1830.740.5710.132] \begin{bmatrix} -0.833 & 0.012 & 0.324 & 0.116 & -0.434 \\ 0.455 & 0.094 & -0.053 & 0.311 & -0.827 \\ -0.042 & -0.966 & -0.227 & 0.02 & -0.111 \\ 0.15 & -0.154 & 0.541 & 0.75 & 0.313 \\ -0.274 & 0.183 & -0.74 & 0.571 & 0.132 \\ \end{bmatrix}

(Note: torch.linalg.eigh returns the eigenvalues and vectors from small to big)

Step 4. Choose your kk and do reduction

In this example, I will choose k=3k = 3, but usually you would want to choose the smallest kk that can keep some percentage of the variance (95 or 99).

When k=3k = 3, we choose

B=PkT=\mathbf{B} = \mathbf{P}_k^T =

[0.3240.0530.2270.5410.740.1160.3110.020.750.5710.4340.8270.1110.3130.132] \begin{bmatrix} 0.324 & -0.053 & -0.227 & 0.541 & -0.74 \\ 0.116 & 0.311 & 0.02 & 0.75 & 0.571 \\ -0.434 & -0.827 & -0.111 & 0.313 & 0.132 \\ \end{bmatrix}

Then,

3CY=BPDP1B1=[0.0020.00.00.02.3510.00.00.06.986] \begin{aligned} 3\mathbf{C_Y} &= \mathbf{B}\mathbf{P}\mathbf{D}\mathbf{P^{-1}}\mathbf{B^{-1}} \\ &= \begin{bmatrix} 0.002 & 0.0 & 0.0 \\ 0.0 & 2.351 & 0.0 \\ 0.0 & 0.0 & 6.986 \\ \end{bmatrix} \end{aligned}
Which is what we expected.

Step 5. Project to new basis and reconstruct data

We will now obtain the projection coefficient matrix Y\mathbf{Y} and use it to reconstruct the data with the new basis.

Y=BX=[0.0180.0310.0180.031.1810.8260.5030.1480.4750.7621.8951.608]\mathbf{Y} = \mathbf{B}\mathbf{X} = \begin{bmatrix} -0.018 & -0.031 & 0.018 & 0.03 \\ 1.181 & -0.826 & -0.503 & 0.148 \\ 0.475 & -0.762 & 1.895 & -1.608 \\ \end{bmatrix}

And if we try to reconstruct the original matrix with our new basis:

BTY+Xmean=[2.32.61.53.14.95.33.26.35.15.24.95.38.26.37.46.84.43.13.63.5]\mathbf{B}^T\mathbf{Y} + X_\text{mean} = \begin{bmatrix} 2.3 & 2.6 & 1.5 & 3.1 \\ 4.9 & 5.3 & 3.2 & 6.3 \\ 5.1 & 5.2 & 4.9 & 5.3 \\ 8.2 & 6.3 & 7.4 & 6.8 \\ 4.4 & 3.1 & 3.6 & 3.5 \\ \end{bmatrix}

And if we use k=2k = 2:

[2.3062.611.4943.094.8995.2983.2016.3025.0965.1934.9045.3078.216.3177.396.7844.3873.0773.6133.522] \begin{bmatrix} 2.306 & 2.61 & 1.494 & 3.09 \\ 4.899 & 5.298 & 3.201 & 6.302 \\ 5.096 & 5.193 & 4.904 & 5.307 \\ 8.21 & 6.317 & 7.39 & 6.784 \\ 4.387 & 3.077 & 3.613 & 3.522 \\ \end{bmatrix}

And finally, k=1k = 1:

[2.1692.7061.5533.0734.5325.5563.3576.2555.0725.214.9145.3047.3246.9377.7676.6723.7133.5493.93.438] \begin{bmatrix} 2.169 & 2.706 & 1.553 & 3.073 \\ 4.532 & 5.556 & 3.357 & 6.255 \\ 5.072 & 5.21 & 4.914 & 5.304 \\ 7.324 & 6.937 & 7.767 & 6.672 \\ 3.713 & 3.549 & 3.9 & 3.438 \\ \end{bmatrix}

As you can see, PCA did a pretty good job in perserving the relationships of features with fewer dimensions.

Example 2: MNIST

MNIST is a classic hand writing digit dataset. We will use PCA to reduce the dimension of it and check the reconstruction results w.r.t different kk.

Here is the google colab for this example, I will only show the result and the code for PCA in this blog.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def PCA(images, k):
# Flatten the image and center the data
images = images.reshape([images.shape[0], -1])
mean = images.mean(0)
images = images - mean
# Calculate covariance matrix and eigenvalues/vectors
cov_matrix = torch.matmul(images.T, images) / (images.shape[0] - 1)
eigen_values, eigen_vectors = torch.linalg.eigh(cov_matrix)
# Get eigenvectors w.r.t top k eigenvalues
eigen_vectors_k = eigen_vectors[:, -k:]
# Get Y, matrix with coefficients w.r.t new basis
projected_data = torch.matmul(images, eigen_vectors_k)
# Reconstruct images
new_images = torch.matmul(projected_data, eigen_vectors_k.T) + mean
return new_images

PCA

Afterword

PCA is quite an old, yet powerful linear dimension reduction algorithm. The whole encoder-decoder architecture, and embedding systems all use dimension reduction, so this is very important! If you want to learn more about dimension reduction, (deep)auto-encoders are a good place to start. Good luck!

It took me quite some time to compose this blog, so I hope it was clear and concise for you to understand.

My freshman year is finally over, and after my grades all roll out, I think I will make a short blog to summarize up my first year in college!

Taiwan food yummy~

References

A Tutorial on Principal Component Analysis

ML Lecture 13: Unsupervised Learning - Linear Methods