Friday, August 21, 2015

PCA vs Least Squares

We all know that least squares minimizes the sum of the squared errors. The error is defined as the error between the measured and the predicted value of the dependent variable, where the predicted value is a linear function of the independent variables. If both the dependent and the independent variable are measured quantities and may be noisy, then we might be more interested in minimizing the error of the orthogonal projection of the data points on a linear hyperplane, and this turns out to be the first principal component of the data. The page here has a couple of good images that get this point across. But, lets see that is indeed the case, again, with matrix algebra :)

Let's start with the basics. The projection of a vector \(x\) on a vector \(w\) can be expressed as \(w^Tx \frac{w}{\| w \|^2}\), which in the case of a unit-norm vector \(w\) reduces to \((w^Tx) w\). So, if we have a data matrix \(X\), where the columns of the data matrix form the data points, then the projection of the entire data matrix can be written as \((w^TX) w\). Basically, the projections are all pointing along \(w\) with lengths given by \(w^TX\). In PCA, we seek to find a direction that has the maximum spread (or variance) and we can continue to consider the direction as a vector through the origin if we assume that the data matrix \(X\) has \(0\) mean. Equivalently, we want to find a direction \(w\) that maximizes the following:

\[\mathrm{Cov}\left( w^TX\right) = w^T\mathrm{Cov}\left(X\right)w = w^TXX^Tw\]

Now, lets see what we get for the sum of the squared orthogonal error of the data points on to the line given by the vector \(w\). Each data points has an error given by the 2-norm of the vector that represents the orthogonal error. The sum of the squared errors for each data point can then be represented by a Frobenius norm of a matrix that is composed of these error vectors.


\[\|X - w (w^TX)\|_F = \mathrm{Tr}( (X-ww^TX)^T (X - ww^TX) )\]
\[= \mathrm{Tr}\left(X^TX - X^Tww^TX-X^Tww^TX+X^Tww^Tww^TX\right)\]
\[= \mathrm{Tr}\left(X^TX - X^Tww^TX-X^Tww^TX+X^Tww^TX\right)\] (as \(w\) is unit norm)
\[=\mathrm{Tr}\left( X^TX \right) - \mathrm{Tr}\left(X^Tww^TX\right)\]
\[=\mathrm{Tr}\left( X^TX \right) - \mathrm{Tr}\left(w^TXX^Tw\right)\] (Since trace of cyclic rotations of matrix products is the same)
\[=\mathrm{Tr}\left( X^TX \right) - \left(w^TXX^Tw\right)\] (The trace of 1x1 matrix is the matrix itself)

which is the same as the case above for PCA except for a constant term (\(\mathrm{Tr} (X^TX)\) that is independent of \(w\)) and that we seek to maximize the PCA variance and minimize the orthogonal error. Now, the matrix \(A=XX^T\) represents the covariance matrix of the data points and one seeks to find a unit-norm direction that minimizes the quadratic form \(w^TAw\) for a positive semi-definite symmetric matrix \(A\). One can either construct the Lagrangian or write \(w\) in terms of the eigenbasis to see that we need to choose \(w\) as the eigenvector with the largest eigenvalue.

No comments:

Post a Comment