Tuesday, December 8, 2015

Linear Regression

I recently gave a talk on Linear Regression and its extensions (Ridge regression, LASSO, LARS etc.) as part of the Austin SIGKDD Advanced Machine Learning Meetup. The slides can be found here.
One of the interesting things that I learnt about while preparing for this talk was the fact that coordinate descent works for function where the non-differentiable part of the objective function is separable (slides 29-30). This is why we can use gradient optimization for minimizing the LASSO objective function. The proof works out and the only background that is needed is subgradients. Also, it is instructive to make sure that the proof "really" uses the property of separability and why the proof breaks when we do not have separability.
Hint: For a separable function the subgradient set is a direct product of the subgradient sets of the individual components. However, if the non-differentiable function is not separable then the negative gradient of the differentiable part may not lie in the subgradient set of the non-differentiable function.
Another, interesting thing was the connection between LASSO and LARS. LASSO minimizes the quadratic residual norm with a \(\mathrm{L1}\) penalty and requires that the sign of the coefficient related to a vector be the same as the sign of the inner product of the residual with the vector (easy to show using the Lagrangian). LARS does not enforce this and can therefore produce different solutions. LARS can modified to drop a vector from the active set when this sign change happens and can thus be tailored to generate a LASSO solution. However, then the number of steps taken by LARS is not bounded by \(n\), the dimensionality of the problem.

Friday, August 21, 2015

PCA vs Least Squares

We all know that least squares minimizes the sum of the squared errors. The error is defined as the error between the measured and the predicted value of the dependent variable, where the predicted value is a linear function of the independent variables. If both the dependent and the independent variable are measured quantities and may be noisy, then we might be more interested in minimizing the error of the orthogonal projection of the data points on a linear hyperplane, and this turns out to be the first principal component of the data. The page here has a couple of good images that get this point across. But, lets see that is indeed the case, again, with matrix algebra :)

Let's start with the basics. The projection of a vector \(x\) on a vector \(w\) can be expressed as \(w^Tx \frac{w}{\| w \|^2}\), which in the case of a unit-norm vector \(w\) reduces to \((w^Tx) w\). So, if we have a data matrix \(X\), where the columns of the data matrix form the data points, then the projection of the entire data matrix can be written as \((w^TX) w\). Basically, the projections are all pointing along \(w\) with lengths given by \(w^TX\). In PCA, we seek to find a direction that has the maximum spread (or variance) and we can continue to consider the direction as a vector through the origin if we assume that the data matrix \(X\) has \(0\) mean. Equivalently, we want to find a direction \(w\) that maximizes the following:

\[\mathrm{Cov}\left( w^TX\right) = w^T\mathrm{Cov}\left(X\right)w = w^TXX^Tw\]

Now, lets see what we get for the sum of the squared orthogonal error of the data points on to the line given by the vector \(w\). Each data points has an error given by the 2-norm of the vector that represents the orthogonal error. The sum of the squared errors for each data point can then be represented by a Frobenius norm of a matrix that is composed of these error vectors.


\[\|X - w (w^TX)\|_F = \mathrm{Tr}( (X-ww^TX)^T (X - ww^TX) )\]
\[= \mathrm{Tr}\left(X^TX - X^Tww^TX-X^Tww^TX+X^Tww^Tww^TX\right)\]
\[= \mathrm{Tr}\left(X^TX - X^Tww^TX-X^Tww^TX+X^Tww^TX\right)\] (as \(w\) is unit norm)
\[=\mathrm{Tr}\left( X^TX \right) - \mathrm{Tr}\left(X^Tww^TX\right)\]
\[=\mathrm{Tr}\left( X^TX \right) - \mathrm{Tr}\left(w^TXX^Tw\right)\] (Since trace of cyclic rotations of matrix products is the same)
\[=\mathrm{Tr}\left( X^TX \right) - \left(w^TXX^Tw\right)\] (The trace of 1x1 matrix is the matrix itself)

which is the same as the case above for PCA except for a constant term (\(\mathrm{Tr} (X^TX)\) that is independent of \(w\)) and that we seek to maximize the PCA variance and minimize the orthogonal error. Now, the matrix \(A=XX^T\) represents the covariance matrix of the data points and one seeks to find a unit-norm direction that minimizes the quadratic form \(w^TAw\) for a positive semi-definite symmetric matrix \(A\). One can either construct the Lagrangian or write \(w\) in terms of the eigenbasis to see that we need to choose \(w\) as the eigenvector with the largest eigenvalue.

Tuesday, August 18, 2015

Logistic Regression: Gradients computed the matrix way

I really enjoy performing calculus with matrix/vector operations rather than messing around with tons of summations. The other nice thing about calculations with matrices/vectors is that the results come out cleanly in a form that can be used directly in Matlab or a similar high-level language and where the matrix/vector multiplication based implementations are typically much faster than using your own for-loops to compute the result. One of the nice examples of this approach is computing the gradients for logistic regression without using any summations at all!

Before we get to logistic regression let's build up one small piece of the machinery (vectorized functions) that will help us when we get there. Lets begin with vectors \(p, z \in \mathbb{R}^n\) and let \(f_v\) be a vectorized version of function \(f\) that applies the function \(f\) component-wise.

\[p = f_v(z)\]
\[\mathrm{d}p = \mathrm{Diag}((f')_v(z))\mathrm{d}z\].

Now, our logistic regression objective function is

\[J(\theta) = \frac{1}{m} \left(   -y^T(log(p)) -(1-y)^T(log(1-p))   \right)\]

where

\[p=g_v(z)\]
\[g(x)=\frac{1}{1+\exp^{-x}}\]
\[z=X^T\theta\]

Lets start with the simplest derivatives and then use those to compute the gradient of the objective function

\[\mathrm{d}g=\frac{\exp^{-x}}{(1+\exp^{-x})^2}\mathrm{d}x=g(x)(1-g(x))\mathrm{d}x\]

Now, lets get to \(p\) and make use of machinery we developed at the very beginning:

\[\mathrm{d}p = \mathrm{Diag}((g')_v(z)) \mathrm{d}z\]
\[\mathrm{d}p = \mathrm{Diag}(g_v(z)(1-g_v(z))\mathrm{d}z=\mathrm{Diag}(g_v(z)(1-g_v(z))X^T\mathrm{d}\theta\]

Now, we come to our original goal of computing the gradient of the objective function \(J\)

\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1/p)) + (1-y)^T(\mathrm{Diag}(1/(1-p)))   \right)\mathrm{d}p\]
\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1/g_v(z))) + (1-y)^T(\mathrm{Diag}(1/(1-g_v(z))))   \right)\mathrm{Diag}(g_v(z)(1-g_v(z))X^T\mathrm{d}\theta\]
\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1-g_v(z))) + (1-y)^T(\mathrm{Diag}(g_v(z)))   \right)X^T\mathrm{d}\theta\]
\[\mathrm{d}J = \frac{1}{m} \left(   g_v(z)^T -y ^T    \right)X^T\mathrm{d}\theta\]

So, finally we have

\[\nabla J = X(g_v(z)-y)\]

A good reference for matrix calculus can be found here.

Saturday, August 15, 2015

Kernelizing the SVM

Recently, I learnt a few examples of the kernel trick. Earlier, I used to believe that the kernel trick was as simple as adding a few nonlinear features and running your standard machine learning algorithm that works with linear features and getting out a nonlinear model. Its a little more than that :)
The kernel trick is really the above idea with the additional goal that the computations should not be performed in the higher dimensional feature space that includes both the linear and the nonlinear features, and we get away with performing computations in a space that has the same dimensionality as the original set of linear features.

In this post I will go over the kernel trick as applied to the SVM and in a later post to linear least squares (my favourite thing of all).

SVM 


The basic idea of SVM is to find a linear hyperplane that separates two classes of data points. So, ideally, we want to find a normal vector \(w\) and an offset \(b\) such that for points \(x\) with the class \(+1\) we have \(w^Tx + b \ge 1\) and for points \(x\) with the class \(-1\) we have \(w^Tx + b \le -1\).


To make the problem realistic we allow for situations where the data might not be linearly separable by adding positive slacks \(s_i\) by which the above constraints can be relaxed. We need to minimize \(\sum s_i\) under the linear classification constraints above and the additional constraint that the norm of the normal vector of the classifier be 1. The quadratic constraint can be introduced into the objective with a user-provided weighting \(c\).

\[\mathrm{min}_{w,b,s}:   \frac{1}{2}w^Tw + c (\mathbf{1}^Ts)\]
\[\mathrm{s.t.}: \mathrm{Diag}(y) \left(X^Tw + b\mathbf{1} \right) \ge \mathbf{1} - s\]
\[\mathrm{s.t.}: s \ge 0\]

where \(y\) represents the vector of data point classes and \(X\) represents the matrix of data points forming its columns.
The Lagrangian can then be expressed as:

\[\frac{1}{2}w^Tw + c(\mathbf{1}^Ts) - \lambda^T(\mathrm{Diag}(y)\left(X^Tw+b\mathbf{1} \right)-1+s) - \mu^Ts\]
where \(\lambda\) and \(\mu\) are positive Lagrange multipliers.

Minimizing over the primal variables gives us the following constraints:
\[b: \lambda^T\mathrm{Diag}(y)1=\lambda^Ty=0\]
\[s: \lambda + \mu=c\mathbf{1}\]
\[w: w = X\mathrm{Diag}(y)\lambda\]

So, \(w\) turns out to be a weighted sum of the data points in the \(X\) matrix. Also, by complementary slackness, the \(\lambda\)'s can be strictly positive only if the corresponding constraint in the primal is satisfied with equality and these data points will form our support vectors.

We will solve the dual problem which can be obtained by simplifying the Lagrangian using the constraints obtained above to get

\[\mathrm{min}_{\lambda}: \frac{1}{2}\lambda^T\mathrm{Diag}(y)\left(X^TX\right)\mathrm{Diag}(y)\lambda - \lambda^T\mathbf{1}\]
\[\mathrm{s.t.}: \lambda^Ty=0\]
\[\mathrm{s.t.}: 0 \le \lambda \le c\mathbf{1}\]

Now, the kernel trick is to observe that in the dual problem the data matrix \(X\) only appears in a multiplication with its transpose. The composite \(X^TX\) can be thought of as a matrix where its \((i,j)\) entry is the dot-product \((x_i^Tx_j)\) of the \(i\) and \(j\) data points. Using the kernel trick we replace each entry by a non-linear function \(K(x_i, x_j)\) which can capture the inner product in the higher dimensional feature space. More on kernel functions later.

Another, important thing to note is that is the kernel trick should carry over to the prediction step as well. In the SVM case the prediction is \(\mathrm{sgn}(w^Tx_{test}+b)\). Now, \(w^Tx_{test}=\lambda^T\mathrm{Diag}(y)X^Tx_{test}\) and we see that the data points again only appear as a dot-product which we will replace with the kernel function. The value of \(b\) can be obtained with the data points involved only in dot-products by observing that the the primal classification constraints are satisfied with equality and \(s_i=0\) when we have \(0 \lt \lambda_i \lt c\). When \(0 \lt \lambda\) the constraint \(y_i(w^Tx_i+b)\ge1-s_i\) is satisfied with equality and when \(\lambda \lt c\) we have non-zero value for \(\mu\) which forces \(s_i\) to be zero. In that case we can evaluate \(b=y_i-w^Tx_i\). We already saw that the \(w^Tx_i\) term  involves only dot-products when we plug-in the expression for \(w\).