Saturday, August 15, 2015

Kernelizing the SVM

Recently, I learnt a few examples of the kernel trick. Earlier, I used to believe that the kernel trick was as simple as adding a few nonlinear features and running your standard machine learning algorithm that works with linear features and getting out a nonlinear model. Its a little more than that :)
The kernel trick is really the above idea with the additional goal that the computations should not be performed in the higher dimensional feature space that includes both the linear and the nonlinear features, and we get away with performing computations in a space that has the same dimensionality as the original set of linear features.

In this post I will go over the kernel trick as applied to the SVM and in a later post to linear least squares (my favourite thing of all).

SVM 


The basic idea of SVM is to find a linear hyperplane that separates two classes of data points. So, ideally, we want to find a normal vector \(w\) and an offset \(b\) such that for points \(x\) with the class \(+1\) we have \(w^Tx + b \ge 1\) and for points \(x\) with the class \(-1\) we have \(w^Tx + b \le -1\).


To make the problem realistic we allow for situations where the data might not be linearly separable by adding positive slacks \(s_i\) by which the above constraints can be relaxed. We need to minimize \(\sum s_i\) under the linear classification constraints above and the additional constraint that the norm of the normal vector of the classifier be 1. The quadratic constraint can be introduced into the objective with a user-provided weighting \(c\).

\[\mathrm{min}_{w,b,s}:   \frac{1}{2}w^Tw + c (\mathbf{1}^Ts)\]
\[\mathrm{s.t.}: \mathrm{Diag}(y) \left(X^Tw + b\mathbf{1} \right) \ge \mathbf{1} - s\]
\[\mathrm{s.t.}: s \ge 0\]

where \(y\) represents the vector of data point classes and \(X\) represents the matrix of data points forming its columns.
The Lagrangian can then be expressed as:

\[\frac{1}{2}w^Tw + c(\mathbf{1}^Ts) - \lambda^T(\mathrm{Diag}(y)\left(X^Tw+b\mathbf{1} \right)-1+s) - \mu^Ts\]
where \(\lambda\) and \(\mu\) are positive Lagrange multipliers.

Minimizing over the primal variables gives us the following constraints:
\[b: \lambda^T\mathrm{Diag}(y)1=\lambda^Ty=0\]
\[s: \lambda + \mu=c\mathbf{1}\]
\[w: w = X\mathrm{Diag}(y)\lambda\]

So, \(w\) turns out to be a weighted sum of the data points in the \(X\) matrix. Also, by complementary slackness, the \(\lambda\)'s can be strictly positive only if the corresponding constraint in the primal is satisfied with equality and these data points will form our support vectors.

We will solve the dual problem which can be obtained by simplifying the Lagrangian using the constraints obtained above to get

\[\mathrm{min}_{\lambda}: \frac{1}{2}\lambda^T\mathrm{Diag}(y)\left(X^TX\right)\mathrm{Diag}(y)\lambda - \lambda^T\mathbf{1}\]
\[\mathrm{s.t.}: \lambda^Ty=0\]
\[\mathrm{s.t.}: 0 \le \lambda \le c\mathbf{1}\]

Now, the kernel trick is to observe that in the dual problem the data matrix \(X\) only appears in a multiplication with its transpose. The composite \(X^TX\) can be thought of as a matrix where its \((i,j)\) entry is the dot-product \((x_i^Tx_j)\) of the \(i\) and \(j\) data points. Using the kernel trick we replace each entry by a non-linear function \(K(x_i, x_j)\) which can capture the inner product in the higher dimensional feature space. More on kernel functions later.

Another, important thing to note is that is the kernel trick should carry over to the prediction step as well. In the SVM case the prediction is \(\mathrm{sgn}(w^Tx_{test}+b)\). Now, \(w^Tx_{test}=\lambda^T\mathrm{Diag}(y)X^Tx_{test}\) and we see that the data points again only appear as a dot-product which we will replace with the kernel function. The value of \(b\) can be obtained with the data points involved only in dot-products by observing that the the primal classification constraints are satisfied with equality and \(s_i=0\) when we have \(0 \lt \lambda_i \lt c\). When \(0 \lt \lambda\) the constraint \(y_i(w^Tx_i+b)\ge1-s_i\) is satisfied with equality and when \(\lambda \lt c\) we have non-zero value for \(\mu\) which forces \(s_i\) to be zero. In that case we can evaluate \(b=y_i-w^Tx_i\). We already saw that the \(w^Tx_i\) term  involves only dot-products when we plug-in the expression for \(w\).

No comments:

Post a Comment