Tuesday, August 18, 2015

Logistic Regression: Gradients computed the matrix way

I really enjoy performing calculus with matrix/vector operations rather than messing around with tons of summations. The other nice thing about calculations with matrices/vectors is that the results come out cleanly in a form that can be used directly in Matlab or a similar high-level language and where the matrix/vector multiplication based implementations are typically much faster than using your own for-loops to compute the result. One of the nice examples of this approach is computing the gradients for logistic regression without using any summations at all!

Before we get to logistic regression let's build up one small piece of the machinery (vectorized functions) that will help us when we get there. Lets begin with vectors \(p, z \in \mathbb{R}^n\) and let \(f_v\) be a vectorized version of function \(f\) that applies the function \(f\) component-wise.

\[p = f_v(z)\]
\[\mathrm{d}p = \mathrm{Diag}((f')_v(z))\mathrm{d}z\].

Now, our logistic regression objective function is

\[J(\theta) = \frac{1}{m} \left(   -y^T(log(p)) -(1-y)^T(log(1-p))   \right)\]

where

\[p=g_v(z)\]
\[g(x)=\frac{1}{1+\exp^{-x}}\]
\[z=X^T\theta\]

Lets start with the simplest derivatives and then use those to compute the gradient of the objective function

\[\mathrm{d}g=\frac{\exp^{-x}}{(1+\exp^{-x})^2}\mathrm{d}x=g(x)(1-g(x))\mathrm{d}x\]

Now, lets get to \(p\) and make use of machinery we developed at the very beginning:

\[\mathrm{d}p = \mathrm{Diag}((g')_v(z)) \mathrm{d}z\]
\[\mathrm{d}p = \mathrm{Diag}(g_v(z)(1-g_v(z))\mathrm{d}z=\mathrm{Diag}(g_v(z)(1-g_v(z))X^T\mathrm{d}\theta\]

Now, we come to our original goal of computing the gradient of the objective function \(J\)

\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1/p)) + (1-y)^T(\mathrm{Diag}(1/(1-p)))   \right)\mathrm{d}p\]
\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1/g_v(z))) + (1-y)^T(\mathrm{Diag}(1/(1-g_v(z))))   \right)\mathrm{Diag}(g_v(z)(1-g_v(z))X^T\mathrm{d}\theta\]
\[\mathrm{d}J = \frac{1}{m} \left(   -y^T(\mathrm{Diag}(1-g_v(z))) + (1-y)^T(\mathrm{Diag}(g_v(z)))   \right)X^T\mathrm{d}\theta\]
\[\mathrm{d}J = \frac{1}{m} \left(   g_v(z)^T -y ^T    \right)X^T\mathrm{d}\theta\]

So, finally we have

\[\nabla J = X(g_v(z)-y)\]

A good reference for matrix calculus can be found here.

No comments:

Post a Comment