Tying loose ends

Monday, December 12, 2016

Logistic Regression

I recently gave a talk on Logistic Regression and its usage using scikit-learn as part of the Austin SIGKDD Advanced Machine Learning Meetup. The slides can be found here.

Friday, October 21, 2016

Quasi Newton Methods - Part II (Symmetric Rank 1)

In the previous post we saw that the "smallest" update to a matrix so that it satisfies a given linear constraint is a rank-1 update. However, the updated matrix may lose it symmetric and positive definite characteristics. In this post we will find a matrix update that maintains the symmetric nature of the matrix and we will stick to a rank-1 update.

A symmetric rank-1 update can be expressed as \(\alpha xx^T\) where \(x\) is some vector and \(\alpha\) is \(\pm 1\). So, compared to our previous post, instead of the matrix \(X\) we have this symmetric rank-1 update and we still have the same constraint, namely,
\[ \alpha xx^T a = b\]
which can be rewritten as
\[ \alpha \left( x^Ta \right) x = b\]
and straightaway tells us that \(x\) is linearly dependent on \(b\). Plugging, \(x=\beta b\) into the original constraint we get:
\[ \alpha \beta^2 (b^Ta) b = b\]
which implies that
\[ \alpha \beta^2 = \frac{1}{b^Ta} \]
giving us our symmetric rank-1 update as \(\frac{bb^T}{b^Ta}\). Mapping, the symbols back to the original problem we get:
\[J'_n = J'_{n-1} + \frac{ (\Delta f - J_{n-1}x_{n-1})(\Delta f - J_{n-1}x_{n-1})^T}{(\Delta f - J_{n-1}x_{n-1})^T(\Delta x)}\]
This update is known by the TLA SR1 (Symmetric Rank-1 update) and is the update for a symmetric rank-1 update that satisfies the secant condition. Depending on the sign of the denominator this update may or may not be positive definite. Since SR1 was a result of just requiring the update to be symmetric and rank-1 and no other constraint we cannot hope to obtain a symmetric positive definite rank-1 update that satisfies the secant condition. We will again use the approach of Part I and look for a "smallest" update that gives us both symmetry and positive definiteness and we will end up with a rank-2 update and the famous DFP and BFGS formulas.

Tuesday, October 11, 2016

Quasi Newton Methods - Part I (Rank 1)

This time we will be looking at Quasi Newton Methods. But lets start off with an introduction to Newton's Method, firstly, looking at its application at root finding. Let \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) be a function and we want to find an \(x\) such that \(f(x)=0\). If we start from an initial guess \(x_0\) we can use Newton's method to iteratively update the initial guess. In its most basic form we look at the Jacobian matrix of \(f\) and create a simple first order approximation of the function and use the zero of the approximate function as the initial guess for the next iteration of the method. So,

\[ f_{approx}(x) = f(x_0) + J_0(x-x_0)\]

where \(f_{approx}\) represents the first order approximation to \(f\) and \(J_0\) is the Jacobian matrix of \(f\) at \(x_0\). So, our next guess will be the zero of \(f_{approx}\) which can be expressed as (assuming \(J_0\) is invertible)

\[ x_1 = x_0 - J_{0}^{-1}f(x_0)\]

and is known as the "full" Newton step. In general, going from the \(n-1^{th}\) iterate to \(n^{th}\) iterate we have

\[ x_{n} = x_{n-1} - J_{n-1}^{-1}f(x_{n-1})\]

Computing, the Jacobian matrix at each iterate is a computationally expensive step and "Quasi" Newton methods attempt to approximate the Jacobian matrix based on the Jacobian used to obtain the iterate. Lets impose the condition that the approximation we construct should match the function at the previous iterate, in addition to the current iterate (as in the very first equation in this post). This gives,

\[f_{n-1} = f_n + J'_n (x_{n-1}-x_{n})\]

where \(J'\) is our approximation to the Jacobian. The constraint above can be succinctly represented as \(J'_n\Delta x = \Delta f\) and is known as the Secant Condition. The constraint is not enough to approximate the Jacobian as we have \(n^2\) variables and only \(n\) constraints. Since, we want to use the previous Jacobian to approximate the new Jacobian lets add an objective function that minimizes the Frobenius norm of the change. This given us our overall optimization problem as:

\[\mathrm{min} : \frac{1}{2}\|J'_n-J'_{n-1}\|_F\]

\[\mathrm {s.t.} : J'_n\Delta x = \Delta f\]

Lets rewrite the above by using \(X = J'_n-J'_{n-1}\) and \(a = \Delta x\) and \(b = \Delta f - J'_{n-1}\Delta x\) as

\[\mathrm{min} : \frac{1}{2}\|X\|_F\]

\[\mathrm {s.t.} : Xa = b\]

Lets, first square the objective function (which does not change the optimization problem as we have a positive objective function) and solve the resulting problem by introducing the Lagrangian and using Matrix derivatives:

\[L(X, \lambda) = \frac{1}{2}\mathbb{Tr}(X^TX) + \lambda^T(Xa-b) \]

\[\mathrm{d}L = \mathbb{Tr}\left((X^T\mathrm + a\lambda^T)\mathrm{d}X\right) \]

Which implies that at the optimal point \((X^*, \lambda^*)\), \(X^* + \lambda^* a^T=0\). Plugging, this into the original constraint of the optimization problem we get \(\lambda^* = \frac{-b}{a^Ta}\). Plugging, this back into the expression for \(X^*\), finally, we get

\[ X^* = \frac{ba^T}{a^Ta}\]

So we see that the update is Rank-1 and using this we can now write the expression for the updated Jacobian as

\[J'_n = J'_{n-1} + \frac{ (\Delta f - J_{n-1}x_{n-1})(\Delta x)^T}{(\Delta x)^T(\Delta x)}\]

which is known as Broyden's good method. The Broyden's bad method is obtained if we try to minimize the Frobenius norm of the update to the inverse of the Jacobian and can be easily written down by using the result for our simplified optimization problem (by flipping the roles of \(a\) and \(b\)) as

\[J'^{-1}_n = J'^{-1}_{n-1} + \frac{ (\Delta x - J^{-1}_{n-1}f_{n-1})(\Delta f)^T}{(\Delta f)^T(\Delta f)}\]

Using the Sherman–Morrison formula we can write the update to the inverse of the Jacobian in the good Broyden case and the update to the Jacobian in the bad Broyden case as rank-1 updates as well.

There are two shortcomings of this result:

1) The update is not symmetric. So even if we start with a symmetric Jacobian we will lose symmetry. This is important if we look at Newton's method as an optimization method where the function \(f\) will be a gradient of an objective function and the Jacobian will be the Hessian of the objective function and, therefore, symmetric. The next post will be about a symmetric update to the Jacobian though still rank-1.

2) The update is not positive definite. Again a problem when we are looking to optimize say a convex function. The third post in this series will be on the DFP and BFGS updates that will get us the desired SPD (symmetric positive definite) update, but we will need to go one rank up to a rank-2 update.

Tuesday, September 20, 2016

Bishop PRML Chapter 1 (1.15-1.16)

Finding \( n(D,M) \) (problem 1.15) is finding the number of ways we can sum \(D\) non-negative numbers to M. This is a basic combinatorics problem with a number of interesting solution methods. I will outline three approaches that feel somewhat different and find application in solving other problems.

\[ \mathrm{Solve:} \quad x_1+x_2+...+x_D=M \]
\[ x_i \gt =0 \quad 1 \le i \le D\]

Method 1: Using generating functions

Coefficient of \(x^M\) in \( \left(1+x+x^2+....\right)^D \)

=Coefficient of \(x^M\) in \( \left(1-x\right)^{-D} \)

Using -ve binomial series the coefficient is \( \left( \begin{matrix} D+M-1\\M \end{matrix} \right) \)

Method 2: Bars and Balls

Lets say we have a set of \(M\) 1’s, and a set of \(D-1\) bars. The \(D-1\) bars split the 1’s into the number of 1's representing each of the \(D x_i\)’s . So in total we have \(D+M-1\) positions out of which we select \(M\) position for 1’s and the rest are for the bars.

A following similar approach yields the wrong solution: So we have \(M\) 1’s so we can select each bar in \(M+1\) ways. The mistake here is that selecting the first and second bars in positions 1 and 2 is the same as selecting them in positions 2 and 1. We need each new bar to come to the right of the ones we have already positioned. Okay, so can't we just select the position of \(D-1\) bars from \(M+1\) positions. This gives us the solution to the problem where each \(x_i \gt 0\) because we cannot choose the same position for two bars. However, when \(x_i \ge 0\) we are allowed to choose bars from the same position between balls so we start off with a set of positions for bars and balls and then select the positions of the bars.

Method 3: Indices

Let each \(x_i\) represent the number of balls in box \(i\). The number of boxes is \(D\) and the number of balls is \(M\). Represent each solution as a non-decreasing list of box numbers. E.g. 11335 represents 2 balls in box 1, two balls in box 3 and one ball in box 5. Now create another sequence which is the indices of the balls 01234. Add the two sequences 12569. We can see that all increasing sequences can be generated with a smallest value of 1 and a largest value of \(D+M-1\). So the number of ways is again \( \left( \begin{matrix} D+M-1\\M \end{matrix} \right) \).

The extension to problem 1.16 is then fairly straightforward. We just add another variable giving us a total of \(D+1\) dimensions and solve the same problem as above. The newly added variable will take over the leftover whenever the other \(D\) variables sum to less than \(M\). So, in this case we have the number of ways as \( \left( \begin{matrix} D+M\\M \end{matrix} \right) \)

Sunday, July 31, 2016

Puzzles: Languages and locks

I recently heard a fun puzzle. Puzzle: How many languages are needed so that we can have that for any two people in a group of 70 each one knows a language that the other does not.

Solution: Basically we need to assign to each person a unique subset of languages. If there are K languages number of such such subsets will be maximum if each person knows K/2 languages. So, we will need the number of languages K to be such that

\[ \left( \begin{matrix} K\\K/2 \end{matrix} \right) \ge 70 \]

Another, variant of this puzzle is regarding bank locks. Puzzle: Suppose, you have \(n\) people working at a bank and you want that at least \(m\) should be present to be able to open a lock. How many locks do you need, and how many keys are needed for each lock.

Solution: For each subset of \(m-1\) people there should be a lock that they cannot open so we need

\[ \left( \begin{matrix} n\\m-1 \end{matrix} \right) \]

locks. Also, for any lock we cannot find m people that together cannot open it. So each lock should have \(n-(m-1)\) keys.

Trace and transpose trick: Computing derivatives

I am not sure why but none of the college courses on linear algebra or calculus ever introduced me to working with derivatives of functions of matrices and vectors. This comes up on a daily basis in optimization and is a lot cleaner than dealing with summations. Here is a trick that comes in very handy when taking derivatives of a functions that maps vectors or matrices to scalars. Basically, one can replace a scalar results of matrix vector products by its transpose and in some cases by their trace (and then use the cyclic property of trace to rewrite expressions) in which the desired terms appear at the end and are correctly transposed.

Simplest case: Linear function takes a vector to a scalar.
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(c^Tx \right) = c^T \mathrm{d}x\]

Next level: Function takes a vector to a scalar using a quadratic form. We will use the transpose trick here:
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(x^TAx \right) = (\mathrm{d}x)^TAx + x^TA(\mathrm{d}x) \]

The first term on the RHS is a scalar so we can take its transpose and get

\[ \mathrm{d} \left(x^TAx \right) = x^TA^T(\mathrm{d}x) + x^TA(\mathrm{d}x) = (x^TA^T + x^TA) \mathrm{d}x\]

Next level: Function takes a matrix to a scalar. We will use the trace trick here:

\[f : \mathbb{R}^{nxn} \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(a^TXb \right) = a^T (\mathrm{d}X)b \]
Now since the RHS is a scalar we can think of it as a trace and rotate the terms to get:
\[ \mathrm{d} \left(a^TXb \right) = \mathrm{Tr}(ba^T \mathrm{d}X) \].
Now, we will use the symbol (:) that rewrites the matrix as a concatenation of its vectors and rewrite the above as:
\[ \mathrm{d} \left(a^TXb \right) = (ab^T): (\mathrm{d}X): \].
Giving us \( \frac{\mathrm{d} \left(a^TXb \right)}{\mathrm{d}X} = ab^T \).

Lets use this machinery to attempt something more complicated and find \( \frac{\mathrm{d}(a^TX^TCXb)}{\mathrm{d}X} \)

\[ \mathrm{d} (a^TX^TCXb) = a^T(\mathrm{d}X^T)CXb + a^TX^TC(\mathrm{d}X)b \]

\[ = b^TX^TC^T(dX)a + a^TX^TC(dX)b \]

\[ = \mathrm{Tr}( b^TX^TC^T(\mathrm{d}X)a + a^TX^TC(\mathrm{d}X)b ) \]

\[ = \mathrm{Tr}( ab^TX^TC^T(\mathrm{d}X) + ba^TX^TC(\mathrm{d}X)) \]

\[ = (CXba^T+C^TXab^T): (\mathrm{d}X): \]

\[ \Rightarrow \mathrm{d}(a^TX^TCXb)/\mathrm{d}X = (CXba^T+C^TXab^T) \]

Bell states to test nature's inherent uncertainity

This post is to succinctly describe the key ideas that are presented by Prof. Vazirani in his QMQC course on EdX (I saw this when it was originally offered on coursera) related to Bell's idea to show that quantum randomness is (or is not) inherent in nature.

So, as usual, we have two players Alice and Bob who can initially agree on a strategy and are then separated by a huge distance over which we will assume that they cannot communicate. Now, the game is that both Alice and Bob receive a binary input and produce a binary input. They win when they generate different inputs only when both their inputs are 1, otherwise they should produce the same output. Classically, we can see that the best that Alice and Bob can do is win with a probability of 0.75.

However, if they can share an entangled Bell state then they can achieve a better probability of success. An entangled state of two particles is one which cannot be expressed as a product of the states of the two particles. So, as an example,

\[ \lvert \Psi \rangle = \frac{1}{\sqrt{2}} \left( \lvert 00 \rangle + \lvert 11 \rangle \right) \]

is an entangled state. It is instructive to see that this state cannot be written as a tensor product of the individual states of the two particles. Now, the next fun thing to note is that the representation of this state when written in any other orthonormal basis \(u, u^{\perp} \) does not change (again, expand out the tensor products to see this). So, our state above can be expressed as

\[ \lvert \Psi \rangle = \frac{1}{\sqrt{2}} \left( \lvert uu \rangle + \lvert u^{\perp}u^{\perp} \rangle \right) \]

Now, let Alice and Bob share the entangled state with one particle each before separation and are then taken to two corners of the universe. They can choose their basis (x for Alice and y for Bob) as shown in the figure below and announce their result depending on whether they see the particle oriented along their basis vector or orthogonal to it. In, all cases except when \(x=y=1\) their basis vectors are oriented at an angle of \( \frac{\pi}{8} \), and only when \(x=y=1\) there basis vectors are orientated at an angle of \( \frac{3\pi}{8} \). In both cases the probability of winning is \( \cos^2\frac{\pi}{8} \approx 0.85 \), much better than can be obtained classically.

A nice article discussing the questions still surrounding the experimental results with Bell's states appeared here recently.