Friday, October 21, 2016

Quasi Newton Methods - Part II (Symmetric Rank 1)

In the previous post we saw that the "smallest" update to a matrix so that it satisfies a given linear constraint is a rank-1 update. However, the updated matrix may lose it symmetric and positive definite characteristics. In this post we will find a matrix update that maintains the symmetric nature of the matrix and we will stick to a rank-1 update.

A symmetric rank-1 update can be expressed as \(\alpha xx^T\) where \(x\) is some vector and \(\alpha\) is \(\pm 1\). So, compared to our previous post, instead of the matrix \(X\) we have this symmetric rank-1 update and we still have the same constraint, namely,
\[ \alpha xx^T a = b\]
which can be rewritten as
\[ \alpha \left( x^Ta \right) x = b\]
and straightaway tells us that \(x\) is linearly dependent on \(b\). Plugging, \(x=\beta b\) into the original constraint we get:
\[ \alpha \beta^2 (b^Ta) b = b\]
which implies that
\[ \alpha \beta^2 = \frac{1}{b^Ta} \]
giving us our symmetric rank-1 update as \(\frac{bb^T}{b^Ta}\). Mapping, the symbols back to the original problem we get:
\[J'_n = J'_{n-1} + \frac{ (\Delta f - J_{n-1}x_{n-1})(\Delta f - J_{n-1}x_{n-1})^T}{(\Delta f - J_{n-1}x_{n-1})^T(\Delta x)}\]
This update is known by the TLA SR1 (Symmetric Rank-1 update) and is the update for a symmetric rank-1 update that satisfies the secant condition. Depending on the sign of the denominator this update may or may not be positive definite. Since SR1 was a result of just requiring the update to be symmetric and rank-1 and no other constraint we cannot hope to obtain a symmetric positive definite rank-1 update that satisfies the secant condition. We will again use the approach of Part I and look for a "smallest" update that gives us both symmetry and positive definiteness and we will end up with a rank-2 update and the famous DFP and BFGS formulas.

Tuesday, October 11, 2016

Quasi Newton Methods - Part I (Rank 1)

This time we will be looking at Quasi Newton Methods. But lets start off with an introduction to Newton's Method, firstly, looking at its application at root finding. Let \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) be a function and we want to find an \(x\) such that \(f(x)=0\). If we start from an initial guess \(x_0\) we can use Newton's method to iteratively update the initial guess. In its most basic form we look at the Jacobian matrix of \(f\) and create a simple first order approximation of the function and use the zero of the approximate function as the initial guess for the next iteration of the method. So,
\[ f_{approx}(x) = f(x_0) + J_0(x-x_0)\]
where \(f_{approx}\) represents the first order approximation to \(f\) and \(J_0\) is the Jacobian matrix of \(f\) at \(x_0\). So, our next guess will be the zero of \(f_{approx}\) which can be expressed as (assuming \(J_0\) is invertible)
\[ x_1 = x_0 - J_{0}^{-1}f(x_0)\]
and is known as the "full" Newton step. In general, going from the \(n-1^{th}\) iterate to \(n^{th}\) iterate we have
\[ x_{n} = x_{n-1} - J_{n-1}^{-1}f(x_{n-1})\]
Computing, the Jacobian matrix at each iterate is a computationally expensive step and "Quasi" Newton methods attempt to approximate the Jacobian matrix based on the Jacobian used to obtain the iterate. Lets impose the condition that the approximation we construct should match the function at the previous iterate, in addition to the current iterate (as in the very first equation in this post). This gives,
\[f_{n-1} = f_n + J'_n (x_{n-1}-x_{n})\]
where \(J'\) is our approximation to the Jacobian. The constraint above can be succinctly represented as \(J'_n\Delta x = \Delta f\) and is known as the Secant Condition. The constraint is not enough to approximate the Jacobian as we have \(n^2\) variables and only \(n\) constraints. Since, we want to use the previous Jacobian to approximate the new Jacobian lets add an objective function that minimizes the Frobenius norm of the change. This given us our overall optimization problem as:
\[\mathrm{min} : \frac{1}{2}\|J'_n-J'_{n-1}\|_F\]
\[\mathrm {s.t.} :  J'_n\Delta x = \Delta f\]
Lets rewrite the above by using \(X = J'_n-J'_{n-1}\) and \(a = \Delta x\) and \(b = \Delta f - J'_{n-1}\Delta x\) as

\[\mathrm{min} : \frac{1}{2}\|X\|_F\]
\[\mathrm {s.t.} :  Xa = b\]

Lets, first square the objective function (which does not change the optimization problem as we have a positive objective function) and solve the resulting problem by introducing the Lagrangian and using Matrix derivatives:
\[L(X, \lambda) = \frac{1}{2}\mathbb{Tr}(X^TX) + \lambda^T(Xa-b) \]
\[\mathrm{d}L = \mathbb{Tr}\left((X^T\mathrm + a\lambda^T)\mathrm{d}X\right) \]
Which implies that at the optimal point \((X^*, \lambda^*)\), \(X^* + \lambda^* a^T=0\). Plugging, this into the original constraint of the optimization problem we get \(\lambda^* = \frac{-b}{a^Ta}\). Plugging, this back into the expression for \(X^*\), finally, we get
\[ X^* = \frac{ba^T}{a^Ta}\]
So we see that the update is Rank-1 and using this we can now write the expression for the updated Jacobian as
\[J'_n = J'_{n-1} + \frac{ (\Delta f - J_{n-1}x_{n-1})(\Delta x)^T}{(\Delta x)^T(\Delta x)}\]
which is known as Broyden's good method. The Broyden's bad method is obtained if we try to minimize the Frobenius norm of the update to the inverse of the Jacobian and can be easily written down by using the result for our simplified optimization problem (by flipping the roles of \(a\) and \(b\)) as
\[J'^{-1}_n = J'^{-1}_{n-1} + \frac{ (\Delta x - J^{-1}_{n-1}f_{n-1})(\Delta f)^T}{(\Delta f)^T(\Delta f)}\]

Using the Sherman–Morrison formula we can write the update to the inverse of the Jacobian in the good Broyden case and the update to the Jacobian in the bad Broyden case as rank-1 updates as well.

There are two shortcomings of this result:

1) The update is not symmetric. So even if we start with a symmetric Jacobian we will lose symmetry. This is important if we look at Newton's method as an optimization method where the function \(f\) will be a gradient of an objective function and the Jacobian will be the Hessian of the objective function and, therefore, symmetric. The next post will be about a symmetric update to the Jacobian though still rank-1.

2) The update is not positive definite. Again a problem when we are looking to optimize say a convex function. The third post in this series will be on the DFP and BFGS updates that will get us the desired SPD (symmetric positive definite) update, but we will need to go one rank up to a rank-2 update.