This time we will be looking at Quasi Newton Methods. But lets start off with an introduction to Newton's Method, firstly, looking at its application at root finding. Let \(f : \mathbb{R}^n \rightarrow \mathbb{R}\) be a function and we want to find an \(x\) such that \(f(x)=0\). If we start from an initial guess \(x_0\) we can use Newton's method to iteratively update the initial guess. In its most basic form we look at the Jacobian matrix of \(f\) and create a simple first order approximation of the function and use the zero of the approximate function as the initial guess for the next iteration of the method. So,
\[ f_{approx}(x) = f(x_0) + J_0(x-x_0)\]
where \(f_{approx}\) represents the first order approximation to \(f\) and \(J_0\) is the Jacobian matrix of \(f\) at \(x_0\). So, our next guess will be the zero of \(f_{approx}\) which can be expressed as (assuming \(J_0\) is invertible)
Lets, first square the objective function (which does not change the optimization problem as we have a positive objective function) and solve the resulting problem by introducing the Lagrangian and using Matrix derivatives:
\[ x_1 = x_0 - J_{0}^{-1}f(x_0)\]
and is known as the "full" Newton step. In general, going from the \(n-1^{th}\) iterate to \(n^{th}\) iterate we have
\[ x_{n} = x_{n-1} - J_{n-1}^{-1}f(x_{n-1})\]
Computing, the Jacobian matrix at each iterate is a computationally expensive step and "Quasi" Newton methods attempt to approximate the Jacobian matrix based on the Jacobian used to obtain the iterate. Lets impose the condition that the approximation we construct should match the function at the previous iterate, in addition to the current iterate (as in the very first equation in this post). This gives,
\[f_{n-1} = f_n + J'_n (x_{n-1}-x_{n})\]
where \(J'\) is our approximation to the Jacobian. The constraint above can be succinctly represented as \(J'_n\Delta x = \Delta f\) and is known as the Secant Condition. The constraint is not enough to approximate the Jacobian as we have \(n^2\) variables and only \(n\) constraints. Since, we want to use the previous Jacobian to approximate the new Jacobian lets add an objective function that minimizes the Frobenius norm of the change. This given us our overall optimization problem as:
\[\mathrm{min} : \frac{1}{2}\|J'_n-J'_{n-1}\|_F\]
\[\mathrm {s.t.} : J'_n\Delta x = \Delta f\]
Lets rewrite the above by using \(X = J'_n-J'_{n-1}\) and \(a = \Delta x\) and \(b = \Delta f - J'_{n-1}\Delta x\) as
\[\mathrm{min} : \frac{1}{2}\|X\|_F\]
\[\mathrm {s.t.} : Xa = b\]
Lets, first square the objective function (which does not change the optimization problem as we have a positive objective function) and solve the resulting problem by introducing the Lagrangian and using Matrix derivatives:
\[L(X, \lambda) = \frac{1}{2}\mathbb{Tr}(X^TX) + \lambda^T(Xa-b) \]
\[\mathrm{d}L = \mathbb{Tr}\left((X^T\mathrm + a\lambda^T)\mathrm{d}X\right) \]
Which implies that at the optimal point \((X^*, \lambda^*)\), \(X^* + \lambda^* a^T=0\). Plugging, this into the original constraint of the optimization problem we get \(\lambda^* = \frac{-b}{a^Ta}\). Plugging, this back into the expression for \(X^*\), finally, we get
\[ X^* = \frac{ba^T}{a^Ta}\]
So we see that the update is Rank-1 and using this we can now write the expression for the updated Jacobian as
\[J'_n = J'_{n-1} + \frac{ (\Delta f - J_{n-1}x_{n-1})(\Delta x)^T}{(\Delta x)^T(\Delta x)}\]
which is known as Broyden's good method. The Broyden's bad method is obtained if we try to minimize the Frobenius norm of the update to the inverse of the Jacobian and can be easily written down by using the result for our simplified optimization problem (by flipping the roles of \(a\) and \(b\)) as
\[J'^{-1}_n = J'^{-1}_{n-1} + \frac{ (\Delta x - J^{-1}_{n-1}f_{n-1})(\Delta f)^T}{(\Delta f)^T(\Delta f)}\]
Using the Sherman–Morrison formula we can write the update to the inverse of the Jacobian in the good Broyden case and the update to the Jacobian in the bad Broyden case as rank-1 updates as well.
There are two shortcomings of this result:
1) The update is not symmetric. So even if we start with a symmetric Jacobian we will lose symmetry. This is important if we look at Newton's method as an optimization method where the function \(f\) will be a gradient of an objective function and the Jacobian will be the Hessian of the objective function and, therefore, symmetric. The next post will be about a symmetric update to the Jacobian though still rank-1.
2) The update is not positive definite. Again a problem when we are looking to optimize say a convex function. The third post in this series will be on the DFP and BFGS updates that will get us the desired SPD (symmetric positive definite) update, but we will need to go one rank up to a rank-2 update.
No comments:
Post a Comment