I recently gave a talk on Linear Regression and its extensions (Ridge regression, LASSO, LARS etc.) as part of the Austin SIGKDD Advanced Machine Learning Meetup. The slides can be found here.
One of the interesting things that I learnt about while preparing for this talk was the fact that coordinate descent works for function where the non-differentiable part of the objective function is separable (slides 29-30). This is why we can use gradient optimization for minimizing the LASSO objective function. The proof works out and the only background that is needed is subgradients. Also, it is instructive to make sure that the proof "really" uses the property of separability and why the proof breaks when we do not have separability.
Hint: For a separable function the subgradient set is a direct product of the subgradient sets of the individual components. However, if the non-differentiable function is not separable then the negative gradient of the differentiable part may not lie in the subgradient set of the non-differentiable function.
Another, interesting thing was the connection between LASSO and LARS. LASSO minimizes the quadratic residual norm with a \(\mathrm{L1}\) penalty and requires that the sign of the coefficient related to a vector be the same as the sign of the inner product of the residual with the vector (easy to show using the Lagrangian). LARS does not enforce this and can therefore produce different solutions. LARS can modified to drop a vector from the active set when this sign change happens and can thus be tailored to generate a LASSO solution. However, then the number of steps taken by LARS is not bounded by \(n\), the dimensionality of the problem.