Sunday, July 31, 2016

Trace and transpose trick: Computing derivatives

I am not sure why but none of the college courses on linear algebra or calculus ever introduced me to working with derivatives of functions of matrices and vectors. This comes up on a daily basis in optimization and is a lot cleaner than dealing with summations. Here is a trick that comes in very handy when taking derivatives of a functions that maps vectors or matrices to scalars. Basically, one can replace a scalar results of matrix vector products by its transpose and in some cases by their trace (and then use the cyclic property of  trace to rewrite expressions) in which the desired terms appear at the end and are correctly transposed.

Simplest case: Linear function takes a vector to a scalar.
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(c^Tx \right) = c^T \mathrm{d}x\]

Next level: Function takes a vector to a scalar using a quadratic form. We will use the transpose trick here:
\[f : \mathbb{R}^n \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(x^TAx \right) = (\mathrm{d}x)^TAx + x^TA(\mathrm{d}x) \]
The first term on the RHS is a scalar so we can take its transpose and get
\[ \mathrm{d} \left(x^TAx \right) = x^TA^T(\mathrm{d}x) + x^TA(\mathrm{d}x)  = (x^TA^T + x^TA) \mathrm{d}x\]

Next level: Function takes a matrix to a scalar. We will use the trace trick here: 
\[f : \mathbb{R}^{nxn} \rightarrow \mathbb{R}\]
\[ \mathrm{d} \left(a^TXb \right) = a^T (\mathrm{d}X)b \]
Now since the RHS is a scalar we can think of it as a trace and rotate the terms to get:
\[ \mathrm{d} \left(a^TXb \right) = \mathrm{Tr}(ba^T \mathrm{d}X) \].
Now, we will use the symbol (:) that rewrites the matrix as a concatenation of its vectors and rewrite the above as:
\[ \mathrm{d} \left(a^TXb \right) = (ab^T): (\mathrm{d}X): \].
Giving us \( \frac{\mathrm{d} \left(a^TXb \right)}{\mathrm{d}X} = ab^T \).

Lets use this machinery to attempt something more complicated and find \( \frac{\mathrm{d}(a^TX^TCXb)}{\mathrm{d}X} \)
\[ \mathrm{d} (a^TX^TCXb)  = a^T(\mathrm{d}X^T)CXb + a^TX^TC(\mathrm{d}X)b \]
\[ = b^TX^TC^T(dX)a + a^TX^TC(dX)b \]
\[ = \mathrm{Tr}( b^TX^TC^T(\mathrm{d}X)a + a^TX^TC(\mathrm{d}X)b ) \]
\[ = \mathrm{Tr}( ab^TX^TC^T(\mathrm{d}X) + ba^TX^TC(\mathrm{d}X)) \]
\[ = (CXba^T+C^TXab^T): (\mathrm{d}X): \]
\[ \Rightarrow \mathrm{d}(a^TX^TCXb)/\mathrm{d}X = (CXba^T+C^TXab^T) \]

No comments:

Post a Comment