The best of ICML 2022 – Paper reviews (part 5)
This article is one of a series of paper reviews from our researchers and machine learning engineers – view more
Last month, G-Research were Diamond sponsors at ICML, hosted this year in Baltimore, US.
As well as having a stand and team in attendance, a number of our quantitative researchers and machine learning engineers attended as part of their ongoing learning and development.
We asked our quants and machine learning practitioners to write about some of the papers and research that they found most interesting.
Here, Linden R, Quantitative Researcher at G-Research, discusses three papers.
Yilang Zhang, Mengchu Xu, Xiaojun Mao, Jian Wang
Compressed image vector =
(random matrix) x (original image vector)
Reconstructed image =
If we are trying to lossily compress an image, one way is to record the dot product of the image with
k predetermined random vectors. Linear regression won’t give us the original image back, but rather a codimension-k “plausible” subspace where the image lies. So we will need some priors to pick the “most likely” input image within (or near) this plausible subspace. If we train an encoder-decoder model with latent variable vector
z, we can pick
z which gives an image near the plausible subspace, and where
z is L2-small. This does pretty well in practice; however, our training of the encoder-decoder model may have overfit, and its weights vector theta may be noisy.
The paper’s novelty is therefore to model the “true” theta as being normally distributed around the trained theta, and pick z and true theta to maximise the posterior likelihood. Thus the true theta is penalised by its distance to the trained theta, and
z is penalised by its norm as before. These optimal parameters are found efficiently using an iterative solver described in the paper. In this way the authors effectively find weights near to the trained weights which give an easily-describable (small
z) image which compresses to (something similar to) the compressed image.
The results are visibly better-looking than similar methods and the authors give guarantees on when this method will produce an image “near to” the original image.
Aviv Navon, Aviv Shamsian, Idan Achituve, Haggai Maron, Kenji Kawaguchi, Gal Chechik, Ethan Fetaya
In this setting we are trying to train one model to learn multiple tasks. A common way to do this is to gradient descend the model weights on each task simultaneously during training. However, different tasks can have contradictory directions and step sizes, leading to cancellation of the steps and reducing the effectiveness of the training. The authors of this paper have tried to resolve this issue by, instead of summing the different tasks’ gradients, finding the Nash equilibrium of a bargaining “game” between the different gradients, effectively forcing the different tasks to negotiate which direction to move the model weights in. As a Nash equilibrium, it is invariant to changing the sizes of the gradients, i.e. it behaves as if the gradients were all normalised. This prevents one gradient from dominating the joint descent.
When choosing a joint direction
d, the objective function the authors minimise is the sum of
g_i is the gradient vector from the ith task. The fact that this is a Nash equilibrium means that if all but one of the “players” choose
d, then the remaining player maximises their own utility (which is
g_i.d) by choosing
d as well, since otherwise there is no agreement and the step defaults to
The authors calculate that the unique equilibrium point is a linear combination of the task gradients
g_i whose coefficients
G^T G a = 1/a, where
1/a is the element-wise reciprocal.
As a special case, if the gradients are pairwise orthogonal, then the combined gradient (up to scale) is just the sum of the normalised gradients. The authors provide an efficient way to numerically solve for
a, and compare the resulting gradient descent procedure to the state-of-the-art in multi-task learning where their Nash-MTL model performs significantly better than previous MTL methods.
Antonio Orvieto, Hans Kersting, Frank Proske, Francis Bach, Aurelien Lucchi
This paper gives an experimental variant of stochastic gradient descent (SGD). In SGD, usually you add in a noise term with the gradient descent, which in practice seems to help the descent process move out of minima which are in some sense too narrow and find wide, highly-robust minima. The novelty in this paper is to add noise terms which are anticorrelated, i.e. the nth noise term is of the form
(z_n - z_n+1) where the
z_i are iid normal.
This is functionally equivalent to modified gradient descent where you evaluate the gradient at a random nearby point, instead of the current point. In expectation, this “random nearby loss” function acts as a smoothed version of the original loss function, and Taylor-expanding the loss gradient to second order and a cheeky use of Clairaut’s theorem (assuming the loss function has continuous fourth-order partial derivatives!) tells you that each step of this descent procedure is on average just gradient descent of a modified loss function where we regularise by adding in the loss’s Laplacian (times a small constant). Thus this seemingly simple change to SGD gives a procedure which elegantly penalises sharp (high-Laplacian) minima.
The authors show that this is well-behaved under in many circumstances and show it in action on some toy problems, but I am interested to see if this can be applied to larger descents.