Back to news

ICML 2024: Paper Review #3

24 September 2024

Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Software Engineer, Jonathan, as he discusses his most compelling findings from the conference.

A Universal Class of Sharpness-Aware Minimization Algorithms

Behrooz Tahmasebi, Ashkan Soleymani, Dara Bahri, Stefanie Jegelka, Patrick Jaillet

Sharpness-Aware Minimization (SAM) (Foret et al, 2021) is an optimisation procedure that aims to improve the generalisation of trained models, by biasing towards flatter minima in the loss landscape. ^[1]

Specifically, SAM works by minimising the maximum loss L(x) within a local neighbourhood of the model parameters. This has been shown to be equivalent to minimising the largest eigenvalue of the Hessian matrix H(x) on the zero-loss manifold. A variant of this uses the average loss instead of the maximum – this corresponds to minimising the trace of H(x) (and thus the average eigenvalue). These functions of the Hessian can be considered measures of sharpness.

However, these particular sharpness measures have shortcomings. For instance, a loss of L(a, b) = a² – b² has tr(H) = 0 everywhere, so the average-based sharpness objective becomes meaningless. The authors also provide examples where L(x) is scale-invariant but the sharpness measures are not.

Using the average-based objective as a starting point, the authors define a generalised class of sharpness measures S(x) that are functions of the Hessian. They then prove that this class of sharpness measures is universal for functions of Hessian eigenvalues, as well as for arbitrary functions of the Hessian.

They then provide an objective function that only relies on zeroth-order information about the training loss, with an explicit bias towards minimising S(x), as well as the full generalised SAM algorithm. They provide the form of the original maximum and average-based measures under this parameterisation, as well as new measures which use the Frobenius norm and determinant of the Hessian to address the aforementioned problems with saddle points and scale-invariance.

The authors then demonstrate that these SAM variants are competitive with the original ones on various vision datasets, and can outperform in certain scenarios, e.g. with training data is limited, and in the presence of label noise. It would be interesting to see how these variants compare across different datasets and model architectures.

^{[1] Sharpness-Aware Minimization for Efficiently Improving Generalization}

A Universal Class of Sharpness-Aware Minimization Algorithms

ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Atli Kosson, Bettina Messmer, Martin Jaggi

Weight decay and L2-regularisation are widely used techniques in deep learning but the mechanisms by which they regularise a network are not well understood.

Weight decay is often thought of as an explicit regularisation method. However, this has been shown not to hold in networks with normalisation layers, as this can make weight vectors scale-invariant, and thus the magnitude of the weight vectors does not affect the output of the network.

In this paper, the authors break down the update dynamics of various different optimisers into the magnitude and directional components. They show that the critical effect of weight decay is to control the speed of angular updates to the weight vectors. Gradient updates and weight decay have opposite effects on the norm of the weights, thus bringing the weight vectors into an equilibrium state with fixed expected norm and angular update size.

They derive expressions for the update sizes and equilibrium norm for AdamW and Adam+L2, showing that in Adam+L2 this equilibrium depends on the gradient norm, whereas in AdamW it doesn’t. AdamW is known empirically to outperform Adam+L2, and the authors suggest that the balanced rotation between layers and neurons under AdamW is the main reason for this. They illustrate this imbalanced rotation in Adam+L2 by plotting the per-layer rotation speed over time, which they show is clearly far more dispersed than for AdamW.

The authors propose rotational variants (RVs) of AdamW and SGDM, which explicitly control the size of the angular updates while keeping the weight norms constant. They demonstrate performance parity with the baseline optimisers, and show that the RVs eliminate the need for learning rate warmup (which is often required to counteract fast rotation at the start of training), as well as making the dynamics more robust to normalisation or the lack thereof.

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Latest News

G-Research May 2025 Grant Winners

18 Jun 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our May grant winners.

Read article

G-Research 2025 PhD prize winners: University of Warwick

04 Jun 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with the University of Warwick.

Read article

G-Research 2025 PhD prize winners: University of Oxford

29 May 2025

Read article

Latest Events

Quantitative Engineering
Quantitative Research

G-Research networking drinks at EuroPython 2025

16 Jul 2025 Shared on confirmation of your place

More info

Quantitative Engineering
Quantitative Research

ML in PL Conference 2025

15 Oct 2025 - 18 Oct 2025 Copernicus Science Centre, Warsaw, Poland

More info

Quantitative Engineering
Quantitative Research

SIAM Conference on Financial Mathematics and Engineering

15 Jul 2025 - 18 Jul 2025 Hyatt Regency Miami, 400 SE 2nd St, Miami, FL 33131, United States

More info

ICML 2024: Paper Review #3

A Universal Class of Sharpness-Aware Minimization Algorithms

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

G-Research networking drinks at EuroPython 2025

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with
G-Research

ICML 2024: Paper Review #3

A Universal Class of Sharpness-Aware Minimization Algorithms

Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

G-Research networking drinks at EuroPython 2025

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with G-Research

Stay up to date with
G-Research