Skip to main content
Back to News
ICML 2024: Paper Review #8

ICML 2024: Paper Review #8

25 September 2024
  • Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Quantitative Researcher, Oliver, as he discusses his most compelling findings from the conference.

Better & Faster Large Language Models via Multi-token Prediction

Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve

This paper introduces a novel approach to LLM pre-training, challenging the conventional next-token prediction paradigm. The authors propose a non-autoregressive multi-token prediction loss, aiming to enhance model performance whilst maintaining parallelisability. Intuitively such a multi-token loss should help as it makes the pre-training objective more similar to the autoregressive text generation in downstream tasks.

The research examines models ranging from 0.3 to 13 billion parameters, revealing that the multi-token loss becomes increasingly beneficial as model size grows. Notably, this approach yields significant improvements in coding benchmarks, such as MBPP and HumanEval. The fact that smaller models (under 1 billion parameters) experience performance degradation, even on coding benchmarks, with this method may explain why it hasn’t been previously explored.

Beside these gains on coding benchmarks, I think their findings on byte-level training are particularly promising, as the multi-token loss nearly bridges the performance gap between byte and standard tokenisers. This development could potentially pave the way for ‘tokeniser-free’ models in the future.

Beside those empirical gains, the paper proposes two intuitions that illustrate why a multi-token loss should indeed help with auto-regressive text generation which I found helpful.

It’s worth noting that the claimed reduction in inference time refers to wall-clock time rather than FLOPs as the gains stem from increased parallelisability and speculative decoding, not from fundamental computational efficiency improvements.

Better & Faster Large Language Models via Multi-token Prediction
ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Alexander Hägele, Elie Bakouch, Atli Kosson, Loubna Ben Allal, Leandro Von Werra, Martin Jaggi

Determining the optimal training length for a given model is a crucial aspect of generative AI research. This line of inquiry, culminated (in the public domain) with the Chinchilla paper, which demonstrated that the learning rate schedule is an important factor. [1] Until recently, cosine annealing was the scheduler of choice. Despite its strong performance, it has a major drawback: one must specify the decay scale upfront, and proportional to the number of training steps to get state of the art performance. This makes it difficult to reuse model checkpoints in different experiments.

This paper investigates a simple learning rate schedule: constant followed by cool-down. The results are promising, with the final performance of models being at least as good as with cosine annealing. In addition, the checkpoints from the constant learning rate phase can be reused for experiments of different durations. Interestingly, the Llama 3.1 paper, released during ICML, describes a very similar approach. Though the opted for a linear learning rate decay during cool-down rather than the 1-sqrt version proposed in the paper. [2]

Beyond scaling law experiments, the new scheduler enables checkpoint sharing for investigations into different data mixtures in a curriculum learning setup. Again, Llama 3.1 leveraged this too and explored different data mixtures during the cooldown period, ie. re-using the last checkpoint from the constant learning rate phase.

The authors also analysed Stochastic Weight Averaging and the recently proposed schedule free optimiser, which promise optimal performance throughout the training. Whilst these yield strong performance during all training iterations, the paper finds that they cannot quite match the performance of the proposed constant and cool-down method. Additionally, these alternatives would only offer compute savings for scaling law runs, not for experiments investigating data mixtures or curriculum learning stages.

[1] Training Compute-Optimal Large Language Models

[2] The Llama 3 Herd of Models

Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more of our quantitative researchers thoughts

ICML 2024: Paper Review #1

Discover the perspectives of Yousuf, one of our machine learning engineers, on the following papers:

  • Arrows of Time for Large Language Models
  • Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Read now
ICML 2024: Paper Review #2

Discover the perspectives of Danny, one of our machine learning engineers, on the following papers:

  • Compute Better Spent: Replacing Dense Layers with Structured Matrices
  • Emergent Equivariance in Deep Ensembles
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Jonathan, one of our software engineers, on the following papers:

  • A Universal Class of Sharpness-Aware Minimization Algorithms
  • Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Evgeni, one of our senior quantitative researchers, on the following papers:

  • Trained Random Forests Completely Reveal your Dataset
  • Test-of-time Award: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Michael, one of our Scientific Directors, on the following papers:

  • Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
  • Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Fabian, one of our senior quantitative researchers, on the following papers:

  • I/O Complexity of Attention, or How Optimal is Flash Attention?
  • Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Ingmar, one of our quantitative researchers, on the following papers:

  • Offline Actor-Critic Reinforcement Learning Scales to Large Models
  • Information-Directed Pessimism for Offline Reinforcement Learning
Read now

Latest News

Going 15 Percent Faster with Graph-Based Type-checking (part two)
  • 13 Jan 2025

Hear from Florian, Open-Source Software Engineer, in the second part of this two part series, on the challenges and breakthroughs of an internal G-Research initiative aimed at enhancing the .NET developer experience at scale.

Read article
G-Research December 2024 Grant Winners
  • 09 Jan 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our December grant winners.

Read article
James Maynard on Prime Numbers: Cryptography, Twin Primes and Groundbreaking Discoveries
  • 19 Dec 2024

We were thrilled to welcome James Maynard, Fields Medallist 2022 and Professor of Number Theory, at the Mathematical Institute in Oxford, on stage for the latest Distinguished Speaker Symposium last month. James’ talk on Patterns in prime numbers hones in on unanswered questions within mathematics and the recent developments that have brought the solutions to those problems closer to reality. Hear more in his exclusive interview with us.

Read article

Latest Events

  • Platform Engineering
  • Software Engineering

Hack the Burgh

01 Mar 2025 - 02 Mar 2025 The Nucleus Building, The University of Edinburgh, Thomas Bayes Road, Edinburgh, UK
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Oxford

12 Feb 2025 Oxford - to be confirmed after registration
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Cambridge

25 Feb 2025 Cambridge - to be confirmed after registration

Stay up to date with
G-Research