Skip to main content
Back to News
ICML 2024: Paper Review #1

ICML 2024: Paper Review #1

24 September 2024
  • Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Machine Learning Engineer, Yousuf, as he discusses his most compelling findings from the conference.

Arrows of Time for Large Language Models

Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

Large Language Models (LLMs) typically model the probability of observing a token given the previous tokens, generating text autoregressively. The product of these conditional probabilities forms the joint probability over the sequence, representing the likelihood of observing the entire sequence of tokens.

The authors of the paper observe that this probability distribution can also be learned in a right-to-left fashion, by predicting the previous token instead of the next one. This raises the question: does modelling left-to-right produce a better joint distribution estimate? If so, why?

The authors denote this phenomenon as a “forward arrow of time” when the left-to-right model consistently outperforms the right-to-left model. Their main result is the identification of a consistent forward arrow of time in language models across various architectures, model sizes and languages, with the performance gap increasing as model size and capability increase.

They argue that natural language inherently possesses a certain degree of sparsity. In the forward direction, this sparsity is less pronounced due to the natural progression of language being more predictable and structured. This reduced sparsity makes it easier for language models to predict the next token, leading to better performance (lower perplexity). Conversely, predicting previous tokens in the backward direction encounters higher sparsity and less predictable structures, resulting in higher perplexity.

To illustrate this, the authors construct artificial datasets where one direction is easier to model. For instance, they create a dataset of random prime numbers followed by their product. Modelling this sequence in reverse, starting with the product, would require the model to perform prime factorization, a computationally complex task. This example demonstrates a strong forward arrow of time, emphasising the computational complexity aspect.

The authors hope these insights can inform improved training procedures that explicitly account for the arrow of time present in the dataset. This could lead to more efficient and effective language models.

Arrows of Time for Large Language Models
ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu

The main contributions of this paper are two-fold: a new framework for reasoning about sequence models, and a new architecture, Mamba-2, which improves on Mamba.

The authors’ new framework, termed State Space Duality (SSD), demonstrates that Transformers and SSMs can be understood as specific instances of a broader class of models. This framework represents sequence models as a single matrix multiplication on a sequence, where the “sequence-mixing” matrix may depend on the input.

The authors show that SSMs, viewed in this light, produce a sequence-mixing matrix that is structured and semiseparable. It is lower triangle and every submatrix in this lower triangle is of low-rank. This property allows the matrix to be factorised and efficiently multiplied with sub-quadratic complexity. This is the basis of an efficient algorithm for training such models and leads to greatly improved computational performance.

Part of the improved performance, relies on the fact that previous SSMs utilised scan algorithms, which are harder to implement efficiently on GPUs. The SSD framework, which generalises previous SSMs, allows for the model to be computed via batched matrix multiplications instead. Furthermore, NVIDIA GPUs have specialised so called “tensor cores” which are used for fast tiled matrix multiplications, and as a result, this reparameterisation leads to much better throughput.

The authors show this empirically as the SSD implementation is up to 8x faster and is particularly well-suited to large state sizes which are important for some tasks. Moreover, the SSD framework allows for more natural distributed formulations that use fewer collective operations per layer, further improving the performance of large-scale training.

The authors note that for sufficiently short sequence lengths, around 2048, fused Attention implementations (such as FlashAttention) are still slightly faster as they take advantage of specific features introduced by NVIDIA in the Hopper generation of GPUs. They expect this gap to decrease as the SSD implementation improves, and, crucially, for large sequences, the sub-quadratic complexity makes SSDs much faster already. For example, at a sequence length of 32k, the performance gap is more than 10x between SSDs and FlashAttention.

In summary, this paper extends the previous SSM literature with some favourable computational properties and introduces a general framework for reasoning about sequence models.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more of our quantitative researchers thoughts

ICML 2024: Paper Review #2

Discover the perspectives of Danny, one of our machine learning engineers, on the following papers:

  • Compute Better Spent: Replacing Dense Layers with Structured Matrices
  • Emergent Equivariance in Deep Ensembles
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Jonathan, one of our software engineers, on the following papers:

  • A Universal Class of Sharpness-Aware Minimization Algorithms
  • Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Evgeni, one of our senior quantitative researchers, on the following papers:

  • Trained Random Forests Completely Reveal your Dataset
  • Test-of-time Award: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Michael, one of our Scientific Directors, on the following papers:

  • Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
  • Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Fabian, one of our senior quantitative researchers, on the following papers:

  • I/O Complexity of Attention, or How Optimal is Flash Attention?
  • Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Ingmar, one of our quantitative researchers, on the following papers:

  • Offline Actor-Critic Reinforcement Learning Scales to Large Models
  • Information-Directed Pessimism for Offline Reinforcement Learning
Read now
ICML 2024: Paper Review #8

Discover the perspectives of Oliver, one of our quantitative researchers, on the following papers:

  • Better & Faster Large Language Models via Multi-token Prediction
  • Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Read now

Latest News

Going 15 Percent Faster with Graph-Based Type-checking (part two)
  • 13 Jan 2025

Hear from Florian, Open-Source Software Engineer, in the second part of this two part series, on the challenges and breakthroughs of an internal G-Research initiative aimed at enhancing the .NET developer experience at scale.

Read article
G-Research December 2024 Grant Winners
  • 09 Jan 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our December grant winners.

Read article
James Maynard on Prime Numbers: Cryptography, Twin Primes and Groundbreaking Discoveries
  • 19 Dec 2024

We were thrilled to welcome James Maynard, Fields Medallist 2022 and Professor of Number Theory, at the Mathematical Institute in Oxford, on stage for the latest Distinguished Speaker Symposium last month. James’ talk on Patterns in prime numbers hones in on unanswered questions within mathematics and the recent developments that have brought the solutions to those problems closer to reality. Hear more in his exclusive interview with us.

Read article

Latest Events

  • Platform Engineering
  • Software Engineering

Hack the Burgh

01 Mar 2025 - 02 Mar 2025 The Nucleus Building, The University of Edinburgh, Thomas Bayes Road, Edinburgh, UK
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Oxford

12 Feb 2025 Oxford - to be confirmed after registration
  • Quantitative Engineering
  • Quantitative Research

Pub Quiz: Cambridge

25 Feb 2025 Cambridge - to be confirmed after registration

Stay up to date with
G-Research