Skip to main content
We're enhancing our site and your experience, so please keep checking back as we evolve.
Back to News
ICML 2024: Paper Review #1

ICML 2024: Paper Review #1

24 September 2024
  • Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Machine Learning Engineer, Yousuf, as he discusses his most compelling findings from the conference.

Arrows of Time for Large Language Models

Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

Large Language Models (LLMs) typically model the probability of observing a token given the previous tokens, generating text autoregressively. The product of these conditional probabilities forms the joint probability over the sequence, representing the likelihood of observing the entire sequence of tokens.

The authors of the paper observe that this probability distribution can also be learned in a right-to-left fashion, by predicting the previous token instead of the next one. This raises the question: does modelling left-to-right produce a better joint distribution estimate? If so, why?

The authors denote this phenomenon as a “forward arrow of time” when the left-to-right model consistently outperforms the right-to-left model. Their main result is the identification of a consistent forward arrow of time in language models across various architectures, model sizes and languages, with the performance gap increasing as model size and capability increase.

They argue that natural language inherently possesses a certain degree of sparsity. In the forward direction, this sparsity is less pronounced due to the natural progression of language being more predictable and structured. This reduced sparsity makes it easier for language models to predict the next token, leading to better performance (lower perplexity). Conversely, predicting previous tokens in the backward direction encounters higher sparsity and less predictable structures, resulting in higher perplexity.

To illustrate this, the authors construct artificial datasets where one direction is easier to model. For instance, they create a dataset of random prime numbers followed by their product. Modelling this sequence in reverse, starting with the product, would require the model to perform prime factorization, a computationally complex task. This example demonstrates a strong forward arrow of time, emphasising the computational complexity aspect.

The authors hope these insights can inform improved training procedures that explicitly account for the arrow of time present in the dataset. This could lead to more efficient and effective language models.

Arrows of Time for Large Language Models
ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu

The main contributions of this paper are two-fold: a new framework for reasoning about sequence models, and a new architecture, Mamba-2, which improves on Mamba.

The authors’ new framework, termed State Space Duality (SSD), demonstrates that Transformers and SSMs can be understood as specific instances of a broader class of models. This framework represents sequence models as a single matrix multiplication on a sequence, where the “sequence-mixing” matrix may depend on the input.

The authors show that SSMs, viewed in this light, produce a sequence-mixing matrix that is structured and semiseparable. It is lower triangle and every submatrix in this lower triangle is of low-rank. This property allows the matrix to be factorised and efficiently multiplied with sub-quadratic complexity. This is the basis of an efficient algorithm for training such models and leads to greatly improved computational performance.

Part of the improved performance, relies on the fact that previous SSMs utilised scan algorithms, which are harder to implement efficiently on GPUs. The SSD framework, which generalises previous SSMs, allows for the model to be computed via batched matrix multiplications instead. Furthermore, NVIDIA GPUs have specialised so called “tensor cores” which are used for fast tiled matrix multiplications, and as a result, this reparameterisation leads to much better throughput.

The authors show this empirically as the SSD implementation is up to 8x faster and is particularly well-suited to large state sizes which are important for some tasks. Moreover, the SSD framework allows for more natural distributed formulations that use fewer collective operations per layer, further improving the performance of large-scale training.

The authors note that for sufficiently short sequence lengths, around 2048, fused Attention implementations (such as FlashAttention) are still slightly faster as they take advantage of specific features introduced by NVIDIA in the Hopper generation of GPUs. They expect this gap to decrease as the SSD implementation improves, and, crucially, for large sequences, the sub-quadratic complexity makes SSDs much faster already. For example, at a sequence length of 32k, the performance gap is more than 10x between SSDs and FlashAttention.

In summary, this paper extends the previous SSM literature with some favourable computational properties and introduces a general framework for reasoning about sequence models.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more of our quantitative researchers thoughts

ICML 2024: Paper Review #2

Discover the perspectives of Danny, one of our machine learning engineers, on the following papers:

  • Compute Better Spent: Replacing Dense Layers with Structured Matrices
  • Emergent Equivariance in Deep Ensembles
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Jonathan, one of our software engineers, on the following papers:

  • A Universal Class of Sharpness-Aware Minimization Algorithms
  • Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Evgeni, one of our senior quantitative researchers, on the following papers:

  • Trained Random Forests Completely Reveal your Dataset
  • Test-of-time Award: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Michael, one of our Scientific Directors, on the following papers:

  • Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
  • Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Read now
ICML 2024: Paper Review #6

Discover the perspectives of Fabian, one of our senior quantitative researchers, on the following papers:

  • I/O Complexity of Attention, or How Optimal is Flash Attention?
  • Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Ingmar, one of our quantitative researchers, on the following papers:

  • Offline Actor-Critic Reinforcement Learning Scales to Large Models
  • Information-Directed Pessimism for Offline Reinforcement Learning
Read now

Latest News

G Research
G-Research September 2024 Grant Winners
  • 08 Oct 2024

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our August grant winners.

Read article
Lessons learned: Delivering software programs (part 5)
  • 04 Oct 2024

Hear more from our Head of Forecasting Engineering and learn how to keep your projects on track by embracing constant change and acting quickly.

Read article

Latest Events

  • Quantitative Engineering
  • Quantitative Research

Oxford Coding Challenge

23 Oct 2024 University of Oxford, Computer Science Lecture Theatre A, 7 Parks Rd, Oxford, OX1 3QG
  • Quantitative Engineering
  • Quantitative Research

Cambridge Coding Challenge

28 Oct 2024 East Hub 1, University of Cambridge, JJ Thomson Avenue, Cambridge, CB3 0US
  • Quantitative Engineering
  • Quantitative Research

Cambridge Quant Challenge

06 Nov 2024 University of Cambridge, Centre for Mathematical Sciences,  Wilberforce Road,  Cambridge CB3 0WA

Stay up to date with
G-Research