Back to news

ICML 2024: Paper Review #1

24 September 2024

Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Machine Learning Engineer, Yousuf, as he discusses his most compelling findings from the conference.

Arrows of Time for Large Language Models

Vassilis Papadopoulos, Jérémie Wenger, Clément Hongler

Large Language Models (LLMs) typically model the probability of observing a token given the previous tokens, generating text autoregressively. The product of these conditional probabilities forms the joint probability over the sequence, representing the likelihood of observing the entire sequence of tokens.

The authors of the paper observe that this probability distribution can also be learned in a right-to-left fashion, by predicting the previous token instead of the next one. This raises the question: does modelling left-to-right produce a better joint distribution estimate? If so, why?

The authors denote this phenomenon as a “forward arrow of time” when the left-to-right model consistently outperforms the right-to-left model. Their main result is the identification of a consistent forward arrow of time in language models across various architectures, model sizes and languages, with the performance gap increasing as model size and capability increase.

They argue that natural language inherently possesses a certain degree of sparsity. In the forward direction, this sparsity is less pronounced due to the natural progression of language being more predictable and structured. This reduced sparsity makes it easier for language models to predict the next token, leading to better performance (lower perplexity). Conversely, predicting previous tokens in the backward direction encounters higher sparsity and less predictable structures, resulting in higher perplexity.

To illustrate this, the authors construct artificial datasets where one direction is easier to model. For instance, they create a dataset of random prime numbers followed by their product. Modelling this sequence in reverse, starting with the product, would require the model to perform prime factorization, a computationally complex task. This example demonstrates a strong forward arrow of time, emphasising the computational complexity aspect.

The authors hope these insights can inform improved training procedures that explicitly account for the arrow of time present in the dataset. This could lead to more efficient and effective language models.

Arrows of Time for Large Language Models

ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu

The main contributions of this paper are two-fold: a new framework for reasoning about sequence models, and a new architecture, Mamba-2, which improves on Mamba.

The authors’ new framework, termed State Space Duality (SSD), demonstrates that Transformers and SSMs can be understood as specific instances of a broader class of models. This framework represents sequence models as a single matrix multiplication on a sequence, where the “sequence-mixing” matrix may depend on the input.

The authors show that SSMs, viewed in this light, produce a sequence-mixing matrix that is structured and semiseparable. It is lower triangle and every submatrix in this lower triangle is of low-rank. This property allows the matrix to be factorised and efficiently multiplied with sub-quadratic complexity. This is the basis of an efficient algorithm for training such models and leads to greatly improved computational performance.

Part of the improved performance, relies on the fact that previous SSMs utilised scan algorithms, which are harder to implement efficiently on GPUs. The SSD framework, which generalises previous SSMs, allows for the model to be computed via batched matrix multiplications instead. Furthermore, NVIDIA GPUs have specialised so called “tensor cores” which are used for fast tiled matrix multiplications, and as a result, this reparameterisation leads to much better throughput.

The authors show this empirically as the SSD implementation is up to 8x faster and is particularly well-suited to large state sizes which are important for some tasks. Moreover, the SSD framework allows for more natural distributed formulations that use fewer collective operations per layer, further improving the performance of large-scale training.

The authors note that for sufficiently short sequence lengths, around 2048, fused Attention implementations (such as FlashAttention) are still slightly faster as they take advantage of specific features introduced by NVIDIA in the Hopper generation of GPUs. They expect this gap to decrease as the SSD implementation improves, and, crucially, for large sequences, the sub-quadratic complexity makes SSDs much faster already. For example, at a sequence length of 32k, the performance gap is more than 10x between SSDs and FlashAttention.

In summary, this paper extends the previous SSM literature with some favourable computational properties and introduces a general framework for reasoning about sequence models.

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Latest News

G-Research May 2025 Grant Winners

18 Jun 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our May grant winners.

Read article

G-Research 2025 PhD prize winners: University of Warwick

04 Jun 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with the University of Warwick.

Read article

G-Research 2025 PhD prize winners: University of Oxford

29 May 2025

Read article

Latest Events

Quantitative Engineering
Quantitative Research

ML in PL Conference 2025

15 Oct 2025 - 18 Oct 2025 Copernicus Science Centre, Warsaw, Poland

More info

Quantitative Engineering
Quantitative Research

SIAM Conference on Financial Mathematics and Engineering

15 Jul 2025 - 18 Jul 2025 Hyatt Regency Miami, 400 SE 2nd St, Miami, FL 33131, United States

More info

ICML 2024: Paper Review #1

Arrows of Time for Large Language Models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with
G-Research

ICML 2024: Paper Review #1

Arrows of Time for Large Language Models

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with G-Research

Stay up to date with
G-Research