Back to news

ICML 2024: Paper Review #6

24 September 2024

Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Senior Quantitative Researcher, Fabian, as he discusses his most compelling findings from the conference.

I/O Complexity of Attention, or How Optimal is Flash Attention?

Barna Saha & Christopher Ye

In 2022, Dao et al. introduced FlashAttention, an algorithm designed for quicker computation of self-attention by reducing I/O operations, which was identified as a major bottleneck in traditional attention algorithms.

In their paper, Saha and Ye provide a comprehensive analysis of the I/O complexity of the self-attention module used in transformer architectures. They demonstrate that the FlashAttention algorithm is optimal in terms of I/O complexity for most practical scenarios.

Saha and Ye establish a tight lower bound on the I/O complexity of attention computation using standard matrix multiplication, if the cache size M exceeds d², where d is the dimension of the head. This bound is achieved by FlashAttention and is proved using the red-blue pebble game method introduced by Hong and Kung in 1981. ^[1]

They also utilise a compression framework from Pagh & Silvestri (2014) to show that even with any fast matrix multiplication algorithm, the established bound cannot be improved for exact attention computation. ^[2]

For scenarios where M is less than d² (referred to as the small cache regime), they demonstrate that the I/O complexity of attention calculation aligns with that of standard matrix multiplication. They also propose an algorithm with better I/O complexity than FlashAttention for this case, employing standard techniques for reducing matrix multiplication complexity.

These results collectively indicate that no other exact attention algorithm can surpass FlashAttention in terms of I/O complexity.

^{[1] I/O Complexity: The Red-Blue Pebble Game}

^{[2] The Input/Output Complexity of Triangle Enumeration}

I/O Complexity of Attention, or How Optimal is Flash Attention?

ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Simple Linear Attention Language Models Balance the Recall-Throughout Tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

One theme of ICML 2024 has been the competition and comparison between attention-based models (i.e. transformers) and state-space models (SSM). While SSMs have shown remarkable performance in general language modelling with a superior memory footprint due to their fixed recurrent state size, they have demonstrated relatively poor recall ability, which is the capability to ground their generation in previously seen tokens. In contrast, attention-based models excel in recall if the context length is sufficiently large.

In this paper, Arora et al. reveal that this observation stems from a fundamental trade-off between recall ability and state size.

First, they empirically compare several models on their recall ability as a function of state size, revealing a consistent pattern: increased state size leads to better recall performance. However, some models consistently underperform compared to others with the same state size.

Second, they provide theoretical analysis to support these empirical findings.

Third, the authors introduce a new model called BASED, which utilises linear attention and sliding window attention. This model improves on the existing Pareto frontier, outperforming existing models in recall ability for a given state size budget, while also achieving competitive results in general language modelling tasks.

Finally, they describe the implementation of BASED in an I/O-aware manner, significantly improving training and inference times.

In conclusion, the authors demonstrate a fundamental trade-off between recall and state size. They also show that BASED improves upon the best existing models for recall while providing competitive results in general language modelling tasks.

Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Latest News

G-Research May 2025 Grant Winners

18 Jun 2025

Each month, we provide up to £2,000 in grant money to early career researchers in quantitative disciplines. Hear from our May grant winners.

Read article

G-Research 2025 PhD prize winners: University of Warwick

04 Jun 2025

Every year, G-Research runs a number of different PhD prizes in Maths and Data Science at universities in the UK, Europe and beyond. We're pleased to announce the winners of this prize, run in conjunction with the University of Warwick.

Read article

G-Research 2025 PhD prize winners: University of Oxford

29 May 2025

Read article

Latest Events

Quantitative Engineering
Quantitative Research

Pre-ICML @ London 2025

03 Jul 2025 Cruciform Building, Gower Street, London, WC1E 6BT

More info

Quantitative Engineering
Quantitative Research

ML in PL Conference 2025

15 Oct 2025 - 18 Oct 2025 Copernicus Science Centre, Warsaw, Poland

More info

Quantitative Engineering
Quantitative Research

SIAM Conference on Financial Mathematics and Engineering

15 Jul 2025 - 18 Jul 2025 Hyatt Regency Miami, 400 SE 2nd St, Miami, FL 33131, United States

More info

ICML 2024: Paper Review #6

I/O Complexity of Attention, or How Optimal is Flash Attention?

Simple Linear Attention Language Models Balance the Recall-Throughout Tradeoff

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

Pre-ICML @ London 2025

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with
G-Research

ICML 2024: Paper Review #6

I/O Complexity of Attention, or How Optimal is Flash Attention?

Simple Linear Attention Language Models Balance the Recall-Throughout Tradeoff

Quantitative Research and Machine Learning

Read more of our quantitative researchers thoughts

Latest News

Latest Events

Pre-ICML @ London 2025

ML in PL Conference 2025

SIAM Conference on Financial Mathematics and Engineering

Stay up to date with G-Research

Stay up to date with
G-Research