Skip to main content
Back to News
ICML 2024: Paper Review #6

ICML 2024: Paper Review #6

24 September 2024
  • Quantitative Research

Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.

In this paper review series, our ICML 2024 attendees reveal the research and papers they found most interesting.

Here, discover the perspectives of Senior Quantitative Researcher, Fabian, as he discusses his most compelling findings from the conference.

I/O Complexity of Attention, or How Optimal is Flash Attention?

Barna Saha & Christopher Ye

In 2022, Dao et al. introduced FlashAttention, an algorithm designed for quicker computation of self-attention by reducing I/O operations, which was identified as a major bottleneck in traditional attention algorithms.

In their paper, Saha and Ye provide a comprehensive analysis of the I/O complexity of the self-attention module used in transformer architectures. They demonstrate that the FlashAttention algorithm is optimal in terms of I/O complexity for most practical scenarios.

Saha and Ye establish a tight lower bound on the I/O complexity of attention computation using standard matrix multiplication, if the cache size M exceeds d2, where d is the dimension of the head. This bound is achieved by FlashAttention and is proved using the red-blue pebble game method introduced by Hong and Kung in 1981. [1]

They also utilise a compression framework from Pagh & Silvestri (2014) to show that even with any fast matrix multiplication algorithm, the established bound cannot be improved for exact attention computation. [2]

For scenarios where M is less than d2 (referred to as the small cache regime), they demonstrate that the I/O complexity of attention calculation aligns with that of standard matrix multiplication. They also propose an algorithm with better I/O complexity than FlashAttention for this case, employing standard techniques for reducing matrix multiplication complexity.

These results collectively indicate that no other exact attention algorithm can surpass FlashAttention in terms of I/O complexity.

[1] I/O Complexity: The Red-Blue Pebble Game

[2] The Input/Output Complexity of Triangle Enumeration

I/O Complexity of Attention, or How Optimal is Flash Attention?
ICML 2023 Paper Reviews

Read paper reviews from ICML 2023 from a number of our quantitative researchers and machine learning practitioners.

Read now

Simple Linear Attention Language Models Balance the Recall-Throughout Tradeoff

Simran Arora, Sabri Eyuboglu, Michael Zhang, Aman Timalsina, Silas Alberti, Dylan Zinsley, James Zou, Atri Rudra, Christopher Ré

One theme of ICML 2024 has been the competition and comparison between attention-based models (i.e. transformers) and state-space models (SSM). While SSMs have shown remarkable performance in general language modelling with a superior memory footprint due to their fixed recurrent state size, they have demonstrated relatively poor recall ability, which is the capability to ground their generation in previously seen tokens. In contrast, attention-based models excel in recall if the context length is sufficiently large.

In this paper, Arora et al. reveal that this observation stems from a fundamental trade-off between recall ability and state size.

First, they empirically compare several models on their recall ability as a function of state size, revealing a consistent pattern: increased state size leads to better recall performance. However, some models consistently underperform compared to others with the same state size.

Second, they provide theoretical analysis to support these empirical findings.

Third, the authors introduce a new model called BASED, which utilises linear attention and sliding window attention. This model improves on the existing Pareto frontier, outperforming existing models in recall ability for a given state size budget, while also achieving competitive results in general language modelling tasks.

Finally, they describe the implementation of BASED in an I/O-aware manner, significantly improving training and inference times.

In conclusion, the authors demonstrate a fundamental trade-off between recall and state size. They also show that BASED improves upon the best existing models for recall while providing competitive results in general language modelling tasks.

Simple Linear Attention Language Models Balance the Recall-Throughput Tradeoff

Quantitative Research and Machine Learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more of our quantitative researchers thoughts

ICML 2024: Paper Review #1

Discover the perspectives of Yousuf, one of our machine learning engineers, on the following papers:

  • Arrows of Time for Large Language Models
  • Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality
Read now
ICML 2024: Paper Review #2

Discover the perspectives of Danny, one of our machine learning engineers, on the following papers:

  • Compute Better Spent: Replacing Dense Layers with Structured Matrices
  • Emergent Equivariance in Deep Ensembles
Read now
ICML 2024: Paper Review #3

Discover the perspectives of Jonathan, one of our software engineers, on the following papers:

  • A Universal Class of Sharpness-Aware Minimization Algorithms
  • Rotational Equilibrium: How Weight Decay Balances Learning Across Neural Networks
Read now
ICML 2024: Paper Review #4

Discover the perspectives of Evgeni, one of our senior quantitative researchers, on the following papers:

  • Trained Random Forests Completely Reveal your Dataset
  • Test-of-time Award: DeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition
Read now
ICML 2024: Paper Review #5

Discover the perspectives of Michael, one of our Scientific Directors, on the following papers:

  • Stop Regressing: Training Value Functions via Classification for Scalable Deep RL
  • Physics of Language Models: Part 3.1, Knowledge Storage and Extraction
Read now
ICML 2024: Paper Review #7

Discover the perspectives of Ingmar, one of our quantitative researchers, on the following papers:

  • Offline Actor-Critic Reinforcement Learning Scales to Large Models
  • Information-Directed Pessimism for Offline Reinforcement Learning
Read now
ICML 2024: Paper Review #8

Discover the perspectives of Oliver, one of our quantitative researchers, on the following papers:

  • Better & Faster Large Language Models via Multi-token Prediction
  • Scaling Laws and Compute-Optimal Training Beyond Fixed Training Durations
Read now

Latest News

NeurIPS Paper Reviews 2024 #5
  • 23 Jan 2025

In this NeurIPS paper review series, Dustin, Scientific Director, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article
NeurIPS Paper Reviews 2024 #4
  • 23 Jan 2025

In this NeurIPS paper review series, Angus, Machine Learning Engineer, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article
NeurIPS Paper Reviews 2024 #3
  • 23 Jan 2025

In this NeurIPS paper review series, Mark, Senior Quantitative Researcher, shares his perspectives on the most exciting research presented at the conference, providing a comprehensive look at the newest trends and innovations shaping the future of ML.

Read article

Latest Events

  • Quantitative Engineering
  • Quantitative Research

Boston Trivia Night

06 Mar 2025 Boston - to be confirmed after registration
  • Platform Engineering
  • Software Engineering

Imperial Doc Soc Coding Challenge

20 Feb 2025 Imperial College London, Exhibition Rd, South Kensington, London, SW7 2AZ
  • Infrastructure Engineering
  • Platform Engineering
  • Software Engineering

SXSW – Software Engineering Social

11 Mar 2025 Moonshine Grill, 303 Red River St, Austin, TX 78701, United States

Stay up to date with
G-Research