Barna Saha & Christopher Ye
In 2022, Dao et al. introduced FlashAttention, an algorithm designed for quicker computation of self-attention by reducing I/O operations, which was identified as a major bottleneck in traditional attention algorithms.
In their paper, Saha and Ye provide a comprehensive analysis of the I/O complexity of the self-attention module used in transformer architectures. They demonstrate that the FlashAttention algorithm is optimal in terms of I/O complexity for most practical scenarios.
Saha and Ye establish a tight lower bound on the I/O complexity of attention computation using standard matrix multiplication, if the cache size M exceeds d2, where d is the dimension of the head. This bound is achieved by FlashAttention and is proved using the red-blue pebble game method introduced by Hong and Kung in 1981. [1]
They also utilise a compression framework from Pagh & Silvestri (2014) to show that even with any fast matrix multiplication algorithm, the established bound cannot be improved for exact attention computation. [2]
For scenarios where M is less than d2 (referred to as the small cache regime), they demonstrate that the I/O complexity of attention calculation aligns with that of standard matrix multiplication. They also propose an algorithm with better I/O complexity than FlashAttention for this case, employing standard techniques for reducing matrix multiplication complexity.
These results collectively indicate that no other exact attention algorithm can surpass FlashAttention in terms of I/O complexity.
[1] I/O Complexity: The Red-Blue Pebble Game
[2] The Input/Output Complexity of Triangle Enumeration
I/O Complexity of Attention, or How Optimal is Flash Attention?