The best of ICML 2022 – Paper reviews (part 3)
This article is one of a series of paper reviews from our researchers and machine learning engineers – view more
Last month, G-Research were Diamond sponsors at ICML, hosted this year in Baltimore, US.
As well as having a stand and team in attendance, a number of our quantitative researchers and machine learning engineers attended as part of their ongoing learning and development.
We asked our quants and machine learning practitioners to write about some of the papers and research that they found most interesting.
Here, Angus L, Machine Learning Engineer at G-Research, discusses three papers.
Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, Christopher Ré
Attention mechanisms have been responsible for many of the incredible strides made by deep learning models in recent years, but their need to materialise a square matrix with dimensions equal to the length of the input context results in quadratic costs in both memory and time, making them prohibitively expensive for long context lengths.
Much recent work has focused on alleviating this problem via approximate self-attention mechanisms, which explicitly trade-off computational complexity for model quality. In contrast, FlashAttention is a highly efficient exact attention mechanism, using IO-awareness to greatly optimise the performance of the computation on GPUs.
By tiling the input tensors into smaller blocks, the attention computation can be completed entirely in a single fused kernel from fast GPU on-chip SRAM, without requiring multiple reads and writes to comparatively slow high-bandwidth memory (HBM). By additionally re-computing the attention on-chip during the backward pass (which remains faster than reading from HBM!), the intermediate activations do not have to be stored on the GPU, resulting in substantial memory savings.
The authors report a 2-4x speedup and 5-20x memory reduction when compared to existing attention implementations, opening the door for the training of attention-based models on much larger context sizes.
Utku Evci, Vincent Dumoulin, Hugo Larochelle, Michael C Mozer
Transfer learning is the task of taking a machine learning model trained on a particular source domain and adapting it to perform well in a separate target domain, particularly when data is much more abundant in the former.
The hope is that the intermediate representations learnt by the model in order to perform well in the source domain will transfer to the target domain, allowing it to generalise effectively to the new task with minimal additional training and data. For instance, a model trained to detect cats may not require much modification in order to successfully detect dogs, since visually they have many features in common.
Two common approaches for classification task transfer learning with deep neural networks are:
- linear probing – where the trained model weights are frozen and a new final classification layer is trained for the target domain
- fine-tuning – where all of the model parameters are further trained in the target domain
Fine-tuning can achieve greater accuracy at the cost of a much more computation, but Head2Toe achieves the best of both worlds. On the VTAB benchmark for visual transfer learning, it attains classification performance on-par with (and even exceeding) that of fine-tuning, with a reduction in both training and storage costs in excess of 100x.
Head2Toe achieves this by training a new classification head on features selected from all layers of the source model (with no fine-tuning!), using group-lasso regularisation to learn a sparse linear layer that selects the most relevant features for the source domain.
I’m particularly excited to see how well this approach generalises beyond image classification settings.
Zhuang Wang, Zhaozhuo Xu, Xinyu Wu, Anshumali Shrivastava, T. S. Eugene Ng
In data-parallel distributed training of neural networks, gradients are computed on separate mini-batches of data across multiple GPUs and then combined across all GPUs via an all-reduce operation. Since the size of these gradients is proportional to the number of parameters in the model, communicating these gradients becomes increasingly expensive as models grow ever larger.
To alleviate this, gradient compression schemes seek to reduce the size of these communicated gradients while adding as little computational overhead as possible.
DRAGONN is one such scheme, which utilises a hashing-based compression algorithm which builds on Deep Gradient Compression (DGC). In DGC, only gradient elements greater than some threshold are communicated to the other GPUs – even in the most efficient GPU implementations, this requires multiple scans of the tensor in order to create a Boolean mask of elements above the threshold and extract their corresponding indices.
DRAGONN instead implements a hashing scheme where, if a gradient element is greater than or equal to some threshold, its index is mapped into some fixed integer range, which is used to compute an offset for a pre-allocated memory location into which the gradient value is written.
While this introduces some additional noise due to hash collisions, this approach greatly reduces the compression cost incurred relative to DGC, which in turn accelerates training times to model convergence by up to 3.5x vs. DGC.