The best of ICML 2023 (part 2)
This article is one of a series of paper reviews from our researchers and machine learning engineers – view more
Machine Learning (ML) is a fast evolving discipline, which means conference attendance and hearing about the very latest research is key to the ongoing development and success of our quantitative researchers and ML engineers.
As part of their attendance at ICML 2023 in Hawaii, we asked our quants and machine learning practitioners to write about some of the papers and research that they found most interesting.
Here, Jonathan L, Quantitative Researcher at G-Research, discusses three papers.
Antonio Orvieto, Samuel L Smith, Albert Gu, Anushan Fernando, Caglar Gulcehre, Razvan Pascanu, Soham De
The authors introduce a new recurrent neural net architecture, based on a series of fairly simple modifications to standard RNNs.
These changes to vanilla RNN include in particular (i) a linear recurrent unit – as opposed to passing the hidden state through a non-linearity -, (ii) using a complex diagonal matrix in the linear recurrence of the hidden state – thus yielding faster computation compared to a dense linear system -, (iii) exponential parameterization of the diagonal coefficients of the recurrence matrix – making it easier to control their scale in training and thus controlling stability -, and (iv) normalization of the hidden activations.
Stacking these linear recurrences into a deep RNN, they show strong and fast (thanks to the linear recurrence) performance of the model on tasks requiring to memorize long range interactions between the inputs.
Their approach draws many similarities to deep state space models (e.g. S4) based on ‘HiPPO’ matrices. The latter approach also involves a linear recurrence of a hidden state designed with a principled initialization scheme. Precisely, given a measure over the time dimension that reflects a prior belief on the scale and shape of long-range dependencies, the initial values of the hidden state are chosen as the projection coefficients of the time-series input into a basis of polynomials orthogonal with respect to the latter measure. The S4 algorithm then learns the coefficients of that dynamic, and uses various numerical and discretization techniques to accelerate the computation. Differently, the approach in the present paper does not require a structured initialization nor the discretization mechanism of S4, and thus provides a simpler implementation for a comparably efficient model on long sequences.
Theoretical Guarantees of Learning Ensembling Strategies with Applications to Time Series Forecasting
Hilaf Hasson, Danielle C. Maddix, Yuyang Wang, Gaurav Gupta, Youngsuk Park
The authors consider the problem of ensembling different weak learners in the context of time-series probabilistic forecasting over different horizons. Their proposed approach is based on a three-steps fitting and validation procedure.
First, they propose to train several weak learners up to a time date D1. Second, they add a new chunk of data between D1 and a more recent date D2 to the training set, they form forecasts of the weak learners (up to D2), and fit ensembling weights according to various ensembling algorithms (e.g. Lasso with various regularization parameters). Third, they add another chunk of data up to a date D3, retrain the weak learners and the ensembling weights using data up to D2 and choose the ensembling method with best validation performance on [D2, D3]. Finally, they refit the ensembling weights using all data up to D3, and then re-train the weak learners up to D3. The latter weak learners and ensembling weights are used for testing.
They test their approach on several time series datasets, and consider a wide range of weak learners for probabilistic forecasting (XGBoost, deep neural nets) and ensembling methods (quantile regression with a combination of an entropy-based and L1 regularization terms to control the uniformity and sparsity of the weights across the learners and the different tasks). They show improvements over standard ensembling baselines.
Xunyi Zhao, Théotime Le Hellard, Lionel Eyraud, Julia Gusak, Olivier Beaumont
This paper presents a new tool – called Rockmate – designed to improve the memory efficiency of training neural networks in PyTorch. In the context of large neural networks, high memory requirements may be due to the number of parameters and the size of the variables that are kept in the device memory to perform back-propagation. Different approaches have been proposed to reduce the memory footprints: (i) ‘model parallelism’ which consists in distributing the memory across multiple devices, (ii) ‘offloading’ which moves some activations from the GPU to CPU and prefetches them at the appropriate time, and (iii) ‘re-materialization’ which deletes from the GPU memory some variables computed during the forward pass and recomputes them in the backward phase. In contrast to the first two approaches, re-materialization does not incur communication costs, but a computational overhead due to re-computing some variables.
Rockmate is based on the re-materialization approach. Given the computational graph initially provided and a memory constraint, it solves an optimization problem to minimize the computational overhead by finding a sequence of computing, forgetting and recomputing actions.
The authors show a significantly lower memory consumption (by a factor of 2 to 5) for a small overhead (of the order of 10% to 20%) when training large neural networks (e.g. ResNet-101 and GPT2-large). Their implementation is open source (link provided in the paper).