NeurIPS 2022: Paper review #6
G-Research were headline sponsors at NeurIPS 2022, in New Orleans.
ML is a fast-evolving discipline; attending conferences like NeurIPS and keeping up-to-date with the latest developments is key to the success of our quantitative researchers and machine learning engineers.
Our NeurIPS 2022 paper review series gives you the opportunity to hear about the research and papers that our quants and ML engineers found most interesting from the conference.
Here, Maria R, Quantitative Researcher at G-Research, discusses two papers from NeurIPS:
- Low-rank lottery tickets: finding efficient low-rank neural networks via matrix differential equations
- AUTOMATA: Gradient Based Data Subset Selection for Compute-Efficient Hyper-parameter Tuning
Steffen Schotthöfer, Emanuele Zangrando, Jonas Kusch, Gianluca Ceruti, Francesco Tudisco
Various results suggest that at the end of training, a large proportion of the parameters in modern neural networks are unnecessary (for example, the lottery ticket hypothesis).
Building on this observation, the authors approximate the weight matrix of the network by a matrix of much smaller rank, of the form USV, obtained by a decomposition of the original matrix, and a selection of the top singular values (i.e. the first values in the diagonal of S).
To ensure the low rank structure of the matrix is preserved, instead of optimising through discreet stochastic gradient descent, the authors integrate the gradient flow equations for each one of U, S and V.
The ordinary differential equation integrator they propose further allows to optimise the rank of the approximate matrix (i.e. the dimension of the manifold on which optimisation occurs). The pseudo-algorithm is presented in Algorithm 1, and theoretical convergence guarantees numerical results are given in Section 5 (tested on MNIST, CIFAR-10 and ImageNet1K).
Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Alexandre V. Evfimievski, Lucian Popa, Ganesh Ramakrishnan, Rishabh Iyer
The paper aims at reducing the significant computational cost of hyper-parameter tuning.
Typically, hyper-parameter optimisation is performed by choosing random subsets of training data. The authors propose a gradient-based informative subset selection model that allows for much faster tuning. They combine this with usual hyper-parameter search into their proposed framework for hyper-parameters tuning: AUTOMATA.
They empirically test their framework on various benchmark datasets: SST2 (text), glue-SST2 (text), CIFAR10 (image), CIFAR100 (image), and CONNECT-4 (tabular). The speed-up is considerable, from 10 to 30 times with respect to tuning on the full dataset when no scheduler is used, and two to three times when the ASHA scheduler is used. The performance loss is of no more than 3% in both cases.
The AUTOMATA framework consists of three components:  a hyper-parameter search algorithm that determines which configurations need to be evaluated,  the proposed gradient-based subset selection algorithm (SSD) that trains and evaluates each configuration efficiently, and  a hyper-parameter scheduling algorithm, which provides early stopping by eliminating the poor configurations quickly.
The novel SSD method (step ) assigns weights to different data samples and allows it to find, for a given choice of hyper-parameters, the most informative data subset to evaluate the model. It finds the optimal subset and weights through gradient descent on the difference between the training loss on the full dataset and on the weighted subset (Eq. 2) which it alternates with usual training epochs for the network’s parameter.