# NeurIPS 2018 Paper Review

- Quantitative Research

*In December 2018, G-Research attended the 32 ^{nd} annual Neural Information Processing Systems (NeurIPS) Conference held in Montréal, Canada. Six of our Quantitative Researchers have each short-listed their favourite papers from the conference and provided a summary of each paper*:

** **

### Andrew – Machine Learning Research Manager:

** **

**How Does Batch Normalization Help Optimization?**

*Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry*

Batch normalization is a very popular method for improving training in deep neural networks. It was introduced in a 2015 paper by Ioffe and Szegedy, motivated by the idea that “covariate shift” between network layers makes training difficult, and that BN can reduce this effect. Santurkar et al. improve our understanding of this important technique by exploring that hypothesis, and demonstrate that in fact far more important than any impact on covariate shift is the effect BN has on the smoothness of the optimisation landscape.

**Visualizing the Loss Landscape of Neural Nets**

*Hao Li, Zheng Xu, Gavin Taylor, Christoph Studer, Tom Goldstein*

The authors offer a fascinating insight into the behaviour of a wide variety of neural network architectures during training, with several compelling visualisations that shed light on how network width and depth, and particularly skip connections, can affect the smoothness of the loss landscape.

**How to Start Training: The Effect of Initialization and Architecture**

*Boris Hanin, David Rolnick*

Inappropriate initialisation of weights in a deep network can have seriously harmful effects on training. In this work, the authors provide useful practical advice on weight initialisation and neural architectures for deep ReLU networks, all supported with rigorous proofs.

**Gaussian Process Conditional Density Estimation**

*Vincent Dutordoir, Hugh Salimbeni, James Hensman, Marc Deisenroth*

Although Gaussian processes are less prominent since the rise of deep learning (at least in conference proceedings), they remain a very powerful tool for data modelling, and are particularly favoured for their sound handling of uncertainty. In this work the authors present an elegant and flexible model for conditional density estimation which is particularly suitable for small data regimes (where its Bayesian construction prevents overfitting), but which can also be run efficiently on massive data sets.

**Depth-Limited Solving for Imperfect-Information Games**

*Noam Brown, Tuomas Sandholm, Brandon Amos*

Brown and Sandholm won the best paper award for their work on imperfect information games at NIPS 2017. Ideas from that paper were at the core of Libratus, their essentially unbeatable no-limit hold-em poker bot. However, the algorithm is computationally intensive, so in this work the authors propose an approximation which is sound but yields a superhuman poker AI using only the resources of a standard laptop.

### Sean – Senior Quantitative Researcher:

** **

**Neural Ordinary Differential Equations**

*Ricky T. Q. Chen, Yulia Rubanova, Jesse Bettencourt, David Duvenaud*

Residual neural networks use skip connections to try to overcome the problem of vanishing gradients in deep networks. The idea is to learn a correction to the identity map, rather than to learn a map from scratch. The effect is that extra layers are at worst superfluous, since each layer defaults to the identity. The starting point of this paper is the cute observation that such a connection h_{t+1} = h_t + f(h_t, theta_t) resembles one step of Euler’s method for solving ODEs, and adding more layers looks a lot like using a smaller step-size in Euler’s method. Why not then just try to learn an ODE, h'(t) = f(h(t), t, theta), and worry about depth/step-size during evaluation? The authors demonstrate how to do this using a black-box ODE solver, taking advantage of more modern methods for solving ODEs such as adaptive step-sizes.

**Deep Reinforcement Learning and the Deadly Triad**

*Hado van Hasselt, Yotam Doron, Florian Strub, Matteo Hessel, Nicolas Sonnerat, Joseph Modayil*

The “deadly triad” is a buzzword invented by Sutton that describes a fundamental difficulty in reinforcement learning. The three parts of the triad are functional approximation (or generalization), bootstrapping, and off-policy learning. With any two of the three we have well-established algorithms that work well both in theory and in practice, but with all three together we have no guarantees. Why then has DQN been so successful? This paper re-examines some of the bells and whistles of DQN in this light, explaining how each helps to mitigate the risk factors indicated by the deadly triad. For example, target Q-learning reduces the risk of bootstrapping noisy values, and using a deeper network reduces the risk of aliasing.

**Meta-Gradient Reinforcement Learning**

*Zhongwen XU, Hado van Hasselt, David Silver*

Hyperparameters are frustrating, in all of machine learning but particularly in reinforcement learning. The discount factor gamma, the lambda in the lambda-return, the epsilon in epsilon-greedy, even the nature of the return itself — unfortunately performance can be very sensitive to these choices, and the best choice depends on the environment. Worse, what about an environment consisting of different stages or rooms in which the best hyperparameters change dramatically from one stage to the next? The ambitious idea of this paper is to create an agent that adjusts its hyperparameters on the fly using online cross-validation. Most reinforcement learning algorithms have at their heart an update theta’ = theta + f(tau, theta, eta) that tells you how to update the parameters theta of your value or policy network depending on an experience tau and some “metaparameters” eta — gamma, lambda, epsilon, etc.. The idea is to measure the performance of the new parameters theta’ using an independent experience tau’, use this to estimate the gradient of performance with respect to eta, and then update eta using stochastic gradient descent. Inevitably there are still many true hyperparameters (e.g. the learning rate for eta) that sit on the outer layer of abstraction and must be tuned, but the dream is to reduce to a collection of hyperparameters that really can be tuned once and for all.

### Alessandro – Senior Quantitative Researcher:

** **

**Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces**

*Yu-An Chung, Wei-Hung Weng, Schrasing Tong, and James Glass*

This is one of those papers that when you read it you think: “What?! This cannot possibly be true!” Yet apparently it is. Embeddings alignment has been a very prolific area of research in the last few years, and surprisingly good results in text translations has been recently achieved by aligning embeddings independently generated for different languages. But this paper takes this concept a whole step further, showing that it is also possible to align embeddings generated from different modalities (in this case text and speech) and create a robust mapping between the two.

**Supervising Unsupervised Learning**

*Vikas K. Garg, Adam Kalai*

Supervised learning allows us to evaluate model performance, select the most suited architecture for our problem and tune hyper-parameters to get the best possible result. But how can we achieve the same when dealing with unsupervised learning? This paper brilliantly addresses this question, providing a principled way to evaluate unsupervised algorithms, as well as a framework to transfer knowledge acquired from a repository of supervised datasets to new unsupervised ones.

**SING: Symbol-to-Instrument Neural Generator**

*Alexandre Defossez, Neil Zeghidour, Nicolas Usunier, Leon Bottou, Francis Bach*

We all love when machine learning mixes with art, and one very trendy example of this is the use of DL to generate music. State of the art techniques for synthesizing sounds (e.g. WaveNet or SampleRNN) involve the use of autoregressive models to generate one audio sample at the time. This turns out to be very inefficient both at training and inference, with prohibitive computational costs for most real world applications. This paper proposes an alternative approach: using a combination of Recurrent and Convolutional neural networks and exploiting a new loss function that minimizes the distances between the log spectrograms of the generated and target waveforms, the authors are able to generate notes from nearly 1000 instruments with a single decoder, beating previous state of the art models performance. Have fun reading (and listening to the generated audio samples on their website here:https://research.fb.com/wp-content/themes/fb-research/research/sing-paper/)

**ResNet with one-neuron hidden layers is a Universal Approximator**

*Hongzhou Lin, Stefanie Jegelka*

Deep learning has been taking the scene of ML research because of its pervasive and omnilateral effectiveness. Yet, we all feel that the theoretical foundations of neural networks success are still quite wobbly. For this reason, whenever we find a paper that formally investigates the theory of neural networks we feel quite reassured. This is one of those paper, proving that ResNet efficacy when working with very narrow and deep architectures is theoretically grounded and sound. Specifically, the authors prove that an infinitely deep ResNet with alternating layers of dimension one and d can uniformly approximate any Lebesgue integrable function in d dimensions.

**How Does Batch Normalization Help Optimization?**

*Shibani Santurkar, Dimitris Tsipras, Andrew Ilyas, Aleksander Madry*

We all use Batch Normalization in our daily research. BN simplifies our lives by making our networks more robust, allowing us to use higher learning rates (which accelerates training) and improving generalisation of our models. But why is Batch Norm so powerful? For many years we all thought it had to do with reducing the “internal covariate shift” of network activations. Yet, recent results seem to show this is not sufficient to explain its effectiveness, and various papers at this year NeurIPS addressed this topic. The paper mentioned here is providing one of the most appealing answers to this question, proposing that batch norm makes the optimization landscape significantly smoother. But other ideas are floating in the air too (see for example “Understanding Batch Normalization” by Johan Bjorck et al and “Norm matters: efficient and accurate normalization schemes in deep networks” by Helad Offer et al). So have a read and let us know what you think in the comments below.

### Tomasz – Senior Quantitative Researcher:

**On the Dimensionality of Word Embedding**

*Zi Yin, Yuanyuan Shen*

In the recent years, word embeddings became an inherent part of any NLP processing pipeline around the world. In spite of them being widely used, the size of embedding vector has usually been chosen empirically and not much work has been do to provide a theoretical justification for its dimensionality. In this work, by introducing novel metric on the dissimilarity between word embedding and using matrix perturbation theory authors reveal fundamental bias-variance trade-off in dimensionality selection and, in consequence, prove existence of an optimal dimensionality as well as provide the practical way to select it for your problem.

**Learning to Reconstruct Shapes from Unseen Classes**

*Xiuming Zhang, Zhoutong Zhang, Chengkai Zhang, Josh Tenenbaum, Bill Freeman, Jiajun Wu*

Humans are capable of confidently reconstructing 3D images of objects from looking at a single 2D image by using their accumulated life experience. The 3D reconstruction algorithms have been developed to solve this problem, but most of the state-of-the-art algorithms had tendency to strongly overfit to training classes and were not able to generalize to unseen classes. During this fascinating talk, authors presented a novel algorithm, Generalizable Reconstrucion, designed to capture class-agnostic shape priors. The presented results were very impressive and were successfully generalizing to unseen classes by exploiting the causal structure of how 3D shapes give rise to 2D images.

**L4: Practical loss-based stepsize adaptation for deep learning**

*Michal Rolinek, Georg Martius*

Deep Learning practitioners spend a large amount of time tuning hyper parameters to make their models behave in a way they expect them to do. One of the most important hyper parameter is a choice of optimiser and its learning rate schedule. It can take a large amount of time and resources to find the right values and therefore is very problematic if we are constrained in any of these. The authors of this paper propose a stepsize adaptation scheme that works out-of-the-box and outperforms popular Adam and Momentum optimizers with constant stepsize, which can be very useful for the day-to-day practitioners.

*Frank Hutter, Joaquin Vanschoren*

While not a paper, I believe this tutorial deserves to be mentioned as one of the NeurIPS 2018 highlights. Currently, achieving the state-of-the-art results for a given task requires great amount of human expertise. Machine Learning practitioners are responsible for choosing family of models, architecture and frequently a large set of hyper parameters. Automatic Machine Learning is an emerging trend of creating off-the-shelf classifiers that can be applied directly to the problem in question, without relying on human experts. The tutorial authors, including Frank Hutter from University of Freiburg, whose team won several AutoML competitions in the past, gave extremely interesting talk on the current state-of-the-art in this growing research area.

### Szymon – Senior Quantitative Researcher:

**DeepPINK: reproducible feature selection in deep neural networks.**

*Yang Young Lu, Yingying Fan, Jinchi Lv, William Stafford Noble*

DeepPINK, or Deep feature selection using Paired-Input Nonlinear Knockoffs, is a novel method of ranking input features in deep networks. The objective is to pick the best possible subset of features of given size. Authors use false discovery rate (proportion of irrelevant features in the subset) to evaluate results. The idea is to define so called model-X knockoff features that resemble input features distribution and yet are independent of response conditioned on input features. Later pairwise-coupling layers are used in training a deep network in order to make input features compete with knockoff features. In the end weights of original and knockoff features are compared to determine feature importance. A truly relevant feature would overshadow its knockoff, while an arbitrary feature would be comparable or less important than its knockoff.

**Learning Overparameterized Neural Networks via Stochastic Gradient Descent on Structured Data**

*Yuanzhi Li, Yingyu Liang*

Recently deep neural networks proved to be very successful in modelling real-world data. While it is obvious that a non-linear model with plenty parameters can model training data very well, it is less trivial why it generalizes to data it hasn’t seen before. Authors of the paper explain why huge number of parameters doesn’t necessarily cause over-fitting in case of a specific two-layer network architecture.

** **

**Bayesian Model-Agnostic Meta-Learning**

*Jaesik Yoon, Taesup Kim, Ousmane Dia, Sungwoong Kim, Yoshua Bengio, Sungjin Ahn*

Model-agnostic meta-learning has a nice property of being a gradient-based method, which makes it applicable in many tasks. This paper presents a novel Bayesian approach to MAML. The particle-based approach used allows to capture complex posterior distribution. Uncertainty awareness is also extended to meta-update using Chaser Loss, which further prevents over-fitting. Authors also argue that their method shows better policy exploration in reinforcement learning.

### Pawel – Quantitative Research Manager:

**Generalizing Tree Probability Estimation via Bayesian Networks**

*Cheng Zhang, Frederick A Matsen IV*

The paper introduces a way of modelling phylogenetic trees in molecular evolution. Phylogenetic trees are used to describe the evolutionary relationship amongst various species. The usual approaches include MCMC which do not generalise well as they require an unreasonable number of samples. The authors propose a framework for tree probability estimation based on subsplit Bayesian networks.

SBN exploit the similarity amongst trees to provide flexible probability estimators to generalise to unsampled trees. The method allows for use of maximum likelihood in the case of rooted trees whilst unrooted tress can be treated via a variation of the EM algorithm.

**Meta-Reinforcement Learning of Structured Exploration Strategies**

*Abhishek Gupta, Russell Mendonca, YuXuan Liu, Pieter Abbeel, Sergey Levine*

Humans are very good at solving new tasks because they leverage prior experience. This paper intruduces the Model Agnostic Exploration with Structured Noise method, which allows robots to use prior experience to explore and then quickly adapt to new tasks. It is achieved via introducing a latent space of prior behaviours, making sure that an intention is picked in the beginning of an episode and prevails until the end of the episode. This makes exploration more efficient. Once the robot gets a reward, it starts adaptation in the latent space to quickly learn how to solve a new task.

**Visual Reinforcement Learning with Imagined Goals**

*Ashvin Nair, Vitchyr Pong, Murtaza Dalal, Shikhar Bahl, Steven Lin, Sergey Levine*

The paper introduces a way via which a robot can autonomously learn tasks which can then be useful in the future. Therefore, the robot can explore and learn new skills without a specific prescription from the humans. The method is called reinforcement learning with imagined goals (RIG).

An agent is put in an environment where it receives images as inputs, and learns a low-dimensional latent representations of the input. A generative latent variable (variational autoencoder) is used to set goals. The goal setting for the agent is done via sampling new variables in the latent space and setting the distance in the latent space as the reward function. With such a setup, the robot can learn autonomously. A real world task can then be set via encoding the real image and getting the robot to move towards it in the latent space.

The results on the synthetic simulators look quite promising, especially given that the only inputs that the robot can receive are raw images.