# NeurIPS 2020 Paper Review

- Quantitative Research

*We were proud to be platinum sponsors of the 2020 Conference on Neural Information Processing Systems, or NeurIPS, in December. *

*Following the conference, we asked some of our quantitative researchers to share the papers and research from the event that they found most interesting. Here’s what they came back with:*

**Samuel M – Quantitative Researcher**

**Leap-Of-Thought: Teaching Pre-Trained Models to Systematically Reason Over Implicit Knowledge **

*Alon Talmor, Oyvind Tafjord, Peter Clark, Yoav Goldberg, Jonathan Berant*

In this paper, the authors examine the ability of language models to combine their implicit knowledge learned during training with explicit knowledge provided in a query. For instance, among many other things they investigate reasonings of the form (A is a type of B) + (all B satisfy C) => (A satisfies C), where C is a concept unlikely to be known by the pre-trained model. They provide the model with (all B satisfy C) at query time and see if the model correctly deduces (A satisfies C). They show that this essentially occurs whenever the model happens to already know that (A is a type of B). This suggests language models are able to reason logically and combine their internal knowledge with new information presented to them.

*Peter Harrison, Raja Marjieh, Fede G Adolfi, Pol van Rijn, Manuel Anglada-Tort, Ofer Tchernichovski, Pauline Larrouy-Maestri, Nori Jacoby*

This paper describes a technique to study semantic representations in human minds. One can think of a semantical concept as a probability distribution on a space of perceptual input. For instance, the colour “lavender” can be thought of as a probability distribution on 3-dimensional color space, describing how likely to be described as “lavender” each color is. The authors explain how to use Gibbs sampling to sample from such human semantical probability distribution. Human subjects are presented with a slider modifying one coordinate on the space of perceptual inputs and adjust it to match best the proposed semantical concept. By repeating this process following the Gibbs sampling algorithm, possibly changing the human subject at each step, reliable samples from the semantical distribution can be obtained. The authors present many fascinating results around these human semantic representations.

**Jaak S – Quantitative Researcher**

**Delta-STN: Efficient Bilevel Optimization for Neural Networks using Structured Response Jacobians**

*Juhan Bae, Roger Grosse*

The paper presents an improved approach to self-tuning networks (STN). Such networks bring two key advantages in practice. Firstly, it avoids costly hyperparameter optimisation that needs to execute separate runs with different dropout rates and weight decays. Secondly, the self-tuning networks reach better performance compared to hyperparameter search as STN is able to adapt the hyperparameters during the run. This paper presents several improvements over the previous STN approaches, resulting in higher accuracy and faster convergence.

**What shapes feature representations? Exploring datasets, architectures, and training**

*Katherine Hermann, Andrew Lampinen*

The paper investigates what representations neural networks learn when they are given redundant features that are different in complexity. This is quite an important issue in practice where redundancy and non-linear combinations of inputs often are present. They present a very interesting finding that linear features dominate more complex ones in representations, even if the former have weaker predictive performance. Additionally, they found that complex features do not result in same representations across different runs. The work highlights the importance of properly designing inputs to the model and creating right inductive biases when designing the model.

**Tom H – Group Head**

**Big Bird: Transformers for Longer Sequences**

*Manzil Zaheer, Guru Guruganesh, Kumar Avinava Dubey, Joshua Ainslie, Chris Alberti, Santiago Ontanon, Philip Pham, Anirudh Ravula, Qifan Wang, Li Yang, Amr Ahmed*

Since 2018, pre-trained language models such as Google’s BERT based on the transformer architecture have been phenomenally successful, revolutionising NLP.

Their SOTA performance relies on the self-attention mechanism, which allows networks to use contextual information from an entire text sequence. However, the computational and memory requirements of self-attention grow quadratically with sequence length, making processing long sequences very expensive.

The Big Bird paper proposes a new sparse attention mechanism to address this problem, posing it as a graph sparsification problem. Their generalised attention mechanism is described by a directed graph D whose vertices represent the tokens of the input sequence. If D were fully connected (every token attend to every other token), then we would recover quadratic self-attention. Instead, BigBird uses a sparse directed graph, built from a local component (tokens attend to their close neighbours), a global component (tokens that attend to all other tokens, and that every token attends to) and a random component (tokens attend to a number of other randomly chosen tokens).

The authors show that BigBird preserves all known theoretical properties of self-attention models, whilst giving improved results on long-context NLP tasks such as Question Answering and Abstractive Summarization.

The authors also propose applications to genomic data.

Overall, this is an interesting paper, although it is similar to other recent work in this area (e.g. Longformer and the Extended Transformer Construction), and some details are unclear – in particular some of their best results are for a variant of their model that appears not to include the random component at all.

**Unsupervised Data Augmentation for Consistency Training**

*Qizhe Xie, Zihang Dai, Eduard Hovy, Thang Luong, Quoc V Le*

Deep Learning typically requires large amounts of labelled data to work effectively. Semi-supervised learning (SSL) is one way of leveraging unlabeled data to address this problem. Data augmentation is the problem of taking existing labelled examples and perturbing them in some way to generate plausible further examples that can increase the size of the training set. The desired result is new examples that are valid, diverse, and provide inductive biases missing from the existing labelled data.

In this paper, the authors present a general framework for unsupervised data augmentation, based on unsupervised consistency loss. They also prevent concrete strategies for augmentation. The first, RandAugment, is applied to image classification. For NLP, they propose two methods: one based on backtranslation (from English, to French, and back to English), and the second based on using TF-IDF to identify uninformative words which can safely be replaced whilst preserving meaning. They assert impressive results on IMDb sentiment classification by applying backtranslation to a small number of labelled examples. Their code is available on Github and is interesting to study.

**Hugh S – Quantitative Researcher**

**Neural Controlled Differential Equations for Irregular Time Series**

*Patrick Kidger, James Morrill, James Foster, Terry Lyons*

This is the latest in a recent series of papers that connect ideas from deep learning to ideas from differential equations. One way to view the connection is that application of successive functions (i.e. ‘layers’ ) is essentially identical to the integration of a parameterised differential equation using simple fixed step size solver. Thinking of a neural network as a differential equation gives two advantages: firstly, the differential equation is reversible so we can avoid keeping the intermediate activations in memory for the backward pass, and secondly we can use non-fixed step size solvers rather than using a predetermined number of layers. This paper applies the differential equation approach to recurrent neural network models. Aside from the conceptual elegance, this approach has the advantage that it can deal with very long sequences. This is possible because the memory requirement doesn’t grow with the length of sequence and because if the underlying function is actually slow moving a smaller number of steps can be taken compared to observations.

**Jeremy M – Senior Quantitative Researcher**

**Bootstrap Your Own Latent: A New Approach to Self-Supervised Learning**

*Jean-Bastien Grill, Florian Strub, Florent Altché, Corentin Tallec, Pierre H. Richemond, Elena Buchatskaya, Carl Doersch, Bernardo Avila Pires, Zhaohan Daniel Guo, Mohammad Gheshlaghi Azar, Bilal Piot, Koray Kavukcuoglu, Rémi Munos, Michal Valko1*

My tagline: “Contrastive Learning without negative examples”.

This paper presents a surprising twist on Contrastive Learning in image recognition. Contrastive Learning has recently taken off as a powerful way to learn representations of many types of input without needing labelled data. The learnt representations are powerful enough to reuse in a range of supervised tasks with a small amount of fine tuning. The state of the art in image recognition trains a network to encode images in such a way that corrupted versions of the same image will be A) identified with each other, and B) distinguished from ‘negative examples’, i.e. different images. The twist here is to do away with the negative examples. Normally this would lead to network collapse and useless representations. This paper shows how to set up training correctly to avoid this, and in the process finds a training technique which is easier to implement, more robust to the choice of image corruption used, and produces more useful and generalizable image representations. A counterintuitive but pleasing result which may prove useful beyond image recognition.

**Modern Hopfield Networks and Attention for Immune Repertoire Classification**

*Michael Widrich, Bernhard Schäfl, Milena Pavlović, Hubert Ramsauer, Lukas Gruber, Markus Holzleitner, Johannes Brandstetter, Geir Kjetil Sandve, Victor Greiff, Sepp Hochreiter*

This paper showcases two cool things: firstly how dot-product attention (aka ‘the transformer’) can be adapted to a huge range of tasks, and secondly how machine learning is being applied to problems in biology and medicine – in this case “Immune Repertoire Classification”. The Immune Repertoire is the enormous collection of genetic sequences of immune receptors in an individual. The classification task is to determine the vastly smaller number of diseases these provide immunity to. The authors use an attention mechanism to map each entire immune repertoire onto a fixed length representation to be used for classification and outperform the previous best attempts. They frame their method in terms of “Modern Hopfield Networks”, but good news: these are nothing more than the standard attention mechanism, but with either the queries or keys as fixed network parameters. I don’t have a sense of how applicable this is in the real world, but certainly a worthy goal.

**Michel B – Quant Research Manager**

**Untangling tradeoffs between recurrence and self-attention in neural networks**

*Giancarlo Kerg, Bhargav Kanuparthi, Anirudh Goyal, Kyle Goyette, Yoshua Bengio, Guillaume Lajoie*

Recurrent networks are eminently well-suited for sequential tasks. However, they tend to struggle on long sequences: they’re forgetful. A known remedy to this is to allow a recurrent network to pay attention to past hidden states. However, there’s a balance to be struck: paying attention to all past states leads to quadratic complexity in sequence length. The authors of this paper contribute a formal analysis of gradient propagation in recurrent networks with attention mechanisms. They identify two quantities, “dependency depth” and “sparsity”: both are key to good gradient propagation and each represents one direction in the trade-off between “better memory” and “more complexity”. Building on these insights they propose a simple new attention mechanism, “relevancy screening”, that strikes a good balance between the two. A great read for anyone interested in better understanding the mechanics of attention mechanisms in recurrent networks.

**Autoencoders that don’t overfit towards the Identity**

*Harald Steck*

I really enjoyed Harald Steck’s talk this year, very clear and full of interesting takeaways. In his paper, he deals with a tendency in auto-encoders to over-fit to the identity: relying too much on input feature i to predict output feature i, and not learning enough about interactions with other features j. He shows why dropout only addresses this problem indirectly, whereas emphasised denoising counter-acts the over-fit directly. I like closed-form solutions – they’re great for intuition – so it’s nice to see a simple closed-form solution to the full-rank linear variant of an auto-encoder with emphasised denoising. The paper also contains a number of experimental results that highlight (among other things) that the choice of regularisation can have substantial effects on performance, and that a properly regularised linear auto-encoder achieves impressively good results compared to deep non-linear models.