Skip to main content

NeurIPS paper reviews 2025 #11

30 January 2026
  • News
  • Quantitative research

In this paper review series our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2025.

Here, discover the perspectives of Machine Learning Engineer, Timothy.

ZeroS: Zero-Sum Linear Attention for Efficient Transformers

Jiecheng Lu, Xu Han, Yan Sun, Viresh Pati, Yubin Kim, Siddhartha Somani, Shihao Yang

Linear attention aims to scale transformers to long sequences by reducing the quadratic cost of self-attention to O(N), but in practice often underperforms standard softmax attention, in part due to attention dilution. In causal linear attention, the output at position t is typically written as

 o_t = \frac{\phi(q_t)^\top \sum_{i=1}^t \phi(k_i) v_i}{\phi(q_t)^\top \sum_{i=1}^t \phi(k_i)},

where ϕ(·) is a kernel feature map.

This formulation enforces non-negative, sum-to-one weights, constraining outputs to the convex hull of the values.

While this convexity is benign in softmax attention, whose adaptive normalisation can become arbitrarily sharp, it becomes harmful under linearisation, hardening into a uniform averaging bias as sequence length grows.

ZeroS traces this bias to the Taylor expansion around zero of the softmax function. Writing the softmax of scores si as

 \operatorname{softmax}(s_i) \approx \frac{1}{t} + \frac{1}{t}\,\delta_i + \frac{1}{2t} \left( \delta_i^2 - \frac{1}{t}\sum_{j=1}^t \delta_j^2 \right) + O(\lVert s \rVert^3),

where δi = si − 1/t ∑j sj, the leading term corresponds to a uniform, query-independent contribution. In exact softmax attention this term is counterbalanced by higher-order terms and adaptive normalisation, but in linear attention, where only low-order structure is retained, it dominates, driving attention toward indiscriminate averaging.

ZeroS explicitly removes this zero-order component, yielding attention weights that sum to zero before normalisation. This breaks the strict convexity constraint by allowing signed weights and contrastive interactions, while preserving the factorised structure required for efficient computation. As a result, attention can still be computed using prefix sums in linear time.

By identifying attention dilution as a consequence of the zero-order term rather than an inherent limitation of linear attention, ZeroS shows that efficient transformers can retain much of the expressive power of softmax attention without sacrificing O(N) scalability.

The results show that ZeroS closes the gap with softmax attention and performs competitively with other efficient methods such as Mamba and GLA:

ZeroS achieves high accuracy on the RegBench benchmark suite.

ZeroS: Zero-Sum Linear Attention for Efficient Transformers
Blue line chart object tracing a jagged upward trend action along vertical and horizontal axes context on a white square background NeurIPS 2024 paper reviews

Read paper reviews from NeurIPS 2024 from a number of our quantitative researchers and machine learning practitioners.

Read now

In Search of Adam’s Secret Sauce

Antonio Orvieto, Robert M. Gower

Adam is the de-facto standard optimiser in deep learning, especially for language models, yet the reasons for its strong performance are still not fully understood. In this paper, the authors conduct an extensive empirical investigation of Adam and several alternative optimisers, uncovering a number of insightful results.

Across 1,500+ language-model pretraining runs, with careful hyper-parameter sweeps to ensure each optimiser is well tuned, Adam consistently comes out on top:

Pretraining 160M-parameter models on SlimPajama. Despite extensive tuning, Signum still achieves slightly worse overall performance and is about 25% slower.

This suggests that Adam retains a genuine “secret sauce” that continues to distinguish it from simpler alternatives. By systematically ablating Adam’s parameters, the authors uncover a surprising pattern in the learning dynamics of the forgetting factors of the first and second moments of gradients: setting β1 = β2 is almost always optimal:

Pretraining 410M-parameter models on SlimPajama. Equal β values yield near-optimal performance.

In fact, they recommend this configuration as a strong default for Adam in language-model training. Constraining Adam to β1 = β2 = β leads to a simple elegant interpretation. The update direction for parameter coordinate k becomes

 d_k = \frac{-\operatorname{sign}(m_k)}{\sqrt{1 + \sigma_k^2 / m_k^2}},

where mk and σk are online estimates of the mean and variance of the stochastic gradients. From this perspective, Adam can be viewed as a steepest-descent method with a variable trust region: when gradient noise dominates the signal, the signal-to-noise ratio mk / σk shrinks and the effective step size is reduced; as uncertainty decreases, the effective step size smoothly increases toward 1.

In Search of Adam’s Secret Sauce
Man gestures while talking to three colleagues around a table in a modern meeting room with wood slat wall a whiteboard plants and a decorative wire sculpture
Quantitative research & machine learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more paper reviews

NeurIPS 2025: Paper review #1

Discover the perspectives of Nick, one of our Quantitative Researchers, on the following papers:

  • Counterfactual Identifiability via Dynamic Optimal Transport
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Progressive Data Dropout: An Embarrassingly Simple Approach to Train Faster
Read now
NeurIPS 2025: Paper review #2

Discover the perspectives of Tomi, one of our Quantitative Research Managers, on the following papers:

  • Kronos: A Foundation Model for the Language of Financial Markets
  • LOBERT: Generative AI Foundation Model for Limit Order Book Messages
  • Auto-Compressing Networks
Read now
NeurIPS 2025: Paper review #3

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • Statistical Inference for Gradient Boosting Regression
  • Dynamic Low-Rank Training with Spectral Regularisation: Achieving Robustness in Compressed Representations
Read now
NeurIPS 2025: Paper review #4

Discover the perspectives of Nick, one of our Software Engineers, on the following papers:

  • Pass@K Policy Optimisation: Solving Harder Reinforcement Learning Problems
  • Antidistillation Sampling
  • Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Read now
NeurIPS 2025: Paper review #5

Discover the perspectives of Cédric, one of our Quantitative Researchers, on the following papers:

  • Learning (Approximately) Equivariant Networks via Constrained Optimisation
  • On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
Read now
NeurIPS 2025: Paper review #6

Discover the perspectives of Ognjen, one of our Quantitative Researchers, on the following papers:

  • Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimisation
  • Bubbleformer: Forecasting Boiling with Transformers
Read now
NeurIPS 2025: Paper review #7

Discover the perspectives of Radomir, one of our Machine Learning Engineers, on the following papers:

  • Learning Task-Agnostic Representations through Multi-Teacher Distillation
  • Contrastive Representations for Temporal Reasoning
Read now
NeurIPS 2025: Paper review #8

Discover the perspectives of Benjamin, one of our Quantitative Researchers, on the following papers:

  • Dynamical Decoupling of Generalisation and Overfitting in Large Two-Layer Networks
  • Backward Conformal Prediction
  • Predicting the Performance of Black-box Language Models with Follow-up Queries
Read now
NeurIPS 2025: Paper review #9

Discover the perspectives of Casey, one of our Machine Learning Engineers, on the following papers:

  • Distributed Orthonormal Updates for Large-Scale Training
  • Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenisations
Read now
NeurIPS 2025: Paper review #10

Discover the perspectives of Hugh, one of our Quantitative Research Managers, on the following papers:

  • Understanding and Mitigating Numerical Sources of Non-determinism in LLM Inference
  • Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Read now
NeurIPS 2025: Paper review #12

Discover the perspectives of David, one of our Quantitative Researchers, on the following papers:

  • 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
  • Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Read now
NeurIPS 2025: Paper review #13

Discover the perspectives of Szymon, one of our Senior Quantitative Researchers, on the following papers:

  • Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
  • Parallelizing MCMC Across the Sequence Length
Red now
NeurIPS 2025: Paper review #14

Discover the perspectives of Simon, one of our Senior Quantitative Researchers, on the following papers:

  • Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
  • ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Read now

Latest events

  • Quantitative engineering
  • Quantitative research

Careers beyond academia: Options and pathways for researchers

26 Mar 2026 Mathematical Institute, Andrew Wiles Building, Oxford
  • Quantitative engineering
  • Quantitative research

Pub Quiz: Oxford

23 Feb 2026 Oxford - to be confirmed after registration
  • Technology innovation and open-source

KubeCon Europe 2026

23 Mar 2026 - 26 Mar 2026 RAI Amsterdam, Europaplein 24, 1078 GZ Amsterdam

Stay up to date with G-Research