Skip to main content

NeurIPS paper reviews 2025 #9

30 January 2026
  • News
  • Quantitative research

In this paper review series our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2025.

Here, discover the perspectives of Machine Learning Engineer, Casey.

Distributed Orthonormal Updates for Large-Scale Training

Kwangjun Ahn

This industry talk at NeurIPS presented recent research on orthonormal optimisers. These are a class of optimisation methods designed to address the scalability and conditioning issues of standard optimisers when training modern large language models (LLMs).

The motivation behind orthonormal optimisers stems from two key observations about widely used methods like AdamW:

  • Memory overhead: AdamW maintains both a momentum buffer and a second-moment buffer, each roughly the size of the model itself. For LLMs with tens or hundreds of billions of parameters, these extra copies can be prohibitively expensive.
  • Poorly conditioned updates: Empirically, AdamW tends to produce ill-conditioned updates. For an m×n weight matrix, many rows or columns of the update are often nearly linear combinations of others, leading to redundancy and inefficient learning.

A more recent optimiser, MUON (Momentum Orthogonalized by Newton-Schulz) addresses both issues. Unlike AdamW, which flattens matrices into long parameter vectors and applies updates independently to each element, MUON explicitly leverages the two-dimensional structure of weight matrices and enforces well-conditioned updates through orthonormalisation.

If Gt is the Nesterov momentum at step t, the update rule is:

W_{t+1} = W_t - \eta_t \cdot \sqrt{{\texttt{fan}_{\texttt{out}} \over \texttt{fan}_{\texttt{in}}}}\cdot \texttt{Orthogonalize}(G_t).

This update rule can be derived by minimising a linear approximation of the loss function subject to a constraint that limits the update magnitude in a root-mean-square (RMS) sense.

The results on the NanoGPT speedrun benchmark show a convincing improvement over Adam:

MUON improves validation loss compared to Adam for equal training time (Source: Muon: An optimizer for hidden layers in neural networks)

The primary focus of the talk was on follow-up work aimed at making MUON practical at scale. Researchers from various labs have proposed implementations and optimisations to reduce both communication overhead and redundant FLOPs, which can otherwise dominate training costs.

To this end, the talk also covered DION (Distributed Orthonormalisation) and DION-2 which retain the core idea of orthonormal updates but avoid orthogonalizing the entire weight matrix. They also replace MUON’s Newton–Schulz implementation of the Orthogonalize operation with a cheaper, iterative method that works more efficiently in distributed settings.

Distributed Orthonormal Updates for Large-Scale Training
Blue line chart object tracing a jagged upward trend action along vertical and horizontal axes context on a white square background NeurIPS 2024 paper reviews

Read paper reviews from NeurIPS 2024 from a number of our quantitative researchers and machine learning practitioners.

Read now

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenisations

Brian Siyuan Zheng, Alisa Liu, Orevaoghene Ahia, Jonathan Hayase, Yejin Choi, Noah A. Smith

One of the key ingredients in training large language models (LLMs) is tokenisation. Tokenisation maps raw text into a sequence of integers, where each integer represents a subset of characters. Because there are many valid ways to tokenise the same string, different models are often trained using different tokenisation schemes.

The figure below illustrates how two tokenisers segment the same string differently:

Left: GPT-4o tokenises the string into 5 tokens. Right: GPT-3 tokenises the same string into only 2 tokens

This paper asks the question: “What happens if we use a different tokenisation at inference time than the one used during training?”. Surprisingly, the authors find that “instruction-tuned LMs across many model families are extremely robust to non-canonical tokenisations.”

In contrast, the base language models (before RLHF, SFT, or DPO) are not robust to alternate tokenisations. Instead of producing grammatically correct responses, these models tend to perform raw text continuation, mimicking the quirks and idiosyncrasies of the input tokenisation. Although the resulting output may look jumbled or corrupted, there is still evidence that the model partially understands the underlying prompt.

To explain this robustness, the authors investigate several possible sources and conclude that much of the improvement comes from the Supervised Fine-Tuning (SFT) stage, in particular, the use of a chat template.

Perhaps even more surprising than the robustness result, they show that non-canonical tokenisations can sometimes improve performance relative to the original tokenisation, such as right-digit-aligned tokenisation for arithmetic tasks.

Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenisations
Man gestures while talking to three colleagues around a table in a modern meeting room with wood slat wall a whiteboard plants and a decorative wire sculpture
Quantitative research & machine learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more paper reviews

NeurIPS 2025: Paper review #1

Discover the perspectives of Nick, one of our Quantitative Researchers, on the following papers:

  • Counterfactual Identifiability via Dynamic Optimal Transport
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Progressive Data Dropout: An Embarrassingly Simple Approach to Train Faster
Read now
NeurIPS 2025: Paper review #2

Discover the perspectives of Tomi, one of our Quantitative Research Managers, on the following papers:

  • Kronos: A Foundation Model for the Language of Financial Markets
  • LOBERT: Generative AI Foundation Model for Limit Order Book Messages
  • Auto-Compressing Networks
Read now
NeurIPS 2025: Paper review #3

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • Statistical Inference for Gradient Boosting Regression
  • Dynamic Low-Rank Training with Spectral Regularisation: Achieving Robustness in Compressed Representations
Read now
NeurIPS 2025: Paper review #4

Discover the perspectives of Nick, one of our Software Engineers, on the following papers:

  • Pass@K Policy Optimisation: Solving Harder Reinforcement Learning Problems
  • Antidistillation Sampling
  • Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Read now
NeurIPS 2025: Paper review #5

Discover the perspectives of Cédric, one of our Quantitative Researchers, on the following papers:

  • Learning (Approximately) Equivariant Networks via Constrained Optimisation
  • On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
Read now
NeurIPS 2025: Paper review #6

Discover the perspectives of Ognjen, one of our Quantitative Researchers, on the following papers:

  • Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimisation
  • Bubbleformer: Forecasting Boiling with Transformers
Read now
NeurIPS 2025: Paper review #7

NeurIPS 2025: Paper review #7

Discover the perspectives of Radomir, one of our Machine Learning Engineers, on the following papers:

  • Learning Task-Agnostic Representations through Multi-Teacher Distillation
  • Contrastive Representations for Temporal Reasoning
Read now
NeurIPS 2025: Paper review #8

Discover the perspectives of Benjamin, one of our Quantitative Researchers, on the following papers:

  • Dynamical Decoupling of Generalisation and Overfitting in Large Two-Layer Networks
  • Backward Conformal Prediction
  • Predicting the Performance of Black-box Language Models with Follow-up Queries
Read now
NeurIPS 2025: Paper review #10

Discover the perspectives of Hugh, one of our Quantitative Research Managers, on the following papers:

  • Understanding and Mitigating Numerical Sources of Non-determinism in LLM Inference
  • Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Read now
NeurIPS 2025: Paper review #11

Discover the perspectives of Timothy, one of our Machine Learning Engineers, on the following papers:

  • ZeroS: Zero-Sum Linear Attention for Efficient Transformers
  • In Search of Adam’s Secret Sauce
Coming soon
NeurIPS 2025: Paper review #12

Discover the perspectives of David, one of our Quantitative Researchers, on the following papers:

  • 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
  • Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Read now
NeurIPS 2025: Paper review #13

Discover the perspectives of Szymon, one of our Senior Quantitative Researchers, on the following papers:

  • Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
  • Parallelizing MCMC Across the Sequence Length
Read now
NeurIPS 2025: Paper review #14

Discover the perspectives of Simon, one of our Senior Quantitative Researchers, on the following papers:

  • Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
  • ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Read now

Latest events

  • Technology innovation and open-source

KubeCon Europe 2026

23 Mar 2026 - 26 Mar 2026 RAI Amsterdam, Europaplein 24, 1078 GZ Amsterdam
  • Platform engineering
  • Software engineering

NVIDIA GTC 2026

16 Mar 2026 - 19 Mar 2026 San Jose McEnery Convention Centre
  • Machine learning
  • Quantitative research

Spring into Quant Finance 2026

12 Apr 2026 - 17 Apr 2026 Palermo, Sicily, Italy

Stay up to date with G-Research