Skip to main content

NeurIPS paper reviews 2025 #14

30 January 2026
  • News
  • Quantitative research

In this paper review series our team of researchers and machine learning practitioners discuss the papers they found most interesting at NeurIPS 2025.

Here, discover the perspectives of Senior Quantitative Researcher, Simon.

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)

Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Yejin Choi

This paper investigates the behaviour of large language models on open-ended tasks where no single ground truth exists.

This topic is arguably neglected by current frontier models, where maths, coding and reasoning-focused post-training can contribute to mode collapse. As a result, the diversity in observed responses can be lacking when compared to humans (which is neatly demonstrated in the paper both qualitatively and quantitatively).

The authors share a dataset called `Infinity-Chat` of real-world open-ended LLM chat queries, on a range of topics including brainstorming or creative content generation. These queries lack a ground truth to benchmax against, but response similarities within and between model classes can be studied and compared to human responses.

The authors report a disconcertingly high level of response similarity across a range of closed and open-source models (both intra- and inter-model). This may be fine for some verifiable tasks but for a creative task this can be suboptimal.

The paper also shows that human preferences over open-ended responses are more pluralistic than those of LLM-based judges: multiple answers can be considered high quality by different human annotators, who disagree more strongly with each other than LLM judges do.

This is a topic that’s important to many scientific domains: how can we get the most out of language models to aid in exploration aspects of science such as hypothesis generation.

Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)
Blue line chart object tracing a jagged upward trend action along vertical and horizontal axes context on a white square background NeurIPS 2024 paper reviews

Read paper reviews from NeurIPS 2024 from a number of our quantitative researchers and machine learning practitioners.

Read now

ConfTuner: Training Large Language Models to Express Their Confidence Verbally

Yibo Li, Miao Xiong, Jiaying Wu, Bryan Hooi

For high risk real world applications, it is hard to trust an LLM that is prone to hallucination and exhibits extreme overconfidence. Models can be prompted to verbalise their confidence, but this will typically be poorly calibrated. The authors of the ConfTuner propose fine-tuning a language model with a proper scoring rule objective to remedy this.

For open weight models the method involves prompting the LLM for a confidence level (0-100% say) along with its answer. The probability distribution over some pre-defined confidence tokens is extracted from the model’s logits. Training then minimises a tokenised Brier score. This steers the model towards producing confidence tokens that are well calibrated. This method can be adapted to provide proxy confidence for closed weight models, too.

Empirically, ConfTuner works well at reducing the expected calibration error of these verbalised confidence scores. This score is shown to be useful for decision-making too, such as for model cascades, where a more powerful model is used only if confidence in the initial answer is below a threshold.

This is a nice application of classical ML to enhance the outputs of language models. Techniques like this could help improve trust and interpretability in LLM applications.

ConfTuner: Training Large Language Models to Express Their Confidence Verbally
Man gestures while talking to three colleagues around a table in a modern meeting room with wood slat wall a whiteboard plants and a decorative wire sculpture
Quantitative research & machine learning

Want to learn more about life as a researcher at G-Research?

Learn more

Read more paper reviews

NeurIPS 2025: Paper review #1

Discover the perspectives of Nick, one of our Quantitative Researchers, on the following papers:

  • Counterfactual Identifiability via Dynamic Optimal Transport
  • Does Reinforcement Learning Really Incentivize Reasoning Capacity in LLMs Beyond the Base Model?
  • Progressive Data Dropout: An Embarrassingly Simple Approach to Train Faster
Read now
NeurIPS 2025: Paper review #2

Discover the perspectives of Tomi, one of our Quantitative Research Managers, on the following papers:

  • Kronos: A Foundation Model for the Language of Financial Markets
  • LOBERT: Generative AI Foundation Model for Limit Order Book Messages
  • Auto-Compressing Networks
Read now
NeurIPS 2025: Paper review #3

Discover the perspectives of Dustin, one of our Scientific Directors, on the following papers:

  • Statistical Inference for Gradient Boosting Regression
  • Dynamic Low-Rank Training with Spectral Regularisation: Achieving Robustness in Compressed Representations
Read now
NeurIPS 2025: Paper review #4

Discover the perspectives of Nick, one of our Software Engineers, on the following papers:

  • Pass@K Policy Optimisation: Solving Harder Reinforcement Learning Problems
  • Antidistillation Sampling
  • Fundamental Limitations in Pointwise Defences of LLM Finetuning APIs
Read now
NeurIPS 2025: Paper review #5

Discover the perspectives of Cédric, one of our Quantitative Researchers, on the following papers:

  • Learning (Approximately) Equivariant Networks via Constrained Optimisation
  • On the Closed-Form of Flow Matching: Generalization Does Not Arise from Target Stochasticity
Read now
NeurIPS 2025: Paper review #6

Discover the perspectives of Ognjen, one of our Quantitative Researchers, on the following papers:

  • Omnipresent Yet Overlooked: Heat Kernels in Combinatorial Bayesian Optimisation
  • Bubbleformer: Forecasting Boiling with Transformers
Read now
NeurIPS 2025: Paper review #7

Discover the perspectives of Radomir, one of our Machine Learning Engineers, on the following papers:

  • Learning Task-Agnostic Representations through Multi-Teacher Distillation
  • Contrastive Representations for Temporal Reasoning
Read now
NeurIPS 2025: Paper review #8

Discover the perspectives of Benjamin, one of our Quantitative Researchers, on the following papers:

  • Dynamical Decoupling of Generalisation and Overfitting in Large Two-Layer Networks
  • Backward Conformal Prediction
  • Predicting the Performance of Black-box Language Models with Follow-up Queries
Read now
NeurIPS 2025: Paper review #9

Discover the perspectives of Casey, one of our Machine Learning Engineers, on the following papers:

  • Distributed Orthonormal Updates for Large-Scale Training
  • Broken Tokens? Your Language Model can Secretly Handle Non-Canonical Tokenisations
Read now
NeurIPS 2025: Paper review #10

Discover the perspectives of Hugh, one of our Quantitative Research Managers, on the following papers:

  • Understanding and Mitigating Numerical Sources of Non-determinism in LLM Inference
  • Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free
Read now
NeurIPS 2025: Paper review #11

Discover the perspectives of Timothy, one of our Machine Learning Engineers, on the following papers:

  • ZeroS: Zero-Sum Linear Attention for Efficient Transformers
  • In Search of Adam’s Secret Sauce
Coming soon
NeurIPS 2025: Paper review #12

Discover the perspectives of David, one of our Quantitative Researchers, on the following papers:

  • 1000 Layer Networks for Self-Supervised RL: Scaling Depth Can Enable New Goal-Reaching Capabilities
  • Revisiting Residual Connections: Orthogonal Updates for Stable and Efficient Deep Networks
Read now
NeurIPS 2025: Paper review #13

Discover the perspectives of Szymon, one of our Senior Quantitative Researchers, on the following papers:

  • Rollout Roulette: A Probabilistic Inference Approach to Inference-Time Scaling of LLMs using Particle-Based Monte Carlo Methods
  • Parallelizing MCMC Across the Sequence Length
Read now

Latest events

  • Technology innovation and open-source

KubeCon Europe 2026

23 Mar 2026 - 26 Mar 2026 RAI Amsterdam, Europaplein 24, 1078 GZ Amsterdam
  • Platform engineering
  • Software engineering

NVIDIA GTC 2026

16 Mar 2026 - 19 Mar 2026 San Jose McEnery Convention Centre
  • Machine learning
  • Quantitative research

Spring into Quant Finance 2026

12 Apr 2026 - 17 Apr 2026 Palermo, Sicily, Italy

Stay up to date with G-Research