Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, Gabriel Synnaeve
This paper introduces a novel approach to LLM pre-training, challenging the conventional next-token prediction paradigm. The authors propose a non-autoregressive multi-token prediction loss, aiming to enhance model performance whilst maintaining parallelisability. Intuitively such a multi-token loss should help as it makes the pre-training objective more similar to the autoregressive text generation in downstream tasks.
The research examines models ranging from 0.3 to 13 billion parameters, revealing that the multi-token loss becomes increasingly beneficial as model size grows. Notably, this approach yields significant improvements in coding benchmarks, such as MBPP and HumanEval. The fact that smaller models (under 1 billion parameters) experience performance degradation, even on coding benchmarks, with this method may explain why it hasn’t been previously explored.
Beside these gains on coding benchmarks, I think their findings on byte-level training are particularly promising, as the multi-token loss nearly bridges the performance gap between byte and standard tokenisers. This development could potentially pave the way for ‘tokeniser-free’ models in the future.
Beside those empirical gains, the paper proposes two intuitions that illustrate why a multi-token loss should indeed help with auto-regressive text generation which I found helpful.
It’s worth noting that the claimed reduction in inference time refers to wall-clock time rather than FLOPs as the gains stem from increased parallelisability and speculative decoding, not from fundamental computational efficiency improvements.
Better & Faster Large Language Models via Multi-token Prediction