Jiayi Yuan, Hao Li, Xinheng Ding, Wenya Xie, Yu-Jhe Li, Wentian Zhao, Kun Wan, Jing Shi, Xia Hu, Zirui Liu
This paper shines light on a commonly observed problem of non-determinism in transformer inference. While generally randomness is attributed to the random sampling, this paper demonstrates that randomness due to non-determinism of floating point numerics is a significant factor.
This paper shows that the non-determinism is mostly attributable to the non-associativity of floating-point addition (i.e. that (a + b) + c is not numerically equal to a + (b + c), despite being mathematically equivalent). In general, GPU execution does not guarantee the order of summation, so it is tricky to maintain deterministic inference while achieving performant GPU execution.
Beyond the LLM world, this is problematic as flash attention-based models are non-deterministic, which means results are not reproducible and making paired comparison tests between model variants with the same seed is made more difficult. The paper suggests a LayerCast method that incrementally casts the layer to higher precision. They show that this greatly mitigates the issue in practice (though does not completely avoid it).
Understanding and Mitigating Numerical Sources of Non-determinism in LLM Inference