Liwei Jiang, Yuanjun Chai, Margaret Li, Mickel Liu, Raymond Fok, Nouha Dziri, Yulia Tsvetkov, Maarten Sap, Yejin Choi
This paper investigates the behaviour of large language models on open-ended tasks where no single ground truth exists.
This topic is arguably neglected by current frontier models, where maths, coding and reasoning-focused post-training can contribute to mode collapse. As a result, the diversity in observed responses can be lacking when compared to humans (which is neatly demonstrated in the paper both qualitatively and quantitatively).
The authors share a dataset called `Infinity-Chat` of real-world open-ended LLM chat queries, on a range of topics including brainstorming or creative content generation. These queries lack a ground truth to benchmax against, but response similarities within and between model classes can be studied and compared to human responses.
The authors report a disconcertingly high level of response similarity across a range of closed and open-source models (both intra- and inter-model). This may be fine for some verifiable tasks but for a creative task this can be suboptimal.
The paper also shows that human preferences over open-ended responses are more pluralistic than those of LLM-based judges: multiple answers can be considered high quality by different human annotators, who disagree more strongly with each other than LLM judges do.
This is a topic that’s important to many scientific domains: how can we get the most out of language models to aid in exploration aspects of science such as hypothesis generation.
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond)