LLM (large language model) practitioners commonly notice that outputs can vary for the same inputs under settings expected to be deterministic.
A systematic investigation into the non-determinism of five LLMs configured to be deterministic was performed.
Accuracy variations of up to 15% were observed across naturally occurring runs, with a gap of best possible performance to worst possible performance of up to 70%.
Non-determinism in LLMs is considered essential to the efficient use of compute resources, indicating that this issue will persist.