There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking.
An assessment was conducted to measure the extent of confidence that LLMs have in their answers and how it correlates with accuracy.
Performance of three LLMs – GPT4o, GPT4-turbo, and Mistral – was evaluated on benchmark sets of questions related to causal judgement, formal fallacies, probability, and statistical puzzles.
LLMs show better performance than random guessing, but there is variability in their tendency to change initial answers, and they tend to overstate the self-reported confidence score.