<ul><li>There is a growing literature on reasoning by large language models (LLMs), but the discussion on the uncertainty in their responses is still lacking.</li><li>An assessment was conducted to measure the extent of confidence that LLMs have in their answers and how it correlates with accuracy.</li><li>Performance of three LLMs – GPT4o, GPT4-turbo, and Mistral – was evaluated on benchmark sets of questions related to causal judgement, formal fallacies, probability, and statistical puzzles.</li><li>LLMs show better performance than random guessing, but there is variability in their tendency to change initial answers, and they tend to overstate the self-reported confidence score.</li></ul>

Confidence in the Reasoning of Large Language Models

Discover more