A University of Oxford study raises concerns about the effectiveness of medical advice chatbots, showing that humans performed poorly when assisted by large language models (LLMs) in diagnosing medical conditions.
Participants using LLMs identified relevant conditions less consistently than those in a control group who self-diagnosed, highlighting issues with human-technology interaction.
Despite LLMs providing correct information, participants often provided incomplete details or misinterpreted prompts, leading to incorrect self-diagnoses and actions.
The study demonstrates that testing LLMs solely on standard measures, like medical licensing exams, may not reflect their real-world performance in interacting with humans.
Simulated participants interacting with LLMs performed better than humans, suggesting that LLMs may interact more effectively with other AI models than with humans.
User experience specialist Nathalie Volkheimer emphasizes the importance of understanding the audience and customer experience before deploying LLMs as chatbots.
Volkheimer stresses the need for well-curated training materials to make chatbots useful and warns blaming users for poor interactions is not a constructive approach.
The study urges AI engineers and designers to test LLMs with human interactions rather than relying solely on standardized benchmarks to avoid misjudging their real-world capabilities.
The discrepancy in performance between humans and simulated participants using LLMs highlights the complexities of human-technology interactions in chatbot applications.
Human participants often failed to follow the recommendations provided by LLMs, showcasing the challenges in translating LLM medical knowledge into practical self-diagnoses.
The study serves as a critical reminder for AI developers to evaluate LLMs in real-life scenarios with human users to accurately assess their performance and usability.