The race to AGI isn’t just about creating smarter models but also systems that think, adapt, and feel human, especially so in the age of intelligent design.
Vibe checks, as OpenAI’s Greg Brockman calls them, are becoming official benchmarks, hinting that a more intuitive approach to AI evaluation is taking shape.
Researchers at UC Berkeley have identified and quantified qualitative differences, or “vibes,” in the outputs of large language models (LLMs).
The concept of VibeSystem is introduced, a framework for evaluating AI models based on consistency, ability to distinguish models, and alignment with user preferences.
“Vibes” are subjective, making automation or standardisation challenging and resource-intensive, especially for large-scale or real-time applications.
The future of AI evaluation will likely combine data-driven metrics with human intuition, focusing on user experience, and creating methods that assess how well AI models align with human expectations and emotions.
Researchers like Dunlap agree that Model evaluations will expand beyond numerical scores like those on MMLU to include more subjective traits.
The focus will shift from global, standardised evaluations to user-specific evaluations.
Vibe-based benchmarks are most effective for open-ended tasks, such as asking a chatbot like GPT to write a story or leveraging LLMs for customer service.
Reka AI researchers introduced Vibe-Eval, an open benchmark designed to challenge models with nuanced, hard prompts to evaluate traits such as humor, tone, and conversational depth.