Recent advances in vision-language models (VLMs) offer the potential to automate design assessments, but it is crucial to ensure that these AI ``judges'' perform on par with human experts.
A statistical framework has been introduced to determine whether an AI judge's ratings match those of human experts in design evaluation.
The top-performing AI judge using text- and image-based in-context learning achieves expert-level agreement for uniqueness and drawing quality and outperforms or matches trained novices across all metrics.
Reasoning-supported VLM models can achieve human-expert equivalence in design evaluation, impacting design evaluation in education and practice.