<ul><li>A study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development.</li><li>A new benchmark evaluates LLMs on freelance programming and data analysis tasks derived from economic data, with tasks standardized to USD.</li><li>Four modern LLMs were evaluated - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral - based on accuracy and total 'freelance earnings' achieved.</li><li>Results show Claude 3.5 Haiku performs best, earning $1.52 million USD, followed by GPT-4o-mini, Qwen 2.5, and Mistral, with insights on error distribution and task complexity.</li></ul>

Can AI Freelancers Compete? Benchmarking Earnings, Reliability, and Task Success at Scale

Discover more