A study explores Large Language Models (LLMs) as autonomous agents for real-world tasks, including freelance software development.
A new benchmark evaluates LLMs on freelance programming and data analysis tasks derived from economic data, with tasks standardized to USD.
Four modern LLMs were evaluated - Claude 3.5 Haiku, GPT-4o-mini, Qwen 2.5, and Mistral - based on accuracy and total 'freelance earnings' achieved.
Results show Claude 3.5 Haiku performs best, earning $1.52 million USD, followed by GPT-4o-mini, Qwen 2.5, and Mistral, with insights on error distribution and task complexity.