A new study evaluating AI agents revealed that even advanced systems struggle with basic tasks like closing a pop-up window or waiting 10 minutes before escalating an issue.
Researchers from Carnegie Mellon University simulated a digital company where AI agents interacted with popular models but found they're not ready to replace human jobs.
Top performers like Claude 3.5 Sonnet could only complete 24% of tasks, highlighting the gap in AI readiness to take over human work.
Tasks' complexity and realism in a new benchmark called The Agent Company showed AI agents failing due to a lack of common sense, social skills, and incompetence in web browsing.
The study found AI agents more competent in technical tasks like software engineering than administrative ones, challenging the assumption that simpler jobs are easier to automate.
Researchers emphasized that current AI agents are not ready for business-critical tasks due to issues like hallucinations, lack of common sense, and potential for causing trouble with excessive power.
Despite current limitations, the study suggests that AI agents could evolve to complete over 90% of tasks and become significantly more useful in the future.
While AI models perform well in controlled benchmarks, real-world conditions expose their shortcomings in tasks beyond code generation.
The research team's experiment demonstrated that AI agents often lack the ability to complete additional real-world tasks beyond what's tested in conventional benchmarks.
The study's findings suggest that AI agents require significant advancement before becoming capable of replacing a substantial portion of human work tasks.