Enterprises are focusing on evaluating AI agents specifically for complex, domain-specific workflows through voice interfaces.Salesforce AI Research & Engineering teams designed a benchmark to assess AI agents in text and voice environments for enterprise tasks.The benchmark covers healthcare appointment management, financial transactions, inbound sales, and e-commerce order processing.It emphasizes tool integration, protocol adherence, domain expertise, and voice robustness for comprehensive evaluation.The benchmark architecture includes environments, tasks, participants, and metrics for reproducible evaluations.It spans appointments management, financial transactions, inbound sales, and order management to test different enterprise operations.Tasks vary in complexity from simple to multi-step processes, all human-verified to ensure realism and difficulty.Agents are evaluated based on accuracy and efficiency in text and voice modalities, with noise injection for robustness testing.Implementation details include Python usage, modular definitions, client-agent simulation, multi-provider support, and voice processing.Experimental results highlighted challenges in financial transactions, voice vs. text accuracy, and performance in complex tasks.