Large language models have improved conversational AI assistants, but evaluating personalization in these assistants is challenging.
Existing personalization benchmarks do not capture the complexities of personalized task-oriented assistance.
To address this gap, PersonaLens is introduced, a benchmark for evaluating personalization in task-oriented AI assistants.
PersonaLens includes diverse user profiles with rich preferences and interaction histories, along with specialized LLM-based user and judge agents.
The user agent engages in realistic task-oriented dialogues with AI assistants, while the judge agent assesses personalization, response quality, and task success.
Extensive experiments with current LLM assistants across diverse tasks have shown significant variability in personalization capabilities.
PersonaLens provides crucial insights for the advancement of conversational AI systems.