Large Language Models (LLMs) are used for behavior planning based on natural language instructions, but struggle with ambiguous instructions in real-world scenarios.
Various methods have been proposed for detecting task ambiguity, but lack of universal benchmark makes comparison difficult.
To address this, AmbiK dataset, focusing on ambiguous tasks in kitchen environment, has been introduced.
The dataset includes 1000 pairs of ambiguous tasks and their unambiguous versions, categorized by ambiguity type, created with the help of LLMs and human validation.