Evaluating how well LLMs handle long contexts is essential, especially for retrieving specific, relevant information embedded in lengthy inputs.
Needle-in-a-Haystack (NIAH) task challenges models to retrieve critical information from predominantly irrelevant content and lacks tasks involving retrieval and correct ordering of sequential information.
Sequential-NIAH benchmark designed to assess how well LLMs retrieve sequential information, referred to as a needle, from long texts.
Tests on popular LLMs showed highest performance at just 63.15%, highlighting the difficulty of the task and need for further advancement.