<ul data-eligibleForWebStory="true">Research paper introduces ImageChain, enhancing multimodal large language models with sequential reasoning capabilities over image data.ImageChain models visual sequences as a multi-turn conversation by interleaving images with corresponding textual descriptions.Framework explicitly captures temporal dependencies and narrative progression in image data.Optimizes for the task of next-scene description, where model generates context-aware descriptions based on preceding visual and textual cues.Approach improves performance on next-scene description task, showing an average improvement from 3.7% to 19% in SimRate metric.ImageChain demonstrates robust zero-shot out-of-domain performance in applications like comics and robotics.Extensive experiments validate the importance of instruction-tuning in a multimodal, multi-turn conversation design for enhanced reasoning.