Action segmentation is a core challenge in high-level video understanding, focusing on partitioning videos and assigning predefined action labels.
Existing methods mainly address single-person activities, leaving out multi-person scenarios.
A new dataset, RHAS133, is introduced for Referring Human Action Segmentation in multi-person settings, comprising 133 movies with annotations for 137 actions and textual descriptions.
Benchmarking existing methods on the RHAS133 dataset shows limited performance in aggregating visual cues for target individuals.
To improve action segmentation in multi-person scenarios, a new framework called HopaDIFF is proposed.
HopaDIFF leverages a holistic-partial aware Fourier-conditioned diffusion approach and a novel cross-input gate attentional xLSTM for enhanced long-range reasoning.
The framework introduces a Fourier condition to gain more control and improve action segmentation generation.
HopaDIFF achieves state-of-the-art results on the RHAS133 dataset across various evaluation scenarios.
The code for HopaDIFF is available at https://github.com/KPeng9510/HopaDIFF.git.