Sequential multiple-instance learning involves learning representations of sets distributed across discrete timesteps.
Existing methods either focus on learning set representations at a static level, ignoring temporal dynamics, or treat sequences as ordered lists of individual elements, lacking explicit mechanisms to represent sets.
Set2Seq Transformer is a novel architecture that jointly models permutation-invariant set structure and temporal dependencies by learning temporal and positional-aware representations of sets within a sequence in an end-to-end multimodal manner.
The Set2Seq Transformer significantly improves over traditional static multiple-instance learning methods by effectively learning permutation-invariant set, temporal, and positional-aware representations.