SpikeVideoFormer is introduced as an efficient spike-driven video Transformer with linear temporal complexity O(T).
The model features a spike-driven Hamming attention (SDHA) that transitions from traditional real-valued attention to spike-driven attention.
Multiple spike-driven space-time attention designs were analyzed to identify an optimal scheme for video tasks with linear temporal complexity.
The SpikeVideoFormer model demonstrates superior performance in diverse video tasks like classification, human pose tracking, and semantic segmentation, outperforming existing SNN approaches and offering significant efficiency gains.