Modern vision transformers are utilizing noise addition to enhance the performance of object detection tasks.
Early vision transformers like DETR used learned decoder queries for object detection, but had slow convergence.
Recent transformer architectures have implemented deformable aggregation and spatial anchors for improved detection results.
The Hungarian algorithm is used for prediction to ground truth matching in transformers, leading to unstable training objectives.
DN-DETR addresses the unstable matching issue by introducing noise to ground truth boxes, improving model stability and convergence speed.
DINO enhances denoising by incorporating contrastive learning, improving detection performance even further.
Temporal models like Sparse4Dv3 leverage denoising and temporal denoising groups for object tracking across frames.
Denoising in vision transformers accelerates convergence and boosts detection results, especially with learnable anchors.
The use of denoising raises questions about the necessity of learnable anchors and the impact on models with non-learnable anchors.
While denoising improves stability in gradient descent, the relevance in models with spatially constrained queries remains a topic for further exploration.