Recent multi-teacher distillation methods have unified the encoders of multiple foundation models into a single encoder, achieving competitive performance on core vision tasks.
The paper introduces the concept of heterogeneous teacher distillation, where teacher models vary significantly in design objectives and the data they were trained on.
The researchers propose DUNE, a single encoder excelling in 2D vision, 3D understanding, and 3D human perception, achieving performance comparable or even surpassing larger teachers on their respective tasks.
DUNE outperforms MASt3R in Map-free Visual Relocalization with a smaller encoder.