Advances in self-distillation have shown that when knowledge is distilled from a teacher to a student using the same deep learning (DL) architecture, the student performance can surpass the teacher particularly when the network is overparameterized and the teacher is trained with early stopping.
This paper proposes to train only one model and generate multiple diverse teacher representations using distillation-time dropout.
To overcome noisy representations, a novel stochastic self-distillation (SSD) training strategy is introduced, which uses student-guided knowledge distillation (SGKD) to filter and weight teacher representations.
Experimental results show that the proposed SSD method outperforms state-of-the-art methods without increasing the model size, incurs negligible computational complexity, and achieves superior performance on various datasets.