Accurate sleep stage classification is significant for sleep health assessment.A new cross-modal transformer-based method for sleep stage classification is proposed.The method outperforms the state-of-the-art methods and eliminates the black-box behavior of deep-learning models.Considerable reductions in the number of parameters and training time are achieved compared to the state-of-the-art methods.