Ultrasound videos are an important form of clinical imaging data for diagnostic analysis.E-ViM$^3$ is a data-efficient Vision Mamba network that enhances space-time correlations.Enclosure Global Tokens (EGT) capture and aggregate global features effectively.With limited labels, E-ViM$^3$ achieves competitive performance in semantic analysis tasks.