Audio-Visual Target Speaker Extraction (AV-TSE) aims to enhance auditory perception using visual cues.A model-agnostic strategy called Mask-And-Recover (MAR) is proposed to improve extraction quality by integrating contextual correlations.The Fine-grained Confidence Score (FCS) model is introduced to assess extraction quality and guide improvement on low-quality segments.The proposed model-agnostic training paradigm demonstrated consistent performance improvements across various metrics on the VoxCeleb2 dataset.