Debiasing approaches often lead to a decrease in model capabilities like accuracy and knowledge retention.
Existing debiasing methods face trade-offs resulting in reduced truthfulness, knowledge loss, or unintelligible outputs, especially in smaller models.
A contrastive learning framework is proposed to address these limitations by using positive and negative examples for learning, introducing contrast computation and dynamic loss scaling.
Experimental results show that this approach improves toxicity reduction and faithfulness preservation simultaneously, without the capability degradation seen in current methods.