Recent copyright agreements highlight the need for controlling language models' reproduction of copyrighted text.
Existing methods sacrifice model utility or fail to adequately prevent verbatim leakage.
A new method called Obliviate is introduced to selectively suppress exact reproduction of specified sequences while maintaining semantic understanding.
Obliviate identifies memorized passages and adjusts the model's output distribution to reduce the probability of exact reproduction using a Kullback-Leibler divergence penalty.
Consistency loss is enforced on non-target tokens to preserve fluency and task performance.
Obliviate is evaluated on various models using synthetic memorization benchmarks and copyrighted excerpts like Moby Dick and Alice in Wonderland.
It significantly reduces verbatim recall while minimally affecting downstream accuracy on different benchmarks.
The method is compared against other unlearning and copyright techniques and proves effective in ensuring copyright compliance in language models.