This paper explores the risks of unintentional and malicious disclosure in large language models trained for code generation.
Unintentional disclosure refers to the language model presenting secrets to users without user intent, while malicious disclosure refers to presenting secrets to an attacker.
The study assesses the risks of unintentional and malicious disclosure in the Open Language Model (OLMo) family of models and the Dolma training datasets.
The results show that changes in data source and processing greatly affect the risk of unintended memorization, and the risk of disclosing sensitive information varies based on prompt strategies and types of sensitive information.