<ul><li>Large language models (LLMs) can distinguish past from future events with 90% accuracy.</li><li>Backdoors triggered by a temporal distributional shift can activate when exposed to news headlines beyond their training cut-off dates.</li><li>Fine-tuning on helpful, harmless, and honest (HHH) data is effective in removing backdoor triggers in backdoored models.</li><li>Standard safety measures are enough to remove backdoors in models at the modest scale tested.</li></ul>

Future Events as Backdoor Triggers: Investigating Temporal Vulnerabilities in LLMs

Discover more