Many language models exhibit an 'over-refusal' behavior, hindering their ability to engage with sensitive or controversial topics intelligently.
A new dataset named 'FalseReject' aims to address this issue by retraining models to handle sensitive topics more effectively without sacrificing safety.
Researchers from Dartmouth College and Amazon developed the FalseReject dataset, containing prompts likely to trigger refusals but are harmless.
The dataset challenges models to learn a flexible tolerance towards potentially risky prompts, rather than relying on a fixed 'white-list' approach.
Language models often struggle with over-refusal, impacting their interactions with users on various topics.
Refusal patterns vary among different model families, with reasoning models like DeepSeek-R1 showing better alignment in handling sensitive prompts.
The FalseReject dataset includes prompts that challenge models to distinguish between casual inquiry and security research-level queries.
Open-source models like Mistral-7B and DeepSeek-R1 demonstrate strong performance in handling over-refusal, potentially outperforming closed-source models.
Training with FalseReject data helps reduce over-refusal in non-reasoning models and enhances safety in reasoning models.
The new approach emphasizes the importance of balancing safety and engagement in language models, especially in evolving ethical and legal contexts.