OMNIGUARD is an approach for detecting harmful prompts across languages and modalities in large language models (LLMs).
The approach identifies internal representations of LLMs/MLLMs that are aligned across languages or modalities to build a language-agnostic or modality-agnostic classifier for detecting harmful prompts.
OMNIGUARD improves harmful prompt classification accuracy significantly in multilingual setting, image-based prompts, and audio-based prompts.
By repurposing embeddings computed during generation, OMNIGUARD is very efficient and sets a new state-of-the-art for audio-based prompts.