Vision-Language Models (VLMs) like CLIP have shown impressive performance in cross-modal tasks through large-scale pre-training.
Parameter-Efficient Fine-Tuning (PEFT) techniques such as LoRA have emerged as scalable alternatives to full fine-tuning for adapting transformer-based models like VLMs efficiently.
Adversarial attacks can significantly impact the performance of VLMs, and adversarial training is crucial for improving model robustness in few-shot scenarios.
AdvCLIP-LoRA is introduced as the first algorithm to enhance the adversarial robustness of CLIP models fine-tuned with LoRA in few-shot settings, providing theoretical guarantees for convergence and showing significant improvements in robustness against common adversarial attacks while maintaining clean accuracy.