A new method called SharpZO has been proposed for fine-tuning vision language models without the need for backpropagation, making them suitable for memory-constrained edge devices.
SharpZO utilizes a sharpness-aware two-stage optimization process that includes a global exploration stage using evolutionary strategies and a fine-grained local search phase with zeroth-order optimization.
The approach solely relies on forward passes during optimization and has shown significant improvements in accuracy and convergence speed compared to existing forward-only methods, achieving up to a 7% average gain in experiments on CLIP models.