This article delves into the fine-tuning of VLMs like Qwen 2.5 VL 7B to optimize performance on tasks like extracting handwritten text.
The main focus of the article is to fine-tune a VLM on a dataset to improve machine learning techniques and achieve efficient outcomes.
Topics covered include motivation, advantages of VLMs, dataset overview, annotation, fine-tuning, SFT technical details, results, and plots.
Motivation revolves around showcasing the process of fine-tuning VLMs for specific tasks like extracting handwritten text for valuable applications like climate research.
Utilizing VLMs over traditional OCR engines is advantageous due to better performance in extracting text, handling handwriting variations, and providing specific instructions for data extraction.
Fine-tuning involves a three-step process of prediction, reviewing and correcting mistakes, and retraining the model to improve performance using annotated data efficiently.
Supervised fine-tuning (SFT) involves updating model weights to improve performance, considering challenges like similar-looking characters, image background noise, and annotation correctness.
Hyperparameter search and balancing data sets are crucial for optimizing model parameters, and selecting layers for fine-tuning based on specific task requirements, such as OCR for handwritten text extraction.
Results show that fine-tuning of Qwen model enhances performance over the base model, displayed through better performance on test sets.
The article concludes with insights into a phenology dataset, the process of extracting handwritten text, model fine-tuning pipeline, results, and data visualization.