Researchers have proposed VLAD, a vision-language autonomous driving model that integrates a fine-tuned Visual Language Model (VLM) with VAD, a state-of-the-art end-to-end system.
VLAD utilizes a specialized fine-tuning approach using custom question-answer datasets to enhance spatial reasoning capabilities.
The system generates high-level navigational commands for vehicle operation and provides interpretable natural language explanations of driving decisions to increase transparency and trustworthiness.
Evaluation on the nuScenes dataset shows that VLAD reduces average collision rates by 31.82% compared to baseline methodologies, setting a new benchmark for VLM-augmented autonomous driving systems.