ChatGPT-4V, a language model for vision tasks, can solve complex physics problems when text questions are converted to PNG image.Vision-Language Models (VLMs) combine a large language model with a vision encoder to enable the model to see.VLMs are capable of performing image analysis, visual Q&A, summarising images and video, and solving complex math and physics problems.VLMs are useful in logistics and manufacturing where robots can sort items based on appearance and verbal guidance.The limitations of VLMs include challenges around spatial and long-context video understanding.Training VLMs requires large image/caption datasets and high computational power.The ethical implications of VLMs involve job displacement and labor impacts as machines can outperform human capabilities.The challenge is not just technological, but societal and requires embracing innovation without sacrificing the human touch that defines us.