Kyutai has introduced MoshiVis, the first open-source real-time speech model that can talk about images.MoshiVis is an open-source Vision Speech Model (VSM) that enables natural, real-time speech interactions about images.MoshiVis integrates lightweight cross-attention modules to process and discuss visual inputs, while maintaining efficiency and responsiveness.The release of MoshiVis as an open-source project invites collaboration and promotes innovation in vision-speech models.