<ul><li>The study investigates the scaling capability of vision-language models with respect to the number of vision tokens.</li><li>The model exhibits weak scaling capabilities on the length of vision tokens, with performance approximately following a power-law relationship.</li><li>The scaling behavior remains unaffected by the inclusion or exclusion of the user's question in the input.</li><li>Fusing the user's question with the vision token can enhance model performance when the question is relevant.</li></ul>

Weak Scaling Capability in Token Space: An Observation from Large Vision Language Model

Discover more