The study investigates the scaling capability of vision-language models with respect to the number of vision tokens.The model exhibits weak scaling capabilities on the length of vision tokens, with performance approximately following a power-law relationship.The scaling behavior remains unaffected by the inclusion or exclusion of the user's question in the input.Fusing the user's question with the vision token can enhance model performance when the question is relevant.