Multimodal AI faces challenges with late-fusion strategies, impacting cross-modality dependencies and scaling complexity.Researchers explore early-fusion models for efficient multimodal integration and scaling properties.Study compares early-fusion and late-fusion models, showing early-fusion's efficiency and scalability advantages.Sparse architectures like Mixture of Experts offer performance boosts and prioritize training tokens over active parameters.Native multimodal models follow scaling patterns similar to language models and demonstrate modality-specific specialization.Experiments reveal scalability of multimodal models, with MoE models outperforming dense models at smaller sizes.Early-fusion models perform better at lower compute budgets and are more efficient to train than late-fusion models.Sparse architectures show enhanced capability in handling heterogeneous data through modality specialization.Overall, early-fusion architectures with dynamic parameter allocation offer a promising direction for efficient multimodal AI systems.Study by Sorbonne University and Apple challenges conventional architectural assumptions for multimodal AI models.