LlamaFusion is a framework that enhances pretrained text-only large language models (LLMs) with multimodal generative capabilities.It enables LLMs to understand and generate both text and images in arbitrary sequences.LlamaFusion utilizes dedicated modules for processing text and images, allowing interactions between text and image features.Through experiments, LlamaFusion shows improved image understanding and generation while maintaining the language capabilities of text-only LLMs.