Llama 4 is introducing an industry-leading 10 million token context window, while Llama 3 had a limit of 128,000 tokens.
Despite the increase in context window size, RAG (Retrieval-Augmented Generation) remains valuable in extracting relevant information from large language models.
Creating a simple RAG system using open-source models like GPT 4 and Mistral Saba can enhance locally-hosted Llama or Qwen models to deliver accurate answers.
Using a diverse and interesting dataset of four books from Project Gutenberg, the benefits of the RAG system can be fully appreciated.