LLMs (Large Language Models) are often treated as black boxes by most people.
This article aims to provide a breakdown of LLMs layer by layer for better understanding and examination of machine learning models.
Tokenization is the process of splitting large text into smaller text pieces called tokens, which can be as small as a character or as large as a word.
Tokens can be words, subwords, or even characters depending on the model's design.