Generative AI application stores data from multiple sources, and queries the data to generate human language responses.
Tokenization is the process of breaking down human text into smaller units assigned with token IDs that can be processed by models.
There are various types of tokenization like sentence-level, word-level, subword, character-level, and hybrid tokenizer to optimize efficiency and flexibility.
Embedding is the process of converting tokenized text to numerical format, enabling the model to process it.
There are three known types of embeddings: traditional word embeddings, contextual embeddings, and positional embeddings.
To generate human-like outputs, user query vectors are compared with corpus vectors through cosine similarity or Euclidean distance.
Tokenization and embeddings are two critical processes that enable LLMs to process and generate human language.
As GenAI evolves, improving tokenization methods and embeddings will be important to build more efficient and powerful language models.