Pre-training GPT-2 (124M)on Hindi(Devanagari) Text from scratch: A Journey Through Tokenization…

A naukri.com initiative

New

Pre-traini...

Medium

215

Image Credit: Medium

An attempt was made to pre-train the GPT-2 124M model on Hindi Devanagari text, using the Hin_Deva dataset sourced from books, articles, and websites.
Challenges arose during tokenization due to the GPT-2 tokenizer's optimization for English, which fragmented Hindi text into suboptimal chunks.
Despite computational resource limitations, the model was pre-trained on a cluster equipped with 8× NVIDIA A100 SXM4–80GB GPUs from Lambda Labs.
The model was trained for 19,073 steps on a batch size of 524,288 tokens, costing approximately $82 in total.
Results showed lower loss values compared to OpenAI's model, indicating improvements in token prediction for Hindi.
Generated sample sentences in Hindi were somewhat coherent, suggesting the model learned useful representations despite tokenization challenges.
Following the Hindi pre-training, a subsequent pre-training run was conducted on an English dataset from the Fineweb dataset.
The English pre-training involved running the model for 38,146 steps, costing around $116 in total, and using the same batch size.
Model performance was evaluated using the Hellaswag benchmark, showcasing the effectiveness of traditional methods in conventional settings.
Future work may involve exploring custom tokenizers for non-Latin languages and further optimizing the training pipeline for enhanced performance.

Read Full Article

12 Likes

For uninterrupted reading, download the app