menu
techminis

A naukri.com initiative

google-web-stories
Home

>

Deep Learning News

>

Pre-traini...
source image

Medium

1M

read

210

img
dot

Image Credit: Medium

Pre-training GPT-2 (124M)on Hindi(Devanagari) Text from scratch: A Journey Through Tokenization…

  • An attempt was made to pre-train the GPT-2 124M model on Hindi Devanagari text, using the Hin_Deva dataset sourced from books, articles, and websites.
  • Challenges arose during tokenization due to the GPT-2 tokenizer's optimization for English, which fragmented Hindi text into suboptimal chunks.
  • Despite computational resource limitations, the model was pre-trained on a cluster equipped with 8× NVIDIA A100 SXM4–80GB GPUs from Lambda Labs.
  • The model was trained for 19,073 steps on a batch size of 524,288 tokens, costing approximately $82 in total.
  • Results showed lower loss values compared to OpenAI's model, indicating improvements in token prediction for Hindi.
  • Generated sample sentences in Hindi were somewhat coherent, suggesting the model learned useful representations despite tokenization challenges.
  • Following the Hindi pre-training, a subsequent pre-training run was conducted on an English dataset from the Fineweb dataset.
  • The English pre-training involved running the model for 38,146 steps, costing around $116 in total, and using the same batch size.
  • Model performance was evaluated using the Hellaswag benchmark, showcasing the effectiveness of traditional methods in conventional settings.
  • Future work may involve exploring custom tokenizers for non-Latin languages and further optimizing the training pipeline for enhanced performance.

Read Full Article

like

12 Likes

For uninterrupted reading, download the app