Language loss to AI is not inevitable but convenient for system builders; however, reclaiming the algorithm is possible by providing it with necessary data.
Most AI models are trained on internet text, thus if your language is underrepresented, the model won't understand.
Creating data intentionally in your language online by writing, publishing, and uploading content is crucial to teaching AI.
Community-led language digitization projects provide structured language data for training language models.
Open-source language models like Mistral, LLaMA, Falcon, and BLOOM can be fine-tuned locally to create AI that reflects specific language and values.
Language preservation and empowerment must come from the people themselves, as big tech companies may not prioritize low-resource languages.
Governments, schools, and cultural institutions should fund AI language modeling to preserve national identity in the digital age.
AI should be seen as something to shape rather than just use, with every interaction serving as a teaching moment.
Community data governance is crucial to prevent exploitation of language data and ensure collective rights to datasets.
The goal is to train machines on our terms and prevent language erasure through passive inaction.
The future should involve AI that understands diverse voices and speaks in various languages, preserving cultural identity.