Bluesky, a microblogging platform alternative to X and Twitter, has doubled its user base since September to 20 million by November 20.
The platform is competing against approximately 611 million monthly active users of Elon Musk’s X and 275 million monthly active users of Meta’s Threads.
Bluesky offers an open API, which allows its data to be used for training AI models, unlike X. Daniel van Strien, a machine learning engineer at Hugging Face, recently released a controversial dataset of one million public posts sourced from Bluesky’s Firehose API without user consent.
Clem Delangue, CEO of Hugging Face, responded on X claiming “there are a lot of toxic users on Bluesky.” Bluesky itself has no intention of using user content to train generative AI.
X updated its terms of service stating that when users upload content, they permit X to use it for analysis, including using user content to help train machine learning and artificial intelligence models. This change led users to migrate to Bluesky.
Meta’s updated privacy policy also specifies that it trains its models using users’ posts, photos, and captions.
Startups like OpenAI and Anthropic have already exhausted human-generated content to train their models and now rely on synthetic data for their upcoming frontier models.
However, user consent is still essential.
In India, Sarvam AI is using synthetic data created by Meta Llama 3.1 405B to train its model, while OpenAI reportedly uses Strawberry to generate synthetic data for GPT-5.
This sets up a ‘recursive improvement cycle,’ where each GPT version is trained on higher-quality synthetic data created by the previous model..