Scholarly publishers have started to monetize their research content to provide training data for large language models (LLMs).
Major academic publishers, including Wiley, Taylor & Francis, and others, have reported substantial revenues from licensing their content to tech companies developing generative AI models.
Risks arise when questionable research infiltrates these AI training datasets, as the scholarly community is no stranger to issues of fraudulent research.
The implications are profound when LLMs train on databases containing fraudulent or low-quality research. Such models could perpetuate inaccuracies, posing harmful consequences for fields like medicine.
Publishers must improve their peer-review process to catch unreliable studies before they make it into training datasets.
Choosing publishers and journals with a strong reputation for high-quality, well-reviewed research is key in reducing the risks of flawed research disrupting AI training.
AI tools themselves can also be designed to identify suspicious data and reduce the risks of questionable research spreading further.
Transparency is an essential factor, and publishers and AI companies should openly share details about how research is used and where royalties go.
Open access to high-quality research should be encouraged to ensure inclusivity and fairness in AI development.
By focusing on reliable, well-reviewed research, we can build better AI tools, protect scientific integrity, and maintain the public’s trust in science and technology.