Researchers investigate using LLM-generated data for continual pretraining of encoder models in specialized domains with limited training data.
They leverage domain-specific ontologies to enrich them with LLM-generated data, pretraining the encoder model as an ontology-informed embedding model for concept definitions.
The proposed approach proves effective in the scientific domain of invasion biology, achieving substantial improvements over standard LLM pretraining.
The study also explores the feasibility of applying this approach to domains without comprehensive ontologies, substituting ontological concepts with concepts extracted from scientific abstracts and establishing relationships between them using distributional statistics.