Large language models (LLMs) are being increasingly used in molecular science for scientific discovery.
A new framework called CLEANMOL has been introduced to improve the understanding of molecular structures encoded in SMILES representation.
CLEANMOL formulates SMILES parsing into clean and deterministic tasks to enhance graph-level molecular comprehension.
Results show that pre-training LLMs on tasks from CLEANMOL framework improves structural comprehension and performs competitively on the Mol-Instructions benchmark.