Large Language Models, particularly reasoning models, have shown improved abilities in advanced problem-solving domains like mathematics and software engineering.
A novel benchmark called ChemIQ was created to assess reasoning models in directly performing chemistry tasks without external assistance.
Reasoning models like OpenAI's o3-mini correctly answered 28%-59% of questions on the ChemIQ benchmark, with higher reasoning levels boosting performance.
These models surpassed non-reasoning model GPT-4o, demonstrating capabilities such as converting SMILES strings to IUPAC names and elucidating structures from NMR data, showcasing advanced chemical reasoning abilities.