<ul><li>Causal reasoning capabilities of large language models (LLMs) are evaluated using a benchmark named CARL-GT.</li><li>CARL-GT assesses LLMs in areas such as causal graph reasoning, knowledge discovery, and decision-making.</li><li>LLMs are found to be weak in causal reasoning, particularly with tabular data to uncover new insights.</li><li>Different benchmark tasks show varying strengths of LLMs, with performance correlation within categories.</li></ul>

CARL-GT: Evaluating Causal Reasoning Capabilities of Large Language Models

Discover more