Causal reasoning capabilities of large language models (LLMs) are evaluated using a benchmark named CARL-GT.CARL-GT assesses LLMs in areas such as causal graph reasoning, knowledge discovery, and decision-making.LLMs are found to be weak in causal reasoning, particularly with tabular data to uncover new insights.Different benchmark tasks show varying strengths of LLMs, with performance correlation within categories.