Frontier large language models can struggle on high school math problems outside standard benchmarks.A deductive consistency metric is proposed to analyze chain-of-thought output from language models.The metric evaluates performance on understanding input premises and inferring conclusions over multiple reasoning hops.Language models are found to be robust to increasing number of input premises but suffer accuracy decay with increased reasoning hops.