Large Language Models (LLMs) are unreliable for Cyber Threat Intelligence (CTI) tasks.
An evaluation methodology was presented to test LLMs on CTI tasks using zero-shot learning, few-shot learning, and fine-tuning.
Experiments with three state-of-the-art LLMs and a dataset of 350 threat intelligence reports revealed potential security risks in relying on LLMs for CTI.
LLMs showed insufficient performance on real-size reports, inconsistency, and overconfidence despite the use of few-shot learning and fine-tuning.