Recent works have proposed examining the activations produced by Large Language Models (LLMs) at inference time to assess the correctness of their answers.
These works suggest that a 'geometry of truth' can be learned, where activations for correct answers differ from those producing mistakes.
However, a limitation highlighted is that these 'geometries of truth' are task-dependent and do not transfer across different tasks.
Linear classifiers trained across distinct tasks show little similarity, even with more sophisticated approaches, as activation vectors used to classify answers form separate clusters when examined across tasks.