Language models need to be trustworthy and reliable as they become more powerful and sophisticated.
Researchers trained a Taboo model to elicit specific hidden knowledge not presented in the training data or prompt.
They evaluated non-interpretability approaches and developed mechanistic interpretability techniques like logit lens and sparse autoencoders to uncover the hidden knowledge.
The findings suggest promising avenues for future work in eliciting hidden knowledge from language models to ensure their safe deployment.