Post-training is essential for the success of large language models (LLMs), transforming pre-trained base models into more useful and aligned post-trained models.
This paper compares base and post-trained LLMs from four perspectives to understand how post-training affects LLMs internally.
Findings reveal that post-training does not change factual knowledge storage locations, adapts knowledge representations from the base model, and develops new knowledge representations.
Truthfulness can be effectively transferred for interventions, while refusal shows limited forward transferability between the base and post-trained models.