The study explores the Linear Representation Transferability Hypothesis, suggesting that neural networks with similar architectures trained on the same data learn shared representations relevant to the learning task.
It extends the idea by proposing that representations learned across models trained on the same data can be expressed as linear combinations of a universal set of basis features.
The hypothesis states that there exists an affine transformation between the representation spaces of different models, allowing steering vectors to retain their semantic effect when transferred from small to large language models.
Empirical evidence shows that such affine mappings can preserve steering behaviors, indicating that representations learned by small models can guide the behavior of large models, supporting the LRT hypothesis.