A new study presents the Junk DNA Hypothesis, focusing on the pre-trained weights of large language models like GPT-3.
The hypothesis challenges the belief that pruning small weights in LLMs does not affect performance, suggesting that these weights actually encode vital information for challenging tasks.
Removing these seemingly insignificant weights can lead to an irreversible loss of knowledge and performance decline in difficult tasks, even with continued training.
Quantization as a compression method does not exhibit the same effect as weight pruning in exposing task difficulty information according to the study. Extensive experiments support the Junk DNA Hypothesis.