LLMs are essentially language pattern statistics that require alignment to specific tasks through human supervision and fine-tuning methods like DPO.
Fine-tuning, although often effective, is not foolproof and may not always align LLMs with human values as intended.
Methods like DPO reinforce LLMs to produce specific responses verbatim, akin to teaching a child certain behaviors through repetition.
Aligning LLMs to human values requires understanding where and how these values are stored in the network, a challenging task.
Mechanistic interpretability suggests training sparse autoencoders to investigate how changing specific neurons can affect certain concepts within LLMs.
Finding neurons related to abstract concepts like 'expected human suffering' is possible, but identifying specific ones remains a challenge.
Proposing a new approach involves identifying neurons with significant influence on specific human values while minimizing influence on contrary values.
Creating a function f to boost the influence of desired values without affecting undesired ones is crucial for true alignment.
Deriving the gradient of f enables the creation of an 'injection vector' that can push the network towards aligning with human values.
Achieving true alignment with human values in LLMs requires a deep exploration of individual and cultural values and injecting these values into the model.