menu
techminis

A naukri.com initiative

google-web-stories
Home

>

AI News

>

LLM though...
source image

Medium

1d

read

339

img
dot

Image Credit: Medium

LLM thought experiment — 4— alignment through interpretability

  • LLMs are essentially language pattern statistics that require alignment to specific tasks through human supervision and fine-tuning methods like DPO.
  • Fine-tuning, although often effective, is not foolproof and may not always align LLMs with human values as intended.
  • Methods like DPO reinforce LLMs to produce specific responses verbatim, akin to teaching a child certain behaviors through repetition.
  • Aligning LLMs to human values requires understanding where and how these values are stored in the network, a challenging task.
  • Mechanistic interpretability suggests training sparse autoencoders to investigate how changing specific neurons can affect certain concepts within LLMs.
  • Finding neurons related to abstract concepts like 'expected human suffering' is possible, but identifying specific ones remains a challenge.
  • Proposing a new approach involves identifying neurons with significant influence on specific human values while minimizing influence on contrary values.
  • Creating a function f to boost the influence of desired values without affecting undesired ones is crucial for true alignment.
  • Deriving the gradient of f enables the creation of an 'injection vector' that can push the network towards aligning with human values.
  • Achieving true alignment with human values in LLMs requires a deep exploration of individual and cultural values and injecting these values into the model.

Read Full Article

like

20 Likes

For uninterrupted reading, download the app