Training

RLHF

Reinforcement Learning from Human Feedback — training models to align with human preferences.

RLHF is a multi-stage alignment technique. First, human raters rank model outputs by quality. A reward model is trained on these rankings. Finally, the language model is fine-tuned via RL to maximize the reward signal. RLHF is what makes chat models helpful, harmless, and honest rather than just statistically fluent. Variants include DPO (Direct Preference Optimization) which skips the reward model.

Verwandte Begriffe