Alignment
The field of ensuring AI systems behave according to human values and intentions.
Alignment research addresses the challenge of making AI systems that reliably do what humans intend, are honest, and avoid harmful behavior at scale. Practical techniques include RLHF, Constitutional AI, and red-teaming. Misaligned models may be superficially helpful but subtly deceptive, sycophantic, or capable of harmful outputs under adversarial prompting.
Related Terms
Reinforcement Learning from Human Feedback — training models to align with human preferences.
Anthropic's method of training models to self-critique and revise outputs using a set of principles.
An attack where malicious text in the environment overrides a model's instructions.
The discipline of building AI systems that are reliably beneficial and avoid harmful outputs.