Safety & Ethics

Alignment

The field of ensuring AI systems behave according to human values and intentions.

Alignment research addresses the challenge of making AI systems that reliably do what humans intend, are honest, and avoid harmful behavior at scale. Practical techniques include RLHF, Constitutional AI, and red-teaming. Misaligned models may be superficially helpful but subtly deceptive, sycophantic, or capable of harmful outputs under adversarial prompting.

Related Terms

RLHF

Reinforcement Learning from Human Feedback — training models to align with human preferences.

Constitutional AI

Anthropic's method of training models to self-critique and revise outputs using a set of principles.

Prompt Injection

An attack where malicious text in the environment overrides a model's instructions.

AI Safety

The discipline of building AI systems that are reliably beneficial and avoid harmful outputs.