Sécurité et Éthique

AI Safety

The discipline of building AI systems that are reliably beneficial and avoid harmful outputs.

AI safety encompasses technical and policy efforts to ensure AI systems behave predictably, honestly, and without causing harm. It includes both near-term concerns (jailbreaks, biased outputs, prompt injection) and long-term concerns (misaligned superintelligence). Safety research is core to Anthropic's mission and influences Claude's design via Constitutional AI and RLHF.

Termes Associés

Alignment

The field of ensuring AI systems behave according to human values and intentions.

Constitutional AI

Anthropic's method of training models to self-critique and revise outputs using a set of principles.

Prompt Injection

An attack where malicious text in the environment overrides a model's instructions.

RLHF

Reinforcement Learning from Human Feedback — training models to align with human preferences.