AI Safety
The discipline of building AI systems that are reliably beneficial and avoid harmful outputs.
AI safety encompasses technical and policy efforts to ensure AI systems behave predictably, honestly, and without causing harm. It includes both near-term concerns (jailbreaks, biased outputs, prompt injection) and long-term concerns (misaligned superintelligence). Safety research is core to Anthropic's mission and influences Claude's design via Constitutional AI and RLHF.
Termes Associés
The field of ensuring AI systems behave according to human values and intentions.
Anthropic's method of training models to self-critique and revise outputs using a set of principles.
An attack where malicious text in the environment overrides a model's instructions.
Reinforcement Learning from Human Feedback — training models to align with human preferences.