Details
- Anthropic has expanded its bug bounty program to specifically target universal jailbreak attacks against its next-generation AI safety mitigations.
- The initiative is conducted in partnership with HackerOne and invites selected researchers to uncover vulnerabilities that could undermine safeguards in critical areas like CBRN and cybersecurity, offering rewards up to $15,000.
- Researchers will receive early access to unreleased safety systems, enabling them to probe for jailbreaks that potentially bypass AI guardrails across multiple domains.
- The program builds on Anthropic’s existing invite-only model and is in line with voluntary AI safety commitments made to the White House and G7.
- Universal jailbreaks are a growing concern, with recent research, such as HiddenLayer's findings in April 2025, revealing exploits that can bypass safety features in all major large language models.
Impact
By targeting universal jailbreaks, Anthropic is addressing some of the most dangerous vulnerabilities in AI systems, which can compromise safeguards at a broad scale. This proactive initiative not only aims to strengthen AI safety for high-risk domains but also sets a rigorous benchmark for competitors as the industry faces mounting evidence of systemic threats. The move could accelerate the adoption of robust pre-deployment testing standards throughout the AI sector.