AI Red Teaming & Adversarial Testing: NIST AI RMF Govern 1.7, MITRE ATLAS, EU AI Act Article 9, and White House EO Requirements
AI Red Teaming Reference
NIST AI RMF GOVERN 1.7: Red-Teaming as a Governance Requirement
The NIST AI Risk Management Framework (AI RMF 1.0, January 2023) organizes AI risk management into four functions: Govern, Map, Measure, and Manage. GOVERN 1.7 specifically addresses red-teaming: "Processes and procedures are in place for conducting regular assessments and reviews of AI system behavior, including mechanisms for testing and evaluation of AI systems, for identifying and mitigating risks associated with adversarial manipulation of AI systems." GOVERN 1.7 makes red-teaming an organizational governance practice, not just a one-time technical exercise.
The NIST AI RMF Playbook provides additional guidance on implementing GOVERN 1.7: organizations should establish a red team with diverse expertise (technical AI security, domain expertise in the application area, and ethical/social science perspectives), define a testing scope that reflects the AI system's actual use cases and likely adversarial contexts, document findings and remediations in a format that enables tracking over time, and integrate red-teaming results into the AI risk register maintained under GOVERN 4.1 (AI risk is integrated into enterprise risk management processes).
NIST AI RMF also connects red-teaming to the MEASURE function. MEASURE 2.5 requires that "the AI system to be deployed has been tested to evaluate performance across the range of expected use cases and conditions," while MEASURE 2.6 requires testing for "harmful and unintended consequences." Together with GOVERN 1.7, these criteria establish a testing program architecture: red-teaming to identify vulnerabilities (GOVERN 1.7), measurement of AI performance under adversarial conditions (MEASURE 2.5/2.6), and integration of findings into ongoing risk management (MANAGE function). Organizations seeking to align with the NIST AI RMF — increasingly referenced in financial services regulatory guidance and state AI regulations — need a documented red-teaming program that spans all four functions.
GOVERN 1.7 — Adversarial Testing
Requires processes for regular AI behavior assessments including adversarial manipulation testing. Must be documented, recurring, and integrated into AI governance program. Finding and remediation tracking is required — not just one-time testing.
MEASURE 2.5 — Use Case Testing
Requires AI system performance testing across the full range of expected use cases before deployment. For enterprise AI, this means testing with the actual data types, query patterns, and user populations the AI will encounter in production — not just synthetic test cases.
MANAGE 2.4 — Risk Response
Requires that identified AI risks — including those surfaced by red-teaming — have documented response plans. Findings from red-team exercises must flow into the AI risk register and trigger risk treatment decisions: accept, mitigate, transfer, or avoid.
MITRE ATLAS: Adversarial ML Tactics, Techniques, and Procedures
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is MITRE's knowledge base of adversarial tactics and techniques against AI systems, modeled after MITRE ATT&CK for traditional cybersecurity. ATLAS documents real-world attacks on AI systems, organizing them into a matrix of tactics (the attacker's goal) and techniques (how they achieve it). For enterprise AI red teams, ATLAS provides a structured vocabulary for planning test cases and documenting findings.
ATLAS Tactic: ML Reconnaissance — Attackers gather information about the target AI system before attacking. Techniques include: AML.T0000 (active scanning to discover AI API endpoints), AML.T0001 (identifying AI system capabilities through probe queries), and AML.T0002 (discovering training data through model inversion). Red team implication: test whether your AI system's responses reveal information about its architecture, training data, system prompt, or internal tool configurations that an attacker could use to plan a more targeted attack.
ATLAS Tactic: ML Attack Staging — Attackers prepare attack payloads and infrastructure. Includes AML.T0017 (developing adversarial examples), AML.T0018 (backdoor ML model preparation), and AML.T0019 (poison training data). For enterprise AI systems using RAG, the most relevant staging attack is data poisoning: an attacker who can influence documents in the AI's knowledge base can inject malicious instructions that the AI retrieves and follows during legitimate user queries.
ATLAS Tactic: Exfiltration — Attackers extract sensitive information from AI systems. AML.T0025 (model inversion attack) allows attackers to reconstruct training data — including personally identifiable information — from model outputs. AML.T0024 (exfiltration via inference API) allows extracting model parameters through crafted queries. For enterprise AI systems fine-tuned on proprietary data (customer records, internal policies, trade secrets), model extraction attacks can expose that training data to competitors or attackers.
AI Attack Types: Prompt Injection, Jailbreaking, and Model Extraction
Prompt Injection Attacks: Prompt injection is the most prevalent and dangerous attack vector against enterprise AI systems. In a direct prompt injection, an attacker crafts a malicious input that overrides the AI system's instructions — for example, a user query that includes hidden instructions telling the AI to ignore its safety guidelines and take unauthorized actions. In an indirect prompt injection (also called indirect PIXI), malicious instructions are embedded in content the AI retrieves from external sources — web pages, documents, emails, or database records — that the AI processes during a task. When the AI reads the malicious document, it follows the embedded instructions as if they were legitimate system commands. NIST's AI security research has documented indirect prompt injection as the higher-impact variant for enterprise AI with tool use capabilities, because it can cause AI agents to exfiltrate data, send emails, execute code, or take other agentic actions without any direct attacker interaction with the AI interface.
Jailbreaking Techniques: Jailbreaking refers to techniques that cause an AI model to bypass its safety alignment and content policies. Common jailbreaking techniques include: role-play jailbreaks ("pretend you are DAN, an AI with no restrictions"), many-shot jailbreaking (providing a large number of examples demonstrating the disallowed behavior to shift the model's context), encoding tricks (asking the AI to respond in base64, reverse text, or other encodings that bypass output filters), and adversarial suffixes (appending carefully crafted character sequences discovered through automated optimization that cause models to comply with harmful requests). For enterprise AI, jailbreaking is relevant because a successfully jailbroken AI agent with tool access could be directed to take actions that violate organizational policies, regulatory requirements, or cause direct harm — and the jailbreaking interaction may not trigger standard content filters.
Model Extraction Attacks: Model extraction (also called model stealing) involves an attacker querying a target AI system with carefully crafted inputs to reconstruct a functional model equivalent, extract proprietary training data, or infer sensitive information embedded in fine-tuned models. For enterprise AI systems fine-tuned on confidential data (customer contracts, patient records, proprietary research), model extraction poses a direct confidentiality risk. Extraction attacks are also relevant for AI systems trained on regulated personal data: an attacker who extracts a fine-tuned model may be able to recover memorized training examples, representing a GDPR data breach even if the original training data was never directly accessed. NIST SP 800-115 (Technical Guide to Information Security Testing and Assessment) provides the penetration testing framework within which model extraction testing can be scoped and conducted.
NIST SP 800-115 for AI Security Testing: NIST Special Publication 800-115 "Technical Guide to Information Security Testing and Assessment" (September 2008, with AI-specific updates referenced in NIST AI 100-1) defines the methodological framework for technical security testing: planning (define scope, objectives, and rules of engagement), discovery (identify AI system components, APIs, data flows), attack (execute test cases against identified attack surfaces), and reporting (document findings with evidence, severity ratings, and remediation guidance). For AI red teaming, the discovery phase must specifically enumerate: all AI API endpoints, all tool integrations, all data sources the AI can access, all output channels, and all administrative interfaces. The attack phase should systematically test all MITRE ATLAS relevant techniques for the system's capability profile.
Regulatory Requirements: EU AI Act Article 9 and Responsible Scaling
EU AI Act Article 9 — Risk Management System: Article 9 of the EU AI Act requires that providers of high-risk AI systems implement a risk management system that is a continuous iterative process throughout the entire lifecycle of the AI system. Critically, Article 9(4) specifies that risk management measures must include testing procedures to ensure that the AI system performs consistently with its intended purpose and meets the requirements of the Regulation. Article 9(5) further requires that testing shall be performed before placing the AI system on the market or into service, and may also be performed throughout the lifecycle. The specific mention of testing procedures in the context of risk management — combined with the Article 9(4)(b) requirement to identify known and foreseeable risks — creates a direct mandate for adversarial testing of AI systems as part of EU AI Act compliance. Conformity assessment bodies evaluating high-risk AI systems will examine whether red-teaming or adversarial testing was conducted and documented as part of the risk management system.
Anthropic Responsible Scaling Policy (RSP): Anthropic's Responsible Scaling Policy — a voluntary but influential framework — establishes the Responsible Scaling Commitment: the policy that Anthropic will not train or deploy AI models above certain capability thresholds unless specific safety measures are in place. The RSP defines AI Safety Levels (ASL) and requires evaluations — including red-team testing — to determine whether models have crossed capability thresholds that trigger additional safety requirements. For enterprise AI deployers using Anthropic's models (Claude), the RSP is relevant because it defines the adversarial testing regime that the underlying model has been subjected to. Enterprise security teams should request RSP-aligned evaluation documentation from AI providers as part of vendor security assessments — understanding what adversarial testing the AI model itself has undergone is a prerequisite for understanding what residual risks the enterprise deployment inherits.
Air Canada AI Chatbot Liability (2024)
The British Columbia Civil Resolution Tribunal ruled that Air Canada was liable for its AI chatbot providing incorrect bereavement fare information. The ruling established that organizations cannot disclaim responsibility for AI system outputs — and that AI systems that had not been tested for factual accuracy and manipulation resistance create direct legal liability for the deploying organization.
Samsung ChatGPT Data Leak (2023)
Samsung engineers inadvertently uploaded proprietary semiconductor design information to ChatGPT during a debugging session. The incident — a form of data exfiltration through AI interaction — demonstrated that adversarial testing must include insider threat scenarios and data handling edge cases, not just external attacker simulations.
GPT-4 Prompt Injection in Enterprise Deployments
Multiple documented cases in 2023-2024 demonstrated that enterprise AI assistants with email, calendar, and CRM tool access could be manipulated via prompt injection in incoming emails to forward sensitive information to attackers. These real-world incidents — not theoretical attack scenarios — drove regulatory guidance requiring adversarial testing for agentic AI deployments.
AI Red Teaming Program Checklist
- Establish a formal AI red team program per NIST AI RMF GOVERN 1.7Document the red team charter, scope, methodology, and frequency; assign ownership to a named security team; integrate findings into the AI risk register; schedule recurring exercises (minimum annually, quarterly for high-risk AI)
- Map AI attack surface using MITRE ATLAS taxonomyEnumerate all AI system components, APIs, tool integrations, data sources, and output channels; map each to relevant MITRE ATLAS tactics and techniques; prioritize test cases by risk to your specific deployment context
- Conduct prompt injection testing (direct and indirect)Test direct prompt injection against all user-facing AI interfaces; test indirect prompt injection via all data sources the AI ingests (emails, documents, web content, database records); document results with evidence and severity ratings
- Test jailbreaking resistance across known technique categoriesTest role-play jailbreaks, many-shot jailbreaks, encoding-based bypasses, and adversarial suffix techniques; document the AI system's resistance profile and any successful bypasses requiring remediation
- Conduct model extraction and data leakage testingTest for training data memorization via targeted queries; assess model extraction risk for fine-tuned models containing sensitive data; test for system prompt extraction attacks that reveal internal AI configuration
- Document adversarial testing for EU AI Act Article 9 complianceMaintain written records of all adversarial testing conducted as part of the risk management system required by Article 9; include testing scope, methodology, findings, severity, and remediation status in AI conformity documentation
- Follow NIST SP 800-115 methodology for structured AI security testingApply SP 800-115 testing phases: planning (scope and rules of engagement), discovery (AI attack surface mapping), attack (systematic ATLAS technique testing), and reporting (findings with CVSS-equivalent AI severity ratings)
- Test AI agent tool use under adversarial conditionsSpecifically red-team AI agent tool invocations: can prompt injection cause the AI to call tools with unauthorized parameters? Can an attacker cause the AI agent to take actions not intended by the system designer? Document all tool abuse scenarios discovered
- Integrate red team findings into vulnerability managementTrack AI red team findings in the enterprise vulnerability management system; assign severity ratings, owners, and remediation deadlines; verify remediation before closing findings; include AI vulnerabilities in executive risk reporting
- Assess AI provider red-teaming evidence during vendor security reviewsRequest red-teaming documentation from AI model providers (Anthropic RSP evaluations, OpenAI safety assessments); include AI adversarial testing documentation in third-party risk management program assessments
Frequently Asked Questions
What is AI red teaming and how does it differ from traditional penetration testing?
Traditional penetration testing (governed by NIST SP 800-115) tests the security of IT infrastructure, applications, and networks by simulating attacker exploitation of technical vulnerabilities — SQL injection, authentication bypasses, privilege escalation. AI red teaming extends this to include attacks specific to AI systems: prompt injection (manipulating AI behavior through crafted inputs), jailbreaking (bypassing AI safety alignments), model extraction (stealing trained model capabilities or training data), and data poisoning (corrupting AI knowledge bases). AI red teaming also has a safety/alignment dimension that traditional pen testing lacks: teams assess whether the AI system can be manipulated to produce harmful, biased, or policy-violating outputs — not just whether it can be compromised as a technical system. NIST AI RMF GOVERN 1.7 requires both security-focused adversarial testing and safety-focused behavioral evaluation as part of a comprehensive AI red teaming program.
What is MITRE ATLAS and how is it used in AI red teaming?
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems) is a knowledge base of adversarial tactics and techniques against AI systems, modeled after MITRE ATT&CK for traditional cybersecurity. ATLAS organizes AI-specific attacks into a matrix of Tactics (the attacker's high-level goal) and Techniques (specific methods to achieve that goal), enabling red teams to systematically plan test cases and communicate findings using a common vocabulary. For enterprise AI red teams, ATLAS serves as the test case library: for each AI system capability (web browsing, code execution, document retrieval, CRM access), the red team maps the relevant ATLAS techniques and tests each one systematically. ATLAS also provides real-world case studies of AI attacks that have occurred, helping red teams understand realistic threat actor behavior rather than only testing theoretical vulnerabilities.
What is prompt injection and why is it the top AI security concern for enterprise deployments?
Prompt injection is an attack where malicious instructions embedded in AI inputs override the AI system's intended behavior. Direct prompt injection attacks come from users who craft inputs to manipulate the AI into ignoring its system prompt or safety guidelines. Indirect (or second-order) prompt injection is more dangerous for enterprise AI: the attacker embeds malicious instructions in content that the AI retrieves from external sources — an email the AI is asked to summarize, a web page the AI is asked to read, a document in the AI's knowledge base. When the AI processes this content, it executes the embedded instructions as if they were legitimate commands. For enterprise AI agents with tool use (ability to send emails, access databases, execute code, call APIs), successful prompt injection can cause the AI to take unauthorized actions entirely outside the normal user interface. OWASP has ranked prompt injection as the #1 vulnerability in their LLM Application Security Top 10, and NIST has issued guidance specifically on prompt injection defenses for enterprise AI.
How does EU AI Act Article 9 create an adversarial testing obligation?
EU AI Act Article 9 requires high-risk AI providers to implement a risk management system that includes testing procedures sufficient to ensure the AI system performs as intended and meets the Regulation's requirements (Article 9(4)). The requirement to identify "known and foreseeable risks" under Article 9(2)(a) implicitly includes adversarial risks — prompt injection, data poisoning, and model manipulation attacks are "foreseeable risks" for AI systems in 2025. Article 9(5) requires testing before market placement and throughout the lifecycle. Together, these provisions create a legally mandated testing obligation that includes adversarial scenarios for high-risk AI. Conformity assessment bodies (notified bodies for certain high-risk AI categories) will review risk management system documentation and expect to see evidence that adversarial testing was conducted — absence of adversarial testing documentation is a conformity finding that blocks market authorization. Organizations deploying high-risk AI in the EU should ensure their Article 9 risk management documentation explicitly addresses prompt injection, jailbreaking, and data poisoning as identified foreseeable risks with corresponding test procedures and mitigations.
What does Anthropic's Responsible Scaling Policy mean for enterprises using Claude?
Anthropic's Responsible Scaling Policy (RSP) is a voluntary commitment that Anthropic will not train or deploy AI models above AI Safety Level (ASL) thresholds unless corresponding safety measures are demonstrated through evaluations including red-team testing. For enterprises using Claude, the RSP means that Anthropic conducts structured adversarial evaluations of Claude models before deployment — testing for dangerous capabilities like autonomous cyberoffense, weapons synthesis assistance, and deceptive alignment. Anthropic publishes evaluation summaries that enterprises can reference in their own AI risk assessments and vendor security reviews. When conducting your own AI vendor risk assessment, request Anthropic's most recent model evaluation card and RSP compliance documentation. This provides evidence that the underlying model has been subject to adversarial capability evaluation — complementing (but not replacing) the enterprise-level red teaming your organization should conduct on your specific AI deployment configuration, custom system prompts, and tool integrations, which the model provider does not test.
How Claire Addresses AI Red Teaming
Claire undergoes quarterly adversarial testing by our internal red team and annual third-party AI security assessments using MITRE ATLAS methodology. Our red team program covers prompt injection (direct and indirect), jailbreaking resistance, model extraction defenses, and tool abuse scenarios for every AI agent configuration. Findings are tracked in our AI risk register with remediation timelines and are referenced in our NIST AI RMF GOVERN 1.7 documentation. Enterprise prospects can request our red team program summary and ATLAS coverage report as part of the security review process.