AI Penetration Testing: OWASP LLM Top 10 Vulnerability Assessment, Prompt Injection Testing, and Model Security Architecture Review
AI Pen Test Reference
OWASP LLM Top 10 2025: The AI Vulnerability Framework
The OWASP Foundation published the LLM Top 10 in 2023 and updated it in 2025. The 2025 edition covers ten vulnerability categories specific to LLM applications and AI agents. Unlike the traditional OWASP Web Application Top 10 (which covers web application vulnerabilities), the LLM Top 10 is specific to AI systems and requires AI-specific testing methodology.
LLM01 — Prompt Injection: The manipulation of LLM behavior through crafted inputs, including direct injection (user-provided malicious prompts) and indirect injection (malicious content in retrieved documents, emails, or web pages that the agent processes). Testing methodology: systematically attempt to override system prompts, extract system prompt content, cause the model to ignore its safety guidelines, and manipulate agent actions through injected content in all input channels.
LLM02 — Sensitive Information Disclosure: AI systems inadvertently revealing confidential data, training data, or system internals. Testing methodology: attempt to extract system prompt content, probe for training data memorization (reconstruct training data through targeted queries), and test whether RAG retrieval returns unauthorized documents.
LLM03 — Supply Chain Vulnerabilities: Compromised model weights, poisoned training data, or malicious plugins in the AI supply chain. Testing methodology: review model provenance, inspect plugin/tool code for supply chain risks, verify model integrity hashes.
LLM04 — Data and Model Poisoning: Manipulation of training data to introduce backdoors or biases. Testing methodology: test for anomalous model behavior on specific trigger inputs, probe for backdoor patterns in fine-tuned models.
LLM05 — Improper Output Handling: Failure to validate AI outputs before downstream use, enabling XSS, SSRF, or code injection if AI output is rendered in web contexts or executed as code. Testing methodology: inject output payloads through AI responses and test whether they execute in downstream systems.
LLM06: Excessive Agency
AI agents granted more permissions than needed, enabling attackers to achieve more damage through prompt injection by exploiting over-privileged tool access. Test: attempt to use agent tool access beyond intended scope; test for unauthorized data access, external calls, or system modifications.
LLM08: Vector and Embedding Weaknesses
Vulnerabilities in RAG vector databases enabling unauthorized document retrieval, embedding inversion attacks (reconstructing original text from embeddings), or cross-tenant data exposure. Test: attempt to retrieve documents belonging to other users or tenants via crafted embedding queries.
LLM09: Misinformation
AI systems generating plausible but false information (hallucination) that downstream systems or users act upon. Test: probe for known hallucination failure modes; test factual accuracy on domain-specific knowledge; assess overconfidence in uncertain responses.
AI Pen Test Methodology: Phases and Techniques
Phase 1 — Reconnaissance: Map the AI system architecture: identify all input channels (user messages, file uploads, API inputs, webhook receivers), output channels (UI, API responses, downstream system writes), and tool integrations (CRM, database, file system, email, web search). Document the model type and version, system prompt (if discoverable), RAG knowledge base scope, and tool permission model. This reconnaissance phase is critical and often missed in AI security reviews.
Phase 2 — Direct prompt injection testing: Systematically test all user-controlled inputs for prompt injection: role-playing attacks ("Act as an administrator..."), delimiter confusion (using prompt delimiters to break out of instruction context), goal hijacking ("Ignore previous instructions and..."), jailbreaks (safety guideline bypasses, DAN prompts), and multilingual attacks (injecting in languages the safety training may be weaker in). Document which techniques succeed and what access or behavior they enable.
Phase 3 — Indirect injection testing: Test all channels where external content enters the agent context. For RAG systems: inject instructions into document stores accessible to the agent; craft search queries that retrieve attacker-controlled content. For web browsing agents: set up attacker-controlled web pages with embedded injection instructions. For email processing agents: send crafted emails with injected instructions. For tool output channels: manipulate API responses or database records with injected instructions.
Phase 4 — Authorization and access control testing: Verify that RAG retrieval respects document-level access controls (does user A get user B's documents?), that agent tool invocations are authorized (can non-admin users invoke admin tools?), and that multi-tenant isolation is maintained (can tenant A queries retrieve tenant B data?). These are traditional authorization tests applied to AI-specific access patterns.
Advanced AI Attacks: Model Inversion, Membership Inference, and Training Data Extraction
Model inversion attacks: Attempt to reconstruct training data from model outputs by repeatedly querying the model with targeted inputs. For fine-tuned models trained on proprietary or sensitive data (medical records, financial data, customer PII), model inversion can expose that training data. Testing methodology: systematically probe the model with names, email formats, and data patterns that may appear in training data; measure memorization through exact-match or near-match retrieval.
The most famous documented case of training data extraction is the 2021 paper "Extracting Training Data from Large Language Models" (Carlini et al., Google Brain), which demonstrated that GPT-2 memorized and could be prompted to reproduce verbatim training data including names, email addresses, phone numbers, and other PII. Subsequent research has shown that memorization is a systematic property of large language models, not an incidental bug — larger models memorize more training data.
Membership inference attacks: Determine whether specific data was included in a model's training set by querying model confidence or loss values on target examples. This is particularly relevant for models fine-tuned on private datasets: a membership inference attack can reveal whether a specific individual's data was included in fine-tuning, which has implications under GDPR (right to know what processing has occurred) and CCPA (right to know what personal information is held).
Adversarial inputs: Craft inputs that cause AI systems to fail in targeted ways — misclassification attacks for classification models, targeted hallucination for generative models, or evasion of content filters. For enterprise AI agents, adversarial inputs can be used to bypass safety guidelines, cause consistent misbehavior on specific inputs, or evade regulatory compliance guardrails.
AI Penetration Testing Checklist
- Commission AI-specific pen testEngage penetration testing firm with documented AI/LLM testing methodology; verify they test all OWASP LLM Top 10 2025 categories
- Direct prompt injection testingTest all user-controlled input channels for prompt injection: role-play attacks, delimiter confusion, goal hijacking, jailbreaks, multilingual attacks
- Indirect injection testingTest all external content channels: document stores, web content, email inputs, API responses, database records for indirect injection vulnerabilities
- RAG authorization testingVerify document-level access controls in RAG: test cross-user and cross-tenant document retrieval; confirm vector search respects permissions
- Agent tool permission testingTest whether agent tool invocations can be expanded beyond intended scope via prompt injection; verify least-privilege tool access
- Model inversion assessmentProbe fine-tuned models for training data memorization; test whether PII or confidential data in training set can be reconstructed
- Multi-tenant isolation testingTest whether tenant-specific data isolation prevents cross-tenant data access via crafted queries or embedding attacks
- Output handling securityTest whether AI outputs rendered in web UI introduce XSS; verify AI-generated code is sandboxed before execution
- Supply chain reviewVerify model integrity (hash verification), review plugin/tool code for supply chain risks, confirm model provenance from trusted sources
- Annual retest scheduleSchedule AI pen tests annually at minimum; retest after major model updates, new tool integrations, or significant architecture changes
Frequently Asked Questions
What makes AI penetration testing different from traditional pen testing?
Traditional pen testing targets known vulnerability classes in web applications, networks, and operating systems — SQL injection, XSS, authentication bypasses, misconfigurations. AI pen testing addresses a different attack surface: natural language instruction manipulation (prompt injection), model knowledge extraction (training data extraction, membership inference), AI-specific authorization (RAG access control, multi-tenant isolation), and AI supply chain risks (model poisoning, malicious plugins). An AI system can be fully hardened against traditional web vulnerabilities and still be completely vulnerable to prompt injection.
How often should we conduct AI penetration tests?
Annual AI pen testing is the minimum required for SOC 2 Type II compliance. However, AI systems that receive model updates, new tool integrations, or architectural changes should be retested after each significant change. For high-risk AI deployments (financial decisions, healthcare recommendations, employment screening), quarterly pen testing is recommended. NIST AI RMF Govern function recommends continuous adversarial testing as part of AI risk management, which can be supplemented by internal red teaming between annual external pen tests.
Can you detect prompt injection attacks in production AI systems?
Yes, but imperfectly. Behavioral monitoring approaches for production prompt injection detection include: detecting known injection patterns in inputs (regex/ML classifiers, but easily evaded by novel attacks), monitoring for anomalous agent actions (unexpected tool calls, unusual output content), embedding-based input anomaly detection (embedding inputs and detecting statistically anomalous inputs), and LLM-as-judge monitoring (using a second LLM to evaluate whether the primary LLM's actions are consistent with its system prompt). No single technique is sufficient — defense-in-depth monitoring is required.
What is a Transfer Impact Assessment (TIA) and when is it required?
This question appears misplaced in an AI pen test FAQ — it's relevant to data residency. For AI pen testing: a Transfer Impact Assessment isn't relevant. What's relevant is a Threat Model — a structured assessment of attack vectors, threat actors, and potential impacts specific to the AI system being tested. OWASP's LLM Top 10 provides a starting threat taxonomy; MITRE ATLAS provides adversarial ML attack techniques. A threat model should be created before pen testing to ensure the test scope covers the most significant risks.
How does Claire's security architecture defend against OWASP LLM Top 10 vulnerabilities?
Claire implements defense-in-depth controls for each OWASP LLM Top 10 category: LLM01 (prompt injection) — input validation, context isolation, behavioral monitoring, least-privilege tool access; LLM02 (sensitive information disclosure) — output filtering, RAG permission-respecting retrieval, system prompt protection; LLM03 (supply chain) — model provenance verification, plugin code review, integrity validation; LLM06 (excessive agency) — allowlisted tool actions, human-in-the-loop for high-stakes operations; LLM08 (vector/embedding) — per-tenant namespace isolation, embedding access controls. Annual third-party pen testing covers all OWASP LLM Top 10 categories.
How Claire Addresses AI Security Testing
Claire undergoes annual third-party AI penetration testing covering all OWASP LLM Top 10 2025 categories, plus internal quarterly red teaming using MITRE ATLAS techniques. Our security architecture implements defense-in-depth controls for prompt injection, model data exposure, and AI supply chain risks. Request our most recent pen test summary report (executive findings) as part of a security briefing.