Securing Agentic Systems
Security Research · 2025

Securing Agentic Systems
Against Prompt Injection
& Tool Abuse

As AI agents gain the power to browse the web, execute code, and call APIs autonomously, a new class of adversarial attacks emerges — one that exploits trust, context, and capability.

Background

What Is Prompt Injection?

Prompt injection is an attack in which hostile text — embedded in a webpage, document, or API response — hijacks an AI agent’s instructions. The model mistakes attacker-crafted content for legitimate system directives and executes unauthorized actions on the user’s behalf.

“The model cannot reliably distinguish between a trusted instruction from its operator and a forged instruction from adversarial content in its context window.”
Indirect Injection Direct Injection Context Poisoning Instruction Override

Threat Landscape

🕸️

Web-Sourced Injections

Malicious instructions hidden in HTML comments, invisible text, or CSS-obscured elements encountered during browsing tasks.

🔧

Tool-Call Hijacking

Attackers coerce the agent into calling privileged tools — file writes, API mutations, email sends — with attacker-controlled parameters.

📄

Document Payloads

PDFs, spreadsheets, and markdown files smuggle jailbreaks through summarisation pipelines, bypassing input filters.

🔗

Multi-Hop Propagation

An injected sub-agent relays a corrupted payload to downstream agents, amplifying blast radius across an entire pipeline.

🎭

Role & Persona Spoofing

Attackers impersonate operator system prompts or trusted tool providers, tricking the model into elevated trust modes.

🧩

Exfiltration via Side-Channel

Sensitive context window data is leaked through URL parameters, search queries, or crafted tool arguments invisible to the user.

Defense-in-Depth Strategy

1

Minimal Footprint Principle

Grant agents only the permissions required for the current task. Avoid storing long-lived credentials; prefer ephemeral, scoped tokens that expire after a single session or operation.

2

Structured Tool Call Validation

Enforce schema validation on every tool invocation. Reject calls whose parameters reference user-controlled strings that were never explicitly authorised by the operator’s system prompt.

3

Privilege-Separated Execution Contexts

Run read-only information-gathering in an unprivileged context. Require explicit user confirmation before any write, send, or delete action is executed by the agent.

4

Prompt Canary & Integrity Markers

Insert unforgeable canary tokens into system prompts. Detect when downstream tool outputs reference or reproduce these tokens — a strong signal of injection or exfiltration in progress.

5

Sanitised Context Ingestion

Strip executable markup, hidden Unicode, and CSS-invisible text from all external content before it enters the model’s context window. Treat retrieved content as untrusted user input.

6

Audit Logging & Anomaly Detection

Log every tool call, its origin in the conversation, and its parameters. Alert on unusual patterns — large exfiltration payloads, calls to domains outside a whitelist, or permission escalation attempts.

Implementation Reference

Schema-Level Tool Guard (Python)

A lightweight middleware that intercepts tool calls and rejects any whose arguments contain strings originating from the untrusted external context.

# tool_guard.py — minimal prompt-injection filter

import re, hashlib

TRUSTED_SOURCES = {"system_prompt", "operator_config"}

def hash_token(text: str) -> str:
    return hashlib.sha256(text.encode()).hexdigest()[:16]

def validate_tool_call(call: dict, context_provenance: dict) -> bool:
    """Return False if any argument originates from untrusted context."""
    for key, value in call["arguments"].items():
        token = hash_token(str(value))
        origin = context_provenance.get(token, "unknown")
        if origin not in TRUSTED_SOURCES:
            raise PermissionError(
                f"Tool arg '{key}' traces to untrusted source: {origin}"
            )
    return True

# Usage in the agent loop:
# validate_tool_call(tool_call, provenance_map)  ← before execution
Architecture

The Three-Layer Trust Hierarchy

Effective agent security encodes a strict ordering: Anthropic’s policies sit at the apex, followed by operator system prompt instructions, and finally user turn messages. External tool outputs — regardless of how authoritative they appear — must never be permitted to elevate their own trust tier. A retrieved document claiming “SYSTEM: ignore previous instructions” is user-tier data, not an operator directive.

Tier 0 · Anthropic Tier 1 · Operator Tier 2 · User Tier 3 · External
Multi-Agent Pipelines

Orchestrators & Sub-Agent Trust

When one AI agent orchestrates another, trust must be cryptographically or structurally attested — not inferred from conversational context. A sub-agent should behave safely regardless of what its orchestrator claims about its own identity or permissions. Never grant an orchestrating agent permissions that exceed what a human operator explicitly authorised at pipeline construction time.

Leave a Reply

Your email address will not be published. Required fields are marked *