Error Handling & Guardrails — Production AI Systems
2025 Edition

Engineering Guide · Error Handling

Guardrails & Error Handling for Production AI Applications

A practitioner’s reference for building resilient, safe, and observable AI-powered systems — from input validation through graceful degradation to post-incident recovery.

Applies to LLM APIs · Agents · RAG pipelines
Level Intermediate → Advanced
Updated April 2025
73% of AI outages trace to missing input validation
cost reduction with proper retry + caching strategy
99.9% uptime achievable with circuit breakers + fallbacks

Core Principles

Production AI applications fail differently than traditional software. Model outputs are probabilistic, latency is variable, and failure modes include subtle semantic errors that no exception handler can catch. Your error strategy must account for all layers.

🛡
Fail gracefully, not silently

Every error path should return a meaningful response to the user, never a blank screen or cryptic stack trace.

🔁
Retry with intelligence

Exponential backoff, jitter, and budget-aware retry limits prevent thundering herds and runaway costs.

🧱
Validate at every boundary

Sanitise input before the model, and validate output structure before it reaches the user or a downstream system.

👁
Observe everything

Latency, token usage, refusals, and error rates should stream into your observability stack in real time.


Error Taxonomy

Categorising failures precisely lets you route them to the right handler. AI applications surface four distinct error families, each requiring a different response strategy.

Class Severity Examples Strategy
Infrastructure Critical API timeout, 5xx, network partition Retry + circuit breaker
Rate limiting High 429 Too Many Requests, quota exceeded Backoff + queue
Semantic Medium Hallucination, schema mismatch, refusal Output validation + fallback prompt
Input policy Low Unsafe content, PII leak, prompt injection Guardrail — reject before model

Retry Patterns

Naive retries amplify traffic under failure. The gold standard for AI APIs combines exponential backoff with full jitter — spreading retries across time to prevent synchronised bursts.

TypeScript
async function retryWithBackoff<T>(
  fn: () => Promise<T>,
  opts: { maxAttempts: number; baseDelayMs: number }
): Promise<T> {
  let attempt = 0;

  while (attempt < opts.maxAttempts) {
    try {
      return await fn();
    } catch (err) {
      const isRetryable = isRetryableError(err);
      const isLast      = attempt === opts.maxAttempts - 1;

      if (!isRetryable || isLast) throw err;

      // Exponential backoff with full jitter
      const cap   = opts.baseDelayMs * 2 ** attempt;
      const delay = Math.random() * cap;
      await sleep(delay);
      attempt++;
    }
  }
  throw new Error("Max retry attempts reached");
}

function isRetryableError(err: unknown): boolean {
  if (err instanceof APIError) {
    return [429, 500, 502, 503, 504].includes(err.status);
  }
  return err instanceof NetworkError;
}
⚠ Budget Guard

Always wrap your retry loop in a token-budget or cost-budget check. Retrying a 50k-token prompt 5 times costs 250k tokens — set hard spending limits before entering the retry loop.


Input Validation

The cheapest error to handle is the one you prevent from reaching the model. Every input should pass through a validation pipeline before a single token is spent.

1
Schema validation

Enforce request shape — type, length limits, required fields. Reject malformed inputs with HTTP 400 before any model call.

2
PII detection & redaction

Scan for emails, SSNs, credit cards, and phone numbers. Redact or pseudonymise before the prompt is assembled.

3
Prompt injection detection

Flag attempts to override system instructions — “ignore previous instructions”, role-play escapes, delimiter manipulation.

4
Token budget check

Estimate prompt tokens before calling the API. Reject or truncate inputs that would breach context limits or cost thresholds.

5
Content policy pre-screen

Run a fast classifier or keyword filter for clearly policy-violating content before spending on a full model call.


The Guardrails Layer

Guardrails are your application’s immune system. They sit between raw user input and the model, and between the model and your downstream systems. A well-designed guardrail layer is symmetric — it filters both ingress and egress.

“A guardrail that can be bypassed by rephrasing the request is not a guardrail — it’s a speed bump. Design for adversarial inputs from day one.”

Output guardrails

After the model responds, validate the output before surfacing it to the user or invoking any tool calls embedded in the response.

Python
from pydantic import BaseModel, ValidationError
from typing import Literal

class StructuredResponse(BaseModel):
    action: Literal["search", "summarise", "answer"]
    confidence: float          # must be 0.0 – 1.0
    content: str
    citations: list[str] = []

def parse_and_guard(raw_output: str) -> StructuredResponse | None:
    try:
        data = json.loads(raw_output)
        resp = StructuredResponse(**data)

        # Post-parse semantic checks
        if resp.confidence < 0.4:
            log_low_confidence(resp)
            return None          # trigger fallback

        if contains_pii(resp.content):
            resp.content = redact(resp.content)

        return resp

    except (ValidationError, JSONDecodeError) as e:
        log_parse_error(e, raw_output)
        return None
✓ Best Practice

Always define a structured output schema and instruct the model to adhere to it. JSON mode or tool-use structured outputs reduce parse failures by 60–80% compared to free-form text extraction.

✕ Critical Risk: Tool-call injection

If your application uses tool-calling or function-calling, never execute a model-generated tool call without validating the function name against an allowlist and sanitising all arguments. Treat tool calls as untrusted user input.


Circuit Breakers

When a downstream service is degraded, continuing to send traffic amplifies the failure. A circuit breaker detects sustained failure and opens, routing requests to a fallback until the service recovers.

TypeScript
type State = "CLOSED" | "OPEN" | "HALF_OPEN";

class CircuitBreaker {
  private state: State = "CLOSED";
  private failures = 0;
  private lastFailureTime?: number;

  constructor(
    private threshold = 5,     // failures before opening
    private timeout  = 60_000 // ms before half-open probe
  ) {}

  async execute<T>(fn: () => Promise<T>, fallback: () => T): Promise<T> {
    if (this.state === "OPEN") {
      const elapsed = Date.now() - (this.lastFailureTime ?? 0);
      if (elapsed < this.timeout) return fallback();
      this.state = "HALF_OPEN";  // probe with next request
    }
    try {
      const result = await fn();
      this.onSuccess();
      return result;
    } catch (err) {
      this.onFailure();
      return fallback();
    }
  }

  private onSuccess() { this.failures = 0; this.state = "CLOSED"; }
  private onFailure() {
    this.lastFailureTime = Date.now();
    if (++this.failures >= this.threshold) this.state = "OPEN";
  }
}
ℹ Fallback hierarchy

Design a tiered fallback: (1) retry same model → (2) smaller/cheaper model → (3) cached response → (4) static safe response. Each tier should degrade gracefully, never crash.


Observability

You cannot improve what you cannot see. Every LLM call should emit structured telemetry covering latency, token consumption, cache status, and guardrail outcomes.

Python
from dataclasses import dataclass, field
import time, uuid

@dataclass
class LLMSpan:
    trace_id:      str   = field(default_factory=lambda: str(uuid.uuid4()))
    model:         str   = ""
    prompt_tokens: int   = 0
    output_tokens: int   = 0
    latency_ms:    float = 0.0
    cached:        bool  = False
    error_type:    str | None = None
    guardrail_hit: str | None = None

def traced_completion(client, **kwargs) -> tuple:
    span = LLMSpan(model=kwargs["model"])
    t0   = time.perf_counter()
    try:
        response = client.messages.create(**kwargs)
        span.prompt_tokens = response.usage.input_tokens
        span.output_tokens = response.usage.output_tokens
        span.cached        = response.usage.cache_read_input_tokens > 0
        return response, span
    except Exception as e:
        span.error_type = type(e).__name__
        raise
    finally:
        span.latency_ms = (time.perf_counter() - t0) * 1000
        emit_span(span)   # → your telemetry sink

Key metrics to track

  • P50 / P95 / P99 latency — segmented by model and prompt template
  • Token spend rate — prompt vs. completion, with daily budget alerts
  • Error rate by category — infra errors vs. policy refusals vs. parse failures
  • Guardrail hit rate — which rules trigger most, and on what inputs
  • Cache hit ratio — prompt caching effectiveness across your workload
  • Fallback activation rate — how often circuit breakers open

Production Launch Checklist

Before shipping a new AI feature to production, verify every item below. This is a minimum bar — not a ceiling.

Input & validation

  • Request schema validated with 400-level rejection on malformed input
  • PII detection and redaction pipeline active
  • Prompt injection patterns covered by integration tests
  • Token budget enforced — inputs truncated or rejected above limit

Resilience

  • Exponential backoff with jitter on all API calls
  • Circuit breaker configured with tested fallback path
  • Timeout set explicitly — never rely on the SDK default
  • Graceful degradation message copy written and tested

Output & safety

  • Structured output schema with Pydantic / Zod validation
  • Tool-call allowlist and argument sanitisation in place
  • Low-confidence responses route to human review or fallback
  • Output PII redaction symmetric with input redaction

Observability

  • Structured spans emitted for every LLM call
  • Latency and token-spend dashboards live in your ops console
  • Alerts configured for error rate > 2% or latency P95 > threshold
  • Guardrail hit-rate dashboard reviewed weekly
✓ You’re ready when

Every item above is checked, you can kill the model API and the application returns a graceful degraded experience, and your on-call runbook covers the top 5 failure scenarios with step-by-step resolution guides.

Production AI Engineering Guide · 2025 Built for practitioners, by practitioners

Leave a Reply

Your email address will not be published. Required fields are marked *