AI Evaluation & Optimization — Expert Reference
AI Evaluation & Optimization — Complete Reference Expert Level
The Definitive Expert Reference · 2026

Evaluating & Optimizing
Large Language Models

A comprehensive, practitioner-grade reference covering every dimension of AI model evaluation — from benchmark design and alignment techniques to inference optimization and production observability.

12
Core Domains
60+
Techniques
Iteration loops
01

Benchmarking Frameworks

Reasoning

MMLU & MMLU-Pro

Massive Multitask Language Understanding — 57 academic subjects from STEM to law. Pro variant adds harder, multi-step questions resistant to guessing strategies.

↑ 92.1% GPT-4o · 88.7% Claude 3.5 · Frontier baseline
Coding

HumanEval & SWE-Bench

HumanEval tests function synthesis from docstrings. SWE-Bench evaluates real GitHub issue resolution — a more realistic coding benchmark for agentic systems.

pass@1 metric · 50.8% best SWE-Bench Verified
Math

MATH & AIME

MATH benchmark covers competition math across 7 difficulty levels. AIME (American Invitational Math Exam) probes extreme reasoning — near-impossible for older models.

Level 5 accuracy is the discriminating frontier signal
Safety

TruthfulQA & WildGuard

TruthfulQA measures resistance to human misconceptions. WildGuard covers refusal accuracy, over-refusal rates, and safety classification across 13 harm categories.

Dual metric: safety rate + helpfulness rate
Multimodal

MMMU & ChartQA

Massive Multimodal Understanding across 30 subjects. ChartQA tests structured visual reasoning on charts, graphs, and tables in document contexts.

Vision-language capability discriminator
Agentic

GAIA & AgentBench

GAIA tests real-world general AI assistants with web browsing and tool use. AgentBench evaluates agents across OS, DB, web shopping, and coding environments.

Multi-step task completion rate · hardest tier
BIG-Bench Hard HellaSwag ARC-Challenge GSM8K DROP WinoGrande GPQA Diamond MT-Bench
02

Key Evaluation Metrics

Metric Measures Typical Use Caution
Perplexity (PPL) How well model predicts a text sample — lower = better language fit Pre-training quality, base model comparison Indirect — doesn’t correlate with task performance
BLEU / ROUGE N-gram overlap between generated and reference text Summarization, translation Weak — penalizes valid paraphrases; largely deprecated
BERTScore Semantic similarity via contextual embeddings Generation quality, translation Medium — better than BLEU but still reference-dependent
Win Rate (LLM-as-Judge) Preference comparison using a judge model (GPT-4, Llama-3) Chat quality, instruction-following Robust — watch for position & verbosity bias in judge
Calibration (ECE) Whether model confidence matches empirical accuracy Reliability, hallucination mitigation Critical — essential for high-stakes applications
Refusal Rate / Over-refusal Safety compliance vs. helpfulness balance Safety-aligned models Dual — both false positives and negatives matter equally
Latency / TTFT Time To First Token & end-to-end latency (p50/p95/p99) Production deployment SLAs Critical — tail latency (p99) often defines UX
Token Throughput Output tokens per second per GPU Inference cost optimization Tradeoff — batching improves throughput, increases latency
03

The Evaluation Pipeline

01
Dataset Curation
Stratified sampling across domains, difficulty levels, and demographic groups. Decontaminate from training data. Build held-out sets.
02
Automated Eval
Run standardized benchmarks. LLM-as-Judge scoring with multi-model consensus. Regression suites on key capabilities.
03
Human Eval
Expert annotators for nuanced dimensions — factuality, tone, safety. Inter-annotator agreement via Cohen’s κ or Fleiss’ κ.
04
A/B Shadow Test
Shadow deploy candidate model. Real traffic, implicit signals (thumbs, rewrites, session length). Statistical significance gating.
05
Red Teaming
Adversarial probing by internal + external teams. Jailbreak attempts, prompt injection, multi-turn manipulation scenarios.

“Evaluation is not a gate — it is a feedback loop. The quality of your eval dataset directly bounds the quality of your model improvements.”

— Core principle, modern MLOps practice
04

RLHF & Alignment Techniques

Reinforcement Learning from Human Feedback

RLHF is the dominant alignment paradigm: human raters compare model outputs, a reward model learns their preferences, and the language model is optimized via PPO (Proximal Policy Optimization) to maximize reward while not drifting too far from the base policy (KL penalty).

Three-stage pipeline: SFT (supervised fine-tuning on demonstrations) → Reward Model TrainingRL Policy Optimization.

Modern Alternatives

  • DPO — Direct Preference Optimization. Removes explicit RM; optimizes preferences directly. More stable, cheaper, slightly less flexible.
  • IPO — Identity Preference Optimization. Fixes DPO overfitting to deterministic preferences.
  • ORPO — Odds Ratio Preference Optimization. Merges SFT + alignment into one stage.
  • GRPO — Group Relative Policy Optimization. Used in DeepSeek-R1; removes value network entirely.
  • Constitutional AI — Rule-based self-critique (see §07).
  • RLAIF — AI-generated feedback replaces expensive human annotations.
05

Prompt Engineering & Optimization

Reasoning

Chain-of-Thought (CoT)

Elicit step-by-step reasoning with “Let’s think step by step” or few-shot examples. Zero-shot CoT works on capable models; few-shot CoT is more reliable and domain-controllable.

+25–40% accuracy on multi-step math & logic
Ensemble

Self-Consistency

Sample k CoT paths at temperature > 0, then take a majority vote over final answers. Dramatically reduces variance for deterministic problems without training.

k=40 samples → near-ceiling gains on GSM8K
Structure

Tree of Thoughts (ToT)

Expand reasoning into a search tree. Model evaluates intermediate steps and backtracks from dead ends. Best for tasks with clear partial-progress signals.

BFS / DFS search strategies · 74% Game of 24
Automation

APE / DSPy Optimization

Automatic Prompt Engineer generates and selects candidate prompts by scoring on a validation set. DSPy compiles declarative pipelines into optimized prompts and few-shot examples.

Outperforms hand-crafted prompts on 17/24 tasks
Grounding

Few-Shot Selection

Retrieve semantically-similar exemplars per query (rather than fixed examples). Use embedding-based k-NN over curated demonstration pools. Dynamic ICL > static ICL.

KATE retrieval framework · coverage + diversity
Format

Output Structuring

Constrained decoding (JSON mode, grammar-guided generation) ensures parseable outputs. Use XML tags for complex multi-part reasoning. Grammar-constrained generation is production-grade.

Outlines / Guidance / LMQL libraries
# DSPy — compile a pipeline with optimized prompts import dspy class CoT(dspy.Module): def __init__(self): self.prog = dspy.ChainOfThought(“question -> answer”) def forward(self, question): return self.prog(question=question) teleprompter = dspy.BootstrapFewShot(metric=validate_answer) optimized = teleprompter.compile(CoT(), trainset=trainset)
06

Fine-Tuning Strategies

Parameter-Efficient Fine-Tuning

  • LoRA — Low-Rank Adaptation. Inject trainable rank-r matrices into attention layers. 0.1–1% of parameters updated. Standard for 7B–70B models.
  • QLoRA — LoRA on 4-bit quantized base model. Enables 70B fine-tuning on single A100.
  • DoRA — Weight-Decomposed LoRA. Decomposes weights into magnitude + direction; more expressive than vanilla LoRA.
  • Prefix Tuning / P-Tuning v2 — Learn soft prompt tokens prepended to every layer. Parameter-free for base model.
  • IA³ — Learns element-wise scaling vectors. Even fewer parameters than LoRA; good for few-shot adaptation.

Full Fine-Tuning Considerations

  • Catastrophic Forgetting — mitigate with EWC (Elastic Weight Consolidation), replay buffers, or low learning rates.
  • Data Quality — 1k high-quality examples often beats 100k noisy ones. Deduplication is essential.
  • Curriculum Learning — order training by difficulty; start easy, introduce hard examples progressively.
  • Learning Rate Schedule — cosine decay with linear warmup. Peak LR: 1e-5 to 5e-5 for instruction tuning.
  • Multi-task SFT — mix target task with general instruction data to preserve generalization.
  • Merge & Mix — Model merging (SLERP, TIES, DARE) combines fine-tuned models without re-training.
07

RAG — Retrieval-Augmented Generation

A
Indexing
Chunk docs (fixed / semantic / hierarchical). Embed with dense retriever (E5, BGE). Store in vector DB (Pinecone, Weaviate, pgvector).
B
Retrieval
Dense (ANN), sparse (BM25), or hybrid retrieval. HyDE: generate hypothetical doc then retrieve. Multi-query expansion.
C
Re-ranking
Cross-encoder re-rank top-K results (Cohere Rerank, BGE reranker). Reciprocal Rank Fusion for ensemble retrieval.
D
Generation
Long-context packing. Lost-in-the-middle mitigation: put key docs at start/end. Faithfulness check post-generation.
E
RAGAS Eval
Context precision, recall, answer faithfulness, answer relevancy — the four RAGAS dimensions for systematic RAG evaluation.
08

Constitutional AI & Safety

Constitutional AI (CAI)

Anthropic’s approach to alignment using a set of explicit principles (a “constitution”). Model critiques and revises its own outputs according to these rules — replacing expensive human preference labeling for safety. Used in Claude’s development.

Two phases: SL-CAI (supervised on self-revised outputs) followed by RL-CAI (RLHF but with AI-generated preference data from constitutional principles rather than humans).

Red Teaming & Robustness

  • Jailbreak Categories — direct instruction, roleplay-based, many-shot, prompt injection, cipher encoding, virtualization attacks.
  • Automated Red Teaming — train attacker LM with RL to generate adversarial prompts. Scales beyond manual red teaming.
  • Adversarial Training — include successful jailbreaks in safety training. Arms race dynamic — requires continuous updates.
  • Input Guardrails — classifier-based filters (Llama Guard, WildGuard) before model call. Adds latency; tunable thresholds.
  • Output Guardrails — post-hoc moderation on generated text. Catch what prompts miss. NeMo Guardrails, Guardrails AI.
09

Inference Optimization

Memory

KV Cache Management

Cache key-value attention states across requests. Prefix caching (same system prompt → shared KV) reduces TTFT by 80%+ on long contexts. PagedAttention (vLLM) enables dynamic KV allocation.

Prefix cache hit rate is the primary cost lever
Decoding

Speculative Decoding

A small draft model generates k tokens; the large model verifies them in parallel. Accept/reject via rejection sampling. 2–4× throughput with identical output distribution.

Best with 5–10× size gap: Llama-68M → 70B
Quantization

Weight Quantization

INT8 (bitsandbytes, LLM.int8()), INT4 (GPTQ, AWQ), FP8 (native H100 support). AWQ calibrates per-channel to preserve outlier weights. Near-zero quality loss at 4-bit.

2–4× memory reduction · 1.5–3× throughput
Batching

Continuous Batching

Dynamic insertion of new requests mid-batch (iteration-level scheduling). Eliminates idle GPU time from variable-length sequences. vLLM, TensorRT-LLM, SGLang implement this.

5–10× GPU utilization vs. static batching
Distillation

Knowledge Distillation

Train a small “student” model to mimic the probability distributions (soft targets) of a large “teacher.” Intermediate-layer distillation captures richer representations than output-only.

DistilBERT: 40% smaller, 97% capability retained
Architecture

MoE & Sparse Models

Mixture of Experts activates only a subset of parameters per token (Mixtral: 2/8 experts). Huge parameter counts with inference cost of smaller dense model. Routing quality is critical.

Mixtral 8×7B ≈ 13B active params · GPT-4 is MoE
10

LLMOps — Production & Observability

Production Monitoring Stack

  • Tracing — end-to-end request traces with latency breakdown per component (retrieval, LLM, guardrails). OpenTelemetry + Langfuse / LangSmith.
  • Prompt Versioning — treat prompts as code. Git-based versioning with eval regression tests on every change. Never deploy a prompt without eval gating.
  • Cost Tracking — per-user, per-feature token consumption. Cache optimization opportunities. Alert on >2σ cost anomalies.
  • Hallucination Detection — groundedness scoring (RAG context vs. output), factual consistency classifiers (FActScorer, SAFE), entity verification.
  • Drift Detection — monitor output distribution shifts. Input/output similarity to training data. Human eval periodic spot-checks.

Deployment Patterns

  • Blue/Green LLM Rollout — parallel model serving; shift traffic by percentage; instant rollback.
  • Canary Deployment — route 5% of real traffic to candidate model; use implicit signals before full launch.
  • Model Router — classify query complexity; route cheap queries to small fast models, hard queries to frontier models. RouteLLM, LlamaIndex.
  • Fallback Chains — primary model → fallback model → rule-based. Resilience without hard outages.
  • Caching Layer — semantic cache on query embeddings (GPTCache). High-hit-rate on FAQ workloads; up to 60% cost reduction.
  • Human-in-the-Loop — confidence threshold triggers human review. Capture corrections as training signal. Flywheel for continuous improvement.

“A model that is 95% accurate but 100% confident is far more dangerous than one that is 90% accurate and knows when it doesn’t know.”

— On calibration, the underappreciated dimension of LLM quality
# Model router — intelligent cost-quality tradeoff from routellm import Controller router = Controller( routers=[“mf”], # matrix factorization router strong_model=“gpt-4o”, weak_model=“claude-haiku-4-5”, threshold=0.11584 # ~50% cost reduction, minimal quality loss ) response = router.chat.completions.create( model=“router-mf-0.11584”, messages=[{“role”: “user”, “content”: query}] )

Leave a Reply

Your email address will not be published. Required fields are marked *