Evaluating & Optimizing
Large Language Models
A comprehensive, practitioner-grade reference covering every dimension of AI model evaluation — from benchmark design and alignment techniques to inference optimization and production observability.
Benchmarking Frameworks
MMLU & MMLU-Pro
Massive Multitask Language Understanding — 57 academic subjects from STEM to law. Pro variant adds harder, multi-step questions resistant to guessing strategies.
HumanEval & SWE-Bench
HumanEval tests function synthesis from docstrings. SWE-Bench evaluates real GitHub issue resolution — a more realistic coding benchmark for agentic systems.
MATH & AIME
MATH benchmark covers competition math across 7 difficulty levels. AIME (American Invitational Math Exam) probes extreme reasoning — near-impossible for older models.
TruthfulQA & WildGuard
TruthfulQA measures resistance to human misconceptions. WildGuard covers refusal accuracy, over-refusal rates, and safety classification across 13 harm categories.
MMMU & ChartQA
Massive Multimodal Understanding across 30 subjects. ChartQA tests structured visual reasoning on charts, graphs, and tables in document contexts.
GAIA & AgentBench
GAIA tests real-world general AI assistants with web browsing and tool use. AgentBench evaluates agents across OS, DB, web shopping, and coding environments.
Key Evaluation Metrics
| Metric | Measures | Typical Use | Caution |
|---|---|---|---|
| Perplexity (PPL) | How well model predicts a text sample — lower = better language fit | Pre-training quality, base model comparison | Indirect — doesn’t correlate with task performance |
| BLEU / ROUGE | N-gram overlap between generated and reference text | Summarization, translation | Weak — penalizes valid paraphrases; largely deprecated |
| BERTScore | Semantic similarity via contextual embeddings | Generation quality, translation | Medium — better than BLEU but still reference-dependent |
| Win Rate (LLM-as-Judge) | Preference comparison using a judge model (GPT-4, Llama-3) | Chat quality, instruction-following | Robust — watch for position & verbosity bias in judge |
| Calibration (ECE) | Whether model confidence matches empirical accuracy | Reliability, hallucination mitigation | Critical — essential for high-stakes applications |
| Refusal Rate / Over-refusal | Safety compliance vs. helpfulness balance | Safety-aligned models | Dual — both false positives and negatives matter equally |
| Latency / TTFT | Time To First Token & end-to-end latency (p50/p95/p99) | Production deployment SLAs | Critical — tail latency (p99) often defines UX |
| Token Throughput | Output tokens per second per GPU | Inference cost optimization | Tradeoff — batching improves throughput, increases latency |
The Evaluation Pipeline
“Evaluation is not a gate — it is a feedback loop. The quality of your eval dataset directly bounds the quality of your model improvements.”
RLHF & Alignment Techniques
Reinforcement Learning from Human Feedback
RLHF is the dominant alignment paradigm: human raters compare model outputs, a reward model learns their preferences, and the language model is optimized via PPO (Proximal Policy Optimization) to maximize reward while not drifting too far from the base policy (KL penalty).
Three-stage pipeline: SFT (supervised fine-tuning on demonstrations) → Reward Model Training → RL Policy Optimization.
Modern Alternatives
- DPO — Direct Preference Optimization. Removes explicit RM; optimizes preferences directly. More stable, cheaper, slightly less flexible.
- IPO — Identity Preference Optimization. Fixes DPO overfitting to deterministic preferences.
- ORPO — Odds Ratio Preference Optimization. Merges SFT + alignment into one stage.
- GRPO — Group Relative Policy Optimization. Used in DeepSeek-R1; removes value network entirely.
- Constitutional AI — Rule-based self-critique (see §07).
- RLAIF — AI-generated feedback replaces expensive human annotations.
Prompt Engineering & Optimization
Chain-of-Thought (CoT)
Elicit step-by-step reasoning with “Let’s think step by step” or few-shot examples. Zero-shot CoT works on capable models; few-shot CoT is more reliable and domain-controllable.
Self-Consistency
Sample k CoT paths at temperature > 0, then take a majority vote over final answers. Dramatically reduces variance for deterministic problems without training.
Tree of Thoughts (ToT)
Expand reasoning into a search tree. Model evaluates intermediate steps and backtracks from dead ends. Best for tasks with clear partial-progress signals.
APE / DSPy Optimization
Automatic Prompt Engineer generates and selects candidate prompts by scoring on a validation set. DSPy compiles declarative pipelines into optimized prompts and few-shot examples.
Few-Shot Selection
Retrieve semantically-similar exemplars per query (rather than fixed examples). Use embedding-based k-NN over curated demonstration pools. Dynamic ICL > static ICL.
Output Structuring
Constrained decoding (JSON mode, grammar-guided generation) ensures parseable outputs. Use XML tags for complex multi-part reasoning. Grammar-constrained generation is production-grade.
Fine-Tuning Strategies
Parameter-Efficient Fine-Tuning
- LoRA — Low-Rank Adaptation. Inject trainable rank-r matrices into attention layers. 0.1–1% of parameters updated. Standard for 7B–70B models.
- QLoRA — LoRA on 4-bit quantized base model. Enables 70B fine-tuning on single A100.
- DoRA — Weight-Decomposed LoRA. Decomposes weights into magnitude + direction; more expressive than vanilla LoRA.
- Prefix Tuning / P-Tuning v2 — Learn soft prompt tokens prepended to every layer. Parameter-free for base model.
- IA³ — Learns element-wise scaling vectors. Even fewer parameters than LoRA; good for few-shot adaptation.
Full Fine-Tuning Considerations
- Catastrophic Forgetting — mitigate with EWC (Elastic Weight Consolidation), replay buffers, or low learning rates.
- Data Quality — 1k high-quality examples often beats 100k noisy ones. Deduplication is essential.
- Curriculum Learning — order training by difficulty; start easy, introduce hard examples progressively.
- Learning Rate Schedule — cosine decay with linear warmup. Peak LR: 1e-5 to 5e-5 for instruction tuning.
- Multi-task SFT — mix target task with general instruction data to preserve generalization.
- Merge & Mix — Model merging (SLERP, TIES, DARE) combines fine-tuned models without re-training.
RAG — Retrieval-Augmented Generation
Constitutional AI & Safety
Constitutional AI (CAI)
Anthropic’s approach to alignment using a set of explicit principles (a “constitution”). Model critiques and revises its own outputs according to these rules — replacing expensive human preference labeling for safety. Used in Claude’s development.
Two phases: SL-CAI (supervised on self-revised outputs) followed by RL-CAI (RLHF but with AI-generated preference data from constitutional principles rather than humans).
Red Teaming & Robustness
- Jailbreak Categories — direct instruction, roleplay-based, many-shot, prompt injection, cipher encoding, virtualization attacks.
- Automated Red Teaming — train attacker LM with RL to generate adversarial prompts. Scales beyond manual red teaming.
- Adversarial Training — include successful jailbreaks in safety training. Arms race dynamic — requires continuous updates.
- Input Guardrails — classifier-based filters (Llama Guard, WildGuard) before model call. Adds latency; tunable thresholds.
- Output Guardrails — post-hoc moderation on generated text. Catch what prompts miss. NeMo Guardrails, Guardrails AI.
Inference Optimization
KV Cache Management
Cache key-value attention states across requests. Prefix caching (same system prompt → shared KV) reduces TTFT by 80%+ on long contexts. PagedAttention (vLLM) enables dynamic KV allocation.
Speculative Decoding
A small draft model generates k tokens; the large model verifies them in parallel. Accept/reject via rejection sampling. 2–4× throughput with identical output distribution.
Weight Quantization
INT8 (bitsandbytes, LLM.int8()), INT4 (GPTQ, AWQ), FP8 (native H100 support). AWQ calibrates per-channel to preserve outlier weights. Near-zero quality loss at 4-bit.
Continuous Batching
Dynamic insertion of new requests mid-batch (iteration-level scheduling). Eliminates idle GPU time from variable-length sequences. vLLM, TensorRT-LLM, SGLang implement this.
Knowledge Distillation
Train a small “student” model to mimic the probability distributions (soft targets) of a large “teacher.” Intermediate-layer distillation captures richer representations than output-only.
MoE & Sparse Models
Mixture of Experts activates only a subset of parameters per token (Mixtral: 2/8 experts). Huge parameter counts with inference cost of smaller dense model. Routing quality is critical.
LLMOps — Production & Observability
Production Monitoring Stack
- Tracing — end-to-end request traces with latency breakdown per component (retrieval, LLM, guardrails). OpenTelemetry + Langfuse / LangSmith.
- Prompt Versioning — treat prompts as code. Git-based versioning with eval regression tests on every change. Never deploy a prompt without eval gating.
- Cost Tracking — per-user, per-feature token consumption. Cache optimization opportunities. Alert on >2σ cost anomalies.
- Hallucination Detection — groundedness scoring (RAG context vs. output), factual consistency classifiers (FActScorer, SAFE), entity verification.
- Drift Detection — monitor output distribution shifts. Input/output similarity to training data. Human eval periodic spot-checks.
Deployment Patterns
- Blue/Green LLM Rollout — parallel model serving; shift traffic by percentage; instant rollback.
- Canary Deployment — route 5% of real traffic to candidate model; use implicit signals before full launch.
- Model Router — classify query complexity; route cheap queries to small fast models, hard queries to frontier models. RouteLLM, LlamaIndex.
- Fallback Chains — primary model → fallback model → rule-based. Resilience without hard outages.
- Caching Layer — semantic cache on query embeddings (GPTCache). High-hit-rate on FAQ workloads; up to 60% cost reduction.
- Human-in-the-Loop — confidence threshold triggers human review. Capture corrections as training signal. Flywheel for continuous improvement.
“A model that is 95% accurate but 100% confident is far more dangerous than one that is 90% accurate and knows when it doesn’t know.”


