Fine-Tuning & Optimization — Expert Guide
AI Expert Series · Vol. 3

Fine-Tuning &
Optimization

A complete practitioner’s guide to instruction tuning, prompt strategies, parameter-efficient adaptation, and alignment techniques for large language models.

Instruction Tuning LoRA / QLoRA RLHF DPO Prompt Engineering Evaluation PEFT Alignment
01 — FOUNDATIONS

When to fine-tune

Fine-tuning is not always the right answer. Understand the decision matrix before committing GPU-hours.

Fine-tune when…
You have 500+ high-quality examples, need consistent tone/format, want to compress long system prompts into weights, or require domain-specific knowledge not in the base model.
Don’t fine-tune when…
The base model + good prompting already works, your data is scarce (<100 examples), or you need real-time knowledge. RAG is often cheaper and more maintainable.
Consider RAG + prompting first
Retrieval-augmented generation solves knowledge gaps without retraining. Pair with a strong system prompt and few-shot examples before reaching for fine-tuning.
Cost vs. benefit
Fine-tuning can reduce inference costs if it lets you use a smaller model. A fine-tuned 7B model often outperforms a prompted 70B model on narrow tasks.
02 — FOUNDATIONS

Data preparation

The quality of your fine-tune is bounded by the quality of your data. Garbage in, garbage out — always.

  • 01
    Define the task precisely Write down input format, expected output format, edge cases, and failure modes before collecting any data.
  • 02
    Collect or generate examples Use existing logs, human annotators, or GPT-4 to bootstrap. For instruction tuning, target ≥500 diverse examples. For specialized tasks, 50–200 high-quality examples can suffice with PEFT.
  • 03
    Format consistently Pick a chat template (ChatML, Alpaca, ShareGPT) and stick to it. Mix of templates in training data causes degraded performance.
  • 04
    Split and deduplicate 80/10/10 train/val/test. Run MinHash deduplication — even 1% overlap between train and test inflates eval metrics significantly.
  • 05
    Run a data audit Inspect random samples. Look for label noise, format inconsistencies, and distribution imbalance. Fix before training, not after.
# Alpaca-style data format (JSONL) { “instruction”: “Classify the sentiment of the following review.”, “input”: “The product broke after two days. Terrible quality.”, “output”: “negative” } # ChatML format (preferred for chat models) { “messages”: [ { “role”: “system”, “content”: “You are a sentiment classifier.” }, { “role”: “user”, “content”: “The product broke after two days.” }, { “role”: “assistant”, “content”: “negative” } ] }
03 — TECHNIQUES

Instruction tuning

Instruction tuning teaches a base model to follow natural-language directions by fine-tuning on instruction–response pairs.

Base LLMpretrained weights
SFTsupervised fine-tune
Reward Modelpreference data
RLHF / DPOalignment
Deployed Modelproduction
Full fine-tuning
Updates all model weights. Highest potential quality but requires significant VRAM (typically 2–4× model size for optimizer states). Best when you have >10K examples and a large enough cluster.
SFT (supervised)
Next-token prediction on labeled instruction–response pairs. The backbone of instruction tuning. Loss is computed only on the response tokens, not the prompt. Simple, effective, and the starting point for everything else.
Multi-task tuning
Train on multiple task types simultaneously. Improves generalization and reduces catastrophic forgetting. Mix classification, generation, extraction, QA, and summarization tasks in training data.
Continual pre-training
Domain-specific text before instruction tuning. For highly technical domains (medical, legal, code), run a continual pre-training pass on domain documents first, then instruction-tune on top.
Key insight: Loss masking matters. During instruction tuning, mask the loss on instruction tokens (only compute loss on the completion). Otherwise, the model wastes capacity learning to reproduce prompts it already knows.
04 — TECHNIQUES

Prompt strategies

Effective prompting is the cheapest optimization lever. Master these before reaching for any training approach.

Zero-shot
Direct instruction, no examples. Works well with large RLHF-tuned models. Fails on niche formats and specialized domains.
Few-shot
2–8 in-context examples. Dramatically improves consistency of output format. Examples must be diverse and high-quality.
Chain-of-thought
Prompt the model to reason step-by-step before answering. “Let’s think step by step” unlocks latent reasoning in models ≥7B.
ReAct
Interleave Reasoning and Acting. Model alternates between thought steps and tool calls. Foundation of most agent frameworks.
Self-consistency
Sample multiple reasoning paths, take majority vote. Increases accuracy 5–15% on math/logic at cost of 3–5× inference compute.
Program of thought
Model writes code instead of prose for computation. Offloads arithmetic to a Python interpreter — reduces hallucination on numerical tasks.
System prompt engineering
Role + context
Open with a role declaration and task context. “You are an expert medical coder specializing in ICD-10 classification…” conditions the distribution before the first user token.
Output constraints
Specify format explicitly: JSON schema, word count, required fields, and tone. Use negative constraints (“do not include explanations”) alongside positive ones.
Persona + principles
List behavioral rules inline. “Always cite sources. Refuse off-topic requests. Use metric units.” Numbered lists of principles outperform long prose system prompts.
Example anchoring
End the system prompt with 1–2 worked examples in the exact input/output format you expect. This is more powerful than describing the format in prose.
# Structured system prompt template “”” ## Role You are a {role} specializing in {domain}. ## Task {clear task description with scope boundaries} ## Output format Return a JSON object with these fields: – {field_1}: {type and description} – {field_2}: {type and description} ## Rules 1. {behavioral constraint} 2. {quality standard} 3. Never {what to avoid} ## Example Input: {example input} Output: {exact expected output} “””
05 — TECHNIQUES

PEFT & LoRA

Parameter-efficient fine-tuning adapts large models by training only a small fraction of parameters — often <1% of total weights.

LoRA weight decomposition
W’ = W₀ + BA  where  B ∈ ℝd×r, A ∈ ℝr×k
r = rank (typically 4–64). Only A and B are trained. W₀ is frozen.
Method Trainable params Memory Best for
Full fine-tune 100% Very high Large datasets, maximum quality
LoRA 0.1–1% Low Most use cases, great default
QLoRA 0.1–1% Very low Consumer GPUs, 4-bit quantized base
Prefix tuning 0.01% Very low Lightweight task adaptation
IA³ <0.01% Minimal Few-shot adapted models
Adapter layers 0.5–3% Low Multi-task, modular deployments
# QLoRA config using Hugging Face PEFT + bitsandbytes from peft import LoraConfig, get_peft_model from transformers import BitsAndBytesConfig bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_quant_type=“nf4”, bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=torch.bfloat16 ) lora_config = LoraConfig( r=16, # rank lora_alpha=32, # scaling = alpha / r = 2.0 target_modules=[“q_proj”, “v_proj”, “k_proj”, “o_proj”], lora_dropout=0.05, bias=“none”, task_type=“CAUSAL_LM” )
Rule of thumb: Start with r=16, lora_alpha=32 (scaling=2). If underfitting, increase r to 32 or 64. Apply LoRA to all attention projection matrices (q, k, v, o) for best results. Adding MLP layers helps on complex reasoning tasks.
06 — TECHNIQUES

RLHF & DPO

Alignment techniques that optimize model outputs to match human preferences beyond what supervised loss captures.

RLHF — Reinforcement Learning from Human Feedback
Three-stage pipeline: (1) SFT on demonstrations, (2) train a reward model on preference pairs, (3) optimize the SFT model with PPO against the reward model. Complex, unstable, but powerful.
DPO — Direct Preference Optimization
Eliminates the reward model entirely. Reparameterizes the RLHF objective as a classification loss on preferred vs. rejected response pairs. Simpler, more stable, and competitive with PPO.
DPO loss function
DPO = −𝔼[log σ(β log πθ(yw)/πref(yw) − β log πθ(yl)/πref(yl))]
yw = preferred response · yl = rejected response · β controls KL penalty strength (typical: 0.1–0.5)
Preference data format
Pairs of responses ranked by human annotators or a strong model. Each example contains a prompt, a “chosen” response, and a “rejected” response. Quality trumps quantity — 1K excellent pairs > 10K mediocre ones.
KTO (Kahneman-Tversky)
Alignment without explicit pairs. Uses only binary “thumbs up / thumbs down” signals, not preference comparisons. More practical when paired data collection is expensive.
ORPO (Odds Ratio)
Combines SFT and preference alignment in a single training pass. Eliminates the need for a reference model. Faster to train and increasingly popular for open-source fine-tuning.
07 — ENGINEERING

Hyperparameters

Sensible defaults for most fine-tuning jobs. Tune these in order of impact.

Parameter Recommended range Notes
Learning rate 1e-5 → 3e-4 Start at 2e-4 for LoRA, 1e-5 for full fine-tune. Use cosine schedule with warmup.
Warmup steps 3–10% of total steps Critical for stability. Prevents large gradient updates early in training.
Batch size 32–256 effective Use gradient accumulation to simulate large batches on limited VRAM.
Epochs 1–5 Small datasets: 3–5 epochs. Large datasets: 1–2. Watch validation loss for overfitting.
Max sequence length 512–4096 tokens Memory scales quadratically with sequence length. Pack shorter sequences for efficiency.
Weight decay 0.0–0.1 Light regularization. Use 0.01 as a safe default with AdamW.
Gradient clipping 0.3–1.0 max_grad_norm = 1.0 is the standard. Lower (0.3) for unstable training runs.
Common pitfall: Learning rate too high is the #1 cause of fine-tune failures. If loss diverges or validation perplexity spikes after a few hundred steps, halve the learning rate before changing anything else.
08 — ENGINEERING

Evaluation

Automatic metrics are necessary but not sufficient. Build a multi-layer evaluation stack.

Perplexity / loss
Good for detecting overfitting. Low training loss + high validation loss = overfit. Not correlated with downstream quality on generation tasks.
Task-specific metrics
F1, BLEU, ROUGE, accuracy, exact match. Choose the metric that matches your task. ROUGE for summarization, EM for QA, F1 for extraction.
LLM-as-judge
Use GPT-4 or Claude to score outputs on a rubric. Define axes: accuracy, helpfulness, format compliance, conciseness. Correlates better with human preference than n-gram metrics.
Human eval
A/B preference studies. Gold standard. Run on 100–200 random samples. Use trained annotators with clear rubrics and inter-annotator agreement checks.
Regression suite
Curated hard examples covering known failure modes. Run on every checkpoint. Catch regressions before they reach production.
Safety & alignment
Red-team with adversarial prompts. Test for refusal quality, hallucination rate, toxicity. Never skip for user-facing deployments.
09 — ENGINEERING

Deployment optimization

Getting a fine-tuned model into production efficiently requires inference-time optimization on top of training-time choices.

Quantization
Reduce weight precision from FP16 to INT8 or INT4. GPTQ and AWQ are preferred for LLMs. Expect 2–4× memory reduction with <1% quality drop on most tasks. Use bitsandbytes for on-the-fly quantization.
KV cache
Cache attention keys/values for prefill tokens. Critical for long-context inference. PagedAttention (vLLM) enables high-throughput serving by managing KV cache memory like an OS page table.
Continuous batching
Process requests at the iteration level, not request level. Eliminates GPU idle time from variable-length outputs. Essential for production serving. Use vLLM, TGI, or TensorRT-LLM.
Speculative decoding
Use a small draft model to generate token candidates. The target model verifies in parallel. Typically 2–4× speedup with identical output distribution. Works best for long generation tasks.
Merging adapters
Merge LoRA weights back into the base model before deployment. Eliminates adapter overhead at inference time. Use merge_and_unload() in PEFT. Store the original LoRA separately for future updates.
Production checklist: Merge adapters → quantize to INT8 → enable continuous batching → set up KV cache → add request-level rate limiting → monitor GPU utilization and P95 latency → set up a shadow deployment for A/B evaluation before full rollout.
Please enter product(-s) ASIN(-s)!

Leave a Reply

Your email address will not be published. Required fields are marked *