Expert Guide · Supervised Fine-Tuning
Preparing & Cleaning
Custom Datasets
A comprehensive, practitioner-level reference for every stage of building high-quality training data — from raw collection to production-ready JSONL.
✓ What Good Data Does
- Teaches the model how to respond — tone, length, style
- Aligns outputs with human preferences and safety constraints
- Generalises to unseen prompts through consistent patterns
- Reduces hallucination by grounding responses in real examples
✕ What Bad Data Does
- Introduces style drift and inconsistent formatting
- Amplifies annotation biases into the fine-tuned model
- Causes catastrophic forgetting of pre-trained knowledge
- Produces reward hacking when used for RLHF preference data
H Human-Written
Highest quality signal. Annotators write prompts and ideal completions from scratch. Expensive (~$15–$80 per example for expert domains), but gold-standard for safety-critical tasks.
S Scraped / Synthetic
Use a teacher LLM (GPT-4, Claude) to generate large volumes of instruction-response pairs. Cheap at scale. Risk: distributional collapse if source model is weak.
C Converted Corpora
Transform existing structured data — FAQs, documentation, support tickets — into instruction format. Requires careful prompt templating and thorough cleaning.
“Diversity of tasks in the prompt distribution matters more than sheer volume. A 10k dataset spanning 200 distinct task types will typically outperform 100k examples of one task.”
Prompt Diversity Checklist
- Cover all task types relevant to your use case (summarisation, Q&A, extraction, generation, classification)
- Vary prompt length: short one-liners, medium paragraphs, long multi-part instructions
- Include edge cases: ambiguous prompts, refused requests, multi-turn context windows
- Balance domain distribution (do not over-represent the easiest domain)
- Sample adversarial prompts for safety alignment if applicable
-
Exact Deduplication
Hash each (prompt, completion) pair with SHA-256 or MD5. Remove any row whose hash has been seen before. This is O(n) and should always be your first pass — it’s free signal cleanup.
-
Near-Duplicate Detection with MinHash/LSH
Use MinHash Locally Sensitive Hashing to group pairs with Jaccard similarity above ~0.8. Keep one representative per cluster. Libraries:
datasketch,text-dedup. Effective for scraped corpora. -
Semantic Deduplication
Embed prompts using a sentence encoder (e.g.
all-MiniLM-L6-v2), cluster with FAISS or cosine similarity, and prune clusters down to a target size. Catches paraphrastic duplicates missed by character-level methods.
L Length Filters
- Minimum prompt tokens (e.g. ≥ 8): removes stub prompts
- Maximum context tokens (≤ model context window): prevents truncation
- Completion length ratio: flag if completion is < 10% or > 500% of prompt length
P Perplexity Filters
- Run a small language model (KenLM, n-gram) on completions
- Remove very high perplexity (garbled / incoherent text)
- Remove very low perplexity (boilerplate / templated filler)
R Reward Model Scoring
- Pass (prompt, completion) through an RM trained on human preference data
- Retain only top-K percentile by RM score
- Effective but requires a pre-trained RM — use
OpenAssistant/reward-model-deberta
H Heuristic Rules
- Bullet/symbol ratio: remove if > 30% lines start with “•” or “-“
- URL density: flag > 5 URLs in completion
- Language detection: keep only target language (use
langdetect) - PII scan: redact emails, phone numbers, SSNs via regex or presidio
Typical Retention Rates by Filter
-
Unicode & Encoding Normalisation
Apply
unicodedata.normalize("NFC", text)to collapse multiple codepoints to canonical form. Strip zero-width spaces, BOM characters, and other invisible glyphs that cause tokenizer mismatches. -
Whitespace & Line Ending Standardisation
Replace all CRLF with LF. Collapse runs of more than two newlines into two. Strip trailing whitespace per line. Be deliberate about whether you want tabs or spaces in code examples.
-
Quotation Mark & Dash Unification
Decide on a canonical form: curly quotes (” “) vs straight quotes (” “). Em-dashes (—) vs double hyphens (–). Consistency matters for the tokenizer and reduces vocabulary fragmentation.
-
HTML / Markdown Artefact Removal
Scraped data often contains residual HTML tags (
<div>, ). UseBeautifulSoupfor HTML stripping; write targeted regex for common Markdown artefacts you don’t want in completions. -
PII Redaction
Before release or training, scan with
presidio-analyzerorspacyNER to detect names, emails, phone numbers, credit card patterns. Replace with typed placeholders:[EMAIL],[PHONE].
| Format | Schema | Use When | Frameworks |
|---|---|---|---|
| Prompt–Completion | {“prompt”: “…”, “completion”: “…”} | Base model fine-tuning, simple text completion tasks | OpenAI legacy, Hugging Face SFTTrainer |
| Chat Messages | {“messages”: [{“role”:”user”,”content”:”…”},{“role”:”assistant”,”content”:”…”}]} | Instruction-tuning chat models, multi-turn conversations | OpenAI, Axolotl, LlamaFactory |
| Alpaca Format | {“instruction”:”…”,”input”:”…”,”output”:”…”} | When there’s a clear task instruction separate from a user input | Alpaca, LlamaFactory, FastChat |
| ShareGPT | {“conversations”: [{“from”:”human”,”value”:”…”},{“from”:”gpt”,”value”:”…”}]} | Multi-turn datasets, human/GPT conversational data | FastChat, LlamaFactory, Axolotl |
S System Prompt Design
Your system prompt establishes the model’s persona and behavioral contract. Keep it: concise (under 200 tokens), consistent across all examples in the dataset, and representative of the exact system prompt used at inference time. Mismatch here causes distribution shift at deployment.
M Multi-Turn Construction
For multi-turn datasets, ensure each conversation is coherent end-to-end. Avoid synthetic conversations where the assistant answers questions not present in the prior context. Use conversation trees when generating branching dialogues from a single prompt root.
G Annotation Guidelines
- Write explicit rubrics for quality, style, and format
- Provide 10–20 worked examples per task type
- Define clear edge case handling rules
- Version control your guidelines document
A Inter-Annotator Agreement
- Target Cohen’s κ > 0.7 before scaling
- Run calibration sessions after each batch
- Adjudicate disagreements with a senior reviewer
- Track IAA per annotator to detect drift
Q Quality Assurance Sampling
- Sample 5–10% of each annotator’s work for QA review
- Auto-flag examples for re-annotation based on RM score outliers
- Maintain a gold set of 200–500 expert-verified examples
“The annotation rubric is your training objective made legible. If annotators cannot agree on what a good answer looks like, your model cannot learn it.”
- Schema validation: every row matches the expected JSON schema — required keys present, types correct, no null values in critical fields
- Tokenization dry-run: pass all examples through the target tokenizer; assert no sequence exceeds max_length; flag truncated examples
- Special token audit: ensure no raw special tokens (e.g.
<|endoftext|>,<s>) appear in the text fields — they must only appear as designated control tokens - Train/val leakage check: run deduplication across split boundaries — ensure no exact or near-duplicate prompts exist in both train and eval sets
- Label distribution: plot completion length histograms and task-type distributions; ensure eval set distribution mirrors train set
- Encoding sanity: assert all files are valid UTF-8 JSONL — one JSON object per line, no trailing commas, no BOM
- Toxicity & safety scan: run a hate speech / PII classifier over all completions; quarantine flagged examples for review
- Loss masking verification: if using loss masking to train only on completions, verify the mask correctly zeros out the prompt token positions
.map() for parallelised normalisation.-
Version everything
Tag dataset versions with hashes. If a model degrades, you can bisect to the exact data commit that caused it. Use DVC or Hugging Face dataset repos.
-
Start small, scale deliberately
Train on a 1k–5k sample first. Verify loss curves are sensible, eval metrics are improving, and there are no data bugs — before committing to full dataset training.
-
Maintain a held-out human eval set
Keep 200–500 expert-human-written examples permanently out of training. Use them only to evaluate final model quality. Do not update this set as the project evolves.
-
Monitor for distribution shift
Compare token length distributions, vocabulary overlap, and task-type counts between train/val/test. Distribution shift between splits is one of the most common silent failure modes.
-
Respect the context budget
Do not simply truncate long examples. Instead, chunk intelligently — split documents at natural boundaries (paragraphs, sections) and create separate training examples from each chunk.
-
Use loss masking on prompts
When fine-tuning, compute loss only on the assistant completion tokens — not on the system prompt or user message. This is critical: training on the user turn wastes capacity and distorts gradient signal.
-
Iterate with ablations
Run controlled ablations on dataset composition: remove one data source, change a filter threshold, adjust the system prompt. This builds causal understanding of what your data is actually teaching.
-
Document everything
Maintain a data card (following HF or Google Data Cards format) recording: sources, licenses, cleaning decisions, annotation guidelines, split sizes, and known limitations. Essential for reproducibility and responsible AI.

