SFT Dataset Preparation — Expert Guide

Expert Guide · Supervised Fine-Tuning

Preparing & Cleaning
Custom Datasets

A comprehensive, practitioner-level reference for every stage of building high-quality training data — from raw collection to production-ready JSONL.

Data Collection Deduplication Quality Filtering Formatting Validation Tokenization Annotation RLHF-ready
01Collect
02Deduplicate
03Filter
04Normalize
05Annotate
06Format
07Validate
08Tokenize
01
Foundation Why Data Quality Defines SFT Success
80%
of SFT failures traced to data quality issues
10×
quality beats 10× quantity in training signal
~5%
duplicate rate acceptable in cleaned corpus
3–5
human rater passes for annotation gold sets
Supervised fine-tuning teaches a model to follow instructions by showing it (prompt, completion) pairs where the completion is the desired response. The model learns to mimic the distribution of these completions — which means garbage in, garbage out is not just a cliché here; it is a mathematical certainty. A single noisy annotation pass can introduce systematic biases that survive hundreds of thousands of gradient steps.

What Good Data Does

  • Teaches the model how to respond — tone, length, style
  • Aligns outputs with human preferences and safety constraints
  • Generalises to unseen prompts through consistent patterns
  • Reduces hallucination by grounding responses in real examples

What Bad Data Does

  • Introduces style drift and inconsistent formatting
  • Amplifies annotation biases into the fine-tuned model
  • Causes catastrophic forgetting of pre-trained knowledge
  • Produces reward hacking when used for RLHF preference data
02
Step One Data Collection Strategies
The origin of your data shapes everything downstream. There are three primary sourcing strategies, each with distinct trade-offs between coverage, cost, and control.

H Human-Written

Highest quality signal. Annotators write prompts and ideal completions from scratch. Expensive (~$15–$80 per example for expert domains), but gold-standard for safety-critical tasks.


Best for: instruction following, safety

S Scraped / Synthetic

Use a teacher LLM (GPT-4, Claude) to generate large volumes of instruction-response pairs. Cheap at scale. Risk: distributional collapse if source model is weak.


Best for: volume, domain coverage

C Converted Corpora

Transform existing structured data — FAQs, documentation, support tickets — into instruction format. Requires careful prompt templating and thorough cleaning.


Best for: domain-specific knowledge

“Diversity of tasks in the prompt distribution matters more than sheer volume. A 10k dataset spanning 200 distinct task types will typically outperform 100k examples of one task.”

— Core principle from FLAN, Self-Instruct, and Alpaca research

Prompt Diversity Checklist

  • Cover all task types relevant to your use case (summarisation, Q&A, extraction, generation, classification)
  • Vary prompt length: short one-liners, medium paragraphs, long multi-part instructions
  • Include edge cases: ambiguous prompts, refused requests, multi-turn context windows
  • Balance domain distribution (do not over-represent the easiest domain)
  • Sample adversarial prompts for safety alignment if applicable
03
Step Two Deduplication
Duplicate examples cause the model to overfit specific phrasings and inflate training metrics without adding new signal. There are three levels of deduplication you must apply in order.
  • Exact Deduplication

    Hash each (prompt, completion) pair with SHA-256 or MD5. Remove any row whose hash has been seen before. This is O(n) and should always be your first pass — it’s free signal cleanup.

  • Near-Duplicate Detection with MinHash/LSH

    Use MinHash Locally Sensitive Hashing to group pairs with Jaccard similarity above ~0.8. Keep one representative per cluster. Libraries: datasketch, text-dedup. Effective for scraped corpora.

  • Semantic Deduplication

    Embed prompts using a sentence encoder (e.g. all-MiniLM-L6-v2), cluster with FAISS or cosine similarity, and prune clusters down to a target size. Catches paraphrastic duplicates missed by character-level methods.

# Near-dedup with datasketch (Python) from datasketch import MinHash, MinHashLSH lsh = MinHashLSH(threshold=0.8, num_perm=128) seen_keys = set() for idx, row in enumerate(dataset): m = MinHash(num_perm=128) for token in row[“prompt”].split(): m.update(token.encode(“utf-8”)) result = lsh.query(m) if not result: # no near-duplicate found lsh.insert(f“id_{idx}”, m) seen_keys.add(idx) # keep this example clean_dataset = [row for i, row in enumerate(dataset) if i in seen_keys]
04
Step Three Quality Filtering
After deduplication, apply a multi-signal filter to remove low-quality examples. Each filter below is a gate: examples failing any gate are discarded or sent for human review.

L Length Filters

  • Minimum prompt tokens (e.g. ≥ 8): removes stub prompts
  • Maximum context tokens (≤ model context window): prevents truncation
  • Completion length ratio: flag if completion is < 10% or > 500% of prompt length

P Perplexity Filters

  • Run a small language model (KenLM, n-gram) on completions
  • Remove very high perplexity (garbled / incoherent text)
  • Remove very low perplexity (boilerplate / templated filler)

R Reward Model Scoring

  • Pass (prompt, completion) through an RM trained on human preference data
  • Retain only top-K percentile by RM score
  • Effective but requires a pre-trained RM — use OpenAssistant/reward-model-deberta

H Heuristic Rules

  • Bullet/symbol ratio: remove if > 30% lines start with “•” or “-“
  • URL density: flag > 5 URLs in completion
  • Language detection: keep only target language (use langdetect)
  • PII scan: redact emails, phone numbers, SSNs via regex or presidio

Typical Retention Rates by Filter

Exact dedup~95% retained
Near-dedup~82% retained
Length filter~78% retained
Perplexity filter~68% retained
RM scoring (top 70%)~48% retained
05
Step Four Text Normalization
Normalization imposes a consistent surface form on your data so the model learns from content, not formatting artifacts. Apply these transformations in a deterministic, reproducible pipeline.
  • Unicode & Encoding Normalisation

    Apply unicodedata.normalize("NFC", text) to collapse multiple codepoints to canonical form. Strip zero-width spaces, BOM characters, and other invisible glyphs that cause tokenizer mismatches.

  • Whitespace & Line Ending Standardisation

    Replace all CRLF with LF. Collapse runs of more than two newlines into two. Strip trailing whitespace per line. Be deliberate about whether you want tabs or spaces in code examples.

  • Quotation Mark & Dash Unification

    Decide on a canonical form: curly quotes (” “) vs straight quotes (” “). Em-dashes (—) vs double hyphens (–). Consistency matters for the tokenizer and reduces vocabulary fragmentation.

  • HTML / Markdown Artefact Removal

    Scraped data often contains residual HTML tags (<div>, &nbsp;). Use BeautifulSoup for HTML stripping; write targeted regex for common Markdown artefacts you don’t want in completions.

  • PII Redaction

    Before release or training, scan with presidio-analyzer or spacy NER to detect names, emails, phone numbers, credit card patterns. Replace with typed placeholders: [EMAIL], [PHONE].

06
Step Five Dataset Formatting & Schema
Your cleaned examples must be serialised into a format that your training framework understands. The two dominant formats are the prompt-completion pair (legacy) and the conversation / messages array (modern chat format). Choose based on your model’s expected inference format.
Format Schema Use When Frameworks
Prompt–Completion {“prompt”: “…”, “completion”: “…”} Base model fine-tuning, simple text completion tasks OpenAI legacy, Hugging Face SFTTrainer
Chat Messages {“messages”: [{“role”:”user”,”content”:”…”},{“role”:”assistant”,”content”:”…”}]} Instruction-tuning chat models, multi-turn conversations OpenAI, Axolotl, LlamaFactory
Alpaca Format {“instruction”:”…”,”input”:”…”,”output”:”…”} When there’s a clear task instruction separate from a user input Alpaca, LlamaFactory, FastChat
ShareGPT {“conversations”: [{“from”:”human”,”value”:”…”},{“from”:”gpt”,”value”:”…”}]} Multi-turn datasets, human/GPT conversational data FastChat, LlamaFactory, Axolotl
# Chat format JSONL — one JSON object per line { “messages”: [ {“role”: “system”, “content”: “You are a helpful coding assistant.”}, {“role”: “user”, “content”: “Write a Python function to reverse a linked list.”}, {“role”: “assistant”, “content”: “def reverse_linked_list(head):\n prev = None\n curr = head\n while curr:\n next_node = curr.next\n curr.next = prev\n prev = curr\n curr = next_node\n return prev”} ] }

S System Prompt Design

Your system prompt establishes the model’s persona and behavioral contract. Keep it: concise (under 200 tokens), consistent across all examples in the dataset, and representative of the exact system prompt used at inference time. Mismatch here causes distribution shift at deployment.

M Multi-Turn Construction

For multi-turn datasets, ensure each conversation is coherent end-to-end. Avoid synthetic conversations where the assistant answers questions not present in the prior context. Use conversation trees when generating branching dialogues from a single prompt root.

07
Step Six Annotation & Labelling
For human-written or human-verified datasets, annotation quality is everything. Poor annotation guidelines produce inter-annotator disagreement that corrupts your training signal.

G Annotation Guidelines

  • Write explicit rubrics for quality, style, and format
  • Provide 10–20 worked examples per task type
  • Define clear edge case handling rules
  • Version control your guidelines document

A Inter-Annotator Agreement

  • Target Cohen’s κ > 0.7 before scaling
  • Run calibration sessions after each batch
  • Adjudicate disagreements with a senior reviewer
  • Track IAA per annotator to detect drift

Q Quality Assurance Sampling

  • Sample 5–10% of each annotator’s work for QA review
  • Auto-flag examples for re-annotation based on RM score outliers
  • Maintain a gold set of 200–500 expert-verified examples

“The annotation rubric is your training objective made legible. If annotators cannot agree on what a good answer looks like, your model cannot learn it.”

— Annotation best practice, adapted from Ouyang et al. (InstructGPT)
08
Step Seven Validation & Integrity Checks
Before any training run, execute a full integrity audit on your final dataset. These checks catch silent failures — malformed JSON, tokenization edge cases, label leakage — that would otherwise corrupt training silently.
  • Schema validation: every row matches the expected JSON schema — required keys present, types correct, no null values in critical fields
  • Tokenization dry-run: pass all examples through the target tokenizer; assert no sequence exceeds max_length; flag truncated examples
  • Special token audit: ensure no raw special tokens (e.g. <|endoftext|>, <s>) appear in the text fields — they must only appear as designated control tokens
  • Train/val leakage check: run deduplication across split boundaries — ensure no exact or near-duplicate prompts exist in both train and eval sets
  • Label distribution: plot completion length histograms and task-type distributions; ensure eval set distribution mirrors train set
  • Encoding sanity: assert all files are valid UTF-8 JSONL — one JSON object per line, no trailing commas, no BOM
  • Toxicity & safety scan: run a hate speech / PII classifier over all completions; quarantine flagged examples for review
  • Loss masking verification: if using loss masking to train only on completions, verify the mask correctly zeros out the prompt token positions
# Tokenization validation with Hugging Face from transformers import AutoTokenizer import json tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-3-8B-Instruct”) MAX_LEN = 4096 issues = [] with open(“train.jsonl”) as f: for i, line in enumerate(f): ex = json.loads(line) text = tokenizer.apply_chat_template( ex[“messages”], tokenize=False ) ids = tokenizer(text)[“input_ids”] if len(ids) > MAX_LEN: issues.append({“row”: i, “length”: len(ids)}) print(f“Found {len(issues)} over-length examples”)
09
Ecosystem Tools & Libraries
datasets (HF)
Load, stream, map, filter, and split large datasets efficiently. First-class JSONL support. Use .map() for parallelised normalisation.
text-dedup
MinHash LSH deduplication at scale. Supports suffix array exact dedup. Handles 100M+ documents on a single machine.
datasketch
Pure-Python MinHash and LSH. Excellent for prototyping near-dedup pipelines before scaling to text-dedup.
presidio
Microsoft’s PII detection and anonymisation engine. Rule-based + NER. Supports 15+ entity types including SSN, IBAN, medical record numbers.
cleanlab
Confident Learning — detects label errors, near-duplicates, and outliers automatically using cross-validation. Works on classification and generative tasks.
Argilla
Open-source annotation platform purpose-built for NLP. Supports human feedback collection, dataset versioning, and RM training data curation.
LabelStudio
Versatile annotation UI. Customisable labelling interfaces for text classification, NER, and instruction-following rating tasks.
faiss
Facebook AI Similarity Search. Use for semantic deduplication — embed prompts, cluster, and prune near-semantic duplicates at billion-scale.
LlamaFactory
End-to-end fine-tuning framework. Native support for ShareGPT, Alpaca, and OpenAI chat formats. Handles dataset merging and training loop.
10
Expert Principles Best Practices Summary
  • Version everything

    Tag dataset versions with hashes. If a model degrades, you can bisect to the exact data commit that caused it. Use DVC or Hugging Face dataset repos.

  • Start small, scale deliberately

    Train on a 1k–5k sample first. Verify loss curves are sensible, eval metrics are improving, and there are no data bugs — before committing to full dataset training.

  • Maintain a held-out human eval set

    Keep 200–500 expert-human-written examples permanently out of training. Use them only to evaluate final model quality. Do not update this set as the project evolves.

  • Monitor for distribution shift

    Compare token length distributions, vocabulary overlap, and task-type counts between train/val/test. Distribution shift between splits is one of the most common silent failure modes.

  • Respect the context budget

    Do not simply truncate long examples. Instead, chunk intelligently — split documents at natural boundaries (paragraphs, sections) and create separate training examples from each chunk.

  • Use loss masking on prompts

    When fine-tuning, compute loss only on the assistant completion tokens — not on the system prompt or user message. This is critical: training on the user turn wastes capacity and distorts gradient signal.

  • Iterate with ablations

    Run controlled ablations on dataset composition: remove one data source, change a filter threshold, adjust the system prompt. This builds causal understanding of what your data is actually teaching.

  • Document everything

    Maintain a data card (following HF or Google Data Cards format) recording: sources, licenses, cleaning decisions, annotation guidelines, split sizes, and known limitations. Essential for reproducibility and responsible AI.

Leave a Reply

Your email address will not be published. Required fields are marked *