Bestseller #1

AI Implementation in Supply Chain Management: A Comprehensive Gui…

₹1,641

Buy on Amazon

Bestseller #2

AI Data Preparation Guide: Fuel AI With Quality Data | Labeling T…

Buy on Amazon

Bestseller #3

AI Unlocked: Building an OpenAI-API Document Analysis Engine

₹1,865

Buy on Amazon

SFT Dataset Preparation — Expert Guide

Expert Guide · Supervised Fine-Tuning

Preparing & Cleaning
Custom Datasets

A comprehensive, practitioner-level reference for every stage of building high-quality training data — from raw collection to production-ready JSONL.

Data Collection Deduplication Quality Filtering Formatting Validation Tokenization Annotation RLHF-ready

01Collect

→

02Deduplicate

→

03Filter

→

04Normalize

→

05Annotate

→

06Format

→

07Validate

→

08Tokenize

Foundation Why Data Quality Defines SFT Success

80%

of SFT failures traced to data quality issues

10×

quality beats 10× quantity in training signal

~5%

duplicate rate acceptable in cleaned corpus

3–5

human rater passes for annotation gold sets

Supervised fine-tuning teaches a model to follow instructions by showing it (prompt, completion) pairs where the completion is the desired response. The model learns to mimic the distribution of these completions — which means garbage in, garbage out is not just a cliché here; it is a mathematical certainty. A single noisy annotation pass can introduce systematic biases that survive hundreds of thousands of gradient steps.

✓ What Good Data Does

Teaches the model how to respond — tone, length, style
Aligns outputs with human preferences and safety constraints
Generalises to unseen prompts through consistent patterns
Reduces hallucination by grounding responses in real examples

✕ What Bad Data Does

Introduces style drift and inconsistent formatting
Amplifies annotation biases into the fine-tuned model
Causes catastrophic forgetting of pre-trained knowledge
Produces reward hacking when used for RLHF preference data

Step One Data Collection Strategies

The origin of your data shapes everything downstream. There are three primary sourcing strategies, each with distinct trade-offs between coverage, cost, and control.

H Human-Written

Highest quality signal. Annotators write prompts and ideal completions from scratch. Expensive (~$15–$80 per example for expert domains), but gold-standard for safety-critical tasks.

Best for: instruction following, safety

S Scraped / Synthetic

Use a teacher LLM (GPT-4, Claude) to generate large volumes of instruction-response pairs. Cheap at scale. Risk: distributional collapse if source model is weak.

Best for: volume, domain coverage

C Converted Corpora

Transform existing structured data — FAQs, documentation, support tickets — into instruction format. Requires careful prompt templating and thorough cleaning.

Best for: domain-specific knowledge

“Diversity of tasks in the prompt distribution matters more than sheer volume. A 10k dataset spanning 200 distinct task types will typically outperform 100k examples of one task.”

Prompt Diversity Checklist

Cover all task types relevant to your use case (summarisation, Q&A, extraction, generation, classification)
Vary prompt length: short one-liners, medium paragraphs, long multi-part instructions
Include edge cases: ambiguous prompts, refused requests, multi-turn context windows
Balance domain distribution (do not over-represent the easiest domain)
Sample adversarial prompts for safety alignment if applicable

Step Two Deduplication

Duplicate examples cause the model to overfit specific phrasings and inflate training metrics without adding new signal. There are three levels of deduplication you must apply in order.

Exact Deduplication

Hash each (prompt, completion) pair with SHA-256 or MD5. Remove any row whose hash has been seen before. This is O(n) and should always be your first pass — it’s free signal cleanup.
Near-Duplicate Detection with MinHash/LSH

Use MinHash Locally Sensitive Hashing to group pairs with Jaccard similarity above ~0.8. Keep one representative per cluster. Libraries: datasketch, text-dedup. Effective for scraped corpora.
Semantic Deduplication

Embed prompts using a sentence encoder (e.g. all-MiniLM-L6-v2), cluster with FAISS or cosine similarity, and prune clusters down to a target size. Catches paraphrastic duplicates missed by character-level methods.

# Near-dedup with datasketch (Python)
from datasketch import MinHash, MinHashLSH

lsh = MinHashLSH(threshold=0.8, num_perm=128)
seen_keys = set()

for idx, row in enumerate(dataset):
    m = MinHash(num_perm=128)
    for token in row[“prompt”].split():
        m.update(token.encode(“utf-8”))
    
    result = lsh.query(m)
    if not result:               # no near-duplicate found
        lsh.insert(f“id_{idx}”, m)
        seen_keys.add(idx)        # keep this example

clean_dataset = [row for i, row in enumerate(dataset) if i in seen_keys]
      

Step Three Quality Filtering

After deduplication, apply a multi-signal filter to remove low-quality examples. Each filter below is a gate: examples failing any gate are discarded or sent for human review.

L Length Filters

Minimum prompt tokens (e.g. ≥ 8): removes stub prompts
Maximum context tokens (≤ model context window): prevents truncation
Completion length ratio: flag if completion is < 10% or > 500% of prompt length

P Perplexity Filters

Run a small language model (KenLM, n-gram) on completions
Remove very high perplexity (garbled / incoherent text)
Remove very low perplexity (boilerplate / templated filler)

R Reward Model Scoring

Pass (prompt, completion) through an RM trained on human preference data
Retain only top-K percentile by RM score
Effective but requires a pre-trained RM — use OpenAssistant/reward-model-deberta

H Heuristic Rules

Bullet/symbol ratio: remove if > 30% lines start with “•” or “-“
URL density: flag > 5 URLs in completion
Language detection: keep only target language (use langdetect)
PII scan: redact emails, phone numbers, SSNs via regex or presidio

Typical Retention Rates by Filter

Exact dedup~95% retained

Near-dedup~82% retained

Length filter~78% retained

Perplexity filter~68% retained

RM scoring (top 70%)~48% retained

Step Four Text Normalization

Normalization imposes a consistent surface form on your data so the model learns from content, not formatting artifacts. Apply these transformations in a deterministic, reproducible pipeline.

Unicode & Encoding Normalisation

Apply unicodedata.normalize("NFC", text) to collapse multiple codepoints to canonical form. Strip zero-width spaces, BOM characters, and other invisible glyphs that cause tokenizer mismatches.
Whitespace & Line Ending Standardisation

Replace all CRLF with LF. Collapse runs of more than two newlines into two. Strip trailing whitespace per line. Be deliberate about whether you want tabs or spaces in code examples.
Quotation Mark & Dash Unification

Decide on a canonical form: curly quotes (” “) vs straight quotes (” “). Em-dashes (—) vs double hyphens (–). Consistency matters for the tokenizer and reduces vocabulary fragmentation.
HTML / Markdown Artefact Removal

Scraped data often contains residual HTML tags (<div>,  ). Use BeautifulSoup for HTML stripping; write targeted regex for common Markdown artefacts you don’t want in completions.
PII Redaction

Before release or training, scan with presidio-analyzer or spacy NER to detect names, emails, phone numbers, credit card patterns. Replace with typed placeholders: [EMAIL], [PHONE].

Step Five Dataset Formatting & Schema

Your cleaned examples must be serialised into a format that your training framework understands. The two dominant formats are the prompt-completion pair (legacy) and the conversation / messages array (modern chat format). Choose based on your model’s expected inference format.

Format	Schema	Use When	Frameworks
Prompt–Completion	{“prompt”: “…”, “completion”: “…”}	Base model fine-tuning, simple text completion tasks	OpenAI legacy, Hugging Face SFTTrainer
Chat Messages	{“messages”: [{“role”:”user”,”content”:”…”},{“role”:”assistant”,”content”:”…”}]}	Instruction-tuning chat models, multi-turn conversations	OpenAI, Axolotl, LlamaFactory
Alpaca Format	{“instruction”:”…”,”input”:”…”,”output”:”…”}	When there’s a clear task instruction separate from a user input	Alpaca, LlamaFactory, FastChat
ShareGPT	{“conversations”: [{“from”:”human”,”value”:”…”},{“from”:”gpt”,”value”:”…”}]}	Multi-turn datasets, human/GPT conversational data	FastChat, LlamaFactory, Axolotl

# Chat format JSONL — one JSON object per line
{
  “messages”: [
    {“role”: “system”,    “content”: “You are a helpful coding assistant.”},
    {“role”: “user”,      “content”: “Write a Python function to reverse a linked list.”},
    {“role”: “assistant”, “content”: “def reverse_linked_list(head):\n    prev = None\n    curr = head\n    while curr:\n        next_node = curr.next\n        curr.next = prev\n        prev = curr\n        curr = next_node\n    return prev”}
  ]
}
      

S System Prompt Design

Your system prompt establishes the model’s persona and behavioral contract. Keep it: concise (under 200 tokens), consistent across all examples in the dataset, and representative of the exact system prompt used at inference time. Mismatch here causes distribution shift at deployment.

M Multi-Turn Construction

For multi-turn datasets, ensure each conversation is coherent end-to-end. Avoid synthetic conversations where the assistant answers questions not present in the prior context. Use conversation trees when generating branching dialogues from a single prompt root.

Step Six Annotation & Labelling

For human-written or human-verified datasets, annotation quality is everything. Poor annotation guidelines produce inter-annotator disagreement that corrupts your training signal.

G Annotation Guidelines

Write explicit rubrics for quality, style, and format
Provide 10–20 worked examples per task type
Define clear edge case handling rules
Version control your guidelines document

A Inter-Annotator Agreement

Target Cohen’s κ > 0.7 before scaling
Run calibration sessions after each batch
Adjudicate disagreements with a senior reviewer
Track IAA per annotator to detect drift

Q Quality Assurance Sampling

Sample 5–10% of each annotator’s work for QA review
Auto-flag examples for re-annotation based on RM score outliers
Maintain a gold set of 200–500 expert-verified examples

“The annotation rubric is your training objective made legible. If annotators cannot agree on what a good answer looks like, your model cannot learn it.”

Step Seven Validation & Integrity Checks

Before any training run, execute a full integrity audit on your final dataset. These checks catch silent failures — malformed JSON, tokenization edge cases, label leakage — that would otherwise corrupt training silently.

Schema validation: every row matches the expected JSON schema — required keys present, types correct, no null values in critical fields
Tokenization dry-run: pass all examples through the target tokenizer; assert no sequence exceeds max_length; flag truncated examples
Special token audit: ensure no raw special tokens (e.g. <|endoftext|>, <s>) appear in the text fields — they must only appear as designated control tokens
Train/val leakage check: run deduplication across split boundaries — ensure no exact or near-duplicate prompts exist in both train and eval sets
Label distribution: plot completion length histograms and task-type distributions; ensure eval set distribution mirrors train set
Encoding sanity: assert all files are valid UTF-8 JSONL — one JSON object per line, no trailing commas, no BOM
Toxicity & safety scan: run a hate speech / PII classifier over all completions; quarantine flagged examples for review
Loss masking verification: if using loss masking to train only on completions, verify the mask correctly zeros out the prompt token positions

# Tokenization validation with Hugging Face
from transformers import AutoTokenizer
import json

tokenizer = AutoTokenizer.from_pretrained(“meta-llama/Llama-3-8B-Instruct”)
MAX_LEN = 4096
issues = []

with open(“train.jsonl”) as f:
    for i, line in enumerate(f):
        ex = json.loads(line)
        text = tokenizer.apply_chat_template(
            ex[“messages”], tokenize=False
        )
        ids = tokenizer(text)[“input_ids”]
        if len(ids) > MAX_LEN:
            issues.append({“row”: i, “length”: len(ids)})

print(f“Found {len(issues)} over-length examples”)
      

Ecosystem Tools & Libraries

datasets (HF)

Load, stream, map, filter, and split large datasets efficiently. First-class JSONL support. Use .map() for parallelised normalisation.

text-dedup

MinHash LSH deduplication at scale. Supports suffix array exact dedup. Handles 100M+ documents on a single machine.

datasketch

Pure-Python MinHash and LSH. Excellent for prototyping near-dedup pipelines before scaling to text-dedup.

presidio

Microsoft’s PII detection and anonymisation engine. Rule-based + NER. Supports 15+ entity types including SSN, IBAN, medical record numbers.

cleanlab

Confident Learning — detects label errors, near-duplicates, and outliers automatically using cross-validation. Works on classification and generative tasks.

Argilla

Open-source annotation platform purpose-built for NLP. Supports human feedback collection, dataset versioning, and RM training data curation.

LabelStudio

Versatile annotation UI. Customisable labelling interfaces for text classification, NER, and instruction-following rating tasks.

faiss

Facebook AI Similarity Search. Use for semantic deduplication — embed prompts, cluster, and prune near-semantic duplicates at billion-scale.

LlamaFactory

End-to-end fine-tuning framework. Native support for ShareGPT, Alpaca, and OpenAI chat formats. Handles dataset merging and training loop.

Expert Principles Best Practices Summary

Version everything

Tag dataset versions with hashes. If a model degrades, you can bisect to the exact data commit that caused it. Use DVC or Hugging Face dataset repos.
Start small, scale deliberately

Train on a 1k–5k sample first. Verify loss curves are sensible, eval metrics are improving, and there are no data bugs — before committing to full dataset training.
Maintain a held-out human eval set

Keep 200–500 expert-human-written examples permanently out of training. Use them only to evaluate final model quality. Do not update this set as the project evolves.
Monitor for distribution shift

Compare token length distributions, vocabulary overlap, and task-type counts between train/val/test. Distribution shift between splits is one of the most common silent failure modes.

Respect the context budget

Do not simply truncate long examples. Instead, chunk intelligently — split documents at natural boundaries (paragraphs, sections) and create separate training examples from each chunk.
Use loss masking on prompts

When fine-tuning, compute loss only on the assistant completion tokens — not on the system prompt or user message. This is critical: training on the user turn wastes capacity and distorts gradient signal.
Iterate with ablations

Run controlled ablations on dataset composition: remove one data source, change a filter threshold, adjust the system prompt. This builds causal understanding of what your data is actually teaching.
Document everything

Maintain a data card (following HF or Google Data Cards format) recording: sources, licenses, cleaning decisions, annotation guidelines, split sizes, and known limitations. Essential for reproducibility and responsible AI.

Bestseller #1

AI Implementation in Supply Chain Management: A Comprehensive Gui…

₹1,641

Buy on Amazon

Bestseller #2

AI Data Preparation Guide: Fuel AI With Quality Data | Labeling T…

Buy on Amazon

Bestseller #3

AI Unlocked: Building an OpenAI-API Document Analysis Engine

₹1,865

Buy on Amazon

AI Implementation in Supply Chain Management: A Comprehensive Gui…

AI Data Preparation Guide: Fuel AI With Quality Data | Labeling T…

AI Unlocked: Building an OpenAI-API Document Analysis Engine

Preparing & CleaningCustom Datasets

✓ What Good Data Does

✕ What Bad Data Does

H Human-Written

S Scraped / Synthetic

C Converted Corpora

Prompt Diversity Checklist

Exact Deduplication

Near-Duplicate Detection with MinHash/LSH

Semantic Deduplication

L Length Filters

P Perplexity Filters

R Reward Model Scoring

H Heuristic Rules

Typical Retention Rates by Filter

Unicode & Encoding Normalisation

Whitespace & Line Ending Standardisation

Quotation Mark & Dash Unification

HTML / Markdown Artefact Removal

PII Redaction

S System Prompt Design

M Multi-Turn Construction

G Annotation Guidelines

A Inter-Annotator Agreement

Q Quality Assurance Sampling

Version everything

Start small, scale deliberately

Maintain a held-out human eval set

Monitor for distribution shift

Respect the context budget

Use loss masking on prompts

Iterate with ablations

Document everything

AI Implementation in Supply Chain Management: A Comprehensive Gui…

AI Data Preparation Guide: Fuel AI With Quality Data | Labeling T…

AI Unlocked: Building an OpenAI-API Document Analysis Engine

By Somish Saipar

Related Post