LLM Training Pipeline

Deep Learning · Language Models · Training Methods

How LLMs
Learn to
Think

A visual walkthrough of the three-stage pipeline that transforms raw compute into conversational intelligence — from pre-training through fine-tuning to human feedback alignment.

Stage 01
Pre-Training
Stage 02
Fine-Tuning
Stage 03
RLHF
Output
Deployed Model
1
Foundation

Pre-Training

Learning the shape of language at scale

Pre-training is the first and most computationally intensive stage. A neural network — typically a Transformer architecture — is exposed to trillions of tokens drawn from the internet, books, and curated corpora. The model learns by a simple self-supervised objective: predict the next token.

“The model is never told what language is. It discovers grammar, facts, reasoning, and style purely from statistical co-occurrence — billions of times over.”

Through gradient descent, the network adjusts hundreds of billions of parameters to minimise prediction error. Over time it internalises syntax, factual associations, code patterns, and even rudimentary logic — all as a byproduct of next-token prediction.

Data scale
Trillions of tokens

Self-supervised objective

No human labels needed. The text itself provides the training signal via next-token prediction.

Transformer architecture

Attention mechanisms let each token attend to all previous tokens, capturing long-range dependencies.

Emergent capabilities

At sufficient scale, abilities like in-context learning and chain-of-thought reasoning emerge unpredictably.

Base model output

The result is a powerful but “raw” model — fluent, knowledgeable, yet not instruction-following.

2
Specialisation

Fine-Tuning

Teaching the model to follow instructions

The pre-trained base model is a world-class language predictor, but it does not know how to help a user. Fine-tuning bridges this gap by continuing training on a much smaller, curated dataset of (instruction, ideal response) pairs — often called Supervised Fine-Tuning (SFT).

Human annotators craft thousands of examples demonstrating good assistant behaviour: answering questions accurately, following formatting requests, refusing harmful prompts politely. The model is then trained on these examples in the usual supervised fashion — compute loss against the ideal response, backpropagate, update weights.

“Fine-tuning does not teach the model new facts. It reshapes how the model uses the knowledge it already has — pivoting from prediction to conversation.”

Instruction datasets

Carefully curated pairs of prompts and ideal completions, covering diverse tasks and domains.

Supervised learning

Standard cross-entropy loss against target responses. Small dataset, many fewer gradient steps than pre-training.

Format & style

The model learns conversational structure: how to open, acknowledge, reason, and conclude an answer.

Domain adaptation

Fine-tuning can also specialise a model for medicine, law, coding, or any narrow vertical.

Data scale
Thousands–millions of examples
3
Alignment

RLHF

Reinforcement Learning from Human Feedback

Even after SFT, the model may produce responses that are subtly unhelpful, verbose, or unsafe. RLHF addresses this by using human preferences — not just examples — as the training signal, via a reinforcement learning loop.

01
SFT Model generates responses
02
Humans rank responses
03
Reward model trained on rankings
04
PPO optimises policy against reward

The key insight is that it is easier for humans to compare two responses than to write an ideal one from scratch. Annotators rank pairs of outputs; these rankings train a separate reward model that learns to score responses as humans would. The policy (LLM) is then optimised via Proximal Policy Optimisation (PPO) to maximise the reward model’s score.

“RLHF turns implicit human taste into an explicit numerical signal — and then uses that signal to nudge billions of parameters toward helpfulness, harmlessness, and honesty.”

Reward modelling

A separate model trained on preference pairs. It predicts a scalar score for any (prompt, response) pair.

PPO algorithm

Proximal Policy Optimisation keeps updates small, preventing the LLM from “hacking” the reward model catastrophically.

KL-divergence penalty

A regularisation term keeps the RLHF model close to the SFT base, preserving knowledge and preventing degeneration.

Constitutional AI & DPO

Modern variants like Anthropic’s CAI and Direct Preference Optimisation simplify or replace the RL loop entirely.

Overview

Comparing the Stages

Side-by-side at a glance

Dimension Pre-Training Fine-Tuning RLHF
Goal General language understanding Instruction following Preference alignment
Data Web-scale unlabelled text Curated (prompt, response) pairs Human preference rankings
Supervision Self-supervised Supervised (teacher-forced) Reinforcement learning
Compute Enormous (weeks on thousands of GPUs) Moderate (hours–days) Moderate–high (complex loop)
Data size Trillions of tokens Thousands–millions of examples Tens–hundreds of thousands of comparisons
Key challenge Scale, data quality, infrastructure Prompt diversity, annotation quality Reward hacking, annotation bias

Pre-Training · Fine-Tuning · RLHF — LLM Pipeline Overview

Scroll up to explore each stage ↑

Leave a Reply

Your email address will not be published. Required fields are marked *