How LLMs Learn to Think: Pre-Training, Fine-Tuning & RLHF Explained

Stage 01

Pre-Training

Stage 02

Fine-Tuning

Stage 03

RLHF

Output

Deployed Model

Foundation

Pre-Training

Learning the shape of language at scale

Pre-training is the first and most computationally intensive stage. A neural network — typically a Transformer architecture — is exposed to trillions of tokens drawn from the internet, books, and curated corpora. The model learns by a simple self-supervised objective: predict the next token.

“The model is never told what language is. It discovers grammar, facts, reasoning, and style purely from statistical co-occurrence — billions of times over.”

Through gradient descent, the network adjusts hundreds of billions of parameters to minimise prediction error. Over time it internalises syntax, factual associations, code patterns, and even rudimentary logic — all as a byproduct of next-token prediction.

Data scale

Trillions of tokens

◎

Self-supervised objective

No human labels needed. The text itself provides the training signal via next-token prediction.

◈

Transformer architecture

Attention mechanisms let each token attend to all previous tokens, capturing long-range dependencies.

◑

Emergent capabilities

At sufficient scale, abilities like in-context learning and chain-of-thought reasoning emerge unpredictably.

◻

Base model output

The result is a powerful but “raw” model — fluent, knowledgeable, yet not instruction-following.

Specialisation

Fine-Tuning

Teaching the model to follow instructions

The pre-trained base model is a world-class language predictor, but it does not know how to help a user. Fine-tuning bridges this gap by continuing training on a much smaller, curated dataset of (instruction, ideal response) pairs — often called Supervised Fine-Tuning (SFT).

Human annotators craft thousands of examples demonstrating good assistant behaviour: answering questions accurately, following formatting requests, refusing harmful prompts politely. The model is then trained on these examples in the usual supervised fashion — compute loss against the ideal response, backpropagate, update weights.

“Fine-tuning does not teach the model new facts. It reshapes how the model uses the knowledge it already has — pivoting from prediction to conversation.”

◈

Instruction datasets

Carefully curated pairs of prompts and ideal completions, covering diverse tasks and domains.

◎

Supervised learning

Standard cross-entropy loss against target responses. Small dataset, many fewer gradient steps than pre-training.

◑

Format & style

The model learns conversational structure: how to open, acknowledge, reason, and conclude an answer.

◻

Domain adaptation

Fine-tuning can also specialise a model for medicine, law, coding, or any narrow vertical.

Data scale

Thousands–millions of examples

Alignment

RLHF

Reinforcement Learning from Human Feedback

Even after SFT, the model may produce responses that are subtly unhelpful, verbose, or unsafe. RLHF addresses this by using human preferences — not just examples — as the training signal, via a reinforcement learning loop.

SFT Model generates responses

→

Humans rank responses

→

Reward model trained on rankings

→

PPO optimises policy against reward

The key insight is that it is easier for humans to compare two responses than to write an ideal one from scratch. Annotators rank pairs of outputs; these rankings train a separate reward model that learns to score responses as humans would. The policy (LLM) is then optimised via Proximal Policy Optimisation (PPO) to maximise the reward model’s score.

“RLHF turns implicit human taste into an explicit numerical signal — and then uses that signal to nudge billions of parameters toward helpfulness, harmlessness, and honesty.”

◎

Reward modelling

A separate model trained on preference pairs. It predicts a scalar score for any (prompt, response) pair.

◈

PPO algorithm

Proximal Policy Optimisation keeps updates small, preventing the LLM from “hacking” the reward model catastrophically.

◑

KL-divergence penalty

A regularisation term keeps the RLHF model close to the SFT base, preserving knowledge and preventing degeneration.

◻

Constitutional AI & DPO

Modern variants like Anthropic’s CAI and Direct Preference Optimisation simplify or replace the RL loop entirely.

∑

Overview

Comparing the Stages

Side-by-side at a glance

Dimension	Pre-Training	Fine-Tuning	RLHF
Goal	General language understanding	Instruction following	Preference alignment
Data	Web-scale unlabelled text	Curated (prompt, response) pairs	Human preference rankings
Supervision	Self-supervised	Supervised (teacher-forced)	Reinforcement learning
Compute	Enormous (weeks on thousands of GPUs)	Moderate (hours–days)	Moderate–high (complex loop)
Data size	Trillions of tokens	Thousands–millions of examples	Tens–hundreds of thousands of comparisons
Key challenge	Scale, data quality, infrastructure	Prompt diversity, annotation quality	Reward hacking, annotation bias

How LLMs Learn to Think: Pre-Training, Fine-Tuning & RLHF Explained

Quick Start Guide to Large Language Models: Strategies and Best P…

Building A large language model with Ai: A Practical Guide to Str…

Mastering Large Language Models: A Hands-On Guide to AI-Powered A…

How LLMs
Learn to
Think

Pre-Training

Self-supervised objective

Transformer architecture

Emergent capabilities

Base model output

Fine-Tuning

Instruction datasets

Supervised learning

Format & style

Domain adaptation

RLHF

Reward modelling

PPO algorithm

KL-divergence penalty

Constitutional AI & DPO

Comparing the Stages

Quick Start Guide to Large Language Models: Strategies and Best P…

Building A large language model with Ai: A Practical Guide to Str…

Mastering Large Language Models: A Hands-On Guide to AI-Powered A…

By Somish Saipar

Leave a Reply Cancel reply

You Missed

LLM Fine-Tuning & Optimization: Instruction Tuning, LoRA, RLHF & Prompt Strategies

PEFT, LoRA & QLoRA Explained: The Complete Guide to Efficient LLM Fine-Tuning (2025)

Mastering AI Expertise Through Fine-Tuning

Claude AI API Integration — Build Smarter Apps with the World’s Most Capable AI (2026)

About Us

Follow Us

Latest Posts

LLM Fine-Tuning & Optimization: Instruction Tuning, LoRA, RLHF & Prompt Strategies

PEFT, LoRA & QLoRA Explained: The Complete Guide to Efficient LLM Fine-Tuning (2025)

Mastering AI Expertise Through Fine-Tuning

Claude AI API Integration — Build Smarter Apps with the World’s Most Capable AI (2026)

Feed the algorithm. Can we parallel paths are we in agreeance?

Quick Start Guide to Large Language Models: Strategies and Best P…

Building A large language model with Ai: A Practical Guide to Str…

Mastering Large Language Models: A Hands-On Guide to AI-Powered A…

Pre-Training

Self-supervised objective

Transformer architecture

Emergent capabilities

Base model output

Fine-Tuning

Instruction datasets

Supervised learning

Format & style

Domain adaptation

RLHF

Reward modelling

PPO algorithm

KL-divergence penalty

Constitutional AI & DPO

Comparing the Stages

Quick Start Guide to Large Language Models: Strategies and Best P…

Building A large language model with Ai: A Practical Guide to Str…

Mastering Large Language Models: A Hands-On Guide to AI-Powered A…

By Somish Saipar

Related Post

Leave a Reply Cancel reply

You Missed