Transformers &
Attention Mechanism
The architecture that changed everything — how models learn to focus on what matters most in any sequence of text.
What Is a Transformer?
Introduced in the landmark 2017 paper “Attention Is All You Need”, the Transformer replaced recurrent networks (RNNs) entirely. Instead of reading text step-by-step, it processes the entire sequence at once — every token attending to every other token in parallel.
This parallelism unlocked massive scale. GPT, BERT, Claude, and every modern LLM are built on this single architectural idea.
💡 The key insight: context is everything. The word “bank” means something different near “river” vs. “loan”. Attention lets the model weight each surrounding word when computing a token’s meaning.
The Transformer Block
Each Transformer layer stacks several sub-components. Data flows from bottom to top, with residual connections preserving information across each operation.
Output / Logits
Probability over vocabulary
Layer Norm + Residual
Stabilises training
Feed-Forward Network
Two linear layers + ReLU / GELU
Multi-Head Self-Attention
Core of the Transformer ✦
Positional Encoding
Injects word order into embeddings
Input Tokens
Tokenized text sequence
How Attention Works
For every token, attention asks: “Which other tokens are most relevant to understanding me?” It does this via three learned projections of each embedding.
Query
“What am I looking for?”
The current token’s question.
Key
“What do I contain?”
Every token’s label.
Value
“What do I contribute?”
The actual information passed forward.
Dot-Product Scores
Each Query vector is dot-producted against all Key vectors, producing a raw score — how much each token “matches” the query.
Scale by √dₖ
Dividing by the square root of the key dimension prevents scores from becoming too large, keeping gradients stable during training.
Softmax → Weights
Softmax converts raw scores into a probability distribution that sums to 1 — these are the attention weights.
Weighted Sum of Values
The attention weights are used to compute a weighted average of all Value vectors. The result is a rich, context-aware representation of the token.
Click a Word to See Its Attention
Select any word below to visualise how much attention it pays to each other word in the sentence.
Visualising the Full Attention Matrix
An attention heatmap shows, for every token (row), how much weight it assigns to every other token (column). Brighter cells mean stronger attention. Patterns reveal syntactic and semantic relationships the model has learned.
Why Multiple Heads?
A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention heads in parallel, each learning to capture different linguistic patterns. Their outputs are concatenated and projected.
GPT-3 uses 96 attention heads per layer across 96 layers — over 9,000 attention patterns operating simultaneously on each forward pass.
Preserving Word Order
Attention is order-agnostic — it treats the input as a set, not a sequence. To inject position information, a positional encoding vector is added to each token embedding before the first layer.
The original Transformer used sinusoidal functions of varying frequencies. Modern models use learned positional embeddings or RoPE (Rotary Position Embedding), which encodes relative position directly into the attention computation.
Even Dimensions
PE(pos, 2i) = sin(pos / 10000^(2i/d)) — low-frequency sinusoids encode absolute position smoothly.
Odd Dimensions
PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) — the cosine counterpart allows the model to attend to relative offsets via linear transformations.
RoPE, used by LLaMA, Mistral, and Claude, rotates Q and K vectors by an angle proportional to position — letting attention scores naturally encode relative distance between tokens.

