Transformers & Attention Mechanism Explained
Transformers & Attention Mechanism Explained
Transformers & Attention Mechanism
NLP Fundamentals · Part II

Transformers &
Attention Mechanism

The architecture that changed everything — how models learn to focus on what matters most in any sequence of text.

What Is a Transformer?

Introduced in the landmark 2017 paper “Attention Is All You Need”, the Transformer replaced recurrent networks (RNNs) entirely. Instead of reading text step-by-step, it processes the entire sequence at once — every token attending to every other token in parallel.

This parallelism unlocked massive scale. GPT, BERT, Claude, and every modern LLM are built on this single architectural idea.

💡 The key insight: context is everything. The word “bank” means something different near “river” vs. “loan”. Attention lets the model weight each surrounding word when computing a token’s meaning.

The Transformer Block

Each Transformer layer stacks several sub-components. Data flows from bottom to top, with residual connections preserving information across each operation.

Output / Logits

Probability over vocabulary

Layer Norm + Residual

Stabilises training

Feed-Forward Network

Two linear layers + ReLU / GELU

↑ repeated N times (e.g. 96× in GPT-4)

Multi-Head Self-Attention

Core of the Transformer ✦

Positional Encoding

Injects word order into embeddings

Token Embeddings

Token IDs → dense vectors

Input Tokens

Tokenized text sequence

How Attention Works

For every token, attention asks: “Which other tokens are most relevant to understanding me?” It does this via three learned projections of each embedding.

Q

Query
“What am I looking for?”
The current token’s question.

K

Key
“What do I contain?”
Every token’s label.

V

Value
“What do I contribute?”
The actual information passed forward.

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V
1

Dot-Product Scores

Each Query vector is dot-producted against all Key vectors, producing a raw score — how much each token “matches” the query.

2

Scale by √dₖ

Dividing by the square root of the key dimension prevents scores from becoming too large, keeping gradients stable during training.

3

Softmax → Weights

Softmax converts raw scores into a probability distribution that sums to 1 — these are the attention weights.

4

Weighted Sum of Values

The attention weights are used to compute a weighted average of all Value vectors. The result is a rich, context-aware representation of the token.

Click a Word to See Its Attention

Select any word below to visualise how much attention it pays to each other word in the sentence.

↓ click any token

Visualising the Full Attention Matrix

An attention heatmap shows, for every token (row), how much weight it assigns to every other token (column). Brighter cells mean stronger attention. Patterns reveal syntactic and semantic relationships the model has learned.

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention heads in parallel, each learning to capture different linguistic patterns. Their outputs are concatenated and projected.

Head 1
Syntactic subject–verb agreement
Head 2
Coreference & pronouns
Head 3
Long-range dependencies
Head 4
Noun–adjective relations
Head 5
Positional proximity
Head 6
Semantic similarity
Head 7
Verb–object binding
Head 8
Discourse structure

GPT-3 uses 96 attention heads per layer across 96 layers — over 9,000 attention patterns operating simultaneously on each forward pass.

Preserving Word Order

Attention is order-agnostic — it treats the input as a set, not a sequence. To inject position information, a positional encoding vector is added to each token embedding before the first layer.

The original Transformer used sinusoidal functions of varying frequencies. Modern models use learned positional embeddings or RoPE (Rotary Position Embedding), which encodes relative position directly into the attention computation.

sin

Even Dimensions

PE(pos, 2i) = sin(pos / 10000^(2i/d)) — low-frequency sinusoids encode absolute position smoothly.

cos

Odd Dimensions

PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) — the cosine counterpart allows the model to attend to relative offsets via linear transformations.

RoPE, used by LLaMA, Mistral, and Claude, rotates Q and K vectors by an angle proportional to position — letting attention scores naturally encode relative distance between tokens.

No layout selected.

Leave a Reply

Your email address will not be published. Required fields are marked *