Bestseller #1

Kickstart Artificial Intelligence Fundamentals: Master Machine Le…

₹1,359

Buy on Amazon

Bestseller #2

AI Fundamentals – A Beginner’s Guide: Unlock the Power of Artific…

Buy on Amazon

Bestseller #3

AI for Everyone

₹195

Buy on Amazon

Bestseller #4

Generative AI Fundamentals and Interview Guide

₹346

Buy on Amazon

Transformers & Attention Mechanism

NLP Fundamentals · Part II

Transformers &
Attention Mechanism

The architecture that changed everything — how models learn to focus on what matters most in any sequence of text.

01 — Overview

What Is a Transformer?

Introduced in the landmark 2017 paper “Attention Is All You Need”, the Transformer replaced recurrent networks (RNNs) entirely. Instead of reading text step-by-step, it processes the entire sequence at once — every token attending to every other token in parallel.

This parallelism unlocked massive scale. GPT, BERT, Claude, and every modern LLM are built on this single architectural idea.

💡 The key insight: context is everything. The word “bank” means something different near “river” vs. “loan”. Attention lets the model weight each surrounding word when computing a token’s meaning.

02 — Architecture

The Transformer Block

Each Transformer layer stacks several sub-components. Data flows from bottom to top, with residual connections preserving information across each operation.

Output / Logits

Probability over vocabulary

Layer Norm + Residual

Stabilises training

Feed-Forward Network

Two linear layers + ReLU / GELU

↑ repeated N times (e.g. 96× in GPT-4)

Multi-Head Self-Attention

Core of the Transformer ✦

Positional Encoding

Injects word order into embeddings

Token Embeddings

Token IDs → dense vectors

Input Tokens

Tokenized text sequence

03 — Self-Attention

How Attention Works

For every token, attention asks: “Which other tokens are most relevant to understanding me?” It does this via three learned projections of each embedding.

Query
“What am I looking for?”
The current token’s question.

Key
“What do I contain?”
Every token’s label.

Value
“What do I contribute?”
The actual information passed forward.

Attention(Q, K, V) = softmax( QKᵀ / √dₖ ) · V

Dot-Product Scores

Each Query vector is dot-producted against all Key vectors, producing a raw score — how much each token “matches” the query.

Scale by √dₖ

Dividing by the square root of the key dimension prevents scores from becoming too large, keeping gradients stable during training.

Softmax → Weights

Softmax converts raw scores into a probability distribution that sums to 1 — these are the attention weights.

Weighted Sum of Values

The attention weights are used to compute a weighted average of all Value vectors. The result is a rich, context-aware representation of the token.

04 — Interactive Demo

Click a Word to See Its Attention

Select any word below to visualise how much attention it pays to each other word in the sentence.

↓ click any token

05 — Attention Map

Visualising the Full Attention Matrix

An attention heatmap shows, for every token (row), how much weight it assigns to every other token (column). Brighter cells mean stronger attention. Patterns reveal syntactic and semantic relationships the model has learned.

06 — Multi-Head Attention

Why Multiple Heads?

A single attention head can only focus on one type of relationship at a time. Multi-head attention runs several attention heads in parallel, each learning to capture different linguistic patterns. Their outputs are concatenated and projected.

Head 1

Syntactic subject–verb agreement

Head 2

Coreference & pronouns

Head 3

Long-range dependencies

Head 4

Noun–adjective relations

Head 5

Positional proximity

Head 6

Semantic similarity

Head 7

Verb–object binding

Head 8

Discourse structure

GPT-3 uses 96 attention heads per layer across 96 layers — over 9,000 attention patterns operating simultaneously on each forward pass.

07 — Positional Encoding

Preserving Word Order

Attention is order-agnostic — it treats the input as a set, not a sequence. To inject position information, a positional encoding vector is added to each token embedding before the first layer.

The original Transformer used sinusoidal functions of varying frequencies. Modern models use learned positional embeddings or RoPE (Rotary Position Embedding), which encodes relative position directly into the attention computation.

sin

Even Dimensions

PE(pos, 2i) = sin(pos / 10000^(2i/d)) — low-frequency sinusoids encode absolute position smoothly.

cos

Odd Dimensions

PE(pos, 2i+1) = cos(pos / 10000^(2i/d)) — the cosine counterpart allows the model to attend to relative offsets via linear transformations.

RoPE, used by LLaMA, Mistral, and Claude, rotates Q and K vectors by an angle proportional to position — letting attention scores naturally encode relative distance between tokens.

No layout selected.

Kickstart Artificial Intelligence Fundamentals: Master Machine Le…

AI Fundamentals – A Beginner’s Guide: Unlock the Power of Artific…

AI for Everyone

Generative AI Fundamentals and Interview Guide

Transformers &Attention Mechanism

What Is a Transformer?

The Transformer Block

Output / Logits

Layer Norm + Residual

Feed-Forward Network

Multi-Head Self-Attention

Positional Encoding

Token Embeddings

Input Tokens

How Attention Works

Dot-Product Scores

Scale by √dₖ

Softmax → Weights

Weighted Sum of Values

Click a Word to See Its Attention

Visualising the Full Attention Matrix

Why Multiple Heads?

Preserving Word Order

Even Dimensions

Odd Dimensions

By Somish Saipar

Related Post