LLMs & Latent Space — A Visual Guide

A Visual Primer

Large Language Models &
Latent Space

How a neural network compresses the meaning of human language into a geometric universe of numbers — and what that makes possible.

01 — Foundations

What Is a Large Language Model?

An LLM is a neural network trained on vast quantities of text — books, code, conversations, encyclopaedias — with a single deceptively simple task: predict the next token. A token is roughly a word or word-fragment. By learning to predict tokens billions of times, the model is forced to build rich internal representations of grammar, facts, reasoning patterns, and even nuanced sentiment.

The “large” refers to both parameter count — modern models carry hundreds of billions of learned numbers — and the sheer diversity of training data. Scale unlocks qualitatively new behaviours: the ability to translate, code, reason, and write poetry all emerge from the same prediction objective.

1

Tokenisation

Raw text is broken into tokens. “unbelievable” might become [“un”,”believ”,”able”]. The model operates on token IDs, never raw characters.

2

Embedding Lookup

Each token ID is mapped to a high-dimensional vector — its initial position in latent space — via a learned embedding matrix.

3

Transformer Layers

Stacked attention and feed-forward layers refine these vectors, letting every token “look at” every other token and update its meaning accordingly.

4

Output Projection

The final vector is projected onto a probability distribution over the entire vocabulary. The most likely next token is sampled or selected.

02 — Latent Space

The Geometry of Meaning

Every token, sentence, and concept that passes through an LLM is encoded as a point in a very high-dimensional space — typically thousands of dimensions. This is the latent space. Its power lies in what it encodes: proximity = semantic relatedness.

Words like “king”, “queen”, “monarch” cluster near each other. Concepts like “sadness” and “grief” live close together. But what’s extraordinary is the structure that emerges — linear algebraic relationships encode meaning:

Classic Example

vector(“king”) − vector(“man”) + vector(“woman”) ≈ vector(“queen”)
The model has learned gender as a consistent direction through latent space.

↑ Simulated 2D projection of a latent space — hover to perturb

This latent geometry is not hand-crafted. It arises purely from the statistics of co-occurrence in training text. The model discovers that certain directions encode tense, plurality, sentiment, formality — building a map of language that no human designed.

03 — Attention

Self-Attention: Context Shapes Meaning

A word’s meaning shifts with context. “Bank” in “river bank” is very different from “bank” in “savings bank”. Self-attention is the mechanism that resolves this ambiguity.

At each layer, every token queries every other token: “How relevant are you to me right now?” The answers — called attention weights — determine how much each neighbouring token contributes to updating the current token’s latent representation. After many layers, context is fully integrated: the latent vector for “bank” in a finance document has drifted far from the one in a geography text.

🔍

Query

“What am I looking for?” — each token projects a query vector.

🗝️

Key

“Here’s what I offer” — every token broadcasts a key.

💎

Value

“Here’s my actual information” — the weighted sum of values updates the representation.

04 — Emergence

Why Scale Unlocks Abilities

Smaller models can autocomplete sentences. Larger models — with richer latent spaces — begin to reason, translate, write code, and solve maths. These emergent capabilities are not explicitly programmed; they arise at scale thresholds that are difficult to predict in advance.

🧠

In-context Learning

A few examples in the prompt reshape which region of latent space the model operates in — no fine-tuning required.

🔗

Chain-of-Thought

Verbalising intermediate steps guides the model through latent space in a more structured trajectory, improving reasoning.

🌐

Cross-lingual Transfer

Concepts align across languages in shared latent regions — the model learns “language-agnostic” meaning.

05 — Limitations

What Latent Space Cannot Do

The latent space is a compression of training data, not a window into truth. Several fundamental limitations follow:

!

Hallucination

When queried about facts outside its training distribution, the model generates plausible-sounding but false content — the geometry of nearby concepts pulls it toward confident but wrong answers.

!

Knowledge Cutoff

Latent space is frozen at training time. Events after the cutoff date simply don’t exist in the model’s geometry.

!

Spurious Correlations

Biases in training data imprint into latent geometry. Stereotyped associations can be baked into the directions the model learns.

Analogy

Latent space is like a vast, beautifully organised library — but the books are all written before a certain date, a few are fiction presented as fact, and the librarian occasionally confabulates titles that don’t exist.


Understanding LLMs & Latent Space · A Visual Primer · 2025

Leave a Reply

Your email address will not be published. Required fields are marked *