Tokenization & Embeddings Explained
Tokenization & Embeddings Explained
Tokenization & Embeddings
NLP Fundamentals

Tokenization
& Embeddings

How language models turn raw text into numbers they can reason about — from characters to concepts.

Breaking Text into Tokens

Before a model reads a single word, it must convert text into a sequence of discrete units called tokens. Tokens are not always full words — they can be subwords, punctuation, or even individual characters.

Modern models (like GPT-4 or Claude) use Byte-Pair Encoding (BPE) or SentencePiece, which split rare words into known subword pieces while keeping common words whole. The word “unbelievable” might become ["un", "believ", "able"].

The Tokenization Pipeline

1

Normalisation

Text is lowercased, unicode-normalised, and whitespace is standardised. Some models preserve casing for named entities.

2

Pre-tokenisation

A regex splits on whitespace and punctuation, producing rough word-level chunks before subword splitting begins.

3

Subword Merging (BPE)

Pairs of characters that appear together most frequently in training data are merged iteratively until a fixed vocabulary size is reached — typically 32k–100k tokens.

4

ID Lookup

Each token string is mapped to an integer ID from the vocabulary. These integers are the actual inputs to the model.

Turning Tokens into Vectors

Token IDs are just numbers — they carry no semantic meaning. Embeddings fix this by mapping each token ID to a dense vector of floating-point numbers (typically 512–4096 dimensions).

These vectors are learned during training. Words used in similar contexts end up with similar vectors. The geometry of this space encodes meaning.

💡 The famous example: King − Man + Woman ≈ Queen. Vector arithmetic mirrors conceptual relationships because the model has learned structure from patterns in language.

Words as Points in Space

High-dimensional embeddings are projected down to 2D (via PCA or t-SNE) for visualisation. Related words cluster together; distances reflect semantic similarity.

Measuring Closeness with Cosine Similarity

To compare embeddings we use cosine similarity — the cosine of the angle between two vectors. A score of 1 means identical direction (very similar meaning), 0 means unrelated, −1 means opposite.

Why cosine and not Euclidean distance? Cosine ignores magnitude, caring only about direction. Two long documents and a short one on the same topic will score similarly.

Why This All Matters

Tokenization and embeddings are the foundation that every language model, search engine, and recommendation system is built on. They allow raw text to flow through neural networks as differentiable numbers.

Downstream tasks — sentiment analysis, question answering, translation, retrieval-augmented generation — all start with good embeddings. The richer the embedding space, the more nuanced the model’s understanding of language.

Modern contextual embeddings (from Transformer attention) go further: the same word gets a different vector depending on its surrounding context. “Bank” near “river” vs “bank” near “loan” produce distinct vectors.

Leave a Reply

Your email address will not be published. Required fields are marked *