Bestseller #1

Applied Natural Language Processing with PyTorch 2.0: Master Adva…

₹1,039

Buy on Amazon

Bestseller #2

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

₹902

Buy on Amazon

Bestseller #3

Natural Language Processing with TensorFlow: The definitive NLP b…

₹2,634

Buy on Amazon

Bestseller #4

Natural Language Processing with Transformers: Building Language …

₹38,623.20

Buy on Amazon

Bestseller #5

Natural Language Processing

₹357

Buy on Amazon

Tokenization & Embeddings

NLP Fundamentals

Tokenization
& Embeddings

How language models turn raw text into numbers they can reason about — from characters to concepts.

01 — Tokenization

Breaking Text into Tokens

Before a model reads a single word, it must convert text into a sequence of discrete units called tokens. Tokens are not always full words — they can be subwords, punctuation, or even individual characters.

Modern models (like GPT-4 or Claude) use Byte-Pair Encoding (BPE) or SentencePiece, which split rare words into known subword pieces while keeping common words whole. The word “unbelievable” might become ["un", "believ", "able"].

02 — How It Works

The Tokenization Pipeline

Normalisation

Text is lowercased, unicode-normalised, and whitespace is standardised. Some models preserve casing for named entities.

Pre-tokenisation

A regex splits on whitespace and punctuation, producing rough word-level chunks before subword splitting begins.

Subword Merging (BPE)

Pairs of characters that appear together most frequently in training data are merged iteratively until a fixed vocabulary size is reached — typically 32k–100k tokens.

ID Lookup

Each token string is mapped to an integer ID from the vocabulary. These integers are the actual inputs to the model.

03 — Embeddings

Turning Tokens into Vectors

Token IDs are just numbers — they carry no semantic meaning. Embeddings fix this by mapping each token ID to a dense vector of floating-point numbers (typically 512–4096 dimensions).

These vectors are learned during training. Words used in similar contexts end up with similar vectors. The geometry of this space encodes meaning.

💡 The famous example: King − Man + Woman ≈ Queen. Vector arithmetic mirrors conceptual relationships because the model has learned structure from patterns in language.

04 — Vector Space

Words as Points in Space

High-dimensional embeddings are projected down to 2D (via PCA or t-SNE) for visualisation. Related words cluster together; distances reflect semantic similarity.

05 — Similarity

Measuring Closeness with Cosine Similarity

To compare embeddings we use cosine similarity — the cosine of the angle between two vectors. A score of 1 means identical direction (very similar meaning), 0 means unrelated, −1 means opposite.

Why cosine and not Euclidean distance? Cosine ignores magnitude, caring only about direction. Two long documents and a short one on the same topic will score similarly.

06 — Applications

Why This All Matters

Tokenization and embeddings are the foundation that every language model, search engine, and recommendation system is built on. They allow raw text to flow through neural networks as differentiable numbers.

Downstream tasks — sentiment analysis, question answering, translation, retrieval-augmented generation — all start with good embeddings. The richer the embedding space, the more nuanced the model’s understanding of language.

Modern contextual embeddings (from Transformer attention) go further: the same word gets a different vector depending on its surrounding context. “Bank” near “river” vs “bank” near “loan” produce distinct vectors.

Bestseller #1

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

₹902

Buy on Amazon

Bestseller #2

Foundations of NLP: Simplified for Beginners : Understanding Natu…

₹300

Buy on Amazon

Bestseller #3

Natural Language Processing with Transformers: Building Language …

₹4,074

Buy on Amazon

Bestseller #4

Transformers for Natural Language Processing: Build innovative de…

Buy on Amazon

Bestseller #5

Learn AI with Python

₹669

Buy on Amazon

Tokenization & Embeddings Explained: How AI Models Understand Language

Applied Natural Language Processing with PyTorch 2.0: Master Adva…

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

Natural Language Processing with TensorFlow: The definitive NLP b…

Natural Language Processing with Transformers: Building Language …

Natural Language Processing

Tokenization
& Embeddings

Breaking Text into Tokens

The Tokenization Pipeline

Normalisation

Pre-tokenisation

Subword Merging (BPE)

ID Lookup

Turning Tokens into Vectors

Words as Points in Space

Measuring Closeness with Cosine Similarity

Why This All Matters

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

Foundations of NLP: Simplified for Beginners : Understanding Natu…

Natural Language Processing with Transformers: Building Language …

Transformers for Natural Language Processing: Build innovative de…

Learn AI with Python

By Somish Saipar

Leave a Reply Cancel reply

Oops, looks like this got skipped!

Managing Output Parsers for Structured Data Extraction: A Complete Developer Guide

Graceful Error Handling & Retry Patterns | Resilient Web UI with Animated Gradient Background

Ensuring Safety and Security in Tool Execution: A Complete Guide for AI Systems

Architecting Robust Tool Interfaces and API Integrations: Patterns, Principles & Best Practices

Applied Natural Language Processing with PyTorch 2.0: Master Adva…

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

Natural Language Processing with TensorFlow: The definitive NLP b…

Natural Language Processing with Transformers: Building Language …

Natural Language Processing

Tokenization& Embeddings

Breaking Text into Tokens

The Tokenization Pipeline

Normalisation

Pre-tokenisation

Subword Merging (BPE)

ID Lookup

Turning Tokens into Vectors

Words as Points in Space

Measuring Closeness with Cosine Similarity

Why This All Matters

Generative AI for Everyone: Deep learning, NLP, and LLMs for crea…

Foundations of NLP: Simplified for Beginners : Understanding Natu…

Natural Language Processing with Transformers: Building Language …

Transformers for Natural Language Processing: Build innovative de…

Learn AI with Python

By Somish Saipar

Related Post

Leave a Reply Cancel reply

Oops, looks like this got skipped!

Tokenization
& Embeddings