Bestseller #6
Introduction to Generative AI & LLMs
Expert Reference Vol. I — Foundations

Introduction to
Generative AI
& Large Language Models

A comprehensive guide to understanding how modern AI systems generate text, images, code, and more — from mathematical foundations to real-world deployment.

175B+ Parameters in GPT-3
~1T Tokens in training
2017 Transformer paper
01

What is Generative AI?

Generative AI refers to systems that can produce new content — text, images, audio, video, code — by learning statistical patterns from vast amounts of existing data.

Unlike traditional discriminative models that classify or predict labels for given inputs, generative models learn the underlying distribution of data and can sample novel outputs from that distribution. This fundamental difference unlocks creative and open-ended capabilities previously impossible with classical machine learning.

Text Generation

Writing, summarisation, translation, Q&A, code completion, dialogue.

Image Synthesis

Diffusion models and GANs create photorealistic and artistic imagery.

Audio & Music

Speech synthesis, music generation, voice cloning, sound design.

Multimodal

Unified models that reason across text, vision, audio simultaneously.

Key Insight

Generative AI doesn’t “understand” in the human sense — it models the statistical likelihood of tokens given context, with emergent behaviours that can appear remarkably human-like.

02

A Brief History

From perceptrons to foundation models — the trajectory of generative AI spans seven decades of research breakthroughs.

1950s
Early Neural Networks

Rosenblatt’s Perceptron (1958) introduced the first trainable neural model. Limited by single-layer architecture.

1986
Backpropagation

Rumelhart, Hinton & Williams demonstrate efficient gradient-based learning in multi-layer networks.

1997
LSTM Networks

Hochreiter & Schmidhuber introduce Long Short-Term Memory, enabling sequence modelling over longer contexts.

2014
GANs & Seq2Seq

Goodfellow introduces Generative Adversarial Networks. Seq2Seq models emerge for neural machine translation.

2017
“Attention Is All You Need”

Vaswani et al. at Google introduce the Transformer architecture — the foundation of all modern LLMs.

2018–20
BERT, GPT-2 & GPT-3

OpenAI and Google scale up pre-trained language models, demonstrating few-shot and zero-shot capabilities.

2022+
ChatGPT & the Public Moment

Instruction-tuned models and RLHF make LLMs accessible. Stable Diffusion democratises image synthesis. The era of foundation models begins.

03

How LLMs Work

Large Language Models are autoregressive neural networks that predict the probability distribution of the next token given all preceding tokens.

Token Prediction Pipeline
Raw Text
Tokeniser
Embedding
Transformer Blocks
Logits
Softmax
Next Token

Tokens are subword units — typically 3–4 characters on average. A tokeniser like BPE (Byte Pair Encoding) converts raw text into integer IDs. Each ID maps to a learned vector (embedding) in high-dimensional space.

These embeddings pass through a stack of Transformer blocks, each performing multi-head self-attention and feed-forward operations. The final output is a probability distribution over the vocabulary. Sampling from this distribution yields the next token.

# Simplified autoregressive generation def generate(model, prompt, max_tokens=200): tokens = tokenise(prompt) for _ in range(max_tokens): logits = model(tokens) # Forward pass probs = softmax(logits[-1]) # Last token distribution next_t = sample(probs, temp=0.8) # Temperature sampling tokens.append(next_t) if next_t == EOS_TOKEN: break return detokenise(tokens)
04

The Transformer Architecture

The Transformer, introduced in “Attention Is All You Need” (2017), replaced recurrent networks with a fully attention-based architecture that parallelises training over entire sequences.

“Attention mechanisms allow the model to weigh the relevance of every token in context when encoding each position — enabling rich long-range dependencies.” — Vaswani et al., 2017

Self-Attention

  • Queries, Keys & Values
  • Scaled dot-product scoring
  • Multi-head parallelism
  • O(n²) time complexity
  • Absolute + rotary positional encodings

Feed-Forward Layers

  • Two linear projections
  • Non-linearity (GeLU/SiLU)
  • 4× hidden dimension expansion
  • Applied position-wise
  • Stores factual knowledge

Each block also includes residual connections and Layer Normalisation for training stability. Modern LLMs stack 32–96 such blocks. Context windows have expanded from 2,048 tokens (GPT-2) to over 1 million tokens in recent models.

05

The Training Process

Modern LLM training occurs in stages, each refining the model’s behaviour from raw pattern matching to helpful, aligned interaction.

Pre-Training

Self-supervised next-token prediction on trillions of tokens from web, books, code. Learns world knowledge and language.

Supervised Fine-Tuning

Human-written demonstrations of desired behaviours. Teaches instruction following, Q&A, formatting norms.

RLHF

Reinforcement Learning from Human Feedback. A reward model ranks outputs; PPO optimises toward human preferences.

Constitutional AI

AI-generated feedback based on explicit principles. Scales alignment supervision without exhaustive human labelling.

Scale Laws

Chinchilla research (Hoffmann et al., 2022) showed model performance scales predictably with both parameter count and training token count — optimal training requires roughly 20× more tokens than parameters.

06

The Art of Prompting

Prompt engineering is the practice of crafting inputs that elicit optimal model outputs — a skill that blends linguistic intuition with mechanistic understanding.

Technique Description Best For
Zero-Shot No examples, just instructions Simple tasks
Few-Shot 2–8 input/output examples in context Format learning
Chain-of-Thought “Think step by step” elicits reasoning Complex reasoning
System Prompt Persona and context framing Tone & Role
RAG Retrieved context injected into prompt Knowledge grounding
Tool Use Model calls external functions/APIs Agentic workflows
07

Notable Models

The landscape of foundation models has diversified rapidly, with closed and open-weight options spanning a wide range of capability and scale.

Model Organisation Release Notable Feature
GPT-4o OpenAI 2024 Omni — text, vision, audio
Claude 3.7 Sonnet Anthropic 2025 Extended thinking, 200K context
Gemini 2.0 Ultra Google DeepMind 2025 Natively multimodal
Llama 3.1 405B Meta AI 2024 Open weights, 128K context
Mistral Large Mistral AI 2024 European-built, multilingual
DeepSeek-V3 DeepSeek 2025 MoE architecture, cost-efficient
08

Real-World Applications

Generative AI is reshaping industries at an extraordinary pace — from automating knowledge work to enabling entirely new categories of product.

💊
Drug Discovery

Protein structure prediction, molecule generation, clinical trial optimisation.

⚖️
Legal

Contract analysis, case research, document drafting and review automation.

💻
Software Dev

Code generation, debugging, documentation, test writing, refactoring.

🎨
Creative Media

Copywriting, image/video production, game asset generation, storyboarding.

📚
Education

Personalised tutoring, curriculum generation, instant feedback systems.

📊
Finance

Earnings analysis, fraud detection, report generation, risk narratives.

09

Challenges & Ethics

With remarkable capability comes significant responsibility. The field grapples with fundamental technical and societal challenges.

01
Hallucination

LLMs confidently produce plausible-sounding but factually incorrect content. Mitigations include RAG, fine-tuning, and chain-of-thought verification.

02
Bias & Fairness

Training data reflects societal biases. Models can perpetuate or amplify stereotypes across gender, race, culture, and ideology.

03
Safety & Alignment

Ensuring models behave in accordance with human values at scale remains an open research problem. Misuse, misalignment, and catastrophic risk all require active mitigation.

04
Environmental Cost

Training frontier models requires significant energy — GPT-3’s training emitted ~552 tonnes CO₂e. Inference at scale compounds this further.

05
Intellectual Property

Legal uncertainty around training on copyrighted data, model outputs, and attribution is actively contested across multiple jurisdictions.

10

Key Glossary

Token

A subword unit — the atomic element processed by LLMs. “Unbelievable” might tokenise as [“Un”,”believ”,”able”].

Embedding

A dense vector representation of a token in continuous space. Semantically similar tokens cluster together.

Attention

A mechanism that weights the relevance of all context tokens when computing a representation for each position.

Temperature

A sampling parameter (0–2) controlling output randomness. Lower = deterministic; higher = creative/chaotic.

Context Window

The maximum number of tokens an LLM can “see” in one forward pass. Determines memory capacity.

Fine-Tuning

Continued training of a pre-trained model on a smaller, task-specific dataset to specialise its behaviour.

RLHF

Reinforcement Learning from Human Feedback. Aligns model outputs to human preferences via a reward model.

RAG

Retrieval-Augmented Generation. Grounds LLM responses in retrieved documents to reduce hallucination.

Inference

Running a trained model on new inputs to produce outputs — as opposed to training.

Perplexity

A measure of how well a language model predicts a text sample. Lower perplexity indicates better fit.

Further Reading

Attention Is All You Need (2017) · Language Models are Few-Shot Learners (2020) · Training Language Models to Follow Instructions with Human Feedback (2022) · Scaling Laws for Neural Language Models (2020)

Bestseller #5

Leave a Reply

Your email address will not be published. Required fields are marked *