Introduction to
Generative AI
& Large Language Models
A comprehensive guide to understanding how modern AI systems generate text, images, code, and more — from mathematical foundations to real-world deployment.
What is Generative AI?
Generative AI refers to systems that can produce new content — text, images, audio, video, code — by learning statistical patterns from vast amounts of existing data.
Unlike traditional discriminative models that classify or predict labels for given inputs, generative models learn the underlying distribution of data and can sample novel outputs from that distribution. This fundamental difference unlocks creative and open-ended capabilities previously impossible with classical machine learning.
Writing, summarisation, translation, Q&A, code completion, dialogue.
Diffusion models and GANs create photorealistic and artistic imagery.
Speech synthesis, music generation, voice cloning, sound design.
Unified models that reason across text, vision, audio simultaneously.
Generative AI doesn’t “understand” in the human sense — it models the statistical likelihood of tokens given context, with emergent behaviours that can appear remarkably human-like.
A Brief History
From perceptrons to foundation models — the trajectory of generative AI spans seven decades of research breakthroughs.
Rosenblatt’s Perceptron (1958) introduced the first trainable neural model. Limited by single-layer architecture.
Rumelhart, Hinton & Williams demonstrate efficient gradient-based learning in multi-layer networks.
Hochreiter & Schmidhuber introduce Long Short-Term Memory, enabling sequence modelling over longer contexts.
Goodfellow introduces Generative Adversarial Networks. Seq2Seq models emerge for neural machine translation.
Vaswani et al. at Google introduce the Transformer architecture — the foundation of all modern LLMs.
OpenAI and Google scale up pre-trained language models, demonstrating few-shot and zero-shot capabilities.
Instruction-tuned models and RLHF make LLMs accessible. Stable Diffusion democratises image synthesis. The era of foundation models begins.
How LLMs Work
Large Language Models are autoregressive neural networks that predict the probability distribution of the next token given all preceding tokens.
Tokens are subword units — typically 3–4 characters on average. A tokeniser like BPE (Byte Pair Encoding) converts raw text into integer IDs. Each ID maps to a learned vector (embedding) in high-dimensional space.
These embeddings pass through a stack of Transformer blocks, each performing multi-head self-attention and feed-forward operations. The final output is a probability distribution over the vocabulary. Sampling from this distribution yields the next token.
The Transformer Architecture
The Transformer, introduced in “Attention Is All You Need” (2017), replaced recurrent networks with a fully attention-based architecture that parallelises training over entire sequences.
Self-Attention
- Queries, Keys & Values
- Scaled dot-product scoring
- Multi-head parallelism
- O(n²) time complexity
- Absolute + rotary positional encodings
Feed-Forward Layers
- Two linear projections
- Non-linearity (GeLU/SiLU)
- 4× hidden dimension expansion
- Applied position-wise
- Stores factual knowledge
Each block also includes residual connections and Layer Normalisation for training stability. Modern LLMs stack 32–96 such blocks. Context windows have expanded from 2,048 tokens (GPT-2) to over 1 million tokens in recent models.
The Training Process
Modern LLM training occurs in stages, each refining the model’s behaviour from raw pattern matching to helpful, aligned interaction.
Self-supervised next-token prediction on trillions of tokens from web, books, code. Learns world knowledge and language.
Human-written demonstrations of desired behaviours. Teaches instruction following, Q&A, formatting norms.
Reinforcement Learning from Human Feedback. A reward model ranks outputs; PPO optimises toward human preferences.
AI-generated feedback based on explicit principles. Scales alignment supervision without exhaustive human labelling.
Chinchilla research (Hoffmann et al., 2022) showed model performance scales predictably with both parameter count and training token count — optimal training requires roughly 20× more tokens than parameters.
The Art of Prompting
Prompt engineering is the practice of crafting inputs that elicit optimal model outputs — a skill that blends linguistic intuition with mechanistic understanding.
| Technique | Description | Best For |
|---|---|---|
| Zero-Shot | No examples, just instructions | Simple tasks |
| Few-Shot | 2–8 input/output examples in context | Format learning |
| Chain-of-Thought | “Think step by step” elicits reasoning | Complex reasoning |
| System Prompt | Persona and context framing | Tone & Role |
| RAG | Retrieved context injected into prompt | Knowledge grounding |
| Tool Use | Model calls external functions/APIs | Agentic workflows |
Notable Models
The landscape of foundation models has diversified rapidly, with closed and open-weight options spanning a wide range of capability and scale.
| Model | Organisation | Release | Notable Feature |
|---|---|---|---|
| GPT-4o | OpenAI | 2024 | Omni — text, vision, audio |
| Claude 3.7 Sonnet | Anthropic | 2025 | Extended thinking, 200K context |
| Gemini 2.0 Ultra | Google DeepMind | 2025 | Natively multimodal |
| Llama 3.1 405B | Meta AI | 2024 | Open weights, 128K context |
| Mistral Large | Mistral AI | 2024 | European-built, multilingual |
| DeepSeek-V3 | DeepSeek | 2025 | MoE architecture, cost-efficient |
Real-World Applications
Generative AI is reshaping industries at an extraordinary pace — from automating knowledge work to enabling entirely new categories of product.
Protein structure prediction, molecule generation, clinical trial optimisation.
Contract analysis, case research, document drafting and review automation.
Code generation, debugging, documentation, test writing, refactoring.
Copywriting, image/video production, game asset generation, storyboarding.
Personalised tutoring, curriculum generation, instant feedback systems.
Earnings analysis, fraud detection, report generation, risk narratives.
Challenges & Ethics
With remarkable capability comes significant responsibility. The field grapples with fundamental technical and societal challenges.
LLMs confidently produce plausible-sounding but factually incorrect content. Mitigations include RAG, fine-tuning, and chain-of-thought verification.
Training data reflects societal biases. Models can perpetuate or amplify stereotypes across gender, race, culture, and ideology.
Ensuring models behave in accordance with human values at scale remains an open research problem. Misuse, misalignment, and catastrophic risk all require active mitigation.
Training frontier models requires significant energy — GPT-3’s training emitted ~552 tonnes CO₂e. Inference at scale compounds this further.
Legal uncertainty around training on copyrighted data, model outputs, and attribution is actively contested across multiple jurisdictions.
Key Glossary
A subword unit — the atomic element processed by LLMs. “Unbelievable” might tokenise as [“Un”,”believ”,”able”].
A dense vector representation of a token in continuous space. Semantically similar tokens cluster together.
A mechanism that weights the relevance of all context tokens when computing a representation for each position.
A sampling parameter (0–2) controlling output randomness. Lower = deterministic; higher = creative/chaotic.
The maximum number of tokens an LLM can “see” in one forward pass. Determines memory capacity.
Continued training of a pre-trained model on a smaller, task-specific dataset to specialise its behaviour.
Reinforcement Learning from Human Feedback. Aligns model outputs to human preferences via a reward model.
Retrieval-Augmented Generation. Grounds LLM responses in retrieved documents to reduce hallucination.
Running a trained model on new inputs to produce outputs — as opposed to training.
A measure of how well a language model predicts a text sample. Lower perplexity indicates better fit.
Attention Is All You Need (2017) · Language Models are Few-Shot Learners (2020) · Training Language Models to Follow Instructions with Human Feedback (2022) · Scaling Laws for Neural Language Models (2020)

