Pretrained Language Models Explained GPT, BERT, LLaMA & Claude — The Transformers Shaping Modern AI
Pretrained Language Models Explained GPT, BERT, LLaMA & Claude — The Transformers Shaping Modern AI
Pretrained Models
Foundation Models · Deep Learning

The Pretrained Models
Shaping Modern AI

A visual guide to GPT, BERT, LLaMA, and Claude — the transformer-based architectures that redefined what language models can do.

🧠
GPT Series
OpenAI · 2018 – present
Autoregressive Decoder-only Generative Causal LM

The Generative Pre-trained Transformer family pioneered large-scale unsupervised pretraining on internet text followed by task-specific fine-tuning. GPT-3 (175 B params) demonstrated that scale alone unlocks emergent few-shot abilities, while GPT-4 introduced multimodal reasoning and RLHF alignment.

175BGPT-3 params
2018First release
96LGPT-3 layers
🔍
BERT
Google · 2018
Bidirectional Encoder-only MLM NSP

Bidirectional Encoder Representations from Transformers changed NLP benchmarks overnight. By masking random tokens and training the model to predict them using left and right context simultaneously, BERT produced deeply contextual embeddings ideal for classification, NER, QA, and semantic search.

340MLarge params
2018Released
24LLarge layers
🦙
LLaMA Series
Meta AI · 2023 – present
Open-weights Decoder-only RoPE GQA

Large Language Model Meta AI democratised foundation-model research by releasing competitive weights publicly. LLaMA 2 added grouped-query attention for efficiency; LLaMA 3 extended context to 128 K tokens and trained on 15 T tokens. Its open availability spurred thousands of fine-tunes and derivative models.

405BLlama 3 max
128KContext window
15TTraining tokens
Claude Series
Anthropic · 2023 – present
Constitutional AI RLHF Safety-first Long context

Built around Constitutional AI — a method that uses a set of principles to guide self-critique and revision — Claude prioritises helpfulness, harmlessness, and honesty. Claude 3 Opus matched or exceeded GPT-4 on many benchmarks; the Claude 3.5 and 4 families extended multimodal reasoning and tool use.

200KContext tokens
2023First release
CAIAlignment method

Architecture at a Glance

Model Architecture Training objective Best for
GPT Decoder-only transformer Next-token prediction (CLM) Open-ended generation, chat, code
BERT Encoder-only transformer Masked LM + Next sentence pred. Classification, NER, semantic search
LLaMA Decoder-only (RoPE + GQA) Next-token prediction (CLM) Open research, fine-tuning, edge deploy
Claude Decoder-only + Constitutional AI RLHF + CAI self-critique Long-context reasoning, safe assistants

A Brief History

  • 2017
    Attention Is All You Need — Vaswani et al. introduce the Transformer, replacing recurrent nets with pure self-attention, laying the foundation for every model on this page.
  • 2018
    GPT-1 & BERT — OpenAI’s GPT shows unsupervised pretraining + fine-tuning wins at NLU. Google’s BERT simultaneously proves bidirectional context is king for understanding tasks.
  • 2020
    GPT-3 — 175 B parameters and in-context few-shot learning stun the research community. Scale, it turns out, is a feature.
  • 2023
    LLaMA 1 & Claude 1 — Meta opens the weights to researchers; Anthropic ships Constitutional AI-aligned Claude. The open/closed dichotomy defines a new era of LLM competition.
  • 2024 – 25
    Claude 3 / 4, LLaMA 3, GPT-4o — Multimodal reasoning, 128 K–1 M token contexts, tool use, and real-time voice. The frontier accelerates faster than ever.

Leave a Reply

Your email address will not be published. Required fields are marked *