The Pretrained Models
Shaping Modern AI
A visual guide to GPT, BERT, LLaMA, and Claude — the transformer-based architectures that redefined what language models can do.
The Generative Pre-trained Transformer family pioneered large-scale unsupervised pretraining on internet text followed by task-specific fine-tuning. GPT-3 (175 B params) demonstrated that scale alone unlocks emergent few-shot abilities, while GPT-4 introduced multimodal reasoning and RLHF alignment.
Bidirectional Encoder Representations from Transformers changed NLP benchmarks overnight. By masking random tokens and training the model to predict them using left and right context simultaneously, BERT produced deeply contextual embeddings ideal for classification, NER, QA, and semantic search.
Large Language Model Meta AI democratised foundation-model research by releasing competitive weights publicly. LLaMA 2 added grouped-query attention for efficiency; LLaMA 3 extended context to 128 K tokens and trained on 15 T tokens. Its open availability spurred thousands of fine-tunes and derivative models.
Built around Constitutional AI — a method that uses a set of principles to guide self-critique and revision — Claude prioritises helpfulness, harmlessness, and honesty. Claude 3 Opus matched or exceeded GPT-4 on many benchmarks; the Claude 3.5 and 4 families extended multimodal reasoning and tool use.
Architecture at a Glance
| Model | Architecture | Training objective | Best for |
|---|---|---|---|
| GPT | Decoder-only transformer | Next-token prediction (CLM) | Open-ended generation, chat, code |
| BERT | Encoder-only transformer | Masked LM + Next sentence pred. | Classification, NER, semantic search |
| LLaMA | Decoder-only (RoPE + GQA) | Next-token prediction (CLM) | Open research, fine-tuning, edge deploy |
| Claude | Decoder-only + Constitutional AI | RLHF + CAI self-critique | Long-context reasoning, safe assistants |
A Brief History
-
2017Attention Is All You Need — Vaswani et al. introduce the Transformer, replacing recurrent nets with pure self-attention, laying the foundation for every model on this page.
-
2018GPT-1 & BERT — OpenAI’s GPT shows unsupervised pretraining + fine-tuning wins at NLU. Google’s BERT simultaneously proves bidirectional context is king for understanding tasks.
-
2020GPT-3 — 175 B parameters and in-context few-shot learning stun the research community. Scale, it turns out, is a feature.
-
2023LLaMA 1 & Claude 1 — Meta opens the weights to researchers; Anthropic ships Constitutional AI-aligned Claude. The open/closed dichotomy defines a new era of LLM competition.
-
2024 – 25Claude 3 / 4, LLaMA 3, GPT-4o — Multimodal reasoning, 128 K–1 M token contexts, tool use, and real-time voice. The frontier accelerates faster than ever.

