Building Your First Chatbot · RAG Basics
Beginner Guide · LLM + RAG

Building Your First Chatbot with an LLM

A gentle introduction to Large Language Models and Retrieval-Augmented Generation — from zero to a working, knowledge-aware chatbot.

Chapter 01

What is a Large Language Model?

A Large Language Model (LLM) is a neural network trained on vast amounts of text. It learns the statistical patterns of language, enabling it to generate coherent, context-aware responses to almost any prompt.

  • Trained on billions of tokens of text from the web, books, and code
  • Understands and generates natural language
  • Accessed via an API — you send a prompt, receive a completion
  • Popular models: GPT-4, Claude, Gemini, Llama 3
Key insight: The LLM is your engine. Your chatbot is the vehicle built around it — with routing logic, memory, and context management.
Chapter 02

Basic Chatbot Architecture

At its simplest, a chatbot is a loop: receive user input → build a prompt → call the LLM API → return the response. Here’s the minimal Python skeleton:

import anthropic client = anthropic.Anthropic() history = [] while True: user_msg = input("You: ") history.append({"role": "user", "content": user_msg}) response = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=1024, system="You are a helpful assistant.", messages=history, ) reply = response.content[0].text history.append({"role": "assistant", "content": reply}) print(f"Bot: {reply}\n")
Conversation history is just a list of messages passed back each time. The model has no memory of its own — you give it context on every call.
Chapter 03

The Problem RAG Solves

LLMs are powerful, but they have a hard knowledge cutoff and know nothing about your private documents, product docs, or recent events. Two naive workarounds both fail at scale:

  • Fine-tuning — expensive, slow to update, and still hallucinates facts
  • Stuffing entire documents into the prompt — context windows are finite and costly

Retrieval-Augmented Generation (RAG) solves this by fetching only the relevant snippets at query time and injecting them into the prompt — giving the model accurate, up-to-date grounding without retraining.

User Query
🔍 Retriever
Top-k Chunks
LLM + Context
Answer ✅
Chapter 04

How RAG Works — Step by Step

  1. Ingest & Chunk — Split your documents into overlapping chunks (~300–500 tokens each) so retrieval is fine-grained.
  2. Embed — Run each chunk through an embedding model (e.g. text-embedding-3-small) to produce a dense vector representation.
  3. Index — Store vectors in a vector database (Pinecone, Chroma, pgvector, FAISS).
  4. Retrieve — At query time, embed the user’s question and perform a nearest-neighbour search to find the top-k most similar chunks.
  5. Augment — Prepend retrieved chunks to the system prompt as context.
  6. Generate — The LLM now answers grounded in your private knowledge.
Chapter 05

Minimal RAG in Python

Using ChromaDB (local vector store) + sentence-transformers for embeddings and Claude for generation:

import chromadb, anthropic from sentence_transformers import SentenceTransformer # 1. Setup embedder = SentenceTransformer("all-MiniLM-L6-v2") db = chromadb.Client() coll = db.create_collection("docs") # 2. Index your chunks chunks = ["RAG stands for Retrieval-Augmented Generation...", "Embeddings map text to high-dimensional vectors..."] coll.add( documents=chunks, embeddings=embedder.encode(chunks).tolist(), ids=[str(i) for i in range(len(chunks))] ) # 3. Query + generate query = "What is RAG?" q_emb = embedder.encode([query]).tolist() hits = coll.query(query_embeddings=q_emb, n_results=2) ctx = "\n\n".join(hits["documents"][0]) client = anthropic.Anthropic() answer = client.messages.create( model="claude-sonnet-4-20250514", max_tokens=512, system=f"Answer using this context:\n{ctx}", messages=[{"role":"user","content":query}] ) print(answer.content[0].text)
Pro tip: Add metadata filters (e.g. date range, document type) to your vector query for more precise retrieval in production systems.
What’s Next

Level Up Your Chatbot

Once the basics are working, these improvements make a real-world difference:

  • Hybrid search — combine dense (vector) + sparse (BM25) retrieval for better recall
  • Reranking — use a cross-encoder to reorder retrieved chunks before passing to the LLM
  • Streaming — stream tokens back to the UI for a snappier experience
  • Conversation memory — summarise old turns to stay within context limits
  • Evaluation — use frameworks like RAGAS to measure faithfulness and answer relevance
  • Guardrails — add input/output safety layers before deploying publicly
Built with curiosity · LLMs + RAG · 2026

Leave a Reply

Your email address will not be published. Required fields are marked *