Building Your First Chatbot with an LLM
A gentle introduction to Large Language Models and Retrieval-Augmented Generation — from zero to a working, knowledge-aware chatbot.
What is a Large Language Model?
A Large Language Model (LLM) is a neural network trained on vast amounts of text. It learns the statistical patterns of language, enabling it to generate coherent, context-aware responses to almost any prompt.
- Trained on billions of tokens of text from the web, books, and code
- Understands and generates natural language
- Accessed via an API — you send a prompt, receive a completion
- Popular models: GPT-4, Claude, Gemini, Llama 3
Basic Chatbot Architecture
At its simplest, a chatbot is a loop: receive user input → build a prompt → call the LLM API → return the response. Here’s the minimal Python skeleton:
import anthropic
client = anthropic.Anthropic()
history = []
while True:
user_msg = input("You: ")
history.append({"role": "user", "content": user_msg})
response = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=1024,
system="You are a helpful assistant.",
messages=history,
)
reply = response.content[0].text
history.append({"role": "assistant", "content": reply})
print(f"Bot: {reply}\n")
The Problem RAG Solves
LLMs are powerful, but they have a hard knowledge cutoff and know nothing about your private documents, product docs, or recent events. Two naive workarounds both fail at scale:
- Fine-tuning — expensive, slow to update, and still hallucinates facts
- Stuffing entire documents into the prompt — context windows are finite and costly
Retrieval-Augmented Generation (RAG) solves this by fetching only the relevant snippets at query time and injecting them into the prompt — giving the model accurate, up-to-date grounding without retraining.
How RAG Works — Step by Step
- Ingest & Chunk — Split your documents into overlapping chunks (~300–500 tokens each) so retrieval is fine-grained.
- Embed — Run each chunk through an embedding model (e.g.
text-embedding-3-small) to produce a dense vector representation. - Index — Store vectors in a vector database (Pinecone, Chroma, pgvector, FAISS).
- Retrieve — At query time, embed the user’s question and perform a nearest-neighbour search to find the top-k most similar chunks.
- Augment — Prepend retrieved chunks to the system prompt as context.
- Generate — The LLM now answers grounded in your private knowledge.
Minimal RAG in Python
Using ChromaDB (local vector store) + sentence-transformers for embeddings and Claude for generation:
import chromadb, anthropic
from sentence_transformers import SentenceTransformer
# 1. Setup
embedder = SentenceTransformer("all-MiniLM-L6-v2")
db = chromadb.Client()
coll = db.create_collection("docs")
# 2. Index your chunks
chunks = ["RAG stands for Retrieval-Augmented Generation...",
"Embeddings map text to high-dimensional vectors..."]
coll.add(
documents=chunks,
embeddings=embedder.encode(chunks).tolist(),
ids=[str(i) for i in range(len(chunks))]
)
# 3. Query + generate
query = "What is RAG?"
q_emb = embedder.encode([query]).tolist()
hits = coll.query(query_embeddings=q_emb, n_results=2)
ctx = "\n\n".join(hits["documents"][0])
client = anthropic.Anthropic()
answer = client.messages.create(
model="claude-sonnet-4-20250514",
max_tokens=512,
system=f"Answer using this context:\n{ctx}",
messages=[{"role":"user","content":query}]
)
print(answer.content[0].text)
Level Up Your Chatbot
Once the basics are working, these improvements make a real-world difference:
- Hybrid search — combine dense (vector) + sparse (BM25) retrieval for better recall
- Reranking — use a cross-encoder to reorder retrieved chunks before passing to the LLM
- Streaming — stream tokens back to the UI for a snappier experience
- Conversation memory — summarise old turns to stay within context limits
- Evaluation — use frameworks like RAGAS to measure faithfulness and answer relevance
- Guardrails — add input/output safety layers before deploying publicly

