Managing Context Windows & Token Constraints in Large Language Models — A Complete Practitioner’s Guide

Every large language model reads the world through a porthole — fixed in size, finite in scope. That porthole is the context window: a hard ceiling on how many tokens a model can see at once, spanning your system prompt, the entire conversation history, any retrieved documents, and the response it must generate.

Exceed it and older content silently vanishes. Approach it carelessly and quality degrades, costs spike, and latency balloons. Managing this constraint is not an afterthought — it is a core engineering discipline.

A token is the atomic unit of LLM text — roughly ¾ of an English word on average. A 128 k-token context holds approximately 96,000 words, or a short novel. Sounds generous until you factor in multi-turn history, dense retrieval chunks, and verbose system prompts.

The strategies on this page are drawn from production deployments at scale. They apply whether you are building a simple chatbot or an agentic pipeline that spans dozens of tool calls.

§ 01 — Anatomy

What fills a
context window?

Typical 128 k token budget — production chatbot with RAG

SYS

HISTORY

RETRIEVED DOCS

RESPONSE

free (25%)

System prompt (~8 k)

Chat history (~28 k)

Retrieved docs (~45 k)

Reserved for response (~16 k)

Headroom / buffer

Agent loop — after 12 tool calls

SYS

TOOL CALLS + RESULTS

CTX

RESP

free (20%)

Tool calls accumulate fast — a critical pressure point in long agent runs

§ 02 — Strategies

Six ways to stay within
the window

Sliding Window Truncation

Drop the oldest turns first while always preserving the system prompt and the most recent N exchanges. Simple, deterministic, and zero extra cost. Loses long-term coherence on extended sessions.

Simplest

Summarise-and-Replace

When history fills past a threshold, compress older turns into a concise summary using a fast, cheap model. Inject the summary as a synthetic message. Preserves narrative continuity at the cost of a summarisation call.

Recommended

Retrieval-Augmented Generation

Store knowledge in a vector index. Retrieve only the top-K most relevant chunks per turn instead of dumping an entire knowledge base into the prompt. Keeps the context lean and semantically focused.

Production

Prompt Compression

Use models like LLMLingua or Selective Context to drop low-entropy tokens from long documents before they enter the prompt. Can cut document length 3–5× with minimal quality loss.

Advanced

Hierarchical Memory

Maintain three tiers: in-context (hot), vector store (warm), and structured DB (cold). A memory manager agent decides what to promote or evict. Scales to arbitrarily long sessions.

Agentic

Response Reservation

Always pre-subtract your expected max output tokens from the available budget before filling with inputs. Prevents the model from being truncated mid-response — a silent and embarrassing failure mode.

Essential

§ 03 — Implementation

Token-aware
message trimming

The snippet opposite implements a production-grade context manager in Python. It trims history from the oldest turn inward, always preserving the system prompt and reserving headroom for the model’s response.

Key design decisions baked in:

Uses tiktoken for byte-pair-encoded counting — fast and accurate
Reserves output budget before calculating available input space
Never drops the system prompt regardless of pressure
Logs warnings before the window is critically full

import tiktoken

class ContextManager:
    def __init__(self,
                 model: str = "gpt-4o",
                 max_tokens: int = 128_000,
                 reserve_output: int = 4_096):
        self.enc = tiktoken.encoding_for_model(model)
        self.budget = max_tokens - reserve_output

    def count(self, text: str) -> int:
        return len(self.enc.encode(text))

    def trim(self,
             system: str,
             messages: list) -> list:
        sys_tokens = self.count(system)
        available = self.budget - sys_tokens
        result, used = [], 0

        # Walk history newest-first
        for msg in reversed(messages):
            cost = self.count(msg["content"])
            if used + cost > available:
                break
            result.insert(0, msg)
            used += cost

        return result

    def utilisation(self, system, messages):
        used = sum(
            self.count(m["content"])
            for m in messages
        ) + self.count(system)
        return used / self.budget

§ 04 — Reference

Context windows
across major models

Model	Context Window	Output Limit	Recommended Use
Claude 3.5 Sonnet	200 k tokens	8 192	Long-document analysis, dense RAG pipelines
GPT-4o	128 k tokens	16 384	Extended agent loops, multi-doc reasoning
Gemini 1.5 Pro	1 M tokens	8 192	Full codebase ingestion, video transcripts
GPT-4o mini	128 k tokens	16 384	Cost-sensitive workloads with manageable context
Llama 3.1 70B	128 k tokens	4 096	Open-weight deployments, on-prem fine-tuning
Mistral 7B	32 k tokens	8 192	Edge deployment, latency-critical paths

📏

Count before you call

Always measure token usage before sending a request. Surprises at the API boundary are expensive and unrecoverable in production.

🗜️

Compress aggressively

Whitespace, repeated boilerplate, and verbose system prompts add up. Every saved token is latency and cost recovered at scale.

🧠

Externalise memory

Don’t fight the window — work around it. Vector stores and structured databases are infinitely scalable; context windows are not.

📉

Monitor utilisation

Track context utilisation per request as a metric. Alert at 80%. Hitting 95% regularly indicates an architectural issue, not a tuning one.

Managing Context Windows & Token Constraints in Large Language Models — A Complete Practitioner’s Guide

RAG-Driven Generative AI: Build MAS-RAG with DualRAG, GraphRAG, m…

Create AI Agents with LangChain: Build Intelligent AI Systems wit…

Learn Generative AI and RAG From Scratch: A Practical Guide to Bu…

SAMEZONE Microfiber Cleaning Cloth Roll 25×25 cm – Reusable, Wash…

Managing Context Windows
& Token Constraints

What fills a
context window?

Six ways to stay within
the window

Sliding Window Truncation

Summarise-and-Replace

Retrieval-Augmented Generation

Prompt Compression

Hierarchical Memory

Response Reservation

Token-aware
message trimming

Context windows
across major models

Count before you call

Compress aggressively

Externalise memory

Monitor utilisation

RAG-Driven Generative AI: Build MAS-RAG with DualRAG, GraphRAG, m…

Create AI Agents with LangChain: Build Intelligent AI Systems wit…

Learn Generative AI and RAG From Scratch: A Practical Guide to Bu…

Mastering Graph RAG Pipelines: A Practical Guide to Scalable LLM …

By Somish Saipar

Leave a Reply Cancel reply

You Missed

LLM Fine-Tuning & Optimization: Instruction Tuning, LoRA, RLHF & Prompt Strategies

PEFT, LoRA & QLoRA Explained: The Complete Guide to Efficient LLM Fine-Tuning (2025)

Mastering AI Expertise Through Fine-Tuning

Claude AI API Integration — Build Smarter Apps with the World’s Most Capable AI (2026)

About Us

Follow Us

Latest Posts

LLM Fine-Tuning & Optimization: Instruction Tuning, LoRA, RLHF & Prompt Strategies

PEFT, LoRA & QLoRA Explained: The Complete Guide to Efficient LLM Fine-Tuning (2025)

Mastering AI Expertise Through Fine-Tuning

Claude AI API Integration — Build Smarter Apps with the World’s Most Capable AI (2026)

Feed the algorithm. Can we parallel paths are we in agreeance?

RAG-Driven Generative AI: Build MAS-RAG with DualRAG, GraphRAG, m…

Create AI Agents with LangChain: Build Intelligent AI Systems wit…

Learn Generative AI and RAG From Scratch: A Practical Guide to Bu…

SAMEZONE Microfiber Cleaning Cloth Roll 25×25 cm – Reusable, Wash…

What fills acontext window?

Six ways to stay withinthe window

Sliding Window Truncation

Summarise-and-Replace

Retrieval-Augmented Generation

Prompt Compression

Hierarchical Memory

Response Reservation

Token-awaremessage trimming

Context windowsacross major models

Count before you call

Compress aggressively

Externalise memory

Monitor utilisation

RAG-Driven Generative AI: Build MAS-RAG with DualRAG, GraphRAG, m…

Create AI Agents with LangChain: Build Intelligent AI Systems wit…

Learn Generative AI and RAG From Scratch: A Practical Guide to Bu…

Mastering Graph RAG Pipelines: A Practical Guide to Scalable LLM …

By Somish Saipar

Related Post

Leave a Reply Cancel reply

You Missed

What fills a
context window?

Six ways to stay within
the window

Token-aware
message trimming

Context windows
across major models