Short-term Memory:
Managing Conversational
History in Loops
How language-model agents maintain context across multi-turn interactions — without losing their minds or blowing their token budgets.
🧠 What Is Short-term Memory?
In the context of LLM-powered agents, short-term memory is the slice of context passed into each model call that represents the ongoing conversation. Unlike human short-term memory, it lives entirely inside the context window — a finite buffer measured in tokens.
Every time the agent runs a new loop iteration — calling tools, receiving results, and reasoning about next steps — the conversational history must be carefully curated, trimmed, and serialized back into the prompt. Get this wrong, and the model hallucinates, contradicts itself, or silently forgets critical instructions.
🔄 The Agentic Loop
Most agentic frameworks run a think → act → observe → repeat cycle. Conversational history threads through every iteration:
📋 Representing History in Code
History is typically stored as an ordered list of message objects. Each message has a role and content. Tool interactions are interleaved as first-class citizens:
# History grows with every turn of the loop history = [ {"role": "user", "content": "Summarise last month's sales"}, {"role": "assistant", "content": [{"type": "tool_use", "name": "query_db", ...}]}, {"role": "user", "content": [{"type": "tool_result", "content": "..."}]}, {"role": "assistant", "content": "Here are your sales figures..."}, ] def run_loop(user_message: str) -> str: history.append({"role": "user", "content": user_message}) while True: response = client.messages.create( model="claude-opus-4-6", messages=history, # ← entire history sent each call tools=TOOLS, ) history.append({"role": "assistant", "content": response.content}) if response.stop_reason == "end_turn": return extract_text(response) # handle tool calls, append results, continue loop…
⚡ History Management Strategies
As conversations grow, naïvely appending every message will eventually exhaust the context window. Choose a strategy that fits your use-case:
✂️ Token-aware Truncation
Before sending history to the model, enforce a hard token budget. Always preserve the first message (system context) and the most recent exchanges:
def trim_history(history, max_tokens=80_000): """Remove oldest middle messages until we fit the budget.""" while count_tokens(history) > max_tokens: if len(history) <= 3: raise MemoryError("Single message exceeds token budget") # Remove the oldest non-system message (index 1) history.pop(1) return history def summarise_old_turns(history, keep_recent=6): """Compress all-but-the-latest N turns into one summary message.""" to_compress = history[:-keep_recent] summary = call_model("Summarise this conversation concisely: " + str(to_compress)) return [{"role": "user", "content": "[Summary] " + summary}] \ + history[-keep_recent:]
⚖️ Strategy Trade-offs at a Glance
| Strategy | Token Cost | Coherence | Complexity | Best For |
|---|---|---|---|---|
| Full history | ⬆ High | ⭐⭐⭐ | Low | Short sessions |
| Sliding window | Capped | ⭐⭐ | Low | Streaming chat |
| Summarisation | Medium | ⭐⭐⭐ | Medium | Long research tasks |
| Pinned + pruned | Low | ⭐⭐⭐ | Medium | Tool-heavy agents |
| RAG memory | Low | ⭐⭐⭐ | High | Persistent assistants |
⚠️ Common Pitfalls
assistant tool-call message but keep the corresponding user tool-result message, the API will reject the malformed turn pair.
Always validate that your history array forms a well-interleaved user → assistant → user → … sequence before each API call. Tool-use turns follow the structure: assistant tool_use → user tool_result, and these pairs must never be separated by truncation.
🚀 Quick-reference Checklist
✅ Always include the system prompt / first user message ✅ Count tokens BEFORE sending — not after a 400 error ✅ Keep tool_use / tool_result pairs together when pruning ✅ Summarise periodically for tasks > 20 turns ✅ Log the full history externally for debugging ✅ Test context-loss scenarios: does the agent ask for clarification? ✅ For prod: set a hard max_turns guard to exit infinite loops ❌ Don't embed giant raw tool outputs verbatim — compress first ❌ Don't discard error messages — they carry diagnostic signal ❌ Don't rely on the model "remembering" anything across sessions

