Short-term Memory: Managing Conversational History in Loops
AI Engineering · Memory Systems

Short-term Memory:
Managing Conversational
History in Loops

How language-model agents maintain context across multi-turn interactions — without losing their minds or blowing their token budgets.

🧠 What Is Short-term Memory?

In the context of LLM-powered agents, short-term memory is the slice of context passed into each model call that represents the ongoing conversation. Unlike human short-term memory, it lives entirely inside the context window — a finite buffer measured in tokens.

Every time the agent runs a new loop iteration — calling tools, receiving results, and reasoning about next steps — the conversational history must be carefully curated, trimmed, and serialized back into the prompt. Get this wrong, and the model hallucinates, contradicts itself, or silently forgets critical instructions.

🔄 The Agentic Loop

Most agentic frameworks run a think → act → observe → repeat cycle. Conversational history threads through every iteration:

💬
User Input
🤔
LLM Reasons
🛠️
Tool Call
📥
Observe Result
📝
Update History
Key insight: The model has no persistent state between calls. The entire “memory” it has access to is what you explicitly include in the messages array on each request. Nothing more.

📋 Representing History in Code

History is typically stored as an ordered list of message objects. Each message has a role and content. Tool interactions are interleaved as first-class citizens:

agent_loop.py
# History grows with every turn of the loop
history = [
  {"role": "user",      "content": "Summarise last month's sales"},
  {"role": "assistant", "content": [{"type": "tool_use", "name": "query_db", ...}]},
  {"role": "user",      "content": [{"type": "tool_result", "content": "..."}]},
  {"role": "assistant", "content": "Here are your sales figures..."},
]

def run_loop(user_message: str) -> str:
    history.append({"role": "user", "content": user_message})

    while True:
        response = client.messages.create(
            model="claude-opus-4-6",
            messages=history,   # ← entire history sent each call
            tools=TOOLS,
        )
        history.append({"role": "assistant", "content": response.content})

        if response.stop_reason == "end_turn":
            return extract_text(response)

        # handle tool calls, append results, continue loop…

History Management Strategies

As conversations grow, naïvely appending every message will eventually exhaust the context window. Choose a strategy that fits your use-case:

🪟 Sliding Window
Keep the last N turns. Simple and predictable — ideal when recent context is all that matters. Risk: abrupt loss of early instructions.
🗜️ Summarisation
Periodically ask the model to compress older turns into a compact summary, then replace them. Retains semantic meaning at a fraction of the token cost.
📌 Pinned + Pruned
Pin critical messages (system context, key decisions) and prune low-value ones (verbose tool outputs). Surgical and controllable.
🔍 Retrieval-Augmented
Store history in a vector database. At each turn, retrieve the most relevant past messages. Scales to arbitrarily long sessions.

✂️ Token-aware Truncation

Before sending history to the model, enforce a hard token budget. Always preserve the first message (system context) and the most recent exchanges:

memory.py
def trim_history(history, max_tokens=80_000):
    """Remove oldest middle messages until we fit the budget."""
    while count_tokens(history) > max_tokens:
        if len(history) <= 3:
            raise MemoryError("Single message exceeds token budget")
        # Remove the oldest non-system message (index 1)
        history.pop(1)
    return history

def summarise_old_turns(history, keep_recent=6):
    """Compress all-but-the-latest N turns into one summary message."""
    to_compress = history[:-keep_recent]
    summary     = call_model("Summarise this conversation concisely: "
                              + str(to_compress))
    return [{"role": "user", "content": "[Summary] " + summary}] \
           + history[-keep_recent:]

⚖️ Strategy Trade-offs at a Glance

Strategy Token Cost Coherence Complexity Best For
Full history⬆ High⭐⭐⭐LowShort sessions
Sliding windowCapped⭐⭐LowStreaming chat
SummarisationMedium⭐⭐⭐MediumLong research tasks
Pinned + prunedLow⭐⭐⭐MediumTool-heavy agents
RAG memoryLow⭐⭐⭐HighPersistent assistants

⚠️ Common Pitfalls

Lost system prompt: Naïve window truncation discards the system message — the model forgets its persona, tools, and constraints mid-conversation.
Orphaned tool results: If you trim away an assistant tool-call message but keep the corresponding user tool-result message, the API will reject the malformed turn pair.
Over-summarisation: Aggressive compression loses nuance. A model that summarised “we decided to use PostgreSQL” may later hallucinate back to SQLite if specifics were stripped.

Always validate that your history array forms a well-interleaved user → assistant → user → … sequence before each API call. Tool-use turns follow the structure: assistant tool_use → user tool_result, and these pairs must never be separated by truncation.

🚀 Quick-reference Checklist

checklist.md
✅  Always include the system prompt / first user message
✅  Count tokens BEFORE sending — not after a 400 error
✅  Keep tool_use / tool_result pairs together when pruning
✅  Summarise periodically for tasks > 20 turns
✅  Log the full history externally for debugging
✅  Test context-loss scenarios: does the agent ask for clarification?
✅  For prod: set a hard max_turns guard to exit infinite loops
❌  Don't embed giant raw tool outputs verbatim — compress first
❌  Don't discard error messages — they carry diagnostic signal
❌  Don't rely on the model "remembering" anything across sessions
Short-term Memory  ·  Conversational History in Loops  ·  AI Engineering Reference

Leave a Reply

Your email address will not be published. Required fields are marked *