Your AI Agent Forgets and Degrades — Engineering Context Across Time

Two Ways the Same Thing Breaks

I've watched an AI agent fail the same project in two completely different ways on the same afternoon.

The first failure: I asked it to wire up a feature, and it rebuilt an abstraction we had explicitly torn out the week before. Not a hallucination — a sincere, confident rebuild of a thing we had already decided was wrong. It had no memory that the decision was ever made. The conversation where we made it had ended, and with it, the reasoning.

The second failure happened inside a single long session. Early on, the agent was sharp. It tracked the file layout, respected the conventions, made clean edits. Four hours and a few hundred thousand tokens later, it was a different collaborator: missing constraints I'd stated explicitly, re-asking questions I'd already answered, contradicting an edit it had made twenty minutes earlier. Nothing had changed about the model. The only thing that had changed was how full its context was.

Two failures, one root cause: context. An AI agent is bounded by what it can hold in its working context at the moment it acts. That context fails in two directions across two timescales. It rots within a session — as the context fills, the quality of what the model does with it degrades. And it evaporates between sessions — when the conversation ends, nothing it learned persists.

The fix is not two fixes. It's one discipline applied across both timescales: engineer the context. Isolate and compact what's in the window within a session; persist the high-signal residue to durable memory across sessions. This post is about both halves and the architecture that ties them together.

Context Rot: Why a Fuller Window Is a Worse Window

The intuition most people carry is that a bigger context window is strictly better. You paid for 200K tokens, so use them — dump the whole codebase, the whole conversation history, the whole spec. The model will sort it out.

It will not sort it out as well as you think. The most rigorous public evidence here is Chroma's 2025 study, "Context Rot: How Increasing Input Tokens Impacts LLM Performance." They evaluated 18 frontier models — including Claude Opus 4, GPT-4.1, Gemini 2.5 Pro, and Qwen3 — and found, in their words, that "model performance degrades as input length increases, often in surprising and non-uniform ways." Every model they tested got less reliable as the input grew, even on tasks that were trivially easy at short lengths. This is not the model hitting its token limit and overflowing. The degradation shows up well inside the advertised window — a model with a 200K window can wobble at a fraction of that.

Two things make this counterintuitive, and both are worth understanding.

NIAH gives false confidence. The benchmark that convinced everyone long context was "solved" is needle-in-a-haystack: bury a sentence in a huge document, ask the model to find it. Models ace it, and the takeaway became long context works. But as Chroma notes, NIAH "typically assesses direct lexical matching, which may not be representative of flexible, semantically oriented tasks." When they varied the semantic similarity between the question and the buried target, the picture changed sharply: "performance degrades more quickly in input length with lower similarity needle-question pairs." Real work is almost never lexical lookup. It's the harder, fuzzier kind of retrieval NIAH doesn't measure — which is exactly where rot bites.

Position matters, and the middle is the weak spot. This is the older "lost in the middle" finding from Liu et al. (2023), and it has held up: models attend well to information at the very beginning and the very end of their context and poorly to what's stranded in the middle. Put the critical constraint in paragraph 40 of a 90-paragraph dump and you've put it in the model's blind spot. It's the same primacy-and-recency U-curve you see in human memory, which is a little unsettling.

There's a mechanical reason underneath all of this. Attention is quadratic — every token attends to every other token — so the cost grows quadratically as the window fills. You can think of it as a fixed attention budget spreading across an ever-larger set of pairwise relationships: more tokens, thinner slice each. That's an intuition pump, not a proof, but it points the right way.

Two honest caveats so nobody quotes me out of context. First, this is reliability degrading with length, not a hard cliff — performance trends down, it doesn't fall off a ledge at some magic number. Second, the magnitude is task- and model-dependent; Chroma found no single model that resisted rot best across the board. So I won't hang a precise percentage on it. The directional finding is what's robust and what you should design around: the enemy is context fill, not context size. Filling the window is the failure mode. The size of the window just tells you how much rope you have.

Fighting Rot Within the Session

If fill is the enemy, the discipline is to keep the window lean and high-signal at the moment the model acts. A few moves do most of the work.

Isolation — give each task a fresh budget. The single highest-leverage move is to stop running everything in one ever-growing conversation. Spin up a fresh subagent per task. Each one starts with a clean context budget, sees only what that task needs, does the work, and returns a compact result to the parent. The 300K-token main thread never has to hold the 80K tokens of file-reading and dead ends a single subtask burned through — it gets the answer, not the slog. This is exactly the orchestration pattern I describe in verifying AI-generated code: a clean-context agent that did the implementation is a poor reviewer of its own work, but it's also a fresh budget for the next job. Isolation buys you both independence and rot resistance in one move.

Compaction. When a session genuinely has to run long, summarize the history into a dense state rather than carrying the full transcript. Twenty exchanges of exploration collapse into five bullet points of conclusions. You trade the verbatim record — which the model was attending to worse and worse anyway — for a high-signal digest it can actually use.

The smallest high-signal set. Before adding anything to context, ask whether it earns its tokens. Most of the time the answer is to add less. Which leads to the rule I break the most and regret the most:

Never dump a whole file when an excerpt will do. Reading a 1,200-line file into context to use one 15-line function is the most common self-inflicted rot I see. You've spent the tokens, you've pushed the actually-relevant lines toward the lossy middle, and you've diluted every subsequent decision. Read the function. Read the type signature. Read the section. Reach for the whole file only when you genuinely need the whole file.

Deliberate ordering. Since the middle is the blind spot, put what matters at the edges. The task and the hard constraints go at the top or the bottom of the prompt, not buried in a wall of pasted context. Order is a free lever and almost nobody pulls it.

Every one of these is the same instinct: curate, don't accumulate.

Memory: Fighting Evaporation Across Sessions

Isolation and compaction keep a single session healthy. They do nothing for the second failure — the agent that rebuilt the abstraction we'd torn out. When a session ends, its context is gone. Tomorrow's agent starts from zero, and "we already decided this is wrong" is not written down anywhere it will look.

The fix is durable, file-based memory. Not a vector database, not a fine-tune — a directory of small markdown files the agent reads, writes, and curates. Here's the architecture I actually run, and it's deliberately boring, because boring is what survives.

One fact per file. Each memory is a small markdown file holding a single durable fact. Not a journal, not a log — one idea per file, so it can be recalled, updated, or retired without dragging unrelated context along with it.

Frontmatter that makes recall cheap. Every file carries a name (a kebab-case slug), a one-line description that states the fact tightly enough to judge its relevance at a glance, and a type. The type tells the agent what kind of memory it is:

user — who the person is, how they like to work.
feedback — how the agent should work: corrections and confirmed-good approaches, captured with the why so the rule survives the situation that produced it.
project — ongoing work, decisions, and constraints that aren't derivable from reading the code.
reference — pointers to external resources, repos, and docs.

A single feedback memory looks like this:

---
name: dump-whole-file-causes-rot
description: Reading an entire file to use one function pollutes context and degrades later decisions — excerpt instead.
type: feedback
---
 
When you only need one function, read that function — not the whole file.
Loading a 1,200-line file to use 15 lines pushes the relevant lines toward
the lossy middle of the window and dilutes every decision that follows.
Reach for the full file only when the task genuinely needs the full file.
 
Related: [[smallest-high-signal-context-set]]

An index loaded every session. A MEMORY.md file holds one line per memory — a title plus a short hook — and that is what loads into context at the start of every session. Not every memory file. Just the index:

# Memory Index
 
- [Ref: contrast-budget carry-forwards](ref_contrast_budget.md) — 30+ numbered measured floors; any scene-brightening change must re-verify against them.
- [Feedback: dump-whole-file-causes-rot](feedback_dump_whole_file.md) — excerpt, don't dump; whole-file reads pollute context.
- [Project: ledger-app](project_ledger_app.md) — practice-management SaaS; current sprint + constraints not in code.

The index is the trick that makes the whole thing tractable. The agent knows what it knows without paying to load it all. Recall is by relevance: the agent reads the descriptions, decides which two or three memories actually bear on the task at hand, and loads only those. Dumping every memory file into context every session would itself be a self-inflicted dose of context rot — the index is the defense against that, not just an organizational nicety.

Links between memories. A memory can reference another with a [[other-memory-name]] link, so related facts connect into a small graph instead of sitting as isolated notes. When you pull one, you can see what it touches.

Carry-forwards as institutional memory. The highest-value entries are the ones earned the hard way: numbered, durable lessons captured the moment something bit you. The canonical example lives in my WCAG contrast budget in CI work — that project accumulated 30-plus numbered carry-forwards, each one a measured floor the next change has to respect ("this wire-tight margin is 3.11:1; any change that brightens the scene must re-verify against it first"). These aren't suggestions. They're the institutional memory the agent enforces across sessions, so a lesson paid for once in June isn't re-learned the hard way in September.

It's One Discipline, Not Two

Step back and the two halves are obviously the same move.

Within a session, you curate the smallest high-signal context: isolate, compact, excerpt, order. Across sessions, you persist what compounds: one fact per file, typed, indexed, recalled by relevance. The within-session instinct — only let in what earns its place — is identical to the across-session instinct — only write down what will matter later, and only load it when it's relevant. Both are the refusal to confuse more context with better context.

This is also why the industry converged on a durable project-context file. AGENTS.md (and Anthropic's CLAUDE.md, which serves the same role for Claude Code) has become the de-facto standard place to write the project context an agent should load every time — build commands, conventions, hard boundaries, the stuff that isn't derivable from the code. By 2026 a shared agent-instructions file in this format is read across a wide range of coding agents — Claude Code, OpenAI's Codex CLI, Cursor, Aider, Gemini CLI, GitHub Copilot, Windsurf, and a growing list of others. It is the simplest possible instance of the across-sessions half of this discipline: write the durable context down once, in a predictable place, so every future session starts pre-loaded with it instead of rediscovering it. The memory architecture above is what you graduate to when one file stops being enough.

The Payoff: A Teammate Instead of a Tool

Here is what this buys you, and it's the whole reason to bother.

Without memory, every session is day one. The agent is a brilliant contractor with amnesia — fast, capable, and condemned to relearn your project from scratch every single morning. You spend the first hour of every session re-explaining the constraints, re-litigating the decisions, watching it rebuild the thing you tore out last week. It never gets to know your project, because it never gets to keep knowing it.

With the discipline in place, the curve bends the other way. Each session deposits its hard-won lessons into durable memory; each new session starts already aware of them. The carry-forwards accumulate. The feedback memories sharpen. A constraint discovered in week two is still enforced in month four without anyone re-typing it. The agent stops resetting and starts compounding — month three is genuinely smarter than month one, because the residue of every session before it is still in play. That accumulation is a big part of what I mean when I write about shipping SaaS 10x faster: the speed isn't just raw generation, it's not paying the re-onboarding tax every session.

That's the line between a tool and a teammate. A tool does what you point it at and forgets. A teammate remembers why you made the call, holds the line on the constraint, and gets more useful the longer you work together. Memory — engineered context across time — is what moves an agent across that line.

Engineer the Context, Both Directions

An AI agent's effectiveness is bounded by its context, and that context fails two ways: it rots as the window fills within a session, and it evaporates entirely between sessions. The answer to both is the same discipline run across two timescales — isolate and compact what's live in the window now, persist the high-signal residue to durable, indexed, relevance-recalled memory for later. Do the first and your sessions stay sharp to the end. Do the second and your project compounds instead of resetting. Do both and the agent stops being a clever stranger every morning and starts being someone who knows the work.

If you're building with AI agents and want to talk through context engineering, memory architecture, or shipping production software with this kind of discipline, get in touch.