Memory, Compounding, and Persistence

One of the most powerful levers for improving your agentic systems

Mar 30, 2026

*This piece is part of How to Design an Agentic System, a series on the design decisions required to specify an internal agentic system.

You built an agentic system. It works. But it works the same today as it did three months ago. Every session starts from scratch. The foundation model gets smarter (that’s the vendor’s job), but your system doesn’t learn anything from its own experience. The gap between a tool and a colleague is memory. And the gap between a static system and one that improves over time is compounding.

The Compounding Loop

Two months ago, my research agent Elise cited a study without checking the sample size. I corrected her. She wrote the correction into her working memory. She hasn’t made that mistake since.

That’s the compounding loop: learn from a failure, persist the lesson, apply it to future work. Three steps, and each one is load-bearing. Skip “learn” and the agent repeats mistakes. Skip “persist” and the lesson evaporates when the session ends. Skip “apply” and you have a journal, not a learning system.

Over time, these corrections accumulate into something that looks a lot like judgment. Not general intelligence. Domain-specific judgment: knowing what to check, what to flag, what to skip, what matters to this particular team in this particular context.

Three Layers

Compounding happens at three levels, and the most effective systems invest in all three simultaneously.

Agent-level. An individual agent learns from its own mistakes and preferences. Elise learns to check sample sizes. Morgan learns how Peter likes status updates formatted. These are personal corrections that make a single agent better at its specific role.
System-level. The factory itself improves. Pipeline metrics identify failure patterns. Workflows get tightened. If agents consistently fail at step 4 of a seven-step pipeline, the system can add a verification gate at step 3. This is organizational learning, not individual learning.
Human-level. The people using the system get better at specifying intent. You learn what the agent needs to hear. You learn which instructions produce good output and which produce garbage. This layer is often overlooked, but it’s often the fastest to compound. A human who learns to write clear specifications gets better results from any model.

Implementing Memory Practically

Memory is only useful if it’s managed. Too little and the agent repeats the same mistakes every session. Too much and it drowns in stale context, spending tokens on corrections that no longer apply to problems it solved weeks ago.

The practical pattern most production systems converge on: write everything during a session. At the end, trim to essentials. Update what’s changed, remove what’s obsolete, compress what’s redundant. Treat memory like code. It needs refactoring.

What goes in: corrections, preferences, patterns, working context the next version of the agent will need. Things the agent learned that aren’t documented elsewhere. Observations that don’t fit in the task database or the architecture docs.

What gets trimmed: resolved issues, stale details, entries that duplicate information already in the codebase, context that was useful for one session but won’t matter next time.

Gas Town’s Beads system handles this with semantic decay: completed tasks get summarized with the reasoning preserved, then archived. The stale details go away. The lessons stay. This is the right instinct. Memory should preserve intent and judgment while shedding implementation details that won’t be relevant next session.

Manus, the Singapore-based agent platform, found that typical tasks require around 50 tool calls spanning hundreds of conversational turns. They refactored their context architecture five times in the first year. That’s what it looks like when a team takes memory management seriously. It’s not a one-time design decision. It’s an ongoing engineering practice.

Where State Lives

Different kinds of state need different storage. Picking one substrate for everything is not always the right choice.

Git for configuration, pipeline definitions, agent persona files, and anything that benefits from versioning and auditability. When you need to see what changed, when it changed, and who changed it, git is the right answer. OpenAI’s Harness Engineering team stored plans as first-class git artifacts. Ephemeral plans for small changes, complete specifications for complex work. The version history became institutional memory.
Relational database (Postgres) for structured operational data: tasks, contacts, events, metrics, interaction logs. When you need to query, filter, aggregate, and join, this is where it belongs. If you’re storing tasks as markdown files, you’ll regret it the moment you need to answer “what’s overdue across all agents?”
Vector stores for semantic similarity. When you need to find things that are conceptually close, not just textually matching. Useful for retrieval-augmented generation, for surfacing related past work, for finding “we solved something like this before” connections.
Graph databases when traversal patterns matter. Dependency chains, relationship networks, organizational hierarchies. If your primary access pattern is “follow the connections,” a graph is the right fit.
Versioned databases (Dolt) when you need SQL with git semantics. Branch-per-agent isolation, time-travel debugging, cell-level diffs. Gas Town runs 160 agents on a single host using Dolt for coordination state. The upside: agents can branch, experiment, and merge results just like code. Rollback is trivial if an agent corrupts its state. The downside: it’s another database to operate, and most teams won’t need branch-merge semantics until they’re running agents in parallel at scale. Add it when environment isolation becomes a real problem, not before.

Your system will likely use several of these. The choice isn’t “where do I store things.” It’s “what properties does this state need?” Durability, queryability, versioning, semantic proximity: each points to a different substrate.

Context management has economic weight too. Spotify reduced token costs from $1.00 to $0.25 per call by managing what goes into context more carefully. When your system runs thousands of tasks, the difference between loading everything and loading what’s relevant is a significant cost line.

Drift

Here’s the arithmetic problem with accumulated learning. If each correction has 95% fidelity to the original intent (a reasonable assumption for any learning-from-feedback system), then after 100 accumulated corrections, your system has 0.95^100 = approximately 0.6% fidelity to where it started. That’s not a theoretical risk. It’s math.

Drift is subtle because each individual step looks fine. The agent’s behavior this week is only slightly different from last week. But compare this month to three months ago, and the gap can be significant. The system has been optimizing for local feedback without maintaining global coherence.

Detection requires periodic comparison against baseline criteria. What did you originally want this agent to do? Does it still do that? Regular audits of memory state, comparing current behavior against original specifications, will catch drift before it compounds.

Correction means human review of the memory itself. Not just the agent’s output, but the accumulated instructions, preferences, and corrections that shape its behavior. Prune entries that have drifted from intent. Re-anchor to first principles. This is maintenance, and it needs to be scheduled, not reactive.

Risks

Memory makes your system smarter. It also creates new failure modes.

Overfitting. The agent over-indexes on recent corrections. You told it once that a particular formatting choice was wrong, and now it avoids that format in every context, including ones where it was fine. The correction was local; the agent applied it globally.
Contamination. Bad data enters memory and persists. An incorrect correction, a misunderstood preference, a factual error that gets treated as ground truth. Without provenance tracking, contaminated memory is hard to detect and harder to clean.
Provenance decay. The agent “knows” something but can’t trace where it learned it. Was this a correction from the user? An inference from a pattern? A fact from a source that turned out to be unreliable? Without provenance, you can’t audit why the agent behaves the way it does.
Memory as an attack surface. Palo Alto Networks demonstrated that indirect prompt injection can poison an agent’s long-term memory through a link in an email. The agent includes attacker instructions in its session summary, then silently exfiltrates user data in every subsequent session. No visible malicious behavior during the initial attack. The memory becomes the persistence mechanism for the exploit. This connects directly to the broader security question covered in the prompt injection piece.

Each of these risks is manageable. None of them is solved by ignoring memory entirely. The right response is to build memory deliberately: with trimming, with provenance, with periodic human review, and with awareness that memory is both an asset and an attack surface.

The Future of the Engineering Org

Discussion about this post

Ready for more?