Curate Before You Store

A deep dive on the Memory and Knowledge plane. Why memory is where most agentic deployments quietly fail, and what to build instead.

This is the fourth piece in a series on enterprise multi-agent architecture. The flagship laid out five planes and a Trust Fabric. The second piece went deep on the Agent plane and the discipline of harness engineering. The third piece went deep on the Model plane and the AI gateway. This piece goes inside the Memory and Knowledge plane and stays there.

The eight weeks before anyone noticed

A category leader at a Fortune 500 procurement organization opens her queue on a Monday morning. The Renewal Supervisor has assembled a recommended action on a 2.3 million dollar enterprise contract: terminate the relationship, switch to a competing vendor. The justification is detailed. The Risk Reassessment agent cites the incumbent’s “ongoing security incident” and “deteriorating financial position per recent SEC filings.”

She approves. The vendor is notified. The migration kicks off.

Eight weeks later she learns the security incident was resolved fourteen months ago, and the SEC filings the agent referenced were from 2024, not “recent.” The incumbent’s reputation has been mishandled. They are entitled to compensation. She gets to face them across a table.

The model worked correctly. The harness ran exactly the loop it was designed to run. The tool calls returned exactly what the tools were configured to return. The failure happened in the memory plane: an episodic record that aged out of relevance without aging out of retrieval, a semantic memory entry that was never tagged with a temporal validity, a knowledge graph edge that pointed to financial data without a timestamp.

This is the failure mode that does not crash anything. The agent is confident. The reasoning is articulate. The action is wrong. By the time anyone notices, weeks have passed.

In the field, this is the category of failure that takes the longest to detect, costs the most to remediate, and burns the most trust between the system and the people who depend on it. It is also the category of failure that almost every production agent deployment ships with, because memory is the part of the architecture that most teams treat as an afterthought.

This piece is about why that has to change, and what changes it.

The thesis

Memory is where most agentic deployments quietly fail.

Production agents need four distinct memory types. They are not interchangeable. They differ in write authority, retention policy, retrieval pattern, and governance. Treating all four as a single vector database is the architectural equivalent of putting every table in your relational database into one denormalized blob and wondering why nothing scales.

A serious Memory and Knowledge plane has four properties. It is typed, because working, episodic, semantic, and procedural memory are not the same thing and should not be stored the same way. It is curated, because copying chaos into storage produces an agent that confidently cites whatever someone wrote in a Slack thread three quarters ago and forgot about. It is hybrid in retrieval, because vector search is necessary for unstructured fuzziness and embarrassingly bad at multi-hop reasoning about entities, relationships, and time. And it is governed, with authority tagging on every write, a forgetting path on every memory, and access controls enforced at retrieval time, not after.

Curate before you store. Build the forgetting path. Hybrid retrieval is the production answer. Memory is the long-term character of the system.

The remainder of this piece defends each of those claims.

What changed in 2025 and 2026

The Memory and Knowledge plane went from afterthought to first-class architectural concern in roughly eighteen months. A few inflection points are worth naming because they reshape what serious looks like.

The collapse of the “RAG is dead” narrative. Through late 2025, a meaningful share of the practitioner discourse argued that long context windows would make dedicated retrieval unnecessary. By May 2026, the VentureBeat Pulse enterprise survey data made the position untenable. Respondents identifying long-context-as-dominant-architecture collapsed from 15.5 percent in January to 3.5 percent in February before partially recovering to 6.7 percent in March. The enterprise market answered the question. Retrieval is not going away. It is becoming hybrid.

The CoALA framework as the consensus taxonomy. Sumers et al. at Princeton and CMU published Cognitive Architectures for Language Agents (arXiv:2309.02427) in 2023. By 2026 it is the canonical academic reference for the four memory types. Mem0, Letta (formerly MemGPT), LangChain, Zep, and LlamaIndex all use it as their taxonomy foundation. The cognitive science roots run deeper. Tulving distinguished episodic from semantic memory in 1972. Squire added procedural memory in 1987. Baddeley and Hitch formalized working memory in 1974. The taxonomy is not new. Its application as production architecture is.

Dedicated memory benchmarks. BEAM (Beyond a Million Tokens) emerged through 2026 as the industry-standard methodology for long-horizon memory evaluation. It scales to ten million tokens across one hundred procedurally generated multi-turn conversations and tests ten distinct memory dimensions including contradiction resolution, event ordering, instruction following across time, and preference tracking. The previous benchmark (LoCoMo) was found in a 2026 audit to contain score-corrupting errors in 6.4 percent of its ground-truth answer key. Memory evaluation finally has rigorous tooling.

Foundational research arriving. March 2026 brought the Governed Memory paper (arXiv:2603.17787), which formalized five structural failures of ungoverned multi-agent memory: memory silos, governance fragmentation, unstructured memories unusable by downstream systems, redundant context delivery, and silent quality degradation. The AgeMem framework (arXiv:2601.01885) defined a six-tool action space for memory operations: ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, FILTER. The SSGM paper (arXiv:2603.11768) introduced a Stability and Safety Governed Memory framework. February’s “Rethinking Memory Mechanisms of Foundation Agents” survey (arXiv:2602.06052) consolidated three years of research into one reference. The field caught up to itself.

Microsoft GraphRAG reaching production maturity. Open-sourced in 2024, GraphRAG hit version 1.0 in 2026 with substantially better cost characteristics through the LazyGraphRAG optimization. Production deployments report up to 35 percent precision improvement over vector-only retrieval when knowledge graphs are integrated. Healthcare, finance, and legal sectors have been the early adopters, because each one has problems where the answer is a relationship rather than a document.

The Snowflake ontology result. Snowflake published internal research showing that adding an ontology layer to their agents produced a 20 percent improvement in answer accuracy and a 39 percent reduction in tool calls. The result is consequential because it is enterprise-scale, internally validated, and concrete. The ontology is entity identity mapping across systems, which is precisely the kind of structured knowledge that flat vector search cannot represent and that knowledge graphs handle natively.

Six inflection points. One direction. The Memory and Knowledge plane is now a designed system, not a database choice. Architectures that treated it as a database choice in 2024 are the architectures hitting the scale wall in 2026.

The four memory types

The CoALA taxonomy is the foundation. Production agents need four distinct memory types, designed deliberately.

Anatomy of the Memory and Knowledge plane: four memory types, hybrid retrieval, governance across all

Working memory. What the agent is processing right now. It lives in the context window and the active scratchpad. It is volatile by design. It clears between tasks. Working memory’s failure mode is overflow. As context fills, attention degrades, instructions buried in the middle get ignored, and the agent silently starts losing the thread. The harness piece covered the compaction patterns that mitigate this.

Episodic memory. What the agent has done and what happened. Traces, decisions, tool calls, outcomes. It is the audit trail. It is also the source of learning. Episodic memory needs to be tiered (hot, warm, archive) because not every interaction needs to be retrievable at the same latency forever. Its failure mode is the opposite of working memory: not overflow but staleness. An episodic record from fourteen months ago is sometimes the most relevant fact in the world and sometimes a hazard. The discipline is to know which.

Semantic memory. Facts and certified knowledge. The procurement playbook. The vendor commercial appetites. The customer coverage profile. Semantic memory is where most of the “curate before storage” discipline lives because it is the most consequential category and the easiest to pollute. Its failure mode is contamination: stale, low-quality, or hostile content makes its way in, and the agent confidently cites it forever.

Procedural memory. How to do things. Workflows. Heuristics. Playbooks. Procedural memory should be loaded by relevance to the current task, not by default, because loading too many playbooks consumes the context budget the working memory needs. Its failure mode is brittleness: a playbook that worked last quarter no longer matches the current process, and the agent applies it confidently anyway.

The most common failure I see is treating all four types as a single vector database. They differ in nearly every architectural dimension. Write authority differs (working memory writes on every step; semantic memory should require explicit curation). Retention policy differs (working memory clears immediately; episodic memory tiers; semantic memory is governed). Retrieval pattern differs (working memory is read at every reasoning step; semantic memory is read on retrieval; procedural memory is read on task initiation). Governance differs (working memory needs almost none; semantic memory needs the most). Designing all four with the same primitives produces a system in which the wrong content gets retrieved at the wrong time, repeatedly, and the agent appears confidently broken.

Hybrid retrieval as the production answer

Vector search is necessary. It is also not sufficient. The 2026 production consensus is hybrid retrieval, and the reason is mechanical.

Vector embeddings excel at semantic similarity. “Show me a passage about pricing changes” works well. They are embarrassingly bad at multi-hop structural reasoning. “Which suppliers serve competitors who recently entered our market?” requires traversing relationships across multiple entities. The vector database does not represent relationships. The knowledge graph does. The two technologies are not alternatives. They are complements.

The production pattern that has consolidated through 2026 has three components running in parallel, with the retrieval layer fusing the results.

Vector for fuzziness. Dense embeddings, semantic similarity, the standard RAG building block. Excellent for unstructured retrieval, document-oriented search, “find me content like this.” Pinecone, Weaviate, Milvus, Qdrant, ChromaDB are the established options. The standalone vector database category has been under pressure through 2026 as enterprises move toward hybrid, but the underlying capability remains essential.

Knowledge graph for structure. Property graphs that explicitly model entities and their relationships. Microsoft’s GraphRAG (open-sourced 2024, version 1.0 in 2026) and the LazyGraphRAG cost optimization that followed have brought knowledge graph adoption inside the cost envelope most enterprises can justify. The published precision improvement over vector-only retrieval runs up to 35 percent in domains where multi-hop reasoning matters, which is most enterprise domains. The Snowflake ontology result (20 percent accuracy improvement, 39 percent tool call reduction) is the cleanest single proof point.

Lexical search and reranking. BM25 keyword matching for exact-string requirements that semantic search misses. A cross-encoder reranker that re-scores the top candidates from vector and graph retrieval before they enter the prompt. Hybrid indexing with BM25 plus dense embeddings produces 15 to 30 percent precision improvements across enterprise deployments. The reranker is the unsung step. Most teams skip it and pay for it later.

The architectural move is to run all three in parallel and fuse the results. Reciprocal Rank Fusion (RRF) is the standard merge algorithm and works well enough for most production traffic. The retrieval layer’s job is not to pick the right method per query. It is to run the right methods in parallel and rank what comes back. The agent then reasons over a curated, ranked, multi-source result set.

This is what serious in 2026 looks like. Not “RAG with a vector database.” Not “agentic memory replaces retrieval.” Hybrid retrieval over four typed memory stores, with the agent orchestrating both.

The taxonomy of memory failures

Six failure modes I have seen consistently across deployments, my own and others’:

Memory contamination. Stale, low-quality, or hostile content makes its way into semantic or episodic memory and the agent confidently cites it forever. The 2026 Gamage study of 4,416 trials across six conversation depths quantified the downstream effect: constraint compliance dropped from 73 percent at turn 5 to 33 percent at turn 16 without memory mitigation. Halfway through a task, the agent is violating its own instructions twice as often as it was at the start. It does not know this is happening. It keeps running confidently.

Stale memory poisoning. A specific case of contamination worth naming separately. A tool returns data that was correct at retrieval time but becomes stale across the session. The agent integrates the data as fact, and every subsequent reasoning step builds on a premise that has moved underneath it. The opening scenario was a textbook case. The fix requires temporal validity on stored memories and active staleness detection at retrieval, not passive time-based decay alone.

Confident wrong citation. The agent retrieves the wrong document or the wrong record and cites it persuasively. The retrieval looks fine on standard metrics like NDCG or recall at k. The cited content semantically matched the query. It just was not the right content. This is the failure mode in regulated domains: clinical trial data from 2022 retrieved for a query that needed the 2025 safety profile, executive compensation documents retrieved because they semantically matched a benefits question.

Access control leakage. The retrieval layer returns results that semantically match the query without considering whether the requesting user is authorized to see them. Industry analysis suggests this risk is present in at least 73 percent of production RAG implementations. The most common incident: employees receiving context from board minutes or executive compensation documents because the retrieval layer ignored the access controls that the underlying document management system enforced. The fix is retrieval-native access control: permission predicates embedded in the retrieval query, not applied as a post-filter.

Cross-session identity drift. The agent loses track of who the user is across sessions. A returning customer is treated as new. A canceled vendor is treated as active. Identity in memory is a hard, open problem in 2026, and the Mem0 State of AI Agent Memory report names it as one of the three hardest unresolved problems alongside temporal abstraction at scale and memory staleness.

The five governance failures from the Governed Memory paper. Memory silos (each agent maintains its own memory, none can read another’s). Governance fragmentation (no consistent policy on what gets stored, who can read, when memory is forgotten). Unstructured memories unusable by downstream systems (stored in formats that subsequent agents cannot consume). Redundant context delivery (the same content retrieved repeatedly, paying for it every time). Silent quality degradation (memory quality decays without any signal that it is decaying).

These six are not exotic. They are the ordinary ways memory breaks. A serious memory engineering practice means you have explicit detection and mitigation for each.

The discipline that prevents most of the failures above is the discipline of curating before storage.

The principle is simple to state and hard to maintain. The write path to semantic memory is governed. Not every document, not every tool response, not every agent observation flows into long-term memory by default. Each candidate is tagged for authority before storage. Each is tagged for temporal validity. Each is tagged for the tenant or scope that owns it. Each is reviewed against the access controls of its source. What enters semantic memory is what your organization considers true, current, and authorized to share with the agent.

Three operational rules:

Tag by authority level. Policy and Standard go in. Opinion and draft do not. The same source system can produce content at different authority levels. The procurement policy database is authoritative. A Slack thread debating the policy is not, even if both technically describe the policy. The agent should know the difference, which means the memory plane should know the difference, which means the write path should tag the difference.

Build the forgetting path. Most teams forget to. Every memory should have a temporal validity, a freshness signal, and an explicit prune mechanism. TTL on episodic records. Decay on low-relevance content. Active staleness detection on high-relevance content (the open problem). The forgetting path is not a cleanup job that runs once a quarter. It is a primary memory operation that fires continuously.

Use a structured action space. The AgeMem framework defines six memory operations: ADD inserts new entries, UPDATE modifies existing ones, DELETE actively prunes stale or redundant knowledge, RETRIEVE pulls relevant content, SUMMARY consolidates, and FILTER manages the boundaries of working context. The architectural move is to treat memory operations as a first-class API, not as a side effect of model inference. Every write is intentional. Every delete is intentional. Every update preserves audit history.

The deepest version of this principle, which I have come to repeat to every new engineer on the team: do not copy chaos. Connect to truth. When you ingest enterprise content into your memory plane, you are choosing what your agents will believe. Choose deliberately.

A worked example: the Risk Reassessment agent’s memory plane

Recall the Risk Reassessment agent from the harness and model pieces. Its job is to assemble a current view of a vendor’s risk profile by pulling SOC 2 history, security incident records, financial filings, and fresh external signals, then producing a structured risk score.

Here is what its memory plane looks like.

Its working memory holds the current task: this vendor, this renewal window, this risk reassessment in flight. It is the context window plus an active scratchpad that the harness manages. It clears when the task completes.

Its episodic memory holds the agent’s prior risk reassessments on this vendor, tiered by recency. The last six months sit in a hot tier with sub-second retrieval. The next two years sit in a warm tier. Older history sits in archive and is retrievable on explicit request. Every reassessment includes the date, the sources cited, the score produced, and the human review outcome. When the agent runs against a vendor it has reassessed before, episodic memory is what makes the new reassessment incremental rather than from scratch.

Its semantic memory holds the certified facts. The organization’s risk taxonomy. The vendor risk policy. The standard set of carriers and their commercial appetites. The supplier ontology that maps entity identity across procurement, legal, finance, and security systems. Every entry has an authority tag. Policy and Standard documents are present. Drafts and proposals are not. Every entry has a temporal validity. SOC 2 reports older than fifteen months trigger a staleness flag at retrieval.

Its procedural memory holds the playbooks for the reassessment workflow itself. The standard sequence of checks. The escalation criteria. The diagnostic patterns for unusual signal combinations. The agent loads only the playbook relevant to the current case, not all of them. When the playbook fails (the human reviewer overrides the recommendation, or the recommendation later proves wrong) the failure is logged and the playbook is updated for future runs. The harness piece called this the eval flywheel. The memory plane is where it lives.

The retrieval layer runs three methods in parallel. Vector search finds semantically similar content (passages about similar incident patterns). The knowledge graph traverses entity relationships (this vendor’s parent companies, subsidiaries, suppliers, and the customers exposed to them). Lexical search and reranking handle exact-match requirements (a specific CVE identifier, a specific regulatory citation). The agent reasons over a fused, ranked result set.

Governance runs across all four memory types. Authority tagging on every write to semantic memory. TTL on every episodic entry. Access control predicates embedded in every retrieval query, scoped to the calling agent’s identity. Audit logs on every read and every write. Tenant scoping on shared infrastructure so that one customer’s memory cannot leak into another’s reasoning.

The Snowflake-style productivity gain shows up across a quarter of operation. The structured ontology reduces redundant tool calls by roughly a third. The episodic memory of prior reassessments lets the agent skip work that has already been done. Hybrid retrieval lifts the precision of evidence the agent reasons over. The combination produces answers that the procurement leader can act on without spending eight weeks wondering whether the agent was right.

That is what one agent’s memory plane looks like in production. Multiply across an environment of dozens of agents and you start to see why the Memory and Knowledge plane, properly designed, is the difference between a system that compounds in usefulness over quarters and a system that degrades silently into expensive nonsense.

What I would do differently

The lessons below are the ones I paid for. I share them in the hope that they cost you less.

I underbuilt the forgetting path for too long. We had episodic memory accumulating for a year before we built a real pruning strategy. By the time we built it, retrieval quality on the older content was visibly degrading, and we did not have a clean way to know which entries were still relevant. Build the forgetting path before you need it. TTL on every entry. Decay on low-relevance content. Active staleness detection on high-relevance content from day one.

I treated all memory as one vector database. It was easier. It was also wrong. The four memory types differ in nearly every architectural dimension, and conflating them produces a system in which the wrong content gets retrieved at the wrong time. Type your memory from the start. Storing it in different substrates is fine. Pretending it is one thing is not.

I shipped without staleness detection. Our early agents retrieved content based on semantic similarity and recency, without any explicit signal of whether the content was still valid. The first time a confidently-wrong recommendation cost a customer relationship, we built temporal validity into the schema. It should have been there from the start.

I underestimated the access control problem. Retrieval-native access control was an afterthought in our early architecture. We patched it later by applying user-permission filters as a post-retrieval step. That works until the day a permission-restricted document leaks into an agent’s reasoning through the embeddings, even if it does not appear in the final answer. The fix is permission predicates embedded in the retrieval query itself, not post-filters. The earlier this is built, the less expensive it is.

I did not invest in the knowledge graph soon enough. We were vector-only for the first two years. The agents could handle “find me content like this” perfectly well. They could not handle “which suppliers serve competitors who recently entered our market.” When we finally built the knowledge graph layer, several entire categories of agent failure disappeared. If I had to start over, the knowledge graph would be in the architecture from week one, even with a small initial entity set.

The 90-day move

If you are reading this and wondering where to begin, here is what I would do this quarter.

Stand up the registry of memory writes. Every write to semantic or episodic memory, by any agent, gets logged with the source, the authority tag, the temporal validity, the tenant scope, and the writing agent’s identity. This is the equivalent of the agent registry from the harness piece, applied to memory. Build it on day one.
Instrument staleness. Add a freshness signal to every memory entry. Set thresholds per memory type and per content category. Surface staleness on every retrieval. Build a dashboard that shows the staleness distribution across your memory plane.
Add a knowledge graph for entity-relationship reasoning. Start small. The supplier ontology, the customer hierarchy, the product taxonomy. Use Microsoft GraphRAG or one of the established commercial options. The cost of starting is lower than it was a year ago. The cost of not starting is higher than it was a year ago.
Validate retrieval against access controls. Embed permission predicates in the retrieval query, not as post-filters. Test that restricted documents cannot enter the reasoning stream of an agent acting on behalf of a user without the relevant permissions.
Build the forgetting path. TTL on every episodic entry. Decay on low-relevance content. Active staleness detection on high-relevance content. Treat memory operations as a six-tool action space (ADD, UPDATE, DELETE, RETRIEVE, SUMMARY, FILTER) with audit on each operation.

That is roughly a quarter of focused work for a small team. The payback shows up immediately as a reduction in confidently-wrong answers, and compounds over quarters as the system’s long-term character becomes something the business can rely on.

What this means

The flagship made the case that the model is not the product. The architecture is the product. The harness piece refined that claim into the runtime scaffolding that makes a model into a reliable agent. The model piece refined it into the substitutable, tiered, gateway-served portfolio that absorbs model market churn. This piece refines it one layer further, into the memory plane that gives the system its long-term character.

You are not building chat. You are building an agent that has to remember the right things, forget the right things, retrieve the right things, and refuse to confidently cite the wrong things. The system’s character lives here. So does its credibility.

Build the four memory types. Run hybrid retrieval. Curate before storage. Build the forgetting path. Govern access at retrieval, not after. Tag authority on every write. Treat memory operations as a first-class API.

The next and final piece in this series goes deep on the Trust Fabric: the cross-cutting set of controls (identity, policy, observability, evals, FinOps, human oversight, compliance) that turns the five planes into a system the business can put its name on.