Five Planes and a Trust Fabric: A reference architecture for production multi-agent systems. The six tensions, the planes that hold them, and the lessons paid for in expensive ways.

It is a Tuesday morning at a Fortune 500 procurement organization. A category leader opens a spreadsheet of contracts expiring in the next ninety days. Six hundred and forty rows. Industry research puts twenty to thirty percent of enterprise SaaS and services spend at risk of waste through overlap, unused seats, and unrenegotiated terms. For a company this size that is somewhere between forty and sixty million dollars a year, leaking quietly out of the back of the budget. The category leader has the same headcount she had two years ago. She will work through the top fifty rows carefully and approve the rest at last year’s terms. The leak continues.

This is not a headcount problem. It is a system problem. The architecture I share below is one approach to solving it, and a few hundred problems shaped like it.

I have spent the last year and a half building agentic software in regulated environments, mostly insurance. The lesson I would write down for any CTO is this: the model is not the product. The architecture is the product.

Frontier models are converging in capability. They are sold by the token, hosted by everyone, and they will only get better and cheaper. The competitive moat is no longer “we picked the right model.” The moat is what you built underneath. The agentic systems that win the next decade will look less like clever prompts and more like distributed systems with probabilistic components. They will be designed, instrumented, and governed accordingly.

What follows is the reference architecture I wish I had been handed when we started. It has five planes, one cross-cutting Trust Fabric, and six architectural tensions that define every meaningful design decision. It is opinionated. It is also field-tested. I use it to evaluate every new feature, every vendor pitch, and every architectural decision my team makes. If you are a CTO, VP of Engineering, or principal architect trying to move from copilots to autonomous workflows without setting fire to your security team’s hair, this is the map.

The thesis

A multi-agent system is not an LLM with extra steps. It is a distributed system in which several components reason in natural language, fail unpredictably, and accumulate state across interactions. That sentence sounds banal. It is the most consequential idea in this entire essay.

Most agent pilots stall short of production for the same reason most early microservice migrations stalled. People build the happy path and discover, only in production, that the architecture has no answer for identity, observability, cost, failure isolation, or upgrade paths. Gartner projects forty percent of enterprise applications will embed task-specific agents by the end of 2026. Industry analysis of pilots puts the share that reach production at roughly fifteen percent. The gap between those two numbers is not a model problem. It is an architecture problem.

The six tensions

An architect’s job is to hold trade-offs in productive opposition. Most “AI strategy” decks read like wish lists (scalable, secure, cheap, flexible, extensible) as if all five properties could be maximized at once. They cannot. The interesting design choices live in the trade-offs, and six tensions define the space:

Autonomy and Oversight. How much an agent decides alone, and where a human reclaims the steering wheel. Too much autonomy and you lose accountability and regulatory cover. Too much oversight and the system delivers nothing the human could not have done themselves.

Quality and Cost. Frontier reasoning is excellent and expensive. Small specialized models are cheap and adequate for most steps. Routing the wrong task to the wrong tier produces systems that are either too expensive to scale or too unreliable to trust.

Velocity and Safety. Ship fast and gate hard are in permanent opposition. The eval flywheel, shadow deployments, and graduated rollout are how you live with both. Skipping any of them is how you ship features that look like outages.

Specialization and Composability. Narrow agents are reliable but compose-heavy. General agents are flexible but unreliable. This is the same trade-space that gave us microservices versus monoliths, and it has the same right answer for most enterprises.

Adaptability and Stability. The model layer churns every quarter. Production systems need durable contracts. The architecture has to absorb the churn invisibly to the application code.

Action and Reversibility. Every side effect is a potential cleanup bill. Idempotency keys, dry-run modes, and compensating workflows are not nice-to-haves. They are the design pattern that lets an agent act on a system the business depends on.

A reference architecture is the discipline of holding all six tensions in productive opposition and refusing to surrender any of them. Every plane in the stack below is doing real work against at least one tension, sometimes two or three.

The architecture in one picture

Five horizontal planes, stacked from foundation to experience. One vertical Trust Fabric that cuts across all of them. Every plane has its own primitives, its own failure modes, and its own evolution path. The Trust Fabric is what turns a stack of probabilistic components into something an enterprise can put its name on.

Read the picture this way. A request arrives at the top, from a human or another system. The Orchestration plane decides which agents will run and how they will coordinate. Each agent reasons and acts using the Tool and Action plane to touch the outside world. It reads from and writes to the Memory and Knowledge plane to maintain context across time. Underneath it all, the Model plane provides the reasoning fabric, routed and cached and budgeted. The Trust Fabric watches every step, enforces policy, accounts for cost, and decides when to call a human.

To make this concrete, let me run one request end to end.

One request, end to end: the Monday queue

It is Sunday at 7:14 AM. A contract renewal triggers automatically ninety days before its expiration. The work that used to take a procurement analyst the better part of a week, spread across vendor portals, SOC 2 repositories, usage analytics, and three internal systems, will be staged for review before the team gets to their desks Monday morning.

A Renewal Supervisor agent wakes up in the Orchestration plane. It pulls the contract from the source-to-pay system, reads its declared policy in the Trust Fabric (“read-only across these systems, write-allowed against the draft repository, commercial term changes above ten percent require human approval”), and decomposes the work into four parallel tracks.

A Risk Reassessment agent queries the Memory and Knowledge plane for the vendor’s history (three years of SOC 2 reports, security incident records, financial filings) and pulls fresh external signals (regulatory actions, public breach disclosures, financial press) from a curated retrieval index. The model behind it is a mid-tier model routed through the AI gateway, because the work is retrieval-shaped rather than reasoning-shaped.

A Market Benchmark agent uses the Tool and Action plane to query the company’s contract repository for comparable agreements, queries internal pricing intelligence, and pulls public pricing where the vendor publishes it. Each query is authenticated with a short-lived token tied to the agent’s identity, not a shared service account. Every action is logged.

A Usage and Fit agent evaluates whether the vendor’s product still fits the company’s footprint. It uses semantic memory to apply the procurement organization’s vendor strategy and procedural memory to recall the company’s standard consolidation patterns. It flags two seat tiers that should be cut and one feature tier that should be downgraded.

A Communication and Routing agent assembles the renewal package: a risk reassessment summary, a benchmark-based negotiation position, a usage-based commercial recommendation, and a draft outreach to the vendor’s account team. It composes from the outputs of the other three agents, references the category leader’s preferred negotiation tone from episodic memory, and routes the package through the approval workflow. Legal sees the contract diff. Security sees the risk reassessment. Finance sees the commercial recommendation.

The Renewal Supervisor reconciles the four tracks, validates the proposed action against policy (“proposed commercial term change is eight percent, no executive approval required for the negotiation position; the security risk score moved one tier, route for security lead review”), writes a complete trace into observability, attributes the cost of the entire run to the contract record, and surfaces the work in the category leader’s queue.

When the team opens their laptops on Monday morning, six hundred and forty renewals are staged with full context, recommended actions, and a single click to approve each track. The work that was impossible at this scale has become routine. The architecture made it look easy.

Notice what just happened across the six tensions. The Renewal Supervisor concentrated autonomy where the cost of being wrong was low and routed to humans where the cost was high. The model tiering gave us quality where reasoning mattered and cost discipline everywhere else. The four parallel agents specialized while the Supervisor composed. The gateway absorbed the model layer. Every action was idempotent and reversible. Every step was visible. The system was fast and safe at the same time, because the architecture made the trade-offs explicit instead of implicit.

Now let me show you what is inside each plane.

The Model plane: portfolio, not pick

The first instinct of most engineering teams is to pick a model. That instinct is wrong. You are not picking a model. You are building a portfolio.

A serious Model plane has four properties. It is multi-provider by default, because any single provider is one outage or pricing change away from breaking your business. It is tiered, because the cost gap between a small specialized model and a frontier reasoning model is now two orders of magnitude on tasks where both perform comparably. It is routed, because the right model for classifying a customer message is not the right model for synthesizing a six-source analysis. And it is cached, because in real-world workloads more than thirty percent of requests are semantically similar, and you should not pay for the same answer twice.

The mechanical realization of all four properties is an AI gateway sitting between your application code and every model provider. Your code calls the gateway. The gateway routes, caches, falls back, enforces budgets, and emits telemetry. Production teams running this pattern report cost reductions in the forty to seventy percent range without measurable quality regression.

There is one design rule I would tattoo on every engineer’s wrist: no model names in application code, ever. If a model identifier appears as a string literal anywhere outside your gateway configuration, you are one provider change away from a refactor. Treat model names like database hostnames. They live in config.

The deeper play is to design for substitution. The gateway abstracts the wire format. Your evals run against any backing model. Your prompts are versioned. When the next frontier release lands, you swap a config flag, rerun the eval suite, and ship. The teams that get this right will spend an afternoon doing what their competitors spend a quarter doing.

This is where the Adaptability and Stability tension is fought and resolved. The application code above the gateway has zero awareness that the model layer is changing every six weeks. The gateway absorbs the churn.

A final word on small language models. The 2025 to 2026 progression in distilled and task-specific SLMs has been remarkable. For classification, extraction, routing, and structured generation, a well-tuned SLM running on cheap inference often beats a frontier model on the only three metrics that matter in production: latency, cost, and reliability. Use them. The agentic future is not all frontier. It is a portfolio.

The Memory and Knowledge plane: curate before you store

If the Model plane is the engine, the Memory and Knowledge plane is the long-term character of the system. It is also where most agentic deployments quietly fail.

Production agents need four distinct memory types, designed deliberately:

  • Working memory holds the current task. It lives in the context window and the active scratchpad. It is volatile by design.
  • Episodic memory records what the agent has done and what happened. It is the audit trail. It is also the source of learning.
  • Semantic memory holds facts: the company’s procurement playbook, the vendors’ commercial appetite, the customer’s coverage profile.
  • Procedural memory holds how-to knowledge: the workflows, the heuristics, the playbooks.

The most common failure I see is treating all four as a single vector database. They are not. They differ in write authority, retention policy, retrieval pattern, and governance. A vector database is necessary, but it is not the architecture.

The principle I repeat to every new engineer is this: do not copy chaos. Connect to truth. When you ingest enterprise content into your memory plane, curate before storage. Tag content by authority level. Policy and Standard go in. Opinion and draft do not. Otherwise your agents will confidently cite documents that someone wrote in a Slack thread three quarters ago and forgot about.

The unresolved problem in this layer is staleness. A memory about a vendor’s pricing, a metric definition, or a customer’s employer is highly relevant until it is not, at which point it becomes confidently wrong. The right answer combines retrieval recency, source certification, and intentional forgetting. Build the forgetting path. Most teams forget to.

And bring back knowledge graphs. Vector search is necessary for unstructured retrieval. It is also embarrassingly bad at multi-hop reasoning about entities, relationships, and time. The teams I admire most are running hybrid retrieval: vector for fuzziness, graph for structure, with the agent orchestrating both.

The Specialization and Composability tension lives here too. The four memory types are deliberately specialized. The agent runtime is what composes across them. Treating all memory as one undifferentiated blob is the architectural equivalent of putting all of your data in one denormalized table and wondering why nothing scales.

The Tool and Action plane: the new API surface

This is the plane where ambition meets risk. An agent that only reads is a research assistant. An agent that acts is a system. The Tool and Action plane is what makes the difference, and it is the surface where most security incidents will live.

The single most important development in this plane is the Model Context Protocol. MCP, originated by Anthropic in late 2024, donated to the Linux Foundation in late 2025, and adopted by every major model provider through 2025 and 2026, is doing for agent-to-tool integration what HTTP did for documents and what USB did for peripherals. Before MCP, every agent needed bespoke connectors for every tool. After MCP, the integration problem becomes N + M rather than N × M. If you are not designing your tool layer on MCP today, you are choosing future technical debt deliberately.

But MCP by itself is just a protocol. The production pattern is the MCP gateway, which sits between your agents and your tools the same way the AI gateway sits between your agents and your models. The MCP gateway authenticates agents, issues short-lived just-in-time tokens for tool access, inspects traffic for prompt injection attempts and data exfiltration patterns, enforces per-tool policies, and records every action for audit. It is the agent equivalent of an API gateway, and it is non-negotiable for any production deployment.

A few design rules that have earned their place in my notes:

Idempotency keys on every side-effectful action. Agents retry. So do their orchestrators. Without idempotency, you will double-bind a customer or send two emails.

Dry-run modes on every dangerous action. Before the agent files an actual binding request with a vendor, it should be able to run the same call in a simulated mode and surface the diff. This is the most important pattern almost no team builds.

Read-only first, then graduated write access. When a new agent enters production, give it read access for a quarter. Let it earn the right to write. The discipline pays for itself the first time the agent does something unexpected.

No long-lived credentials in agent context. This is the single biggest preventable security vulnerability in the agentic stack. Static API keys in agent memory mean a prompt injection becomes a data breach. Use JIT tokens, scoped to the action, expiring in minutes.

This is where the Action and Reversibility tension is decided. Every design choice in this plane should be evaluated against one question: what does it cost us to undo this action if it turns out to be wrong? If the answer is “more than we can afford,” redesign until the answer changes.

Browser-based action deserves a brief note because many of the systems agents need to act on, especially in regulated industries, do not have public APIs and never will. Browser automation is not a hack. It is a permanent component of the agent stack. Design it like one: sandboxed runtimes, fingerprint hygiene, session vaulting, and the same audit trail as any other action.

The Agent plane: harness engineering, or why the model is not the bottleneck

In every agentic deployment I have observed, the teams that succeed build narrow agents. The teams that struggle build one general agent that tries to do everything.

A production agent is not a clever prompt. It is a unit of software with an anatomy: a planner that decomposes goals, an executor that calls tools, a critic that checks outputs, memory adapters that read and write across the Memory plane, and an error handler that knows the difference between a recoverable failure and an escalation. It has an interface, a versioning scheme, a test suite, and an owner.

The field has converged, through early 2026, on a name for the discipline of building this scaffolding: harness engineering. The frame, popularized by Anthropic and amplified by OpenAI’s engineering blog and a wave of practitioner pieces through the first half of the year, is this: the model is the brain. The harness is the nervous system that lets the brain do useful work. The harness comprises the prompt construction logic, the memory orchestration, the tool dispatch layer, the execution loop, the sandbox, the error handler, the cache manager, and the audit hooks. It is the runtime equivalent of an operating system around a CPU.

The architectural equation worth committing to memory:

Reliable Agent = Foundation Model + Harness

Industry analysis of failed enterprise pilots converges on the same finding: roughly two-thirds of failures trace to harness-level defects (memory contamination, tool misconfiguration, brittle prompt scaffolding, missing error recovery) rather than to model capability. Roughly eighty percent of pilots fail to reach production at all. The decisive variable is the harness, not the model.

The practical implication for any CTO: the ratio of engineers working on prompts versus engineers working on harness is one of the most consequential staffing decisions you will make this year. The teams I admire most run that ratio at roughly one to four. The teams I watch struggle run it inverted.

The most underrated artifact in the Agent plane is the agent registry. Every agent in your environment is registered: who owns it, what it is allowed to do, what its declared blast radius is, what data it can touch, what models it can call, what its eval suite says about its current quality. The registry is the single source of truth that lets your security team sleep, your finance team forecast, and your engineering team upgrade with confidence. Build it on day one, even when there is only one agent. It is far easier to add agents to a registry than to retrofit a registry around agents.

The agent lifecycle should mirror software:

  1. Design, with explicit goals and out-of-scope statements.
  2. Eval, with a test suite that runs before any deployment.
  3. Shadow, in which the agent runs in parallel with the human and its outputs are compared but not used.
  4. Canary, in which the agent handles a small fraction of real traffic with tight monitoring.
  5. Production, with full traffic and ongoing online evals.
  6. Retirement, because agents, like services, deserve a deliberate end of life.

The deepest architectural choice in this plane is one I made early and never regretted: narrow over general. A “renew this contract” agent that does one thing well, with a known eval set and a bounded action surface, is worth more than a “do anything” agent that needs ten guardrails to keep from embarrassing you. This is the Specialization and Composability tension resolved in favor of specialization at the agent layer, with composability moved up to the Orchestration plane where it belongs. The same discipline that moved enterprise architecture from monoliths to microservices. Build small. Compose deliberately. Replace ruthlessly.

The Orchestration and Experience plane: topologies that earn their keep

When agents need to work together, you have three canonical topologies, and the question is not which one is best but which one fits the workload.

Hierarchical (supervisor-worker) is a tree. A supervisor agent decomposes a goal and delegates to specialized workers, then reviews and integrates. This is the workhorse pattern for most production deployments because it concentrates accountability, makes failure isolation tractable, and maps cleanly to existing approval hierarchies. The renewal example above is hierarchical.

Mesh (peer-to-peer) is a graph. Agents broadcast capability manifests and form task graphs dynamically without a single orchestrator. Mesh is powerful when the work is genuinely emergent and no single agent can plan the whole. It is also the topology that amplifies errors fastest, so use it with caution.

Pipeline is a line. Each agent’s output is the next agent’s input. Pipelines are excellent when the work is deterministic in shape and only the content varies. They are also brittle, because every stage’s failure cascades.

The pattern most mature deployments converge on is a hybrid: a hierarchical outer loop, with mesh collaboration inside specific phases where shared evidence accumulates on a blackboard (a shared workspace agents write findings to and read from). Supervisor-worker gives you the control plane. Blackboard gives you the data plane.

This plane is where the Autonomy and Oversight tension gets resolved at the workflow level. The topology you choose determines where humans intervene, how exceptions surface, and how accountability is distributed.

On experience, one strong opinion: chat is the prototype, not the product. The early agent demos were chat-shaped because that is what was easy to build. The mature surface is workflow-native. Agents live inside the systems where work already happens (the source-to-pay system, the email client, the CRM, the underwriting workbench), and they appear as a queue of completed work to be reviewed and approved, not as a conversation to be had. A renewal you have to discuss with an agent over chat is not a renewal that scales. A renewal that arrives in your queue with a recommendation, three alternatives, and a single click to approve is a renewal that scales.

The Trust Fabric: the layer that turns a pile of agents into a system

If I could go back and give myself one piece of advice, it would be to invest in the Trust Fabric six months earlier than felt comfortable. It is the difference between a prototype and a product.

The Trust Fabric is the cross-cutting set of controls that runs through every plane. Seven concerns live here, and each one is non-negotiable for any system meant to act on behalf of a regulated business:

Identity. Every agent is a non-human identity, registered, owned by a named human, authenticated through your enterprise identity provider. No shared service accounts. No anonymous agents. The non-human identity population in most enterprises already outnumbers humans by an order of magnitude, and agents will widen that gap. If you cannot enumerate every agent in your stack by name, owner, and capability today, that is the work to do this quarter.

Authorization. Capability-based, least-privilege, short-lived. Tokens scoped to a specific action, expiring in minutes, issued at the point of need. The Cisco and Microsoft zero-trust frameworks released through 2025 and 2026 codify this well, and your existing identity provider almost certainly has the primitives. Use them.

Guardrails. Prompt injection defense at the gateway. Semantic inspection of agent intent before action. Output validation, especially before any side-effectful call. Treat the agent’s input stream the way you treat user input: assume hostile, validate explicitly.

Observability. Every reasoning step, every tool call, every memory operation traced and stored. The agentic equivalent of distributed tracing is now a solved problem, and there is no excuse for shipping an agent you cannot debug. If something goes wrong in production, you need to replay exactly what the agent saw, decided, and did.

Evals. The eval flywheel is the single most important practice in this stack. Offline evals before deployment. Online evals continuously in production. Regression budgets that flag silent quality drift when an upstream model updates. Treat evals like tests, not like research projects. Make them a required gate. This is where the Velocity and Safety tension is finally resolved.

Human oversight. Graduated autonomy is the design pattern that lets agents move fast without breaking things. Low-risk decisions execute. Medium-risk decisions notify. High-risk decisions require approval. The 2026 direction of travel is away from human-in-the-loop (humans approve every decision, which does not scale) and toward human-on-the-loop (humans supervise the system and intervene on exceptions). Design for the latter.

FinOps. Spend visibility per agent, per workflow, per customer, per feature. Hard budgets enforced at the gateway. Cache hit rate as a first-class KPI. The cost spirals in agentic systems are almost never the fault of the model. They are the fault of a missing budget control.

Compliance. EU AI Act provisions came into force across 2026. ISO/IEC 42001 is the de facto AI management system standard. DORA, live for EU financial services since January 2025, is worth addressing precisely because it is widely cited and widely misunderstood in this context. DORA is not an AI regulation. It does not mention agents. It does not have to. Every AI agent is an ICT system, and DORA’s 87 ICT obligations apply in full: third-party model provider risk management, resilience testing, incident classification and reporting, and complete audit trails. If your agents call a third-party LLM provider, that provider is a third-party ICT supplier under DORA and must be risk-assessed accordingly. The observability layer in the Trust Fabric is not a best practice for a DORA-regulated entity. It is a legal requirement. None of these are afterthoughts to be retrofitted. They are architectural constraints that shape how you design identity, audit, and risk classification from the start. The teams that treat compliance as design will move faster, not slower, because they will not have to rebuild their architecture every time a regulator clarifies a rule.

Patterns and anti-patterns

A few patterns I see in teams that succeed:

  • Narrow agents, composed deliberately. Smaller agents with bounded scope are easier to eval, debug, upgrade, and trust.
  • Eval-driven development. Build the eval before the agent. The eval is the spec.
  • Shadow before canary, canary before production. Earn each step.
  • Read-only first, graduated write access. The fastest path to a production-grade agent is to start by not letting it write anything.
  • Cost ceilings per task. A single bug should never produce a five-figure invoice. The cap is a feature, not a constraint.

And the anti-patterns I see in teams that struggle:

  • The God Agent, which tries to do everything in one prompt.
  • The RAG Reflex, where every problem looks like a retrieval problem and every retrieval problem gets thrown at a vector database.
  • Ship First, Eval Later, which always becomes “ship first, debug forever.”
  • Hardcoded Model Names in Application Code, which guarantees lock-in even when the abstraction exists.
  • Long-Lived Credentials in Agent Context, which makes every prompt injection a potential breach.

Print the anti-patterns. Tape them above the team’s desks. Refer to them in code review.

The six tensions at each maturity stage

Most maturity models are too generic to be useful. The table below is specific to the six tensions, and it tells you what each design axis should look like at each stage of your build. Most companies are at stage 1 or 2 and pretend they are at stage 4. The progression is sequential.

TensionStage 1: Single agent in productionStage 2: Multi-agent workflowsStage 3: Cross-functional agent meshStage 4: Cross-organizational agent economy
Autonomy and OversightHuman reviews every outputGraduated by risk tierPolicy-driven, exceptions onlyInter-organizational policy contracts
Quality and CostSingle frontier modelRouting by task complexityPortfolio with caching at scaleFederated cost accounting across agents
Velocity and SafetyManual eval before releaseCI eval gate, shadow and canaryContinuous online evals, regression budgetsCross-organizational eval protocols
Specialization and ComposabilityOne agent, one taskSupervisor and workersMesh with blackboardPublic capability manifests
Adaptability and StabilityHardcoded model and promptsGateway abstractionMulti-provider portfolioProtocol-level abstraction
Action and ReversibilityDry-run mode onlyIdempotency keys and undo hooksCompensating workflowsCross-system rollback contracts

The table is the diagnostic. Look at your current build, find each row, locate yourself honestly, and read across. The columns that lag your overall maturity are the work to do this quarter.

The 90-day starting move

If you are reading this and wondering where to begin:

  1. Stand up the AI gateway. Everything goes through it from day one. Multi-provider, with semantic caching, budgets, and observability. Your application code calls one address.
  2. Build the agent registry. Even if you have only one agent. Especially if you have only one agent. It is the artifact that scales with you.
  3. Pick one workflow. Make it end to end excellent. Resist the temptation to build a platform first. Platforms are built by abstracting working systems, not the other way around.
  4. Set up the eval harness before you ship. The eval is the spec. The spec is the gate.
  5. Write your graduated autonomy policy. On one page. Sign it. Share it with your security team and your legal team. Live by it.

That is not a roadmap. That is the prerequisite to having a roadmap.

What this means for the next three years

The shift underway is more consequential than the cloud migration, more consequential than mobile, more consequential than the move from on-premise to SaaS. We are moving from software that responds to software that acts. The interface stops being the product. The outcome becomes the product.

In that world, the best model is the one everyone else can also rent. The best architecture is the one you actually built. The companies that win the next decade will not be the ones with the biggest model partnerships. They will be the ones whose architecture made it cheap and safe to compound. Build the five planes. Invest in the Trust Fabric. Hold the six tensions in productive opposition. Make the boring, expensive, foundational choices early. The Monday queue will take care of itself.

Leave a Reply

Your email address will not be published. Required fields are marked *