What this architecture solves in a multi-agent system
Bottom Line: Coordinating five or more specialized agents without an explicit control plane produces systems that duplicate work, contradict each other, and drift away from their stated objective. The solution is a clean separation of concerns — a control plane that owns planning, policy, and routing; a data plane that carries task payloads and tool results; and dedicated infrastructure for observability, memory, and governance. Every component in this article belongs to one of those layers. Getting the placement right is the only way to build a system that stays predictable as agent count grows.
As arXiv:2601.13671 states directly: "the agent orchestration layer forms the control plane of a multi-agent system," and "without orchestration, even highly capable agents risk duplication of effort, logical inconsistency, or unbounded autonomy that diverges from the system's objectives." That failure mode is not hypothetical — it appears whenever engineers add a new agent to handle a new task without wiring it into a shared planning and routing layer, treating agent addition as additive rather than systemic.
Microsoft's multi-agent reference architecture frames the design challenge precisely: the goal is not building a capable single agent, but making many specialized agents work together in a way that remains adaptive, auditable, and extensible. Observability appears in that architecture not as a monitoring afterthought but as a first-class design concern, with a dedicated observability guide alongside governance documentation. That framing is the right one — observability cannot be wired in after topology decisions are locked.
Control plane vs data plane in multi-agent orchestration
The distinction that competing articles bury in implementation detail is actually a systems architecture principle: the control plane and the data plane serve fundamentally different purposes, carry different failure modes, and must scale independently.
The control plane owns the cognitive and coordination work: decomposing goals into tasks, assigning tasks to agents, enforcing policies, triggering replanning when results arrive out of spec, and maintaining governance state. Nothing in the control plane should depend on the specific wire format agents use to talk to each other. As The Orchestration of Multi-Agent Systems: Architectures, Design Principles, and Open Challenges confirms, orchestration is the control plane — meaning it should be thought of as infrastructure, not as prompt logic inside any individual agent. The paper also states that the orchestration layer "integrates planning, policy, and communication" into a unified model with defined scope.
The data plane carries the actual payloads: task specifications moving from planner to worker, results moving from worker to reviewer, tool inputs and outputs flowing between agents and external systems. The data plane is where communication protocols live.
This is where Agent2Agent (A2A) and Model Context Protocol (MCP) belong — and the two protocols occupy different positions even within the data plane.
Bottom Line: Google’s Developer's Guide to AI Agent Protocols says, "This blog post explores how six key protocols, including MCP and A2A, simplify AI agent development by replacing custom integration code with standardized communication patterns." That is the right boundary line: A2A and MCP reduce custom wiring, but orchestration still sits above them as the control plane.
flowchart LR
subgraph ControlPlane["Control Plane"]
PL[Planner]
PO[Policy Engine]
RT[Router / Supervisor]
GV[Governance Store]
OB[Observability Collector]
end
subgraph DataPlane["Data Plane"]
A2A_BUS["A2A — Agent-to-Agent Transport"]
MCP_BUS["MCP — Agent-to-Tool / Data Access"]
end
subgraph Agents["Specialized Agents"]
WA[Worker Agent A]
WB[Worker Agent B]
RV[Reviewer Agent]
end
subgraph Tools["Tools & Data Sources"]
T1[External API]
T2[Vector Store]
T3[Azure Function / OpenAPI]
end
PL -->|task spec| RT
RT -->|routed task| A2A_BUS
PO -->|policy decision| RT
GV <-->|audit log| OB
OB <-->|spans + events| RT
A2A_BUS --> WA
A2A_BUS --> WB
WA -->|result| A2A_BUS
WB -->|result| A2A_BUS
A2A_BUS --> RV
RV -->|verdict| A2A_BUS
A2A_BUS -->|evaluated result| RT
WA -->|tool call| MCP_BUS
WB -->|tool call| MCP_BUS
MCP_BUS --> T1
MCP_BUS --> T2
MCP_BUS --> T3
Google's Developer's Guide to AI Agent Protocols (March 2026) explicitly groups MCP and A2A among six protocols that simplify AI agent development by replacing custom integration code with standardized communication patterns. A2A handles agent-to-agent coordination; MCP handles agent-to-tool and agent-to-data access. The control plane sits above both. Orchestration logic must not leak down into protocol handlers.
Where planning lives and what it owns
Planning occupies the top tier of the control plane. The planner receives a goal — from a human, a trigger, or a higher-level system — decomposes it into ordered or parallel tasks, and assigns those tasks to agents by capability. Planning is also responsible for replanning: when a worker agent returns a partial result or signals an error, the planner decides whether to retry, reroute, escalate, or abort.
The key architectural constraint is that planning must own decomposition exclusively. If communication protocols begin absorbing routing decisions — for example, if an A2A message schema encodes conditional task logic — the control plane loses visibility and you lose the ability to reason about system behavior from a single point. arXiv:2601.13671 frames this as the integration of planning, policy, and communication into a unified model where each has a defined scope.
A2A supports this pattern because it standardizes how agents negotiate task acceptance and report results, without encoding the task decomposition logic itself. Google's A2A announcement positions the protocol as enabling agents to "communicate with each other, securely exchange information, and coordinate actions" — the coordination intent originates in the planner; A2A carries the signal.
Pro Tip: Implement replanning as a first-class planner operation with its own observable event type. If replanning is buried inside a worker agent's retry loop, the control plane cannot distinguish between a transient error and a structural goal decomposition failure — two problems that require entirely different responses.
Where policy executes and which decisions it blocks
Policy is a control-plane component distinct from planning. The planner decides what to do; the policy engine decides whether the system is permitted to do it, under what conditions, and with what approval chain. These two concerns must not be collapsed.
Policy gates include: tool access permissions per agent role, human-in-the-loop approval triggers for high-stakes actions, rate limits on external API calls, routing rules based on data classification, and enforcement of agent versioning constraints. Microsoft's governance guide describes a "Guide for designing adaptive, scalable, and secure enterprise multi-agent systems" and separates governance from orchestration precisely because policy decisions need independent auditability — a blocked action must be logged with its reason and the identity of the requesting agent, not silently swallowed by the planner.
Policy should not be implemented as prompt-level instructions inside worker agents. Prompt-level policy is unverifiable, can be overridden by context, and produces no audit record. A policy engine outside the agents produces a durable, observable record of every gate evaluation.
Watch Out: Attaching policy decisions to observability outputs retroactively — logging them only when an agent fails — means you cannot distinguish authorized blocked actions from system errors. Wire policy evaluation events directly into your observability pipeline at architecture time, not as a post-hoc audit trail.
Where communication protocols stop and orchestration begins
The most common confusion in multi-agent architecture is treating A2A or MCP as orchestration. They are not. Both protocols are data-plane concerns: they define how messages are formatted, authenticated, and transported. Orchestration is the logic that decides when to send a message, to whom, and what to do with the response.
| Dimension | A2A (Agent2Agent) | MCP (Model Context Protocol) |
|---|---|---|
| Primary role | Agent-to-agent task handoff and collaboration | Agent-to-tool and agent-to-data access |
| Where it sits | Data plane, inter-agent transport | Data plane, tool/resource interface |
| What it carries | Task specifications, capability negotiation, results | Tool call requests, data retrieval, context injection |
| Vendor stance | Open standard for cross-vendor agent interop | Standardized tool/context interface replacing bespoke integrations |
| Relationship to orchestration | Carries orchestrator decisions; does not make them | Carries agent-side tool requests; does not route tasks |
A2A does not decide which agent receives a task — the supervisor or router in the control plane makes that decision, then A2A carries it. MCP does not determine what data an agent needs — the agent's task context drives that; MCP provides a standardized way to retrieve it without custom connectors. Google's protocol guide and Google's A2A announcement are explicit: six protocols together reduce custom integration code, with A2A and MCP serving complementary but non-overlapping roles.
Agent roles, task routing, and memory boundaries
Predictable coordination requires that each agent in the system has an assigned role with defined inputs, outputs, and failure behaviors. Role ambiguity — agents that both plan and execute, or that manage their own memory without a shared schema — produces coupling that makes the system brittle and the observability data uninterpretable.
Pro Tip: Instrument each role boundary as a distinct span in your tracing system. If planner-to-worker handoff and worker-to-reviewer handoff share the same span type, you cannot distinguish coordination latency from execution latency in production. The MAESTRO telemetry schema tracks
gen_ai.agent.name,gen_ai.tool.name,agent.retry.attempt_number, andrun.outcomeas separate fields precisely because role-level granularity is what makes multi-agent debugging tractable.
Planner, supervisor, worker, and reviewer roles
Four roles cover the majority of production multi-agent topologies. Each maps to a distinct position in the control or data plane.
| Role | Control/Data Plane | Core responsibility | Handoff via |
|---|---|---|---|
| Planner | Control plane | Goal decomposition, task graph construction, replanning | Router / Supervisor |
| Supervisor | Control plane | Task routing, agent selection, retry decisions | A2A to workers |
| Worker | Data plane execution | Specialized task execution, tool calls via MCP | A2A result to supervisor |
| Reviewer | Data plane, feedback loop | Output verification, quality gate, escalation signal | A2A verdict to supervisor |
Handoffs between roles must be explicit logged events, not implicit message continuations. When the planner hands a task graph to the supervisor, that transition should produce a distinct trace event. When a worker completes and hands results to the reviewer, that handoff is a separate event. Google's A2A design supports this pattern — agents "securely exchange information, and coordinate actions" through a defined protocol, which means the exchange has a structure that can be logged, not just a prompt passed in a chat thread.
Collapsing supervisor and worker into a single agent saves topology complexity but eliminates the ability to independently observe routing decisions versus execution decisions. When that combined agent fails, you cannot tell which responsibility caused the failure.
State, memory, and context propagation rules
Memory in a multi-agent system operates at three distinct scopes, and conflating them is a coupling hazard.
Task-scoped working memory holds context for a single task execution — the worker's current inputs, intermediate results, and tool call history. This memory should be created when a task is assigned and discarded when the task completes or is handed off. It should never be shared with other agents unless the planner explicitly includes it in a new task specification.
Session-scoped memory persists across tasks within a coordinated workflow. The planner and supervisor access it to maintain goal state, track completed subtasks, and supply context to the next worker. MCP is the right interface here — Google's protocol guidance says MCP is the part that handles data access, which means memory retrieval should flow through a standardized interface rather than being copied between agent prompts. The MAESTRO telemetry fields include trace_id, span_id, parent_span_id, and gen_ai.conversation.id to track causal lineage across this scope.
Long-lived organizational memory — fine-tuned behaviors, historical task outcomes, policy decisions — belongs in an auditable store outside all agents, accessible through governed reads, not broadcast into every context window.
Watch Out: Sharing full session context with every agent at each step couples all agents to the same knowledge state. When a worker modifies context and broadcasts the update, downstream agents may act on stale or contradictory information before the reviewer has validated the change. Propagate context selectively: pass only what the receiving agent needs for its specific task.
Task routing and failure recovery patterns
-
Route by capability match: The supervisor selects the worker agent whose declared capability profile best matches the task specification. Hard-coded routing tables break under agent evolution; capability registries that agents update at startup scale better.
-
Retry with the same agent when: The failure is transient (network timeout, rate limit, API error) and the task specification is unchanged. Cap retries — MAESTRO's telemetry says
"agent.retry.attempt_number"and"agent.failure.category"are distinct telemetry fields precisely because unbounded retries are a detectable failure mode, not just a slow success. -
Reroute to a fallback agent when: The primary agent returns a failure category that signals capability mismatch (wrong tool access, out-of-scope task) rather than a transient error. The supervisor makes this decision, not the worker.
-
Trigger replanning when: Multiple agents fail on subtasks that share a dependency, partial results invalidate downstream task assumptions, or total retry count crosses a threshold that indicates the original decomposition was wrong.
| Failure signal | Recovery action | Observability marker |
|---|---|---|
| Transient API error, ≤ N retries | Retry same agent | agent.retry.attempt_number |
| Capability mismatch result | Reroute to fallback agent | agent.failure.category |
| Partial result invalidates downstream | Replan affected subgraph | New planner span, revised task graph |
| Repeated handoff loops detected | Escalate to human / abort | Span correlation pattern, loop counter |
Loops are detectable before final failure: span correlation across parent_span_id chains reveals the same task being re-queued to the same agent repeatedly. Do not wait for a final timeout to detect this; the observability collector should alert on repeated handoff patterns in real time.
Connected agents, tool access, and framework constraints
Connected-agent architectures — where one agent instantiates or delegates to another at runtime — introduce control complexity that scales poorly without explicit depth constraints and externalizable tool access. The architectural principle is that every connection point between agents is a potential observability gap and a coordination failure surface.
Pro Tip: Treat the depth of your agent nesting as a tunable architectural parameter, not a byproduct of feature requirements. Before adding a sub-agent to handle a subtask, ask whether an MCP-backed tool call or an OpenAPI endpoint would accomplish the same outcome with less coordination overhead and full auditability in your existing observability stack.
Why Microsoft caps nesting depth at 2
Microsoft's connected-agents guidance is often interpreted as a cap at a nesting depth of 2 — one parent orchestrator and one layer of specialized subagents — but that specific limit was not confirmed in the verified documentation available here. The operational point still holds: beyond shallow nesting, the control plane's ability to trace, audit, and recover from failures degrades sharply.
At deeper nesting levels, a failure in a leaf agent requires the debugging engineer to reconstruct causal chains across multiple span contexts. Policy enforcement becomes ambiguous — which layer is responsible for a given approval gate? Replanning logic must account for partial completion states that themselves contain nested partial states. Microsoft's reference architecture maintains separate observability and governance documentation precisely because these concerns must remain tractable as the graph grows.
Watch Out: Deep agent nesting is not a sign of architectural sophistication — it is usually a sign that task decomposition is happening inside the agent graph rather than in the control-plane planner. If you find yourself at depth 3, the correct fix is almost always to move decomposition logic up into the planner, not to add more observability to a deep stack.
When to use OpenAPI tools or Azure Functions instead of local function calling
Local function calling — where an agent directly invokes a Python function in its runtime environment — is appropriate for stateless, deterministic, low-stakes operations where auditability and retry logic are irrelevant. As soon as any of those conditions fail, external tool access via OpenAPI specifications or Azure Functions can be the better architectural choice, depending on the governance and hosting model you already run.
| Criterion | Local function call | OpenAPI tool / Azure Function |
|---|---|---|
| Auditability | No external trace; only agent-local logs | Externally logged, replayable, independently auditable |
| Retry and circuit-breaking | Agent must implement manually | Platform-managed, consistent across agents |
| Access control | Agent-level prompt boundary only | Policy-enforced at the API/function gateway |
| MCP compatibility | Not directly addressable via MCP | Exposed via standardized MCP tool interface |
| Appropriate when | Pure computation, no side effects | External state, payments, data writes, shared resources |
Google's protocol guidance frames this as replacing "custom integration code with standardized communication patterns" — an OpenAPI-backed MCP tool is exactly that: a standardized, auditable, retryable external integration that any agent in the system can call through the same interface.
How A2A and MCP complement each other in production
A2A and MCP operate at different connection surfaces and should be deployed together, not treated as alternatives.
-
Choose A2A when the interaction crosses agent boundaries: task handoff, capability negotiation, result reporting between a worker and supervisor.
-
Choose MCP when the interaction crosses an agent-to-resource boundary: retrieving data from a vector store, calling an external API, reading from a structured knowledge base.
-
Use neither when the operation is purely within a single agent's execution scope and produces no externally visible side effects.
| Scenario | Protocol | Rationale |
|---|---|---|
| Planner assigns task to Worker B | A2A | Cross-agent task handoff |
| Worker B retrieves customer record | MCP | Agent-to-data access |
| Worker B calls payment API | MCP via OpenAPI tool | Agent-to-external-service |
| Worker B returns result to Supervisor | A2A | Cross-agent result reporting |
| Supervisor routes to Reviewer | A2A | Cross-agent coordination |
| Reviewer escalates to human queue | Control plane event (not A2A) | Control plane decision, not agent-to-agent |
The production constraint is boundary discipline: once you route a human escalation through A2A instead of a control-plane event, you lose policy gating and audit guarantees on that path.
Observability, evaluation, and governance in the stack
Bottom Line: Observability, evaluation, and governance must be designed into the control plane before agents are wired together. Bolting them on post-launch means the telemetry schema does not capture role-level events, evaluation metrics measure only final task outcomes rather than collaboration quality, and governance controls lack the audit hooks needed for approval workflows. The MAESTRO framework was built as a "Multi-Agent Evaluation Suite for Testing, Reliability, and Observability" — the sequencing in that title is deliberate: observability enables reliability, which enables meaningful evaluation. Microsoft's reference architecture treats governance and observability as separate first-class documentation domains, not subsections of deployment.
What to log, trace, and measure across agent boundaries
Single-agent tracing captures one execution path. Multi-agent tracing must capture causal chains across agents that execute concurrently, retry independently, and communicate asynchronously.
| Signal type | Field / Metric | What it reveals |
|---|---|---|
| Causal linkage | trace_id, span_id, parent_span_id |
Which agent actions are causally connected |
| Agent identity | gen_ai.agent.name, gen_ai.conversation.id |
Which agent produced a given event |
| Tool usage | gen_ai.tool.name, gen_ai.operation.name |
Which tools were called, in what order |
| Token consumption | gen_ai.usage.input_tokens, gen_ai.usage.output_tokens |
Cost attribution per agent and task |
| Retry behavior | agent.retry.attempt_number, agent.retry.trigger, agent.retry.reason |
Whether failures are transient or structural |
| Outcome | run.outcome, run.judgement, agent.failure.category |
Task success, partial completion, or failure type |
| Coordination overhead | communication.total_message_size_bytes |
Inter-agent messaging volume and payload growth |
All fields above are drawn from MAESTRO's telemetry schema. Without parent_span_id correlation, a failure in agent C that was caused by a bad result from agent A is invisible — you see C fail, not why. Without agent.retry.attempt_number, a routing loop looks identical to a slow but eventually successful retry chain.
How to evaluate multi-agent behavior beyond single-turn accuracy
MAESTRO makes the evaluation requirement explicit: "LLM-based multi-agent systems (MAS) must be evaluated not merely by task completion, but as complex systems characterized by dynamic, stochastic execution." Single-turn accuracy — did the system return a correct final answer — is insufficient when success depends on correct handoffs, recovery from intermediate failures, and coordination across agents with different latencies.
-
Choose task-completion rate when measuring whether the system achieves its end goal across a representative sample of inputs — necessary but not sufficient.
-
Choose collaboration quality metrics (handoff success rate, reviewer rejection rate, reroute frequency) when diagnosing where in the agent topology coordination is degrading.
-
Choose recovery behavior metrics (replanning trigger rate, retry success rate by failure category) when tuning routing and fallback logic.
-
Choose end-to-end latency distributions (p50/p95/p99 across full task graphs) when optimizing for user-facing responsiveness, accounting for concurrent agent execution.
-
Use MAESTRO-style standardized MAS configuration when benchmarking topology variants — controlling for input sources, agent count, tool access, and retry limits ensures comparisons are meaningful.
Governance controls for safety, approvals, and change management
Governance controls belong in the control plane, not in agent prompts. Three categories need hard infrastructure:
Tool access gating: Each agent role should have a declared set of permitted tools. Requests outside that set should be rejected by the policy engine and logged with the agent identity and the requested tool name. This is not a prompt constraint — it is an API-level permission check.
Human approval workflows: High-stakes operations — financial transactions, data deletions, external communications on behalf of users — should trigger a pause in the task graph, route an approval request to a human queue, and resume only on confirmation. The task graph state must be durable across this pause.
Agent version change management: When a worker agent is updated, the governance store should record the version change, flag in-flight tasks that were assigned to the previous version, and require explicit re-validation before routing new high-stakes tasks to the new version.
Watch Out: Prompt-level safety admonitions ("do not perform financial transactions without confirmation") are not governance controls. They provide no audit record, can be overridden by sufficiently forceful context, and offer no enforcement at the API or function boundary. Safety in a production multi-agent system requires infrastructure-level gates, not conversational guardrails.
Production trade-offs for five-plus agent systems
Adding agents to a system is not free. Each agent introduces coordination overhead — additional A2A message round-trips, additional spans in the observability pipeline, additional policy evaluations, and additional failure surfaces that the replanning logic must handle. arXiv:2601.13671 identifies the failure modes directly: "duplication of effort, logical inconsistency, or unbounded autonomy" — all three worsen as agent count increases without proportional investment in the control plane.
| Design dimension | Single agent | 2–4 agents | 5+ agents |
|---|---|---|---|
| Planning complexity | Trivial / in-prompt | Explicit planner beneficial | Planner + supervisor required |
| Observability cost | Low (one trace per request) | Moderate (correlated spans) | High (distributed trace correlation required) |
| Policy surface | One agent, one policy scope | Role-based policies | Per-role, per-tool, per-interaction policies |
| Failure recovery | Single retry loop | Rerouting across agents | Full replanning capability needed |
| A2A / MCP value | Minimal | Useful for external tool access | Essential for cross-vendor interop |
| Justified when | Task fits one context window | Parallelism or specialization gains | Specialization gains exceed coordination cost |
Latency, cost, and concurrency trade-offs
Parallelism is the primary latency benefit of multi-agent design: independent subtasks can execute concurrently across workers, reducing wall-clock time below what a single sequential agent could achieve. The cost side is less commonly modeled.
Each agent interaction in a multi-agent system consumes tokens for context setup, task specification, and result formatting — not only for the core task. MAESTRO's telemetry schema tracks gen_ai.usage.input_tokens and gen_ai.usage.output_tokens per agent, and communication.total_message_size_bytes per interaction, making these costs measurable. In practice, systems where agents pass large context windows at each handoff can spend more on coordination tokens than on task execution tokens.
Pro Tip: Before optimizing agent count for parallelism, measure the actual token cost of your A2A handoff messages relative to the task payload. If coordination messages are growing proportionally to task volume, you have a context propagation design problem, not a parallelism win. Standardized A2A messages with lean schemas cost less than bespoke natural-language handoffs.
Failure modes: loops, dead ends, and prompt coupling
Three failure modes dominate production multi-agent systems:
Routing loops: A task is routed to an agent, which fails and returns the task to the supervisor, which routes it to the same agent, which fails again. Without retry counters and loop detection in the supervisor, this continues until an external timeout fires. The MAESTRO telemetry fields agent.retry.attempt_number and agent.failure.category exist specifically to make this pattern detectable before the loop exhausts budget.
Dead ends: A task reaches a worker that lacks the tool access or context to complete it but also lacks a clear failure signal to return — it produces a partial result or a confident-but-wrong output. Reviewer agents catch this if their input includes the expected output schema; without a reviewer role, dead-end outputs propagate silently.
Prompt coupling: Agents implicitly depend on one another's prompt structure rather than on an explicit contract. When agent A is updated and its output format changes slightly, agent B fails in a way that is not immediately attributable to agent A's change. This failure mode is architectural: if inter-agent communication uses A2A with a versioned schema, a format change in agent A's output triggers a schema version mismatch that is immediately traceable. If communication uses unstructured natural language passed through context, the coupling is invisible until failure.
Watch Out: Prompt coupling is the hardest failure mode to detect because it does not surface as an error — it surfaces as degraded quality or subtly wrong outputs. Enforce explicit schema contracts on all A2A message payloads. When a schema changes, treat it as an API version change, not a prompt edit.
When to simplify the topology instead of adding another agent
-
Add an agent when the specialization benefit is measurable (a dedicated code-review agent with tool access to static analysis reduces reviewer false negatives), the agent's role boundary is clean, and the control plane can route to it without introducing new shared-state dependencies.
-
Replace an agent with a tool call when the agent's only function is to call a single external API, transform a data structure, or apply a deterministic rule. An MCP-backed OpenAPI tool does this with less overhead and full auditability.
-
Flatten the topology when routing cost exceeds specialization benefit, when the coordination overhead (A2A messages, policy evaluations, replanning triggers) consumes more latency than the parallel execution saves, or when the supervisor must track more partial states than it can resolve without replanning on every cycle.
-
Consolidate when two agents share the same tool set, the same memory scope, and the same success criteria. Separate existence serves architectural elegance, not operational value.
Named patterns from Microsoft, A2A, and production vendors
| Pattern source | Core architectural contribution | Stance on framework |
|---|---|---|
| Microsoft multi-agent reference architecture | Design-for-change, observability, governance, extensibility as first-class building blocks | Framework-agnostic; works with AutoGen, LangGraph, CrewAI, custom |
| Google A2A protocol | Open standard for cross-vendor agent-to-agent interoperability | Vendor-neutral; complements any orchestration framework |
| Google MCP | Standardized agent-to-tool/data interface | Vendor-neutral; replaces bespoke tool connectors |
| Magentic-One / AutoGen | Multi-agent topology with explicit orchestrator and worker roles | Microsoft-ecosystem; aligns with reference architecture |
| LangGraph | Stateful graph-based orchestration; nodes are agents or tools | Framework-specific; maps cleanly to control-plane/data-plane split |
| CrewAI / Crew Enterprise | Role-based agent crews with explicit task delegation | Opinionated; adds governance and enterprise controls at SaaS layer |
What Microsoft's reference architecture contributes
Microsoft's multi-agent reference architecture contributes a framework-agnostic design vocabulary: design-for-change, extensibility, observability, evaluation, security, and governance as the six core building blocks of production systems. As the repo states: "You will find actionable guidance for designing for change, balancing long-term extensibility with pragmatic, shipping-first engineering."
The architecture's practical value is opinionation without prescription. It tells you what concerns must be separated and why — control plane from data plane, planning from policy, observability from execution — without mandating AutoGen over LangGraph or Azure over any other cloud. That neutrality makes it useful as a reference map regardless of stack.
Bottom Line: The Microsoft reference architecture is the closest thing the multi-agent field has to a consensus design vocabulary. Use it to align cross-team conversations about where each concern belongs, not as an implementation specification. Its separate governance and observability guides signal that these concerns require their own engineering investment, not an afternoon of prompt tuning.
What A2A standardization changes for inter-agent interoperability
Before A2A, cross-vendor agent collaboration required custom integration code at every boundary — each pair of agent frameworks negotiated their own message formats, authentication schemes, and result structures. A2A's design goal is to make that negotiation unnecessary: "Agent2Agent (A2A) protocol is an open standard designed to enable AI agents from different vendors and frameworks to collaborate and exchange information across enterprise platforms."
The architectural implication is that the control plane no longer needs to know which framework implements a given worker agent — only its declared capabilities and its A2A endpoint. A LangGraph-based worker and a CrewAI-based worker can receive tasks from the same AutoGen-based supervisor without custom adapters.
| Before A2A | After A2A |
|---|---|
| Per-pair custom message formats | Standardized task and result schemas |
| Per-pair authentication integration | Protocol-level auth |
| Framework-locked topologies | Heterogeneous agent graphs |
| Manual capability negotiation | Declared capability registry |
A2A does not replace the orchestrator — it removes the integration friction that previously forced teams to use a single framework throughout the stack.
What vendor products imply about the SaaS-to-Agentic-SaaS shift
The verified signal is architectural rather than market-metric: Google's 2026 protocol guidance treats standardized agent protocols as the mechanism for "replacing custom integration code with standardized communication patterns." That framing describes a supply-side shift — product vendors building on MCP and A2A can expose agent-callable interfaces without maintaining proprietary SDKs for each integration partner.
Vendors like Sierra, Cognition, Crew Enterprise, and Lindy are best treated here as examples rather than sourced market claims. The inference is that SaaS products are moving toward agent-callable interfaces and workflow depth while the control-plane primitives — planning, policy, A2A transport, and MCP tool access — are increasingly provided by infrastructure. No verified 2026 metrics from those companies were available at publication time; the conclusion is an inference drawn from public architectural choices and protocol adoption patterns, not from earnings or product announcements.
FAQ for multi-agent orchestration architecture
What is the difference between A2A and MCP?
A2A (Agent2Agent) is an open protocol for agent-to-agent task handoff and collaboration across vendors. MCP (Model Context Protocol) is a standardized interface for agents to access tools, data sources, and external APIs. A2A connects agents; MCP connects agents to resources.
How do multi-agent systems work?
A planner decomposes a goal into tasks and routes them via a supervisor to specialized worker agents. Workers execute tasks, call tools via MCP, and return results through A2A. A reviewer validates outputs. A policy engine gates sensitive operations. An observability collector records all events across agent boundaries.
What is the role of orchestration in multi-agent systems?
Orchestration is the control plane: it owns planning, routing, policy enforcement, and replanning. Without it, agents duplicate work, produce contradictory outputs, or pursue unbounded autonomous actions.
How do you build a multi-agent system architecture?
Separate control plane (planner, supervisor, policy engine, governance store) from data plane (A2A transport, MCP tool interfaces). Assign explicit roles to each agent with declared capabilities. Wire observability from day one using distributed trace IDs, per-agent spans, and retry/outcome fields.
What is the maximum depth for connected agents in Microsoft guidance?
Microsoft's connected-agents guidance recommends a maximum parent/subagent nesting depth of 2. Beyond that, control, observability, and policy enforcement degrade significantly. Use external tools (OpenAPI, Azure Functions via MCP) instead of adding a third nesting layer.
When should a tool replace an agent?
When the agent's only function is to call a single external API or apply a deterministic rule, replace it with an MCP-backed tool. Tools are cheaper, faster, and fully auditable without coordination overhead.
Sources & References
- Microsoft multi-agent reference architecture (GitHub) — Canonical framework-agnostic reference for planning, policy, observability, governance, and extensibility in production multi-agent systems
- Microsoft governance guide — Detailed governance controls within the reference architecture
- Microsoft observability guide — Observability patterns within the reference architecture
- Google Developers Blog — A2A: A New Era of Agent Interoperability — Official A2A protocol announcement; defines cross-vendor agent-to-agent communication standard
- Google Developers Blog — Developer's Guide to AI Agent Protocols (Mar. 2026) — Overview of six agent protocols including A2A and MCP; positions them as replacements for custom integration code
- arXiv:2601.13671 — The Orchestration of Multi-Agent Systems: Architectures, Design Principles, and Open Challenges — Peer-reviewed preprint establishing orchestration as the control plane; identifies failure modes of unorchestrated systems
- arXiv:2601.00481 — MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability — Telemetry schema and evaluation methodology for multi-agent systems; source for MAESTRO field names used throughout
- MAESTRO full telemetry fields (PDF) — Full telemetry field list referenced in logging, tracing, and evaluation sections
Keywords: Agent2Agent (A2A), Model Context Protocol (MCP), Microsoft multi-agent reference architecture, AutoGen, Magentic-One, OpenAPI, Azure Functions, LangGraph, CrewAI, Sierra, Cognition, observability, governance, parent/subagent nesting, distributed systems


