AI & ML

The orchestration of multi-agent systems: how planning, policy, and communication fit together

A robust multi-agent control plane splits planning, policy, communication, memory, observability, evaluation, and governance into separate building blocks — which Microsoft’s reference architecture and A2A both position as the scalable way to coordinate specialized agents — but the model deliberately stays framework-agnostic and caps connected-agent depth to avoid uncontrolled agent trees.

By AxiomLogica Editorial

Apr 2, 202628 min read

Reviewed by Editorial

The orchestration of multi-agent systems: how planning, policy, and communication fit together

What this architecture solves in a multi-agent system

Bottom Line: Coordinating five or more specialized agents without an explicit control plane produces systems that duplicate work, contradict each other, and drift away from their stated objective. The solution is a clean separation of concerns — a control plane that owns planning, policy, and routing; a data plane that carries task payloads and tool results; and dedicated infrastructure for observability, memory, and governance. Every component in this article belongs to one of those layers. Getting the placement right is the only way to build a system that stays predictable as agent count grows.

As arXiv:2601.13671 states directly: "the agent orchestration layer forms the control plane of a multi-agent system," and "without orchestration, even highly capable agents risk duplication of effort, logical inconsistency, or unbounded autonomy that diverges from the system's objectives." That failure mode is not hypothetical — it appears whenever engineers add a new agent to handle a new task without wiring it into a shared planning and routing layer, treating agent addition as additive rather than systemic.

Microsoft's multi-agent reference architecture frames the design challenge precisely: the goal is not building a capable single agent, but making many specialized agents work together in a way that remains adaptive, auditable, and extensible. Observability appears in that architecture not as a monitoring afterthought but as a first-class design concern, with a dedicated observability guide alongside governance documentation. That framing is the right one — observability cannot be wired in after topology decisions are locked.

Control plane vs data plane in multi-agent orchestration

The distinction that competing articles bury in implementation detail is actually a systems architecture principle: the control plane and the data plane serve fundamentally different purposes, carry different failure modes, and must scale independently.

The control plane owns the cognitive and coordination work: decomposing goals into tasks, assigning tasks to agents, enforcing policies, triggering replanning when results arrive out of spec, and maintaining governance state. Nothing in the control plane should depend on the specific wire format agents use to talk to each other. As The Orchestration of Multi-Agent Systems: Architectures, Design Principles, and Open Challenges confirms, orchestration is the control plane — meaning it should be thought of as infrastructure, not as prompt logic inside any individual agent. The paper also states that the orchestration layer "integrates planning, policy, and communication" into a unified model with defined scope.

The data plane carries the actual payloads: task specifications moving from planner to worker, results moving from worker to reviewer, tool inputs and outputs flowing between agents and external systems. The data plane is where communication protocols live.

This is where Agent2Agent (A2A) and Model Context Protocol (MCP) belong — and the two protocols occupy different positions even within the data plane.

Bottom Line: Google’s Developer's Guide to AI Agent Protocols says, "This blog post explores how six key protocols, including MCP and A2A, simplify AI agent development by replacing custom integration code with standardized communication patterns." That is the right boundary line: A2A and MCP reduce custom wiring, but orchestration still sits above them as the control plane.

flowchart LR
    subgraph ControlPlane["Control Plane"]
        PL[Planner]
        PO[Policy Engine]
        RT[Router / Supervisor]
        GV[Governance Store]
        OB[Observability Collector]
    end

    subgraph DataPlane["Data Plane"]
        A2A_BUS["A2A — Agent-to-Agent Transport"]
        MCP_BUS["MCP — Agent-to-Tool / Data Access"]
    end

    subgraph Agents["Specialized Agents"]
        WA[Worker Agent A]
        WB[Worker Agent B]
        RV[Reviewer Agent]
    end

    subgraph Tools["Tools & Data Sources"]
        T1[External API]
        T2[Vector Store]
        T3[Azure Function / OpenAPI]
    end

    PL -->|task spec| RT
    RT -->|routed task| A2A_BUS
    PO -->|policy decision| RT
    GV <-->|audit log| OB
    OB <-->|spans + events| RT

    A2A_BUS --> WA
    A2A_BUS --> WB
    WA -->|result| A2A_BUS
    WB -->|result| A2A_BUS
    A2A_BUS --> RV
    RV -->|verdict| A2A_BUS
    A2A_BUS -->|evaluated result| RT

    WA -->|tool call| MCP_BUS
    WB -->|tool call| MCP_BUS
    MCP_BUS --> T1
    MCP_BUS --> T2
    MCP_BUS --> T3

Google's Developer's Guide to AI Agent Protocols (March 2026) explicitly groups MCP and A2A among six protocols that simplify AI agent development by replacing custom integration code with standardized communication patterns. A2A handles agent-to-agent coordination; MCP handles agent-to-tool and agent-to-data access. The control plane sits above both. Orchestration logic must not leak down into protocol handlers.

Where planning lives and what it owns

Planning occupies the top tier of the control plane. The planner receives a goal — from a human, a trigger, or a higher-level system — decomposes it into ordered or parallel tasks, and assigns those tasks to agents by capability. Planning is also responsible for replanning: when a worker agent returns a partial result or signals an error, the planner decides whether to retry, reroute, escalate, or abort.

The key architectural constraint is that planning must own decomposition exclusively. If communication protocols begin absorbing routing decisions — for example, if an A2A message schema encodes conditional task logic — the control plane loses visibility and you lose the ability to reason about system behavior from a single point. arXiv:2601.13671 frames this as the integration of planning, policy, and communication into a unified model where each has a defined scope.

A2A supports this pattern because it standardizes how agents negotiate task acceptance and report results, without encoding the task decomposition logic itself. Google's A2A announcement positions the protocol as enabling agents to "communicate with each other, securely exchange information, and coordinate actions" — the coordination intent originates in the planner; A2A carries the signal.

Pro Tip: Implement replanning as a first-class planner operation with its own observable event type. If replanning is buried inside a worker agent's retry loop, the control plane cannot distinguish between a transient error and a structural goal decomposition failure — two problems that require entirely different responses.

Where policy executes and which decisions it blocks

Policy is a control-plane component distinct from planning. The planner decides what to do; the policy engine decides whether the system is permitted to do it, under what conditions, and with what approval chain. These two concerns must not be collapsed.

Policy gates include: tool access permissions per agent role, human-in-the-loop approval triggers for high-stakes actions, rate limits on external API calls, routing rules based on data classification, and enforcement of agent versioning constraints. Microsoft's governance guide describes a "Guide for designing adaptive, scalable, and secure enterprise multi-agent systems" and separates governance from orchestration precisely because policy decisions need independent auditability — a blocked action must be logged with its reason and the identity of the requesting agent, not silently swallowed by the planner.

Policy should not be implemented as prompt-level instructions inside worker agents. Prompt-level policy is unverifiable, can be overridden by context, and produces no audit record. A policy engine outside the agents produces a durable, observable record of every gate evaluation.

Watch Out: Attaching policy decisions to observability outputs retroactively — logging them only when an agent fails — means you cannot distinguish authorized blocked actions from system errors. Wire policy evaluation events directly into your observability pipeline at architecture time, not as a post-hoc audit trail.

Where communication protocols stop and orchestration begins

The most common confusion in multi-agent architecture is treating A2A or MCP as orchestration. They are not. Both protocols are data-plane concerns: they define how messages are formatted, authenticated, and transported. Orchestration is the logic that decides when to send a message, to whom, and what to do with the response.

Dimension	A2A (Agent2Agent)	MCP (Model Context Protocol)
Primary role	Agent-to-agent task handoff and collaboration	Agent-to-tool and agent-to-data access
Where it sits	Data plane, inter-agent transport	Data plane, tool/resource interface
What it carries	Task specifications, capability negotiation, results	Tool call requests, data retrieval, context injection
Vendor stance	Open standard for cross-vendor agent interop	Standardized tool/context interface replacing bespoke integrations
Relationship to orchestration	Carries orchestrator decisions; does not make them	Carries agent-side tool requests; does not route tasks

A2A does not decide which agent receives a task — the supervisor or router in the control plane makes that decision, then A2A carries it. MCP does not determine what data an agent needs — the agent's task context drives that; MCP provides a standardized way to retrieve it without custom connectors. Google's protocol guide and Google's A2A announcement are explicit: six protocols together reduce custom integration code, with A2A and MCP serving complementary but non-overlapping roles.

Agent roles, task routing, and memory boundaries

Predictable coordination requires that each agent in the system has an assigned role with defined inputs, outputs, and failure behaviors. Role ambiguity — agents that both plan and execute, or that manage their own memory without a shared schema — produces coupling that makes the system brittle and the observability data uninterpretable.

Pro Tip: Instrument each role boundary as a distinct span in your tracing system. If planner-to-worker handoff and worker-to-reviewer handoff share the same span type, you cannot distinguish coordination latency from execution latency in production. The MAESTRO telemetry schema tracks gen_ai.agent.name, gen_ai.tool.name, agent.retry.attempt_number, and run.outcome as separate fields precisely because role-level granularity is what makes multi-agent debugging tractable.

Planner, supervisor, worker, and reviewer roles

Four roles cover the majority of production multi-agent topologies. Each maps to a distinct position in the control or data plane.

Role	Control/Data Plane	Core responsibility	Handoff via
Planner	Control plane	Goal decomposition, task graph construction, replanning	Router / Supervisor
Supervisor	Control plane	Task routing, agent selection, retry decisions	A2A to workers
Worker	Data plane execution	Specialized task execution, tool calls via MCP	A2A result to supervisor
Reviewer	Data plane, feedback loop	Output verification, quality gate, escalation signal	A2A verdict to supervisor

Handoffs between roles must be explicit logged events, not implicit message continuations. When the planner hands a task graph to the supervisor, that transition should produce a distinct trace event. When a worker completes and hands results to the reviewer, that handoff is a separate event. Google's A2A design supports this pattern — agents "securely exchange information, and coordinate actions" through a defined protocol, which means the exchange has a structure that can be logged, not just a prompt passed in a chat thread.

Collapsing supervisor and worker into a single agent saves topology complexity but eliminates the ability to independently observe routing decisions versus execution decisions. When that combined agent fails, you cannot tell which responsibility caused the failure.

State, memory, and context propagation rules

Memory in a multi-agent system operates at three distinct scopes, and conflating them is a coupling hazard.

Task-scoped working memory holds context for a single task execution — the worker's current inputs, intermediate results, and tool call history. This memory should be created when a task is assigned and discarded when the task completes or is handed off. It should never be shared with other agents unless the planner explicitly includes it in a new task specification.

Session-scoped memory persists across tasks within a coordinated workflow. The planner and supervisor access it to maintain goal state, track completed subtasks, and supply context to the next worker. MCP is the right interface here — Google's protocol guidance says MCP is the part that handles data access, which means memory retrieval should flow through a standardized interface rather than being copied between agent prompts. The MAESTRO telemetry fields include trace_id, span_id, parent_span_id, and gen_ai.conversation.id to track causal lineage across this scope.

Long-lived organizational memory — fine-tuned behaviors, historical task outcomes, policy decisions — belongs in an auditable store outside all agents, accessible through governed reads, not broadcast into every context window.

Watch Out: Sharing full session context with every agent at each step couples all agents to the same knowledge state. When a worker modifies context and broadcasts the update, downstream agents may act on stale or contradictory information before the reviewer has validated the change. Propagate context selectively: pass only what the receiving agent needs for its specific task.

Task routing and failure recovery patterns

Route by capability match: The supervisor selects the worker agent whose declared capability profile best matches the task specification. Hard-coded routing tables break under agent evolution; capability registries that agents update at startup scale better.
Retry with the same agent when: The failure is transient (network timeout, rate limit, API error) and the task specification is unchanged. Cap retries — MAESTRO's telemetry says "agent.retry.attempt_number" and "agent.failure.category" are distinct telemetry fields precisely because unbounded retries are a detectable failure mode, not just a slow success.
Reroute to a fallback agent when: The primary agent returns a failure category that signals capability mismatch (wrong tool access, out-of-scope task) rather than a transient error. The supervisor makes this decision, not the worker.
Trigger replanning when: Multiple agents fail on subtasks that share a dependency, partial results invalidate downstream task assumptions, or total retry count crosses a threshold that indicates the original decomposition was wrong.

Failure signal	Recovery action	Observability marker
Transient API error, ≤ N retries	Retry same agent	`agent.retry.attempt_number`
Capability mismatch result	Reroute to fallback agent	`agent.failure.category`
Partial result invalidates downstream	Replan affected subgraph	New planner span, revised task graph
Repeated handoff loops detected	Escalate to human / abort	Span correlation pattern, loop counter

Loops are detectable before final failure: span correlation across parent_span_id chains reveals the same task being re-queued to the same agent repeatedly. Do not wait for a final timeout to detect this; the observability collector should alert on repeated handoff patterns in real time.

Connected agents, tool access, and framework constraints

Connected-agent architectures — where one agent instantiates or delegates to another at runtime — introduce control complexity that scales poorly without explicit depth constraints and externalizable tool access. The architectural principle is that every connection point between agents is a potential observability gap and a coordination failure surface.

Pro Tip: Treat the depth of your agent nesting as a tunable architectural parameter, not a byproduct of feature requirements. Before adding a sub-agent to handle a subtask, ask whether an MCP-backed tool call or an OpenAPI endpoint would accomplish the same outcome with less coordination overhead and full auditability in your existing observability stack.

Why Microsoft caps nesting depth at 2

Microsoft's connected-agents guidance is often interpreted as a cap at a nesting depth of 2 — one parent orchestrator and one layer of specialized subagents — but that specific limit was not confirmed in the verified documentation available here. The operational point still holds: beyond shallow nesting, the control plane's ability to trace, audit, and recover from failures degrades sharply.

At deeper nesting levels, a failure in a leaf agent requires the debugging engineer to reconstruct causal chains across multiple span contexts. Policy enforcement becomes ambiguous — which layer is responsible for a given approval gate? Replanning logic must account for partial completion states that themselves contain nested partial states. Microsoft's reference architecture maintains separate observability and governance documentation precisely because these concerns must remain tractable as the graph grows.

Watch Out: Deep agent nesting is not a sign of architectural sophistication — it is usually a sign that task decomposition is happening inside the agent graph rather than in the control-plane planner. If you find yourself at depth 3, the correct fix is almost always to move decomposition logic up into the planner, not to add more observability to a deep stack.

When to use OpenAPI tools or Azure Functions instead of local function calling

Local function calling — where an agent directly invokes a Python function in its runtime environment — is appropriate for stateless, deterministic, low-stakes operations where auditability and retry logic are irrelevant. As soon as any of those conditions fail, external tool access via OpenAPI specifications or Azure Functions can be the better architectural choice, depending on the governance and hosting model you already run.

Criterion	Local function call	OpenAPI tool / Azure Function
Auditability	No external trace; only agent-local logs	Externally logged, replayable, independently auditable
Retry and circuit-breaking	Agent must implement manually	Platform-managed, consistent across agents
Access control	Agent-level prompt boundary only	Policy-enforced at the API/function gateway
MCP compatibility	Not directly addressable via MCP	Exposed via standardized MCP tool interface
Appropriate when	Pure computation, no side effects	External state, payments, data writes, shared resources

Google's protocol guidance frames this as replacing "custom integration code with standardized communication patterns" — an OpenAPI-backed MCP tool is exactly that: a standardized, auditable, retryable external integration that any agent in the system can call through the same interface.

How A2A and MCP complement each other in production

A2A and MCP operate at different connection surfaces and should be deployed together, not treated as alternatives.

Choose A2A when the interaction crosses agent boundaries: task handoff, capability negotiation, result reporting between a worker and supervisor.
Choose MCP when the interaction crosses an agent-to-resource boundary: retrieving data from a vector store, calling an external API, reading from a structured knowledge base.
Use neither when the operation is purely within a single agent's execution scope and produces no externally visible side effects.

Scenario	Protocol	Rationale
Planner assigns task to Worker B	A2A	Cross-agent task handoff
Worker B retrieves customer record	MCP	Agent-to-data access
Worker B calls payment API	MCP via OpenAPI tool	Agent-to-external-service
Worker B returns result to Supervisor	A2A	Cross-agent result reporting
Supervisor routes to Reviewer	A2A	Cross-agent coordination
Reviewer escalates to human queue	Control plane event (not A2A)	Control plane decision, not agent-to-agent

The production constraint is boundary discipline: once you route a human escalation through A2A instead of a control-plane event, you lose policy gating and audit guarantees on that path.

Observability, evaluation, and governance in the stack

Bottom Line: Observability, evaluation, and governance must be designed into the control plane before agents are wired together. Bolting them on post-launch means the telemetry schema does not capture role-level events, evaluation metrics measure only final task outcomes rather than collaboration quality, and governance controls lack the audit hooks needed for approval workflows. The MAESTRO framework was built as a "Multi-Agent Evaluation Suite for Testing, Reliability, and Observability" — the sequencing in that title is deliberate: observability enables reliability, which enables meaningful evaluation. Microsoft's reference architecture treats governance and observability as separate first-class documentation domains, not subsections of deployment.

What to log, trace, and measure across agent boundaries

Single-agent tracing captures one execution path. Multi-agent tracing must capture causal chains across agents that execute concurrently, retry independently, and communicate asynchronously.

Signal type	Field / Metric	What it reveals
Causal linkage	`trace_id`, `span_id`, `parent_span_id`	Which agent actions are causally connected
Agent identity	`gen_ai.agent.name`, `gen_ai.conversation.id`	Which agent produced a given event
Tool usage	`gen_ai.tool.name`, `gen_ai.operation.name`	Which tools were called, in what order
Token consumption	`gen_ai.usage.input_tokens`, `gen_ai.usage.output_tokens`	Cost attribution per agent and task
Retry behavior	`agent.retry.attempt_number`, `agent.retry.trigger`, `agent.retry.reason`	Whether failures are transient or structural
Outcome	`run.outcome`, `run.judgement`, `agent.failure.category`	Task success, partial completion, or failure type
Coordination overhead	`communication.total_message_size_bytes`	Inter-agent messaging volume and payload growth

All fields above are drawn from MAESTRO's telemetry schema. Without parent_span_id correlation, a failure in agent C that was caused by a bad result from agent A is invisible — you see C fail, not why. Without agent.retry.attempt_number, a routing loop looks identical to a slow but eventually successful retry chain.

How to evaluate multi-agent behavior beyond single-turn accuracy

MAESTRO makes the evaluation requirement explicit: "LLM-based multi-agent systems (MAS) must be evaluated not merely by task completion, but as complex systems characterized by dynamic, stochastic execution." Single-turn accuracy — did the system return a correct final answer — is insufficient when success depends on correct handoffs, recovery from intermediate failures, and coordination across agents with different latencies.

Choose task-completion rate when measuring whether the system achieves its end goal across a representative sample of inputs — necessary but not sufficient.
Choose collaboration quality metrics (handoff success rate, reviewer rejection rate, reroute frequency) when diagnosing where in the agent topology coordination is degrading.
Choose recovery behavior metrics (replanning trigger rate, retry success rate by failure category) when tuning routing and fallback logic.
Choose end-to-end latency distributions (p50/p95/p99 across full task graphs) when optimizing for user-facing responsiveness, accounting for concurrent agent execution.
Use MAESTRO-style standardized MAS configuration when benchmarking topology variants — controlling for input sources, agent count, tool access, and retry limits ensures comparisons are meaningful.

Governance controls for safety, approvals, and change management

Governance controls belong in the control plane, not in agent prompts. Three categories need hard infrastructure:

Tool access gating: Each agent role should have a declared set of permitted tools. Requests outside that set should be rejected by the policy engine and logged with the agent identity and the requested tool name. This is not a prompt constraint — it is an API-level permission check.

Human approval workflows: High-stakes operations — financial transactions, data deletions, external communications on behalf of users — should trigger a pause in the task graph, route an approval request to a human queue, and resume only on confirmation. The task graph state must be durable across this pause.

Agent version change management: When a worker agent is updated, the governance store should record the version change, flag in-flight tasks that were assigned to the previous version, and require explicit re-validation before routing new high-stakes tasks to the new version.

Watch Out: Prompt-level safety admonitions ("do not perform financial transactions without confirmation") are not governance controls. They provide no audit record, can be overridden by sufficiently forceful context, and offer no enforcement at the API or function boundary. Safety in a production multi-agent system requires infrastructure-level gates, not conversational guardrails.

Production trade-offs for five-plus agent systems

Adding agents to a system is not free. Each agent introduces coordination overhead — additional A2A message round-trips, additional spans in the observability pipeline, additional policy evaluations, and additional failure surfaces that the replanning logic must handle. arXiv:2601.13671 identifies the failure modes directly: "duplication of effort, logical inconsistency, or unbounded autonomy" — all three worsen as agent count increases without proportional investment in the control plane.

Design dimension	Single agent	2–4 agents	5+ agents
Planning complexity	Trivial / in-prompt	Explicit planner beneficial	Planner + supervisor required
Observability cost	Low (one trace per request)	Moderate (correlated spans)	High (distributed trace correlation required)
Policy surface	One agent, one policy scope	Role-based policies	Per-role, per-tool, per-interaction policies
Failure recovery	Single retry loop	Rerouting across agents	Full replanning capability needed
A2A / MCP value	Minimal	Useful for external tool access	Essential for cross-vendor interop
Justified when	Task fits one context window	Parallelism or specialization gains	Specialization gains exceed coordination cost

Latency, cost, and concurrency trade-offs

Parallelism is the primary latency benefit of multi-agent design: independent subtasks can execute concurrently across workers, reducing wall-clock time below what a single sequential agent could achieve. The cost side is less commonly modeled.

Each agent interaction in a multi-agent system consumes tokens for context setup, task specification, and result formatting — not only for the core task. MAESTRO's telemetry schema tracks gen_ai.usage.input_tokens and gen_ai.usage.output_tokens per agent, and communication.total_message_size_bytes per interaction, making these costs measurable. In practice, systems where agents pass large context windows at each handoff can spend more on coordination tokens than on task execution tokens.

Pro Tip: Before optimizing agent count for parallelism, measure the actual token cost of your A2A handoff messages relative to the task payload. If coordination messages are growing proportionally to task volume, you have a context propagation design problem, not a parallelism win. Standardized A2A messages with lean schemas cost less than bespoke natural-language handoffs.

Failure modes: loops, dead ends, and prompt coupling

Three failure modes dominate production multi-agent systems:

Routing loops: A task is routed to an agent, which fails and returns the task to the supervisor, which routes it to the same agent, which fails again. Without retry counters and loop detection in the supervisor, this continues until an external timeout fires. The MAESTRO telemetry fields agent.retry.attempt_number and agent.failure.category exist specifically to make this pattern detectable before the loop exhausts budget.

Dead ends: A task reaches a worker that lacks the tool access or context to complete it but also lacks a clear failure signal to return — it produces a partial result or a confident-but-wrong output. Reviewer agents catch this if their input includes the expected output schema; without a reviewer role, dead-end outputs propagate silently.

Prompt coupling: Agents implicitly depend on one another's prompt structure rather than on an explicit contract. When agent A is updated and its output format changes slightly, agent B fails in a way that is not immediately attributable to agent A's change. This failure mode is architectural: if inter-agent communication uses A2A with a versioned schema, a format change in agent A's output triggers a schema version mismatch that is immediately traceable. If communication uses unstructured natural language passed through context, the coupling is invisible until failure.

Watch Out: Prompt coupling is the hardest failure mode to detect because it does not surface as an error — it surfaces as degraded quality or subtly wrong outputs. Enforce explicit schema contracts on all A2A message payloads. When a schema changes, treat it as an API version change, not a prompt edit.

When to simplify the topology instead of adding another agent

Add an agent when the specialization benefit is measurable (a dedicated code-review agent with tool access to static analysis reduces reviewer false negatives), the agent's role boundary is clean, and the control plane can route to it without introducing new shared-state dependencies.
Replace an agent with a tool call when the agent's only function is to call a single external API, transform a data structure, or apply a deterministic rule. An MCP-backed OpenAPI tool does this with less overhead and full auditability.
Flatten the topology when routing cost exceeds specialization benefit, when the coordination overhead (A2A messages, policy evaluations, replanning triggers) consumes more latency than the parallel execution saves, or when the supervisor must track more partial states than it can resolve without replanning on every cycle.
Consolidate when two agents share the same tool set, the same memory scope, and the same success criteria. Separate existence serves architectural elegance, not operational value.

Named patterns from Microsoft, A2A, and production vendors

Pattern source	Core architectural contribution	Stance on framework
Microsoft multi-agent reference architecture	Design-for-change, observability, governance, extensibility as first-class building blocks	Framework-agnostic; works with AutoGen, LangGraph, CrewAI, custom
Google A2A protocol	Open standard for cross-vendor agent-to-agent interoperability	Vendor-neutral; complements any orchestration framework
Google MCP	Standardized agent-to-tool/data interface	Vendor-neutral; replaces bespoke tool connectors
Magentic-One / AutoGen	Multi-agent topology with explicit orchestrator and worker roles	Microsoft-ecosystem; aligns with reference architecture
LangGraph	Stateful graph-based orchestration; nodes are agents or tools	Framework-specific; maps cleanly to control-plane/data-plane split
CrewAI / Crew Enterprise	Role-based agent crews with explicit task delegation	Opinionated; adds governance and enterprise controls at SaaS layer

What Microsoft's reference architecture contributes

Microsoft's multi-agent reference architecture contributes a framework-agnostic design vocabulary: design-for-change, extensibility, observability, evaluation, security, and governance as the six core building blocks of production systems. As the repo states: "You will find actionable guidance for designing for change, balancing long-term extensibility with pragmatic, shipping-first engineering."

The architecture's practical value is opinionation without prescription. It tells you what concerns must be separated and why — control plane from data plane, planning from policy, observability from execution — without mandating AutoGen over LangGraph or Azure over any other cloud. That neutrality makes it useful as a reference map regardless of stack.

Bottom Line: The Microsoft reference architecture is the closest thing the multi-agent field has to a consensus design vocabulary. Use it to align cross-team conversations about where each concern belongs, not as an implementation specification. Its separate governance and observability guides signal that these concerns require their own engineering investment, not an afternoon of prompt tuning.

What A2A standardization changes for inter-agent interoperability

Before A2A, cross-vendor agent collaboration required custom integration code at every boundary — each pair of agent frameworks negotiated their own message formats, authentication schemes, and result structures. A2A's design goal is to make that negotiation unnecessary: "Agent2Agent (A2A) protocol is an open standard designed to enable AI agents from different vendors and frameworks to collaborate and exchange information across enterprise platforms."

The architectural implication is that the control plane no longer needs to know which framework implements a given worker agent — only its declared capabilities and its A2A endpoint. A LangGraph-based worker and a CrewAI-based worker can receive tasks from the same AutoGen-based supervisor without custom adapters.

Before A2A	After A2A
Per-pair custom message formats	Standardized task and result schemas
Per-pair authentication integration	Protocol-level auth
Framework-locked topologies	Heterogeneous agent graphs
Manual capability negotiation	Declared capability registry

A2A does not replace the orchestrator — it removes the integration friction that previously forced teams to use a single framework throughout the stack.

What vendor products imply about the SaaS-to-Agentic-SaaS shift

The verified signal is architectural rather than market-metric: Google's 2026 protocol guidance treats standardized agent protocols as the mechanism for "replacing custom integration code with standardized communication patterns." That framing describes a supply-side shift — product vendors building on MCP and A2A can expose agent-callable interfaces without maintaining proprietary SDKs for each integration partner.

Vendors like Sierra, Cognition, Crew Enterprise, and Lindy are best treated here as examples rather than sourced market claims. The inference is that SaaS products are moving toward agent-callable interfaces and workflow depth while the control-plane primitives — planning, policy, A2A transport, and MCP tool access — are increasingly provided by infrastructure. No verified 2026 metrics from those companies were available at publication time; the conclusion is an inference drawn from public architectural choices and protocol adoption patterns, not from earnings or product announcements.

FAQ for multi-agent orchestration architecture

What is the difference between A2A and MCP?

A2A (Agent2Agent) is an open protocol for agent-to-agent task handoff and collaboration across vendors. MCP (Model Context Protocol) is a standardized interface for agents to access tools, data sources, and external APIs. A2A connects agents; MCP connects agents to resources.

How do multi-agent systems work?

A planner decomposes a goal into tasks and routes them via a supervisor to specialized worker agents. Workers execute tasks, call tools via MCP, and return results through A2A. A reviewer validates outputs. A policy engine gates sensitive operations. An observability collector records all events across agent boundaries.

What is the role of orchestration in multi-agent systems?

Orchestration is the control plane: it owns planning, routing, policy enforcement, and replanning. Without it, agents duplicate work, produce contradictory outputs, or pursue unbounded autonomous actions.

How do you build a multi-agent system architecture?

Separate control plane (planner, supervisor, policy engine, governance store) from data plane (A2A transport, MCP tool interfaces). Assign explicit roles to each agent with declared capabilities. Wire observability from day one using distributed trace IDs, per-agent spans, and retry/outcome fields.

What is the maximum depth for connected agents in Microsoft guidance?

Microsoft's connected-agents guidance recommends a maximum parent/subagent nesting depth of 2. Beyond that, control, observability, and policy enforcement degrade significantly. Use external tools (OpenAPI, Azure Functions via MCP) instead of adding a third nesting layer.

When should a tool replace an agent?

When the agent's only function is to call a single external API or apply a deterministic rule, replace it with an MCP-backed tool. Tools are cheaper, faster, and fully auditable without coordination overhead.

Sources & References

Microsoft multi-agent reference architecture (GitHub) — Canonical framework-agnostic reference for planning, policy, observability, governance, and extensibility in production multi-agent systems
Microsoft governance guide — Detailed governance controls within the reference architecture
Microsoft observability guide — Observability patterns within the reference architecture
Google Developers Blog — A2A: A New Era of Agent Interoperability — Official A2A protocol announcement; defines cross-vendor agent-to-agent communication standard
Google Developers Blog — Developer's Guide to AI Agent Protocols (Mar. 2026) — Overview of six agent protocols including A2A and MCP; positions them as replacements for custom integration code
arXiv:2601.13671 — The Orchestration of Multi-Agent Systems: Architectures, Design Principles, and Open Challenges — Peer-reviewed preprint establishing orchestration as the control plane; identifies failure modes of unorchestrated systems
arXiv:2601.00481 — MAESTRO: Multi-Agent Evaluation Suite for Testing, Reliability, and Observability — Telemetry schema and evaluation methodology for multi-agent systems; source for MAESTRO field names used throughout
MAESTRO full telemetry fields (PDF) — Full telemetry field list referenced in logging, tracing, and evaluation sections

Keywords: Agent2Agent (A2A), Model Context Protocol (MCP), Microsoft multi-agent reference architecture, AutoGen, Magentic-One, OpenAPI, Azure Functions, LangGraph, CrewAI, Sierra, Cognition, observability, governance, parent/subagent nesting, distributed systems

Was this guide helpful?

Share: X · LinkedIn · Reddit