Multi-Agent Systems: What Actually Works (And What Doesn't)

The pitch is seductive: instead of one big agent doing everything, have a team of specialized agents. A researcher, a writer, a reviewer, a publisher. They collaborate, check each other's work, and produce better results than any single agent could.

CrewAI is built on this premise. AutoGen's whole model is multi-agent conversation. LangGraph lets you wire up arbitrary agent topologies. The demos are incredible — watch five agents debate strategy and produce a polished deliverable.

Then you try to run it in production, and you discover that multi-agent systems have failure modes that single-agent systems don't, and they're significantly harder to debug, monitor, and control.

This isn't an argument against multi-agent architectures. It's an argument for understanding when they help and when they're unnecessary complexity. After working with dozens of teams deploying agent systems, here's what we've seen actually work.

The Failure Modes Nobody Talks About

Hallucination Propagation

A single agent that hallucinates is a problem. Multiple agents that pass hallucinations to each other is a catastrophe.

Here's the pattern. Agent A is a "researcher." It retrieves some documents and summarizes them. In the summary, it slightly mischaracterizes a data point — says revenue was $12M when it was $12M ARR (annual recurring) vs. total revenue. Agent B, the "analyst," takes that summary as ground truth and builds projections on it. Agent C, the "writer," turns those projections into a client-facing report.

The original error was subtle. By the time it reaches the output, it's been amplified and embedded in analysis that looks authoritative. Three agents agreed on it, so it must be right — except they didn't independently verify anything. They played telephone.

┌──────────┐     ┌──────────┐     ┌──────────┐
│ Agent A   │────▶│ Agent B   │────▶│ Agent C   │
│ Researcher│     │ Analyst   │     │ Writer    │
│           │     │           │     │           │
│ Error: 1% │     │ Error: 5% │     │ Error: 15%│
│ (subtle   │     │ (builds   │     │ (presents │
│  misread) │     │  on it)   │     │  as fact) │
└──────────┘     └──────────┘     └──────────┘

      Hallucination doesn't average out.
      It compounds.

In a single-agent system, you can ground-truth the output against the source. In a multi-agent pipeline, the intermediate representations are lossy summaries that discard the context needed to verify them.

Coordination Overhead

Every message between agents is an LLM call. A "discussion" between three agents that goes five rounds is 15 LLM calls — and that's before any of them use tools.

We've profiled real CrewAI crews. A four-agent crew handling a moderately complex research task averaged 35-45 LLM calls. With GPT-4-class models, that's $2-6 per invocation. The agents spent roughly 40% of their tokens on coordination — explaining context to each other, negotiating task boundaries, summarizing their work for the next agent.

Compare that to a single well-prompted agent with the same tools. It handles the same task in 8-12 LLM calls at $0.50-1.50. The quality is comparable, sometimes better, because there's no information loss between agents.

Multi-agent coordination is worth the cost when the task genuinely requires different capabilities (different tools, different model strengths, different permission scopes). It's not worth it when it's simulating a process that one agent could handle.

Debugging Impossibility

Something went wrong. The output is bad. Which agent caused it?

In a single-agent system, you read the trace: prompt → reasoning → tool call → reasoning → output. Linear. Debuggable.

In a multi-agent system, you're looking at a conversation graph. Agent A said X to Agent B. Agent B interpreted it as Y and told Agent C. Agent C asked Agent A for clarification. Agent A gave a slightly different answer. Agent B revised its analysis. Agent C produced output based on the revised analysis.

Now find the bug.

This isn't hypothetical. We've seen teams spend days debugging multi-agent issues that would have been immediately obvious in a single-agent trace. The operational cost of debugging multi-agent systems is the silent killer — it doesn't show up in the demo, but it dominates the TCO.

AutoGen's conversation logging helps. LangSmith traces LangGraph execution graphs. But the fundamental challenge remains: the more agents you have, the more interaction paths exist, and the harder it is to reason about behavior.

Non-Deterministic Coordination

Even with temperature=0 (which isn't truly deterministic across providers), multi-agent systems exhibit emergent coordination patterns that vary between runs. Agent A might ask Agent B a question in one run and skip it in another. The "discussion" might go three rounds or seven.

This means:

Costs are unpredictable. The same input can produce wildly different token consumption.
Latency is unpredictable. A task that took 10 seconds last time might take 45 seconds now.
Testing is unreliable. Your test suite passes because the agents happened to coordinate well on that run.

For a demo, this is fine. For a production system with SLAs, it's a serious problem.

What Actually Works

After seeing what fails, here's what succeeds in production.

Pattern 1: Single Agent with Structured Tools

For 80% of use cases, one agent with well-designed tools beats a multi-agent system. The key is pushing complexity into the tools, not the agent topology.

Instead of:

Research Agent → Analysis Agent → Writing Agent

Do:

Single Agent
  ├── research_tool(query) → structured data
  ├── analyze_tool(data) → structured analysis
  └── format_tool(analysis, template) → output

The tools are deterministic functions. The agent decides what to research and how to analyze, but the tool implementations enforce structure. The agent can't produce a malformed analysis because analyze_tool returns a typed schema, not free text.

This pattern is boring. It doesn't make for exciting conference talks. But it ships, it scales, and it's debuggable.

Pattern 2: Pipeline (Not Conversation)

When you genuinely need multiple LLM calls with different prompts or models, use a pipeline, not a conversation. The difference:

Conversation (agents talk to each other):

A: "Here's what I found..."
B: "Interesting, but can you clarify..."
A: "Sure, what I meant was..."
B: "OK, based on that, my analysis is..."

Pipeline (structured handoff):

Stage 1: Research  → {structured output schema}
Stage 2: Analysis  → takes schema, produces {analysis schema}
Stage 3: Synthesis → takes analysis schema, produces final output

The pipeline is unidirectional. Each stage has a typed contract. There's no back-and-forth negotiation. If Stage 2 doesn't have enough information, it fails explicitly rather than entering a multi-round clarification loop.

LangGraph is actually good at this. You can build a graph that's a pipeline with conditional branches (retry, escalate, take different paths) without giving agents the ability to have open-ended conversations with each other.

# LangGraph pipeline — not a conversation
from langgraph.graph import StateGraph

workflow = StateGraph(ResearchState)

workflow.add_node("research", research_node)     # LLM call 1
workflow.add_node("validate", validate_node)      # deterministic check
workflow.add_node("analyze", analyze_node)        # LLM call 2
workflow.add_node("format", format_node)          # deterministic template

workflow.add_edge("research", "validate")
workflow.add_conditional_edges("validate", check_quality, {
    "pass": "analyze",
    "fail": "research"   # retry with feedback, max 2 attempts
})
workflow.add_edge("analyze", "format")

Notice: validate and format aren't LLM calls. They're deterministic code. This is the winning pattern — use LLMs for the parts that need intelligence, use code for the parts that need reliability.

Pattern 3: Supervisor with Workers

When you genuinely need multiple agents — different tools, different permission scopes, different models — use a strict supervisor pattern. One agent decides what to do. Worker agents execute specific, scoped tasks. Workers don't talk to each other.

┌────────────────────┐
│    Supervisor       │
│    (orchestrates)   │
│                     │
│  Decides: what to   │
│  do, in what order, │
│  with what params   │
└───┬────┬────┬──────┘
    │    │    │
    ▼    ▼    ▼
┌──────┐┌──────┐┌──────┐
│Worker││Worker││Worker│
│  A   ││  B   ││  C   │
│      ││      ││      │
│SQL   ││Slack ││Email │
│read  ││post  ││send  │
│only  ││only  ││only  │
└──────┘└──────┘└──────┘

Workers: no inter-communication
Workers: scoped permissions
Workers: single-purpose
Supervisor: full trace of what was delegated

This is how human organizations work. The VP doesn't have the analysts debate each other — she tells each one what to produce, reviews the results, and synthesizes. The workers are stateless, scoped, and replaceable.

Trust Gradients: Matching Architecture to Risk

At PartyBus, we use a trust gradient model (L0-L3) that maps directly to these architectural patterns. The insight is that the right agent architecture depends on how much trust you can place in the agent's outputs.

L0: No Autonomy (Human-in-the-Loop)

The agent suggests actions. A human approves every one. This is where you start with any new agent in a high-stakes domain.

Architecture: Single agent, all tool calls require human approval.

Example: An agent that drafts contract amendments. It proposes changes, a lawyer reviews and approves each one. The agent never modifies the contract directly.

# .partybus.yaml
trust_level: L0
approval:
  required_for: all_tool_calls
  approvers: [legal-team]
  timeout: 24h
  on_timeout: deny

L1: Guarded Autonomy

The agent can execute pre-approved action types independently. Novel or high-risk actions require approval. This is the sweet spot for most production deployments.

Architecture: Single agent or pipeline with deterministic guardrails.

Example: A customer support agent that can look up orders, check status, and issue refunds under $50. Refunds over $50 or account modifications require human approval.

trust_level: L1
approval:
  auto_approve:
    - tool: order_lookup
    - tool: refund
      conditions:
        amount_lte: 50.00
  require_approval:
    - tool: refund
      conditions:
        amount_gt: 50.00
    - tool: account_modify
  approvers: [support-leads]

L2: Supervised Autonomy

The agent operates independently with monitoring. Anomalies trigger alerts, not blocks. Human review is after-the-fact, not inline.

Architecture: Supervisor-worker pattern for complex tasks. Pipeline pattern for structured workflows.

Example: An internal analytics agent that runs queries, builds dashboards, and distributes reports. It operates autonomously but all actions are logged, cost is tracked, and weekly reviews catch drift.

trust_level: L2
monitoring:
  alert_on:
    - cost_per_request_gt: $5.00
    - tool_calls_per_request_gt: 30
    - error_rate_gt: 0.1
  review_cadence: weekly

L3: Full Autonomy

The agent operates without real-time oversight. Used only for well-understood, low-risk, high-volume tasks where the cost of human review exceeds the cost of errors.

Architecture: Single agent with comprehensive guardrails, circuit breakers, and automatic rollback.

Example: An agent that categorizes and routes incoming support tickets. It's been running at L1 for three months with 99.2% accuracy. Promoted to L3 with automatic fallback to L1 if accuracy drops below 95%.

trust_level: L3
guardrails:
  circuit_breaker:
    error_rate_threshold: 0.05
    window: 1h
    fallback: L1  # demote to human-in-the-loop
  rollback:
    on_anomaly: auto
    notification: [ops-team]

The gradient isn't just a label — it configures the entire permission, monitoring, and approval stack. Promoting an agent from L1 to L2 is a deliberate operational decision, like promoting a deployment from staging to production. Demotion happens automatically when guardrails are breached.

The Takeaway

Multi-agent systems are a tool, not a goal. The question isn't "how many agents can we use?" — it's "what's the simplest architecture that reliably solves the problem with acceptable risk?"

For most teams, the answer is:

Start with a single agent and good tools. You'll be surprised how far this gets you.
Graduate to a pipeline when you need multiple LLM steps. Keep them unidirectional with typed contracts.
Use the supervisor-worker pattern when you need multiple agents. Keep workers dumb and scoped.
Match your architecture to your trust level. Start at L0, earn your way up.

If you're building agent systems and want guardrails that make your architecture production-safe — trust levels, scoped permissions, cost controls, automatic demotion — join the PartyBus beta. We're working with teams deploying agents across all four trust levels.

This post is part of a series on production-ready AI agents. Previously: Why Your AI Agents Work in Demos But Fail in Production and The Missing Layer: Authorization for AI Agents.

"Multi-Agent Systems: What Actually Works (And What Doesn't)"