The four orchestration patterns that dominate production in 2026
Almost every agentic workflow in production today fits one of four patterns, or a hybrid of two. Naming them matters because the pattern dictates the failure modes you are signing up for.
1. Planner / Executor
One model writes a plan as structured output (often JSON), and a second model — or a deterministic runtime — executes each step. The planner only thinks; the executor only does. Anthropic's work on long-horizon coding agents popularised the pattern for software engineering; it also dominates research-and-synthesis workflows, where the planner outlines a report and parallel executors fill in each section.
Strengths: clean separation of concerns, easy to debug ("which step failed?"), naturally parallelizable across steps. Weakness: the plan is fixed at start time, so the pattern is brittle on tasks where early discoveries should rewrite the plan. Most production planner/executor systems mitigate this with a "re-plan" tool that the executor can call mid-run.
2. Supervisor + Workers
A supervisor model receives the user request and routes each sub-task to a specialist worker agent — a research worker, a code worker, a billing worker, and so on. The supervisor does not execute; it dispatches and aggregates. This is the dominant pattern for customer support automation and back-office workflows.
The pattern's win is specialization: each worker can have its own system prompt, its own tools, and its own model. Its risk is the supervisor becoming a chokepoint — a bad routing decision sends the whole task down the wrong path, and the workers, scoped narrowly, often cannot recover. Production teams typically address this with a "fallback to supervisor" tool that any worker can call when its inputs do not match its scope.
3. Hierarchical (Manager-of-Managers)
A supervisor of supervisors. Used in workflows complex enough that the top-level router cannot reason about leaf tasks directly — for example, an enterprise research agent where the top level routes between "legal," "financial," and "technical" sub-supervisors, each of which routes between its own workers. The pattern is powerful but costly: every level of hierarchy adds latency, tokens, and a class of failure modes that flat workflows do not have.
In practice, most teams that build hierarchical workflows end up flattening them within six months. Three levels is rarely worth the operational complexity.
4. Graph-state (LangGraph-style)
Nodes are agents or tools; edges are transitions; state is a typed object that every node reads and updates. The graph is explicit, so the workflow becomes a piece of software you can lint, type-check, and unit-test rather than a prompt to reason about. LangGraph popularised the pattern; equivalent libraries exist in most ecosystems by 2026.
Graph-state workflows trade some of the flexibility of free-form agents for dramatically better observability and resumability. If a run fails at node 7, you re-run from node 7 instead of starting over. For workflows that touch external systems with side effects — payments, calendar bookings, ticket creation — this property alone justifies the pattern.
What an agentic workflow actually costs
The number teams quote — "the model API bill" — is usually less than half of the real cost. A realistic decomposition for a moderately complex agentic workflow handling 10,000 runs per month, with an average of 8 LLM calls and 4 tool calls per run, looks roughly like this:
| Cost line | Monthly range (USD) | Main drivers |
| LLM tokens (frontier model for reasoning, small model for routing) | $1,200 – $4,500 | Model mix, average input/output tokens per step, prompt-cache hit rate |
| Tool / API calls (search, scrapers, payments, calendar) | $200 – $1,800 | Vendor pricing, retry rate |
| Vector DB and embeddings | $150 – $700 | Corpus size, query volume, re-embedding frequency |
| Observability (LangSmith, Langfuse, Helicone, in-house) | $150 – $900 | Trace volume, retention window |
| Eval runs (offline + online sampling) | $200 – $1,500 | Eval suite size, frequency, LLM-judged vs deterministic |
| Engineering time (amortised, one platform engineer at 30–60%) | $3,000 – $8,000 | Maintenance, prompt drift, vendor churn |
Two numbers move this total more than anything else: prompt-cache hit rate and retry rate. A workflow that re-sends the same 8,000-token system prompt on every step without caching will burn through budget two to three times faster than one that uses Anthropic's prompt caching or the equivalent on other providers. A workflow whose tool calls fail and retry silently 15% of the time will out-spend a sibling workflow at twice the run volume.
The failure modes that actually break agentic systems
The classic agentic failure: the model calls a tool, the result is ambiguous, it calls the same tool again with marginally different arguments, and again, and again. Most production systems cap the loop with a hard step limit (often 20–40 steps) and a soft penalty (the system prompt tells the model the loop is being detected). The hard cap is non-negotiable. The soft penalty saves money.
Goal drift
Twenty steps into a workflow, the model has quietly forgotten the original instruction and is optimising for something subtly different. This is most acute in long-horizon agents — research, coding, multi-step planning. The most effective mitigations are surprisingly old-school: keep the original instruction pinned in the system prompt across every step, summarise progress every N steps, and re-state the goal at the top of each new sub-task. Frontier-class models in 2026 are dramatically better at this than 2024-era models — but not good enough to skip the mitigations.
Context contamination
Tool outputs include hallucinated facts, error messages get parsed as ground truth, or one worker's output poisons another worker's context. The cleanest defence is a typed state object (the graph-state pattern again) where each field has a known provenance and an expected shape — and where workers cannot write outside their assigned slots.
The vector DB times out, the agent retries, the retry succeeds with a stale result, the agent acts on the stale result, downstream tools fail in confusing ways. In production this is the single most expensive class of failure, because the symptom usually surfaces far from the cause. The mitigations are unglamorous: timeouts on every tool, structured error returns that the model is trained to interpret, and a circuit breaker that pulls the agent out of the loop and into a human queue after N consecutive related failures.
Observability and evals: the part nobody wants to build
An agentic workflow without traces is debugged by guessing. The minimum production setup in 2026 includes:
- Per-run traces capturing every LLM call, every tool call, inputs, outputs, latencies, and token counts. LangSmith, Langfuse, and Helicone are the three names that come up most.
- Offline evals — a set of frozen test cases the workflow is re-run against before any prompt or model change ships. Most teams under-invest here until their first major regression, then over-correct.
- Online evals — sampled production traces scored by an LLM judge against rubrics (was the answer grounded? did the workflow halt correctly?). The judge is itself a source of error; calibrate it against a human-labelled subset quarterly.
- Cost and latency dashboards per step. Most cost blowups are localised to one node in the graph.
For the business-level metrics that sit on top of trace-level evals, the related piece on how to measure AI ROI is the natural companion read.
When agentic is overkill
The hardest skill in 2026 is recognising the workflows that should not exist. A useful filter, in three questions:
- Is the task plan stable enough to express as code? If yes, write the code and use the LLM only for the genuinely non-deterministic parts.
- Does each "agent" add a verifiable capability? If two agents could be one agent with one more tool, they should be.
- Could a single tool-using LLM call solve the task in under three steps? If yes, skip the orchestration and revisit only when scale forces the issue.
The current bias in the industry is to over-orchestrate. The teams that ship reliably are the ones that fight that bias — they use agentic patterns for the small portion of their product that genuinely needs autonomy, and ordinary deterministic software for everything else.
Where this is heading
Two directions are converging by mid-2026. First, model-side improvements — better long-horizon reasoning, native parallelism in frontier models, cheaper inference — are pushing more tasks back inside the single-agent envelope. What required orchestration in 2024 increasingly fits in a single ReAct loop today. Second, MCP (Model Context Protocol) standardisation is making the tool layer interchangeable, which lowers the cost of swapping models inside a workflow without rewriting the agents.
For foundational primers, the companion pieces on how AI agents work, RAG fundamentals, and RAG vs. fine-tuning vs. long context are useful. For model selection, the Claude vs ChatGPT vs Gemini comparison is the right next read. For costs at the program level, see AI development cost in 2026.