Benchmark · Multi-Agent LLM Systems

MAS-PromptBench

When Does Prompt Optimization Improve Multi-Agent LLM Systems?

A benchmark for system-prompt optimization in multi-agent LLM systems

Single Agent

Multi-Agent Systems

Prompt-optimization gains using a state-of-the-art optimizer GEPA in single-agent and multi-agent settings. While GEPA consistently improves single-agent performance across all five diverse tasks, its natural multi-agent extension yields highly variable effects across tasks and workflow topologies, ranging from large gains to severe performance drops.

How much can prompt optimization help in MAS,
and how does its effect vary across configurations?

Contributions

1

MAS-PromptBench: a benchmark for MAS prompt optimization

A comprehensive benchmark for evaluating system-prompt optimizers for MAS. It spans diverse MAS configurations across task domains (reasoning, coding, tool calling), five workflow topologies (both existing and newly constructed systems), communication protocols from free-form to highly structured coordination, varying team sizes, and two default prompt optimizers — a foundation for proposing, analyzing, and comparing system-prompt optimization algorithms under controlled MAS configurations.

2

Prompt optimization gains and failures for MAS

Systematically evaluating a natural multi-agent extension of GEPA, a state-of-the-art single-agent prompt optimizer, against default prompts. The results highlight the promise of prompt optimization for MAS: improvements reach up to +24.0 points. Yet they also reveal the need for principled algorithms tailored to multi-agent settings, as performance can drop by as much as −16.0 points for certain configurations.

3

Insights into when prompt optimization works for MAS

Optimization shows greater potential when tasks have explicit, controllable, and verifiable agent-local behaviors, and when communication protocols impose an explicit shared structure that makes agent interactions easier to control and transfer; it also needs to be workflow-topology-aware. Optimization becomes harder as team size grows — confirming the challenges of scaling MAS prompt optimization and motivating more scalable, robust algorithms.

Benchmark at a Glance

MAS-PromptBench varies each configuration axis independently while holding the others fixed.

0
Tasks
GPQA-DiamondHotpotQAMATHLiveCodeBenchAPPSSWE-Bench VerifiedBFCLToolHopAPI-Bank
0
Frameworks
LangGraphCrewAIAutoGenOpenAI Agents SDK
0
Topologies
SingleIndependentSequentialCentralizedDecentralized
0
Team Sizes
24810
0
Communication Protocols
FreeformSemi-structuredStructured
0
Optimizers
GEPAMIPRO
0
optimized configuration cells — each a paired baseline vs. optimized evaluation

Multi-Agent Systems & Four Factors

Optimizer

GEPA MIPRO
•••

Task Dataset

Reasoning
GPQA-Diamond HotpotQA MATH
Coding
LiveCodeBench APPS SWE-Bench
Tool-Calling
BFCL ToolHop API-Bank
•••
Multi-Agent LLM System
Workflow Topology
Single Sequential Independent Centralized Decentralized
•••
Communication Protocol
Freeform Semi-structured Structured
•••
Team Size
2 4 8 10
•••

Overview of benchmark MAS-PromptBench. Given an input task, a multi-agent system produces a final solution through interactions among LLM-based agents. MAS-PromptBench measures prompt-optimization gains across four axes: task distribution, workflow topology, communication protocol, and team size.

We model a MAS as a tuple $\mathcal{M} = (\mathcal{A}, G, P)$: a collection of agents $\mathcal{A}$, an inter-agent coordination workflow $G$, and a communication protocol $P$. Each agent pairs a frozen LLM with a learnable system prompt. With model weights fixed, system-prompt optimization maximizes the expected task score over the joint prompts. Fixing an optimizer and the base model, and with a configuration $(\mathcal{T}, G, n, P)$, we measure the prompt-optimization gain $\Delta$ as the difference between the optimized ($\pi^\star$) and unoptimized ($\pi^0$) system scores:

$$\Delta(\mathcal{T}, G, n, P) = \mathbb{E}_{(x,y)\sim\mathcal{T}}\big[\,\mu(\mathcal{M}(x;\pi^\star), y) - \mu(\mathcal{M}(x;\pi^0), y)\,\big]$$

Workflow Topologies

Single
Task Agent Solution
Sequential
Task Agent Agent Agent Solution
Independent
Task Task Task Agent Agent Agent Solution Solution Solution
Centralized
Task Agent Agent Agent Agent Solution Solution Solution
Decentralized
Solution Solution Agent Agent Agent Agent Solution Solution
LLM Agent Expertise Task Solution Communication

The five coordination structures evaluated by our protocol. Arrows indicate message flow; nodes are agents.

Single. A single agent solves the task alone — the reference for whether coordination adds value.
Independent. $n$ agents solve the task in parallel without messaging; outputs are aggregated by majority vote.
Sequential. Agents form a directed chain $A_1 \to A_2 \to \cdots \to A_n$; each output is the next agent's input.
Centralized. A coordinator dispatches subtasks to workers and synthesizes their outputs; workers do not communicate directly.
Decentralized. All agents exchange messages over a fully connected graph for a fixed number of rounds.

Findings

When does a better local prompt become a better system? Each factor controls a different stage of that transfer.

Factor 1 · Task

Explicit, verifiable tasks gain more

Coding and tool-calling tasks gain more — and more consistently — than reasoning tasks, because their structured interfaces (code, tests, function calls) preserve local improvements across handoffs. Domain averages capture it: coding +3.7 and tool-calling +4.3 points, versus only +1.3 for reasoning. The peaks are large too — up to +24.0 on Sequential BFCL — while reasoning tops out near +8.0.

SingleLangGraph IndependentLangGraph SequentialCrewAI CentralizedAutoGen DecentralizedOpenAI SDK Average
Reasoning GPQA-Diamond Acc. 54.0 / 58.0+4.0 73.0 / 73.00.0 53.0 / 56.0+3.0 74.0 / 74.00.0 60.0 / 60.00.0 62.8 / 64.2+1.4
HotpotQA EM 26.0 / 39.0+13.0 27.0 / 26.0−1.0 27.0 / 27.00.0 20.0 / 22.0+2.0 16.0 / 18.0+2.0 23.2 / 26.4+3.2
MATH Acc. 49.0 / 51.0+2.0 76.0 / 60.0−16.0 58.0 / 62.0+4.0 63.0 / 69.0+6.0 66.0 / 66.00.0 62.4 / 61.6−0.8
Coding LiveCodeBench pass@1 12.0 / 12.00.0 14.0 / 18.0+4.0 12.0 / 18.0+6.0 14.0 / 16.0+2.0 8.0 / 12.0+4.0 12.0 / 15.2+3.2
APPS pass@1 52.0 / 66.0+14.0 74.0 / 78.0+4.0 62.0 / 80.0+18.0 62.0 / 76.0+14.0 74.0 / 74.00.0 64.8 / 74.8+10.0
SWE-Bench Verified 33.3 / 30.0−3.3 36.7 / 33.3−3.4 33.3 / 30.0−3.3 16.7 / 20.0+3.3 40.0 / 36.7−3.3 32.0 / 30.0−2.0
Tool-Calling BFCL Acc. 84.0 / 88.0+4.0 88.0 / 88.00.0 60.0 / 84.0+24.0 96.0 / 96.00.0 84.0 / 88.0+4.0 82.4 / 88.8+6.4
ToolHop Acc. 62.0 / 64.0+2.0 62.0 / 68.0+6.0 66.0 / 73.0+7.0 68.0 / 69.0+1.0 67.0 / 71.0+4.0 65.0 / 69.0+4.0
API-Bank Acc. 77.0 / 79.0+2.0 74.0 / 76.0+2.0 60.0 / 66.0+6.0 77.0 / 72.0−5.0 62.0 / 69.0+7.0 70.0 / 72.4+2.4

Prompt-optimization gains of MAS-GEPA for nine diverse tasks on popular existing MAS frameworks. Each cell reports baseline / optimized performance, followed by the signed change $\Delta$ in percentage points. Blue indicates improvement, orange indicates regression, and gray indicates no change.

Takeaway. Prompt optimization shows greater potential on tasks with explicit, controllable, and verifiable agent-local behaviors, such as coding and tool-calling, than on reasoning tasks.
Factor 2 · Workflow Topology

Multi-agent systems need topology-awareness

Every multi-agent topology gains less than the single-agent baseline (+4.2 points), with all four peaking at just +2.3 — MAS is simply harder to optimize. Gains also swing by topology: the same optimizer lifts Sequential API-Bank by +9.0 yet drops Centralized by −5.0. The swings have structure — Independent can erase gains as parallel agents overwrite one another (MATH −16.0), while Centralized amplifies both wins and losses (APPS +14.0, HotpotQA −9.0) — motivating topology-aware optimization rather than one-size-fits-all.

Single Independent Sequential Centralized Decentralized
GPQA (Acc.) 54.0 / 58.0+4.0 73.0 / 73.00.0 75.0 / 78.0+3.0 70.0 / 70.00.0 71.0 / 71.00.0
HotpotQA (EM) 26.0 / 39.0+13.0 27.0 / 26.0−1.0 29.0 / 28.0−1.0 19.0 / 10.0−9.0 20.0 / 32.0+12.0
MATH (Acc.) 49.0 / 51.0+2.0 76.0 / 60.0−16.0 74.0 / 74.00.0 66.0 / 69.0+3.0 81.0 / 81.00.0
LiveCodeBench (pass@1) 12.0 / 12.00.0 14.0 / 18.0+4.0 16.0 / 16.00.0 16.0 / 16.00.0 18.0 / 18.00.0
APPS (pass@1) 52.0 / 66.0+14.0 74.0 / 78.0+4.0 82.0 / 84.0+2.0 70.0 / 84.0+14.0 86.0 / 86.00.0
SWE-Bench Verified (Resolved) 33.3 / 30.0−3.3 36.7 / 33.3−3.4 33.3 / 26.7−6.6 30.0 / 33.3+3.3 36.7 / 36.70.0
BFCL (Acc.) 84.0 / 88.0+4.0 88.0 / 88.00.0 84.0 / 80.0−4.0 92.0 / 96.0+4.0 88.0 / 88.00.0
ToolHop (Acc.) 62.0 / 64.0+2.0 62.0 / 68.0+6.0 71.0 / 73.0+2.0 66.0 / 70.0+4.0 65.0 / 71.0+6.0
API-Bank (Acc.) 77.0 / 79.0+2.0 74.0 / 76.0+2.0 61.0 / 70.0+9.0 77.0 / 72.0−5.0 65.0 / 68.0+3.0
Average 49.9 / 54.1+4.2 58.3 / 57.8−0.5 58.4 / 58.9+0.5 56.2 / 57.8+1.6 59.0 / 61.3+2.3

Prompt-optimization gains of MAS-GEPA for five workflow topologies. Each cell reports baseline / optimized performance, followed by the signed change $\Delta$ in percentage points. Blue indicates improvement, orange indicates regression, and gray indicates no change.

Takeaway. Multi-agent systems need topology-aware prompt optimizers.
Factor 3 · Communication Protocol

Shared structure gives optimization more room

Prompt-optimization gain Δ (pp) by communication protocol, grouped by topology. Adding structure improves gain transfer.

More structure means larger gains: the average rises from +1.6 (Freeform) to +2.4 (Semi-structured) to +4.3 (Structured). The effect is largest on evidence-passing tasks like HotpotQA, where downstream agents must reuse upstream outputs, and weakest on LiveCodeBench, where executable code and tests decide correctness regardless of message format.

Takeaway. Communication protocols with explicit shared structure make agent interactions easier to control and transfer, giving MAS prompt optimization more room to improve.
Factor 4 · Team Size

Larger teams make optimization harder

Prompt-optimization gain Δ (pp) across team sizes by topology. The dashed line is the mean across topologies — gains decay as teams grow.

As teams grow ($n \in \{2,4,8,10\}$), gains generally shrink — the average falls from +2.4 at $n{=}2$ to −2.1 at $n{=}10$ — as agent-local improvements get diluted across more handoffs and intermediate states. Topology mediates the effect: Centralized HotpotQA collapses (+5.0−12.0), while Decentralized HotpotQA stays nonnegative at every size.

Takeaway. Larger team size increases the challenge of prompt optimization for MAS: local agent improvements may fail to produce system-level gains.

BibTeX

@article{maspromptbench2026,
  title  = {When Does Prompt Optimization Improve Multi-Agent LLM Systems?},
  author = {Anonymous Authors},
  year   = {2026}
}