MAS-PromptBench

Single Agent

Multi-Agent Systems

Prompt-optimization gains using a state-of-the-art optimizer GEPA in single-agent and multi-agent settings. While GEPA consistently improves single-agent performance across all five diverse tasks, its natural multi-agent extension yields highly variable effects across tasks and workflow topologies, ranging from large gains to severe performance drops.

How much can prompt optimization help in MAS,
and how does its effect vary across configurations?

Contributions

1

MAS-PromptBench: a benchmark for MAS prompt optimization

A comprehensive benchmark for evaluating system-prompt optimizers for MAS. It spans diverse MAS configurations across task domains (reasoning, coding, tool calling), five workflow topologies (both existing and newly constructed systems), communication protocols from free-form to highly structured coordination, varying team sizes, and two default prompt optimizers — a foundation for proposing, analyzing, and comparing system-prompt optimization algorithms under controlled MAS configurations.

2

Prompt optimization gains and failures for MAS

Systematically evaluating a natural multi-agent extension of GEPA, a state-of-the-art single-agent prompt optimizer, against default prompts. The results highlight the promise of prompt optimization for MAS: improvements reach up to +24.0 points. Yet they also reveal the need for principled algorithms tailored to multi-agent settings, as performance can drop by as much as −16.0 points for certain configurations.

3

Insights into when prompt optimization works for MAS

Optimization shows greater potential when tasks have explicit, controllable, and verifiable agent-local behaviors, and when communication protocols impose an explicit shared structure that makes agent interactions easier to control and transfer; it also needs to be workflow-topology-aware. Optimization becomes harder as team size grows — confirming the challenges of scaling MAS prompt optimization and motivating more scalable, robust algorithms.

Benchmark at a Glance

MAS-PromptBench varies each configuration axis independently while holding the others fixed.

0

Tasks

GPQA-DiamondHotpotQAMATHLiveCodeBenchAPPSSWE-Bench VerifiedBFCLToolHopAPI-Bank

0

Frameworks

LangGraphCrewAIAutoGenOpenAI Agents SDK

0

Topologies

SingleIndependentSequentialCentralizedDecentralized

0

Team Sizes

24810

0

Communication Protocols

FreeformSemi-structuredStructured

0

Optimizers

GEPAMIPRO

0

optimized configuration cells — each a paired baseline vs. optimized evaluation

Multi-Agent Systems & Four Factors

Optimizer

GEPA MIPRO

•••

Task Dataset

Reasoning

GPQA-Diamond HotpotQA MATH

Coding

LiveCodeBench APPS SWE-Bench

Tool-Calling

BFCL ToolHop API-Bank

•••

Multi-Agent LLM System

Workflow Topology

Single Sequential Independent Centralized Decentralized

•••

Communication Protocol

Freeform Semi-structured Structured

•••

Team Size

2 4 8 10

•••

Overview of benchmark MAS-PromptBench. Given an input task, a multi-agent system produces a final solution through interactions among LLM-based agents. MAS-PromptBench measures prompt-optimization gains across four axes: task distribution, workflow topology, communication protocol, and team size.

We model a MAS as a tuple $\mathcal{M} = (\mathcal{A}, G, P)$: a collection of agents $\mathcal{A}$, an inter-agent coordination workflow $G$, and a communication protocol $P$. Each agent pairs a frozen LLM with a learnable system prompt. With model weights fixed, system-prompt optimization maximizes the expected task score over the joint prompts. Fixing an optimizer and the base model, and with a configuration $(\mathcal{T}, G, n, P)$, we measure the prompt-optimization gain $\Delta$ as the difference between the optimized ($\pi^\star$) and unoptimized ($\pi^0$) system scores:

$$\Delta(\mathcal{T}, G, n, P) = \mathbb{E}_{(x,y)\sim\mathcal{T}}\big[\,\mu(\mathcal{M}(x;\pi^\star), y) - \mu(\mathcal{M}(x;\pi^0), y)\,\big]$$

Workflow Topologies

Single

Sequential

Independent

Centralized

Decentralized

LLM Agent

Expertise

Task

Solution Communication

The five coordination structures evaluated by our protocol. Arrows indicate message flow; nodes are agents.

Single. A single agent solves the task alone — the reference for whether coordination adds value.

Independent. $n$ agents solve the task in parallel without messaging; outputs are aggregated by majority vote.

Sequential. Agents form a directed chain $A_1 \to A_2 \to \cdots \to A_n$; each output is the next agent's input.

Centralized. A coordinator dispatches subtasks to workers and synthesizes their outputs; workers do not communicate directly.

Decentralized. All agents exchange messages over a fully connected graph for a fixed number of rounds.

Findings

When does a better local prompt become a better system? Each factor controls a different stage of that transfer.

Factor 1 · Task

Explicit, verifiable tasks gain more

Coding and tool-calling tasks gain more — and more consistently — than reasoning tasks, because their structured interfaces (code, tests, function calls) preserve local improvements across handoffs. Domain averages capture it: coding +3.7 and tool-calling +4.3 points, versus only +1.3 for reasoning. The peaks are large too — up to +24.0 on Sequential BFCL — while reasoning tops out near +8.0.

		SingleLangGraph	IndependentLangGraph	SequentialCrewAI	CentralizedAutoGen	DecentralizedOpenAI SDK	Average
Reasoning	GPQA-Diamond Acc.	54.0 / 58.0+4.0	73.0 / 73.00.0	53.0 / 56.0+3.0	74.0 / 74.00.0	60.0 / 60.00.0	62.8 / 64.2+1.4
	HotpotQA EM	26.0 / 39.0+13.0	27.0 / 26.0−1.0	27.0 / 27.00.0	20.0 / 22.0+2.0	16.0 / 18.0+2.0	23.2 / 26.4+3.2
	MATH Acc.	49.0 / 51.0+2.0	76.0 / 60.0−16.0	58.0 / 62.0+4.0	63.0 / 69.0+6.0	66.0 / 66.00.0	62.4 / 61.6−0.8
Coding	LiveCodeBench pass@1	12.0 / 12.00.0	14.0 / 18.0+4.0	12.0 / 18.0+6.0	14.0 / 16.0+2.0	8.0 / 12.0+4.0	12.0 / 15.2+3.2
	APPS pass@1	52.0 / 66.0+14.0	74.0 / 78.0+4.0	62.0 / 80.0+18.0	62.0 / 76.0+14.0	74.0 / 74.00.0	64.8 / 74.8+10.0
	SWE-Bench Verified	33.3 / 30.0−3.3	36.7 / 33.3−3.4	33.3 / 30.0−3.3	16.7 / 20.0+3.3	40.0 / 36.7−3.3	32.0 / 30.0−2.0
Tool-Calling	BFCL Acc.	84.0 / 88.0+4.0	88.0 / 88.00.0	60.0 / 84.0+24.0	96.0 / 96.00.0	84.0 / 88.0+4.0	82.4 / 88.8+6.4
	ToolHop Acc.	62.0 / 64.0+2.0	62.0 / 68.0+6.0	66.0 / 73.0+7.0	68.0 / 69.0+1.0	67.0 / 71.0+4.0	65.0 / 69.0+4.0
	API-Bank Acc.	77.0 / 79.0+2.0	74.0 / 76.0+2.0	60.0 / 66.0+6.0	77.0 / 72.0−5.0	62.0 / 69.0+7.0	70.0 / 72.4+2.4

Prompt-optimization gains of MAS-GEPA for nine diverse tasks on popular existing MAS frameworks. Each cell reports baseline / optimized performance, followed by the signed change $\Delta$ in percentage points. Blue indicates improvement, orange indicates regression, and gray indicates no change.

Takeaway. Prompt optimization shows greater potential on tasks with explicit, controllable, and verifiable agent-local behaviors, such as coding and tool-calling, than on reasoning tasks.

Factor 2 · Workflow Topology

Multi-agent systems need topology-awareness

Every multi-agent topology gains less than the single-agent baseline (+4.2 points), with all four peaking at just +2.3 — MAS is simply harder to optimize. Gains also swing by topology: the same optimizer lifts Sequential API-Bank by +9.0 yet drops Centralized by −5.0. The swings have structure — Independent can erase gains as parallel agents overwrite one another (MATH −16.0), while Centralized amplifies both wins and losses (APPS +14.0, HotpotQA −9.0) — motivating topology-aware optimization rather than one-size-fits-all.

	Single	Independent	Sequential	Centralized	Decentralized
GPQA (Acc.)	54.0 / 58.0+4.0	73.0 / 73.00.0	75.0 / 78.0+3.0	70.0 / 70.00.0	71.0 / 71.00.0
HotpotQA (EM)	26.0 / 39.0+13.0	27.0 / 26.0−1.0	29.0 / 28.0−1.0	19.0 / 10.0−9.0	20.0 / 32.0+12.0
MATH (Acc.)	49.0 / 51.0+2.0	76.0 / 60.0−16.0	74.0 / 74.00.0	66.0 / 69.0+3.0	81.0 / 81.00.0
LiveCodeBench (pass@1)	12.0 / 12.00.0	14.0 / 18.0+4.0	16.0 / 16.00.0	16.0 / 16.00.0	18.0 / 18.00.0
APPS (pass@1)	52.0 / 66.0+14.0	74.0 / 78.0+4.0	82.0 / 84.0+2.0	70.0 / 84.0+14.0	86.0 / 86.00.0
SWE-Bench Verified (Resolved)	33.3 / 30.0−3.3	36.7 / 33.3−3.4	33.3 / 26.7−6.6	30.0 / 33.3+3.3	36.7 / 36.70.0
BFCL (Acc.)	84.0 / 88.0+4.0	88.0 / 88.00.0	84.0 / 80.0−4.0	92.0 / 96.0+4.0	88.0 / 88.00.0
ToolHop (Acc.)	62.0 / 64.0+2.0	62.0 / 68.0+6.0	71.0 / 73.0+2.0	66.0 / 70.0+4.0	65.0 / 71.0+6.0
API-Bank (Acc.)	77.0 / 79.0+2.0	74.0 / 76.0+2.0	61.0 / 70.0+9.0	77.0 / 72.0−5.0	65.0 / 68.0+3.0
Average	49.9 / 54.1+4.2	58.3 / 57.8−0.5	58.4 / 58.9+0.5	56.2 / 57.8+1.6	59.0 / 61.3+2.3

Prompt-optimization gains of MAS-GEPA for five workflow topologies. Each cell reports baseline / optimized performance, followed by the signed change $\Delta$ in percentage points. Blue indicates improvement, orange indicates regression, and gray indicates no change.

Takeaway. Multi-agent systems need topology-aware prompt optimizers.

Factor 3 · Communication Protocol

Shared structure gives optimization more room

Prompt-optimization gain Δ (pp) by communication protocol, grouped by topology. Adding structure improves gain transfer.

More structure means larger gains: the average rises from +1.6 (Freeform) to +2.4 (Semi-structured) to +4.3 (Structured). The effect is largest on evidence-passing tasks like HotpotQA, where downstream agents must reuse upstream outputs, and weakest on LiveCodeBench, where executable code and tests decide correctness regardless of message format.

Takeaway. Communication protocols with explicit shared structure make agent interactions easier to control and transfer, giving MAS prompt optimization more room to improve.

Factor 4 · Team Size

Larger teams make optimization harder

Prompt-optimization gain Δ (pp) across team sizes by topology. The dashed line is the mean across topologies — gains decay as teams grow.

As teams grow ($n \in \{2,4,8,10\}$), gains generally shrink — the average falls from +2.4 at $n{=}2$ to −2.1 at $n{=}10$ — as agent-local improvements get diluted across more handoffs and intermediate states. Topology mediates the effect: Centralized HotpotQA collapses (+5.0 → −12.0), while Decentralized HotpotQA stays nonnegative at every size.

Takeaway. Larger team size increases the challenge of prompt optimization for MAS: local agent improvements may fail to produce system-level gains.

BibTeX

@article{maspromptbench2026,
  title  = {When Does Prompt Optimization Improve Multi-Agent LLM Systems?},
  author = {Anonymous Authors},
  year   = {2026}
}