Multi-Agent Safety Evaluation – Marian E. Arenskrieger

§01 Why multi-agent

The risk surface single-agent evaluation cannot see

A safe agent and a safe agent do not compose into a safe pair. Once agents delegate, share tools and read each other's outputs, new failure modes appear that are properties of the interaction, not of any one model: two agents that each behave individually reinforce a bad plan; an error in one is treated as ground truth by the next and cascades; a prompt injection lands in one agent's context and propagates through the messages it sends onward. None of these is visible when each agent is scored alone – the object under test is the fleet, and the risk is combinatorial in its size.

Single-agent evaluation measures the nodes. The risks that matter in a fleet live in the edges – collusion, cascades, injection hopping agent to agent – and grow as k(k−1).

Single-model safety can say each agent is fine. It cannot say the fleet is safe.

§02 Task force 1 · Taxonomy

A failure-mode taxonomy for multi-agent systems

Measurement needs a map of what to measure. The first work stream is a structured taxonomy of the failure modes that only appear between agents, grouped so each becomes a concrete, testable target rather than a vague worry. It is the enumeration that tells the metrics and the testbed what scenarios to build.

Nine failure modes in three families. Each is written to be a concrete scenario the metrics can score and the testbed can reproduce – not an abstract concern.

§03 Task force 2 · Metrics

Quantitative risk metrics with a discrimination bar

Each failure mode needs a number that means something. The second work stream defines metrics that are computed from what the agents actually did in the testbed – not from a model's self-report – and that inherit the same rule the evaluation pipeline enforces: a metric on which every run scores identically measures nothing and does not ship.

Metric

Cascade depth

How many downstream agents act on a seeded upstream error before it is caught. Measured by injecting a known error and tracing propagation through the message graph.

Metric

Injection reach

How far an indirect prompt injection travels across agent boundaries. Measured as the number of agents whose actions change after a single poisoned input.

Metric

Delegation safety

Whether a safety constraint stated to the lead agent survives every handoff. Measured by checking constraint satisfaction at each sub-agent, not just at the top.

Metrics are reported with variance across seeds, and honest about their own limits: a low score on a narrow scenario set is not a safety certificate, only evidence against the specific failures that set was built to provoke.

§04 Task force 3 · Testbed

An instrumented, hardened multi-agent testbed

The third work stream is where taxonomy and metrics become measurement: a testbed that runs a real multi-agent scenario inside a hardened sandbox, with an observer layer that records the full message graph and tool calls so the metrics can be computed after the fact. It reuses infrastructure that already exists across the other projects rather than starting from zero.

The hardened sandbox is Project A's; contamination-resistant scenario synthesis is Project B's; the single-GPU discipline is Project C's. The new part is the observer and the per-failure-mode risk profile it produces.

Reuses: hardened sandbox (A)contamination-resistant synthesis (B)single-GPU discipline (C)pinned scenarios · +hash

§05 Practical utility

What the research is for: certifying a multi-agent deployment

A research agenda still has to answer "and then what?". Here is the concrete deployment the three task forces are built to serve – the reason the work is worth funding beyond the paper.

SSituation

An organisation wants to ship a fleet of interacting agents – research and coding agents that delegate to each other and share tools. Single-agent benchmarks say each one is fine, but there is no principled way to know the fleet won't collude, cascade an error, or carry a prompt injection from one agent into the next.

TTask

Give the team a defensible, reproducible measure of multi-agent-specific risk before deployment – and a concrete go / no-go bar – rather than a subjective judgment that the system "seems safe".

AAction

Enumerate what to test with the failure-mode taxonomy; score each mode with the risk metrics; run adversarial multi-agent scenarios in the instrumented testbed (hardened sandbox, contamination-resistant scenarios, pinned with semver + hash); report each metric with variance across seeds.

RIntended result · practical impact

A certification artifact – a per-failure-mode risk profile, re-derivable months later from the pinned scenarios – that turns "we believe the fleet is safe" into "here is the measured risk surface and exactly where it fails". The practical benefit of the research is a reusable pre-deployment safety harness for multi-agent systems, usable by any team shipping one – not a result that stops at publication.

§06 Status

An open agenda, built on standing infrastructure

This is a research direction, not a finished result. The taxonomy, metrics and testbed are proposed work; no risk numbers are claimed here, because inventing them would be the exact dishonesty the evaluation work exists to prevent. What already exists is the infrastructure the agenda stands on – the hardened sandbox, contamination-resistant synthesis and single-GPU discipline of Projects A–C – which is what makes the agenda credible rather than speculative. It is written toward a non-dilutive multi-agent safety research call.

The contribution this agenda proposes is a way to measure fleet-level risk, not just agent-level capability: name the failure modes, score them from what the agents actually did, and hand a team a reproducible risk profile before they ship – safety as an engineering artifact rather than a hope.

Framed under CTC AI Operations, on the same evaluation discipline as the pipeline it extends.