The risk surface single-agent evaluation cannot see
A safe agent and a safe agent do not compose into a safe pair. Once agents delegate, share tools and read each other's outputs, new failure modes appear that are properties of the interaction, not of any one model: two agents that each behave individually reinforce a bad plan; an error in one is treated as ground truth by the next and cascades; a prompt injection lands in one agent's context and propagates through the messages it sends onward. None of these is visible when each agent is scored alone – the object under test is the fleet, and the risk is combinatorial in its size.
A failure-mode taxonomy for multi-agent systems
Measurement needs a map of what to measure. The first work stream is a structured taxonomy of the failure modes that only appear between agents, grouped so each becomes a concrete, testable target rather than a vague worry. It is the enumeration that tells the metrics and the testbed what scenarios to build.
Quantitative risk metrics with a discrimination bar
Each failure mode needs a number that means something. The second work stream defines metrics that are computed from what the agents actually did in the testbed – not from a model's self-report – and that inherit the same rule the evaluation pipeline enforces: a metric on which every run scores identically measures nothing and does not ship.
Cascade depth
How many downstream agents act on a seeded upstream error before it is caught. Measured by injecting a known error and tracing propagation through the message graph.
Injection reach
How far an indirect prompt injection travels across agent boundaries. Measured as the number of agents whose actions change after a single poisoned input.
Delegation safety
Whether a safety constraint stated to the lead agent survives every handoff. Measured by checking constraint satisfaction at each sub-agent, not just at the top.
Metrics are reported with variance across seeds, and honest about their own limits: a low score on a narrow scenario set is not a safety certificate, only evidence against the specific failures that set was built to provoke.
An instrumented, hardened multi-agent testbed
The third work stream is where taxonomy and metrics become measurement: a testbed that runs a real multi-agent scenario inside a hardened sandbox, with an observer layer that records the full message graph and tool calls so the metrics can be computed after the fact. It reuses infrastructure that already exists across the other projects rather than starting from zero.
What the research is for: certifying a multi-agent deployment
A research agenda still has to answer "and then what?". Here is the concrete deployment the three task forces are built to serve – the reason the work is worth funding beyond the paper.
An organisation wants to ship a fleet of interacting agents – research and coding agents that delegate to each other and share tools. Single-agent benchmarks say each one is fine, but there is no principled way to know the fleet won't collude, cascade an error, or carry a prompt injection from one agent into the next.
Give the team a defensible, reproducible measure of multi-agent-specific risk before deployment – and a concrete go / no-go bar – rather than a subjective judgment that the system "seems safe".
Enumerate what to test with the failure-mode taxonomy; score each mode with the risk metrics; run adversarial multi-agent scenarios in the instrumented testbed (hardened sandbox, contamination-resistant scenarios, pinned with semver + hash); report each metric with variance across seeds.
A certification artifact – a per-failure-mode risk profile, re-derivable months later from the pinned scenarios – that turns "we believe the fleet is safe" into "here is the measured risk surface and exactly where it fails". The practical benefit of the research is a reusable pre-deployment safety harness for multi-agent systems, usable by any team shipping one – not a result that stops at publication.
An open agenda, built on standing infrastructure
The contribution this agenda proposes is a way to measure fleet-level risk, not just agent-level capability: name the failure modes, score them from what the agents actually did, and hand a team a reproducible risk profile before they ship – safety as an engineering artifact rather than a hope.
Framed under CTC AI Operations, on the same evaluation discipline as the pipeline it extends.