Repeat failures, deterministically keyed.
Failed evaluations group by a stable failure_key. Three strikes on the same key trips an improvement trigger, the diagnosis sub-agent reads its own failing spans, and a candidate prompt enters the experiment runner.
—
Active aggregates
—
Above threshold (≥2)
—
Critical types
—
Total occurrences
failure_keyoccurrences · trend
Loading failures…