Baseline vs candidate. Code-evals only.
Each experiment scores the baseline and candidate prompts against the regression set with deterministic code_evals — no LLM judges in the hot path. The release gate decides whether the candidate ships.
Runs
Select an experiment to view results.