The agent that watches itself and rewrites its own prompt.
PhoenixLoop is a Gemini support agent that traces every run with Arize Phoenix, clusters its own failed evaluations, drafts a prompt fix from the failing spans, A/B-tests the candidate against a regression set, and gates promotion on the score. Self-improving, not theoretically — measurably.
gemini-2.5-flash phoenix-mcp arize-phoenix-evals1from phoenix.otel import register
2
3tracer_provider = register(
4 project_name="phoenixloop",
5 endpoint="https://app.phoenix.arize.com/v1/traces",
6 headers={"Authorization": f"Bearer {api_key}"},
7 auto_instrument=True,
8 batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.
Seven stages. One closed circuit.
An LLM agent without a loop is a sample. With this loop, every failure becomes a labeled example, every cluster becomes a prompt patch, every patch becomes a measurable experiment.
- 01
Observe
Every agent turn, tool call, and judge call streams to Phoenix as OTel spans.
phoenix.otel.register - 02
Evaluate
7 code evals + 4 LLM judges (2 Phoenix Evals templates + 2 custom) + 3 Phoenix tool evals score every run.
arize-phoenix-evals - 03
Cluster
Repeat failures group by deterministic failure_key. Three strikes trips an improvement trigger.
failure_aggregator - 04
Diagnose
A sub-agent reads its own failing spans via Phoenix MCP — get-spans, get-span-annotations — and names the pattern.
phoenix-mcp:get-spans - 05
Patch prompt
Gemini drafts a minimal one-line addition to the system prompt. The diff is human-readable.
patch_synthesis - 06
Experiment
Baseline-vs-candidate on 5 frozen regression examples. Code-evals only, no judge round-trips.
experiment.run - 07
Gate
Release-gate verdict from the score delta. Promotion is automatic above threshold, human in-the-loop below.
release_gate
Three pieces of evidence.
If the agent says it observes itself, the spans should be visible. If it says it evaluates itself, the evaluators should be named. If it says it improves itself, the before-and-after should be auditable.
Real Phoenix spans
- ›ADK · agent_run
- ›phoenix-mcp:get-spans
- ›phoenix-mcp:get-span-annotations
- ›phoenix-mcp:get-dataset-examples
- ›google-genai · judges_combined
- ›google-genai · patch_synthesis
Visible in Arize Phoenix, deep-linked from every run.
Real evaluators
- ›code · CitationPresence
- ›code · RefundGuard
- ›code · ToolSequence
- ›judge · Hallucination (Phoenix Evals)
- ›judge · QA-Correctness (Phoenix Evals)
- ›judge · PolicyCompliance (custom)
Code-evals + Phoenix Evals templates + custom judges.
Real before/after
- ›Baseline prompt v1.1
- ›→ Candidate prompt v1.2
- ›Resolution correctness · 0.42 → 0.91
- ›Citation presence · 0.10 → 0.98
- ›Regression canaries · 5/5 pass
- ›Verdict · PROMOTED
Code-evals only in the experiment hot path.
One agent. Three feedback paths.
The support agent calls Phoenix MCP at runtime to retrieve few-shot exemplars from a curated dataset of resolved tickets. Failed runs aggregate. A separate diagnosis sub-agent reads the failing spans back from Phoenix MCP, names the root cause, and proposes a patch. Experiments score before-and-after on the same dataset.
Three pieces of code that carry the claim.
The most reassuring thing on a marketing page is the actual production code.
Trace every Gemini call with one register()
Replaces our hand-rolled OTel setup with the canonical Phoenix call. The auto_instrument=True flag picks up both ADK agent spans and direct google-genai calls — so the LLM judges are no longer invisible.
1from phoenix.otel import register
2
3tracer_provider = register(
4 project_name="phoenixloop",
5 endpoint="https://app.phoenix.arize.com/v1/traces",
6 headers={"Authorization": f"Bearer {api_key}"},
7 auto_instrument=True,
8 batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.
Few-shot retrieval from a Phoenix dataset
The support agent calls this tool before drafting any non-trivial response. Top-3 exemplars come back from the successful-resolutions dataset in Phoenix Cloud via MCP. The retrieval span is visible in every trace.
1@retry(max_attempts=3, retryable_exceptions=(httpx.TimeoutException,))
2async def retrieve_similar_resolutions(
3 category: str,
4 brief: str,
5) -> list[ResolutionExample]:
6 """Top-3 exemplars from the Phoenix `successful-resolutions` dataset."""
7 examples = await phoenix_client.get_dataset_examples(
8 dataset="successful-resolutions",
9 filter={"category": category},
10 limit=3,
11 )
12 return [ResolutionExample.model_validate(e) for e in examples]
Diagnosis sub-agent reads its own failing spans
The diagnosis agent uses the Phoenix MCP toolset as its tool surface. The Live Trace pane shows phoenix-mcp:get-spans and phoenix-mcp:get-span-annotations spans every time it runs — real observability data flowing back into reasoning.
1diagnosis_agent = Agent(
2 name="diagnosis",
3 model=settings.gemini_model, # flash
4 tools=[mcp_toolset], # phoenix-mcp
5 generate_content_config=GenerateContentConfig(
6 thinking_config=ThinkingConfig(thinking_budget=128),
7 ),
8 instruction=DIAGNOSIS_PROMPT, # JSON-only output, ≤2 MCP calls
9)
10result = await Runner(agent=diagnosis_agent, ...).run_async(failure_key)
A dense difference, not a vibe.
Most agent demos describe what the agent should do. This row says what the architecture actually does — and what the common alternative looks like.
| Surface | PhoenixLoop | Common alternative |
|---|---|---|
| Tracing | phoenix.otel.register · batched · auto-instrument | Print-to-stdout / custom JSON logs |
| MCP usage | Real Phoenix MCP at runtime (sub-agent + RAG) | Hand-rolled HTTP call to /api/spans |
| Evaluation | 7 code · 4 judges (2 Phoenix Evals) · 3 tool | One ‘LLM-as-judge’ score, no breakdown |
| Improvement loop | Diagnosis sub-agent reads failing spans · patches prompt | Manual prompt edits in a Notion doc |
| Promotion | Score-gated baseline-vs-candidate experiment | Vibes; ship it and watch metrics |
| Model choice | Flash everywhere · thinking_budget tuned per agent | Pro for everything · billed accordingly |
Things we deliberately did not build.
A short list is a credibility signal. Anyone can claim a self-improving agent. Below is the surface area we said no to.
- A glassmorphic ‘AI gradient’ landing page.
- A fake terminal that types Lorem-ipsum forever.
- A LangSmith-tier observability rewrite.
- An evals framework competing with Phoenix.
- Agents that A2A-call ten dummy agents to look busy.
Boot the stack. Click one ticket. Watch one promotion.
The seed runs six tickets, two intentional failures, one diagnosis, one experiment and one release-gate verdict. Around 4–5 minutes in live mode (real Gemini calls) or ~90s with LIGHTWEIGHT_DEMO=true (fixture replay).