Arize · Phoenix · Gemini

The agent that watches itself and rewrites its own prompt.

PhoenixLoop is a Gemini support agent that traces every run with Arize Phoenix, clusters its own failed evaluations, drafts a prompt fix from the failing spans, A/B-tests the candidate against a regression set, and gates promotion on the score. Self-improving, not theoretically — measurably.

Open the conversation

Read the architecture ↘

gemini-2.5-flash phoenix-mcp arize-phoenix-evals

backend/src/tracing/instrumentor.pypython

1from phoenix.otel import register
2
3tracer_provider = register(
4    project_name="phoenixloop",
5    endpoint="https://app.phoenix.arize.com/v1/traces",
6    headers={"Authorization": f"Bearer {api_key}"},
7    auto_instrument=True,
8    batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.

live trace · stdout

demo seed

12:14:02.001 evaluator:CitationPresence(FAILED · score=0.42)

—

Agent runs tracedin Phoenix

—

Evaluators wiredcode · LLM · MCP

—

MCP tool calls / runavg recent 50

—

Prompts auto-promotedrelease-gate

// match-rate lift

—

Baseline avgpre-heal score

—

Post-heal avgpost-heal score

—

Liftdelta

The loop

Seven stages. One closed circuit.

An LLM agent without a loop is a sample. With this loop, every failure becomes a labeled example, every cluster becomes a prompt patch, every patch becomes a measurable experiment.

01
Observe
Every agent turn, tool call, and judge call streams to Phoenix as OTel spans.
phoenix.otel.register
02
Evaluate
7 code evals + 4 LLM judges (2 Phoenix Evals templates + 2 custom) + 3 Phoenix tool evals score every run.
arize-phoenix-evals
03
Cluster
Repeat failures group by deterministic failure_key. Three strikes trips an improvement trigger.
failure_aggregator
04
Diagnose
A sub-agent reads its own failing spans via Phoenix MCP — get-spans, get-span-annotations — and names the pattern.
phoenix-mcp:get-spans
05
Patch prompt
Gemini drafts a minimal one-line addition to the system prompt. The diff is human-readable.
patch_synthesis
06
Experiment
Baseline-vs-candidate on 5 frozen regression examples. Code-evals only, no judge round-trips.
experiment.run
07
Gate
Release-gate verdict from the score delta. Promotion is automatic above threshold, human in-the-loop below.
release_gate

Receipts, not claims

Three pieces of evidence.

If the agent says it observes itself, the spans should be visible. If it says it evaluates itself, the evaluators should be named. If it says it improves itself, the before-and-after should be auditable.

01not a console logger

Real Phoenix spans

›ADK · agent_run
›phoenix-mcp:get-spans
›phoenix-mcp:get-span-annotations
›phoenix-mcp:get-dataset-examples
›google-genai · judges_combined
›google-genai · patch_synthesis

Visible in Arize Phoenix, deep-linked from every run.

0214 wired, named, deterministic

Real evaluators

›code · CitationPresence
›code · RefundGuard
›code · ToolSequence
›judge · Hallucination (Phoenix Evals)
›judge · QA-Correctness (Phoenix Evals)
›judge · PolicyCompliance (custom)

Code-evals + Phoenix Evals templates + custom judges.

030.42 → 0.91 on one cluster

Real before/after

›Baseline prompt v1.1
›→ Candidate prompt v1.2
›Resolution correctness · 0.42 → 0.91
›Citation presence · 0.10 → 0.98
›Regression canaries · 5/5 pass
›Verdict · PROMOTED

Code-evals only in the experiment hot path.

Architecture

One agent. Three feedback paths.

The support agent calls Phoenix MCP at runtime to retrieve few-shot exemplars from a curated dataset of resolved tickets. Failed runs aggregate. A separate diagnosis sub-agent reads the failing spans back from Phoenix MCP, names the root cause, and proposes a patch. Experiments score before-and-after on the same dataset.

Code walks

Three pieces of code that carry the claim.

The most reassuring thing on a marketing page is the actual production code.

01Code walk

Trace every Gemini call with one register()

Replaces our hand-rolled OTel setup with the canonical Phoenix call. The auto_instrument=True flag picks up both ADK agent spans and direct google-genai calls — so the LLM judges are no longer invisible.

backend/src/tracing/instrumentor.pypython

1from phoenix.otel import register
2
3tracer_provider = register(
4    project_name="phoenixloop",
5    endpoint="https://app.phoenix.arize.com/v1/traces",
6    headers={"Authorization": f"Bearer {api_key}"},
7    auto_instrument=True,
8    batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.

02Code walk

Few-shot retrieval from a Phoenix dataset

The support agent calls this tool before drafting any non-trivial response. Top-3 exemplars come back from the successful-resolutions dataset in Phoenix Cloud via MCP. The retrieval span is visible in every trace.

backend/src/agent/tools.pypython

1@retry(max_attempts=3, retryable_exceptions=(httpx.TimeoutException,))
2async def retrieve_similar_resolutions(
3    category: str,
4    brief: str,
5) -> list[ResolutionExample]:
6    """Top-3 exemplars from the Phoenix `successful-resolutions` dataset."""
7    examples = await phoenix_client.get_dataset_examples(
8        dataset="successful-resolutions",
9        filter={"category": category},
10        limit=3,
11    )
12    return [ResolutionExample.model_validate(e) for e in examples]

03Code walk

Diagnosis sub-agent reads its own failing spans

The diagnosis agent uses the Phoenix MCP toolset as its tool surface. The Live Trace pane shows phoenix-mcp:get-spans and phoenix-mcp:get-span-annotations spans every time it runs — real observability data flowing back into reasoning.

backend/src/agent/diagnosis_agent.pypython

1diagnosis_agent = Agent(
2    name="diagnosis",
3    model=settings.gemini_model,                  # flash
4    tools=[mcp_toolset],                          # phoenix-mcp
5    generate_content_config=GenerateContentConfig(
6        thinking_config=ThinkingConfig(thinking_budget=128),
7    ),
8    instruction=DIAGNOSIS_PROMPT,                 # JSON-only output, ≤2 MCP calls
9)
10result = await Runner(agent=diagnosis_agent, ...).run_async(failure_key)

What it is. What it isn’t.

A dense difference, not a vibe.

Most agent demos describe what the agent should do. This row says what the architecture actually does — and what the common alternative looks like.

Surface	PhoenixLoop	Common alternative
Tracing	phoenix.otel.register · batched · auto-instrument	Print-to-stdout / custom JSON logs
MCP usage	Real Phoenix MCP at runtime (sub-agent + RAG)	Hand-rolled HTTP call to /api/spans
Evaluation	7 code · 4 judges (2 Phoenix Evals) · 3 tool	One ‘LLM-as-judge’ score, no breakdown
Improvement loop	Diagnosis sub-agent reads failing spans · patches prompt	Manual prompt edits in a Notion doc
Promotion	Score-gated baseline-vs-candidate experiment	Vibes; ship it and watch metrics
Model choice	Flash everywhere · thinking_budget tuned per agent	Pro for everything · billed accordingly

Anti-claim

Things we deliberately did not build.

A short list is a credibility signal. Anyone can claim a self-improving agent. Below is the surface area we said no to.

A glassmorphic ‘AI gradient’ landing page.
A fake terminal that types Lorem-ipsum forever.
A LangSmith-tier observability rewrite.
An evals framework competing with Phoenix.
Agents that A2A-call ten dummy agents to look busy.

Run it locally

Boot the stack. Click one ticket. Watch one promotion.

The seed runs six tickets, two intentional failures, one diagnosis, one experiment and one release-gate verdict. Around 4–5 minutes in live mode (real Gemini calls) or ~90s with LIGHTWEIGHT_DEMO=true (fixture replay).

Open conversation See a live healing trace

The agent that watches itself and rewrites its own prompt.

Seven stages. One closed circuit.

Observe

Evaluate

Cluster

Diagnose

Patch prompt

Experiment

Gate