Skip to content
Arize · Phoenix · Gemini

The agent that watches itself and rewrites its own prompt.

PhoenixLoop is a Gemini support agent that traces every run with Arize Phoenix, clusters its own failed evaluations, drafts a prompt fix from the failing spans, A/B-tests the candidate against a regression set, and gates promotion on the score. Self-improving, not theoretically — measurably.

Open the conversation
Read the architecture ↘
gemini-2.5-flash phoenix-mcp arize-phoenix-evals
backend/src/tracing/instrumentor.pypython
1from phoenix.otel import register
2
3tracer_provider = register(
4    project_name="phoenixloop",
5    endpoint="https://app.phoenix.arize.com/v1/traces",
6    headers={"Authorization": f"Bearer {api_key}"},
7    auto_instrument=True,
8    batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.
live trace · stdout
demo seed
12:14:02.001 evaluator:CitationPresence(FAILED · score=0.42)
Agent runs tracedin Phoenix
Evaluators wiredcode · LLM · MCP
MCP tool calls / runavg recent 50
Prompts auto-promotedrelease-gate
// match-rate lift
Baseline avgpre-heal score
Post-heal avgpost-heal score
Liftdelta
The loop

Seven stages. One closed circuit.

An LLM agent without a loop is a sample. With this loop, every failure becomes a labeled example, every cluster becomes a prompt patch, every patch becomes a measurable experiment.

  1. 01

    Observe

    Every agent turn, tool call, and judge call streams to Phoenix as OTel spans.

    phoenix.otel.register
  2. 02

    Evaluate

    7 code evals + 4 LLM judges (2 Phoenix Evals templates + 2 custom) + 3 Phoenix tool evals score every run.

    arize-phoenix-evals
  3. 03

    Cluster

    Repeat failures group by deterministic failure_key. Three strikes trips an improvement trigger.

    failure_aggregator
  4. 04

    Diagnose

    A sub-agent reads its own failing spans via Phoenix MCP — get-spans, get-span-annotations — and names the pattern.

    phoenix-mcp:get-spans
  5. 05

    Patch prompt

    Gemini drafts a minimal one-line addition to the system prompt. The diff is human-readable.

    patch_synthesis
  6. 06

    Experiment

    Baseline-vs-candidate on 5 frozen regression examples. Code-evals only, no judge round-trips.

    experiment.run
  7. 07

    Gate

    Release-gate verdict from the score delta. Promotion is automatic above threshold, human in-the-loop below.

    release_gate
Receipts, not claims

Three pieces of evidence.

If the agent says it observes itself, the spans should be visible. If it says it evaluates itself, the evaluators should be named. If it says it improves itself, the before-and-after should be auditable.

01not a console logger

Real Phoenix spans

  • ADK · agent_run
  • phoenix-mcp:get-spans
  • phoenix-mcp:get-span-annotations
  • phoenix-mcp:get-dataset-examples
  • google-genai · judges_combined
  • google-genai · patch_synthesis

Visible in Arize Phoenix, deep-linked from every run.

0214 wired, named, deterministic

Real evaluators

  • code · CitationPresence
  • code · RefundGuard
  • code · ToolSequence
  • judge · Hallucination (Phoenix Evals)
  • judge · QA-Correctness (Phoenix Evals)
  • judge · PolicyCompliance (custom)

Code-evals + Phoenix Evals templates + custom judges.

030.42 → 0.91 on one cluster

Real before/after

  • Baseline prompt v1.1
  • → Candidate prompt v1.2
  • Resolution correctness · 0.42 → 0.91
  • Citation presence · 0.10 → 0.98
  • Regression canaries · 5/5 pass
  • Verdict · PROMOTED

Code-evals only in the experiment hot path.

Architecture

One agent. Three feedback paths.

The support agent calls Phoenix MCP at runtime to retrieve few-shot exemplars from a curated dataset of resolved tickets. Failed runs aggregate. A separate diagnosis sub-agent reads the failing spans back from Phoenix MCP, names the root cause, and proposes a patch. Experiments score before-and-after on the same dataset.

Customer ticketREST /api/ticketsSupport agent · ADKgemini-2.5-flash · 3 toolsPhoenix · OTel + Evalsauto_instrument · batchResponsecited · structuredPhoenix MCPget-dataset-examples · get-spansDiagnosis sub-agentreads its own failing spansFailure aggregatorfailure_key · 3-strike rulePatch synthesisone-line prompt diffExperiment runnerbaseline vs candidate · 5 exRelease gatePROMOTED / REJECTretrieve_similar_resolutionsfailing spans
Code walks

Three pieces of code that carry the claim.

The most reassuring thing on a marketing page is the actual production code.

01Code walk

Trace every Gemini call with one register()

Replaces our hand-rolled OTel setup with the canonical Phoenix call. The auto_instrument=True flag picks up both ADK agent spans and direct google-genai calls — so the LLM judges are no longer invisible.

backend/src/tracing/instrumentor.pypython
1from phoenix.otel import register
2
3tracer_provider = register(
4    project_name="phoenixloop",
5    endpoint="https://app.phoenix.arize.com/v1/traces",
6    headers={"Authorization": f"Bearer {api_key}"},
7    auto_instrument=True,
8    batch=True,
9)
10# google-genai + ADK spans now stream to Phoenix.
02Code walk

Few-shot retrieval from a Phoenix dataset

The support agent calls this tool before drafting any non-trivial response. Top-3 exemplars come back from the successful-resolutions dataset in Phoenix Cloud via MCP. The retrieval span is visible in every trace.

backend/src/agent/tools.pypython
1@retry(max_attempts=3, retryable_exceptions=(httpx.TimeoutException,))
2async def retrieve_similar_resolutions(
3    category: str,
4    brief: str,
5) -> list[ResolutionExample]:
6    """Top-3 exemplars from the Phoenix `successful-resolutions` dataset."""
7    examples = await phoenix_client.get_dataset_examples(
8        dataset="successful-resolutions",
9        filter={"category": category},
10        limit=3,
11    )
12    return [ResolutionExample.model_validate(e) for e in examples]
03Code walk

Diagnosis sub-agent reads its own failing spans

The diagnosis agent uses the Phoenix MCP toolset as its tool surface. The Live Trace pane shows phoenix-mcp:get-spans and phoenix-mcp:get-span-annotations spans every time it runs — real observability data flowing back into reasoning.

backend/src/agent/diagnosis_agent.pypython
1diagnosis_agent = Agent(
2    name="diagnosis",
3    model=settings.gemini_model,                  # flash
4    tools=[mcp_toolset],                          # phoenix-mcp
5    generate_content_config=GenerateContentConfig(
6        thinking_config=ThinkingConfig(thinking_budget=128),
7    ),
8    instruction=DIAGNOSIS_PROMPT,                 # JSON-only output, ≤2 MCP calls
9)
10result = await Runner(agent=diagnosis_agent, ...).run_async(failure_key)
What it is. What it isn’t.

A dense difference, not a vibe.

Most agent demos describe what the agent should do. This row says what the architecture actually does — and what the common alternative looks like.

SurfacePhoenixLoopCommon alternative
Tracingphoenix.otel.register · batched · auto-instrumentPrint-to-stdout / custom JSON logs
MCP usageReal Phoenix MCP at runtime (sub-agent + RAG)Hand-rolled HTTP call to /api/spans
Evaluation7 code · 4 judges (2 Phoenix Evals) · 3 toolOne ‘LLM-as-judge’ score, no breakdown
Improvement loopDiagnosis sub-agent reads failing spans · patches promptManual prompt edits in a Notion doc
PromotionScore-gated baseline-vs-candidate experimentVibes; ship it and watch metrics
Model choiceFlash everywhere · thinking_budget tuned per agentPro for everything · billed accordingly
Anti-claim

Things we deliberately did not build.

A short list is a credibility signal. Anyone can claim a self-improving agent. Below is the surface area we said no to.

  • A glassmorphic ‘AI gradient’ landing page.
  • A fake terminal that types Lorem-ipsum forever.
  • A LangSmith-tier observability rewrite.
  • An evals framework competing with Phoenix.
  • Agents that A2A-call ten dummy agents to look busy.
Run it locally

Boot the stack. Click one ticket. Watch one promotion.

The seed runs six tickets, two intentional failures, one diagnosis, one experiment and one release-gate verdict. Around 4–5 minutes in live mode (real Gemini calls) or ~90s with LIGHTWEIGHT_DEMO=true (fixture replay).