Evaluating and shipping agents that work
Descript, Bolt, Claude Code — all followed the same arc
Code eval: score: 1 · label: "valid"
LLM judge: score: 0 · label: "incorrect"
explanation: "The response fails to include..."
label: "incorrect"
explanation: "The response fails to include a budget
breakdown, which is a core requirement. The agent
provides destination info and local recommendations
but omits all cost estimates, making the plan
incomplete for a user who asked specifically
about budget travel to Tokyo."
%pip install claude-agent-sdk
openinference-instrumentation-claude-agent-sdk
arize arize-otel arize-phoenix anthropic
os.environ["ANTHROPIC_API_KEY"] = "sk-ant-XXXX"
os.environ["ARIZE_API_KEY"] = "YYYY"
os.environ["ARIZE_SPACE_ID"] = "ZZZZ"
from arize.otel import register, Endpoint
tracer_provider = register(
space_id=..., api_key=...,
project_name="aie-financial-demo",
endpoint=Endpoint.ARIZE,
batch=False,
)
ClaudeAgentSDKInstrumentor().instrument(
tracer_provider=tracer_provider)
from arize import ArizeClient
arize_client = ArizeClient(api_key=...)
SPACE_ID = os.environ["ARIZE_SPACE_ID"]
PROJECT_NAME = "aie-financial-demo"
npx skills add Arize-ai/arize-skills
options = ClaudeAgentOptions(
model="claude-haiku-4-5-20251001",
allowed_tools=["WebSearch"],
permission_mode="acceptEdits",
)
RESEARCH_PROMPT = "Research {tickers}. Focus on: {focus}.
Use web search to find current financial data."
WRITE_PROMPT = "Now write a concise financial report
based on your research above."
with tracer.start_as_current_span("financial_report", ...):
result = await financial_report(
"TSLA",
"financial performance and growth outlook"
)
print(result)
Here's one I made earlier
test_queries = [
{"tickers": "AAPL", "focus": "revenue growth"},
{"tickers": "NVDA", "focus": "AI chip demand"},
{"tickers": "AAPL, MSFT", "focus": "comparative analysis"},
{"tickers": "RIVN", "focus": "financial health"},
{"tickers": "KO", "focus": "dividend yield"},
... # 12 in total
]
spans_df = arize_client.spans.export_to_df(
space_id=SPACE_ID,
project_name=PROJECT_NAME,
start_time=..., end_time=...,
)
parent_spans = spans_df[spans_df["parent_id"].isna()]
@create_evaluator(name="mentions_ticker", kind="code")
def mentions_ticker(input, output):
tickers = re.findall(r"\b([A-Z]{1,5})\b", input)
...
if not missing:
return {"label": "pass", "score": 1}
return {"label": "fail", "score": 0,
"explanation": f"Missing: {', '.join(missing)}"}
with suppress_tracing():
results = evaluate_dataframe(
dataframe=parent_spans,
evaluators=[mentions_ticker])
log_eval_to_ax(results, eval_name="mentions_ticker")
from phoenix.evals.llm import LLM
from phoenix.evals.metrics import CorrectnessEvaluator
llm = LLM(provider="anthropic", model="claude-sonnet-4-6")
correctness_eval = CorrectnessEvaluator(llm=llm)
with suppress_tracing():
correctness_results = evaluate_dataframe(
dataframe=parent_spans,
evaluators=[correctness_eval])
faithfulness_eval = FaithfulnessEvaluator(llm=llm)
with suppress_tracing():
faith_results = evaluate_dataframe(
dataframe=spans_with_context,
evaluators=[faithfulness_eval])
"You are an expert financial analyst evaluator.
Your task is to judge whether a financial report
provides actionable investment guidance,
not just raw data."
<user_query>
{input}
</user_query>
<financial_report>
{output}
</financial_report>
"Based on NVDA's 122% YoY revenue growth driven by
data center demand, strong forward P/E of 35x relative
to sector median of 22x, and expanding margins, NVDA
presents a compelling growth position. Key risk:
concentration in AI training chips (~70% of revenue).
Recommendation: accumulate on pullbacks below $800."
"NVDA is a major player in the semiconductor industry.
The company has seen significant growth in recent years
driven by AI demand. NVDA's stock has performed well.
Investors should consider various factors when making
investment decisions."
Don't end the prompt with "answer ACTIONABLE or NOT"
Define the choices in the evaluator config instead
choices={"actionable": 1.0, "not actionable": 0.0}
actionability_evaluator = ClassificationEvaluator(
name="actionability",
llm=llm,
prompt_template=actionability_template,
choices={"actionable": 1.0, "not actionable": 0.0},
)
spans_df = arize_client.spans.export_to_df(...)
ANNOTATION_COL = "annotation.human_actionable.label"
labeled_subset = parent_spans[
parent_spans[ANNOTATION_COL].notna()]
with suppress_tracing():
judge_results = evaluate_dataframe(
dataframe=labeled_subset,
evaluators=[actionability_evaluator])
async def improved_financial_report(tickers, focus):
... uses IMPROVED_RESEARCH_PROMPT / IMPROVED_WRITE_PROMPT
experiment, experiment_df = arize_client.experiments.run(
name="improved-prompts-v1",
dataset="aie-financial-demo-fails",
space=SPACE_ID,
task=improved_agent_task,
evaluators=[actionability_eval],
)
Find failures → Read explanations → Fix → Run experiment → Repeat
Application → online evals → eval labels → monitors → alert → improve → repeat
Instrument → trace → read data → eval → validate → iterate → ship → monitor
@seldo.com on BlueSky
Get a free year! Upgrade with code
ARIZEAIE2026