How Adk Eval Guide fits into a Paperclip company.

Adk Eval Guide drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md311 linesmarkdown
Expand
1---2name: adk-eval-guide3description: >4  MUST READ before running any ADK evaluation.5  ADK evaluation methodology — eval metrics, evalset schema, LLM-as-judge,6  tool trajectory scoring, and common failure causes.7  Use when evaluating agent quality, running adk eval, or debugging eval results.8  Do NOT use for API code patterns (use adk-cheatsheet), deployment9  (use adk-deploy-guide), or project scaffolding (use adk-scaffold).10metadata:11  license: Apache-2.012  author: Google13---14 15# ADK Evaluation Guide16 17> **Scaffolded project?** If you used `/adk-scaffold`, you already have `make eval`, `tests/eval/evalsets/`, and `tests/eval/eval_config.json`. Start with `make eval` and iterate from there.18>19> **Non-scaffolded?** Use `adk eval` directly — see [Running Evaluations](#running-evaluations) below.20 21## Reference Files22 23| File | Contents |24|------|----------|25| `references/criteria-guide.md` | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config |26| `references/user-simulation.md` | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics |27| `references/builtin-tools-eval.md` | google_search and model-internal tools — trajectory behavior, metric compatibility |28| `references/multimodal-eval.md` | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern |29 30---31 32## The Eval-Fix Loop33 34Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure.35 36### How to iterate37 381. **Start small**: Begin with 1-2 eval cases, not the full suite392. **Run eval**: `make eval` (or `adk eval` if no Makefile)403. **Read the scores** — identify what failed and why414. **Fix the code** — adjust prompts, tool logic, instructions, or the evalset425. **Rerun eval** — verify the fix worked436. **Repeat steps 3-5** until the case passes447. **Only then** add more eval cases and expand coverage45 46**Expect 5-10+ iterations.** This is normal — each iteration makes the agent better.47 48### What to fix when scores fail49 50| Failure | What to change |51|---------|---------------|52| `tool_trajectory_avg_score` low | Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type |53| `response_match_score` low | Adjust agent instruction wording, or relax the expected response |54| `final_response_match_v2` low | Refine agent instructions, or adjust expected response — this is semantic, not lexical |55| `rubric_based` score low | Refine agent instructions to address the specific rubric that failed |56| `hallucinations_v1` low | Tighten agent instructions to stay grounded in tool output |57| Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config |58| Agent calls extra tools | Use `IN_ORDER`/`ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1` |59 60---61 62## Choosing the Right Criteria63 64| Goal | Recommended Metric |65|------|--------------------|66| Regression testing / CI/CD (fast, deterministic) | `tool_trajectory_avg_score` + `response_match_score` |67| Semantic response correctness (flexible phrasing OK) | `final_response_match_v2` |68| Response quality without reference answer | `rubric_based_final_response_quality_v1` |69| Validate tool usage reasoning | `rubric_based_tool_use_quality_v1` |70| Detect hallucinated claims | `hallucinations_v1` |71| Safety compliance | `safety_v1` |72| Dynamic multi-turn conversations | User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md`) |73| Multimodal input (image, audio, file) | `tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md`) |74 75For the complete metrics reference with config examples, match types, and custom metrics, see `references/criteria-guide.md`.76 77---78 79## Running Evaluations80 81```bash82# Scaffolded projects:83make eval EVALSET=tests/eval/evalsets/my_evalset.json84 85# Or directly via ADK CLI:86adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results87 88# Run specific eval cases from a set:89adk eval ./app my_evalset.json:eval_1,eval_290 91# With GCS storage:92adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals93```94 95**CLI options:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level`96 97**Eval set management:**98```bash99adk eval_set create <agent_path> <eval_set_id>100adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>101```102 103---104 105## Configuration Schema (`eval_config.json`)106 107Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs.108 109### Full example110 111```json112{113  "criteria": {114    "tool_trajectory_avg_score": {115      "threshold": 1.0,116      "match_type": "IN_ORDER"117    },118    "final_response_match_v2": {119      "threshold": 0.8,120      "judge_model_options": {121        "judge_model": "gemini-2.5-flash",122        "num_samples": 5123      }124    },125    "rubric_based_final_response_quality_v1": {126      "threshold": 0.8,127      "rubrics": [128        {129          "rubric_id": "professionalism",130          "rubric_content": { "text_property": "The response must be professional and helpful." }131        },132        {133          "rubric_id": "safety",134          "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." }135        }136      ]137    }138  }139}140```141 142Simple threshold shorthand is also valid: `"response_match_score": 0.8`143 144For custom metrics, `judge_model_options` details, and `user_simulator_config`, see `references/criteria-guide.md`.145 146---147 148## EvalSet Schema (`evalset.json`)149 150```json151{152  "eval_set_id": "my_eval_set",153  "name": "My Eval Set",154  "description": "Tests core capabilities",155  "eval_cases": [156    {157      "eval_id": "search_test",158      "conversation": [159        {160          "invocation_id": "inv_1",161          "user_content": { "parts": [{ "text": "Find a flight to NYC" }] },162          "final_response": {163            "role": "model",164            "parts": [{ "text": "I found a flight for $500. Want to book?" }]165          },166          "intermediate_data": {167            "tool_uses": [168              { "name": "search_flights", "args": { "destination": "NYC" } }169            ],170            "intermediate_responses": [171              ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]]172            ]173          }174        }175      ],176      "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} }177    }178  ]179}180```181 182**Key fields:**183- `intermediate_data.tool_uses` — expected tool call trajectory (chronological order)184- `intermediate_data.intermediate_responses` — expected sub-agent responses (for multi-agent systems)185- `session_input.state` — initial session state (overrides Python-level initialization)186- `conversation_scenario` — alternative to `conversation` for user simulation (see `references/user-simulation.md`)187 188---189 190## Common Gotchas191 192### The Proactivity Trajectory Gap193 194LLMs often perform extra actions not asked for (e.g., `google_search` after `save_preferences`). This causes `tool_trajectory_avg_score` failures with `EXACT` match. Solutions:195 1961. **Use `IN_ORDER` or `ANY_ORDER` match type** — tolerates extra tool calls between expected ones1972. Include ALL tools the agent might call in your expected trajectory1983. Use `rubric_based_tool_use_quality_v1` instead of trajectory matching1994. Add strict stop instructions: "Stop after calling save_preferences. Do NOT search."200 201### Multi-turn conversations require tool_uses for ALL turns202 203The `tool_trajectory_avg_score` evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools.204 205```json206{207  "conversation": [208    {209      "invocation_id": "inv_1",210      "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] },211      "intermediate_data": {212        "tool_uses": [213          { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} }214        ]215      }216    },217    {218      "invocation_id": "inv_2",219      "user_content": { "parts": [{"text": "Book the first option"}] },220      "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] },221      "intermediate_data": {222        "tool_uses": [223          { "name": "book_flight", "args": {"flight_id": "1"} }224        ]225      }226    }227  ]228}229```230 231### App name must match directory name232 233The `App` object's `name` parameter MUST match the directory containing your agent:234 235```python236# CORRECT - matches the "app" directory237app = App(root_agent=root_agent, name="app")238 239# WRONG - causes "Session not found" errors240app = App(root_agent=root_agent, name="flight_booking_assistant")241```242 243### The `before_agent_callback` Pattern (State Initialization)244 245Always use a callback to initialize session state variables used in your instruction template. This prevents `KeyError` crashes on the first turn:246 247```python248async def initialize_state(callback_context: CallbackContext) -> None:249    state = callback_context.state250    if "user_preferences" not in state:251        state["user_preferences"] = {}252 253root_agent = Agent(254    name="my_agent",255    before_agent_callback=initialize_state,256    instruction="Based on preferences: {user_preferences}...",257)258```259 260### Eval-State Overrides (Type Mismatch Danger)261 262Be careful with `session_input.state` in your evalset. It overrides Python-level initialization:263 264```json265// WRONG — initializes feedback_history as a string, breaks .append()266"state": { "feedback_history": "" }267 268// CORRECT — matches the Python type (list)269"state": { "feedback_history": [] }270 271// NOTE: Remove these // comments before using — JSON does not support comments.272```273 274### Model thinking mode may bypass tools275 276Models with "thinking" enabled may skip tool calls. Use `tool_config` with `mode="ANY"` to force tool usage, or switch to a non-thinking model for predictable tool calling.277 278---279 280## Common Eval Failure Causes281 282| Symptom | Cause | Fix |283|---------|-------|-----|284| Missing `tool_uses` in intermediate turns | Trajectory expects match per invocation | Add expected tool calls to all turns |285| Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add `hallucinations_v1` metric |286| "Session not found" error | App name mismatch | Ensure App `name` matches directory name |287| Score fluctuates between runs | Non-deterministic model | Set `temperature=0` or use rubric-based eval |288| `tool_trajectory_avg_score` always 0 | Agent uses `google_search` (model-internal) | Remove trajectory metric; see `references/builtin-tools-eval.md` |289| Trajectory fails but tools are correct | Extra tools called | Switch to `IN_ORDER`/`ANY_ORDER` match type |290| LLM judge ignores image/audio in eval | `get_text_from_content()` skips non-text parts | Use custom metric with vision-capable judge (see `references/multimodal-eval.md`) |291 292---293 294## Deep Dive: ADK Docs295 296For the official evaluation documentation, fetch these pages:297 298- **Evaluation overview**: `https://adk.dev/evaluate/index.md`299- **Criteria reference**: `https://adk.dev/evaluate/criteria/index.md`300- **User simulation**: `https://adk.dev/evaluate/user-sim/index.md`301 302---303 304## Debugging Example305 306User says: "tool_trajectory_avg_score is 0, what's wrong?"307 3081. Check if agent uses `google_search` — if so, see `references/builtin-tools-eval.md`3092. Check if using `EXACT` match and agent calls extra tools — try `IN_ORDER`3103. Compare expected `tool_uses` in evalset with actual agent behavior3114. Fix mismatch (update evalset or agent instructions)
Related skills
Adk Cheatsheet

Install Adk Cheatsheet skill for Claude Code from google/adk-docs.
Adk Deploy Guide

The adk-deploy-guide skill provides deployment instructions and architectural guidance for Google ADK agents across multiple platforms—Agent Engine, Cloud Run,
Adk Dev Guide

Install Adk Dev Guide skill for Claude Code from google/adk-docs.