Install
Terminal · npx$
npx skills add https://github.com/google/adk-docs --skill adk-eval-guideWorks with Paperclip
How Adk Eval Guide fits into a Paperclip company.
Adk Eval Guide drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
S
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore packSource file
SKILL.md311 linesExpandCollapse
---name: adk-eval-guidedescription: > MUST READ before running any ADK evaluation. ADK evaluation methodology — eval metrics, evalset schema, LLM-as-judge, tool trajectory scoring, and common failure causes. Use when evaluating agent quality, running adk eval, or debugging eval results. Do NOT use for API code patterns (use adk-cheatsheet), deployment (use adk-deploy-guide), or project scaffolding (use adk-scaffold).metadata: license: Apache-2.0 author: Google--- # ADK Evaluation Guide > **Scaffolded project?** If you used `/adk-scaffold`, you already have `make eval`, `tests/eval/evalsets/`, and `tests/eval/eval_config.json`. Start with `make eval` and iterate from there.>> **Non-scaffolded?** Use `adk eval` directly — see [Running Evaluations](#running-evaluations) below. ## Reference Files | File | Contents ||------|----------|| `references/criteria-guide.md` | Complete metrics reference — all 8 criteria, match types, custom metrics, judge model config || `references/user-simulation.md` | Dynamic conversation testing — ConversationScenario, user simulator config, compatible metrics || `references/builtin-tools-eval.md` | google_search and model-internal tools — trajectory behavior, metric compatibility || `references/multimodal-eval.md` | Multimodal inputs — evalset schema, built-in metric limitations, custom evaluator pattern | --- ## The Eval-Fix Loop Evaluation is iterative. When a score is below threshold, diagnose the cause, fix it, rerun — don't just report the failure. ### How to iterate 1. **Start small**: Begin with 1-2 eval cases, not the full suite2. **Run eval**: `make eval` (or `adk eval` if no Makefile)3. **Read the scores** — identify what failed and why4. **Fix the code** — adjust prompts, tool logic, instructions, or the evalset5. **Rerun eval** — verify the fix worked6. **Repeat steps 3-5** until the case passes7. **Only then** add more eval cases and expand coverage **Expect 5-10+ iterations.** This is normal — each iteration makes the agent better. ### What to fix when scores fail | Failure | What to change ||---------|---------------|| `tool_trajectory_avg_score` low | Fix agent instructions (tool ordering), update evalset `tool_uses`, or switch to `IN_ORDER`/`ANY_ORDER` match type || `response_match_score` low | Adjust agent instruction wording, or relax the expected response || `final_response_match_v2` low | Refine agent instructions, or adjust expected response — this is semantic, not lexical || `rubric_based` score low | Refine agent instructions to address the specific rubric that failed || `hallucinations_v1` low | Tighten agent instructions to stay grounded in tool output || Agent calls wrong tools | Fix tool descriptions, agent instructions, or tool_config || Agent calls extra tools | Use `IN_ORDER`/`ANY_ORDER` match type, add strict stop instructions, or switch to `rubric_based_tool_use_quality_v1` | --- ## Choosing the Right Criteria | Goal | Recommended Metric ||------|--------------------|| Regression testing / CI/CD (fast, deterministic) | `tool_trajectory_avg_score` + `response_match_score` || Semantic response correctness (flexible phrasing OK) | `final_response_match_v2` || Response quality without reference answer | `rubric_based_final_response_quality_v1` || Validate tool usage reasoning | `rubric_based_tool_use_quality_v1` || Detect hallucinated claims | `hallucinations_v1` || Safety compliance | `safety_v1` || Dynamic multi-turn conversations | User simulation + `hallucinations_v1` / `safety_v1` (see `references/user-simulation.md`) || Multimodal input (image, audio, file) | `tool_trajectory_avg_score` + custom metric for response quality (see `references/multimodal-eval.md`) | For the complete metrics reference with config examples, match types, and custom metrics, see `references/criteria-guide.md`. --- ## Running Evaluations ```bash# Scaffolded projects:make eval EVALSET=tests/eval/evalsets/my_evalset.json # Or directly via ADK CLI:adk eval ./app <path_to_evalset.json> --config_file_path=<path_to_config.json> --print_detailed_results # Run specific eval cases from a set:adk eval ./app my_evalset.json:eval_1,eval_2 # With GCS storage:adk eval ./app my_evalset.json --eval_storage_uri gs://my-bucket/evals``` **CLI options:** `--config_file_path`, `--print_detailed_results`, `--eval_storage_uri`, `--log_level` **Eval set management:**```bashadk eval_set create <agent_path> <eval_set_id>adk eval_set add_eval_case <agent_path> <eval_set_id> --scenarios_file <path> --session_input_file <path>``` --- ## Configuration Schema (`eval_config.json`) Both camelCase and snake_case field names are accepted (Pydantic aliases). The examples below use snake_case, matching the official ADK docs. ### Full example ```json{ "criteria": { "tool_trajectory_avg_score": { "threshold": 1.0, "match_type": "IN_ORDER" }, "final_response_match_v2": { "threshold": 0.8, "judge_model_options": { "judge_model": "gemini-2.5-flash", "num_samples": 5 } }, "rubric_based_final_response_quality_v1": { "threshold": 0.8, "rubrics": [ { "rubric_id": "professionalism", "rubric_content": { "text_property": "The response must be professional and helpful." } }, { "rubric_id": "safety", "rubric_content": { "text_property": "The agent must NEVER book without asking for confirmation." } } ] } }}``` Simple threshold shorthand is also valid: `"response_match_score": 0.8` For custom metrics, `judge_model_options` details, and `user_simulator_config`, see `references/criteria-guide.md`. --- ## EvalSet Schema (`evalset.json`) ```json{ "eval_set_id": "my_eval_set", "name": "My Eval Set", "description": "Tests core capabilities", "eval_cases": [ { "eval_id": "search_test", "conversation": [ { "invocation_id": "inv_1", "user_content": { "parts": [{ "text": "Find a flight to NYC" }] }, "final_response": { "role": "model", "parts": [{ "text": "I found a flight for $500. Want to book?" }] }, "intermediate_data": { "tool_uses": [ { "name": "search_flights", "args": { "destination": "NYC" } } ], "intermediate_responses": [ ["sub_agent_name", [{ "text": "Found 3 flights to NYC." }]] ] } } ], "session_input": { "app_name": "my_app", "user_id": "user_1", "state": {} } } ]}``` **Key fields:**- `intermediate_data.tool_uses` — expected tool call trajectory (chronological order)- `intermediate_data.intermediate_responses` — expected sub-agent responses (for multi-agent systems)- `session_input.state` — initial session state (overrides Python-level initialization)- `conversation_scenario` — alternative to `conversation` for user simulation (see `references/user-simulation.md`) --- ## Common Gotchas ### The Proactivity Trajectory Gap LLMs often perform extra actions not asked for (e.g., `google_search` after `save_preferences`). This causes `tool_trajectory_avg_score` failures with `EXACT` match. Solutions: 1. **Use `IN_ORDER` or `ANY_ORDER` match type** — tolerates extra tool calls between expected ones2. Include ALL tools the agent might call in your expected trajectory3. Use `rubric_based_tool_use_quality_v1` instead of trajectory matching4. Add strict stop instructions: "Stop after calling save_preferences. Do NOT search." ### Multi-turn conversations require tool_uses for ALL turns The `tool_trajectory_avg_score` evaluates each invocation. If you don't specify expected tool calls for intermediate turns, the evaluation will fail even if the agent called the right tools. ```json{ "conversation": [ { "invocation_id": "inv_1", "user_content": { "parts": [{"text": "Find me a flight from NYC to London"}] }, "intermediate_data": { "tool_uses": [ { "name": "search_flights", "args": {"origin": "NYC", "destination": "LON"} } ] } }, { "invocation_id": "inv_2", "user_content": { "parts": [{"text": "Book the first option"}] }, "final_response": { "role": "model", "parts": [{"text": "Booking confirmed!"}] }, "intermediate_data": { "tool_uses": [ { "name": "book_flight", "args": {"flight_id": "1"} } ] } } ]}``` ### App name must match directory name The `App` object's `name` parameter MUST match the directory containing your agent: ```python# CORRECT - matches the "app" directoryapp = App(root_agent=root_agent, name="app") # WRONG - causes "Session not found" errorsapp = App(root_agent=root_agent, name="flight_booking_assistant")``` ### The `before_agent_callback` Pattern (State Initialization) Always use a callback to initialize session state variables used in your instruction template. This prevents `KeyError` crashes on the first turn: ```pythonasync def initialize_state(callback_context: CallbackContext) -> None: state = callback_context.state if "user_preferences" not in state: state["user_preferences"] = {} root_agent = Agent( name="my_agent", before_agent_callback=initialize_state, instruction="Based on preferences: {user_preferences}...",)``` ### Eval-State Overrides (Type Mismatch Danger) Be careful with `session_input.state` in your evalset. It overrides Python-level initialization: ```json// WRONG — initializes feedback_history as a string, breaks .append()"state": { "feedback_history": "" } // CORRECT — matches the Python type (list)"state": { "feedback_history": [] } // NOTE: Remove these // comments before using — JSON does not support comments.``` ### Model thinking mode may bypass tools Models with "thinking" enabled may skip tool calls. Use `tool_config` with `mode="ANY"` to force tool usage, or switch to a non-thinking model for predictable tool calling. --- ## Common Eval Failure Causes | Symptom | Cause | Fix ||---------|-------|-----|| Missing `tool_uses` in intermediate turns | Trajectory expects match per invocation | Add expected tool calls to all turns || Agent mentions data not in tool output | Hallucination | Tighten agent instructions; add `hallucinations_v1` metric || "Session not found" error | App name mismatch | Ensure App `name` matches directory name || Score fluctuates between runs | Non-deterministic model | Set `temperature=0` or use rubric-based eval || `tool_trajectory_avg_score` always 0 | Agent uses `google_search` (model-internal) | Remove trajectory metric; see `references/builtin-tools-eval.md` || Trajectory fails but tools are correct | Extra tools called | Switch to `IN_ORDER`/`ANY_ORDER` match type || LLM judge ignores image/audio in eval | `get_text_from_content()` skips non-text parts | Use custom metric with vision-capable judge (see `references/multimodal-eval.md`) | --- ## Deep Dive: ADK Docs For the official evaluation documentation, fetch these pages: - **Evaluation overview**: `https://adk.dev/evaluate/index.md`- **Criteria reference**: `https://adk.dev/evaluate/criteria/index.md`- **User simulation**: `https://adk.dev/evaluate/user-sim/index.md` --- ## Debugging Example User says: "tool_trajectory_avg_score is 0, what's wrong?" 1. Check if agent uses `google_search` — if so, see `references/builtin-tools-eval.md`2. Check if using `EXACT` match and agent calls extra tools — try `IN_ORDER`3. Compare expected `tool_uses` in evalset with actual agent behavior4. Fix mismatch (update evalset or agent instructions)Related skills
Adk Cheatsheet
Install Adk Cheatsheet skill for Claude Code from google/adk-docs.
Adk Deploy Guide
The adk-deploy-guide skill provides deployment instructions and architectural guidance for Google ADK agents across multiple platforms—Agent Engine, Cloud Run,
Adk Dev Guide
Install Adk Dev Guide skill for Claude Code from google/adk-docs.