Name: Arize Experiment
Author: Github
Install
Terminal · npx
$npx skills add https://github.com/microsoft/github-copilot-for-azure --skill azure-ai
Works with Paperclip
How Arize Experiment fits into a Paperclip company.

Arize Experiment drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md326 linesmarkdown
Expand
1---2name: arize-experiment3description: "INVOKE THIS SKILL when creating, running, or analyzing Arize experiments. Covers experiment CRUD, exporting runs, comparing results, and evaluation workflows using the ax CLI."4---5 6# Arize Experiment Skill7 8## Concepts9 10- **Experiment** = a named evaluation run against a specific dataset version, containing one run per example11- **Experiment Run** = the result of processing one dataset example -- includes the model output, optional evaluations, and optional metadata12- **Dataset** = a versioned collection of examples; every experiment is tied to a dataset and a specific dataset version13- **Evaluation** = a named metric attached to a run (e.g., `correctness`, `relevance`), with optional label, score, and explanation14 15The typical flow: export a dataset → process each example → collect outputs and evaluations → create an experiment with the runs.16 17## Prerequisites18 19Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.20 21If an `ax` command fails, troubleshoot based on the error:22- `command not found` or version error → see references/ax-setup.md23- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)24- Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user25- Project unclear → check `.env` for `ARIZE_DEFAULT_PROJECT`, or ask, or run `ax projects list -o json --limit 100` and present as selectable options26 27## List Experiments: `ax experiments list`28 29Browse experiments, optionally filtered by dataset. Output goes to stdout.30 31```bash32ax experiments list33ax experiments list --dataset-id DATASET_ID --limit 2034ax experiments list --cursor CURSOR_TOKEN35ax experiments list -o json36```37 38### Flags39 40| Flag | Type | Default | Description |41|------|------|---------|-------------|42| `--dataset-id` | string | none | Filter by dataset |43| `--limit, -l` | int | 15 | Max results (1-100) |44| `--cursor` | string | none | Pagination cursor from previous response |45| `-o, --output` | string | table | Output format: table, json, csv, parquet, or file path |46| `-p, --profile` | string | default | Configuration profile |47 48## Get Experiment: `ax experiments get`49 50Quick metadata lookup -- returns experiment name, linked dataset/version, and timestamps.51 52```bash53ax experiments get EXPERIMENT_ID54ax experiments get EXPERIMENT_ID -o json55```56 57### Flags58 59| Flag | Type | Default | Description |60|------|------|---------|-------------|61| `EXPERIMENT_ID` | string | required | Positional argument |62| `-o, --output` | string | table | Output format |63| `-p, --profile` | string | default | Configuration profile |64 65### Response fields66 67| Field | Type | Description |68|-------|------|-------------|69| `id` | string | Experiment ID |70| `name` | string | Experiment name |71| `dataset_id` | string | Linked dataset ID |72| `dataset_version_id` | string | Specific dataset version used |73| `experiment_traces_project_id` | string | Project where experiment traces are stored |74| `created_at` | datetime | When the experiment was created |75| `updated_at` | datetime | Last modification time |76 77## Export Experiment: `ax experiments export`78 79Download all runs to a file. By default uses the REST API; pass `--all` to use Arrow Flight for bulk transfer.80 81```bash82ax experiments export EXPERIMENT_ID83# -> experiment_abc123_20260305_141500/runs.json84 85ax experiments export EXPERIMENT_ID --all86ax experiments export EXPERIMENT_ID --output-dir ./results87ax experiments export EXPERIMENT_ID --stdout88ax experiments export EXPERIMENT_ID --stdout | jq '.[0]'89```90 91### Flags92 93| Flag | Type | Default | Description |94|------|------|---------|-------------|95| `EXPERIMENT_ID` | string | required | Positional argument |96| `--all` | bool | false | Use Arrow Flight for bulk export (see below) |97| `--output-dir` | string | `.` | Output directory |98| `--stdout` | bool | false | Print JSON to stdout instead of file |99| `-p, --profile` | string | default | Configuration profile |100 101### REST vs Flight (`--all`)102 103- **REST** (default): Lower friction -- no Arrow/Flight dependency, standard HTTPS ports, works through any corporate proxy or firewall. Limited to 500 runs per page.104- **Flight** (`--all`): Required for experiments with more than 500 runs. Uses gRPC+TLS on a separate host/port (`flight.arize.com:443`) which some corporate networks may block.105 106**Agent auto-escalation rule:** If a REST export returns exactly 500 runs, the result is likely truncated. Re-run with `--all` to get the full dataset.107 108Output is a JSON array of run objects:109 110```json111[112  {113    "id": "run_001",114    "example_id": "ex_001",115    "output": "The answer is 4.",116    "evaluations": {117      "correctness": { "label": "correct", "score": 1.0 },118      "relevance": { "score": 0.95, "explanation": "Directly answers the question" }119    },120    "metadata": { "model": "gpt-4o", "latency_ms": 1234 }121  }122]123```124 125## Create Experiment: `ax experiments create`126 127Create a new experiment with runs from a data file.128 129```bash130ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json131ax experiments create --name "claude-test" --dataset-id DATASET_ID --file runs.csv132```133 134### Flags135 136| Flag | Type | Required | Description |137|------|------|----------|-------------|138| `--name, -n` | string | yes | Experiment name |139| `--dataset-id` | string | yes | Dataset to run the experiment against |140| `--file, -f` | path | yes | Data file with runs: CSV, JSON, JSONL, or Parquet |141| `-o, --output` | string | no | Output format |142| `-p, --profile` | string | no | Configuration profile |143 144### Passing data via stdin145 146Use `--file -` to pipe data directly — no temp file needed:147 148```bash149echo '[{"example_id": "ex_001", "output": "Paris"}]' | ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file -150 151# Or with a heredoc152ax experiments create --name "my-experiment" --dataset-id DATASET_ID --file - << 'EOF'153[{"example_id": "ex_001", "output": "Paris"}]154EOF155```156 157### Required columns in the runs file158 159| Column | Type | Required | Description |160|--------|------|----------|-------------|161| `example_id` | string | yes | ID of the dataset example this run corresponds to |162| `output` | string | yes | The model/system output for this example |163 164Additional columns are passed through as `additionalProperties` on the run.165 166## Delete Experiment: `ax experiments delete`167 168```bash169ax experiments delete EXPERIMENT_ID170ax experiments delete EXPERIMENT_ID --force   # skip confirmation prompt171```172 173### Flags174 175| Flag | Type | Default | Description |176|------|------|---------|-------------|177| `EXPERIMENT_ID` | string | required | Positional argument |178| `--force, -f` | bool | false | Skip confirmation prompt |179| `-p, --profile` | string | default | Configuration profile |180 181## Experiment Run Schema182 183Each run corresponds to one dataset example:184 185```json186{187  "example_id": "required -- links to dataset example",188  "output": "required -- the model/system output for this example",189  "evaluations": {190    "metric_name": {191      "label": "optional string label (e.g., 'correct', 'incorrect')",192      "score": "optional numeric score (e.g., 0.95)",193      "explanation": "optional freeform text"194    }195  },196  "metadata": {197    "model": "gpt-4o",198    "temperature": 0.7,199    "latency_ms": 1234200  }201}202```203 204### Evaluation fields205 206| Field | Type | Required | Description |207|-------|------|----------|-------------|208| `label` | string | no | Categorical classification (e.g., `correct`, `incorrect`, `partial`) |209| `score` | number | no | Numeric quality score (e.g., 0.0 - 1.0) |210| `explanation` | string | no | Freeform reasoning for the evaluation |211 212At least one of `label`, `score`, or `explanation` should be present per evaluation.213 214## Workflows215 216### Run an experiment against a dataset217 2181. Find or create a dataset:219   ```bash220   ax datasets list221   ax datasets export DATASET_ID --stdout | jq 'length'222   ```2232. Export the dataset examples:224   ```bash225   ax datasets export DATASET_ID226   ```2273. Process each example through your system, collecting outputs and evaluations2284. Build a runs file (JSON array) with `example_id`, `output`, and optional `evaluations`:229   ```json230   [231     {"example_id": "ex_001", "output": "4", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}},232     {"example_id": "ex_002", "output": "Paris", "evaluations": {"correctness": {"label": "correct", "score": 1.0}}}233   ]234   ```2355. Create the experiment:236   ```bash237   ax experiments create --name "gpt-4o-baseline" --dataset-id DATASET_ID --file runs.json238   ```2396. Verify: `ax experiments get EXPERIMENT_ID`240 241### Compare two experiments242 2431. Export both experiments:244   ```bash245   ax experiments export EXPERIMENT_ID_A --stdout > a.json246   ax experiments export EXPERIMENT_ID_B --stdout > b.json247   ```2482. Compare evaluation scores by `example_id`:249   ```bash250   # Average correctness score for experiment A251   jq '[.[] | .evaluations.correctness.score] | add / length' a.json252 253   # Same for experiment B254   jq '[.[] | .evaluations.correctness.score] | add / length' b.json255   ```2563. Find examples where results differ:257   ```bash258   jq -s '.[0] as $a | .[1][] | . as $run |259     {260       example_id: $run.example_id,261       b_score: $run.evaluations.correctness.score,262       a_score: ($a[] | select(.example_id == $run.example_id) | .evaluations.correctness.score)263     }' a.json b.json264   ```2654. Score distribution per evaluator (pass/fail/partial counts):266   ```bash267   # Count by label for experiment A268   jq '[.[] | .evaluations.correctness.label] | group_by(.) | map({label: .[0], count: length})' a.json269   ```2705. Find regressions (examples that passed in A but fail in B):271   ```bash272   jq -s '273     [.[0][] | select(.evaluations.correctness.label == "correct")] as $passed_a |274     [.[1][] | select(.evaluations.correctness.label != "correct") |275       select(.example_id as $id | $passed_a | any(.example_id == $id))276     ]277   ' a.json b.json278   ```279 280**Statistical significance note:** Score comparisons are most reliable with ≥ 30 examples per evaluator. With fewer examples, treat the delta as directional only — a 5% difference on n=10 may be noise. Report sample size alongside scores: `jq 'length' a.json`.281 282### Download experiment results for analysis283 2841. `ax experiments list --dataset-id DATASET_ID` -- find experiments2852. `ax experiments export EXPERIMENT_ID` -- download to file2863. Parse: `jq '.[] | {example_id, score: .evaluations.correctness.score}' experiment_*/runs.json`287 288### Pipe export to other tools289 290```bash291# Count runs292ax experiments export EXPERIMENT_ID --stdout | jq 'length'293 294# Extract all outputs295ax experiments export EXPERIMENT_ID --stdout | jq '.[].output'296 297# Get runs with low scores298ax experiments export EXPERIMENT_ID --stdout | jq '[.[] | select(.evaluations.correctness.score < 0.5)]'299 300# Convert to CSV301ax experiments export EXPERIMENT_ID --stdout | jq -r '.[] | [.example_id, .output, .evaluations.correctness.score] | @csv'302```303 304## Related Skills305 306- **arize-dataset**: Create or export the dataset this experiment runs against → use `arize-dataset` first307- **arize-prompt-optimization**: Use experiment results to improve prompts → next step is `arize-prompt-optimization`308- **arize-trace**: Inspect individual span traces for failing experiment runs → use `arize-trace`309- **arize-link**: Generate clickable UI links to traces from experiment runs → use `arize-link`310 311## Troubleshooting312 313| Problem | Solution |314|---------|----------|315| `ax: command not found` | See references/ax-setup.md |316| `401 Unauthorized` | API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md. |317| `No profile found` | No profile is configured. See references/ax-profiles.md to create one. |318| `Experiment not found` | Verify experiment ID with `ax experiments list` |319| `Invalid runs file` | Each run must have `example_id` and `output` fields |320| `example_id mismatch` | Ensure `example_id` values match IDs from the dataset (export dataset to verify) |321| `No runs found` | Export returned empty -- verify experiment has runs via `ax experiments get` |322| `Dataset not found` | The linked dataset may have been deleted; check with `ax datasets list` |323 324## Save Credentials for Future Use325 326See references/ax-profiles.md § Save Credentials for Future Use.
Related skills
Add Educational Comments

Takes any code file and transforms it into a teaching resource by adding educational comments that explain syntax, design choices, and language concepts. Automa
Agent Governance

When your AI agents start calling APIs, touching databases, or executing shell commands, you need guardrails before something goes sideways. This gives you comp
Agentic Eval

Implements self-critique loops where Claude generates output, evaluates it against your criteria, then refines based on its own feedback. Includes evaluator-opt