Name: Arize Dataset
Author: Github
Install
Terminal · npx
$npx skills add https://github.com/microsoft/github-copilot-for-azure --skill azure-ai
Works with Paperclip
How Arize Dataset fits into a Paperclip company.

Arize Dataset drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md361 linesmarkdown
Expand
1---2name: arize-dataset3description: "INVOKE THIS SKILL when creating, managing, or querying Arize datasets and examples. Covers dataset CRUD, appending examples, exporting data, and file-based dataset creation using the ax CLI."4---5 6# Arize Dataset Skill7 8## Concepts9 10- **Dataset** = a versioned collection of examples used for evaluation and experimentation11- **Dataset Version** = a snapshot of a dataset at a point in time; updates can be in-place or create a new version12- **Example** = a single record in a dataset with arbitrary user-defined fields (e.g., `question`, `answer`, `context`)13- **Space** = an organizational container; datasets belong to a space14 15System-managed fields on examples (`id`, `created_at`, `updated_at`) are auto-generated by the server -- never include them in create or append payloads.16 17## Prerequisites18 19Proceed directly with the task — run the `ax` command you need. Do NOT check versions, env vars, or profiles upfront.20 21If an `ax` command fails, troubleshoot based on the error:22- `command not found` or version error → see references/ax-setup.md23- `401 Unauthorized` / missing API key → run `ax profiles show` to inspect the current profile. If the profile is missing or the API key is wrong: check `.env` for `ARIZE_API_KEY` and use it to create/update the profile via references/ax-profiles.md. If `.env` has no key either, ask the user for their Arize API key (https://app.arize.com/admin > API Keys)24- Space ID unknown → check `.env` for `ARIZE_SPACE_ID`, or run `ax spaces list -o json`, or ask the user25- Project unclear → check `.env` for `ARIZE_DEFAULT_PROJECT`, or ask, or run `ax projects list -o json --limit 100` and present as selectable options26 27## List Datasets: `ax datasets list`28 29Browse datasets in a space. Output goes to stdout.30 31```bash32ax datasets list33ax datasets list --space-id SPACE_ID --limit 2034ax datasets list --cursor CURSOR_TOKEN35ax datasets list -o json36```37 38### Flags39 40| Flag | Type | Default | Description |41|------|------|---------|-------------|42| `--space-id` | string | from profile | Filter by space |43| `--limit, -l` | int | 15 | Max results (1-100) |44| `--cursor` | string | none | Pagination cursor from previous response |45| `-o, --output` | string | table | Output format: table, json, csv, parquet, or file path |46| `-p, --profile` | string | default | Configuration profile |47 48## Get Dataset: `ax datasets get`49 50Quick metadata lookup -- returns dataset name, space, timestamps, and version list.51 52```bash53ax datasets get DATASET_ID54ax datasets get DATASET_ID -o json55```56 57### Flags58 59| Flag | Type | Default | Description |60|------|------|---------|-------------|61| `DATASET_ID` | string | required | Positional argument |62| `-o, --output` | string | table | Output format |63| `-p, --profile` | string | default | Configuration profile |64 65### Response fields66 67| Field | Type | Description |68|-------|------|-------------|69| `id` | string | Dataset ID |70| `name` | string | Dataset name |71| `space_id` | string | Space this dataset belongs to |72| `created_at` | datetime | When the dataset was created |73| `updated_at` | datetime | Last modification time |74| `versions` | array | List of dataset versions (id, name, dataset_id, created_at, updated_at) |75 76## Export Dataset: `ax datasets export`77 78Download all examples to a file. Use `--all` for datasets larger than 500 examples (unlimited bulk export).79 80```bash81ax datasets export DATASET_ID82# -> dataset_abc123_20260305_141500/examples.json83 84ax datasets export DATASET_ID --all85ax datasets export DATASET_ID --version-id VERSION_ID86ax datasets export DATASET_ID --output-dir ./data87ax datasets export DATASET_ID --stdout88ax datasets export DATASET_ID --stdout | jq '.[0]'89```90 91### Flags92 93| Flag | Type | Default | Description |94|------|------|---------|-------------|95| `DATASET_ID` | string | required | Positional argument |96| `--version-id` | string | latest | Export a specific dataset version |97| `--all` | bool | false | Unlimited bulk export (use for datasets > 500 examples) |98| `--output-dir` | string | `.` | Output directory |99| `--stdout` | bool | false | Print JSON to stdout instead of file |100| `-p, --profile` | string | default | Configuration profile |101 102**Agent auto-escalation rule:** If an export returns exactly 500 examples, the result is likely truncated — re-run with `--all` to get the full dataset.103 104**Export completeness verification:** After exporting, confirm the row count matches what the server reports:105```bash106# Get the server-reported count from dataset metadata107ax datasets get DATASET_ID -o json | jq '.versions[-1] | {version: .id, examples: .example_count}'108 109# Compare to what was exported110jq 'length' dataset_*/examples.json111 112# If counts differ, re-export with --all113```114 115Output is a JSON array of example objects. Each example has system fields (`id`, `created_at`, `updated_at`) plus all user-defined fields:116 117```json118[119  {120    "id": "ex_001",121    "created_at": "2026-01-15T10:00:00Z",122    "updated_at": "2026-01-15T10:00:00Z",123    "question": "What is 2+2?",124    "answer": "4",125    "topic": "math"126  }127]128```129 130## Create Dataset: `ax datasets create`131 132Create a new dataset from a data file.133 134```bash135ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.csv136ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.json137ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.jsonl138ax datasets create --name "My Dataset" --space-id SPACE_ID --file data.parquet139```140 141### Flags142 143| Flag | Type | Required | Description |144|------|------|----------|-------------|145| `--name, -n` | string | yes | Dataset name |146| `--space-id` | string | yes | Space to create the dataset in |147| `--file, -f` | path | yes | Data file: CSV, JSON, JSONL, or Parquet |148| `-o, --output` | string | no | Output format for the returned dataset metadata |149| `-p, --profile` | string | no | Configuration profile |150 151### Passing data via stdin152 153Use `--file -` to pipe data directly — no temp file needed:154 155```bash156echo '[{"question": "What is 2+2?", "answer": "4"}]' | ax datasets create --name "my-dataset" --space-id SPACE_ID --file -157 158# Or with a heredoc159ax datasets create --name "my-dataset" --space-id SPACE_ID --file - << 'EOF'160[{"question": "What is 2+2?", "answer": "4"}]161EOF162```163 164To add rows to an existing dataset, use `ax datasets append --json '[...]'` instead — no file needed.165 166### Supported file formats167 168| Format | Extension | Notes |169|--------|-----------|-------|170| CSV | `.csv` | Column headers become field names |171| JSON | `.json` | Array of objects |172| JSON Lines | `.jsonl` | One object per line (NOT a JSON array) |173| Parquet | `.parquet` | Column names become field names; preserves types |174 175**Format gotchas:**176- **CSV**: Loses type information — dates become strings, `null` becomes empty string. Use JSON/Parquet to preserve types.177- **JSONL**: Each line is a separate JSON object. A JSON array (`[{...}, {...}]`) in a `.jsonl` file will fail — use `.json` extension instead.178- **Parquet**: Preserves column types. Requires `pandas`/`pyarrow` to read locally: `pd.read_parquet("examples.parquet")`.179 180## Append Examples: `ax datasets append`181 182Add examples to an existing dataset. Two input modes -- use whichever fits.183 184### Inline JSON (agent-friendly)185 186Generate the payload directly -- no temp files needed:187 188```bash189ax datasets append DATASET_ID --json '[{"question": "What is 2+2?", "answer": "4"}]'190 191ax datasets append DATASET_ID --json '[192  {"question": "What is gravity?", "answer": "A fundamental force..."},193  {"question": "What is light?", "answer": "Electromagnetic radiation..."}194]'195```196 197### From a file198 199```bash200ax datasets append DATASET_ID --file new_examples.csv201ax datasets append DATASET_ID --file additions.json202```203 204### To a specific version205 206```bash207ax datasets append DATASET_ID --json '[{"q": "..."}]' --version-id VERSION_ID208```209 210### Flags211 212| Flag | Type | Required | Description |213|------|------|----------|-------------|214| `DATASET_ID` | string | yes | Positional argument |215| `--json` | string | mutex | JSON array of example objects |216| `--file, -f` | path | mutex | Data file (CSV, JSON, JSONL, Parquet) |217| `--version-id` | string | no | Append to a specific version (default: latest) |218| `-o, --output` | string | no | Output format for the returned dataset metadata |219| `-p, --profile` | string | no | Configuration profile |220 221Exactly one of `--json` or `--file` is required.222 223### Validation224 225- Each example must be a JSON object with at least one user-defined field226- Maximum 100,000 examples per request227 228**Schema validation before append:** If the dataset already has examples, inspect its schema before appending to avoid silent field mismatches:229 230```bash231# Check existing field names in the dataset232ax datasets export DATASET_ID --stdout | jq '.[0] | keys'233 234# Verify your new data has matching field names235echo '[{"question": "..."}]' | jq '.[0] | keys'236 237# Both outputs should show the same user-defined fields238```239 240Fields are free-form: extra fields in new examples are added, and missing fields become null. However, typos in field names (e.g., `queston` vs `question`) create new columns silently -- verify spelling before appending.241 242## Delete Dataset: `ax datasets delete`243 244```bash245ax datasets delete DATASET_ID246ax datasets delete DATASET_ID --force   # skip confirmation prompt247```248 249### Flags250 251| Flag | Type | Default | Description |252|------|------|---------|-------------|253| `DATASET_ID` | string | required | Positional argument |254| `--force, -f` | bool | false | Skip confirmation prompt |255| `-p, --profile` | string | default | Configuration profile |256 257## Workflows258 259### Find a dataset by name260 261Users often refer to datasets by name rather than ID. Resolve a name to an ID before running other commands:262 263```bash264# Find dataset ID by name265ax datasets list -o json | jq '.[] | select(.name == "eval-set-v1") | .id'266 267# If the list is paginated, fetch more268ax datasets list -o json --limit 100 | jq '.[] | select(.name | test("eval-set")) | {id, name}'269```270 271### Create a dataset from file for evaluation272 2731. Prepare a CSV/JSON/Parquet file with your evaluation columns (e.g., `input`, `expected_output`)274   - If generating data inline, pipe it via stdin using `--file -` (see the Create Dataset section)2752. `ax datasets create --name "eval-set-v1" --space-id SPACE_ID --file eval_data.csv`2763. Verify: `ax datasets get DATASET_ID`2774. Use the dataset ID to run experiments278 279### Add examples to an existing dataset280 281```bash282# Find the dataset283ax datasets list284 285# Append inline or from a file (see Append Examples section for full syntax)286ax datasets append DATASET_ID --json '[{"question": "...", "answer": "..."}]'287ax datasets append DATASET_ID --file additional_examples.csv288```289 290### Download dataset for offline analysis291 2921. `ax datasets list` -- find the dataset2932. `ax datasets export DATASET_ID` -- download to file2943. Parse the JSON: `jq '.[] | .question' dataset_*/examples.json`295 296### Export a specific version297 298```bash299# List versions300ax datasets get DATASET_ID -o json | jq '.versions'301 302# Export that version303ax datasets export DATASET_ID --version-id VERSION_ID304```305 306### Iterate on a dataset307 3081. Export current version: `ax datasets export DATASET_ID`3092. Modify the examples locally3103. Append new rows: `ax datasets append DATASET_ID --file new_rows.csv`3114. Or create a fresh version: `ax datasets create --name "eval-set-v2" --space-id SPACE_ID --file updated_data.json`312 313### Pipe export to other tools314 315```bash316# Count examples317ax datasets export DATASET_ID --stdout | jq 'length'318 319# Extract a single field320ax datasets export DATASET_ID --stdout | jq '.[].question'321 322# Convert to CSV with jq323ax datasets export DATASET_ID --stdout | jq -r '.[] | [.question, .answer] | @csv'324```325 326## Dataset Example Schema327 328Examples are free-form JSON objects. There is no fixed schema -- columns are whatever fields you provide. System-managed fields are added by the server:329 330| Field | Type | Managed by | Notes |331|-------|------|-----------|-------|332| `id` | string | server | Auto-generated UUID. Required on update, forbidden on create/append |333| `created_at` | datetime | server | Immutable creation timestamp |334| `updated_at` | datetime | server | Auto-updated on modification |335| *(any user field)* | any JSON type | user | String, number, boolean, null, nested object, array |336 337 338## Related Skills339 340- **arize-trace**: Export production spans to understand what data to put in datasets → use `arize-trace`341- **arize-experiment**: Run evaluations against this dataset → next step is `arize-experiment`342- **arize-prompt-optimization**: Use dataset + experiment results to improve prompts → use `arize-prompt-optimization`343 344## Troubleshooting345 346| Problem | Solution |347|---------|----------|348| `ax: command not found` | See references/ax-setup.md |349| `401 Unauthorized` | API key is wrong, expired, or doesn't have access to this space. Fix the profile using references/ax-profiles.md. |350| `No profile found` | No profile is configured. See references/ax-profiles.md to create one. |351| `Dataset not found` | Verify dataset ID with `ax datasets list` |352| `File format error` | Supported: CSV, JSON, JSONL, Parquet. Use `--file -` to read from stdin. |353| `platform-managed column` | Remove `id`, `created_at`, `updated_at` from create/append payloads |354| `reserved column` | Remove `time`, `count`, or any `source_record_*` field |355| `Provide either --json or --file` | Append requires exactly one input source |356| `Examples array is empty` | Ensure your JSON array or file contains at least one example |357| `not a JSON object` | Each element in the `--json` array must be a `{...}` object, not a string or number |358 359## Save Credentials for Future Use360 361See references/ax-profiles.md § Save Credentials for Future Use.
Related skills
Add Educational Comments

Takes any code file and transforms it into a teaching resource by adding educational comments that explain syntax, design choices, and language concepts. Automa
Agent Governance

When your AI agents start calling APIs, touching databases, or executing shell commands, you need guardrails before something goes sideways. This gives you comp
Agentic Eval

Implements self-critique loops where Claude generates output, evaluates it against your criteria, then refines based on its own feedback. Includes evaluator-opt