Claude Agent Skill · by Github

Bigquery Pipeline Audit

The bigquery-pipeline-audit skill audits Python and BigQuery pipeline scripts to identify cost overruns, unsafe reruns, and production readiness issues, returni

Install
Terminal · npx
$npx skills add https://github.com/github/awesome-copilot --skill bigquery-pipeline-audit
Works with Paperclip

How Bigquery Pipeline Audit fits into a Paperclip company.

Bigquery Pipeline Audit drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md129 lines
Expand
---name: bigquery-pipeline-auditdescription: 'Audits Python + BigQuery pipelines for cost safety, idempotency, and production readiness. Returns a structured report with exact patch locations.'--- # BigQuery Pipeline Audit: Cost, Safety and Production Readiness You are a senior data engineer reviewing a Python + BigQuery pipeline script.Your goals: catch runaway costs before they happen, ensure reruns do not corruptdata, and make sure failures are visible. Analyze the codebase and respond in the structure below (A to F + Final).Reference exact function names and line locations. Suggest minimal fixes, notrewrites. --- ## A) COST EXPOSURE: What will actually get billed? Locate every BigQuery job trigger (`client.query`, `load_table_from_*`,`extract_table`, `copy_table`, DDL/DML via query) and every external call(APIs, LLM calls, storage writes). For each, answer:- Is this inside a loop, retry block, or async gather?- What is the realistic worst-case call count?- For each `client.query`, is `QueryJobConfig.maximum_bytes_billed` set?  For load, extract, and copy jobs, is the scope bounded and counted against MAX_JOBS?- Is the same SQL and params being executed more than once in a single run?  Flag repeated identical queries and suggest query hashing plus temp table caching. **Flag immediately if:**- Any BQ query runs once per date or once per entity in a loop- Worst-case BQ job count exceeds 20- `maximum_bytes_billed` is missing on any `client.query` call --- ## B) DRY RUN AND EXECUTION MODES Verify a `--mode` flag exists with at least `dry_run` and `execute` options. - `dry_run` must print the plan and estimated scope with zero billed BQ execution  (BigQuery dry-run estimation via job config is allowed) and zero external API or LLM calls- `execute` requires explicit confirmation for prod (`--env=prod --confirm`)- Prod must not be the default environment If missing, propose a minimal `argparse` patch with safe defaults. --- ## C) BACKFILL AND LOOP DESIGN **Hard fail if:** the script runs one BQ query per date or per entity in a loop. Check that date-range backfills use one of:1. A single set-based query with `GENERATE_DATE_ARRAY`2. A staging table loaded with all dates then one join query3. Explicit chunks with a hard `MAX_CHUNKS` cap Also check:- Is the date range bounded by default (suggest 14 days max without `--override`)?- If the script crashes mid-run, is it safe to re-run without double-writing?- For backdated simulations, verify data is read from time-consistent snapshots  (`FOR SYSTEM_TIME AS OF`, partitioned as-of tables, or dated snapshot tables).  Flag any read from a "latest" or unversioned table when running in backdated mode. Suggest a concrete rewrite if the current approach is row-by-row. --- ## D) QUERY SAFETY AND SCAN SIZE For each query, check:- **Partition filter** is on the raw column, not `DATE(ts)`, `CAST(...)`, or  any function that prevents pruning- **No `SELECT *`**: only columns actually used downstream- **Joins will not explode**: verify join keys are unique or appropriately scoped  and flag any potential many-to-many- **Expensive operations** (`REGEXP`, `JSON_EXTRACT`, UDFs) only run after  partition filtering, not on full table scans Provide a specific SQL fix for any query that fails these checks. --- ## E) SAFE WRITES AND IDEMPOTENCY Identify every write operation. Flag plain `INSERT`/append with no dedup logic. Each write should use one of:1. `MERGE` on a deterministic key (e.g., `entity_id + date + model_version`)2. Write to a staging table scoped to the run, then swap or merge into final3. Append-only with a dedupe view:   `QUALIFY ROW_NUMBER() OVER (PARTITION BY <key>) = 1` Also check:- Will a re-run create duplicate rows?- Is the write disposition (`WRITE_TRUNCATE` vs `WRITE_APPEND`) intentional  and documented?- Is `run_id` being used as part of the merge or dedupe key? If so, flag it.  `run_id` should be stored as a metadata column, not as part of the uniqueness  key, unless you explicitly want multi-run history. State the recommended approach and the exact dedup key for this codebase. --- ## F) OBSERVABILITY: Can you debug a failure? Verify:- Failures raise exceptions and abort with no silent `except: pass` or warn-only- Each BQ job logs: job ID, bytes processed or billed when available,  slot milliseconds, and duration- A run summary is logged or written at the end containing:  `run_id, env, mode, date_range, tables written, total BQ jobs, total bytes`- `run_id` is present and consistent across all log lines If `run_id` is missing, propose a one-line fix:`run_id = run_id or datetime.utcnow().strftime('%Y%m%dT%H%M%S')` --- ## Final **1. PASS / FAIL** with specific reasons per section (A to F).**2. Patch list** ordered by risk, referencing exact functions to change.**3. If FAIL: Top 3 cost risks** with a rough worst-case estimate(e.g., "loop over 90 dates x 3 retries = 270 BQ jobs").