How Explore Data fits into a Paperclip company.

Explore Data drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md325 linesmarkdown
Expand
1---2name: explore-data3description: Profile and explore a dataset to understand its shape, quality, and patterns. Use when encountering a new table or file, checking null rates and column distributions, spotting data quality issues like duplicates or suspicious values, or deciding which dimensions and metrics to analyze.4argument-hint: "<table or file>"5---6 7# /explore-data - Profile and Explore a Dataset8 9> If you see unfamiliar placeholders or need to check which tools are connected, see [CONNECTORS.md](../../CONNECTORS.md).10 11Generate a comprehensive data profile for a table or uploaded file. Understand its shape, quality, and patterns before diving into analysis.12 13## Usage14 15```16/explore-data <table_name or file>17```18 19## Workflow20 21### 1. Access the Data22 23**If a data warehouse MCP server is connected:**24 251. Resolve the table name (handle schema prefixes, suggest matches if ambiguous)262. Query table metadata: column names, types, descriptions if available273. Run profiling queries against the live data28 29**If a file is provided (CSV, Excel, Parquet, JSON):**30 311. Read the file and load into a working dataset322. Infer column types from the data33 34**If neither:**35 361. Ask the user to provide a table name (with their warehouse connected) or upload a file372. If they describe a table schema, provide guidance on what profiling queries to run38 39### 2. Understand Structure40 41Before analyzing any data, understand its structure:42 43**Table-level questions:**44- How many rows and columns?45- What is the grain (one row per what)?46- What is the primary key? Is it unique?47- When was the data last updated?48- How far back does the data go?49 50**Column classification** — categorize each column as one of:51- **Identifier**: Unique keys, foreign keys, entity IDs52- **Dimension**: Categorical attributes for grouping/filtering (status, type, region, category)53- **Metric**: Quantitative values for measurement (revenue, count, duration, score)54- **Temporal**: Dates and timestamps (created_at, updated_at, event_date)55- **Text**: Free-form text fields (description, notes, name)56- **Boolean**: True/false flags57- **Structural**: JSON, arrays, nested structures58 59### 3. Generate Data Profile60 61Run the following profiling checks:62 63**Table-level metrics:**64- Total row count65- Column count and types breakdown66- Approximate table size (if available from metadata)67- Date range coverage (min/max of date columns)68 69**All columns:**70- Null count and null rate71- Distinct count and cardinality ratio (distinct / total)72- Most common values (top 5-10 with frequencies)73- Least common values (bottom 5 to spot anomalies)74 75**Numeric columns (metrics):**76```77min, max, mean, median (p50)78standard deviation79percentiles: p1, p5, p25, p75, p95, p9980zero count81negative count (if unexpected)82```83 84**String columns (dimensions, text):**85```86min length, max length, avg length87empty string count88pattern analysis (do values follow a format?)89case consistency (all upper, all lower, mixed?)90leading/trailing whitespace count91```92 93**Date/timestamp columns:**94```95min date, max date96null dates97future dates (if unexpected)98distribution by month/week99gaps in time series100```101 102**Boolean columns:**103```104true count, false count, null count105true rate106```107 108**Present the profile as a clean summary table**, grouped by column type (dimensions, metrics, dates, IDs).109 110### 4. Identify Data Quality Issues111 112Apply the quality assessment framework below. Flag potential problems:113 114- **High null rates**: Columns with >5% nulls (warn), >20% nulls (alert)115- **Low cardinality surprises**: Columns that should be high-cardinality but aren't (e.g., a "user_id" with only 50 distinct values)116- **High cardinality surprises**: Columns that should be categorical but have too many distinct values117- **Suspicious values**: Negative amounts where only positive expected, future dates in historical data, obviously placeholder values (e.g., "N/A", "TBD", "test", "999999")118- **Duplicate detection**: Check if there's a natural key and whether it has duplicates119- **Distribution skew**: Extremely skewed numeric distributions that could affect averages120- **Encoding issues**: Mixed case in categorical fields, trailing whitespace, inconsistent formats121 122### 5. Discover Relationships and Patterns123 124After profiling individual columns:125 126- **Foreign key candidates**: ID columns that might link to other tables127- **Hierarchies**: Columns that form natural drill-down paths (country > state > city)128- **Correlations**: Numeric columns that move together129- **Derived columns**: Columns that appear to be computed from others130- **Redundant columns**: Columns with identical or near-identical information131 132### 6. Suggest Interesting Dimensions and Metrics133 134Based on the column profile, recommend:135 136- **Best dimension columns** for slicing data (categorical columns with reasonable cardinality, 3-50 values)137- **Key metric columns** for measurement (numeric columns with meaningful distributions)138- **Time columns** suitable for trend analysis139- **Natural groupings** or hierarchies apparent in the data140- **Potential join keys** linking to other tables (ID columns, foreign keys)141 142### 7. Recommend Follow-Up Analyses143 144Suggest 3-5 specific analyses the user could run next:145 146- "Trend analysis on [metric] by [time_column] grouped by [dimension]"147- "Distribution deep-dive on [skewed_column] to understand outliers"148- "Data quality investigation on [problematic_column]"149- "Correlation analysis between [metric_a] and [metric_b]"150- "Cohort analysis using [date_column] and [status_column]"151 152## Output Format153 154```155## Data Profile: [table_name]156 157### Overview158- Rows: 2,340,891159- Columns: 23 (8 dimensions, 6 metrics, 4 dates, 5 IDs)160- Date range: 2021-03-15 to 2024-01-22161 162### Column Details163[summary table]164 165### Data Quality Issues166[flagged issues with severity]167 168### Recommended Explorations169[numbered list of suggested follow-up analyses]170```171 172---173 174## Quality Assessment Framework175 176### Completeness Score177 178Rate each column:179- **Complete** (>99% non-null): Green180- **Mostly complete** (95-99%): Yellow -- investigate the nulls181- **Incomplete** (80-95%): Orange -- understand why and whether it matters182- **Sparse** (<80%): Red -- may not be usable without imputation183 184### Consistency Checks185 186Look for:187- **Value format inconsistency**: Same concept represented differently ("USA", "US", "United States", "us")188- **Type inconsistency**: Numbers stored as strings, dates in various formats189- **Referential integrity**: Foreign keys that don't match any parent record190- **Business rule violations**: Negative quantities, end dates before start dates, percentages > 100191- **Cross-column consistency**: Status = "completed" but completed_at is null192 193### Accuracy Indicators194 195Red flags that suggest accuracy issues:196- **Placeholder values**: 0, -1, 999999, "N/A", "TBD", "test", "xxx"197- **Default values**: Suspiciously high frequency of a single value198- **Stale data**: Updated_at shows no recent changes in an active system199- **Impossible values**: Ages > 150, dates in the far future, negative durations200- **Round number bias**: All values ending in 0 or 5 (suggests estimation, not measurement)201 202### Timeliness Assessment203 204- When was the table last updated?205- What is the expected update frequency?206- Is there a lag between event time and load time?207- Are there gaps in the time series?208 209## Pattern Discovery Techniques210 211### Distribution Analysis212 213For numeric columns, characterize the distribution:214- **Normal**: Mean and median are close, bell-shaped215- **Skewed right**: Long tail of high values (common for revenue, session duration)216- **Skewed left**: Long tail of low values (less common)217- **Bimodal**: Two peaks (suggests two distinct populations)218- **Power law**: Few very large values, many small ones (common for user activity)219- **Uniform**: Roughly equal frequency across range (often synthetic or random)220 221### Temporal Patterns222 223For time series data, look for:224- **Trend**: Sustained upward or downward movement225- **Seasonality**: Repeating patterns (weekly, monthly, quarterly, annual)226- **Day-of-week effects**: Weekday vs. weekend differences227- **Holiday effects**: Drops or spikes around known holidays228- **Change points**: Sudden shifts in level or trend229- **Anomalies**: Individual data points that break the pattern230 231### Segmentation Discovery232 233Identify natural segments by:234- Finding categorical columns with 3-20 distinct values235- Comparing metric distributions across segment values236- Looking for segments with significantly different behavior237- Testing whether segments are homogeneous or contain sub-segments238 239### Correlation Exploration240 241Between numeric columns:242- Compute correlation matrix for all metric pairs243- Flag strong correlations (|r| > 0.7) for investigation244- Note: Correlation does not imply causation -- flag this explicitly245- Check for non-linear relationships (e.g., quadratic, logarithmic)246 247## Schema Understanding and Documentation248 249### Schema Documentation Template250 251When documenting a dataset for team use:252 253```markdown254## Table: [schema.table_name]255 256**Description**: [What this table represents]257**Grain**: [One row per...]258**Primary Key**: [column(s)]259**Row Count**: [approximate, with date]260**Update Frequency**: [real-time / hourly / daily / weekly]261**Owner**: [team or person responsible]262 263### Key Columns264 265| Column | Type | Description | Example Values | Notes |266|--------|------|-------------|----------------|-------|267| user_id | STRING | Unique user identifier | "usr_abc123" | FK to users.id |268| event_type | STRING | Type of event | "click", "view", "purchase" | 15 distinct values |269| revenue | DECIMAL | Transaction revenue in USD | 29.99, 149.00 | Null for non-purchase events |270| created_at | TIMESTAMP | When the event occurred | 2024-01-15 14:23:01 | Partitioned on this column |271 272### Relationships273- Joins to `users` on `user_id`274- Joins to `products` on `product_id`275- Parent of `event_details` (1:many on event_id)276 277### Known Issues278- [List any known data quality issues]279- [Note any gotchas for analysts]280 281### Common Query Patterns282- [Typical use cases for this table]283```284 285### Schema Exploration Queries286 287When connected to a data warehouse, use these patterns to discover schema:288 289```sql290-- List all tables in a schema (PostgreSQL)291SELECT table_name, table_type292FROM information_schema.tables293WHERE table_schema = 'public'294ORDER BY table_name;295 296-- Column details (PostgreSQL)297SELECT column_name, data_type, is_nullable, column_default298FROM information_schema.columns299WHERE table_name = 'my_table'300ORDER BY ordinal_position;301 302-- Table sizes (PostgreSQL)303SELECT relname, pg_size_pretty(pg_total_relation_size(relid))304FROM pg_catalog.pg_statio_user_tables305ORDER BY pg_total_relation_size(relid) DESC;306 307-- Row counts for all tables (general pattern)308-- Run per-table: SELECT COUNT(*) FROM table_name309```310 311### Lineage and Dependencies312 313When exploring an unfamiliar data environment:314 3151. Start with the "output" tables (what reports or dashboards consume)3162. Trace upstream: What tables feed into them?3173. Identify raw/staging/mart layers3184. Map the transformation chain from raw data to analytical tables3195. Note where data is enriched, filtered, or aggregated320 321## Tips322 323- For very large tables (100M+ rows), profiling queries use sampling by default -- mention if you need exact counts324- If exploring a new dataset for the first time, this command gives you the lay of the land before writing specific queries325- The quality flags are heuristic -- not every flag is a real problem, but each is worth a quick look
Related skills
Accessibility Review

Install Accessibility Review skill for Claude Code from anthropics/knowledge-work-plugins.
Account Research

Install Account Research skill for Claude Code from anthropics/knowledge-work-plugins.
Algorithmic Art

When you want to create generative art that's actually algorithmic rather than just randomized shapes, this skill follows a two-step process that works surprisi