Name: Data Quality Frameworks
Author: Wshobson
Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill data-quality-frameworks
Works with Paperclip
How Data Quality Frameworks fits into a Paperclip company.

Data Quality Frameworks drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md583 linesmarkdown
Expand
1---2name: data-quality-frameworks3description: Implement data quality validation with Great Expectations, dbt tests, and data contracts. Use when building data quality pipelines, implementing validation rules, or establishing data contracts.4---5 6# Data Quality Frameworks7 8Production patterns for implementing data quality with Great Expectations, dbt tests, and data contracts to ensure reliable data pipelines.9 10## When to Use This Skill11 12- Implementing data quality checks in pipelines13- Setting up Great Expectations validation14- Building comprehensive dbt test suites15- Establishing data contracts between teams16- Monitoring data quality metrics17- Automating data validation in CI/CD18 19## Core Concepts20 21### 1. Data Quality Dimensions22 23| Dimension        | Description              | Example Check                                      |24| ---------------- | ------------------------ | -------------------------------------------------- |25| **Completeness** | No missing values        | `expect_column_values_to_not_be_null`              |26| **Uniqueness**   | No duplicates            | `expect_column_values_to_be_unique`                |27| **Validity**     | Values in expected range | `expect_column_values_to_be_in_set`                |28| **Accuracy**     | Data matches reality     | Cross-reference validation                         |29| **Consistency**  | No contradictions        | `expect_column_pair_values_A_to_be_greater_than_B` |30| **Timeliness**   | Data is recent           | `expect_column_max_to_be_between`                  |31 32### 2. Testing Pyramid for Data33 34```35          /\36         /  \     Integration Tests (cross-table)37        /────\38       /      \   Unit Tests (single column)39      /────────\40     /          \ Schema Tests (structure)41    /────────────\42```43 44## Quick Start45 46### Great Expectations Setup47 48```bash49# Install50pip install great_expectations51 52# Initialize project53great_expectations init54 55# Create datasource56great_expectations datasource new57```58 59```python60# great_expectations/checkpoints/daily_validation.yml61import great_expectations as gx62 63# Create context64context = gx.get_context()65 66# Create expectation suite67suite = context.add_expectation_suite("orders_suite")68 69# Add expectations70suite.add_expectation(71    gx.expectations.ExpectColumnValuesToNotBeNull(column="order_id")72)73suite.add_expectation(74    gx.expectations.ExpectColumnValuesToBeUnique(column="order_id")75)76 77# Validate78results = context.run_checkpoint(checkpoint_name="daily_orders")79```80 81## Patterns82 83### Pattern 1: Great Expectations Suite84 85```python86# expectations/orders_suite.py87import great_expectations as gx88from great_expectations.core import ExpectationSuite89from great_expectations.core.expectation_configuration import ExpectationConfiguration90 91def build_orders_suite() -> ExpectationSuite:92    """Build comprehensive orders expectation suite"""93 94    suite = ExpectationSuite(expectation_suite_name="orders_suite")95 96    # Schema expectations97    suite.add_expectation(ExpectationConfiguration(98        expectation_type="expect_table_columns_to_match_set",99        kwargs={100            "column_set": ["order_id", "customer_id", "amount", "status", "created_at"],101            "exact_match": False  # Allow additional columns102        }103    ))104 105    # Primary key106    suite.add_expectation(ExpectationConfiguration(107        expectation_type="expect_column_values_to_not_be_null",108        kwargs={"column": "order_id"}109    ))110    suite.add_expectation(ExpectationConfiguration(111        expectation_type="expect_column_values_to_be_unique",112        kwargs={"column": "order_id"}113    ))114 115    # Foreign key116    suite.add_expectation(ExpectationConfiguration(117        expectation_type="expect_column_values_to_not_be_null",118        kwargs={"column": "customer_id"}119    ))120 121    # Categorical values122    suite.add_expectation(ExpectationConfiguration(123        expectation_type="expect_column_values_to_be_in_set",124        kwargs={125            "column": "status",126            "value_set": ["pending", "processing", "shipped", "delivered", "cancelled"]127        }128    ))129 130    # Numeric ranges131    suite.add_expectation(ExpectationConfiguration(132        expectation_type="expect_column_values_to_be_between",133        kwargs={134            "column": "amount",135            "min_value": 0,136            "max_value": 100000,137            "strict_min": True  # amount > 0138        }139    ))140 141    # Date validity142    suite.add_expectation(ExpectationConfiguration(143        expectation_type="expect_column_values_to_be_dateutil_parseable",144        kwargs={"column": "created_at"}145    ))146 147    # Freshness - data should be recent148    suite.add_expectation(ExpectationConfiguration(149        expectation_type="expect_column_max_to_be_between",150        kwargs={151            "column": "created_at",152            "min_value": {"$PARAMETER": "now - timedelta(days=1)"},153            "max_value": {"$PARAMETER": "now"}154        }155    ))156 157    # Row count sanity158    suite.add_expectation(ExpectationConfiguration(159        expectation_type="expect_table_row_count_to_be_between",160        kwargs={161            "min_value": 1000,  # Expect at least 1000 rows162            "max_value": 10000000163        }164    ))165 166    # Statistical expectations167    suite.add_expectation(ExpectationConfiguration(168        expectation_type="expect_column_mean_to_be_between",169        kwargs={170            "column": "amount",171            "min_value": 50,172            "max_value": 500173        }174    ))175 176    return suite177```178 179### Pattern 2: Great Expectations Checkpoint180 181```yaml182# great_expectations/checkpoints/orders_checkpoint.yml183name: orders_checkpoint184config_version: 1.0185class_name: Checkpoint186run_name_template: "%Y%m%d-%H%M%S-orders-validation"187 188validations:189  - batch_request:190      datasource_name: warehouse191      data_connector_name: default_inferred_data_connector_name192      data_asset_name: orders193      data_connector_query:194        index: -1 # Latest batch195    expectation_suite_name: orders_suite196 197action_list:198  - name: store_validation_result199    action:200      class_name: StoreValidationResultAction201 202  - name: store_evaluation_parameters203    action:204      class_name: StoreEvaluationParametersAction205 206  - name: update_data_docs207    action:208      class_name: UpdateDataDocsAction209 210  # Slack notification on failure211  - name: send_slack_notification212    action:213      class_name: SlackNotificationAction214      slack_webhook: ${SLACK_WEBHOOK}215      notify_on: failure216      renderer:217        module_name: great_expectations.render.renderer.slack_renderer218        class_name: SlackRenderer219```220 221```python222# Run checkpoint223import great_expectations as gx224 225context = gx.get_context()226result = context.run_checkpoint(checkpoint_name="orders_checkpoint")227 228if not result.success:229    failed_expectations = [230        r for r in result.run_results.values()231        if not r.success232    ]233    raise ValueError(f"Data quality check failed: {failed_expectations}")234```235 236### Pattern 3: dbt Data Tests237 238```yaml239# models/marts/core/_core__models.yml240version: 2241 242models:243  - name: fct_orders244    description: Order fact table245    tests:246      # Table-level tests247      - dbt_utils.recency:248          datepart: day249          field: created_at250          interval: 1251      - dbt_utils.at_least_one252      - dbt_utils.expression_is_true:253          expression: "total_amount >= 0"254 255    columns:256      - name: order_id257        description: Primary key258        tests:259          - unique260          - not_null261 262      - name: customer_id263        description: Foreign key to dim_customers264        tests:265          - not_null266          - relationships:267              to: ref('dim_customers')268              field: customer_id269 270      - name: order_status271        tests:272          - accepted_values:273              values:274                ["pending", "processing", "shipped", "delivered", "cancelled"]275 276      - name: total_amount277        tests:278          - not_null279          - dbt_utils.expression_is_true:280              expression: ">= 0"281 282      - name: created_at283        tests:284          - not_null285          - dbt_utils.expression_is_true:286              expression: "<= current_timestamp"287 288  - name: dim_customers289    columns:290      - name: customer_id291        tests:292          - unique293          - not_null294 295      - name: email296        tests:297          - unique298          - not_null299          # Custom regex test300          - dbt_utils.expression_is_true:301              expression: "email ~ '^[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}$'"302```303 304### Pattern 4: Custom dbt Tests305 306```sql307-- tests/generic/test_row_count_in_range.sql308{% test row_count_in_range(model, min_count, max_count) %}309 310with row_count as (311    select count(*) as cnt from {{ model }}312)313 314select cnt315from row_count316where cnt < {{ min_count }} or cnt > {{ max_count }}317 318{% endtest %}319 320-- Usage in schema.yml:321-- tests:322--   - row_count_in_range:323--       min_count: 1000324--       max_count: 10000000325```326 327```sql328-- tests/generic/test_sequential_values.sql329{% test sequential_values(model, column_name, interval=1) %}330 331with lagged as (332    select333        {{ column_name }},334        lag({{ column_name }}) over (order by {{ column_name }}) as prev_value335    from {{ model }}336)337 338select *339from lagged340where {{ column_name }} - prev_value != {{ interval }}341  and prev_value is not null342 343{% endtest %}344```345 346```sql347-- tests/singular/assert_orders_customers_match.sql348-- Singular test: specific business rule349 350with orders_customers as (351    select distinct customer_id from {{ ref('fct_orders') }}352),353 354dim_customers as (355    select customer_id from {{ ref('dim_customers') }}356),357 358orphaned_orders as (359    select o.customer_id360    from orders_customers o361    left join dim_customers c using (customer_id)362    where c.customer_id is null363)364 365select * from orphaned_orders366-- Test passes if this returns 0 rows367```368 369### Pattern 5: Data Contracts370 371```yaml372# contracts/orders_contract.yaml373apiVersion: datacontract.com/v1.0.0374kind: DataContract375metadata:376  name: orders377  version: 1.0.0378  owner: data-platform-team379  contact: data-team@company.com380 381info:382  title: Orders Data Contract383  description: Contract for order event data from the ecommerce platform384  purpose: Analytics, reporting, and ML features385 386servers:387  production:388    type: snowflake389    account: company.us-east-1390    database: ANALYTICS391    schema: CORE392 393terms:394  usage: Internal analytics only395  limitations: PII must not be exposed in downstream marts396  billing: Charged per query TB scanned397 398schema:399  type: object400  properties:401    order_id:402      type: string403      format: uuid404      description: Unique order identifier405      required: true406      unique: true407      pii: false408 409    customer_id:410      type: string411      format: uuid412      description: Customer identifier413      required: true414      pii: true415      piiClassification: indirect416 417    total_amount:418      type: number419      minimum: 0420      maximum: 100000421      description: Order total in USD422 423    created_at:424      type: string425      format: date-time426      description: Order creation timestamp427      required: true428 429    status:430      type: string431      enum: [pending, processing, shipped, delivered, cancelled]432      description: Current order status433 434quality:435  type: SodaCL436  specification:437    checks for orders:438      - row_count > 0439      - missing_count(order_id) = 0440      - duplicate_count(order_id) = 0441      - invalid_count(status) = 0:442          valid values: [pending, processing, shipped, delivered, cancelled]443      - freshness(created_at) < 24h444 445sla:446  availability: 99.9%447  freshness: 1 hour448  latency: 5 minutes449```450 451### Pattern 6: Automated Quality Pipeline452 453```python454# quality_pipeline.py455from dataclasses import dataclass456from typing import List, Dict, Any457import great_expectations as gx458from datetime import datetime459 460@dataclass461class QualityResult:462    table: str463    passed: bool464    total_expectations: int465    failed_expectations: int466    details: List[Dict[str, Any]]467    timestamp: datetime468 469class DataQualityPipeline:470    """Orchestrate data quality checks across tables"""471 472    def __init__(self, context: gx.DataContext):473        self.context = context474        self.results: List[QualityResult] = []475 476    def validate_table(self, table: str, suite: str) -> QualityResult:477        """Validate a single table against expectation suite"""478 479        checkpoint_config = {480            "name": f"{table}_validation",481            "config_version": 1.0,482            "class_name": "Checkpoint",483            "validations": [{484                "batch_request": {485                    "datasource_name": "warehouse",486                    "data_asset_name": table,487                },488                "expectation_suite_name": suite,489            }],490        }491 492        result = self.context.run_checkpoint(**checkpoint_config)493 494        # Parse results495        validation_result = list(result.run_results.values())[0]496        results = validation_result.results497 498        failed = [r for r in results if not r.success]499 500        return QualityResult(501            table=table,502            passed=result.success,503            total_expectations=len(results),504            failed_expectations=len(failed),505            details=[{506                "expectation": r.expectation_config.expectation_type,507                "success": r.success,508                "observed_value": r.result.get("observed_value"),509            } for r in results],510            timestamp=datetime.now()511        )512 513    def run_all(self, tables: Dict[str, str]) -> Dict[str, QualityResult]:514        """Run validation for all tables"""515        results = {}516 517        for table, suite in tables.items():518            print(f"Validating {table}...")519            results[table] = self.validate_table(table, suite)520 521        return results522 523    def generate_report(self, results: Dict[str, QualityResult]) -> str:524        """Generate quality report"""525        report = ["# Data Quality Report", f"Generated: {datetime.now()}", ""]526 527        total_passed = sum(1 for r in results.values() if r.passed)528        total_tables = len(results)529 530        report.append(f"## Summary: {total_passed}/{total_tables} tables passed")531        report.append("")532 533        for table, result in results.items():534            status = "✅" if result.passed else "❌"535            report.append(f"### {status} {table}")536            report.append(f"- Expectations: {result.total_expectations}")537            report.append(f"- Failed: {result.failed_expectations}")538 539            if not result.passed:540                report.append("- Failed checks:")541                for detail in result.details:542                    if not detail["success"]:543                        report.append(f"  - {detail['expectation']}: {detail['observed_value']}")544            report.append("")545 546        return "\n".join(report)547 548# Usage549context = gx.get_context()550pipeline = DataQualityPipeline(context)551 552tables_to_validate = {553    "orders": "orders_suite",554    "customers": "customers_suite",555    "products": "products_suite",556}557 558results = pipeline.run_all(tables_to_validate)559report = pipeline.generate_report(results)560 561# Fail pipeline if any table failed562if not all(r.passed for r in results.values()):563    print(report)564    raise ValueError("Data quality checks failed!")565```566 567## Best Practices568 569### Do's570 571- **Test early** - Validate source data before transformations572- **Test incrementally** - Add tests as you find issues573- **Document expectations** - Clear descriptions for each test574- **Alert on failures** - Integrate with monitoring575- **Version contracts** - Track schema changes576 577### Don'ts578 579- **Don't test everything** - Focus on critical columns580- **Don't ignore warnings** - They often precede failures581- **Don't skip freshness** - Stale data is bad data582- **Don't hardcode thresholds** - Use dynamic baselines583- **Don't test in isolation** - Test relationships too
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app