Name: Airflow Dag Patterns
Author: Wshobson
Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill airflow-dag-patterns
Works with Paperclip
How Airflow Dag Patterns fits into a Paperclip company.

Airflow Dag Patterns drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
SaaS FactoryPaired
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
$27$59
Explore pack
Source file
SKILL.md519 linesmarkdown
Expand
1---2name: airflow-dag-patterns3description: Build production Apache Airflow DAGs with best practices for operators, sensors, testing, and deployment. Use when creating data pipelines, orchestrating workflows, or scheduling batch jobs.4---5 6# Apache Airflow DAG Patterns7 8Production-ready patterns for Apache Airflow including DAG design, operators, sensors, testing, and deployment strategies.9 10## When to Use This Skill11 12- Creating data pipeline orchestration with Airflow13- Designing DAG structures and dependencies14- Implementing custom operators and sensors15- Testing Airflow DAGs locally16- Setting up Airflow in production17- Debugging failed DAG runs18 19## Core Concepts20 21### 1. DAG Design Principles22 23| Principle       | Description                         |24| --------------- | ----------------------------------- |25| **Idempotent**  | Running twice produces same result  |26| **Atomic**      | Tasks succeed or fail completely    |27| **Incremental** | Process only new/changed data       |28| **Observable**  | Logs, metrics, alerts at every step |29 30### 2. Task Dependencies31 32```python33# Linear34task1 >> task2 >> task335 36# Fan-out37task1 >> [task2, task3, task4]38 39# Fan-in40[task1, task2, task3] >> task441 42# Complex43task1 >> task2 >> task444task1 >> task3 >> task445```46 47## Quick Start48 49```python50# dags/example_dag.py51from datetime import datetime, timedelta52from airflow import DAG53from airflow.operators.python import PythonOperator54from airflow.operators.empty import EmptyOperator55 56default_args = {57    'owner': 'data-team',58    'depends_on_past': False,59    'email_on_failure': True,60    'email_on_retry': False,61    'retries': 3,62    'retry_delay': timedelta(minutes=5),63    'retry_exponential_backoff': True,64    'max_retry_delay': timedelta(hours=1),65}66 67with DAG(68    dag_id='example_etl',69    default_args=default_args,70    description='Example ETL pipeline',71    schedule='0 6 * * *',  # Daily at 6 AM72    start_date=datetime(2024, 1, 1),73    catchup=False,74    tags=['etl', 'example'],75    max_active_runs=1,76) as dag:77 78    start = EmptyOperator(task_id='start')79 80    def extract_data(**context):81        execution_date = context['ds']82        # Extract logic here83        return {'records': 1000}84 85    extract = PythonOperator(86        task_id='extract',87        python_callable=extract_data,88    )89 90    end = EmptyOperator(task_id='end')91 92    start >> extract >> end93```94 95## Patterns96 97### Pattern 1: TaskFlow API (Airflow 2.0+)98 99```python100# dags/taskflow_example.py101from datetime import datetime102from airflow.decorators import dag, task103from airflow.models import Variable104 105@dag(106    dag_id='taskflow_etl',107    schedule='@daily',108    start_date=datetime(2024, 1, 1),109    catchup=False,110    tags=['etl', 'taskflow'],111)112def taskflow_etl():113    """ETL pipeline using TaskFlow API"""114 115    @task()116    def extract(source: str) -> dict:117        """Extract data from source"""118        import pandas as pd119 120        df = pd.read_csv(f's3://bucket/{source}/{{ ds }}.csv')121        return {'data': df.to_dict(), 'rows': len(df)}122 123    @task()124    def transform(extracted: dict) -> dict:125        """Transform extracted data"""126        import pandas as pd127 128        df = pd.DataFrame(extracted['data'])129        df['processed_at'] = datetime.now()130        df = df.dropna()131        return {'data': df.to_dict(), 'rows': len(df)}132 133    @task()134    def load(transformed: dict, target: str):135        """Load data to target"""136        import pandas as pd137 138        df = pd.DataFrame(transformed['data'])139        df.to_parquet(f's3://bucket/{target}/{{ ds }}.parquet')140        return transformed['rows']141 142    @task()143    def notify(rows_loaded: int):144        """Send notification"""145        print(f'Loaded {rows_loaded} rows')146 147    # Define dependencies with XCom passing148    extracted = extract(source='raw_data')149    transformed = transform(extracted)150    loaded = load(transformed, target='processed_data')151    notify(loaded)152 153# Instantiate the DAG154taskflow_etl()155```156 157### Pattern 2: Dynamic DAG Generation158 159```python160# dags/dynamic_dag_factory.py161from datetime import datetime, timedelta162from airflow import DAG163from airflow.operators.python import PythonOperator164from airflow.models import Variable165import json166 167# Configuration for multiple similar pipelines168PIPELINE_CONFIGS = [169    {'name': 'customers', 'schedule': '@daily', 'source': 's3://raw/customers'},170    {'name': 'orders', 'schedule': '@hourly', 'source': 's3://raw/orders'},171    {'name': 'products', 'schedule': '@weekly', 'source': 's3://raw/products'},172]173 174def create_dag(config: dict) -> DAG:175    """Factory function to create DAGs from config"""176 177    dag_id = f"etl_{config['name']}"178 179    default_args = {180        'owner': 'data-team',181        'retries': 3,182        'retry_delay': timedelta(minutes=5),183    }184 185    dag = DAG(186        dag_id=dag_id,187        default_args=default_args,188        schedule=config['schedule'],189        start_date=datetime(2024, 1, 1),190        catchup=False,191        tags=['etl', 'dynamic', config['name']],192    )193 194    with dag:195        def extract_fn(source, **context):196            print(f"Extracting from {source} for {context['ds']}")197 198        def transform_fn(**context):199            print(f"Transforming data for {context['ds']}")200 201        def load_fn(table_name, **context):202            print(f"Loading to {table_name} for {context['ds']}")203 204        extract = PythonOperator(205            task_id='extract',206            python_callable=extract_fn,207            op_kwargs={'source': config['source']},208        )209 210        transform = PythonOperator(211            task_id='transform',212            python_callable=transform_fn,213        )214 215        load = PythonOperator(216            task_id='load',217            python_callable=load_fn,218            op_kwargs={'table_name': config['name']},219        )220 221        extract >> transform >> load222 223    return dag224 225# Generate DAGs226for config in PIPELINE_CONFIGS:227    globals()[f"dag_{config['name']}"] = create_dag(config)228```229 230### Pattern 3: Branching and Conditional Logic231 232```python233# dags/branching_example.py234from airflow.decorators import dag, task235from airflow.operators.python import BranchPythonOperator236from airflow.operators.empty import EmptyOperator237from airflow.utils.trigger_rule import TriggerRule238 239@dag(240    dag_id='branching_pipeline',241    schedule='@daily',242    start_date=datetime(2024, 1, 1),243    catchup=False,244)245def branching_pipeline():246 247    @task()248    def check_data_quality() -> dict:249        """Check data quality and return metrics"""250        quality_score = 0.95  # Simulated251        return {'score': quality_score, 'rows': 10000}252 253    def choose_branch(**context) -> str:254        """Determine which branch to execute"""255        ti = context['ti']256        metrics = ti.xcom_pull(task_ids='check_data_quality')257 258        if metrics['score'] >= 0.9:259            return 'high_quality_path'260        elif metrics['score'] >= 0.7:261            return 'medium_quality_path'262        else:263            return 'low_quality_path'264 265    quality_check = check_data_quality()266 267    branch = BranchPythonOperator(268        task_id='branch',269        python_callable=choose_branch,270    )271 272    high_quality = EmptyOperator(task_id='high_quality_path')273    medium_quality = EmptyOperator(task_id='medium_quality_path')274    low_quality = EmptyOperator(task_id='low_quality_path')275 276    # Join point - runs after any branch completes277    join = EmptyOperator(278        task_id='join',279        trigger_rule=TriggerRule.NONE_FAILED_MIN_ONE_SUCCESS,280    )281 282    quality_check >> branch >> [high_quality, medium_quality, low_quality] >> join283 284branching_pipeline()285```286 287### Pattern 4: Sensors and External Dependencies288 289```python290# dags/sensor_patterns.py291from datetime import datetime, timedelta292from airflow import DAG293from airflow.sensors.filesystem import FileSensor294from airflow.providers.amazon.aws.sensors.s3 import S3KeySensor295from airflow.sensors.external_task import ExternalTaskSensor296from airflow.operators.python import PythonOperator297 298with DAG(299    dag_id='sensor_example',300    schedule='@daily',301    start_date=datetime(2024, 1, 1),302    catchup=False,303) as dag:304 305    # Wait for file on S3306    wait_for_file = S3KeySensor(307        task_id='wait_for_s3_file',308        bucket_name='data-lake',309        bucket_key='raw/{{ ds }}/data.parquet',310        aws_conn_id='aws_default',311        timeout=60 * 60 * 2,  # 2 hours312        poke_interval=60 * 5,  # Check every 5 minutes313        mode='reschedule',  # Free up worker slot while waiting314    )315 316    # Wait for another DAG to complete317    wait_for_upstream = ExternalTaskSensor(318        task_id='wait_for_upstream_dag',319        external_dag_id='upstream_etl',320        external_task_id='final_task',321        execution_date_fn=lambda dt: dt,  # Same execution date322        timeout=60 * 60 * 3,323        mode='reschedule',324    )325 326    # Custom sensor using @task.sensor decorator327    @task.sensor(poke_interval=60, timeout=3600, mode='reschedule')328    def wait_for_api() -> PokeReturnValue:329        """Custom sensor for API availability"""330        import requests331 332        response = requests.get('https://api.example.com/health')333        is_done = response.status_code == 200334 335        return PokeReturnValue(is_done=is_done, xcom_value=response.json())336 337    api_ready = wait_for_api()338 339    def process_data(**context):340        api_result = context['ti'].xcom_pull(task_ids='wait_for_api')341        print(f"API returned: {api_result}")342 343    process = PythonOperator(344        task_id='process',345        python_callable=process_data,346    )347 348    [wait_for_file, wait_for_upstream, api_ready] >> process349```350 351### Pattern 5: Error Handling and Alerts352 353```python354# dags/error_handling.py355from datetime import datetime, timedelta356from airflow import DAG357from airflow.operators.python import PythonOperator358from airflow.utils.trigger_rule import TriggerRule359from airflow.models import Variable360 361def task_failure_callback(context):362    """Callback on task failure"""363    task_instance = context['task_instance']364    exception = context.get('exception')365 366    # Send to Slack/PagerDuty/etc367    message = f"""368    Task Failed!369    DAG: {task_instance.dag_id}370    Task: {task_instance.task_id}371    Execution Date: {context['ds']}372    Error: {exception}373    Log URL: {task_instance.log_url}374    """375    # send_slack_alert(message)376    print(message)377 378def dag_failure_callback(context):379    """Callback on DAG failure"""380    # Aggregate failures, send summary381    pass382 383with DAG(384    dag_id='error_handling_example',385    schedule='@daily',386    start_date=datetime(2024, 1, 1),387    catchup=False,388    on_failure_callback=dag_failure_callback,389    default_args={390        'on_failure_callback': task_failure_callback,391        'retries': 3,392        'retry_delay': timedelta(minutes=5),393    },394) as dag:395 396    def might_fail(**context):397        import random398        if random.random() < 0.3:399            raise ValueError("Random failure!")400        return "Success"401 402    risky_task = PythonOperator(403        task_id='risky_task',404        python_callable=might_fail,405    )406 407    def cleanup(**context):408        """Cleanup runs regardless of upstream failures"""409        print("Cleaning up...")410 411    cleanup_task = PythonOperator(412        task_id='cleanup',413        python_callable=cleanup,414        trigger_rule=TriggerRule.ALL_DONE,  # Run even if upstream fails415    )416 417    def notify_success(**context):418        """Only runs if all upstream succeeded"""419        print("All tasks succeeded!")420 421    success_notification = PythonOperator(422        task_id='notify_success',423        python_callable=notify_success,424        trigger_rule=TriggerRule.ALL_SUCCESS,425    )426 427    risky_task >> [cleanup_task, success_notification]428```429 430### Pattern 6: Testing DAGs431 432```python433# tests/test_dags.py434import pytest435from datetime import datetime436from airflow.models import DagBag437 438@pytest.fixture439def dagbag():440    return DagBag(dag_folder='dags/', include_examples=False)441 442def test_dag_loaded(dagbag):443    """Test that all DAGs load without errors"""444    assert len(dagbag.import_errors) == 0, f"DAG import errors: {dagbag.import_errors}"445 446def test_dag_structure(dagbag):447    """Test specific DAG structure"""448    dag = dagbag.get_dag('example_etl')449 450    assert dag is not None451    assert len(dag.tasks) == 3452    assert dag.schedule_interval == '0 6 * * *'453 454def test_task_dependencies(dagbag):455    """Test task dependencies are correct"""456    dag = dagbag.get_dag('example_etl')457 458    extract_task = dag.get_task('extract')459    assert 'start' in [t.task_id for t in extract_task.upstream_list]460    assert 'end' in [t.task_id for t in extract_task.downstream_list]461 462def test_dag_integrity(dagbag):463    """Test DAG has no cycles and is valid"""464    for dag_id, dag in dagbag.dags.items():465        assert dag.test_cycle() is None, f"Cycle detected in {dag_id}"466 467# Test individual task logic468def test_extract_function():469    """Unit test for extract function"""470    from dags.example_dag import extract_data471 472    result = extract_data(ds='2024-01-01')473    assert 'records' in result474    assert isinstance(result['records'], int)475```476 477## Project Structure478 479```480airflow/481├── dags/482│   ├── __init__.py483│   ├── common/484│   │   ├── __init__.py485│   │   ├── operators.py    # Custom operators486│   │   ├── sensors.py      # Custom sensors487│   │   └── callbacks.py    # Alert callbacks488│   ├── etl/489│   │   ├── customers.py490│   │   └── orders.py491│   └── ml/492│       └── training.py493├── plugins/494│   └── custom_plugin.py495├── tests/496│   ├── __init__.py497│   ├── test_dags.py498│   └── test_operators.py499├── docker-compose.yml500└── requirements.txt501```502 503## Best Practices504 505### Do's506 507- **Use TaskFlow API** - Cleaner code, automatic XCom508- **Set timeouts** - Prevent zombie tasks509- **Use `mode='reschedule'`** - For sensors, free up workers510- **Test DAGs** - Unit tests and integration tests511- **Idempotent tasks** - Safe to retry512 513### Don'ts514 515- **Don't use `depends_on_past=True`** - Creates bottlenecks516- **Don't hardcode dates** - Use `{{ ds }}` macros517- **Don't use global state** - Tasks should be stateless518- **Don't skip catchup blindly** - Understand implications519- **Don't put heavy logic in DAG file** - Import from modules
Related skills
Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app
Anti Reversing Techniques

This is a solid reference for anyone doing malware analysis, CTF challenges, or reverse engineering work where you hit anti-debugging roadblocks. It covers the