Install

Terminal · npx

$npx skills add https://github.com/affaan-m/everything-claude-code --skill cost-aware-llm-pipeline

Works with Paperclip

How Cost Aware Llm Pipeline fits into a Paperclip company.

Cost Aware Llm Pipeline drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md183 linesmarkdown

Expand

1---2name: cost-aware-llm-pipeline3description: Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.4origin: ECC5---6 7# Cost-Aware LLM Pipeline8 9Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline.10 11## When to Activate12 13- Building applications that call LLM APIs (Claude, GPT, etc.)14- Processing batches of items with varying complexity15- Need to stay within a budget for API spend16- Optimizing cost without sacrificing quality on complex tasks17 18## Core Concepts19 20### 1. Model Routing by Task Complexity21 22Automatically select cheaper models for simple tasks, reserving expensive models for complex ones.23 24```python25MODEL_SONNET = "claude-sonnet-4-6"26MODEL_HAIKU = "claude-haiku-4-5-20251001"27 28_SONNET_TEXT_THRESHOLD = 10_000  # chars29_SONNET_ITEM_THRESHOLD = 30     # items30 31def select_model(32    text_length: int,33    item_count: int,34    force_model: str | None = None,35) -> str:36    """Select model based on task complexity."""37    if force_model is not None:38        return force_model39    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:40        return MODEL_SONNET  # Complex task41    return MODEL_HAIKU  # Simple task (3-4x cheaper)42```43 44### 2. Immutable Cost Tracking45 46Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state.47 48```python49from dataclasses import dataclass50 51@dataclass(frozen=True, slots=True)52class CostRecord:53    model: str54    input_tokens: int55    output_tokens: int56    cost_usd: float57 58@dataclass(frozen=True, slots=True)59class CostTracker:60    budget_limit: float = 1.0061    records: tuple[CostRecord, ...] = ()62 63    def add(self, record: CostRecord) -> "CostTracker":64        """Return new tracker with added record (never mutates self)."""65        return CostTracker(66            budget_limit=self.budget_limit,67            records=(*self.records, record),68        )69 70    @property71    def total_cost(self) -> float:72        return sum(r.cost_usd for r in self.records)73 74    @property75    def over_budget(self) -> bool:76        return self.total_cost > self.budget_limit77```78 79### 3. Narrow Retry Logic80 81Retry only on transient errors. Fail fast on authentication or bad request errors.82 83```python84from anthropic import (85    APIConnectionError,86    InternalServerError,87    RateLimitError,88)89 90_RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)91_MAX_RETRIES = 392 93def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):94    """Retry only on transient errors, fail fast on others."""95    for attempt in range(max_retries):96        try:97            return func()98        except _RETRYABLE_ERRORS:99            if attempt == max_retries - 1:100                raise101            time.sleep(2 ** attempt)  # Exponential backoff102    # AuthenticationError, BadRequestError etc. → raise immediately103```104 105### 4. Prompt Caching106 107Cache long system prompts to avoid resending them on every request.108 109```python110messages = [111    {112        "role": "user",113        "content": [114            {115                "type": "text",116                "text": system_prompt,117                "cache_control": {"type": "ephemeral"},  # Cache this118            },119            {120                "type": "text",121                "text": user_input,  # Variable part122            },123        ],124    }125]126```127 128## Composition129 130Combine all four techniques in a single pipeline function:131 132```python133def process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:134    # 1. Route model135    model = select_model(len(text), estimated_items, config.force_model)136 137    # 2. Check budget138    if tracker.over_budget:139        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)140 141    # 3. Call with retry + caching142    response = call_with_retry(lambda: client.messages.create(143        model=model,144        messages=build_cached_messages(system_prompt, text),145    ))146 147    # 4. Track cost (immutable)148    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)149    tracker = tracker.add(record)150 151    return parse_result(response), tracker152```153 154## Pricing Reference (2025-2026)155 156| Model | Input ($/1M tokens) | Output ($/1M tokens) | Relative Cost |157|-------|---------------------|----------------------|---------------|158| Haiku 4.5 | $0.80 | $4.00 | 1x |159| Sonnet 4.6 | $3.00 | $15.00 | ~4x |160| Opus 4.5 | $15.00 | $75.00 | ~19x |161 162## Best Practices163 164- **Start with the cheapest model** and only route to expensive models when complexity thresholds are met165- **Set explicit budget limits** before processing batches — fail early rather than overspend166- **Log model selection decisions** so you can tune thresholds based on real data167- **Use prompt caching** for system prompts over 1024 tokens — saves both cost and latency168- **Never retry on authentication or validation errors** — only transient failures (network, rate limit, server error)169 170## Anti-Patterns to Avoid171 172- Using the most expensive model for all requests regardless of complexity173- Retrying on all errors (wastes budget on permanent failures)174- Mutating cost tracking state (makes debugging and auditing difficult)175- Hardcoding model names throughout the codebase (use constants or config)176- Ignoring prompt caching for repetitive system prompts177 178## When to Use179 180- Any application calling Claude, OpenAI, or similar LLM APIs181- Batch processing pipelines where cost adds up quickly182- Multi-model architectures that need intelligent routing183- Production systems that need budget guardrails

Related skills

Agent Eval

Install Agent Eval skill for Claude Code from affaan-m/everything-claude-code.

Agent Harness Construction

Install Agent Harness Construction skill for Claude Code from affaan-m/everything-claude-code.

Agent Payment X402

Install Agent Payment X402 skill for Claude Code from affaan-m/everything-claude-code.