Claude Agent Skill · by Affaan M

Cost Aware Llm Pipeline

Cost-aware-llm-pipeline provides developers building LLM-powered applications with patterns to optimize API costs while maintaining output quality, using model

Install
Terminal · npx
$npx skills add https://github.com/affaan-m/everything-claude-code --skill cost-aware-llm-pipeline
Works with Paperclip

How Cost Aware Llm Pipeline fits into a Paperclip company.

Cost Aware Llm Pipeline drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md183 lines
Expand
---name: cost-aware-llm-pipelinedescription: Cost optimization patterns for LLM API usage — model routing by task complexity, budget tracking, retry logic, and prompt caching.origin: ECC--- # Cost-Aware LLM Pipeline Patterns for controlling LLM API costs while maintaining quality. Combines model routing, budget tracking, retry logic, and prompt caching into a composable pipeline. ## When to Activate - Building applications that call LLM APIs (Claude, GPT, etc.)- Processing batches of items with varying complexity- Need to stay within a budget for API spend- Optimizing cost without sacrificing quality on complex tasks ## Core Concepts ### 1. Model Routing by Task Complexity Automatically select cheaper models for simple tasks, reserving expensive models for complex ones. ```pythonMODEL_SONNET = "claude-sonnet-4-6"MODEL_HAIKU = "claude-haiku-4-5-20251001" _SONNET_TEXT_THRESHOLD = 10_000  # chars_SONNET_ITEM_THRESHOLD = 30     # items def select_model(    text_length: int,    item_count: int,    force_model: str | None = None,) -> str:    """Select model based on task complexity."""    if force_model is not None:        return force_model    if text_length >= _SONNET_TEXT_THRESHOLD or item_count >= _SONNET_ITEM_THRESHOLD:        return MODEL_SONNET  # Complex task    return MODEL_HAIKU  # Simple task (3-4x cheaper)``` ### 2. Immutable Cost Tracking Track cumulative spend with frozen dataclasses. Each API call returns a new tracker — never mutates state. ```pythonfrom dataclasses import dataclass @dataclass(frozen=True, slots=True)class CostRecord:    model: str    input_tokens: int    output_tokens: int    cost_usd: float @dataclass(frozen=True, slots=True)class CostTracker:    budget_limit: float = 1.00    records: tuple[CostRecord, ...] = ()     def add(self, record: CostRecord) -> "CostTracker":        """Return new tracker with added record (never mutates self)."""        return CostTracker(            budget_limit=self.budget_limit,            records=(*self.records, record),        )     @property    def total_cost(self) -> float:        return sum(r.cost_usd for r in self.records)     @property    def over_budget(self) -> bool:        return self.total_cost > self.budget_limit``` ### 3. Narrow Retry Logic Retry only on transient errors. Fail fast on authentication or bad request errors. ```pythonfrom anthropic import (    APIConnectionError,    InternalServerError,    RateLimitError,) _RETRYABLE_ERRORS = (APIConnectionError, RateLimitError, InternalServerError)_MAX_RETRIES = 3 def call_with_retry(func, *, max_retries: int = _MAX_RETRIES):    """Retry only on transient errors, fail fast on others."""    for attempt in range(max_retries):        try:            return func()        except _RETRYABLE_ERRORS:            if attempt == max_retries - 1:                raise            time.sleep(2 ** attempt)  # Exponential backoff    # AuthenticationError, BadRequestError etc. → raise immediately``` ### 4. Prompt Caching Cache long system prompts to avoid resending them on every request. ```pythonmessages = [    {        "role": "user",        "content": [            {                "type": "text",                "text": system_prompt,                "cache_control": {"type": "ephemeral"},  # Cache this            },            {                "type": "text",                "text": user_input,  # Variable part            },        ],    }]``` ## Composition Combine all four techniques in a single pipeline function: ```pythondef process(text: str, config: Config, tracker: CostTracker) -> tuple[Result, CostTracker]:    # 1. Route model    model = select_model(len(text), estimated_items, config.force_model)     # 2. Check budget    if tracker.over_budget:        raise BudgetExceededError(tracker.total_cost, tracker.budget_limit)     # 3. Call with retry + caching    response = call_with_retry(lambda: client.messages.create(        model=model,        messages=build_cached_messages(system_prompt, text),    ))     # 4. Track cost (immutable)    record = CostRecord(model=model, input_tokens=..., output_tokens=..., cost_usd=...)    tracker = tracker.add(record)     return parse_result(response), tracker``` ## Pricing Reference (2025-2026) | Model | Input ($/1M tokens) | Output ($/1M tokens) | Relative Cost ||-------|---------------------|----------------------|---------------|| Haiku 4.5 | $0.80 | $4.00 | 1x || Sonnet 4.6 | $3.00 | $15.00 | ~4x || Opus 4.5 | $15.00 | $75.00 | ~19x | ## Best Practices - **Start with the cheapest model** and only route to expensive models when complexity thresholds are met- **Set explicit budget limits** before processing batches — fail early rather than overspend- **Log model selection decisions** so you can tune thresholds based on real data- **Use prompt caching** for system prompts over 1024 tokens — saves both cost and latency- **Never retry on authentication or validation errors** — only transient failures (network, rate limit, server error) ## Anti-Patterns to Avoid - Using the most expensive model for all requests regardless of complexity- Retrying on all errors (wastes budget on permanent failures)- Mutating cost tracking state (makes debugging and auditing difficult)- Hardcoding model names throughout the codebase (use constants or config)- Ignoring prompt caching for repetitive system prompts ## When to Use - Any application calling Claude, OpenAI, or similar LLM APIs- Batch processing pipelines where cost adds up quickly- Multi-model architectures that need intelligent routing- Production systems that need budget guardrails