Claude Agent Skill · by Wshobson

Slo Implementation

Provides a complete framework for implementing Service Level Objectives with Prometheus recording rules, multi-window burn rate alerts, and error budget trackin

Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill slo-implementation
Works with Paperclip

How Slo Implementation fits into a Paperclip company.

Slo Implementation drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md333 lines
Expand
---name: slo-implementationdescription: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.--- # SLO Implementation Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. ## Purpose Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity. ## When to Use - Define service reliability targets- Measure user-perceived reliability- Implement error budgets- Create SLO-based alerts- Track reliability goals ## SLI/SLO/SLA Hierarchy ```SLA (Service Level Agreement)  ↓ Contract with customersSLO (Service Level Objective)  ↓ Internal reliability targetSLI (Service Level Indicator)  ↓ Actual measurement``` ## Defining SLIs ### Common SLI Types #### 1. Availability SLI ```promql# Successful requests / Total requestssum(rate(http_requests_total{status!~"5.."}[28d]))/sum(rate(http_requests_total[28d]))``` #### 2. Latency SLI ```promql# Requests below latency threshold / Total requestssum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))/sum(rate(http_request_duration_seconds_count[28d]))``` #### 3. Durability SLI ```# Successful writes / Total writessum(storage_writes_successful_total)/sum(storage_writes_total)``` **Reference:** See `references/slo-definitions.md` ## Setting SLO Targets ### Availability SLO Examples | SLO %  | Downtime/Month | Downtime/Year || ------ | -------------- | ------------- || 99%    | 7.2 hours      | 3.65 days     || 99.9%  | 43.2 minutes   | 8.76 hours    || 99.95% | 21.6 minutes   | 4.38 hours    || 99.99% | 4.32 minutes   | 52.56 minutes | ### Choose Appropriate SLOs **Consider:** - User expectations- Business requirements- Current performance- Cost of reliability- Competitor benchmarks **Example SLOs:** ```yamlslos:  - name: api_availability    target: 99.9    window: 28d    sli: |      sum(rate(http_requests_total{status!~"5.."}[28d]))      /      sum(rate(http_requests_total[28d]))   - name: api_latency_p95    target: 99    window: 28d    sli: |      sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))      /      sum(rate(http_request_duration_seconds_count[28d]))``` ## Error Budget Calculation ### Error Budget Formula ```Error Budget = 1 - SLO Target``` **Example:** - SLO: 99.9% availability- Error Budget: 0.1% = 43.2 minutes/month- Current Error: 0.05% = 21.6 minutes/month- Remaining Budget: 50% ### Error Budget Policy ```yamlerror_budget_policy:  - remaining_budget: 100%    action: Normal development velocity  - remaining_budget: 50%    action: Consider postponing risky changes  - remaining_budget: 10%    action: Freeze non-critical changes  - remaining_budget: 0%    action: Feature freeze, focus on reliability``` **Reference:** See `references/error-budget.md` ## SLO Implementation ### Prometheus Recording Rules ```yaml# SLI Recording Rulesgroups:  - name: sli_rules    interval: 30s    rules:      # Availability SLI      - record: sli:http_availability:ratio        expr: |          sum(rate(http_requests_total{status!~"5.."}[28d]))          /          sum(rate(http_requests_total[28d]))       # Latency SLI (requests < 500ms)      - record: sli:http_latency:ratio        expr: |          sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))          /          sum(rate(http_request_duration_seconds_count[28d]))   - name: slo_rules    interval: 5m    rules:      # SLO compliance (1 = meeting SLO, 0 = violating)      - record: slo:http_availability:compliance        expr: sli:http_availability:ratio >= bool 0.999       - record: slo:http_latency:compliance        expr: sli:http_latency:ratio >= bool 0.99       # Error budget remaining (percentage)      - record: slo:http_availability:error_budget_remaining        expr: |          (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100       # Error budget burn rate      - record: slo:http_availability:burn_rate_5m        expr: |          (1 - (            sum(rate(http_requests_total{status!~"5.."}[5m]))            /            sum(rate(http_requests_total[5m]))          )) / (1 - 0.999)``` ### SLO Alerting Rules ```yamlgroups:  - name: slo_alerts    interval: 1m    rules:      # Fast burn: 14.4x rate, 1 hour window      # Consumes 2% error budget in 1 hour      - alert: SLOErrorBudgetBurnFast        expr: |          slo:http_availability:burn_rate_1h > 14.4          and          slo:http_availability:burn_rate_5m > 14.4        for: 2m        labels:          severity: critical        annotations:          summary: "Fast error budget burn detected"          description: "Error budget burning at {{ $value }}x rate"       # Slow burn: 6x rate, 6 hour window      # Consumes 5% error budget in 6 hours      - alert: SLOErrorBudgetBurnSlow        expr: |          slo:http_availability:burn_rate_6h > 6          and          slo:http_availability:burn_rate_30m > 6        for: 15m        labels:          severity: warning        annotations:          summary: "Slow error budget burn detected"          description: "Error budget burning at {{ $value }}x rate"       # Error budget exhausted      - alert: SLOErrorBudgetExhausted        expr: slo:http_availability:error_budget_remaining < 0        for: 5m        labels:          severity: critical        annotations:          summary: "SLO error budget exhausted"          description: "Error budget remaining: {{ $value }}%"``` ## SLO Dashboard **Grafana Dashboard Structure:** ```┌────────────────────────────────────┐│ SLO Compliance (Current)           ││ ✓ 99.95% (Target: 99.9%)          │├────────────────────────────────────┤│ Error Budget Remaining: 65%        ││ ████████░░ 65%                     │├────────────────────────────────────┤│ SLI Trend (28 days)                ││ [Time series graph]                │├────────────────────────────────────┤│ Burn Rate Analysis                 ││ [Burn rate by time window]         │└────────────────────────────────────┘``` **Example Queries:** ```promql# Current SLO compliancesli:http_availability:ratio * 100 # Error budget remainingslo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate)(slo:http_availability:error_budget_remaining / 100)*28/(1 - sli:http_availability:ratio) * (1 - 0.999)``` ## Multi-Window Burn Rate Alerts ```yaml# Combination of short and long windows reduces false positivesrules:  - alert: SLOBurnRateHigh    expr: |      (        slo:http_availability:burn_rate_1h > 14.4        and        slo:http_availability:burn_rate_5m > 14.4      )      or      (        slo:http_availability:burn_rate_6h > 6        and        slo:http_availability:burn_rate_30m > 6      )    labels:      severity: critical``` ## SLO Review Process ### Weekly Review - Current SLO compliance- Error budget status- Trend analysis- Incident impact ### Monthly Review - SLO achievement- Error budget usage- Incident postmortems- SLO adjustments ### Quarterly Review - SLO relevance- Target adjustments- Process improvements- Tooling enhancements ## Best Practices 1. **Start with user-facing services**2. **Use multiple SLIs** (availability, latency, etc.)3. **Set achievable SLOs** (don't aim for 100%)4. **Implement multi-window alerts** to reduce noise5. **Track error budget** consistently6. **Review SLOs regularly**7. **Document SLO decisions**8. **Align with business goals**9. **Automate SLO reporting**10. **Use SLOs for prioritization**  ## Related Skills - `prometheus-configuration` - For metric collection- `grafana-dashboards` - For SLO visualization