npx skills add https://github.com/wshobson/agents --skill slo-implementationHow Slo Implementation fits into a Paperclip company.
Slo Implementation drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
SKILL.md333 linesExpandCollapse
---name: slo-implementationdescription: Define and implement Service Level Indicators (SLIs) and Service Level Objectives (SLOs) with error budgets and alerting. Use when establishing reliability targets, implementing SRE practices, or measuring service performance.--- # SLO Implementation Framework for defining and implementing Service Level Indicators (SLIs), Service Level Objectives (SLOs), and error budgets. ## Purpose Implement measurable reliability targets using SLIs, SLOs, and error budgets to balance reliability with innovation velocity. ## When to Use - Define service reliability targets- Measure user-perceived reliability- Implement error budgets- Create SLO-based alerts- Track reliability goals ## SLI/SLO/SLA Hierarchy ```SLA (Service Level Agreement) ↓ Contract with customersSLO (Service Level Objective) ↓ Internal reliability targetSLI (Service Level Indicator) ↓ Actual measurement``` ## Defining SLIs ### Common SLI Types #### 1. Availability SLI ```promql# Successful requests / Total requestssum(rate(http_requests_total{status!~"5.."}[28d]))/sum(rate(http_requests_total[28d]))``` #### 2. Latency SLI ```promql# Requests below latency threshold / Total requestssum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d]))/sum(rate(http_request_duration_seconds_count[28d]))``` #### 3. Durability SLI ```# Successful writes / Total writessum(storage_writes_successful_total)/sum(storage_writes_total)``` **Reference:** See `references/slo-definitions.md` ## Setting SLO Targets ### Availability SLO Examples | SLO % | Downtime/Month | Downtime/Year || ------ | -------------- | ------------- || 99% | 7.2 hours | 3.65 days || 99.9% | 43.2 minutes | 8.76 hours || 99.95% | 21.6 minutes | 4.38 hours || 99.99% | 4.32 minutes | 52.56 minutes | ### Choose Appropriate SLOs **Consider:** - User expectations- Business requirements- Current performance- Cost of reliability- Competitor benchmarks **Example SLOs:** ```yamlslos: - name: api_availability target: 99.9 window: 28d sli: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) - name: api_latency_p95 target: 99 window: 28d sli: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d]))``` ## Error Budget Calculation ### Error Budget Formula ```Error Budget = 1 - SLO Target``` **Example:** - SLO: 99.9% availability- Error Budget: 0.1% = 43.2 minutes/month- Current Error: 0.05% = 21.6 minutes/month- Remaining Budget: 50% ### Error Budget Policy ```yamlerror_budget_policy: - remaining_budget: 100% action: Normal development velocity - remaining_budget: 50% action: Consider postponing risky changes - remaining_budget: 10% action: Freeze non-critical changes - remaining_budget: 0% action: Feature freeze, focus on reliability``` **Reference:** See `references/error-budget.md` ## SLO Implementation ### Prometheus Recording Rules ```yaml# SLI Recording Rulesgroups: - name: sli_rules interval: 30s rules: # Availability SLI - record: sli:http_availability:ratio expr: | sum(rate(http_requests_total{status!~"5.."}[28d])) / sum(rate(http_requests_total[28d])) # Latency SLI (requests < 500ms) - record: sli:http_latency:ratio expr: | sum(rate(http_request_duration_seconds_bucket{le="0.5"}[28d])) / sum(rate(http_request_duration_seconds_count[28d])) - name: slo_rules interval: 5m rules: # SLO compliance (1 = meeting SLO, 0 = violating) - record: slo:http_availability:compliance expr: sli:http_availability:ratio >= bool 0.999 - record: slo:http_latency:compliance expr: sli:http_latency:ratio >= bool 0.99 # Error budget remaining (percentage) - record: slo:http_availability:error_budget_remaining expr: | (sli:http_availability:ratio - 0.999) / (1 - 0.999) * 100 # Error budget burn rate - record: slo:http_availability:burn_rate_5m expr: | (1 - ( sum(rate(http_requests_total{status!~"5.."}[5m])) / sum(rate(http_requests_total[5m])) )) / (1 - 0.999)``` ### SLO Alerting Rules ```yamlgroups: - name: slo_alerts interval: 1m rules: # Fast burn: 14.4x rate, 1 hour window # Consumes 2% error budget in 1 hour - alert: SLOErrorBudgetBurnFast expr: | slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 for: 2m labels: severity: critical annotations: summary: "Fast error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Slow burn: 6x rate, 6 hour window # Consumes 5% error budget in 6 hours - alert: SLOErrorBudgetBurnSlow expr: | slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 for: 15m labels: severity: warning annotations: summary: "Slow error budget burn detected" description: "Error budget burning at {{ $value }}x rate" # Error budget exhausted - alert: SLOErrorBudgetExhausted expr: slo:http_availability:error_budget_remaining < 0 for: 5m labels: severity: critical annotations: summary: "SLO error budget exhausted" description: "Error budget remaining: {{ $value }}%"``` ## SLO Dashboard **Grafana Dashboard Structure:** ```┌────────────────────────────────────┐│ SLO Compliance (Current) ││ ✓ 99.95% (Target: 99.9%) │├────────────────────────────────────┤│ Error Budget Remaining: 65% ││ ████████░░ 65% │├────────────────────────────────────┤│ SLI Trend (28 days) ││ [Time series graph] │├────────────────────────────────────┤│ Burn Rate Analysis ││ [Burn rate by time window] │└────────────────────────────────────┘``` **Example Queries:** ```promql# Current SLO compliancesli:http_availability:ratio * 100 # Error budget remainingslo:http_availability:error_budget_remaining # Days until error budget exhausted (at current burn rate)(slo:http_availability:error_budget_remaining / 100)*28/(1 - sli:http_availability:ratio) * (1 - 0.999)``` ## Multi-Window Burn Rate Alerts ```yaml# Combination of short and long windows reduces false positivesrules: - alert: SLOBurnRateHigh expr: | ( slo:http_availability:burn_rate_1h > 14.4 and slo:http_availability:burn_rate_5m > 14.4 ) or ( slo:http_availability:burn_rate_6h > 6 and slo:http_availability:burn_rate_30m > 6 ) labels: severity: critical``` ## SLO Review Process ### Weekly Review - Current SLO compliance- Error budget status- Trend analysis- Incident impact ### Monthly Review - SLO achievement- Error budget usage- Incident postmortems- SLO adjustments ### Quarterly Review - SLO relevance- Target adjustments- Process improvements- Tooling enhancements ## Best Practices 1. **Start with user-facing services**2. **Use multiple SLIs** (availability, latency, etc.)3. **Set achievable SLOs** (don't aim for 100%)4. **Implement multi-window alerts** to reduce noise5. **Track error budget** consistently6. **Review SLOs regularly**7. **Document SLO decisions**8. **Align with business goals**9. **Automate SLO reporting**10. **Use SLOs for prioritization** ## Related Skills - `prometheus-configuration` - For metric collection- `grafana-dashboards` - For SLO visualizationAccessibility Compliance
This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns
If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration
Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app