npx skills add https://github.com/wshobson/agents --skill prometheus-configurationHow Prometheus Configuration fits into a Paperclip company.
Prometheus Configuration drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
SKILL.md394 linesExpandCollapse
---name: prometheus-configurationdescription: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.--- # Prometheus Configuration Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules. ## Purpose Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications. ## When to Use - Set up Prometheus monitoring- Configure metric scraping- Create recording rules- Design alert rules- Implement service discovery ## Prometheus Architecture ```┌──────────────┐│ Applications │ ← Instrumented with client libraries└──────┬───────┘ │ /metrics endpoint ↓┌──────────────┐│ Prometheus │ ← Scrapes metrics periodically│ Server │└──────┬───────┘ │ ├─→ AlertManager (alerts) ├─→ Grafana (visualization) └─→ Long-term storage (Thanos/Cortex)``` ## Installation ### Kubernetes with Helm ```bashhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update helm install prometheus prometheus-community/kube-prometheus-stack \ --namespace monitoring \ --create-namespace \ --set prometheus.prometheusSpec.retention=30d \ --set prometheus.prometheusSpec.storageVolumeSize=50Gi``` ### Docker Compose ```yamlversion: "3.8"services: prometheus: image: prom/prometheus:latest ports: - "9090:9090" volumes: - ./prometheus.yml:/etc/prometheus/prometheus.yml - prometheus-data:/prometheus command: - "--config.file=/etc/prometheus/prometheus.yml" - "--storage.tsdb.path=/prometheus" - "--storage.tsdb.retention.time=30d" volumes: prometheus-data:``` ## Configuration File **prometheus.yml:** ```yamlglobal: scrape_interval: 15s evaluation_interval: 15s external_labels: cluster: "production" region: "us-west-2" # Alertmanager configurationalerting: alertmanagers: - static_configs: - targets: - alertmanager:9093 # Load rules filesrule_files: - /etc/prometheus/rules/*.yml # Scrape configurationsscrape_configs: # Prometheus itself - job_name: "prometheus" static_configs: - targets: ["localhost:9090"] # Node exporters - job_name: "node-exporter" static_configs: - targets: - "node1:9100" - "node2:9100" - "node3:9100" relabel_configs: - source_labels: [__address__] target_label: instance regex: "([^:]+)(:[0-9]+)?" replacement: "${1}" # Kubernetes pods with annotations - job_name: "kubernetes-pods" kubernetes_sd_configs: - role: pod relabel_configs: - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+) - source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port] action: replace regex: ([^:]+)(?::\d+)?;(\d+) replacement: $1:$2 target_label: __address__ - source_labels: [__meta_kubernetes_namespace] action: replace target_label: namespace - source_labels: [__meta_kubernetes_pod_name] action: replace target_label: pod # Application metrics - job_name: "my-app" static_configs: - targets: - "app1.example.com:9090" - "app2.example.com:9090" metrics_path: "/metrics" scheme: "https" tls_config: ca_file: /etc/prometheus/ca.crt cert_file: /etc/prometheus/client.crt key_file: /etc/prometheus/client.key``` **Reference:** See `assets/prometheus.yml.template` ## Scrape Configurations ### Static Targets ```yamlscrape_configs: - job_name: "static-targets" static_configs: - targets: ["host1:9100", "host2:9100"] labels: env: "production" region: "us-west-2"``` ### File-based Service Discovery ```yamlscrape_configs: - job_name: "file-sd" file_sd_configs: - files: - /etc/prometheus/targets/*.json - /etc/prometheus/targets/*.yml refresh_interval: 5m``` **targets/production.json:** ```json[ { "targets": ["app1:9090", "app2:9090"], "labels": { "env": "production", "service": "api" } }]``` ### Kubernetes Service Discovery ```yamlscrape_configs: - job_name: "kubernetes-services" kubernetes_sd_configs: - role: service relabel_configs: - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape] action: keep regex: true - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme] action: replace target_label: __scheme__ regex: (https?) - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path] action: replace target_label: __metrics_path__ regex: (.+)``` **Reference:** See `references/scrape-configs.md` ## Recording Rules Create pre-computed metrics for frequently queried expressions: ```yaml# /etc/prometheus/rules/recording_rules.ymlgroups: - name: api_metrics interval: 15s rules: # HTTP request rate per service - record: job:http_requests:rate5m expr: sum by (job) (rate(http_requests_total[5m])) # Error rate percentage - record: job:http_requests_errors:rate5m expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m])) - record: job:http_requests_error_rate:percentage expr: | (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100 # P95 latency - record: job:http_request_duration:p95 expr: | histogram_quantile(0.95, sum by (job, le) (rate(http_request_duration_seconds_bucket[5m])) ) - name: resource_metrics interval: 30s rules: # CPU utilization percentage - record: instance:node_cpu:utilization expr: | 100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100) # Memory utilization percentage - record: instance:node_memory:utilization expr: | 100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100) # Disk usage percentage - record: instance:node_disk:utilization expr: | 100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)``` **Reference:** See `references/recording-rules.md` ## Alert Rules ```yaml# /etc/prometheus/rules/alert_rules.ymlgroups: - name: availability interval: 30s rules: - alert: ServiceDown expr: up{job="my-app"} == 0 for: 1m labels: severity: critical annotations: summary: "Service {{ $labels.instance }} is down" description: "{{ $labels.job }} has been down for more than 1 minute" - alert: HighErrorRate expr: job:http_requests_error_rate:percentage > 5 for: 5m labels: severity: warning annotations: summary: "High error rate for {{ $labels.job }}" description: "Error rate is {{ $value }}% (threshold: 5%)" - alert: HighLatency expr: job:http_request_duration:p95 > 1 for: 5m labels: severity: warning annotations: summary: "High latency for {{ $labels.job }}" description: "P95 latency is {{ $value }}s (threshold: 1s)" - name: resources interval: 1m rules: - alert: HighCPUUsage expr: instance:node_cpu:utilization > 80 for: 5m labels: severity: warning annotations: summary: "High CPU usage on {{ $labels.instance }}" description: "CPU usage is {{ $value }}%" - alert: HighMemoryUsage expr: instance:node_memory:utilization > 85 for: 5m labels: severity: warning annotations: summary: "High memory usage on {{ $labels.instance }}" description: "Memory usage is {{ $value }}%" - alert: DiskSpaceLow expr: instance:node_disk:utilization > 90 for: 5m labels: severity: critical annotations: summary: "Low disk space on {{ $labels.instance }}" description: "Disk usage is {{ $value }}%"``` ## Validation ```bash# Validate configurationpromtool check config prometheus.yml # Validate rulespromtool check rules /etc/prometheus/rules/*.yml # Test querypromtool query instant http://localhost:9090 'up'``` **Reference:** See `scripts/validate-prometheus.sh` ## Best Practices 1. **Use consistent naming** for metrics (prefix_name_unit)2. **Set appropriate scrape intervals** (15-60s typical)3. **Use recording rules** for expensive queries4. **Implement high availability** (multiple Prometheus instances)5. **Configure retention** based on storage capacity6. **Use relabeling** for metric cleanup7. **Monitor Prometheus itself**8. **Implement federation** for large deployments9. **Use Thanos/Cortex** for long-term storage10. **Document custom metrics** ## Troubleshooting **Check scrape targets:** ```bashcurl http://localhost:9090/api/v1/targets``` **Check configuration:** ```bashcurl http://localhost:9090/api/v1/status/config``` **Test query:** ```bashcurl 'http://localhost:9090/api/v1/query?query=up'``` ## Related Skills - `grafana-dashboards` - For visualization- `slo-implementation` - For SLO monitoring- `distributed-tracing` - For request tracingAccessibility Compliance
This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns
If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration
Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app