Claude Agent Skill · by Wshobson

Prometheus Configuration

Solid Prometheus setup guide that covers the essentials without getting lost in the weeds. Handles scrape configs for static targets, Kubernetes service discove

Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill prometheus-configuration
Works with Paperclip

How Prometheus Configuration fits into a Paperclip company.

Prometheus Configuration drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md394 lines
Expand
---name: prometheus-configurationdescription: Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.--- # Prometheus Configuration Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules. ## Purpose Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications. ## When to Use - Set up Prometheus monitoring- Configure metric scraping- Create recording rules- Design alert rules- Implement service discovery ## Prometheus Architecture ```┌──────────────┐│ Applications │ ← Instrumented with client libraries└──────┬───────┘       │ /metrics endpoint┌──────────────┐│  Prometheus  │ ← Scrapes metrics periodically│    Server    │└──────┬───────┘       ├─→ AlertManager (alerts)       ├─→ Grafana (visualization)       └─→ Long-term storage (Thanos/Cortex)``` ## Installation ### Kubernetes with Helm ```bashhelm repo add prometheus-community https://prometheus-community.github.io/helm-chartshelm repo update helm install prometheus prometheus-community/kube-prometheus-stack \  --namespace monitoring \  --create-namespace \  --set prometheus.prometheusSpec.retention=30d \  --set prometheus.prometheusSpec.storageVolumeSize=50Gi``` ### Docker Compose ```yamlversion: "3.8"services:  prometheus:    image: prom/prometheus:latest    ports:      - "9090:9090"    volumes:      - ./prometheus.yml:/etc/prometheus/prometheus.yml      - prometheus-data:/prometheus    command:      - "--config.file=/etc/prometheus/prometheus.yml"      - "--storage.tsdb.path=/prometheus"      - "--storage.tsdb.retention.time=30d" volumes:  prometheus-data:``` ## Configuration File **prometheus.yml:** ```yamlglobal:  scrape_interval: 15s  evaluation_interval: 15s  external_labels:    cluster: "production"    region: "us-west-2" # Alertmanager configurationalerting:  alertmanagers:    - static_configs:        - targets:            - alertmanager:9093 # Load rules filesrule_files:  - /etc/prometheus/rules/*.yml # Scrape configurationsscrape_configs:  # Prometheus itself  - job_name: "prometheus"    static_configs:      - targets: ["localhost:9090"]   # Node exporters  - job_name: "node-exporter"    static_configs:      - targets:          - "node1:9100"          - "node2:9100"          - "node3:9100"    relabel_configs:      - source_labels: [__address__]        target_label: instance        regex: "([^:]+)(:[0-9]+)?"        replacement: "${1}"   # Kubernetes pods with annotations  - job_name: "kubernetes-pods"    kubernetes_sd_configs:      - role: pod    relabel_configs:      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]        action: keep        regex: true      - source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]        action: replace        target_label: __metrics_path__        regex: (.+)      - source_labels:          [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]        action: replace        regex: ([^:]+)(?::\d+)?;(\d+)        replacement: $1:$2        target_label: __address__      - source_labels: [__meta_kubernetes_namespace]        action: replace        target_label: namespace      - source_labels: [__meta_kubernetes_pod_name]        action: replace        target_label: pod   # Application metrics  - job_name: "my-app"    static_configs:      - targets:          - "app1.example.com:9090"          - "app2.example.com:9090"    metrics_path: "/metrics"    scheme: "https"    tls_config:      ca_file: /etc/prometheus/ca.crt      cert_file: /etc/prometheus/client.crt      key_file: /etc/prometheus/client.key``` **Reference:** See `assets/prometheus.yml.template` ## Scrape Configurations ### Static Targets ```yamlscrape_configs:  - job_name: "static-targets"    static_configs:      - targets: ["host1:9100", "host2:9100"]        labels:          env: "production"          region: "us-west-2"``` ### File-based Service Discovery ```yamlscrape_configs:  - job_name: "file-sd"    file_sd_configs:      - files:          - /etc/prometheus/targets/*.json          - /etc/prometheus/targets/*.yml        refresh_interval: 5m``` **targets/production.json:** ```json[  {    "targets": ["app1:9090", "app2:9090"],    "labels": {      "env": "production",      "service": "api"    }  }]``` ### Kubernetes Service Discovery ```yamlscrape_configs:  - job_name: "kubernetes-services"    kubernetes_sd_configs:      - role: service    relabel_configs:      - source_labels:          [__meta_kubernetes_service_annotation_prometheus_io_scrape]        action: keep        regex: true      - source_labels:          [__meta_kubernetes_service_annotation_prometheus_io_scheme]        action: replace        target_label: __scheme__        regex: (https?)      - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]        action: replace        target_label: __metrics_path__        regex: (.+)``` **Reference:** See `references/scrape-configs.md` ## Recording Rules Create pre-computed metrics for frequently queried expressions: ```yaml# /etc/prometheus/rules/recording_rules.ymlgroups:  - name: api_metrics    interval: 15s    rules:      # HTTP request rate per service      - record: job:http_requests:rate5m        expr: sum by (job) (rate(http_requests_total[5m]))       # Error rate percentage      - record: job:http_requests_errors:rate5m        expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))       - record: job:http_requests_error_rate:percentage        expr: |          (job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100       # P95 latency      - record: job:http_request_duration:p95        expr: |          histogram_quantile(0.95,            sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))          )   - name: resource_metrics    interval: 30s    rules:      # CPU utilization percentage      - record: instance:node_cpu:utilization        expr: |          100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)       # Memory utilization percentage      - record: instance:node_memory:utilization        expr: |          100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)       # Disk usage percentage      - record: instance:node_disk:utilization        expr: |          100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)``` **Reference:** See `references/recording-rules.md` ## Alert Rules ```yaml# /etc/prometheus/rules/alert_rules.ymlgroups:  - name: availability    interval: 30s    rules:      - alert: ServiceDown        expr: up{job="my-app"} == 0        for: 1m        labels:          severity: critical        annotations:          summary: "Service {{ $labels.instance }} is down"          description: "{{ $labels.job }} has been down for more than 1 minute"       - alert: HighErrorRate        expr: job:http_requests_error_rate:percentage > 5        for: 5m        labels:          severity: warning        annotations:          summary: "High error rate for {{ $labels.job }}"          description: "Error rate is {{ $value }}% (threshold: 5%)"       - alert: HighLatency        expr: job:http_request_duration:p95 > 1        for: 5m        labels:          severity: warning        annotations:          summary: "High latency for {{ $labels.job }}"          description: "P95 latency is {{ $value }}s (threshold: 1s)"   - name: resources    interval: 1m    rules:      - alert: HighCPUUsage        expr: instance:node_cpu:utilization > 80        for: 5m        labels:          severity: warning        annotations:          summary: "High CPU usage on {{ $labels.instance }}"          description: "CPU usage is {{ $value }}%"       - alert: HighMemoryUsage        expr: instance:node_memory:utilization > 85        for: 5m        labels:          severity: warning        annotations:          summary: "High memory usage on {{ $labels.instance }}"          description: "Memory usage is {{ $value }}%"       - alert: DiskSpaceLow        expr: instance:node_disk:utilization > 90        for: 5m        labels:          severity: critical        annotations:          summary: "Low disk space on {{ $labels.instance }}"          description: "Disk usage is {{ $value }}%"``` ## Validation ```bash# Validate configurationpromtool check config prometheus.yml # Validate rulespromtool check rules /etc/prometheus/rules/*.yml # Test querypromtool query instant http://localhost:9090 'up'``` **Reference:** See `scripts/validate-prometheus.sh` ## Best Practices 1. **Use consistent naming** for metrics (prefix_name_unit)2. **Set appropriate scrape intervals** (15-60s typical)3. **Use recording rules** for expensive queries4. **Implement high availability** (multiple Prometheus instances)5. **Configure retention** based on storage capacity6. **Use relabeling** for metric cleanup7. **Monitor Prometheus itself**8. **Implement federation** for large deployments9. **Use Thanos/Cortex** for long-term storage10. **Document custom metrics** ## Troubleshooting **Check scrape targets:** ```bashcurl http://localhost:9090/api/v1/targets``` **Check configuration:** ```bashcurl http://localhost:9090/api/v1/status/config``` **Test query:** ```bashcurl 'http://localhost:9090/api/v1/query?query=up'```  ## Related Skills - `grafana-dashboards` - For visualization- `slo-implementation` - For SLO monitoring- `distributed-tracing` - For request tracing