Claude Agent Skill · by Jeffallan

Monitoring Expert

Install Monitoring Expert skill for Claude Code from jeffallan/claude-skills.

Install
Terminal · npx
$npx skills add https://github.com/jeffallan/claude-skills --skill monitoring-expert
Works with Paperclip

How Monitoring Expert fits into a Paperclip company.

Monitoring Expert drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md176 lines
Expand
---name: monitoring-expertdescription: Configures monitoring systems, implements structured logging pipelines, creates Prometheus/Grafana dashboards, defines alerting rules, and instruments distributed tracing. Implements Prometheus/Grafana stacks, conducts load testing, performs application profiling, and plans infrastructure capacity. Use when setting up application monitoring, adding observability to services, debugging production issues with logs/metrics/traces, running load tests with k6 or Artillery, profiling CPU/memory bottlenecks, or forecasting capacity needs.license: MITmetadata:  author: https://github.com/Jeffallan  version: "1.1.0"  domain: devops  triggers: monitoring, observability, logging, metrics, tracing, alerting, Prometheus, Grafana, DataDog, APM, performance testing, load testing, profiling, capacity planning, bottleneck  role: specialist  scope: implementation  output-format: code  related-skills: devops-engineer, debugging-wizard, architecture-designer--- # Monitoring Expert Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems. ## Core Workflow 1. **Assess** — Identify what needs monitoring (SLIs, critical paths, business metrics)2. **Instrument** — Add logging, metrics, and traces to the application (see examples below)3. **Collect** — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding4. **Visualize** — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods5. **Alert** — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping ## Quick-Start Examples ### Structured Logging (Node.js / Pino)```jsimport pino from 'pino'; const logger = pino({ level: 'info' }); // Good — structured fields, includes correlation IDlogger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created'); // Bad — string interpolation, no correlationconsole.log(`Order created for user ${userId}`);``` ### Prometheus Metrics (Node.js)```jsimport { Counter, Histogram, register } from 'prom-client'; const httpRequests = new Counter({  name: 'http_requests_total',  help: 'Total HTTP requests',  labelNames: ['method', 'route', 'status'],}); const httpDuration = new Histogram({  name: 'http_request_duration_seconds',  help: 'HTTP request latency',  labelNames: ['method', 'route'],  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],}); // Instrument a routeapp.use((req, res, next) => {  const end = httpDuration.startTimer({ method: req.method, route: req.path });  res.on('finish', () => {    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });    end();  });  next();}); // Expose scrape endpointapp.get('/metrics', async (req, res) => {  res.set('Content-Type', register.contentType);  res.end(await register.metrics());});``` ### OpenTelemetry Tracing (Node.js)```jsimport { NodeSDK } from '@opentelemetry/sdk-node';import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';import { trace } from '@opentelemetry/api'; const sdk = new NodeSDK({  traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),});sdk.start(); // Manual span around a critical operationconst tracer = trace.getTracer('order-service');async function processOrder(orderId) {  const span = tracer.startSpan('order.process');  span.setAttribute('order.id', orderId);  try {    const result = await db.saveOrder(orderId);    span.setStatus({ code: SpanStatusCode.OK });    return result;  } catch (err) {    span.recordException(err);    span.setStatus({ code: SpanStatusCode.ERROR });    throw err;  } finally {    span.end();  }}``` ### Prometheus Alerting Rule```yamlgroups:  - name: api.rules    rules:      - alert: HighErrorRate        expr: |          rate(http_requests_total{status=~"5.."}[5m])          / rate(http_requests_total[5m]) > 0.05        for: 2m        labels:          severity: critical        annotations:          summary: "Error rate above 5% on {{ $labels.route }}"``` ### k6 Load Test```jsimport http from 'k6/http';import { check, sleep } from 'k6'; export const options = {  stages: [    { duration: '1m', target: 50 },   // ramp up    { duration: '5m', target: 50 },   // sustained load    { duration: '1m', target: 0 },    // ramp down  ],  thresholds: {    http_req_duration: ['p(95)<500'],  // 95th percentile < 500 ms    http_req_failed:   ['rate<0.01'],  // error rate < 1%  },}; export default function () {  const res = http.get('https://api.example.com/orders');  check(res, { 'status is 200': (r) => r.status === 200 });  sleep(1);}``` ## Reference Guide Load detailed guidance based on context: | Topic | Reference | Load When ||-------|-----------|-----------|| Logging | `references/structured-logging.md` | Pino, JSON logging || Metrics | `references/prometheus-metrics.md` | Counter, Histogram, Gauge || Tracing | `references/opentelemetry.md` | OpenTelemetry, spans || Alerting | `references/alerting-rules.md` | Prometheus alerts || Dashboards | `references/dashboards.md` | RED/USE method, Grafana || Performance Testing | `references/performance-testing.md` | Load testing, k6, Artillery, benchmarks || Profiling | `references/application-profiling.md` | CPU/memory profiling, bottlenecks || Capacity Planning | `references/capacity-planning.md` | Scaling, forecasting, budgets | ## Constraints ### MUST DO- Use structured logging (JSON)- Include request IDs for correlation- Set up alerts for critical paths- Monitor business metrics, not just technical- Use appropriate metric types (counter/gauge/histogram)- Implement health check endpoints ### MUST NOT DO- Log sensitive data (passwords, tokens, PII)- Alert on every error (alert fatigue)- Use string interpolation in logs (use structured fields)- Skip correlation IDs in distributed systems