Name: Monitoring Expert
Author: Jeffallan

Install

Terminal · npx

$npx skills add https://github.com/jeffallan/claude-skills --skill monitoring-expert

Works with Paperclip

How Monitoring Expert fits into a Paperclip company.

Monitoring Expert drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md176 linesmarkdown

Expand

1---2name: monitoring-expert3description: Configures monitoring systems, implements structured logging pipelines, creates Prometheus/Grafana dashboards, defines alerting rules, and instruments distributed tracing. Implements Prometheus/Grafana stacks, conducts load testing, performs application profiling, and plans infrastructure capacity. Use when setting up application monitoring, adding observability to services, debugging production issues with logs/metrics/traces, running load tests with k6 or Artillery, profiling CPU/memory bottlenecks, or forecasting capacity needs.4license: MIT5metadata:6  author: https://github.com/Jeffallan7  version: "1.1.0"8  domain: devops9  triggers: monitoring, observability, logging, metrics, tracing, alerting, Prometheus, Grafana, DataDog, APM, performance testing, load testing, profiling, capacity planning, bottleneck10  role: specialist11  scope: implementation12  output-format: code13  related-skills: devops-engineer, debugging-wizard, architecture-designer14---15 16# Monitoring Expert17 18Observability and performance specialist implementing comprehensive monitoring, alerting, tracing, and performance testing systems.19 20## Core Workflow21 221. **Assess** — Identify what needs monitoring (SLIs, critical paths, business metrics)232. **Instrument** — Add logging, metrics, and traces to the application (see examples below)243. **Collect** — Configure aggregation and storage (Prometheus scrape, log shipper, OTLP endpoint); verify data arrives before proceeding254. **Visualize** — Build dashboards using RED (Rate/Errors/Duration) or USE (Utilization/Saturation/Errors) methods265. **Alert** — Define threshold and anomaly alerts on critical paths; validate no false-positive flood before shipping27 28## Quick-Start Examples29 30### Structured Logging (Node.js / Pino)31```js32import pino from 'pino';33 34const logger = pino({ level: 'info' });35 36// Good — structured fields, includes correlation ID37logger.info({ requestId: req.id, userId: req.user.id, durationMs: elapsed }, 'order.created');38 39// Bad — string interpolation, no correlation40console.log(`Order created for user ${userId}`);41```42 43### Prometheus Metrics (Node.js)44```js45import { Counter, Histogram, register } from 'prom-client';46 47const httpRequests = new Counter({48  name: 'http_requests_total',49  help: 'Total HTTP requests',50  labelNames: ['method', 'route', 'status'],51});52 53const httpDuration = new Histogram({54  name: 'http_request_duration_seconds',55  help: 'HTTP request latency',56  labelNames: ['method', 'route'],57  buckets: [0.05, 0.1, 0.3, 0.5, 1, 2, 5],58});59 60// Instrument a route61app.use((req, res, next) => {62  const end = httpDuration.startTimer({ method: req.method, route: req.path });63  res.on('finish', () => {64    httpRequests.inc({ method: req.method, route: req.path, status: res.statusCode });65    end();66  });67  next();68});69 70// Expose scrape endpoint71app.get('/metrics', async (req, res) => {72  res.set('Content-Type', register.contentType);73  res.end(await register.metrics());74});75```76 77### OpenTelemetry Tracing (Node.js)78```js79import { NodeSDK } from '@opentelemetry/sdk-node';80import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';81import { trace } from '@opentelemetry/api';82 83const sdk = new NodeSDK({84  traceExporter: new OTLPTraceExporter({ url: 'http://jaeger:4318/v1/traces' }),85});86sdk.start();87 88// Manual span around a critical operation89const tracer = trace.getTracer('order-service');90async function processOrder(orderId) {91  const span = tracer.startSpan('order.process');92  span.setAttribute('order.id', orderId);93  try {94    const result = await db.saveOrder(orderId);95    span.setStatus({ code: SpanStatusCode.OK });96    return result;97  } catch (err) {98    span.recordException(err);99    span.setStatus({ code: SpanStatusCode.ERROR });100    throw err;101  } finally {102    span.end();103  }104}105```106 107### Prometheus Alerting Rule108```yaml109groups:110  - name: api.rules111    rules:112      - alert: HighErrorRate113        expr: |114          rate(http_requests_total{status=~"5.."}[5m])115          / rate(http_requests_total[5m]) > 0.05116        for: 2m117        labels:118          severity: critical119        annotations:120          summary: "Error rate above 5% on {{ $labels.route }}"121```122 123### k6 Load Test124```js125import http from 'k6/http';126import { check, sleep } from 'k6';127 128export const options = {129  stages: [130    { duration: '1m', target: 50 },   // ramp up131    { duration: '5m', target: 50 },   // sustained load132    { duration: '1m', target: 0 },    // ramp down133  ],134  thresholds: {135    http_req_duration: ['p(95)<500'],  // 95th percentile < 500 ms136    http_req_failed:   ['rate<0.01'],  // error rate < 1%137  },138};139 140export default function () {141  const res = http.get('https://api.example.com/orders');142  check(res, { 'status is 200': (r) => r.status === 200 });143  sleep(1);144}145```146 147## Reference Guide148 149Load detailed guidance based on context:150 151| Topic | Reference | Load When |152|-------|-----------|-----------|153| Logging | `references/structured-logging.md` | Pino, JSON logging |154| Metrics | `references/prometheus-metrics.md` | Counter, Histogram, Gauge |155| Tracing | `references/opentelemetry.md` | OpenTelemetry, spans |156| Alerting | `references/alerting-rules.md` | Prometheus alerts |157| Dashboards | `references/dashboards.md` | RED/USE method, Grafana |158| Performance Testing | `references/performance-testing.md` | Load testing, k6, Artillery, benchmarks |159| Profiling | `references/application-profiling.md` | CPU/memory profiling, bottlenecks |160| Capacity Planning | `references/capacity-planning.md` | Scaling, forecasting, budgets |161 162## Constraints163 164### MUST DO165- Use structured logging (JSON)166- Include request IDs for correlation167- Set up alerts for critical paths168- Monitor business metrics, not just technical169- Use appropriate metric types (counter/gauge/histogram)170- Implement health check endpoints171 172### MUST NOT DO173- Log sensitive data (passwords, tokens, PII)174- Alert on every error (alert fatigue)175- Use string interpolation in logs (use structured fields)176- Skip correlation IDs in distributed systems

Related skills

Angular Architect

Install Angular Architect skill for Claude Code from jeffallan/claude-skills.

Api Designer

Install Api Designer skill for Claude Code from jeffallan/claude-skills.

Architecture Designer

Install Architecture Designer skill for Claude Code from jeffallan/claude-skills.