Claude Agent Skill · by Wshobson

Service Mesh Observability

A solid implementation guide for getting observability working across Istio and Linkerd deployments. Sets up the full stack with Prometheus metrics, Jaeger trac

Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill service-mesh-observability
Works with Paperclip

How Service Mesh Observability fits into a Paperclip company.

Service Mesh Observability drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md378 lines
Expand
---name: service-mesh-observabilitydescription: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.--- # Service Mesh Observability Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments. ## When to Use This Skill - Setting up distributed tracing across services- Implementing service mesh metrics and dashboards- Debugging latency and error issues- Defining SLOs for service communication- Visualizing service dependencies- Troubleshooting mesh connectivity ## Core Concepts ### 1. Three Pillars of Observability ```┌─────────────────────────────────────────────────────┐│                  Observability                       │├─────────────────┬─────────────────┬─────────────────┤│     Metrics     │     Traces      │      Logs       ││                 │                 │                 ││ • Request rate  │ • Span context  │ • Access logs   ││ • Error rate    │ • Latency       │ • Error details ││ • Latency P50   │ • Dependencies  │ • Debug info    ││ • Saturation    │ • Bottlenecks   │ • Audit trail   │└─────────────────┴─────────────────┴─────────────────┘``` ### 2. Golden Signals for Mesh | Signal         | Description               | Alert Threshold   || -------------- | ------------------------- | ----------------- || **Latency**    | Request duration P50, P99 | P99 > 500ms       || **Traffic**    | Requests per second       | Anomaly detection || **Errors**     | 5xx error rate            | > 1%              || **Saturation** | Resource utilization      | > 80%             | ## Templates ### Template 1: Istio with Prometheus & Grafana ```yaml# Install PrometheusapiVersion: v1kind: ConfigMapmetadata:  name: prometheus  namespace: istio-systemdata:  prometheus.yml: |    global:      scrape_interval: 15s    scrape_configs:      - job_name: 'istio-mesh'        kubernetes_sd_configs:          - role: endpoints            namespaces:              names:                - istio-system        relabel_configs:          - source_labels: [__meta_kubernetes_service_name]            action: keep            regex: istio-telemetry---# ServiceMonitor for Prometheus OperatorapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata:  name: istio-mesh  namespace: istio-systemspec:  selector:    matchLabels:      app: istiod  endpoints:    - port: http-monitoring      interval: 15s``` ### Template 2: Key Istio Metrics Queries ```promql# Request rate by servicesum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name) # Error rate (5xx)sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 # P99 latencyhistogram_quantile(0.99,  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))  by (le, destination_service_name)) # TCP connectionssum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name) # Request sizehistogram_quantile(0.99,  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))  by (le, destination_service_name))``` ### Template 3: Jaeger Distributed Tracing ```yaml# Jaeger installation for IstioapiVersion: install.istio.io/v1alpha1kind: IstioOperatorspec:  meshConfig:    enableTracing: true    defaultConfig:      tracing:        sampling: 100.0 # 100% in dev, lower in prod        zipkin:          address: jaeger-collector.istio-system:9411---# Jaeger deploymentapiVersion: apps/v1kind: Deploymentmetadata:  name: jaeger  namespace: istio-systemspec:  selector:    matchLabels:      app: jaeger  template:    metadata:      labels:        app: jaeger    spec:      containers:        - name: jaeger          image: jaegertracing/all-in-one:1.50          ports:            - containerPort: 5775 # UDP            - containerPort: 6831 # Thrift            - containerPort: 6832 # Thrift            - containerPort: 5778 # Config            - containerPort: 16686 # UI            - containerPort: 14268 # HTTP            - containerPort: 14250 # gRPC            - containerPort: 9411 # Zipkin          env:            - name: COLLECTOR_ZIPKIN_HOST_PORT              value: ":9411"``` ### Template 4: Linkerd Viz Dashboard ```bash# Install Linkerd viz extensionlinkerd viz install | kubectl apply -f - # Access dashboardlinkerd viz dashboard # CLI commands for observability# Top requestslinkerd viz top deploy/my-app # Per-route metricslinkerd viz routes deploy/my-app --to deploy/backend # Live traffic inspectionlinkerd viz tap deploy/my-app --to deploy/backend # Service edges (dependencies)linkerd viz edges deployment -n my-namespace``` ### Template 5: Grafana Dashboard JSON ```json{  "dashboard": {    "title": "Service Mesh Overview",    "panels": [      {        "title": "Request Rate",        "type": "graph",        "targets": [          {            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",            "legendFormat": "{{destination_service_name}}"          }        ]      },      {        "title": "Error Rate",        "type": "gauge",        "targets": [          {            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"          }        ],        "fieldConfig": {          "defaults": {            "thresholds": {              "steps": [                { "value": 0, "color": "green" },                { "value": 1, "color": "yellow" },                { "value": 5, "color": "red" }              ]            }          }        }      },      {        "title": "P99 Latency",        "type": "graph",        "targets": [          {            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",            "legendFormat": "{{destination_service_name}}"          }        ]      },      {        "title": "Service Topology",        "type": "nodeGraph",        "targets": [          {            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"          }        ]      }    ]  }}``` ### Template 6: Kiali Service Mesh Visualization ```yaml# Kiali installationapiVersion: kiali.io/v1alpha1kind: Kialimetadata:  name: kiali  namespace: istio-systemspec:  auth:    strategy: anonymous # or openid, token  deployment:    accessible_namespaces:      - "**"  external_services:    prometheus:      url: http://prometheus.istio-system:9090    tracing:      url: http://jaeger-query.istio-system:16686    grafana:      url: http://grafana.istio-system:3000``` ### Template 7: OpenTelemetry Integration ```yaml# OpenTelemetry Collector for meshapiVersion: v1kind: ConfigMapmetadata:  name: otel-collector-configdata:  config.yaml: |    receivers:      otlp:        protocols:          grpc:            endpoint: 0.0.0.0:4317          http:            endpoint: 0.0.0.0:4318      zipkin:        endpoint: 0.0.0.0:9411     processors:      batch:        timeout: 10s     exporters:      jaeger:        endpoint: jaeger-collector:14250        tls:          insecure: true      prometheus:        endpoint: 0.0.0.0:8889     service:      pipelines:        traces:          receivers: [otlp, zipkin]          processors: [batch]          exporters: [jaeger]        metrics:          receivers: [otlp]          processors: [batch]          exporters: [prometheus]---# Istio Telemetry v2 with OTelapiVersion: telemetry.istio.io/v1alpha1kind: Telemetrymetadata:  name: mesh-default  namespace: istio-systemspec:  tracing:    - providers:        - name: otel      randomSamplingPercentage: 10``` ## Alerting Rules ```yamlapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata:  name: mesh-alerts  namespace: istio-systemspec:  groups:    - name: mesh.rules      rules:        - alert: HighErrorRate          expr: |            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05          for: 5m          labels:            severity: critical          annotations:            summary: "High error rate for {{ $labels.destination_service_name }}"         - alert: HighLatency          expr: |            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))            by (le, destination_service_name)) > 1000          for: 5m          labels:            severity: warning          annotations:            summary: "High P99 latency for {{ $labels.destination_service_name }}"         - alert: MeshCertExpiring          expr: |            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7          labels:            severity: warning          annotations:            summary: "Mesh certificate expiring in less than 7 days"``` ## Best Practices ### Do's - **Sample appropriately** - 100% in dev, 1-10% in prod- **Use trace context** - Propagate headers consistently- **Set up alerts** - For golden signals- **Correlate metrics/traces** - Use exemplars- **Retain strategically** - Hot/cold storage tiers ### Don'ts - **Don't over-sample** - Storage costs add up- **Don't ignore cardinality** - Limit label values- **Don't skip dashboards** - Visualize dependencies- **Don't forget costs** - Monitor observability costs