Name: Service Mesh Observability
Author: Wshobson

Install

Terminal · npx

$npx skills add https://github.com/wshobson/agents --skill service-mesh-observability

Works with Paperclip

How Service Mesh Observability fits into a Paperclip company.

Service Mesh Observability drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md378 linesmarkdown

Expand

1---2name: service-mesh-observability3description: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.4---5 6# Service Mesh Observability7 8Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.9 10## When to Use This Skill11 12- Setting up distributed tracing across services13- Implementing service mesh metrics and dashboards14- Debugging latency and error issues15- Defining SLOs for service communication16- Visualizing service dependencies17- Troubleshooting mesh connectivity18 19## Core Concepts20 21### 1. Three Pillars of Observability22 23```24┌─────────────────────────────────────────────────────┐25│                  Observability                       │26├─────────────────┬─────────────────┬─────────────────┤27│     Metrics     │     Traces      │      Logs       │28│                 │                 │                 │29│ • Request rate  │ • Span context  │ • Access logs   │30│ • Error rate    │ • Latency       │ • Error details │31│ • Latency P50   │ • Dependencies  │ • Debug info    │32│ • Saturation    │ • Bottlenecks   │ • Audit trail   │33└─────────────────┴─────────────────┴─────────────────┘34```35 36### 2. Golden Signals for Mesh37 38| Signal         | Description               | Alert Threshold   |39| -------------- | ------------------------- | ----------------- |40| **Latency**    | Request duration P50, P99 | P99 > 500ms       |41| **Traffic**    | Requests per second       | Anomaly detection |42| **Errors**     | 5xx error rate            | > 1%              |43| **Saturation** | Resource utilization      | > 80%             |44 45## Templates46 47### Template 1: Istio with Prometheus & Grafana48 49```yaml50# Install Prometheus51apiVersion: v152kind: ConfigMap53metadata:54  name: prometheus55  namespace: istio-system56data:57  prometheus.yml: |58    global:59      scrape_interval: 15s60    scrape_configs:61      - job_name: 'istio-mesh'62        kubernetes_sd_configs:63          - role: endpoints64            namespaces:65              names:66                - istio-system67        relabel_configs:68          - source_labels: [__meta_kubernetes_service_name]69            action: keep70            regex: istio-telemetry71---72# ServiceMonitor for Prometheus Operator73apiVersion: monitoring.coreos.com/v174kind: ServiceMonitor75metadata:76  name: istio-mesh77  namespace: istio-system78spec:79  selector:80    matchLabels:81      app: istiod82  endpoints:83    - port: http-monitoring84      interval: 15s85```86 87### Template 2: Key Istio Metrics Queries88 89```promql90# Request rate by service91sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)92 93# Error rate (5xx)94sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))95  / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 10096 97# P99 latency98histogram_quantile(0.99,99  sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))100  by (le, destination_service_name))101 102# TCP connections103sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)104 105# Request size106histogram_quantile(0.99,107  sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))108  by (le, destination_service_name))109```110 111### Template 3: Jaeger Distributed Tracing112 113```yaml114# Jaeger installation for Istio115apiVersion: install.istio.io/v1alpha1116kind: IstioOperator117spec:118  meshConfig:119    enableTracing: true120    defaultConfig:121      tracing:122        sampling: 100.0 # 100% in dev, lower in prod123        zipkin:124          address: jaeger-collector.istio-system:9411125---126# Jaeger deployment127apiVersion: apps/v1128kind: Deployment129metadata:130  name: jaeger131  namespace: istio-system132spec:133  selector:134    matchLabels:135      app: jaeger136  template:137    metadata:138      labels:139        app: jaeger140    spec:141      containers:142        - name: jaeger143          image: jaegertracing/all-in-one:1.50144          ports:145            - containerPort: 5775 # UDP146            - containerPort: 6831 # Thrift147            - containerPort: 6832 # Thrift148            - containerPort: 5778 # Config149            - containerPort: 16686 # UI150            - containerPort: 14268 # HTTP151            - containerPort: 14250 # gRPC152            - containerPort: 9411 # Zipkin153          env:154            - name: COLLECTOR_ZIPKIN_HOST_PORT155              value: ":9411"156```157 158### Template 4: Linkerd Viz Dashboard159 160```bash161# Install Linkerd viz extension162linkerd viz install | kubectl apply -f -163 164# Access dashboard165linkerd viz dashboard166 167# CLI commands for observability168# Top requests169linkerd viz top deploy/my-app170 171# Per-route metrics172linkerd viz routes deploy/my-app --to deploy/backend173 174# Live traffic inspection175linkerd viz tap deploy/my-app --to deploy/backend176 177# Service edges (dependencies)178linkerd viz edges deployment -n my-namespace179```180 181### Template 5: Grafana Dashboard JSON182 183```json184{185  "dashboard": {186    "title": "Service Mesh Overview",187    "panels": [188      {189        "title": "Request Rate",190        "type": "graph",191        "targets": [192          {193            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",194            "legendFormat": "{{destination_service_name}}"195          }196        ]197      },198      {199        "title": "Error Rate",200        "type": "gauge",201        "targets": [202          {203            "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"204          }205        ],206        "fieldConfig": {207          "defaults": {208            "thresholds": {209              "steps": [210                { "value": 0, "color": "green" },211                { "value": 1, "color": "yellow" },212                { "value": 5, "color": "red" }213              ]214            }215          }216        }217      },218      {219        "title": "P99 Latency",220        "type": "graph",221        "targets": [222          {223            "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",224            "legendFormat": "{{destination_service_name}}"225          }226        ]227      },228      {229        "title": "Service Topology",230        "type": "nodeGraph",231        "targets": [232          {233            "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"234          }235        ]236      }237    ]238  }239}240```241 242### Template 6: Kiali Service Mesh Visualization243 244```yaml245# Kiali installation246apiVersion: kiali.io/v1alpha1247kind: Kiali248metadata:249  name: kiali250  namespace: istio-system251spec:252  auth:253    strategy: anonymous # or openid, token254  deployment:255    accessible_namespaces:256      - "**"257  external_services:258    prometheus:259      url: http://prometheus.istio-system:9090260    tracing:261      url: http://jaeger-query.istio-system:16686262    grafana:263      url: http://grafana.istio-system:3000264```265 266### Template 7: OpenTelemetry Integration267 268```yaml269# OpenTelemetry Collector for mesh270apiVersion: v1271kind: ConfigMap272metadata:273  name: otel-collector-config274data:275  config.yaml: |276    receivers:277      otlp:278        protocols:279          grpc:280            endpoint: 0.0.0.0:4317281          http:282            endpoint: 0.0.0.0:4318283      zipkin:284        endpoint: 0.0.0.0:9411285 286    processors:287      batch:288        timeout: 10s289 290    exporters:291      jaeger:292        endpoint: jaeger-collector:14250293        tls:294          insecure: true295      prometheus:296        endpoint: 0.0.0.0:8889297 298    service:299      pipelines:300        traces:301          receivers: [otlp, zipkin]302          processors: [batch]303          exporters: [jaeger]304        metrics:305          receivers: [otlp]306          processors: [batch]307          exporters: [prometheus]308---309# Istio Telemetry v2 with OTel310apiVersion: telemetry.istio.io/v1alpha1311kind: Telemetry312metadata:313  name: mesh-default314  namespace: istio-system315spec:316  tracing:317    - providers:318        - name: otel319      randomSamplingPercentage: 10320```321 322## Alerting Rules323 324```yaml325apiVersion: monitoring.coreos.com/v1326kind: PrometheusRule327metadata:328  name: mesh-alerts329  namespace: istio-system330spec:331  groups:332    - name: mesh.rules333      rules:334        - alert: HighErrorRate335          expr: |336            sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)337            / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05338          for: 5m339          labels:340            severity: critical341          annotations:342            summary: "High error rate for {{ $labels.destination_service_name }}"343 344        - alert: HighLatency345          expr: |346            histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))347            by (le, destination_service_name)) > 1000348          for: 5m349          labels:350            severity: warning351          annotations:352            summary: "High P99 latency for {{ $labels.destination_service_name }}"353 354        - alert: MeshCertExpiring355          expr: |356            (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7357          labels:358            severity: warning359          annotations:360            summary: "Mesh certificate expiring in less than 7 days"361```362 363## Best Practices364 365### Do's366 367- **Sample appropriately** - 100% in dev, 1-10% in prod368- **Use trace context** - Propagate headers consistently369- **Set up alerts** - For golden signals370- **Correlate metrics/traces** - Use exemplars371- **Retain strategically** - Hot/cold storage tiers372 373### Don'ts374 375- **Don't over-sample** - Storage costs add up376- **Don't ignore cardinality** - Limit label values377- **Don't skip dashboards** - Visualize dependencies378- **Don't forget costs** - Monitor observability costs

Related skills

Accessibility Compliance

This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov

Airflow Dag Patterns

If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API

Angular Migration

Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app