npx skills add https://github.com/wshobson/agents --skill service-mesh-observabilityHow Service Mesh Observability fits into a Paperclip company.
Service Mesh Observability drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.
Pre-configured AI company — 18 agents, 18 skills, one-time purchase.
SKILL.md378 linesExpandCollapse
---name: service-mesh-observabilitydescription: Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.--- # Service Mesh Observability Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments. ## When to Use This Skill - Setting up distributed tracing across services- Implementing service mesh metrics and dashboards- Debugging latency and error issues- Defining SLOs for service communication- Visualizing service dependencies- Troubleshooting mesh connectivity ## Core Concepts ### 1. Three Pillars of Observability ```┌─────────────────────────────────────────────────────┐│ Observability │├─────────────────┬─────────────────┬─────────────────┤│ Metrics │ Traces │ Logs ││ │ │ ││ • Request rate │ • Span context │ • Access logs ││ • Error rate │ • Latency │ • Error details ││ • Latency P50 │ • Dependencies │ • Debug info ││ • Saturation │ • Bottlenecks │ • Audit trail │└─────────────────┴─────────────────┴─────────────────┘``` ### 2. Golden Signals for Mesh | Signal | Description | Alert Threshold || -------------- | ------------------------- | ----------------- || **Latency** | Request duration P50, P99 | P99 > 500ms || **Traffic** | Requests per second | Anomaly detection || **Errors** | 5xx error rate | > 1% || **Saturation** | Resource utilization | > 80% | ## Templates ### Template 1: Istio with Prometheus & Grafana ```yaml# Install PrometheusapiVersion: v1kind: ConfigMapmetadata: name: prometheus namespace: istio-systemdata: prometheus.yml: | global: scrape_interval: 15s scrape_configs: - job_name: 'istio-mesh' kubernetes_sd_configs: - role: endpoints namespaces: names: - istio-system relabel_configs: - source_labels: [__meta_kubernetes_service_name] action: keep regex: istio-telemetry---# ServiceMonitor for Prometheus OperatorapiVersion: monitoring.coreos.com/v1kind: ServiceMonitormetadata: name: istio-mesh namespace: istio-systemspec: selector: matchLabels: app: istiod endpoints: - port: http-monitoring interval: 15s``` ### Template 2: Key Istio Metrics Queries ```promql# Request rate by servicesum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name) # Error rate (5xx)sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m])) / sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100 # P99 latencyhistogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m])) by (le, destination_service_name)) # TCP connectionssum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name) # Request sizehistogram_quantile(0.99, sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m])) by (le, destination_service_name))``` ### Template 3: Jaeger Distributed Tracing ```yaml# Jaeger installation for IstioapiVersion: install.istio.io/v1alpha1kind: IstioOperatorspec: meshConfig: enableTracing: true defaultConfig: tracing: sampling: 100.0 # 100% in dev, lower in prod zipkin: address: jaeger-collector.istio-system:9411---# Jaeger deploymentapiVersion: apps/v1kind: Deploymentmetadata: name: jaeger namespace: istio-systemspec: selector: matchLabels: app: jaeger template: metadata: labels: app: jaeger spec: containers: - name: jaeger image: jaegertracing/all-in-one:1.50 ports: - containerPort: 5775 # UDP - containerPort: 6831 # Thrift - containerPort: 6832 # Thrift - containerPort: 5778 # Config - containerPort: 16686 # UI - containerPort: 14268 # HTTP - containerPort: 14250 # gRPC - containerPort: 9411 # Zipkin env: - name: COLLECTOR_ZIPKIN_HOST_PORT value: ":9411"``` ### Template 4: Linkerd Viz Dashboard ```bash# Install Linkerd viz extensionlinkerd viz install | kubectl apply -f - # Access dashboardlinkerd viz dashboard # CLI commands for observability# Top requestslinkerd viz top deploy/my-app # Per-route metricslinkerd viz routes deploy/my-app --to deploy/backend # Live traffic inspectionlinkerd viz tap deploy/my-app --to deploy/backend # Service edges (dependencies)linkerd viz edges deployment -n my-namespace``` ### Template 5: Grafana Dashboard JSON ```json{ "dashboard": { "title": "Service Mesh Overview", "panels": [ { "title": "Request Rate", "type": "graph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Error Rate", "type": "gauge", "targets": [ { "expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100" } ], "fieldConfig": { "defaults": { "thresholds": { "steps": [ { "value": 0, "color": "green" }, { "value": 1, "color": "yellow" }, { "value": 5, "color": "red" } ] } } } }, { "title": "P99 Latency", "type": "graph", "targets": [ { "expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))", "legendFormat": "{{destination_service_name}}" } ] }, { "title": "Service Topology", "type": "nodeGraph", "targets": [ { "expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)" } ] } ] }}``` ### Template 6: Kiali Service Mesh Visualization ```yaml# Kiali installationapiVersion: kiali.io/v1alpha1kind: Kialimetadata: name: kiali namespace: istio-systemspec: auth: strategy: anonymous # or openid, token deployment: accessible_namespaces: - "**" external_services: prometheus: url: http://prometheus.istio-system:9090 tracing: url: http://jaeger-query.istio-system:16686 grafana: url: http://grafana.istio-system:3000``` ### Template 7: OpenTelemetry Integration ```yaml# OpenTelemetry Collector for meshapiVersion: v1kind: ConfigMapmetadata: name: otel-collector-configdata: config.yaml: | receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 zipkin: endpoint: 0.0.0.0:9411 processors: batch: timeout: 10s exporters: jaeger: endpoint: jaeger-collector:14250 tls: insecure: true prometheus: endpoint: 0.0.0.0:8889 service: pipelines: traces: receivers: [otlp, zipkin] processors: [batch] exporters: [jaeger] metrics: receivers: [otlp] processors: [batch] exporters: [prometheus]---# Istio Telemetry v2 with OTelapiVersion: telemetry.istio.io/v1alpha1kind: Telemetrymetadata: name: mesh-default namespace: istio-systemspec: tracing: - providers: - name: otel randomSamplingPercentage: 10``` ## Alerting Rules ```yamlapiVersion: monitoring.coreos.com/v1kind: PrometheusRulemetadata: name: mesh-alerts namespace: istio-systemspec: groups: - name: mesh.rules rules: - alert: HighErrorRate expr: | sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name) / sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05 for: 5m labels: severity: critical annotations: summary: "High error rate for {{ $labels.destination_service_name }}" - alert: HighLatency expr: | histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m])) by (le, destination_service_name)) > 1000 for: 5m labels: severity: warning annotations: summary: "High P99 latency for {{ $labels.destination_service_name }}" - alert: MeshCertExpiring expr: | (certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7 labels: severity: warning annotations: summary: "Mesh certificate expiring in less than 7 days"``` ## Best Practices ### Do's - **Sample appropriately** - 100% in dev, 1-10% in prod- **Use trace context** - Propagate headers consistently- **Set up alerts** - For golden signals- **Correlate metrics/traces** - Use exemplars- **Retain strategically** - Hot/cold storage tiers ### Don'ts - **Don't over-sample** - Storage costs add up- **Don't ignore cardinality** - Limit label values- **Don't skip dashboards** - Visualize dependencies- **Don't forget costs** - Monitor observability costsAccessibility Compliance
This walks you through implementing proper WCAG 2.2 compliance with real code patterns for screen readers, keyboard navigation, and mobile accessibility. It cov
Airflow Dag Patterns
If you're building data pipelines with Airflow, this skill gives you production-ready DAG patterns that actually work in the real world. It covers TaskFlow API
Angular Migration
Migrating from AngularJS to Angular is notoriously painful, and this skill tackles the practical stuff that makes or breaks these projects. It covers hybrid app