Claude Agent Skill · by Wshobson

Distributed Tracing

This gives you the complete setup for distributed tracing with both Jaeger and Grafana Tempo, including OpenTelemetry instrumentation for Python, Node.js, and G

Install
Terminal · npx
$npx skills add https://github.com/wshobson/agents --skill distributed-tracing
Works with Paperclip

How Distributed Tracing fits into a Paperclip company.

Distributed Tracing drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

S
SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59
Explore pack
Source file
SKILL.md449 lines
Expand
---name: distributed-tracingdescription: Implement distributed tracing with Jaeger and Tempo to track requests across microservices and identify performance bottlenecks. Use when debugging microservices, analyzing request flows, or implementing observability for distributed systems.--- # Distributed Tracing Implement distributed tracing with Jaeger and Tempo for request flow visibility across microservices. ## Purpose Track requests across distributed systems to understand latency, dependencies, and failure points. ## When to Use - Debug latency issues- Understand service dependencies- Identify bottlenecks- Trace error propagation- Analyze request paths ## Distributed Tracing Concepts ### Trace Structure ```Trace (Request ID: abc123)Span (frontend) [100ms]Span (api-gateway) [80ms]  ├→ Span (auth-service) [10ms]  └→ Span (user-service) [60ms]      └→ Span (database) [40ms]``` ### Key Components - **Trace** - End-to-end request journey- **Span** - Single operation within a trace- **Context** - Metadata propagated between services- **Tags** - Key-value pairs for filtering- **Logs** - Timestamped events within a span ## Jaeger Setup ### Kubernetes Deployment ```bash# Deploy Jaeger Operatorkubectl create namespace observabilitykubectl create -f https://github.com/jaegertracing/jaeger-operator/releases/download/v1.51.0/jaeger-operator.yaml -n observability # Deploy Jaeger instancekubectl apply -f - <<EOFapiVersion: jaegertracing.io/v1kind: Jaegermetadata:  name: jaeger  namespace: observabilityspec:  strategy: production  storage:    type: elasticsearch    options:      es:        server-urls: http://elasticsearch:9200  ingress:    enabled: trueEOF``` ### Docker Compose ```yamlversion: "3.8"services:  jaeger:    image: jaegertracing/all-in-one:latest    ports:      - "5775:5775/udp"      - "6831:6831/udp"      - "6832:6832/udp"      - "5778:5778"      - "16686:16686" # UI      - "14268:14268" # Collector      - "14250:14250" # gRPC      - "9411:9411" # Zipkin    environment:      - COLLECTOR_ZIPKIN_HOST_PORT=:9411``` **Reference:** See `references/jaeger-setup.md` ## Application Instrumentation ### OpenTelemetry (Recommended) #### Python (Flask) ```pythonfrom opentelemetry import tracefrom opentelemetry.exporter.jaeger.thrift import JaegerExporterfrom opentelemetry.sdk.resources import SERVICE_NAME, Resourcefrom opentelemetry.sdk.trace import TracerProviderfrom opentelemetry.sdk.trace.export import BatchSpanProcessorfrom opentelemetry.instrumentation.flask import FlaskInstrumentorfrom flask import Flask # Initialize tracerresource = Resource(attributes={SERVICE_NAME: "my-service"})provider = TracerProvider(resource=resource)processor = BatchSpanProcessor(JaegerExporter(    agent_host_name="jaeger",    agent_port=6831,))provider.add_span_processor(processor)trace.set_tracer_provider(provider) # Instrument Flaskapp = Flask(__name__)FlaskInstrumentor().instrument_app(app) @app.route('/api/users')def get_users():    tracer = trace.get_tracer(__name__)     with tracer.start_as_current_span("get_users") as span:        span.set_attribute("user.count", 100)        # Business logic        users = fetch_users_from_db()        return {"users": users} def fetch_users_from_db():    tracer = trace.get_tracer(__name__)     with tracer.start_as_current_span("database_query") as span:        span.set_attribute("db.system", "postgresql")        span.set_attribute("db.statement", "SELECT * FROM users")        # Database query        return query_database()``` #### Node.js (Express) ```javascriptconst { NodeTracerProvider } = require("@opentelemetry/sdk-trace-node");const { JaegerExporter } = require("@opentelemetry/exporter-jaeger");const { BatchSpanProcessor } = require("@opentelemetry/sdk-trace-base");const { registerInstrumentations } = require("@opentelemetry/instrumentation");const { HttpInstrumentation } = require("@opentelemetry/instrumentation-http");const {  ExpressInstrumentation,} = require("@opentelemetry/instrumentation-express"); // Initialize tracerconst provider = new NodeTracerProvider({  resource: { attributes: { "service.name": "my-service" } },}); const exporter = new JaegerExporter({  endpoint: "http://jaeger:14268/api/traces",}); provider.addSpanProcessor(new BatchSpanProcessor(exporter));provider.register(); // Instrument librariesregisterInstrumentations({  instrumentations: [new HttpInstrumentation(), new ExpressInstrumentation()],}); const express = require("express");const app = express(); app.get("/api/users", async (req, res) => {  const tracer = trace.getTracer("my-service");  const span = tracer.startSpan("get_users");   try {    const users = await fetchUsers();    span.setAttributes({ "user.count": users.length });    res.json({ users });  } finally {    span.end();  }});``` #### Go ```gopackage main import (    "context"    "go.opentelemetry.io/otel"    "go.opentelemetry.io/otel/exporters/jaeger"    "go.opentelemetry.io/otel/sdk/resource"    sdktrace "go.opentelemetry.io/otel/sdk/trace"    semconv "go.opentelemetry.io/otel/semconv/v1.4.0") func initTracer() (*sdktrace.TracerProvider, error) {    exporter, err := jaeger.New(jaeger.WithCollectorEndpoint(        jaeger.WithEndpoint("http://jaeger:14268/api/traces"),    ))    if err != nil {        return nil, err    }     tp := sdktrace.NewTracerProvider(        sdktrace.WithBatcher(exporter),        sdktrace.WithResource(resource.NewWithAttributes(            semconv.SchemaURL,            semconv.ServiceNameKey.String("my-service"),        )),    )     otel.SetTracerProvider(tp)    return tp, nil} func getUsers(ctx context.Context) ([]User, error) {    tracer := otel.Tracer("my-service")    ctx, span := tracer.Start(ctx, "get_users")    defer span.End()     span.SetAttributes(attribute.String("user.filter", "active"))     users, err := fetchUsersFromDB(ctx)    if err != nil {        span.RecordError(err)        return nil, err    }     span.SetAttributes(attribute.Int("user.count", len(users)))    return users, nil}``` **Reference:** See `references/instrumentation.md` ## Context Propagation ### HTTP Headers ```traceparent: 00-0af7651916cd43dd8448eb211c80319c-b7ad6b7169203331-01tracestate: congo=t61rcWkgMzE``` ### Propagation in HTTP Requests #### Python ```pythonfrom opentelemetry.propagate import inject headers = {}inject(headers)  # Injects trace context response = requests.get('http://downstream-service/api', headers=headers)``` #### Node.js ```javascriptconst { propagation } = require("@opentelemetry/api"); const headers = {};propagation.inject(context.active(), headers); axios.get("http://downstream-service/api", { headers });``` ## Tempo Setup (Grafana) ### Kubernetes Deployment ```yamlapiVersion: v1kind: ConfigMapmetadata:  name: tempo-configdata:  tempo.yaml: |    server:      http_listen_port: 3200     distributor:      receivers:        jaeger:          protocols:            thrift_http:            grpc:        otlp:          protocols:            http:            grpc:     storage:      trace:        backend: s3        s3:          bucket: tempo-traces          endpoint: s3.amazonaws.com     querier:      frontend_worker:        frontend_address: tempo-query-frontend:9095---apiVersion: apps/v1kind: Deploymentmetadata:  name: tempospec:  replicas: 1  template:    spec:      containers:        - name: tempo          image: grafana/tempo:latest          args:            - -config.file=/etc/tempo/tempo.yaml          volumeMounts:            - name: config              mountPath: /etc/tempo      volumes:        - name: config          configMap:            name: tempo-config``` **Reference:** See `assets/jaeger-config.yaml.template` ## Sampling Strategies ### Probabilistic Sampling ```yaml# Sample 1% of tracessampler:  type: probabilistic  param: 0.01``` ### Rate Limiting Sampling ```yaml# Sample max 100 traces per secondsampler:  type: ratelimiting  param: 100``` ### Adaptive Sampling ```pythonfrom opentelemetry.sdk.trace.sampling import ParentBased, TraceIdRatioBased # Sample based on trace ID (deterministic)sampler = ParentBased(root=TraceIdRatioBased(0.01))``` ## Trace Analysis ### Finding Slow Requests **Jaeger Query:** ```service=my-serviceduration > 1s``` ### Finding Errors **Jaeger Query:** ```service=my-serviceerror=truetags.http.status_code >= 500``` ### Service Dependency Graph Jaeger automatically generates service dependency graphs showing: - Service relationships- Request rates- Error rates- Average latencies ## Best Practices 1. **Sample appropriately** (1-10% in production)2. **Add meaningful tags** (user_id, request_id)3. **Propagate context** across all service boundaries4. **Log exceptions** in spans5. **Use consistent naming** for operations6. **Monitor tracing overhead** (<1% CPU impact)7. **Set up alerts** for trace errors8. **Implement distributed context** (baggage)9. **Use span events** for important milestones10. **Document instrumentation** standards ## Integration with Logging ### Correlated Logs ```pythonimport loggingfrom opentelemetry import trace logger = logging.getLogger(__name__) def process_request():    span = trace.get_current_span()    trace_id = span.get_span_context().trace_id     logger.info(        "Processing request",        extra={"trace_id": format(trace_id, '032x')}    )``` ## Troubleshooting **No traces appearing:** - Check collector endpoint- Verify network connectivity- Check sampling configuration- Review application logs **High latency overhead:** - Reduce sampling rate- Use batch span processor- Check exporter configuration  ## Related Skills - `prometheus-configuration` - For metrics- `grafana-dashboards` - For visualization- `slo-implementation` - For latency SLOs