Name: Chaos Engineer
Author: Jeffallan

Install

Terminal · npx

$npx skills add https://github.com/jeffallan/claude-skills --skill chaos-engineer

Works with Paperclip

How Chaos Engineer fits into a Paperclip company.

Chaos Engineer drops into any Paperclip agent that handles this kind of work. Assign it to a specialist inside a pre-configured PaperclipOrg company and the skill becomes available on every heartbeat — no prompt engineering, no tool wiring.

SaaS FactoryPaired

Pre-configured AI company — 18 agents, 18 skills, one-time purchase.

$27$59

Explore pack

Source file

SKILL.md182 linesmarkdown

Expand

1---2name: chaos-engineer3description: Designs chaos experiments, creates failure injection frameworks, and facilitates game day exercises for distributed systems — producing runbooks, experiment manifests, rollback procedures, and post-mortem templates. Use when designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises. Invoke for chaos experiments, resilience testing, blast radius control, game days, antifragile systems, fault injection, Chaos Monkey, Litmus Chaos.4license: MIT5metadata:6  author: https://github.com/Jeffallan7  version: "1.1.0"8  domain: devops9  triggers: chaos engineering, resilience testing, failure injection, game day, blast radius, chaos experiment, fault injection, Chaos Monkey, Litmus Chaos, antifragile10  role: specialist11  scope: implementation12  output-format: code13  related-skills: sre-engineer, devops-engineer, kubernetes-specialist14---15 16# Chaos Engineer17 18## When to Use This Skill19 20- Designing and executing chaos experiments21- Implementing failure injection frameworks (Chaos Monkey, Litmus, etc.)22- Planning and conducting game day exercises23- Building blast radius controls and safety mechanisms24- Setting up continuous chaos testing in CI/CD25- Improving system resilience based on experiment findings26 27## Core Workflow28 291. **System Analysis** - Map architecture, dependencies, critical paths, and failure modes302. **Experiment Design** - Define hypothesis, steady state, blast radius, and safety controls313. **Execute Chaos** - Run controlled experiments with monitoring and quick rollback324. **Learn & Improve** - Document findings, implement fixes, enhance monitoring335. **Automate** - Integrate chaos testing into CI/CD for continuous resilience34 35## Reference Guide36 37Load detailed guidance based on context:38 39| Topic | Reference | Load When |40|-------|-----------|-----------|41| Experiments | `references/experiment-design.md` | Designing hypothesis, blast radius, rollback |42| Infrastructure | `references/infrastructure-chaos.md` | Server, network, zone, region failures |43| Kubernetes | `references/kubernetes-chaos.md` | Pod, node, Litmus, chaos mesh experiments |44| Tools & Automation | `references/chaos-tools.md` | Chaos Monkey, Gremlin, Pumba, CI/CD integration |45| Game Days | `references/game-days.md` | Planning, executing, learning from game days |46 47## Safety Checklist48 49Non-obvious constraints that must be enforced on every experiment:50 51- **Steady state first** — define and verify baseline metrics before injecting any failure52- **Blast radius cap** — start with the smallest possible impact scope; expand only after validation53- **Automated rollback ≤ 30 seconds** — abort path must be scripted and tested before the experiment begins54- **Single variable** — change only one failure condition at a time until behaviour is well understood55- **No production without safety nets** — customer-facing environments require circuit breakers, feature flags, or canary isolation56- **Close the loop** — every experiment must produce a written learning summary and at least one tracked improvement57 58## Output Templates59 60When implementing chaos engineering, provide:611. Experiment design document (hypothesis, metrics, blast radius)622. Implementation code (failure injection scripts/manifests)633. Monitoring setup and alert configuration644. Rollback procedures and safety controls655. Learning summary and improvement recommendations66 67## Concrete Example: Pod Failure Experiment (Litmus Chaos)68 69The following shows a complete experiment — from hypothesis to rollback — using Litmus Chaos on Kubernetes.70 71### Step 1 — Define steady state and apply the experiment72 73```bash74# Verify baseline: p99 latency < 200ms, error rate < 0.1%75kubectl get deploy my-service -n production76kubectl top pods -n production -l app=my-service77```78 79### Step 2 — Create and apply a Litmus ChaosEngine manifest80 81```yaml82# chaos-pod-delete.yaml83apiVersion: litmuschaos.io/v1alpha184kind: ChaosEngine85metadata:86  name: my-service-pod-delete87  namespace: production88spec:89  appinfo:90    appns: production91    applabel: "app=my-service"92    appkind: deployment93  # Limit blast radius: only 1 replica at a time94  engineState: active95  chaosServiceAccount: litmus-admin96  experiments:97    - name: pod-delete98      spec:99        components:100          env:101            - name: TOTAL_CHAOS_DURATION102              value: "60"          # seconds103            - name: CHAOS_INTERVAL104              value: "20"          # delete one pod every 20s105            - name: FORCE106              value: "false"107            - name: PODS_AFFECTED_PERC108              value: "33"          # max 33% of replicas affected109```110 111```bash112# Apply the experiment113kubectl apply -f chaos-pod-delete.yaml114 115# Watch experiment status116kubectl describe chaosengine my-service-pod-delete -n production117kubectl get chaosresult my-service-pod-delete-pod-delete -n production -w118```119 120### Step 3 — Monitor during the experiment121 122```bash123# Tail application logs for errors124kubectl logs -l app=my-service -n production --since=2m -f125 126# Check ChaosResult verdict when complete127kubectl get chaosresult my-service-pod-delete-pod-delete \128  -n production -o jsonpath='{.status.experimentStatus.verdict}'129```130 131### Step 4 — Rollback / abort if steady state is violated132 133```bash134# Immediately stop the experiment135kubectl patch chaosengine my-service-pod-delete \136  -n production --type merge -p '{"spec":{"engineState":"stop"}}'137 138# Confirm all pods are healthy139kubectl rollout status deployment/my-service -n production140```141 142## Concrete Example: Network Latency with toxiproxy143 144```bash145# Install toxiproxy CLI146brew install toxiproxy   # macOS; use the binary release on Linux147 148# Start toxiproxy server (runs alongside your service)149toxiproxy-server &150 151# Create a proxy for your downstream dependency152toxiproxy-cli create -l 0.0.0.0:22222 -u downstream-db:5432 db-proxy153 154# Inject 300ms latency with 10% jitter — blast radius: this proxy only155toxiproxy-cli toxic add db-proxy -t latency -a latency=300 -a jitter=30156 157# Run your load test / observe metrics here ...158 159# Remove the toxic to restore normal behaviour160toxiproxy-cli toxic remove db-proxy -n latency_downstream161```162 163## Concrete Example: Chaos Monkey (Spinnaker / standalone)164 165```bash166# chaos-monkey-config.yml — restrict to a single ASG167deployment:168  enabled: true169  regionIndependence: false170chaos:171  enabled: true172  meanTimeBetweenKillsInWorkDays: 2173  minTimeBetweenKillsInWorkDays: 1174  grouping: APP           # kill one instance per app, not per cluster175  exceptions:176    - account: production177      region: us-east-1178      detail: "*-canary"  # never kill canary instances179 180# Apply and trigger a manual kill for testing181chaos-monkey --app my-service --account staging --dry-run false182```

Related skills

Angular Architect

Install Angular Architect skill for Claude Code from jeffallan/claude-skills.

Api Designer

Install Api Designer skill for Claude Code from jeffallan/claude-skills.

Architecture Designer

Install Architecture Designer skill for Claude Code from jeffallan/claude-skills.