operationsengineering
Runbook Writing
Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.
runbooksoperationson-callincident-responseprocedures
Works well with agents
Works well with skills
runbook-writing/
SKILL.md
Markdown| 1 | |
| 2 | # Runbook Writing |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user: |
| 7 | |
| 8 | 1. **What system or service does this cover?** (Service name, what it does, where it runs) |
| 9 | 2. **What scenario triggers this runbook?** (Alert name, error condition, user-reported symptom) |
| 10 | 3. **Who is the intended audience?** (On-call generalist, team-specific engineer, external vendor) |
| 11 | 4. **What access or permissions are required?** (SSH keys, cloud console roles, VPN, database credentials) |
| 12 | 5. **What is the blast radius if the procedure goes wrong?** (Data loss risk, downtime scope, affected users) |
| 13 | |
| 14 | If the user says "just write a runbook for the payments service," push back: "Which failure mode? A runbook covers one specific scenario — database connection exhaustion is a different runbook than payment gateway timeouts." |
| 15 | |
| 16 | ## Runbook template |
| 17 | |
| 18 | Use the following structure for every runbook: |
| 19 | |
| 20 | ### Purpose |
| 21 | |
| 22 | One to two sentences stating what this runbook fixes and when to use it. Include the alert name or trigger condition verbatim so engineers can grep for it. Add a "last verified" date. |
| 23 | |
| 24 | ### Prerequisites |
| 25 | |
| 26 | List every tool, permission, and access requirement. The on-call engineer should read this list and immediately know if they can execute the runbook or need to escalate. Example items: cloud console role, SSH/VPN access, specific dashboard URLs, PagerDuty escalation policy membership. |
| 27 | |
| 28 | ### Symptoms and Triggers |
| 29 | |
| 30 | Describe observable signals — alert text, log patterns (exact searchable strings), dashboard anomalies, and user-reported behavior. |
| 31 | |
| 32 | ### Step-by-Step Procedure |
| 33 | |
| 34 | Number every step. Each step must include: |
| 35 | |
| 36 | - The exact command or UI action (copy-pasteable) |
| 37 | - Expected output so the engineer can confirm the step worked |
| 38 | - What to do if the output differs |
| 39 | |
| 40 | Use decision branches with explicit if/then routing: |
| 41 | |
| 42 | ``` |
| 43 | 3. Check pg_stat_activity for connection state: |
| 44 | $ psql -h <DB_HOST> -U readonly -c "SELECT state, count(*) |
| 45 | FROM pg_stat_activity WHERE datname = 'checkout' GROUP BY state;" |
| 46 | - IF active queries > 40: proceed to step 4 (kill long-running queries). |
| 47 | - IF idle connections > 40: skip to step 5 (restart service). |
| 48 | - IF neither: escalate — the connection pool issue is not database-side. |
| 49 | ``` |
| 50 | |
| 51 | Every branch must lead somewhere — a next step number, a different runbook, or an escalation. Never leave the engineer at a dead end. |
| 52 | |
| 53 | ### Verification Steps |
| 54 | |
| 55 | Define how the engineer confirms resolution. Include specific metric thresholds and observation windows: |
| 56 | |
| 57 | ``` |
| 58 | - [ ] Connection pool utilization < 70% for 5 consecutive minutes |
| 59 | - [ ] No new 500 errors in service logs for 5 minutes |
| 60 | - [ ] Monitoring alert auto-resolves |
| 61 | ``` |
| 62 | |
| 63 | ### Rollback |
| 64 | |
| 65 | Describe how to undo the procedure if it makes things worse. Reference steps by number. If a step is irreversible, say so explicitly and state the safe alternative. |
| 66 | |
| 67 | ### Escalation |
| 68 | |
| 69 | Specify when to escalate (time thresholds, permission gaps, out-of-scope root causes), who to contact (specific team or role, not "engineering"), and how (PagerDuty policy name, Slack channel, phone bridge). Include a fallback if the first contact does not respond within a stated time. |
| 70 | |
| 71 | ### Related Runbooks |
| 72 | |
| 73 | Link to runbooks covering adjacent failure modes or downstream effects so the engineer can pivot quickly if the symptoms do not match this scenario. |
| 74 | |
| 75 | ## Quality checklist |
| 76 | |
| 77 | Before delivering the runbook, verify: |
| 78 | |
| 79 | - [ ] Every command is copy-pasteable — no unmarked placeholder values |
| 80 | - [ ] Decision branches have explicit outcomes with next-step references for each path |
| 81 | - [ ] Prerequisites list every permission and tool needed before step 1 |
| 82 | - [ ] Verification steps include specific thresholds and observation windows |
| 83 | - [ ] Escalation section names specific teams or roles with contact methods and time bounds |
| 84 | - [ ] The runbook covers exactly one failure scenario, not a general troubleshooting guide |
| 85 | - [ ] Someone unfamiliar with the service can follow the steps without asking questions |
| 86 | |
| 87 | ## Common mistakes to avoid |
| 88 | |
| 89 | - **Writing for the expert, not the 3am responder.** "Check the HPA" means nothing to a generalist on-call. Write "Check the Horizontal Pod Autoscaler: `kubectl get hpa -n checkout`" instead. |
| 90 | - **Omitting expected output.** Every command needs the expected result. Without it, the engineer cannot tell if the step succeeded or if they are looking at a new problem. |
| 91 | - **Unmarked placeholders.** If a command contains values the engineer must replace, use `<ALL_CAPS_WITH_BRACKETS>` and state where to find the real value. |
| 92 | - **Combining multiple failure modes.** A runbook that says "if X, do A; if Y, do B; if Z, do C" is three runbooks. Split them and cross-link in Related Runbooks. |
| 93 | - **Missing rollback instructions.** If a step can make things worse, the engineer needs to know how to undo it. If it is irreversible, say so before they execute. |
| 94 | - **Stale commands.** Runbooks rot. Include a "last verified" date and flag commands that depend on specific tool versions. |
| 95 |