operationsengineering

Runbook Writing

Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.

runbooksoperationson-callincident-responseprocedures

Works well with agents

DevOps Engineer AgentIncident Commander AgentInfrastructure Engineer AgentRelease Manager AgentSRE Engineer AgentSupport Engineer Agent

Works well with skills

Disaster Recovery PlanIncident PostmortemRelease ChecklistTicket Writing
runbook-writing/
    • high-error-rate.md4.8 KB
  • SKILL.md5.5 KB
SKILL.md
Markdown
1 
2# Runbook Writing
3 
4## Before you start
5 
6Gather the following from the user:
7 
81. **What system or service does this cover?** (Service name, what it does, where it runs)
92. **What scenario triggers this runbook?** (Alert name, error condition, user-reported symptom)
103. **Who is the intended audience?** (On-call generalist, team-specific engineer, external vendor)
114. **What access or permissions are required?** (SSH keys, cloud console roles, VPN, database credentials)
125. **What is the blast radius if the procedure goes wrong?** (Data loss risk, downtime scope, affected users)
13 
14If the user says "just write a runbook for the payments service," push back: "Which failure mode? A runbook covers one specific scenario — database connection exhaustion is a different runbook than payment gateway timeouts."
15 
16## Runbook template
17 
18Use the following structure for every runbook:
19 
20### Purpose
21 
22One to two sentences stating what this runbook fixes and when to use it. Include the alert name or trigger condition verbatim so engineers can grep for it. Add a "last verified" date.
23 
24### Prerequisites
25 
26List every tool, permission, and access requirement. The on-call engineer should read this list and immediately know if they can execute the runbook or need to escalate. Example items: cloud console role, SSH/VPN access, specific dashboard URLs, PagerDuty escalation policy membership.
27 
28### Symptoms and Triggers
29 
30Describe observable signals — alert text, log patterns (exact searchable strings), dashboard anomalies, and user-reported behavior.
31 
32### Step-by-Step Procedure
33 
34Number every step. Each step must include:
35 
36- The exact command or UI action (copy-pasteable)
37- Expected output so the engineer can confirm the step worked
38- What to do if the output differs
39 
40Use decision branches with explicit if/then routing:
41 
42```
433. Check pg_stat_activity for connection state:
44 $ psql -h <DB_HOST> -U readonly -c "SELECT state, count(*)
45 FROM pg_stat_activity WHERE datname = 'checkout' GROUP BY state;"
46 - IF active queries > 40: proceed to step 4 (kill long-running queries).
47 - IF idle connections > 40: skip to step 5 (restart service).
48 - IF neither: escalate — the connection pool issue is not database-side.
49```
50 
51Every branch must lead somewhere — a next step number, a different runbook, or an escalation. Never leave the engineer at a dead end.
52 
53### Verification Steps
54 
55Define how the engineer confirms resolution. Include specific metric thresholds and observation windows:
56 
57```
58- [ ] Connection pool utilization < 70% for 5 consecutive minutes
59- [ ] No new 500 errors in service logs for 5 minutes
60- [ ] Monitoring alert auto-resolves
61```
62 
63### Rollback
64 
65Describe how to undo the procedure if it makes things worse. Reference steps by number. If a step is irreversible, say so explicitly and state the safe alternative.
66 
67### Escalation
68 
69Specify when to escalate (time thresholds, permission gaps, out-of-scope root causes), who to contact (specific team or role, not "engineering"), and how (PagerDuty policy name, Slack channel, phone bridge). Include a fallback if the first contact does not respond within a stated time.
70 
71### Related Runbooks
72 
73Link to runbooks covering adjacent failure modes or downstream effects so the engineer can pivot quickly if the symptoms do not match this scenario.
74 
75## Quality checklist
76 
77Before delivering the runbook, verify:
78 
79- [ ] Every command is copy-pasteable — no unmarked placeholder values
80- [ ] Decision branches have explicit outcomes with next-step references for each path
81- [ ] Prerequisites list every permission and tool needed before step 1
82- [ ] Verification steps include specific thresholds and observation windows
83- [ ] Escalation section names specific teams or roles with contact methods and time bounds
84- [ ] The runbook covers exactly one failure scenario, not a general troubleshooting guide
85- [ ] Someone unfamiliar with the service can follow the steps without asking questions
86 
87## Common mistakes to avoid
88 
89- **Writing for the expert, not the 3am responder.** "Check the HPA" means nothing to a generalist on-call. Write "Check the Horizontal Pod Autoscaler: `kubectl get hpa -n checkout`" instead.
90- **Omitting expected output.** Every command needs the expected result. Without it, the engineer cannot tell if the step succeeded or if they are looking at a new problem.
91- **Unmarked placeholders.** If a command contains values the engineer must replace, use `<ALL_CAPS_WITH_BRACKETS>` and state where to find the real value.
92- **Combining multiple failure modes.** A runbook that says "if X, do A; if Y, do B; if Z, do C" is three runbooks. Split them and cross-link in Related Runbooks.
93- **Missing rollback instructions.** If a step can make things worse, the engineer needs to know how to undo it. If it is irreversible, say so before they execute.
94- **Stale commands.** Runbooks rot. Include a "last verified" date and flag commands that depend on specific tool versions.
95 

©2026 ai-directory.company

·Privacy·Terms·Cookies·