operationsengineering

Runbook Writing

Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.

runbooksoperationson-callincident-responseprocedures

Works well with agents

DevOps Engineer Agent Incident Commander Agent Infrastructure Engineer Agent Release Manager Agent SRE Engineer Agent Support Engineer Agent

Works well with skills

Disaster Recovery Plan Incident Postmortem Release Checklist Ticket Writing

runbook-writing/

SKILL.md

Markdown

1
2	# Runbook Writing
3
4	## Before you start
5
6	Gather the following from the user:
7
8	1. What system or service does this cover? (Service name, what it does, where it runs)
9	2. What scenario triggers this runbook? (Alert name, error condition, user-reported symptom)
10	3. Who is the intended audience? (On-call generalist, team-specific engineer, external vendor)
11	4. What access or permissions are required? (SSH keys, cloud console roles, VPN, database credentials)
12	5. What is the blast radius if the procedure goes wrong? (Data loss risk, downtime scope, affected users)
13
14	If the user says "just write a runbook for the payments service," push back: "Which failure mode? A runbook covers one specific scenario — database connection exhaustion is a different runbook than payment gateway timeouts."
15
16	## Runbook template
17
18	Use the following structure for every runbook:
19
20	### Purpose
21
22	One to two sentences stating what this runbook fixes and when to use it. Include the alert name or trigger condition verbatim so engineers can grep for it. Add a "last verified" date.
23
24	### Prerequisites
25
26	List every tool, permission, and access requirement. The on-call engineer should read this list and immediately know if they can execute the runbook or need to escalate. Example items: cloud console role, SSH/VPN access, specific dashboard URLs, PagerDuty escalation policy membership.
27
28	### Symptoms and Triggers
29
30	Describe observable signals — alert text, log patterns (exact searchable strings), dashboard anomalies, and user-reported behavior.
31
32	### Step-by-Step Procedure
33
34	Number every step. Each step must include:
35
36	- The exact command or UI action (copy-pasteable)
37	- Expected output so the engineer can confirm the step worked
38	- What to do if the output differs
39
40	Use decision branches with explicit if/then routing:
41
42	```
43	3. Check pg_stat_activity for connection state:
44	$ psql -h <DB_HOST> -U readonly -c "SELECT state, count(*)
45	FROM pg_stat_activity WHERE datname = 'checkout' GROUP BY state;"
46	- IF active queries > 40: proceed to step 4 (kill long-running queries).
47	- IF idle connections > 40: skip to step 5 (restart service).
48	- IF neither: escalate — the connection pool issue is not database-side.
49	```
50
51	Every branch must lead somewhere — a next step number, a different runbook, or an escalation. Never leave the engineer at a dead end.
52
53	### Verification Steps
54
55	Define how the engineer confirms resolution. Include specific metric thresholds and observation windows:
56
57	```
58	- [ ] Connection pool utilization < 70% for 5 consecutive minutes
59	- [ ] No new 500 errors in service logs for 5 minutes
60	- [ ] Monitoring alert auto-resolves
61	```
62
63	### Rollback
64
65	Describe how to undo the procedure if it makes things worse. Reference steps by number. If a step is irreversible, say so explicitly and state the safe alternative.
66
67	### Escalation
68
69	Specify when to escalate (time thresholds, permission gaps, out-of-scope root causes), who to contact (specific team or role, not "engineering"), and how (PagerDuty policy name, Slack channel, phone bridge). Include a fallback if the first contact does not respond within a stated time.
70
71	### Related Runbooks
72
73	Link to runbooks covering adjacent failure modes or downstream effects so the engineer can pivot quickly if the symptoms do not match this scenario.
74
75	## Quality checklist
76
77	Before delivering the runbook, verify:
78
79	- [ ] Every command is copy-pasteable — no unmarked placeholder values
80	- [ ] Decision branches have explicit outcomes with next-step references for each path
81	- [ ] Prerequisites list every permission and tool needed before step 1
82	- [ ] Verification steps include specific thresholds and observation windows
83	- [ ] Escalation section names specific teams or roles with contact methods and time bounds
84	- [ ] The runbook covers exactly one failure scenario, not a general troubleshooting guide
85	- [ ] Someone unfamiliar with the service can follow the steps without asking questions
86
87	## Common mistakes to avoid
88
89	- Writing for the expert, not the 3am responder. "Check the HPA" means nothing to a generalist on-call. Write "Check the Horizontal Pod Autoscaler: `kubectl get hpa -n checkout`" instead.
90	- Omitting expected output. Every command needs the expected result. Without it, the engineer cannot tell if the step succeeded or if they are looking at a new problem.
91	- Unmarked placeholders. If a command contains values the engineer must replace, use `<ALL_CAPS_WITH_BRACKETS>` and state where to find the real value.
92	- Combining multiple failure modes. A runbook that says "if X, do A; if Y, do B; if Z, do C" is three runbooks. Split them and cross-link in Related Runbooks.
93	- Missing rollback instructions. If a step can make things worse, the engineer needs to know how to undo it. If it is irreversible, say so before they execute.
94	- Stale commands. Runbooks rot. Include a "last verified" date and flag commands that depend on specific tool versions.
95