operationsengineering

Runbook Writing

Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.

runbooksoperationson-callincident-responseprocedures

Works well with agents

DevOps Engineer Agent Incident Commander Agent Infrastructure Engineer Agent Release Manager Agent SRE Engineer Agent Support Engineer Agent

Works well with skills

Disaster Recovery Plan Incident Postmortem Release Checklist Ticket Writing

$ npx skills add The-AI-Directory-Company/(…) --skill runbook-writing

runbook-writing/

high-error-rate.md

Markdown

1	# Runbook: High API Error Rate — checkout-service
2
3	Alert name: `checkout-service-5xx-rate-high`
4	Last verified: 2026-03-10
5
6	## Purpose
7
8	Diagnose and resolve elevated 5xx error rates on the checkout-service API. Use this runbook when the `checkout-service-5xx-rate-high` alert fires (threshold: > 1% of requests returning 5xx for 3 consecutive minutes).
9
10	## Prerequisites
11
12	- [ ] VPN connected to production network
13	- [ ] `kubectl` access to `prod-us-east` cluster (role: `sre-oncall`)
14	- [ ] Read access to Datadog dashboard: `https://app.datadoghq.com/dash/checkout-prod`
15	- [ ] Database read-only credentials in 1Password vault `SRE-Prod` (entry: `checkout-db-readonly`)
16
17	## Symptoms and Triggers
18
19	- PagerDuty alert: `checkout-service-5xx-rate-high`
20	- Datadog: `checkout-service` error rate panel turns red (> 1%)
21	- Log pattern: `level=error msg="request failed" service=checkout status=500`
22	- User reports: "Payment page shows an error" or "Checkout is broken"
23
24	## Step-by-Step Procedure
25
26	1. Confirm the alert is real — open the Datadog dashboard and verify error rate:
27	```
28	https://app.datadoghq.com/dash/checkout-prod
29	```
30	- IF error rate < 1% and falling: monitor for 5 minutes. If it recovers, acknowledge the alert and close.
31	- IF error rate >= 1%: proceed to step 2.
32
33	2. Check pod health:
34	```bash
35	kubectl get pods -n checkout -l app=checkout-service
36	```
37	- Expected: 6/6 pods in `Running` state, 0 restarts.
38	- IF pods are in `CrashLoopBackOff`: proceed to step 3.
39	- IF all pods are healthy: skip to step 4.
40
41	3. Inspect crashing pod logs:
42	```bash
43	kubectl logs -n checkout -l app=checkout-service --tail=100 \| grep "level=error"
44	```
45	- IF logs show `connection refused` to database: skip to step 5.
46	- IF logs show `OOMKilled`: restart the deployment and escalate to checkout-team.
47	```bash
48	kubectl rollout restart deployment/checkout-service -n checkout
49	```
50	- IF logs show a different error: escalate (see Escalation section).
51
52	4. Check downstream dependency health:
53	```bash
54	kubectl exec -n checkout deploy/checkout-service -- curl -s http://localhost:8080/healthz
55	```
56	- Expected: `{"status":"ok","db":"connected","cache":"connected","payment_gateway":"connected"}`
57	- IF `db` shows `disconnected`: proceed to step 5.
58	- IF `payment_gateway` shows `disconnected`: this is a payment-gateway outage. Escalate to payments-team and switch to the `payment-gateway-outage` runbook.
59	- IF `cache` shows `disconnected`: proceed to step 6.
60
61	5. Investigate database connectivity:
62	```bash
63	psql -h <CHECKOUT_DB_HOST> -U readonly -d checkout -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'checkout';"
64	```
65	- IF connection count > 90 (pool max is 100): kill idle connections:
66	```sql
67	SELECT pg_terminate_backend(pid) FROM pg_stat_activity
68	WHERE datname = 'checkout' AND state = 'idle' AND query_start < now() - interval '5 minutes';
69	```
70	- IF cannot connect at all: this is a database outage. Escalate to dba-team immediately.
71
72	6. Check Redis cache:
73	```bash
74	kubectl exec -n checkout deploy/checkout-service -- redis-cli -h <CACHE_HOST> ping
75	```
76	- Expected: `PONG`
77	- IF no response: restart the cache connection by rolling the deployment:
78	```bash
79	kubectl rollout restart deployment/checkout-service -n checkout
80	```
81
82	## Verification
83
84	After taking corrective action, confirm resolution:
85
86	- [ ] Error rate < 0.5% for 5 consecutive minutes on Datadog dashboard
87	- [ ] All 6 pods in `Running` state with 0 recent restarts
88	- [ ] `/healthz` endpoint returns all dependencies `connected`
89	- [ ] PagerDuty alert auto-resolves within 10 minutes
90
91	## Rollback
92
93	- Step 3 (restart): No rollback needed — restart is non-destructive.
94	- Step 5 (kill connections): No rollback needed — application reconnects automatically.
95	- Step 6 (restart): If restart makes things worse, roll back to previous image:
96	```bash
97	kubectl rollout undo deployment/checkout-service -n checkout
98	```
99
100	## Escalation
101
102	Escalate if:
103	- Issue is not resolved within 15 minutes of starting this runbook
104	- Root cause is outside checkout-service (database, payment gateway, infrastructure)
105	- You lack the required access or permissions
106
107	\| Contact \| Method \| Fallback \|
108	\|---------\|--------\|----------\|
109	\| checkout-team \| PagerDuty policy: `checkout-primary` \| Slack: `#checkout-eng` \|
110	\| dba-team \| PagerDuty policy: `dba-oncall` \| Slack: `#dba-support` \|
111	\| payments-team \| PagerDuty policy: `payments-primary` \| Slack: `#payments-eng` \|
112
113	If primary contact does not respond within 10 minutes, use the fallback channel.
114
115	## Related Runbooks
116
117	- `payment-gateway-outage` — When the payment provider is down
118	- `checkout-db-connection-exhaustion` — Detailed database connection pool debugging
119	- `checkout-service-high-latency` — When errors are low but response times are elevated
120