operationsengineering

Runbook Writing

Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.

runbooksoperationson-callincident-responseprocedures

Works well with agents

DevOps Engineer AgentIncident Commander AgentInfrastructure Engineer AgentRelease Manager AgentSRE Engineer AgentSupport Engineer Agent

Works well with skills

Disaster Recovery PlanIncident PostmortemRelease ChecklistTicket Writing
$ npx skills add The-AI-Directory-Company/(…) --skill runbook-writing
runbook-writing/
    • high-error-rate.md4.8 KB
  • SKILL.md5.5 KB
runbook-writing/examples/high-error-rate.md
high-error-rate.md
Markdown
1# Runbook: High API Error Rate — checkout-service
2 
3**Alert name**: `checkout-service-5xx-rate-high`
4**Last verified**: 2026-03-10
5 
6## Purpose
7 
8Diagnose and resolve elevated 5xx error rates on the checkout-service API. Use this runbook when the `checkout-service-5xx-rate-high` alert fires (threshold: > 1% of requests returning 5xx for 3 consecutive minutes).
9 
10## Prerequisites
11 
12- [ ] VPN connected to production network
13- [ ] `kubectl` access to `prod-us-east` cluster (role: `sre-oncall`)
14- [ ] Read access to Datadog dashboard: `https://app.datadoghq.com/dash/checkout-prod`
15- [ ] Database read-only credentials in 1Password vault `SRE-Prod` (entry: `checkout-db-readonly`)
16 
17## Symptoms and Triggers
18 
19- PagerDuty alert: `checkout-service-5xx-rate-high`
20- Datadog: `checkout-service` error rate panel turns red (> 1%)
21- Log pattern: `level=error msg="request failed" service=checkout status=500`
22- User reports: "Payment page shows an error" or "Checkout is broken"
23 
24## Step-by-Step Procedure
25 
261. Confirm the alert is real — open the Datadog dashboard and verify error rate:
27 ```
28 https://app.datadoghq.com/dash/checkout-prod
29 ```
30 - IF error rate < 1% and falling: monitor for 5 minutes. If it recovers, acknowledge the alert and close.
31 - IF error rate >= 1%: proceed to step 2.
32 
332. Check pod health:
34 ```bash
35 kubectl get pods -n checkout -l app=checkout-service
36 ```
37 - Expected: 6/6 pods in `Running` state, 0 restarts.
38 - IF pods are in `CrashLoopBackOff`: proceed to step 3.
39 - IF all pods are healthy: skip to step 4.
40 
413. Inspect crashing pod logs:
42 ```bash
43 kubectl logs -n checkout -l app=checkout-service --tail=100 | grep "level=error"
44 ```
45 - IF logs show `connection refused` to database: skip to step 5.
46 - IF logs show `OOMKilled`: restart the deployment and escalate to checkout-team.
47 ```bash
48 kubectl rollout restart deployment/checkout-service -n checkout
49 ```
50 - IF logs show a different error: escalate (see Escalation section).
51 
524. Check downstream dependency health:
53 ```bash
54 kubectl exec -n checkout deploy/checkout-service -- curl -s http://localhost:8080/healthz
55 ```
56 - Expected: `{"status":"ok","db":"connected","cache":"connected","payment_gateway":"connected"}`
57 - IF `db` shows `disconnected`: proceed to step 5.
58 - IF `payment_gateway` shows `disconnected`: this is a payment-gateway outage. Escalate to payments-team and switch to the `payment-gateway-outage` runbook.
59 - IF `cache` shows `disconnected`: proceed to step 6.
60 
615. Investigate database connectivity:
62 ```bash
63 psql -h <CHECKOUT_DB_HOST> -U readonly -d checkout -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'checkout';"
64 ```
65 - IF connection count > 90 (pool max is 100): kill idle connections:
66 ```sql
67 SELECT pg_terminate_backend(pid) FROM pg_stat_activity
68 WHERE datname = 'checkout' AND state = 'idle' AND query_start < now() - interval '5 minutes';
69 ```
70 - IF cannot connect at all: this is a database outage. Escalate to dba-team immediately.
71 
726. Check Redis cache:
73 ```bash
74 kubectl exec -n checkout deploy/checkout-service -- redis-cli -h <CACHE_HOST> ping
75 ```
76 - Expected: `PONG`
77 - IF no response: restart the cache connection by rolling the deployment:
78 ```bash
79 kubectl rollout restart deployment/checkout-service -n checkout
80 ```
81 
82## Verification
83 
84After taking corrective action, confirm resolution:
85 
86- [ ] Error rate < 0.5% for 5 consecutive minutes on Datadog dashboard
87- [ ] All 6 pods in `Running` state with 0 recent restarts
88- [ ] `/healthz` endpoint returns all dependencies `connected`
89- [ ] PagerDuty alert auto-resolves within 10 minutes
90 
91## Rollback
92 
93- **Step 3 (restart)**: No rollback needed — restart is non-destructive.
94- **Step 5 (kill connections)**: No rollback needed — application reconnects automatically.
95- **Step 6 (restart)**: If restart makes things worse, roll back to previous image:
96 ```bash
97 kubectl rollout undo deployment/checkout-service -n checkout
98 ```
99 
100## Escalation
101 
102Escalate if:
103- Issue is not resolved within 15 minutes of starting this runbook
104- Root cause is outside checkout-service (database, payment gateway, infrastructure)
105- You lack the required access or permissions
106 
107| Contact | Method | Fallback |
108|---------|--------|----------|
109| checkout-team | PagerDuty policy: `checkout-primary` | Slack: `#checkout-eng` |
110| dba-team | PagerDuty policy: `dba-oncall` | Slack: `#dba-support` |
111| payments-team | PagerDuty policy: `payments-primary` | Slack: `#payments-eng` |
112 
113If primary contact does not respond within 10 minutes, use the fallback channel.
114 
115## Related Runbooks
116 
117- `payment-gateway-outage` — When the payment provider is down
118- `checkout-db-connection-exhaustion` — Detailed database connection pool debugging
119- `checkout-service-high-latency` — When errors are low but response times are elevated
120 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·