operationsengineering
Runbook Writing
Write operational runbooks for the on-call engineer at 3am — step-by-step procedures with decision trees, escalation paths, and rollback instructions that assume no prior context.
runbooksoperationson-callincident-responseprocedures
Works well with agents
Works well with skills
$ npx skills add The-AI-Directory-Company/(…) --skill runbook-writingrunbook-writing/
high-error-rate.md
Markdown
| 1 | # Runbook: High API Error Rate — checkout-service |
| 2 | |
| 3 | **Alert name**: `checkout-service-5xx-rate-high` |
| 4 | **Last verified**: 2026-03-10 |
| 5 | |
| 6 | ## Purpose |
| 7 | |
| 8 | Diagnose and resolve elevated 5xx error rates on the checkout-service API. Use this runbook when the `checkout-service-5xx-rate-high` alert fires (threshold: > 1% of requests returning 5xx for 3 consecutive minutes). |
| 9 | |
| 10 | ## Prerequisites |
| 11 | |
| 12 | - [ ] VPN connected to production network |
| 13 | - [ ] `kubectl` access to `prod-us-east` cluster (role: `sre-oncall`) |
| 14 | - [ ] Read access to Datadog dashboard: `https://app.datadoghq.com/dash/checkout-prod` |
| 15 | - [ ] Database read-only credentials in 1Password vault `SRE-Prod` (entry: `checkout-db-readonly`) |
| 16 | |
| 17 | ## Symptoms and Triggers |
| 18 | |
| 19 | - PagerDuty alert: `checkout-service-5xx-rate-high` |
| 20 | - Datadog: `checkout-service` error rate panel turns red (> 1%) |
| 21 | - Log pattern: `level=error msg="request failed" service=checkout status=500` |
| 22 | - User reports: "Payment page shows an error" or "Checkout is broken" |
| 23 | |
| 24 | ## Step-by-Step Procedure |
| 25 | |
| 26 | 1. Confirm the alert is real — open the Datadog dashboard and verify error rate: |
| 27 | ``` |
| 28 | https://app.datadoghq.com/dash/checkout-prod |
| 29 | ``` |
| 30 | - IF error rate < 1% and falling: monitor for 5 minutes. If it recovers, acknowledge the alert and close. |
| 31 | - IF error rate >= 1%: proceed to step 2. |
| 32 | |
| 33 | 2. Check pod health: |
| 34 | ```bash |
| 35 | kubectl get pods -n checkout -l app=checkout-service |
| 36 | ``` |
| 37 | - Expected: 6/6 pods in `Running` state, 0 restarts. |
| 38 | - IF pods are in `CrashLoopBackOff`: proceed to step 3. |
| 39 | - IF all pods are healthy: skip to step 4. |
| 40 | |
| 41 | 3. Inspect crashing pod logs: |
| 42 | ```bash |
| 43 | kubectl logs -n checkout -l app=checkout-service --tail=100 | grep "level=error" |
| 44 | ``` |
| 45 | - IF logs show `connection refused` to database: skip to step 5. |
| 46 | - IF logs show `OOMKilled`: restart the deployment and escalate to checkout-team. |
| 47 | ```bash |
| 48 | kubectl rollout restart deployment/checkout-service -n checkout |
| 49 | ``` |
| 50 | - IF logs show a different error: escalate (see Escalation section). |
| 51 | |
| 52 | 4. Check downstream dependency health: |
| 53 | ```bash |
| 54 | kubectl exec -n checkout deploy/checkout-service -- curl -s http://localhost:8080/healthz |
| 55 | ``` |
| 56 | - Expected: `{"status":"ok","db":"connected","cache":"connected","payment_gateway":"connected"}` |
| 57 | - IF `db` shows `disconnected`: proceed to step 5. |
| 58 | - IF `payment_gateway` shows `disconnected`: this is a payment-gateway outage. Escalate to payments-team and switch to the `payment-gateway-outage` runbook. |
| 59 | - IF `cache` shows `disconnected`: proceed to step 6. |
| 60 | |
| 61 | 5. Investigate database connectivity: |
| 62 | ```bash |
| 63 | psql -h <CHECKOUT_DB_HOST> -U readonly -d checkout -c "SELECT count(*) FROM pg_stat_activity WHERE datname = 'checkout';" |
| 64 | ``` |
| 65 | - IF connection count > 90 (pool max is 100): kill idle connections: |
| 66 | ```sql |
| 67 | SELECT pg_terminate_backend(pid) FROM pg_stat_activity |
| 68 | WHERE datname = 'checkout' AND state = 'idle' AND query_start < now() - interval '5 minutes'; |
| 69 | ``` |
| 70 | - IF cannot connect at all: this is a database outage. Escalate to dba-team immediately. |
| 71 | |
| 72 | 6. Check Redis cache: |
| 73 | ```bash |
| 74 | kubectl exec -n checkout deploy/checkout-service -- redis-cli -h <CACHE_HOST> ping |
| 75 | ``` |
| 76 | - Expected: `PONG` |
| 77 | - IF no response: restart the cache connection by rolling the deployment: |
| 78 | ```bash |
| 79 | kubectl rollout restart deployment/checkout-service -n checkout |
| 80 | ``` |
| 81 | |
| 82 | ## Verification |
| 83 | |
| 84 | After taking corrective action, confirm resolution: |
| 85 | |
| 86 | - [ ] Error rate < 0.5% for 5 consecutive minutes on Datadog dashboard |
| 87 | - [ ] All 6 pods in `Running` state with 0 recent restarts |
| 88 | - [ ] `/healthz` endpoint returns all dependencies `connected` |
| 89 | - [ ] PagerDuty alert auto-resolves within 10 minutes |
| 90 | |
| 91 | ## Rollback |
| 92 | |
| 93 | - **Step 3 (restart)**: No rollback needed — restart is non-destructive. |
| 94 | - **Step 5 (kill connections)**: No rollback needed — application reconnects automatically. |
| 95 | - **Step 6 (restart)**: If restart makes things worse, roll back to previous image: |
| 96 | ```bash |
| 97 | kubectl rollout undo deployment/checkout-service -n checkout |
| 98 | ``` |
| 99 | |
| 100 | ## Escalation |
| 101 | |
| 102 | Escalate if: |
| 103 | - Issue is not resolved within 15 minutes of starting this runbook |
| 104 | - Root cause is outside checkout-service (database, payment gateway, infrastructure) |
| 105 | - You lack the required access or permissions |
| 106 | |
| 107 | | Contact | Method | Fallback | |
| 108 | |---------|--------|----------| |
| 109 | | checkout-team | PagerDuty policy: `checkout-primary` | Slack: `#checkout-eng` | |
| 110 | | dba-team | PagerDuty policy: `dba-oncall` | Slack: `#dba-support` | |
| 111 | | payments-team | PagerDuty policy: `payments-primary` | Slack: `#payments-eng` | |
| 112 | |
| 113 | If primary contact does not respond within 10 minutes, use the fallback channel. |
| 114 | |
| 115 | ## Related Runbooks |
| 116 | |
| 117 | - `payment-gateway-outage` — When the payment provider is down |
| 118 | - `checkout-db-connection-exhaustion` — Detailed database connection pool debugging |
| 119 | - `checkout-service-high-latency` — When errors are low but response times are elevated |
| 120 |