operationsengineering

Disaster Recovery Plan

Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.

disaster-recoverybusiness-continuityfailoverRPORTOresilience

Works well with agents

Cloud Architect Agent Site Reliability Architect Agent SRE Engineer Agent

Works well with skills

Incident Postmortem Runbook Writing

disaster-recovery-plan/

multi-region-failover.md

Markdown

1	# Disaster Recovery Plan: Multi-Region Failover — Storefront Application
2
3	## Scope
4
5	This plan covers full-region failover of the Storefront web application from us-east-1 (primary) to us-west-2 (secondary). In scope: API servers, PostgreSQL database, Redis cache, CDN origin, and background job workers.
6
7	Not in scope: Third-party payment gateway (covered by `payment-gateway-dr-plan`), analytics pipeline (Tier 3, restored separately).
8
9	## Recovery Objectives
10
11	\| System \| RPO \| RTO \| Tier \|
12	\|--------\|-----\|-----\|------\|
13	\| PostgreSQL (orders, users) \| 1 minute \| 15 minutes \| Tier 1 \|
14	\| API servers \| 0 (stateless) \| 10 minutes \| Tier 1 \|
15	\| Redis cache \| 30 minutes \| 10 minutes \| Tier 2 \|
16	\| Background job workers \| 1 hour \| 30 minutes \| Tier 2 \|
17	\| CDN origin \| 0 (static assets in S3) \| 5 minutes \| Tier 1 \|
18
19	## Backup Strategy
20
21	```
22	PostgreSQL:
23	Method: Streaming replication to us-west-2 standby + hourly WAL archive to S3
24	Lag target: < 30 seconds under normal load
25	Retention: 7 days of WAL archives, 30 daily snapshots
26	Storage: S3 us-west-2 (cross-region from primary)
27	Encryption: AES-256 at rest, TLS 1.3 in transit
28	Verification: Automated restore test every Sunday at 04:00 UTC; quarterly manual validation
29
30	Redis:
31	Method: RDB snapshots every 30 minutes, replicated to us-west-2 S3
32	Retention: 48 hours of snapshots
33	Note: Cache can be rebuilt from database; snapshot is a warm-start optimization
34	```
35
36	## Failover Procedure
37
38	Detection: CloudWatch alarm `region-health-us-east-1` fires when API success rate < 95% for 5 minutes. PagerDuty escalation: `storefront-critical`.
39
40	Decision authority: VP Engineering (@dthompson) or SRE Tech Lead (@rgarcia). Either can authorize failover. If neither is reachable within 10 minutes, the on-call SRE may proceed.
41
42	### Steps
43
44	1. Confirm us-east-1 is down (not a transient blip):
45	```bash
46	curl -s -o /dev/null -w "%{http_code}" https://api-east.storefront.internal/healthz
47	```
48	- IF 200: false alarm. Monitor for 5 more minutes.
49	- IF non-200 or timeout: proceed.
50
51	2. Verify us-west-2 standby database replication status:
52	```bash
53	psql -h db-standby.us-west-2.internal -U sre_admin -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
54	```
55	- Expected: lag < 60 seconds. Note the exact lag for the incident record.
56	- IF lag > 5 minutes: data loss exceeds RPO. Escalate to VP Engineering for go/no-go.
57
58	3. Promote standby database to primary:
59	```bash
60	aws rds promote-read-replica --db-instance-identifier storefront-standby-west2
61	```
62	- Wait for status to change to `available` (~2-5 minutes).
63
64	4. Update API server configuration to point to new primary database:
65	```bash
66	kubectl set env deployment/storefront-api -n storefront DATABASE_URL="postgresql://app:$DB_PASS@db-primary.us-west-2.internal:5432/storefront" --context=us-west-2
67	```
68
69	5. Verify API servers are healthy in us-west-2:
70	```bash
71	kubectl get pods -n storefront --context=us-west-2 -l app=storefront-api
72	```
73
74	6. Switch DNS to us-west-2:
75	```bash
76	aws route53 change-resource-record-sets --hosted-zone-id <ZONE_ID> --change-batch file://failover-dns-west2.json
77	```
78	- DNS TTL is 60 seconds. Full propagation within 2-3 minutes.
79
80	7. Start background workers in us-west-2:
81	```bash
82	kubectl scale deployment/storefront-workers -n storefront --replicas=4 --context=us-west-2
83	```
84
85	## Data Validation (post-failover)
86
87	- [ ] Run `SELECT count(*) FROM orders WHERE created_at > now() - interval '1 hour'` — compare against last known metric
88	- [ ] Place a test order through the full checkout flow
89	- [ ] Verify user authentication works (login, session creation)
90	- [ ] Confirm background jobs are processing (check Sidekiq dashboard)
91
92	## Communication Protocol
93
94	\| Audience \| Channel \| Timing \| Owner \|
95	\|----------\|---------\|--------\|-------\|
96	\| Incident commander \| PagerDuty `storefront-critical` \| Immediate \| Automated \|
97	\| Engineering leadership \| Slack #incidents \| Within 5 min \| Incident commander \|
98	\| Customer support \| Slack #support-alerts + email template \| Within 15 min \| Comms lead \|
99	\| Customers \| status.storefront.com + email \| Within 20 min \| Comms lead \|
100	\| Executive team \| Email summary \| Within 1 hour \| VP Engineering \|
101
102	## Testing Schedule
103
104	- Tabletop exercise: Quarterly (next: 2026-04-15), walk through this plan with all stakeholders
105	- Database failover drill: Semi-annual, promote standby and verify data integrity
106	- Full failover drill: Annual, complete DNS cutover to us-west-2 during low-traffic window (Sunday 05:00 UTC)
107	- Replication lag monitoring: Continuous — alert if lag > 60 seconds
108
109	## Plan Maintenance
110
111	- Owner: @rgarcia (SRE Tech Lead)
112	- Review cadence: Quarterly or after any infrastructure change
113	- Last reviewed: 2026-03-15
114	- Next review: 2026-06-15
115