operationsengineering

Disaster Recovery Plan

Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.

disaster-recoverybusiness-continuityfailoverRPORTOresilience

Works well with agents

Cloud Architect AgentSite Reliability Architect AgentSRE Engineer Agent

Works well with skills

Incident PostmortemRunbook Writing
disaster-recovery-plan/
    • multi-region-failover.md4.8 KB
  • SKILL.md7.6 KB
disaster-recovery-plan/examples/multi-region-failover.md
multi-region-failover.md
Markdown
1# Disaster Recovery Plan: Multi-Region Failover — Storefront Application
2 
3## Scope
4 
5This plan covers full-region failover of the Storefront web application from **us-east-1** (primary) to **us-west-2** (secondary). In scope: API servers, PostgreSQL database, Redis cache, CDN origin, and background job workers.
6 
7**Not in scope**: Third-party payment gateway (covered by `payment-gateway-dr-plan`), analytics pipeline (Tier 3, restored separately).
8 
9## Recovery Objectives
10 
11| System | RPO | RTO | Tier |
12|--------|-----|-----|------|
13| PostgreSQL (orders, users) | 1 minute | 15 minutes | Tier 1 |
14| API servers | 0 (stateless) | 10 minutes | Tier 1 |
15| Redis cache | 30 minutes | 10 minutes | Tier 2 |
16| Background job workers | 1 hour | 30 minutes | Tier 2 |
17| CDN origin | 0 (static assets in S3) | 5 minutes | Tier 1 |
18 
19## Backup Strategy
20 
21```
22PostgreSQL:
23 Method: Streaming replication to us-west-2 standby + hourly WAL archive to S3
24 Lag target: < 30 seconds under normal load
25 Retention: 7 days of WAL archives, 30 daily snapshots
26 Storage: S3 us-west-2 (cross-region from primary)
27 Encryption: AES-256 at rest, TLS 1.3 in transit
28 Verification: Automated restore test every Sunday at 04:00 UTC; quarterly manual validation
29 
30Redis:
31 Method: RDB snapshots every 30 minutes, replicated to us-west-2 S3
32 Retention: 48 hours of snapshots
33 Note: Cache can be rebuilt from database; snapshot is a warm-start optimization
34```
35 
36## Failover Procedure
37 
38**Detection**: CloudWatch alarm `region-health-us-east-1` fires when API success rate < 95% for 5 minutes. PagerDuty escalation: `storefront-critical`.
39 
40**Decision authority**: VP Engineering (@dthompson) or SRE Tech Lead (@rgarcia). Either can authorize failover. If neither is reachable within 10 minutes, the on-call SRE may proceed.
41 
42### Steps
43 
441. Confirm us-east-1 is down (not a transient blip):
45 ```bash
46 curl -s -o /dev/null -w "%{http_code}" https://api-east.storefront.internal/healthz
47 ```
48 - IF 200: false alarm. Monitor for 5 more minutes.
49 - IF non-200 or timeout: proceed.
50 
512. Verify us-west-2 standby database replication status:
52 ```bash
53 psql -h db-standby.us-west-2.internal -U sre_admin -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;"
54 ```
55 - Expected: lag < 60 seconds. Note the exact lag for the incident record.
56 - IF lag > 5 minutes: data loss exceeds RPO. Escalate to VP Engineering for go/no-go.
57 
583. Promote standby database to primary:
59 ```bash
60 aws rds promote-read-replica --db-instance-identifier storefront-standby-west2
61 ```
62 - Wait for status to change to `available` (~2-5 minutes).
63 
644. Update API server configuration to point to new primary database:
65 ```bash
66 kubectl set env deployment/storefront-api -n storefront DATABASE_URL="postgresql://app:$DB_PASS@db-primary.us-west-2.internal:5432/storefront" --context=us-west-2
67 ```
68 
695. Verify API servers are healthy in us-west-2:
70 ```bash
71 kubectl get pods -n storefront --context=us-west-2 -l app=storefront-api
72 ```
73 
746. Switch DNS to us-west-2:
75 ```bash
76 aws route53 change-resource-record-sets --hosted-zone-id <ZONE_ID> --change-batch file://failover-dns-west2.json
77 ```
78 - DNS TTL is 60 seconds. Full propagation within 2-3 minutes.
79 
807. Start background workers in us-west-2:
81 ```bash
82 kubectl scale deployment/storefront-workers -n storefront --replicas=4 --context=us-west-2
83 ```
84 
85## Data Validation (post-failover)
86 
87- [ ] Run `SELECT count(*) FROM orders WHERE created_at > now() - interval '1 hour'` — compare against last known metric
88- [ ] Place a test order through the full checkout flow
89- [ ] Verify user authentication works (login, session creation)
90- [ ] Confirm background jobs are processing (check Sidekiq dashboard)
91 
92## Communication Protocol
93 
94| Audience | Channel | Timing | Owner |
95|----------|---------|--------|-------|
96| Incident commander | PagerDuty `storefront-critical` | Immediate | Automated |
97| Engineering leadership | Slack #incidents | Within 5 min | Incident commander |
98| Customer support | Slack #support-alerts + email template | Within 15 min | Comms lead |
99| Customers | status.storefront.com + email | Within 20 min | Comms lead |
100| Executive team | Email summary | Within 1 hour | VP Engineering |
101 
102## Testing Schedule
103 
104- **Tabletop exercise**: Quarterly (next: 2026-04-15), walk through this plan with all stakeholders
105- **Database failover drill**: Semi-annual, promote standby and verify data integrity
106- **Full failover drill**: Annual, complete DNS cutover to us-west-2 during low-traffic window (Sunday 05:00 UTC)
107- **Replication lag monitoring**: Continuous — alert if lag > 60 seconds
108 
109## Plan Maintenance
110 
111- **Owner**: @rgarcia (SRE Tech Lead)
112- **Review cadence**: Quarterly or after any infrastructure change
113- **Last reviewed**: 2026-03-15
114- **Next review**: 2026-06-15
115 
AgentsSkillsCompaniesJobsForumBlogFAQAbout

©2026 ai-directory.company

·Privacy·Terms·Cookies·