operationsengineering
Disaster Recovery Plan
Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.
disaster-recoverybusiness-continuityfailoverRPORTOresilience
Works well with agents
Works well with skills
disaster-recovery-plan/
multi-region-failover.md
Markdown| 1 | # Disaster Recovery Plan: Multi-Region Failover — Storefront Application |
| 2 | |
| 3 | ## Scope |
| 4 | |
| 5 | This plan covers full-region failover of the Storefront web application from **us-east-1** (primary) to **us-west-2** (secondary). In scope: API servers, PostgreSQL database, Redis cache, CDN origin, and background job workers. |
| 6 | |
| 7 | **Not in scope**: Third-party payment gateway (covered by `payment-gateway-dr-plan`), analytics pipeline (Tier 3, restored separately). |
| 8 | |
| 9 | ## Recovery Objectives |
| 10 | |
| 11 | | System | RPO | RTO | Tier | |
| 12 | |--------|-----|-----|------| |
| 13 | | PostgreSQL (orders, users) | 1 minute | 15 minutes | Tier 1 | |
| 14 | | API servers | 0 (stateless) | 10 minutes | Tier 1 | |
| 15 | | Redis cache | 30 minutes | 10 minutes | Tier 2 | |
| 16 | | Background job workers | 1 hour | 30 minutes | Tier 2 | |
| 17 | | CDN origin | 0 (static assets in S3) | 5 minutes | Tier 1 | |
| 18 | |
| 19 | ## Backup Strategy |
| 20 | |
| 21 | ``` |
| 22 | PostgreSQL: |
| 23 | Method: Streaming replication to us-west-2 standby + hourly WAL archive to S3 |
| 24 | Lag target: < 30 seconds under normal load |
| 25 | Retention: 7 days of WAL archives, 30 daily snapshots |
| 26 | Storage: S3 us-west-2 (cross-region from primary) |
| 27 | Encryption: AES-256 at rest, TLS 1.3 in transit |
| 28 | Verification: Automated restore test every Sunday at 04:00 UTC; quarterly manual validation |
| 29 | |
| 30 | Redis: |
| 31 | Method: RDB snapshots every 30 minutes, replicated to us-west-2 S3 |
| 32 | Retention: 48 hours of snapshots |
| 33 | Note: Cache can be rebuilt from database; snapshot is a warm-start optimization |
| 34 | ``` |
| 35 | |
| 36 | ## Failover Procedure |
| 37 | |
| 38 | **Detection**: CloudWatch alarm `region-health-us-east-1` fires when API success rate < 95% for 5 minutes. PagerDuty escalation: `storefront-critical`. |
| 39 | |
| 40 | **Decision authority**: VP Engineering (@dthompson) or SRE Tech Lead (@rgarcia). Either can authorize failover. If neither is reachable within 10 minutes, the on-call SRE may proceed. |
| 41 | |
| 42 | ### Steps |
| 43 | |
| 44 | 1. Confirm us-east-1 is down (not a transient blip): |
| 45 | ```bash |
| 46 | curl -s -o /dev/null -w "%{http_code}" https://api-east.storefront.internal/healthz |
| 47 | ``` |
| 48 | - IF 200: false alarm. Monitor for 5 more minutes. |
| 49 | - IF non-200 or timeout: proceed. |
| 50 | |
| 51 | 2. Verify us-west-2 standby database replication status: |
| 52 | ```bash |
| 53 | psql -h db-standby.us-west-2.internal -U sre_admin -c "SELECT now() - pg_last_xact_replay_timestamp() AS replication_lag;" |
| 54 | ``` |
| 55 | - Expected: lag < 60 seconds. Note the exact lag for the incident record. |
| 56 | - IF lag > 5 minutes: data loss exceeds RPO. Escalate to VP Engineering for go/no-go. |
| 57 | |
| 58 | 3. Promote standby database to primary: |
| 59 | ```bash |
| 60 | aws rds promote-read-replica --db-instance-identifier storefront-standby-west2 |
| 61 | ``` |
| 62 | - Wait for status to change to `available` (~2-5 minutes). |
| 63 | |
| 64 | 4. Update API server configuration to point to new primary database: |
| 65 | ```bash |
| 66 | kubectl set env deployment/storefront-api -n storefront DATABASE_URL="postgresql://app:$DB_PASS@db-primary.us-west-2.internal:5432/storefront" --context=us-west-2 |
| 67 | ``` |
| 68 | |
| 69 | 5. Verify API servers are healthy in us-west-2: |
| 70 | ```bash |
| 71 | kubectl get pods -n storefront --context=us-west-2 -l app=storefront-api |
| 72 | ``` |
| 73 | |
| 74 | 6. Switch DNS to us-west-2: |
| 75 | ```bash |
| 76 | aws route53 change-resource-record-sets --hosted-zone-id <ZONE_ID> --change-batch file://failover-dns-west2.json |
| 77 | ``` |
| 78 | - DNS TTL is 60 seconds. Full propagation within 2-3 minutes. |
| 79 | |
| 80 | 7. Start background workers in us-west-2: |
| 81 | ```bash |
| 82 | kubectl scale deployment/storefront-workers -n storefront --replicas=4 --context=us-west-2 |
| 83 | ``` |
| 84 | |
| 85 | ## Data Validation (post-failover) |
| 86 | |
| 87 | - [ ] Run `SELECT count(*) FROM orders WHERE created_at > now() - interval '1 hour'` — compare against last known metric |
| 88 | - [ ] Place a test order through the full checkout flow |
| 89 | - [ ] Verify user authentication works (login, session creation) |
| 90 | - [ ] Confirm background jobs are processing (check Sidekiq dashboard) |
| 91 | |
| 92 | ## Communication Protocol |
| 93 | |
| 94 | | Audience | Channel | Timing | Owner | |
| 95 | |----------|---------|--------|-------| |
| 96 | | Incident commander | PagerDuty `storefront-critical` | Immediate | Automated | |
| 97 | | Engineering leadership | Slack #incidents | Within 5 min | Incident commander | |
| 98 | | Customer support | Slack #support-alerts + email template | Within 15 min | Comms lead | |
| 99 | | Customers | status.storefront.com + email | Within 20 min | Comms lead | |
| 100 | | Executive team | Email summary | Within 1 hour | VP Engineering | |
| 101 | |
| 102 | ## Testing Schedule |
| 103 | |
| 104 | - **Tabletop exercise**: Quarterly (next: 2026-04-15), walk through this plan with all stakeholders |
| 105 | - **Database failover drill**: Semi-annual, promote standby and verify data integrity |
| 106 | - **Full failover drill**: Annual, complete DNS cutover to us-west-2 during low-traffic window (Sunday 05:00 UTC) |
| 107 | - **Replication lag monitoring**: Continuous — alert if lag > 60 seconds |
| 108 | |
| 109 | ## Plan Maintenance |
| 110 | |
| 111 | - **Owner**: @rgarcia (SRE Tech Lead) |
| 112 | - **Review cadence**: Quarterly or after any infrastructure change |
| 113 | - **Last reviewed**: 2026-03-15 |
| 114 | - **Next review**: 2026-06-15 |
| 115 |