operationsengineering
Disaster Recovery Plan
Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.
disaster-recoverybusiness-continuityfailoverRPORTOresilience
Works well with agents
Works well with skills
disaster-recovery-plan/
SKILL.md
Markdown| 1 | |
| 2 | # Disaster Recovery Plan |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user before writing: |
| 7 | |
| 8 | 1. **What systems does this plan cover?** (Service names, data stores, and their business functions) |
| 9 | 2. **What are the business-critical operations?** (Revenue-generating flows, regulatory obligations, customer-facing services) |
| 10 | 3. **What is the acceptable data loss?** (RPO — Recovery Point Objective: can you lose 0 seconds, 5 minutes, 1 hour, or 24 hours of data?) |
| 11 | 4. **What is the acceptable downtime?** (RTO — Recovery Time Objective: how long can the system be unavailable before business impact is severe?) |
| 12 | 5. **What disaster scenarios must be covered?** (Region outage, database corruption, ransomware, vendor failure, physical site loss) |
| 13 | |
| 14 | If the user says "write a DR plan for our app," push back: "Which failure scenario? A database corruption recovery is a different plan from a full region failover. Each scenario gets its own procedure with its own RPO/RTO targets." |
| 15 | |
| 16 | ## Disaster recovery plan template |
| 17 | |
| 18 | ### 1. Scope and objectives |
| 19 | |
| 20 | State what this plan covers and what it does not. Define the specific systems, environments, and failure scenarios in scope. List any systems explicitly excluded and reference their separate DR plans if they exist. |
| 21 | |
| 22 | Define recovery objectives for each system: |
| 23 | |
| 24 | | System | RPO | RTO | Tier | Justification | |
| 25 | |---|---|---|---|---| |
| 26 | | Payment processing | 0 (zero data loss) | 15 minutes | Tier 1 | Revenue-critical, regulatory requirement | |
| 27 | | User database | 5 minutes | 30 minutes | Tier 1 | All services depend on auth | |
| 28 | | Analytics pipeline | 24 hours | 4 hours | Tier 2 | No revenue impact, can reprocess | |
| 29 | | Internal wiki | 24 hours | 48 hours | Tier 3 | Low urgency, daily backups sufficient | |
| 30 | |
| 31 | Tier definitions: |
| 32 | - **Tier 1**: Restore first. Business stops without this system. |
| 33 | - **Tier 2**: Restore after Tier 1. Degraded operations are tolerable short-term. |
| 34 | - **Tier 3**: Restore last. No immediate business impact. |
| 35 | |
| 36 | ### 2. Backup strategy |
| 37 | |
| 38 | For each system, document: |
| 39 | |
| 40 | - **Backup method**: Continuous replication, point-in-time snapshots, file-level backups |
| 41 | - **Backup frequency**: Real-time, every N minutes/hours, daily |
| 42 | - **Retention period**: How long backups are kept and the rotation schedule |
| 43 | - **Storage location**: Region, provider, and whether it is geographically separate from primary |
| 44 | - **Encryption**: At-rest and in-transit encryption standards |
| 45 | - **Verification**: How and how often backup integrity is tested (not just "we assume it works") |
| 46 | |
| 47 | ``` |
| 48 | User database: |
| 49 | Method: Continuous WAL replication to standby + daily full snapshot |
| 50 | Frequency: Real-time replication; snapshots at 02:00 UTC daily |
| 51 | Retention: 30 daily snapshots, 12 weekly snapshots |
| 52 | Storage: AWS S3 us-west-2 (primary in us-east-1) — cross-region |
| 53 | Encryption: AES-256 at rest, TLS 1.3 in transit |
| 54 | Verification: Weekly automated restore test to staging; quarterly manual validation |
| 55 | ``` |
| 56 | |
| 57 | ### 3. Failover procedures |
| 58 | |
| 59 | Write step-by-step procedures for each disaster scenario. Each procedure must include: |
| 60 | |
| 61 | - **Detection**: How the failure is identified (monitoring alert, customer report, manual check) |
| 62 | - **Decision authority**: Who authorizes the failover (name/role, not "management") |
| 63 | - **Step-by-step execution**: Numbered steps with exact commands, expected outputs, and decision branches |
| 64 | - **Data validation**: How to confirm data integrity after failover |
| 65 | - **Traffic cutover**: How traffic is redirected to the recovery environment |
| 66 | |
| 67 | Use the same step format as a runbook — copy-pasteable commands, expected output, and if/then branches at every decision point. Reference runbooks for detailed per-service procedures. |
| 68 | |
| 69 | ### 4. Communication protocol |
| 70 | |
| 71 | Define who is notified, when, and how: |
| 72 | |
| 73 | | Audience | Channel | Timing | Message owner | |
| 74 | |---|---|---|---| |
| 75 | | Incident commander | PagerDuty | Immediate (automated) | Monitoring system | |
| 76 | | Engineering leadership | Slack #incidents | Within 5 minutes | Incident commander | |
| 77 | | Customer support | Email + Slack | Within 15 minutes | Comms lead | |
| 78 | | Affected customers | Status page + email | Within 30 minutes | Comms lead | |
| 79 | | Executive team | Email summary | Within 1 hour | Program owner | |
| 80 | |
| 81 | Include message templates for customer-facing communications at each stage: initial acknowledgment, progress update, and resolution confirmation. |
| 82 | |
| 83 | ### 5. Testing schedule |
| 84 | |
| 85 | A plan that has never been tested is a hypothesis, not a plan. Define: |
| 86 | |
| 87 | - **Tabletop exercises**: Quarterly walk-throughs of the plan with all stakeholders |
| 88 | - **Component tests**: Monthly restoration of individual backups to verify recoverability |
| 89 | - **Full failover drills**: Semi-annual or annual end-to-end failover to the recovery environment |
| 90 | - **Chaos engineering**: Ongoing injection of controlled failures in production (if applicable) |
| 91 | |
| 92 | Each test must produce a written report documenting: what was tested, pass/fail per step, time to complete each phase, and issues discovered with remediation owners. |
| 93 | |
| 94 | ### 6. Plan maintenance |
| 95 | |
| 96 | - **Review cadence**: Quarterly review or after any infrastructure change |
| 97 | - **Change triggers**: New system added, provider changed, RTO/RPO targets updated, post-incident findings |
| 98 | - **Version control and ownership**: Store in version control (not a wiki that silently drifts) with a named owner responsible for keeping it current |
| 99 | |
| 100 | ## Quality checklist |
| 101 | |
| 102 | Before delivering the plan, verify: |
| 103 | |
| 104 | - [ ] RPO and RTO are defined per system with business justification, not just technical preference |
| 105 | - [ ] Every system has a documented backup method, frequency, storage location, and verification process |
| 106 | - [ ] Failover procedures are step-by-step with commands, expected outputs, and decision authority |
| 107 | - [ ] Communication protocol specifies audience, channel, timing, and message owner — no gaps |
| 108 | - [ ] Testing schedule includes at least tabletop, component, and full failover tests with defined frequency |
| 109 | - [ ] Tier classifications are assigned and restoration order is explicit |
| 110 | - [ ] The plan names specific people or roles, not "the team" or "management" |
| 111 | - [ ] A maintenance owner and review cadence are defined |
| 112 | |
| 113 | ## Common mistakes |
| 114 | |
| 115 | - **Setting RPO/RTO without business input.** Engineers pick technically convenient targets. The business must define how much downtime and data loss it can tolerate, then engineering designs to meet those targets. |
| 116 | - **Untested backups.** "We have daily backups" means nothing if you have never restored one. Backups that cannot be restored are not backups. |
| 117 | - **Single-region recovery storage.** Storing backups in the same region as production means a region outage destroys both. Cross-region or cross-provider storage is mandatory. |
| 118 | - **No communication plan.** Technical recovery without customer communication creates a second crisis. Customers who see downtime with no explanation lose trust faster than customers who get timely updates. |
| 119 | - **Plan lives in a wiki nobody reads.** If the plan is not tested regularly and updated after infrastructure changes, it will be wrong when you need it most. Treat it as a living document with a named owner. |
| 120 | - **Skipping decision authority.** In a crisis, "who decides to fail over?" cannot be an open question. Name the role and the backup if that person is unreachable. |
| 121 |