operationsengineering

Disaster Recovery Plan

Write disaster recovery plans with RPO/RTO targets, failover procedures, communication protocols, and testing schedules — ensuring business continuity when systems fail.

disaster-recoverybusiness-continuityfailoverRPORTOresilience

Works well with agents

Cloud Architect Agent Site Reliability Architect Agent SRE Engineer Agent

Works well with skills

Incident Postmortem Runbook Writing

disaster-recovery-plan/

SKILL.md

Markdown

1
2	# Disaster Recovery Plan
3
4	## Before you start
5
6	Gather the following from the user before writing:
7
8	1. What systems does this plan cover? (Service names, data stores, and their business functions)
9	2. What are the business-critical operations? (Revenue-generating flows, regulatory obligations, customer-facing services)
10	3. What is the acceptable data loss? (RPO — Recovery Point Objective: can you lose 0 seconds, 5 minutes, 1 hour, or 24 hours of data?)
11	4. What is the acceptable downtime? (RTO — Recovery Time Objective: how long can the system be unavailable before business impact is severe?)
12	5. What disaster scenarios must be covered? (Region outage, database corruption, ransomware, vendor failure, physical site loss)
13
14	If the user says "write a DR plan for our app," push back: "Which failure scenario? A database corruption recovery is a different plan from a full region failover. Each scenario gets its own procedure with its own RPO/RTO targets."
15
16	## Disaster recovery plan template
17
18	### 1. Scope and objectives
19
20	State what this plan covers and what it does not. Define the specific systems, environments, and failure scenarios in scope. List any systems explicitly excluded and reference their separate DR plans if they exist.
21
22	Define recovery objectives for each system:
23
24	\| System \| RPO \| RTO \| Tier \| Justification \|
25	\|---\|---\|---\|---\|---\|
26	\| Payment processing \| 0 (zero data loss) \| 15 minutes \| Tier 1 \| Revenue-critical, regulatory requirement \|
27	\| User database \| 5 minutes \| 30 minutes \| Tier 1 \| All services depend on auth \|
28	\| Analytics pipeline \| 24 hours \| 4 hours \| Tier 2 \| No revenue impact, can reprocess \|
29	\| Internal wiki \| 24 hours \| 48 hours \| Tier 3 \| Low urgency, daily backups sufficient \|
30
31	Tier definitions:
32	- Tier 1: Restore first. Business stops without this system.
33	- Tier 2: Restore after Tier 1. Degraded operations are tolerable short-term.
34	- Tier 3: Restore last. No immediate business impact.
35
36	### 2. Backup strategy
37
38	For each system, document:
39
40	- Backup method: Continuous replication, point-in-time snapshots, file-level backups
41	- Backup frequency: Real-time, every N minutes/hours, daily
42	- Retention period: How long backups are kept and the rotation schedule
43	- Storage location: Region, provider, and whether it is geographically separate from primary
44	- Encryption: At-rest and in-transit encryption standards
45	- Verification: How and how often backup integrity is tested (not just "we assume it works")
46
47	```
48	User database:
49	Method: Continuous WAL replication to standby + daily full snapshot
50	Frequency: Real-time replication; snapshots at 02:00 UTC daily
51	Retention: 30 daily snapshots, 12 weekly snapshots
52	Storage: AWS S3 us-west-2 (primary in us-east-1) — cross-region
53	Encryption: AES-256 at rest, TLS 1.3 in transit
54	Verification: Weekly automated restore test to staging; quarterly manual validation
55	```
56
57	### 3. Failover procedures
58
59	Write step-by-step procedures for each disaster scenario. Each procedure must include:
60
61	- Detection: How the failure is identified (monitoring alert, customer report, manual check)
62	- Decision authority: Who authorizes the failover (name/role, not "management")
63	- Step-by-step execution: Numbered steps with exact commands, expected outputs, and decision branches
64	- Data validation: How to confirm data integrity after failover
65	- Traffic cutover: How traffic is redirected to the recovery environment
66
67	Use the same step format as a runbook — copy-pasteable commands, expected output, and if/then branches at every decision point. Reference runbooks for detailed per-service procedures.
68
69	### 4. Communication protocol
70
71	Define who is notified, when, and how:
72
73	\| Audience \| Channel \| Timing \| Message owner \|
74	\|---\|---\|---\|---\|
75	\| Incident commander \| PagerDuty \| Immediate (automated) \| Monitoring system \|
76	\| Engineering leadership \| Slack #incidents \| Within 5 minutes \| Incident commander \|
77	\| Customer support \| Email + Slack \| Within 15 minutes \| Comms lead \|
78	\| Affected customers \| Status page + email \| Within 30 minutes \| Comms lead \|
79	\| Executive team \| Email summary \| Within 1 hour \| Program owner \|
80
81	Include message templates for customer-facing communications at each stage: initial acknowledgment, progress update, and resolution confirmation.
82
83	### 5. Testing schedule
84
85	A plan that has never been tested is a hypothesis, not a plan. Define:
86
87	- Tabletop exercises: Quarterly walk-throughs of the plan with all stakeholders
88	- Component tests: Monthly restoration of individual backups to verify recoverability
89	- Full failover drills: Semi-annual or annual end-to-end failover to the recovery environment
90	- Chaos engineering: Ongoing injection of controlled failures in production (if applicable)
91
92	Each test must produce a written report documenting: what was tested, pass/fail per step, time to complete each phase, and issues discovered with remediation owners.
93
94	### 6. Plan maintenance
95
96	- Review cadence: Quarterly review or after any infrastructure change
97	- Change triggers: New system added, provider changed, RTO/RPO targets updated, post-incident findings
98	- Version control and ownership: Store in version control (not a wiki that silently drifts) with a named owner responsible for keeping it current
99
100	## Quality checklist
101
102	Before delivering the plan, verify:
103
104	- [ ] RPO and RTO are defined per system with business justification, not just technical preference
105	- [ ] Every system has a documented backup method, frequency, storage location, and verification process
106	- [ ] Failover procedures are step-by-step with commands, expected outputs, and decision authority
107	- [ ] Communication protocol specifies audience, channel, timing, and message owner — no gaps
108	- [ ] Testing schedule includes at least tabletop, component, and full failover tests with defined frequency
109	- [ ] Tier classifications are assigned and restoration order is explicit
110	- [ ] The plan names specific people or roles, not "the team" or "management"
111	- [ ] A maintenance owner and review cadence are defined
112
113	## Common mistakes
114
115	- Setting RPO/RTO without business input. Engineers pick technically convenient targets. The business must define how much downtime and data loss it can tolerate, then engineering designs to meet those targets.
116	- Untested backups. "We have daily backups" means nothing if you have never restored one. Backups that cannot be restored are not backups.
117	- Single-region recovery storage. Storing backups in the same region as production means a region outage destroys both. Cross-region or cross-provider storage is mandatory.
118	- No communication plan. Technical recovery without customer communication creates a second crisis. Customers who see downtime with no explanation lose trust faster than customers who get timely updates.
119	- Plan lives in a wiki nobody reads. If the plan is not tested regularly and updated after infrastructure changes, it will be wrong when you need it most. Treat it as a living document with a named owner.
120	- Skipping decision authority. In a crisis, "who decides to fail over?" cannot be an open question. Name the role and the backup if that person is unreachable.
121