engineeringdata

Data Migration Plan

Plan and execute data migrations with phased rollout — dual-write, backfill, cutover, and verification stages with rollback triggers and data integrity checks at every step.

migrationdataschemarollbackdatabaseETL

Works well with agents

Data Engineer Agent Database Architect Agent

Works well with skills

System Design Document Ticket Writing

data-migration-plan/

SKILL.md

Markdown

1
2	# Data Migration Plan
3
4	## Before you start
5
6	Gather the following from the user. If any is missing, ask before proceeding:
7
8	1. Current schema — Tables, columns, relationships, and storage engine. Include indexes and constraints.
9	2. Target schema — The desired end state. What changes: renamed columns, split tables, new relationships, type changes?
10	3. Data volume — Row counts for affected tables, total size on disk, and growth rate.
11	4. Downtime tolerance — Zero-downtime required? Maintenance window available? If so, how long?
12	5. Rollback requirements — How quickly must you revert? Is data loss during rollback acceptable? What is the point-of-no-return?
13
14	If the user says "just rename a column," push back: What reads from this column? Are there cached references? Is the column in any API contract, event payload, or downstream ETL?
15
16	## Migration plan template
17
18	Use the following structure. Every phase includes explicit entry criteria, steps, and exit criteria.
19
20	### Pre-Migration Assessment
21
22	Before writing any migration code:
23
24	- Catalog every service, job, and query that reads or writes the affected tables
25	- Identify foreign key dependencies and cascade behavior
26	- Measure current query performance on affected tables (baseline P50/P99)
27	- Estimate migration duration: run the backfill query against a production-sized replica
28	- Confirm backup and restore procedures work — test the restore, not just the backup
29
30	### Phase 1: Dual-Write
31
32	Goal: New writes go to both old and new schema without affecting reads.
33
34	1. Deploy schema changes additively (new columns, new tables) — never drop or rename yet
35	2. Update write paths to populate both old and new locations
36	3. Add logging to confirm dual-write consistency (compare old vs. new on every write)
37	4. Integrity check: Query for rows where old and new values diverge. Count must be zero before proceeding.
38
39	Rollback trigger: If dual-write error rate exceeds 0.1%, disable new write path and revert deployment.
40
41	### Phase 2: Backfill
42
43	Goal: Migrate all historical data from old schema to new schema.
44
45	1. Run backfill in batches (1,000–10,000 rows per batch, tuned to avoid lock contention)
46	2. Throttle to keep replica lag under your threshold (e.g., < 5 seconds)
47	3. Use idempotent upserts so the backfill can be safely restarted at any point
48	4. Log progress: batch number, rows processed, elapsed time, estimated time remaining
49	5. Integrity check: After backfill completes, run a full-table comparison — old source vs. new target. Row count must match. Spot-check 1,000 random rows for value equality.
50
51	Rollback trigger: If backfill causes replica lag > 30 seconds or error rate spikes, pause and reduce batch size. If data corruption is detected, stop and revert to Phase 1 state.
52
53	### Phase 3: Cutover
54
55	Goal: Switch reads from old schema to new schema.
56
57	1. Deploy read path changes behind a feature flag — route a small percentage of reads to new schema first (1%, 10%, 50%, 100%)
58	2. Compare read results between old and new paths during the ramp (shadow reads or dual-read with comparison logging)
59	3. Monitor latency, error rate, and correctness at each ramp stage
60	4. At 100%, hold for a soak period (minimum 24 hours in production) before proceeding
61	5. Integrity check: P99 latency on new read path must be within 10% of the old path baseline.
62
63	Rollback trigger: If error rate or latency degrades at any ramp stage, set feature flag back to 0% immediately.
64
65	### Phase 4: Verification
66
67	Goal: Confirm the migration is complete and correct before cleanup.
68
69	1. Run final consistency check: compare row counts and checksums between old and new schema
70	2. Verify all downstream consumers (reports, analytics, ETL jobs, APIs) produce correct output
71	3. Confirm no service is still reading from or writing to the old schema (check query logs)
72	4. Document any data that was intentionally dropped, transformed, or defaulted during migration
73	5. Get sign-off from the data owner before proceeding to cleanup
74
75	### Phase 5: Cleanup
76
77	Goal: Remove old schema and dual-write code.
78
79	1. Remove dual-write code paths and feature flags
80	2. Drop old columns or tables (after a grace period — typically 1–2 release cycles)
81	3. Update documentation, runbooks, and schema diagrams
82	4. Archive the migration scripts for audit trail
83
84	Do not skip cleanup. Leftover dual-write code becomes a maintenance burden and a source of subtle bugs.
85
86	### Rollback Plan
87
88	Document the full rollback procedure upfront, not during an incident:
89
90	- Phase 1 rollback: Revert deployment, remove new columns/tables. No data loss.
91	- Phase 2 rollback: Stop backfill, truncate new tables. Dual-write code still active — disable it.
92	- Phase 3 rollback: Flip feature flag to route all reads back to old schema. Immediate.
93	- Phase 4+ rollback: If old schema is already dropped, you need a restore from backup. This is the point of no return — make sure you are past it before cleanup.
94
95	## Quality checklist
96
97	Before delivering the migration plan, verify:
98
99	- [ ] Every phase has explicit entry criteria, steps, and exit criteria
100	- [ ] Rollback triggers are defined with specific thresholds (not "if something goes wrong")
101	- [ ] Data integrity checks exist at every phase boundary
102	- [ ] Backfill is idempotent and restartable
103	- [ ] Downstream consumers are cataloged and their migration path is documented
104	- [ ] The rollback plan covers every phase, including the point of no return
105	- [ ] Migration duration has been estimated against production-scale data
106
107	## Common mistakes to avoid
108
109	- Skipping the dual-write phase. Going straight from old schema to new schema forces a big-bang cutover. Dual-write lets you validate incrementally and roll back safely.
110	- Non-idempotent backfills. If your backfill crashes at row 500,000, you need to restart from the beginning unless your upsert logic is idempotent. Always use `INSERT ... ON CONFLICT UPDATE` or equivalent.
111	- Ignoring downstream consumers. The migration is not done when the database is updated. Every ETL job, cache layer, search index, and analytics pipeline that reads the old schema needs a migration path too.
112	- No production-scale test. A backfill that takes 2 minutes on staging may take 14 hours on production. Always benchmark against a production-sized dataset before scheduling the migration window.
113	- Cleaning up too early. Dropping the old schema the day after cutover leaves no safety net. Keep the old schema readable for at least one full release cycle.
114