engineeringdata
Data Migration Plan
Plan and execute data migrations with phased rollout — dual-write, backfill, cutover, and verification stages with rollback triggers and data integrity checks at every step.
migrationdataschemarollbackdatabaseETL
Works well with agents
Works well with skills
data-migration-plan/
SKILL.md
Markdown| 1 | |
| 2 | # Data Migration Plan |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user. If any is missing, ask before proceeding: |
| 7 | |
| 8 | 1. **Current schema** — Tables, columns, relationships, and storage engine. Include indexes and constraints. |
| 9 | 2. **Target schema** — The desired end state. What changes: renamed columns, split tables, new relationships, type changes? |
| 10 | 3. **Data volume** — Row counts for affected tables, total size on disk, and growth rate. |
| 11 | 4. **Downtime tolerance** — Zero-downtime required? Maintenance window available? If so, how long? |
| 12 | 5. **Rollback requirements** — How quickly must you revert? Is data loss during rollback acceptable? What is the point-of-no-return? |
| 13 | |
| 14 | If the user says "just rename a column," push back: What reads from this column? Are there cached references? Is the column in any API contract, event payload, or downstream ETL? |
| 15 | |
| 16 | ## Migration plan template |
| 17 | |
| 18 | Use the following structure. Every phase includes explicit entry criteria, steps, and exit criteria. |
| 19 | |
| 20 | ### Pre-Migration Assessment |
| 21 | |
| 22 | Before writing any migration code: |
| 23 | |
| 24 | - Catalog every service, job, and query that reads or writes the affected tables |
| 25 | - Identify foreign key dependencies and cascade behavior |
| 26 | - Measure current query performance on affected tables (baseline P50/P99) |
| 27 | - Estimate migration duration: run the backfill query against a production-sized replica |
| 28 | - Confirm backup and restore procedures work — test the restore, not just the backup |
| 29 | |
| 30 | ### Phase 1: Dual-Write |
| 31 | |
| 32 | **Goal**: New writes go to both old and new schema without affecting reads. |
| 33 | |
| 34 | 1. Deploy schema changes additively (new columns, new tables) — never drop or rename yet |
| 35 | 2. Update write paths to populate both old and new locations |
| 36 | 3. Add logging to confirm dual-write consistency (compare old vs. new on every write) |
| 37 | 4. **Integrity check**: Query for rows where old and new values diverge. Count must be zero before proceeding. |
| 38 | |
| 39 | **Rollback trigger**: If dual-write error rate exceeds 0.1%, disable new write path and revert deployment. |
| 40 | |
| 41 | ### Phase 2: Backfill |
| 42 | |
| 43 | **Goal**: Migrate all historical data from old schema to new schema. |
| 44 | |
| 45 | 1. Run backfill in batches (1,000–10,000 rows per batch, tuned to avoid lock contention) |
| 46 | 2. Throttle to keep replica lag under your threshold (e.g., < 5 seconds) |
| 47 | 3. Use idempotent upserts so the backfill can be safely restarted at any point |
| 48 | 4. Log progress: batch number, rows processed, elapsed time, estimated time remaining |
| 49 | 5. **Integrity check**: After backfill completes, run a full-table comparison — old source vs. new target. Row count must match. Spot-check 1,000 random rows for value equality. |
| 50 | |
| 51 | **Rollback trigger**: If backfill causes replica lag > 30 seconds or error rate spikes, pause and reduce batch size. If data corruption is detected, stop and revert to Phase 1 state. |
| 52 | |
| 53 | ### Phase 3: Cutover |
| 54 | |
| 55 | **Goal**: Switch reads from old schema to new schema. |
| 56 | |
| 57 | 1. Deploy read path changes behind a feature flag — route a small percentage of reads to new schema first (1%, 10%, 50%, 100%) |
| 58 | 2. Compare read results between old and new paths during the ramp (shadow reads or dual-read with comparison logging) |
| 59 | 3. Monitor latency, error rate, and correctness at each ramp stage |
| 60 | 4. At 100%, hold for a soak period (minimum 24 hours in production) before proceeding |
| 61 | 5. **Integrity check**: P99 latency on new read path must be within 10% of the old path baseline. |
| 62 | |
| 63 | **Rollback trigger**: If error rate or latency degrades at any ramp stage, set feature flag back to 0% immediately. |
| 64 | |
| 65 | ### Phase 4: Verification |
| 66 | |
| 67 | **Goal**: Confirm the migration is complete and correct before cleanup. |
| 68 | |
| 69 | 1. Run final consistency check: compare row counts and checksums between old and new schema |
| 70 | 2. Verify all downstream consumers (reports, analytics, ETL jobs, APIs) produce correct output |
| 71 | 3. Confirm no service is still reading from or writing to the old schema (check query logs) |
| 72 | 4. Document any data that was intentionally dropped, transformed, or defaulted during migration |
| 73 | 5. Get sign-off from the data owner before proceeding to cleanup |
| 74 | |
| 75 | ### Phase 5: Cleanup |
| 76 | |
| 77 | **Goal**: Remove old schema and dual-write code. |
| 78 | |
| 79 | 1. Remove dual-write code paths and feature flags |
| 80 | 2. Drop old columns or tables (after a grace period — typically 1–2 release cycles) |
| 81 | 3. Update documentation, runbooks, and schema diagrams |
| 82 | 4. Archive the migration scripts for audit trail |
| 83 | |
| 84 | **Do not skip cleanup.** Leftover dual-write code becomes a maintenance burden and a source of subtle bugs. |
| 85 | |
| 86 | ### Rollback Plan |
| 87 | |
| 88 | Document the full rollback procedure upfront, not during an incident: |
| 89 | |
| 90 | - **Phase 1 rollback**: Revert deployment, remove new columns/tables. No data loss. |
| 91 | - **Phase 2 rollback**: Stop backfill, truncate new tables. Dual-write code still active — disable it. |
| 92 | - **Phase 3 rollback**: Flip feature flag to route all reads back to old schema. Immediate. |
| 93 | - **Phase 4+ rollback**: If old schema is already dropped, you need a restore from backup. This is the point of no return — make sure you are past it before cleanup. |
| 94 | |
| 95 | ## Quality checklist |
| 96 | |
| 97 | Before delivering the migration plan, verify: |
| 98 | |
| 99 | - [ ] Every phase has explicit entry criteria, steps, and exit criteria |
| 100 | - [ ] Rollback triggers are defined with specific thresholds (not "if something goes wrong") |
| 101 | - [ ] Data integrity checks exist at every phase boundary |
| 102 | - [ ] Backfill is idempotent and restartable |
| 103 | - [ ] Downstream consumers are cataloged and their migration path is documented |
| 104 | - [ ] The rollback plan covers every phase, including the point of no return |
| 105 | - [ ] Migration duration has been estimated against production-scale data |
| 106 | |
| 107 | ## Common mistakes to avoid |
| 108 | |
| 109 | - **Skipping the dual-write phase**. Going straight from old schema to new schema forces a big-bang cutover. Dual-write lets you validate incrementally and roll back safely. |
| 110 | - **Non-idempotent backfills**. If your backfill crashes at row 500,000, you need to restart from the beginning unless your upsert logic is idempotent. Always use `INSERT ... ON CONFLICT UPDATE` or equivalent. |
| 111 | - **Ignoring downstream consumers**. The migration is not done when the database is updated. Every ETL job, cache layer, search index, and analytics pipeline that reads the old schema needs a migration path too. |
| 112 | - **No production-scale test**. A backfill that takes 2 minutes on staging may take 14 hours on production. Always benchmark against a production-sized dataset before scheduling the migration window. |
| 113 | - **Cleaning up too early**. Dropping the old schema the day after cutover leaves no safety net. Keep the old schema readable for at least one full release cycle. |
| 114 |