engineeringdata
Data Migration Plan
Plan and execute data migrations with phased rollout — dual-write, backfill, cutover, and verification stages with rollback triggers and data integrity checks at every step.
migrationdataschemarollbackdatabaseETL
Works well with agents
Works well with skills
data-migration-plan/
postgres-to-new-schema.md
Markdown| 1 | # Migration Plan: User Accounts Schema Restructuring |
| 2 | |
| 3 | ## Overview |
| 4 | |
| 5 | Restructure the `users` table into separate `accounts` and `profiles` tables to support multi-tenant access. PostgreSQL 15, ~8.2M rows, zero-downtime requirement. |
| 6 | |
| 7 | ## Pre-Migration Assessment |
| 8 | |
| 9 | - **Affected services**: auth-service, billing-api, admin-dashboard, analytics-etl, email-worker |
| 10 | - **Foreign keys**: `orders.user_id`, `sessions.user_id`, `audit_logs.actor_id` (12 tables total) |
| 11 | - **Baseline performance**: P50 = 2ms, P99 = 18ms on `users` table lookups |
| 12 | - **Estimated backfill duration**: 47 minutes (tested against production replica at 10k rows/batch) |
| 13 | - **Backup verified**: Full pg_dump restore tested on 2026-03-12 — 22 minutes to restore |
| 14 | |
| 15 | ## Phase 1: Dual-Write (Days 1-3) |
| 16 | |
| 17 | **Entry criteria**: All CI checks pass, backup verified within 7 days. |
| 18 | |
| 19 | 1. Deploy additive schema: create `accounts` and `profiles` tables alongside `users` |
| 20 | 2. Update auth-service and billing-api write paths to insert into both `users` and `accounts`/`profiles` |
| 21 | 3. Deploy dual-write consistency logger — compare row-by-row on every write |
| 22 | |
| 23 | **Integrity check**: `SELECT count(*) FROM accounts a LEFT JOIN users u ON a.legacy_user_id = u.id WHERE u.id IS NULL;` must return 0. |
| 24 | |
| 25 | **Rollback trigger**: Dual-write error rate > 0.1% over any 10-minute window. Action: revert auth-service deploy, drop `accounts` and `profiles` tables. |
| 26 | |
| 27 | ## Phase 2: Backfill (Day 4) |
| 28 | |
| 29 | **Entry criteria**: Dual-write running 48+ hours with zero divergences. |
| 30 | |
| 31 | 1. Run backfill in batches of 5,000 rows with 200ms delay between batches |
| 32 | 2. Monitor replica lag — pause if lag exceeds 5 seconds |
| 33 | 3. Use `INSERT INTO accounts (...) SELECT ... FROM users ON CONFLICT (legacy_user_id) DO UPDATE` |
| 34 | |
| 35 | **Integrity check**: Row count match + spot-check 2,000 random rows for field equality: |
| 36 | ```sql |
| 37 | SELECT count(*) FROM users u |
| 38 | JOIN accounts a ON a.legacy_user_id = u.id |
| 39 | WHERE u.email != a.email OR u.created_at != a.created_at; |
| 40 | -- Must return 0 |
| 41 | ``` |
| 42 | |
| 43 | **Rollback trigger**: Replica lag > 30s sustained, or any data mismatch detected. Action: pause backfill, investigate, reduce batch size to 1,000. |
| 44 | |
| 45 | ## Phase 3: Cutover (Days 5-9) |
| 46 | |
| 47 | **Entry criteria**: Backfill complete, 100% row parity confirmed. |
| 48 | |
| 49 | | Stage | Read traffic % | Duration | Gate | |
| 50 | |-------|---------------|----------|------| |
| 51 | | Canary | 1% | 4 hours | Error rate < 0.01%, P99 < 22ms | |
| 52 | | Partial | 10% | 12 hours | Error rate < 0.01%, P99 < 20ms | |
| 53 | | Majority | 50% | 24 hours | Error rate < 0.005% | |
| 54 | | Full | 100% | 48 hours soak | P99 within 10% of baseline | |
| 55 | |
| 56 | **Rollback trigger**: Error rate or latency exceeds gate thresholds. Action: set feature flag `use_new_accounts_schema` to false — immediate revert to `users` table reads. |
| 57 | |
| 58 | ## Phase 4: Verification (Days 10-12) |
| 59 | |
| 60 | 1. Final consistency check: row counts, checksums on email and created_at columns |
| 61 | 2. Verify analytics-etl output matches pre-migration baseline report (< 0.01% variance) |
| 62 | 3. Confirm no queries hit `users` table directly (check `pg_stat_user_tables` for seq/idx scans) |
| 63 | 4. Sign-off from data-eng lead and product-eng lead |
| 64 | |
| 65 | ## Phase 5: Cleanup (Days 20-25) |
| 66 | |
| 67 | 1. Remove dual-write code from auth-service and billing-api |
| 68 | 2. Drop `users` table after 2 release cycles (grace period through Day 25) |
| 69 | 3. Update schema diagrams, ERD docs, and runbooks |
| 70 | 4. Archive migration scripts in `migrations/2026-03-archive/` |
| 71 | |
| 72 | ## Rollback Summary |
| 73 | |
| 74 | | Phase | Rollback action | Data loss risk | |
| 75 | |-------|----------------|----------------| |
| 76 | | 1 — Dual-Write | Revert deploy, drop new tables | None | |
| 77 | | 2 — Backfill | Stop backfill, truncate new tables | None | |
| 78 | | 3 — Cutover | Flip feature flag to 0% | None | |
| 79 | | 4+ — Post-cleanup | Restore from backup (point of no return) | Up to RPO window | |
| 80 |