operationsengineering
Release Checklist
Create go/no-go release checklists with pre-deploy verification, staged rollout steps, monitoring checkpoints, rollback triggers, and stakeholder communication plans.
releasedeploymentchecklistgo-no-gorolloutrollback
Works well with agents
Works well with skills
release-checklist/
SKILL.md
Markdown| 1 | |
| 2 | # Release Checklist |
| 3 | |
| 4 | ## Before you start |
| 5 | |
| 6 | Gather the following from the user: |
| 7 | |
| 8 | 1. **What is being released?** (Service name, version, list of changes or link to changelog) |
| 9 | 2. **What environments are involved?** (Staging, canary, production regions) |
| 10 | 3. **What is the rollback strategy?** (Feature flags, blue-green, redeploy previous version) |
| 11 | 4. **Who are the stakeholders?** (Engineering leads, product owners, support, on-call) |
| 12 | 5. **What is the risk level?** (Database migrations, breaking API changes, new infrastructure) |
| 13 | |
| 14 | If the user says "just make me a release checklist," push back: "For which release? I need the scope of changes, target environments, and rollback strategy to build a useful checklist." |
| 15 | |
| 16 | ## Release checklist template |
| 17 | |
| 18 | ### Pre-Release |
| 19 | |
| 20 | #### Scope Inventory |
| 21 | |
| 22 | List every change shipping in this release. For each item, note the owner, whether it is behind a feature flag, and whether it touches shared infrastructure or data schemas. |
| 23 | |
| 24 | #### Risk Classification |
| 25 | |
| 26 | Classify as **low**, **medium**, or **high** risk based on: database migrations, breaking API changes, new third-party dependencies, and blast radius. State the classification and reasons. High-risk releases require a dedicated rollback runbook before proceeding. |
| 27 | |
| 28 | #### Dependency Check |
| 29 | |
| 30 | - [ ] Dependent services are deployed and healthy |
| 31 | - [ ] Database migrations tested against a production-sized dataset |
| 32 | - [ ] Feature flags configured in all target environments |
| 33 | - [ ] Secrets and environment variables set in target environments |
| 34 | |
| 35 | ### Go/No-Go Criteria |
| 36 | |
| 37 | Every row must be "Go" to proceed. If any item is "No-Go," the release does not ship. |
| 38 | |
| 39 | | Criteria | Owner | Status | |
| 40 | |---|---|---| |
| 41 | | All CI checks pass on release branch | Engineer | Go / No-Go | |
| 42 | | Staging smoke tests pass | QA | Go / No-Go | |
| 43 | | Database migration tested and reversible | DBA / Engineer | Go / No-Go | |
| 44 | | Rollback procedure documented and tested | SRE | Go / No-Go | |
| 45 | | On-call engineer identified and available | Engineering lead | Go / No-Go | |
| 46 | | Stakeholders notified of release window | Release manager | Go / No-Go | |
| 47 | |
| 48 | ### Staged Rollout Plan |
| 49 | |
| 50 | Define each stage with traffic percentage, bake time, and metric thresholds. Adjust based on risk classification — high-risk releases start at 1% with longer bake times. |
| 51 | |
| 52 | | Stage | Traffic % | Bake Time | Metric Thresholds | |
| 53 | |---|---|---|---| |
| 54 | | Canary | 1-5% | 15-30 min | Error rate < 0.1%, p99 latency < baseline + 20% | |
| 55 | | Partial | 25% | 30-60 min | Error rate < 0.05%, no new error signatures | |
| 56 | | Majority | 75% | 60 min | Same as partial | |
| 57 | | Full | 100% | Ongoing | Same as partial | |
| 58 | |
| 59 | ### Monitoring Checkpoints |
| 60 | |
| 61 | At each rollout stage, check: |
| 62 | |
| 63 | - [ ] **Error rates** — Compare canary vs. baseline cohort. New error types are an immediate flag. |
| 64 | - [ ] **Latency** — p50, p95, p99 against pre-release baseline. Watch for gradual degradation, not just spikes. |
| 65 | - [ ] **Resource utilization** — CPU, memory, connection pools. Leaks surface during bake time. |
| 66 | - [ ] **Business metrics** — Conversion rates, checkout completions, or domain-specific KPIs. Drops may not trigger alerts. |
| 67 | - [ ] **Dependency health** — Downstream service error rates and queue depths. |
| 68 | |
| 69 | Include specific dashboard URLs and alert names so the engineer can check each item without searching. |
| 70 | |
| 71 | ### Rollback Triggers |
| 72 | |
| 73 | Define explicit conditions that require rollback — never leave this to judgment: |
| 74 | |
| 75 | - Error rate exceeds 2x baseline for more than 5 minutes |
| 76 | - p99 latency exceeds 3x baseline |
| 77 | - Any data corruption or consistency issue detected |
| 78 | - Dependent service reports degradation traced to this release |
| 79 | - Feature flag kill switch fails to disable new behavior |
| 80 | |
| 81 | **Rollback procedure:** |
| 82 | |
| 83 | 1. Halt rollout progression immediately |
| 84 | 2. Route traffic back to previous version (feature flag off, revert deployment, or DNS switch) |
| 85 | 3. Verify rollback by confirming metrics return to baseline within 10 minutes |
| 86 | 4. Notify stakeholders with incident channel link |
| 87 | 5. Create incident ticket with timeline and root cause hypothesis |
| 88 | |
| 89 | ### Post-Release |
| 90 | |
| 91 | - [ ] **Verification** — Metrics stable for 1 hour at 100%, smoke tests pass, no new alerts |
| 92 | - [ ] **Communication** — Stakeholders notified, release notes published, support team briefed on new behavior |
| 93 | - [ ] **Cleanup** — Feature flags scheduled for removal, old artifacts torn down, retrospective scheduled if high-risk |
| 94 | |
| 95 | ## Quality checklist |
| 96 | |
| 97 | Before delivering the checklist, verify: |
| 98 | |
| 99 | - [ ] Every rollout stage has specific traffic percentages, bake times, and metric thresholds |
| 100 | - [ ] Rollback triggers are measurable conditions, not subjective judgments |
| 101 | - [ ] Go/No-Go table covers CI, testing, rollback readiness, and stakeholder notification |
| 102 | - [ ] Monitoring checkpoints reference specific metrics with comparison baselines |
| 103 | - [ ] Post-release section includes verification, communication, and cleanup steps |
| 104 | - [ ] The checklist is scoped to one release, not a generic process document |
| 105 | |
| 106 | ## Common mistakes |
| 107 | |
| 108 | - **Vague rollback criteria.** "Roll back if things look bad" is not a trigger. State the metric, threshold, and time window. |
| 109 | - **Skipping bake time under pressure.** Bake times exist to surface slow-burn issues like memory leaks and connection exhaustion. Cutting them short defeats the purpose of staged rollout. |
| 110 | - **No baseline comparison.** Metric thresholds mean nothing without a baseline. Always compare canary metrics against the existing production cohort, not against arbitrary numbers. |
| 111 | - **Forgetting business metrics.** A release can have zero errors and perfect latency while silently breaking checkout flows. Include domain-specific KPIs in monitoring checkpoints. |
| 112 | - **Missing stakeholder communication.** Engineering may know the release succeeded, but support, product, and leadership need explicit notification — especially if user-facing behavior changed. |
| 113 | - **Treating the checklist as optional.** If a Go/No-Go item is "No-Go," the release does not proceed. The checklist is a gate, not a suggestion. |
| 114 |