operationsengineering

Release Checklist

Create go/no-go release checklists with pre-deploy verification, staged rollout steps, monitoring checkpoints, rollback triggers, and stakeholder communication plans.

releasedeploymentchecklistgo-no-gorolloutrollback

Works well with agents

DevOps Engineer Agent Open Source Maintainer Agent Product Operations Agent Release Manager Agent

Works well with skills

Runbook Writing Ticket Writing

release-checklist/

SKILL.md

Markdown

1
2	# Release Checklist
3
4	## Before you start
5
6	Gather the following from the user:
7
8	1. What is being released? (Service name, version, list of changes or link to changelog)
9	2. What environments are involved? (Staging, canary, production regions)
10	3. What is the rollback strategy? (Feature flags, blue-green, redeploy previous version)
11	4. Who are the stakeholders? (Engineering leads, product owners, support, on-call)
12	5. What is the risk level? (Database migrations, breaking API changes, new infrastructure)
13
14	If the user says "just make me a release checklist," push back: "For which release? I need the scope of changes, target environments, and rollback strategy to build a useful checklist."
15
16	## Release checklist template
17
18	### Pre-Release
19
20	#### Scope Inventory
21
22	List every change shipping in this release. For each item, note the owner, whether it is behind a feature flag, and whether it touches shared infrastructure or data schemas.
23
24	#### Risk Classification
25
26	Classify as low, medium, or high risk based on: database migrations, breaking API changes, new third-party dependencies, and blast radius. State the classification and reasons. High-risk releases require a dedicated rollback runbook before proceeding.
27
28	#### Dependency Check
29
30	- [ ] Dependent services are deployed and healthy
31	- [ ] Database migrations tested against a production-sized dataset
32	- [ ] Feature flags configured in all target environments
33	- [ ] Secrets and environment variables set in target environments
34
35	### Go/No-Go Criteria
36
37	Every row must be "Go" to proceed. If any item is "No-Go," the release does not ship.
38
39	\| Criteria \| Owner \| Status \|
40	\|---\|---\|---\|
41	\| All CI checks pass on release branch \| Engineer \| Go / No-Go \|
42	\| Staging smoke tests pass \| QA \| Go / No-Go \|
43	\| Database migration tested and reversible \| DBA / Engineer \| Go / No-Go \|
44	\| Rollback procedure documented and tested \| SRE \| Go / No-Go \|
45	\| On-call engineer identified and available \| Engineering lead \| Go / No-Go \|
46	\| Stakeholders notified of release window \| Release manager \| Go / No-Go \|
47
48	### Staged Rollout Plan
49
50	Define each stage with traffic percentage, bake time, and metric thresholds. Adjust based on risk classification — high-risk releases start at 1% with longer bake times.
51
52	\| Stage \| Traffic % \| Bake Time \| Metric Thresholds \|
53	\|---\|---\|---\|---\|
54	\| Canary \| 1-5% \| 15-30 min \| Error rate < 0.1%, p99 latency < baseline + 20% \|
55	\| Partial \| 25% \| 30-60 min \| Error rate < 0.05%, no new error signatures \|
56	\| Majority \| 75% \| 60 min \| Same as partial \|
57	\| Full \| 100% \| Ongoing \| Same as partial \|
58
59	### Monitoring Checkpoints
60
61	At each rollout stage, check:
62
63	- [ ] Error rates — Compare canary vs. baseline cohort. New error types are an immediate flag.
64	- [ ] Latency — p50, p95, p99 against pre-release baseline. Watch for gradual degradation, not just spikes.
65	- [ ] Resource utilization — CPU, memory, connection pools. Leaks surface during bake time.
66	- [ ] Business metrics — Conversion rates, checkout completions, or domain-specific KPIs. Drops may not trigger alerts.
67	- [ ] Dependency health — Downstream service error rates and queue depths.
68
69	Include specific dashboard URLs and alert names so the engineer can check each item without searching.
70
71	### Rollback Triggers
72
73	Define explicit conditions that require rollback — never leave this to judgment:
74
75	- Error rate exceeds 2x baseline for more than 5 minutes
76	- p99 latency exceeds 3x baseline
77	- Any data corruption or consistency issue detected
78	- Dependent service reports degradation traced to this release
79	- Feature flag kill switch fails to disable new behavior
80
81	Rollback procedure:
82
83	1. Halt rollout progression immediately
84	2. Route traffic back to previous version (feature flag off, revert deployment, or DNS switch)
85	3. Verify rollback by confirming metrics return to baseline within 10 minutes
86	4. Notify stakeholders with incident channel link
87	5. Create incident ticket with timeline and root cause hypothesis
88
89	### Post-Release
90
91	- [ ] Verification — Metrics stable for 1 hour at 100%, smoke tests pass, no new alerts
92	- [ ] Communication — Stakeholders notified, release notes published, support team briefed on new behavior
93	- [ ] Cleanup — Feature flags scheduled for removal, old artifacts torn down, retrospective scheduled if high-risk
94
95	## Quality checklist
96
97	Before delivering the checklist, verify:
98
99	- [ ] Every rollout stage has specific traffic percentages, bake times, and metric thresholds
100	- [ ] Rollback triggers are measurable conditions, not subjective judgments
101	- [ ] Go/No-Go table covers CI, testing, rollback readiness, and stakeholder notification
102	- [ ] Monitoring checkpoints reference specific metrics with comparison baselines
103	- [ ] Post-release section includes verification, communication, and cleanup steps
104	- [ ] The checklist is scoped to one release, not a generic process document
105
106	## Common mistakes
107
108	- Vague rollback criteria. "Roll back if things look bad" is not a trigger. State the metric, threshold, and time window.
109	- Skipping bake time under pressure. Bake times exist to surface slow-burn issues like memory leaks and connection exhaustion. Cutting them short defeats the purpose of staged rollout.
110	- No baseline comparison. Metric thresholds mean nothing without a baseline. Always compare canary metrics against the existing production cohort, not against arbitrary numbers.
111	- Forgetting business metrics. A release can have zero errors and perfect latency while silently breaking checkout flows. Include domain-specific KPIs in monitoring checkpoints.
112	- Missing stakeholder communication. Engineering may know the release succeeded, but support, product, and leadership need explicit notification — especially if user-facing behavior changed.
113	- Treating the checklist as optional. If a Go/No-Go item is "No-Go," the release does not proceed. The checklist is a gate, not a suggestion.
114