Guides and playbooks
Canary Release Playbook
Step-by-step playbook for canary releases — expose 1–5% of traffic to a new feature, monitor behavior, then expand or roll back with confidence.
Prerequisites
- A flag created and published to the target environment
- ZENMANAGE_SERVER_KEY exported in your server environment, or VITE_ZENMANAGE_CLIENT_KEY in your front-end build
- Baseline error rate, latency (p95/p99), and business KPI metrics captured before the rollout begins
- An observability stack (logging, metrics, alerting) wired to the service that evaluates the flag
- A rollback owner identified — the person who will make the go/no-go decision at each checkpoint
What is a canary release?
A canary release exposes new behavior to a small slice of production traffic — typically 1–5 % — while the remaining users continue on the current code path. The term comes from the coal-mining practice of using canaries to detect danger before it reaches the wider group.
In Zenmanage, a canary release is a Progressive Rollout set to a low initial percentage. You use the same rollout mechanics — deterministic bucketing, manual or automatic expansion, pause and complete actions — but you apply a stricter monitoring protocol and explicit go/no-go decision gates before widening exposure.
This playbook focuses on the canary-specific workflow: how to size the initial cohort, how long to soak, what to watch, and when to expand or abort. For a broader walkthrough of progressive rollout mechanics, see the Progressive Rollouts Playbook.
Phase 1 — Configure the canary
- 1. Choose the initial percentage. Start between 1 % and 5 %. Use the lower end (1–2 %) for high-risk changes that touch payment flows, data integrity, or cross-service contracts. Use the higher end (3–5 %) when the change is lower risk and you need enough traffic to generate statistically meaningful signal within a reasonable soak window.
- 2. Select Manual mode. Canary releases benefit from human judgment at each gate. Manual mode ensures no automatic expansion happens while you are still evaluating the canary window. You can switch to Automatic mode later once the canary phase is complete and you are ready for broader rollout.
- 3. Set the rollout value. Pick the flag value that enables the new behavior. Users outside the canary cohort continue to receive the current published default.
- 4. Confirm your inline default. Your application code should supply a default that matches the current safe behavior. This ensures that users outside the canary — and users during any network interruption — are unaffected.
- 5. Publish or schedule the draft. The rollout moves to Active state. Zenmanage immediately begins serving the new value to the configured percentage of users via deterministic bucketing.
Why Manual mode for canaries
Phase 2 — Monitor the canary window
The canary window is the observation period at the initial low percentage. This is the most important phase — its entire purpose is to surface regressions before they reach most users.
Recommended monitoring duration
| Traffic pattern | Minimum soak | Recommended soak |
|---|---|---|
| High-traffic consumer product (continuous requests) | 4–6 hours | 24 hours (one full diurnal cycle) |
| Business SaaS (weekday-heavy usage) | 24 hours | 48 hours (captures peak and off-peak) |
| Low-traffic or batch-oriented service | 48 hours | 72 hours or one full processing cycle |
What to watch
- Error rate: compare the canary cohort error rate against the baseline captured before launch. A spike of more than one or two standard deviations is the strongest signal to pause.
- Latency: focus on p95 and p99, not averages. The new code path may only be slower for a subset of requests that averages will hide.
- Business KPIs: if the flag touches a user journey, monitor conversion, activation, or transaction events. A drop in conversion correlated with the rollout start is a signal even when error rates look clean.
- Support volume: watch for a spike in support tickets or in-app feedback. User-visible regressions often surface in support channels before they appear in server metrics.
- Downstream dependencies: if the new code path calls a different service or external API, check that service's health dashboard for correlated changes.
Do not shorten the canary window under pressure
Phase 3 — Go/No-go decision
At the end of the canary window, the rollback owner makes an explicit decision. This is the gate that separates a canary release from a regular rollout.
Go criteria (all must be true)
- Error rate for the canary cohort is within one standard deviation of the pre-rollout baseline.
- p95 and p99 latency have not degraded beyond your team's SLO threshold.
- No correlated increase in support tickets or user-reported issues.
- Business KPIs (conversion, activation, revenue) show no statistically significant regression.
- Downstream service health is stable — no new error patterns or capacity warnings.
- The canary soak duration met or exceeded the minimum for your traffic pattern.
No-go criteria (any one triggers rollback)
- Error rate exceeds two standard deviations above baseline and correlates with the rollout window.
- Latency regression exceeds SLO thresholds and is attributable to the new code path.
- Any data integrity issue — even a single confirmed case — is detected.
- Support reports describe a symptom that maps to the feature under the flag.
- A dependent service is experiencing issues that make it impossible to evaluate the canary cleanly.
When in doubt, extend — do not promote
Phase 4 — Expand the rollout
After a Go decision, begin widening the percentage. A common expansion schedule for a canary that started at 2 %:
| Stage | Percentage | Soak before next stage |
|---|---|---|
| Canary | 1–5 % | 24–48 hours (canary window) |
| Early adopters | 10 % | 24 hours |
| Broader validation | 25 % | 24 hours |
| Majority | 50 % | 24 hours |
| Full rollout | 100 % | Monitor for 24–48 hours, then Complete |
At each stage, briefly re-evaluate the go/no-go criteria before advancing. As the percentage increases, the blast radius of a regression grows — the soak time at each stage is more important than how fast you advance.
Once you pass the canary phase and are ready for steady expansion, you may optionally switch from Manual to Automatic mode and let Zenmanage advance through the remaining stages on a schedule.
Rollback instructions
If a No-go decision is made at any point during the canary or expansion phases:
- 1. Pause the rollout immediately. This stops the new value from being served to any user in the rollout. All users revert to the pre-rollout published default.
- 2. Assess impact. Use the rollout percentage and deterministic bucketing to identify which users were in the canary cohort. Check logs and metrics scoped to that cohort.
- 3. Decide: fix or abandon. If the issue is minor and can be patched quickly, fix the code, deploy the fix, then Resume the rollout. If the issue is severe or complex, Complete the rollout at 0 % (effectively ending it) or delete the rollout configuration entirely.
- 4. For critical incidents, use the kill switch. If the regression is severe enough that you need immediate, global mitigation — not just stopping the rollout — flip the flag's published target to the safe value using the kill-switch pattern. This overrides the rollout and restores safe behavior for all users instantly.
Pause first, investigate second
Quick-reference decision matrix
| Situation | Action |
|---|---|
| All go criteria met after canary soak | Advance to next stage (10 %) |
| Metrics inconclusive — not clearly good or bad | Extend canary window; do not promote |
| Error rate spike of unknown cause | Pause immediately; investigate before deciding |
| Error rate spike confirmed unrelated to the rollout | Resume; document the correlation in notes |
| Any no-go criterion triggered | Pause; assess and either fix-then-resume or abandon |
| Critical data integrity or security issue | Kill switch to revert immediately; do not just pause |
| Feature stable at 100 % for > 48 hours | Complete rollout; schedule code and flag cleanup |
Related resources
- Progressive Rollouts Playbook — full lifecycle walkthrough covering both Manual and Automatic modes, deterministic bucketing, and the complete/archive flow.
- Progressive Rollouts Readiness Checklist — pre-rollout checklist covering flag setup, monitoring, rollback planning, and team communication.
- Kill Switch and Incident Rollback — the fastest path from incident detection to mitigation without a redeploy.
- Debugging and Observability — investigation workflows for when canary metrics show unexpected behavior.
- Analytics and Reporting — how to use the last evaluation date and missing default value reports to verify rollout health.
- Canary Releases use case — solution overview of how progressive rollouts power a canary release strategy.
- Launch Readiness Checklist — end-to-end launch preparation covering documentation, support readiness, communication, and post-launch monitoring.
Ready to run your first canary release? Start your free trial and create a flag in minutes.
Next step
Take the next integration step in your own stack.
Start with the quickstart that matches your runtime, then return to the reference pages when you need exact request and payload details.