Home / Developers / Canary Release Playbook

Guides and playbooks

Canary Release Playbook

Step-by-step playbook for canary releases — expose 1–5% of traffic to a new feature, monitor behavior, then expand or roll back with confidence.

Prerequisites

  • A flag created and published to the target environment
  • ZENMANAGE_SERVER_KEY exported in your server environment, or VITE_ZENMANAGE_CLIENT_KEY in your front-end build
  • Baseline error rate, latency (p95/p99), and business KPI metrics captured before the rollout begins
  • An observability stack (logging, metrics, alerting) wired to the service that evaluates the flag
  • A rollback owner identified — the person who will make the go/no-go decision at each checkpoint

What is a canary release?

A canary release exposes new behavior to a small slice of production traffic — typically 1–5 % — while the remaining users continue on the current code path. The term comes from the coal-mining practice of using canaries to detect danger before it reaches the wider group.

In Zenmanage, a canary release is a Progressive Rollout set to a low initial percentage. You use the same rollout mechanics — deterministic bucketing, manual or automatic expansion, pause and complete actions — but you apply a stricter monitoring protocol and explicit go/no-go decision gates before widening exposure.

This playbook focuses on the canary-specific workflow: how to size the initial cohort, how long to soak, what to watch, and when to expand or abort. For a broader walkthrough of progressive rollout mechanics, see the Progressive Rollouts Playbook.

Phase 1 — Configure the canary

  1. 1. Choose the initial percentage. Start between 1 % and 5 %. Use the lower end (1–2 %) for high-risk changes that touch payment flows, data integrity, or cross-service contracts. Use the higher end (3–5 %) when the change is lower risk and you need enough traffic to generate statistically meaningful signal within a reasonable soak window.
  2. 2. Select Manual mode. Canary releases benefit from human judgment at each gate. Manual mode ensures no automatic expansion happens while you are still evaluating the canary window. You can switch to Automatic mode later once the canary phase is complete and you are ready for broader rollout.
  3. 3. Set the rollout value. Pick the flag value that enables the new behavior. Users outside the canary cohort continue to receive the current published default.
  4. 4. Confirm your inline default. Your application code should supply a default that matches the current safe behavior. This ensures that users outside the canary — and users during any network interruption — are unaffected.
  5. 5. Publish or schedule the draft. The rollout moves to Active state. Zenmanage immediately begins serving the new value to the configured percentage of users via deterministic bucketing.

Why Manual mode for canaries

Automatic mode is designed for steady expansion at a predetermined pace. In a canary release, you want to pause at the initial percentage and make an explicit go/no-go decision before any expansion happens. Manual mode gives you that gate by design.

Phase 2 — Monitor the canary window

The canary window is the observation period at the initial low percentage. This is the most important phase — its entire purpose is to surface regressions before they reach most users.

Recommended monitoring duration

Traffic pattern Minimum soak Recommended soak
High-traffic consumer product (continuous requests) 4–6 hours 24 hours (one full diurnal cycle)
Business SaaS (weekday-heavy usage) 24 hours 48 hours (captures peak and off-peak)
Low-traffic or batch-oriented service 48 hours 72 hours or one full processing cycle

What to watch

  • Error rate: compare the canary cohort error rate against the baseline captured before launch. A spike of more than one or two standard deviations is the strongest signal to pause.
  • Latency: focus on p95 and p99, not averages. The new code path may only be slower for a subset of requests that averages will hide.
  • Business KPIs: if the flag touches a user journey, monitor conversion, activation, or transaction events. A drop in conversion correlated with the rollout start is a signal even when error rates look clean.
  • Support volume: watch for a spike in support tickets or in-app feedback. User-visible regressions often surface in support channels before they appear in server metrics.
  • Downstream dependencies: if the new code path calls a different service or external API, check that service's health dashboard for correlated changes.

Do not shorten the canary window under pressure

The purpose of the canary is to catch problems on a small population. Cutting the soak window short because "it looks fine so far" defeats the point. Stick to your predetermined monitoring duration unless you find a clear reason to extend it.

Phase 3 — Go/No-go decision

At the end of the canary window, the rollback owner makes an explicit decision. This is the gate that separates a canary release from a regular rollout.

Go criteria (all must be true)

  • Error rate for the canary cohort is within one standard deviation of the pre-rollout baseline.
  • p95 and p99 latency have not degraded beyond your team's SLO threshold.
  • No correlated increase in support tickets or user-reported issues.
  • Business KPIs (conversion, activation, revenue) show no statistically significant regression.
  • Downstream service health is stable — no new error patterns or capacity warnings.
  • The canary soak duration met or exceeded the minimum for your traffic pattern.

No-go criteria (any one triggers rollback)

  • Error rate exceeds two standard deviations above baseline and correlates with the rollout window.
  • Latency regression exceeds SLO thresholds and is attributable to the new code path.
  • Any data integrity issue — even a single confirmed case — is detected.
  • Support reports describe a symptom that maps to the feature under the flag.
  • A dependent service is experiencing issues that make it impossible to evaluate the canary cleanly.

When in doubt, extend — do not promote

If the data is inconclusive — not clearly good, not clearly bad — the correct action is to extend the canary window, not to promote. Gather more data until the decision is unambiguous.

Phase 4 — Expand the rollout

After a Go decision, begin widening the percentage. A common expansion schedule for a canary that started at 2 %:

Stage Percentage Soak before next stage
Canary 1–5 % 24–48 hours (canary window)
Early adopters 10 % 24 hours
Broader validation 25 % 24 hours
Majority 50 % 24 hours
Full rollout 100 % Monitor for 24–48 hours, then Complete

At each stage, briefly re-evaluate the go/no-go criteria before advancing. As the percentage increases, the blast radius of a regression grows — the soak time at each stage is more important than how fast you advance.

Once you pass the canary phase and are ready for steady expansion, you may optionally switch from Manual to Automatic mode and let Zenmanage advance through the remaining stages on a schedule.

Rollback instructions

If a No-go decision is made at any point during the canary or expansion phases:

  1. 1. Pause the rollout immediately. This stops the new value from being served to any user in the rollout. All users revert to the pre-rollout published default.
  2. 2. Assess impact. Use the rollout percentage and deterministic bucketing to identify which users were in the canary cohort. Check logs and metrics scoped to that cohort.
  3. 3. Decide: fix or abandon. If the issue is minor and can be patched quickly, fix the code, deploy the fix, then Resume the rollout. If the issue is severe or complex, Complete the rollout at 0 % (effectively ending it) or delete the rollout configuration entirely.
  4. 4. For critical incidents, use the kill switch. If the regression is severe enough that you need immediate, global mitigation — not just stopping the rollout — flip the flag's published target to the safe value using the kill-switch pattern. This overrides the rollout and restores safe behavior for all users instantly.

Pause first, investigate second

Do not spend time diagnosing a regression while the canary is still active. Pause removes the new code path from production traffic, giving you time to investigate without the issue continuing to affect users.

Quick-reference decision matrix

Situation Action
All go criteria met after canary soak Advance to next stage (10 %)
Metrics inconclusive — not clearly good or bad Extend canary window; do not promote
Error rate spike of unknown cause Pause immediately; investigate before deciding
Error rate spike confirmed unrelated to the rollout Resume; document the correlation in notes
Any no-go criterion triggered Pause; assess and either fix-then-resume or abandon
Critical data integrity or security issue Kill switch to revert immediately; do not just pause
Feature stable at 100 % for > 48 hours Complete rollout; schedule code and flag cleanup

Related resources

Ready to run your first canary release? Start your free trial and create a flag in minutes.

Next step

Take the next integration step in your own stack.

Start with the quickstart that matches your runtime, then return to the reference pages when you need exact request and payload details.