problem
Resetting the shared stage branch is currently manual and disruptive. Engineers have to ask in a channel before resetting to avoid interrupting active testing, which slows recovery when stage is broken.
In practice, most resets should proceed immediately, but there is no lightweight mechanism to protect the minority of PRs that are actively testing.
solution
Introduce an automated stage-reset workflow with default-allow behavior and explicit opt-out:
- Add a new PR label:
stage-reset-blocked
- When a stage reset is requested, automation checks open PRs labeled
on stage
- If no PR has
stage-reset-blocked, automatically reset stage to main, then re-merge on stage PR branches in deterministic order
- If one or more PRs have
stage-reset-blocked, skip reset and notify owners/Slack with blocked PR list
- Always post a completion/status summary to Slack and keep full audit trail in Actions logs
This reduces coordination cost in the common case while preserving safety for active testers.
technical
Implement via GitHub Actions + existing merge-bot conventions:
- Trigger paths
- Manual
workflow_dispatch for reset requests (initial rollout)
- Optional automatic trigger on merge-bot conflict / failed stage deploy (later)
- Data collection
- Query open PRs with label
on stage
- Partition into blocked/unblocked based on
stage-reset-blocked
- Reset execution rules
- Proceed only when blocked list is empty (or when explicit maintainer override input is set)
- Reset branch: align
stage to main
- Re-merge each
on stage branch back to stage
- Push branch and report per-PR merge outcomes
- Notifications
- Pre-action Slack message (reason, on-stage PRs, blocked PRs)
- Post-action Slack message (success/failure, re-applied PRs, failures requiring manual follow-up)
- Guardrails
- Dry-run mode for initial rollout
- Protected release-window bypass rules
- Full run logs and deterministic ordering for repeatability
- Docs alignment
- Keep
apps/docs/docs/04-engineering-practices/02-deployment/index.md in sync with workflow behavior
- Include policy examples for when to apply/remove
stage-reset-blocked
Related context: ENG-3570
problem
Resetting the shared
stagebranch is currently manual and disruptive. Engineers have to ask in a channel before resetting to avoid interrupting active testing, which slows recovery when stage is broken.In practice, most resets should proceed immediately, but there is no lightweight mechanism to protect the minority of PRs that are actively testing.
solution
Introduce an automated stage-reset workflow with default-allow behavior and explicit opt-out:
stage-reset-blockedon stagestage-reset-blocked, automatically resetstagetomain, then re-mergeon stagePR branches in deterministic orderstage-reset-blocked, skip reset and notify owners/Slack with blocked PR listThis reduces coordination cost in the common case while preserving safety for active testers.
technical
Implement via GitHub Actions + existing merge-bot conventions:
workflow_dispatchfor reset requests (initial rollout)on stagestage-reset-blockedstagetomainon stagebranch back tostageapps/docs/docs/04-engineering-practices/02-deployment/index.mdin sync with workflow behaviorstage-reset-blockedRelated context: ENG-3570