Files
nomad/.changelog/25726.txt
Tim Gross 5208ad4c2c scheduler: allow canaries to be migrated on node drain (#25726)
When a node is drained that has canaries that are not yet healthy, the canaries
may not be properly migrated and the deployment will halt. This happens only if
there are more than `migrate.max_parallel` canaries on the node and the canaries
are not yet healthy (ex. they have a long `update.min_healthy_time`). In this
circumstance, the first batch of canaries are marked for migration by the
drainer correctly. But then the reconciler counts these migrated canaries
against the total number of expected canaries and no longer progresses the
deployment. Because an insufficient number of allocations have reported they're
healthy, the deployment cannot be promoted.

When the reconciler looks for canaries to cancel, it leaves in the list any
canaries that are already terminal (because there shouldn't be any work to
do). But this ends up skipping the creation of a new canary to replace terminal
canaries that have been marked for migration. Add a conditional for this case to
cause the canary to be removed from the list of active canaries so we can
replace it.

Ref: https://hashicorp.atlassian.net/browse/NMD-560
Fixes: https://github.com/hashicorp/nomad/issues/17842
2025-04-24 09:24:28 -04:00

4 lines
118 B
Plaintext

```release-note:bug
scheduler: Fixed a bug where draining a node with canaries could result in a stuck deployment
```