mirror of
https://github.com/kemko/nomad.git
synced 2026-01-01 16:05:42 +03:00
If a CSI volume is has terminal allocations, the volumewatcher will submit an `Unpublish` RPC. But the "past claim" we create is missing the "external" node identifier (ex. the AWS EC2 instance ID). The unpublish RPC can tolerate this if the node still exists in the state store, but if the node has been GC'd the controller unpublish step will return an error. But at this point we've already checkpointed the unpublish workflow, which triggers a notification on the volumewatcher. This results in the volumewatcher getting into a tight loop of retries. Unfortunately even if we somehow break the loop (perhaps because we hit a different code path), we'll kick off this loop again after a leader election when we spin up the volumewatchers again. This changeset includes the following: * Fix the primary bug by including the external node ID when creating a "past claim" for a terminal allocation. * If we can't lookup the external ID because there's no external node ID and the node no longer exists, abandon it in the same way that we do the node unpublish step. * Rate limit the volumewatcher loop so that any future bugs of this type don't cause a tight loop. * Remove some dead code found while working on this. Fixes: https://github.com/hashicorp/nomad/issues/25349 Ref: https://hashicorp.atlassian.net/browse/NET-12298
4 lines
120 B
Plaintext
4 lines
120 B
Plaintext
```release-note:bug
|
|
csi: Fixed a bug where cleaning up volume claims on GC'd nodes would cause errors on the leader
|
|
```
|