Files
nomad/.changelog/19689.txt
Luiz Aoqui 5267eec3ad vault: fix token revocation during workflow migration (#19689)
When transitioning from the legacy token-based workflow to the new JWT
workflow for Vault the previous code would instantiate a no-op Vault if
the server configuration had a `default_identity` block.

This no-op client returned an error for some of its operations were
called, such as `LookupToken` and `RevokeTokens`. The original intention
was that, in the new JWT workflow, none of these methods should be
called, so returning an error could help surface potential bugs.

But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even
in the JWT flow. When a leadership transition happens, the new server
looks for unused Vault accessors from state and tries to revoke them.
Similarly, the `RevokeTokens` method is called every time the
`Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as
the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault,
calling `RevokeTokens` and `MarkForRevocation` is not able to complete
without a Vault token, so this commit changes the logic to use the no-op
Vault client when no token is configured. It also updates the client
itself to not error if these methods are called, but to rather just log
so operators can be made aware that there are Vault tokens created by
Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based
flow, Nomad operators must first upgrade the Nomad version without
removing any of the existing Vault configuration. Doing so can prevent
Nomad servers from managing and cleaning-up existing Vault tokens during
a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a `vault` block so they are
updated with an `identity` for Vault. Skipping this step may cause
allocations to fail if their Vault token expires (if, for example, the
Nomad client stops running for TTL/2) or if they are rescheduled, since
the new client will try to follow the legacy flow which will fail if the
Nomad server configuration for Vault has already been updated to remove
the Vault address and token.
2024-01-10 13:28:46 -05:00

172 B