When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.
Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.
Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.
Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.
The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.
Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
* Upgrade to using hashicorp/go-metrics@v0.5.4
This also requires bumping the dependencies for:
* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)
Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
When a Vault lease expires, it's revoked on the server and cannot be removed, so
this error should be treated as fatal.
The errors we get aren't wrapped by the Vault SDK, so unfortunately we have to
read the error messages and can't easily enumerate non-fatal error
messages (which might be bubbling up from the stdlib). I've audited the errors
currently used and have documented their source.
Ref 52ba156d47/vault/expiration.go (L1327)
Fixes: https://github.com/hashicorp/nomad/issues/23859
The Vault "logical" API doesn't allow configuring the namespace on a per-request
basis. Instead, it's set on the client. Our `vaultclient` wrapper locks access
to the API client and sets the namespace (and token, if applicable) for each
request, and then resets the namespace and unlocks the API client.
The logic for resetting the namespace incorrectly assumed that if the Vault
configuration didn't set the namespace that it was canonicalized to the
non-empty string `"default"`. This results in the API client's namespace getting
"stuck" whenever a job uses a non-default namespace if the configuration value
is empty. Update the logic to always go back to the configuration, rather than
accepting the "previous" namespace from the caller.
This changeset also removes some long-dead code in the Vault client wrapper.
Fixes: https://github.com/hashicorp/nomad/issues/22230
Ref: https://hashicorp.atlassian.net/browse/NET-10207
* Revert "vault: always renew tokens using the renewal loop (#18998)"
This reverts commit 7054fe1a8c.
* test: add case for concurrent Vault token renewal
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.
When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.
Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.
Fixes: https://github.com/hashicorp/nomad/issues/8690
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.
Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.
This commit prevents the retries by making the error `invalid lease ID`
a fatal error.
In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:
2d0cde4ccc/vault/expiration.go (L636-L639)
But https://github.com/hashicorp/vault/pull/5346 introduced an earlier
`nil` check that generates a different error message:
750ab337ea/vault/expiration.go (L1362-L1364)
Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.
But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.
78f0c6b2a9/client/vaultclient/vaultclient.go (L557-L564)
If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.
1604dba508/client/vaultclient/vaultclient.go (L505-L510)
To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
Eliminates the vaultclient test import cycle by putting the test file into the
client package and making vaultclient objects public.
Ref hashicorp/team-nomad#404
Nomad Enterprise will support configuring multiple Vault clients. Instead of
having a single Vault client field in the Nomad client, we'll have a function
that callers can parameterize by the Vault cluster name that returns the
correctly configured Vault API client wrapper.
* vault: configure user agent on Nomad vault clients
This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.
* vault: fixup find-and-replace gone awry
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)
This commit fixes the renewal time by calculating it as:
(lease/2) +/- 10s
For a lease of 60s this means the renewal will occur in [20s, 40s).
This PR fixes a race condition in which the client was not locked while
deriving Vault tokens. This allowed the token to be set which would
cause subsequent Vault requests to fail with permission denied because
the incorrect Vault token was being used.
Further this PR makes the unsetting and unlocking of the client atomic
to avoid an even harder to hit race condition (not sure it was ever hit
but was still incorrect).