Commit Graph

75 Commits

Author SHA1 Message Date
Tim Gross
26004c5407 vault: set renew increment to lease duration (#26041)
When we renew Vault tokens, we use the lease duration to determine how often to
renew. But we also set an `increment` value which is never updated from the
initial 30s. For periodic tokens this is not a problem because the `increment`
field is ignored on renewal. But for non-periodic tokens this prevents the token
TTL from being properly incremented. This behavior has been in place since the
initial Vault client implementation in #1606 but before the switch to workload
identity most (all?) tokens being created were periodic tokens so this was never
detected.

Fix this bug by updating the request's `increment` field to the lease duration
on each renewal.

Also switch out a `time.After` call in backoff of the derive token caller with a
safe timer so that we don't have to spawn a new goroutine per loop, and have
tighter control over when that's GC'd.

Ref: https://github.com/hashicorp/nomad/pull/1606
Ref: https://github.com/hashicorp/nomad/issues/25812
2025-06-13 13:50:54 -04:00
James Rasell
7268053174 vault: Remove legacy token based authentication workflow. (#25155)
The legacy workflow for Vault whereby servers were configured
using a token to provide authentication to the Vault API has now
been removed. This change also removes the workflow where servers
were responsible for deriving Vault tokens for Nomad clients.

The deprecated Vault config options used byi the Nomad agent have
all been removed except for "token" which is still in use by the
Vault Transit keyring implementation.

Job specification authors can no longer use the "vault.policies"
parameter and should instead use "vault.role" when not using the
default workload identity.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2025-02-28 07:40:02 +00:00
Tim Gross
b5faeff233 vault: fix bug in logging logic around renewals (#25040)
In #24409 we fixed a bug where some of the error messages we get from Vault
weren't being caught correctly. This fix itself contains a bug where we changed
the logic that logged the non-fatal errors so that it logs when there is no
renewal error.

Ref: https://github.com/hashicorp/nomad/pull/24409
Fixes: https://github.com/hashicorp/nomad/issues/24933
2025-02-07 08:45:33 -05:00
Matt Keeler
833e240597 Upgrade to using hashicorp/go-metrics@v0.5.4 (#24856)
* Upgrade to using hashicorp/go-metrics@v0.5.4

This also requires bumping the dependencies for:

* memberlist
* serf
* raft
* raft-boltdb
* (and indirectly hashicorp/mdns due to the memberlist or serf update)

Unlike some other HashiCorp products, Nomads root module is currently expected to be consumed by others. This means that it needs to be treated more like our libraries and upgrade to hashicorp/go-metrics by utilizing its compat packages. This allows those importing the root module to control the metrics module used via build tags.
2025-01-31 15:22:00 -05:00
Tim Gross
6be9a50626 vault: catch expired lease as fatal error (#24409)
When a Vault lease expires, it's revoked on the server and cannot be removed, so
this error should be treated as fatal.

The errors we get aren't wrapped by the Vault SDK, so unfortunately we have to
read the error messages and can't easily enumerate non-fatal error
messages (which might be bubbling up from the stdlib). I've audited the errors
currently used and have documented their source.

Ref 52ba156d47/vault/expiration.go (L1327)
Fixes: https://github.com/hashicorp/nomad/issues/23859
2024-11-18 09:12:35 -05:00
Tim Gross
18fdda6242 vault: fix namespace reset for clients with unset namespace (#23491)
The Vault "logical" API doesn't allow configuring the namespace on a per-request
basis. Instead, it's set on the client. Our `vaultclient` wrapper locks access
to the API client and sets the namespace (and token, if applicable) for each
request, and then resets the namespace and unlocks the API client.

The logic for resetting the namespace incorrectly assumed that if the Vault
configuration didn't set the namespace that it was canonicalized to the
non-empty string `"default"`. This results in the API client's namespace getting
"stuck" whenever a job uses a non-default namespace if the configuration value
is empty. Update the logic to always go back to the configuration, rather than
accepting the "previous" namespace from the caller.

This changeset also removes some long-dead code in the Vault client wrapper.

Fixes: https://github.com/hashicorp/nomad/issues/22230
Ref: https://hashicorp.atlassian.net/browse/NET-10207
2024-07-03 10:13:20 -04:00
Luiz Aoqui
62b7d6ffe9 vault: revert #18998 to fix potential deadlock (#19963)
* Revert "vault: always renew tokens using the renewal loop (#18998)"
  This reverts commit 7054fe1a8c.
* test: add case for concurrent Vault token renewal
2024-02-13 09:50:46 -05:00
Tim Gross
0935f443dc vault: support allowing tokens to expire without refresh (#19691)
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.

When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.

Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.

Fixes: https://github.com/hashicorp/nomad/issues/8690
2024-01-10 14:49:02 -05:00
Luiz Aoqui
099ee06a60 Revert "deps: update go-metrics to v0.5.3 (#19190)" (#19374)
* Revert "deps: update go-metrics to v0.5.3 (#19190)"

This reverts commit ddb060d8b3.

* changelog: add entry for #19374
2023-12-08 08:46:55 -05:00
Luiz Aoqui
ddb060d8b3 deps: update go-metrics to v0.5.3 (#19190)
Update `go-metrics` to v0.5.3 to pick
https://github.com/hashicorp/go-metrics/pull/146.
2023-11-28 12:37:57 -05:00
Tim Gross
b5af87ebf3 set Vault namespace from task in vault_hook JWT login (#19080)
The JWT login codepath for the `vault_hook` was missing the Vault namespace, so
the login request for non-default namespaces would fail.
2023-11-14 09:54:36 -05:00
Luiz Aoqui
ab36cf031c vault: avoid continual renewal of invalid token (#18985)
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

2d0cde4ccc/vault/expiration.go (L636-L639)

But https://github.com/hashicorp/vault/pull/5346 introduced an earlier
`nil` check that generates a different error message:

750ab337ea/vault/expiration.go (L1362-L1364)

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
2023-11-07 19:50:19 -05:00
Luiz Aoqui
7054fe1a8c vault: always renew tokens using the renewal loop (#18998)
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

78f0c6b2a9/client/vaultclient/vaultclient.go (L557-L564)

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

1604dba508/client/vaultclient/vaultclient.go (L505-L510)

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
2023-11-07 19:49:33 -05:00
Luiz Aoqui
a907273557 vault: fix import cycle in vaultclient (#18965)
* Revert "vault: eliminate vaultclient test import cycle (#18652)"

This reverts commit 03cf9ae7ff.

* vault: remove import cycle in vaultclient_test.go
2023-11-02 11:07:04 -04:00
Michael Schurter
66fbc0f67e identity: default to RS256 for new workload ids (#18882)
OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.

Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.

**Test Updates**

Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.

I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package.

In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.

**TODO**

- Docs and changelog entries.
- Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
- Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.

**Requiem for ed25519**

When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.

EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:

1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032).
4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.

Life was good with ed25519, but sadly it could not last.

[IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:

- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256

RS256 only:

- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
2023-10-31 11:25:20 -07:00
Luiz Aoqui
347389f9f9 vault: derive token using create_from_role (#18880)
Fallback to the ACL role defined in the client's `create_from_role`
configuration when using the JWT flow and the task does not specify a
role to use.
2023-10-27 13:03:44 -04:00
Luiz Aoqui
6d4b62200b log: add Consul and Vault cluster name to output (#18817)
Ensure Consul and Vault loggers have the cluster name as an attribute to
help differentiate log source.
2023-10-20 14:03:56 -04:00
Luiz Aoqui
349c032369 vault: update task runner vault hook to support workload identity (#18534) 2023-10-16 19:37:57 -04:00
Piotr Kazmierczak
03cf9ae7ff vault: eliminate vaultclient test import cycle (#18652)
Eliminates the vaultclient test import cycle by putting the test file into the
client package and making vaultclient objects public.

Ref hashicorp/team-nomad#404
2023-10-05 09:17:16 +02:00
Tim Gross
fdc6c2151d vault: select Vault API client by cluster name (#18533)
Nomad Enterprise will support configuring multiple Vault clients. Instead of
having a single Vault client field in the Nomad client, we'll have a function
that callers can parameterize by the Vault cluster name that returns the
correctly configured Vault API client wrapper.
2023-09-19 14:35:01 -04:00
hashicorp-copywrite[bot]
2d35e32ec9 Update copyright file headers to BUSL-1.1 2023-08-10 17:27:15 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Seth Hoenig
f05aa6d5ec vault: configure user agent on Nomad vault clients (#15745)
* vault: configure user agent on Nomad vault clients

This PR attempts to set the User-Agent header on each Vault API client
created by Nomad. Still need to figure a way to set User-Agent on the
Vault client created internally by consul-template.

* vault: fixup find-and-replace gone awry
2023-01-10 10:39:45 -06:00
Seth Hoenig
b242957990 ci: swap ci parallelization for unconstrained gomaxprocs 2022-03-15 12:58:52 -05:00
Kris Hicks
85ed8ddd4f Add gosimple linter (#9590) 2020-12-09 11:05:18 -08:00
Yoan Blanc
77cf2f0573 vendor: vault api and sdk
Signed-off-by: Yoan Blanc <yoan@dosimple.ch>
2020-03-21 17:57:48 +01:00
Seth Hoenig
674ccaa122 nomad: proxy requests for Service Identity tokens between Clients and Consul
Nomad jobs may be configured with a TaskGroup which contains a Service
definition that is Consul Connect enabled. These service definitions end
up establishing a Consul Connect Proxy Task (e.g. envoy, by default). In
the case where Consul ACLs are enabled, a Service Identity token is required
for these tasks to run & connect, etc. This changeset enables the Nomad Server
to recieve RPC requests for the derivation of SI tokens on behalf of instances
of Consul Connect using Tasks. Those tokens are then relayed back to the
requesting Client, which then injects the tokens in the secrets directory of
the Task.
2020-01-31 19:03:53 -06:00
Seth Hoenig
f8666bb1f9 client: enable nomad client to request and set SI tokens for tasks
When a job is configured with Consul Connect aware tasks (i.e. sidecar),
the Nomad Client should be able to request from Consul (through Nomad Server)
Service Identity tokens specific to those tasks.
2020-01-31 19:03:38 -06:00
Michael Schurter
523586a6e6 vault: remove dead lease code 2019-10-25 15:08:35 -07:00
Michael Schurter
b135d28450 vault: fix data races 2019-04-16 11:22:44 -07:00
Michael Schurter
0e6da17a8f vault: fix renewal time
Renewal time was being calculated as 10s+Intn(lease-10s), so the renewal
time could be very rapid or within 1s of the deadline: [10s, lease)

This commit fixes the renewal time by calculating it as:

	(lease/2) +/- 10s

For a lease of 60s this means the renewal will occur in [20s, 40s).
2019-04-16 11:22:44 -07:00
Chris Baker
2022db72b6 vault client test: minor formatting
vendor: using upstream circonus-gometrics
2019-04-10 10:34:10 -05:00
Chris Baker
20a3884559 docs: -vault-namespace, VAULT_NAMESPACE, and config
agent: added VAULT_NAMESPACE env-based configuration
2019-04-10 10:34:10 -05:00
Chris Baker
1349497152 config/docs: added namespace to vault config
server/client: process `namespace` config, setting on the instantiated vault client
2019-04-10 10:34:10 -05:00
Michael Schurter
b41308f16a tests: port TestTaskRunner_BlockForVault from 0.8
Also fix race conditions in the mock vault client.
2019-02-12 13:46:09 -08:00
Alex Dadgar
95297c608c goimports 2019-01-22 15:44:31 -08:00
Danielle Tomlinson
8a4ffea94a chore: Cleanup formatting 2019-01-17 18:43:13 +01:00
Danielle Tomlinson
dcce2d7247 vaultclient: use require for error assertions 2019-01-17 18:43:13 +01:00
Danielle Tomlinson
3078c24f79 vaultclient: Update tests for vault 1.0 2019-01-17 18:43:13 +01:00
Mahmood Ali
0fc84f4cfb address review comments 2018-11-20 17:10:54 -05:00
Mahmood Ali
88c1698ef5 Emit metric counters for Vault token and renewal failures 2018-11-20 17:10:54 -05:00
Mahmood Ali
feaf6214f9 Set User-Agent header when hitting Vault API 2018-11-20 17:10:54 -05:00
Alex Dadgar
f91b269b2a fix test compiling 2018-10-16 16:56:55 -07:00
Michael Schurter
9da25adc54 client: hclog-ify most of the client
Leaving fingerprinters in case that interface changes with plugins.
2018-10-16 16:53:30 -07:00
Alex Dadgar
98c7abe541 Tests only use testlog package logger 2018-06-13 15:40:56 -07:00
Michael Schurter
8da7335c16 non-Existent -> nonexistent
Reverting from #3963

https://www.merriam-webster.com/dictionary/existent
2018-03-12 11:59:33 -07:00
Josh Soref
02a8be09f9 spelling: semantics 2018-03-11 19:00:26 +00:00
Filip Ochnik
38996137cf Recognize renewing non-renewable Vault lease as fatal 2018-01-08 20:32:31 +01:00
Michael Schurter
04b8f8e7fc Remove structs import from api
Goes a step further and removes structs import from api's tests as well
by moving GenerateUUID to its own package.
2017-09-29 10:36:08 -07:00
Alex Dadgar
a9e3a41407 Enable more linters 2017-09-26 15:26:33 -07:00