This PR tweaks the linux cpu fingerprinter to handle the case where no
NUMA node data is found under /sys/devices/system/, in which case we
need to assume just one node, one socket.
Our codec code generation doesn't honor `json:"..."` tags which, if we were to
ever implement `json.Marshaller` for the `KeyEncryptionKeyWrapper` struct, would
break the on-disk format of all the existing KEKs.
As a precaution, add this struct to the code generator's ignore list (just like
we have done with `IdentityClaims`).
When a user performs a client API call, the Nomad client will
perform an RPC which looks up the ACL policies which the callers
ACL token is assigned. If the ACL token includes dangling (deleted)
policies, the call would previously fail with a permission denied
error.
This change ensures this error is not returned and that the lookup
will succeed in the event of dangling policies.
An interactive setup helper for configuring Consul to accept Nomad WI-enabled workloads.
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
Our codec code generation doesn't honor `json:"..."` tags which breaks
the OIDC Discovery endpoint.
This adds the relevant struct to the code generators ignore list (just
like we have done with IdentityClaims).
One of our core scheduler tests for GC tests that volumes with invalid
allocations immediately have those claims marked as past claims and puts them
into the unpublishing state. This happens synchronously with the GC evaluation
processing, so there's no need for us to wait for the results.
Fixes: #18959
Before this commit, it would bring you to the list of allocations
filtered by status=starting. This status does not exist in the Status
drop-down on the Allocations section of a job in the UI.
The template hook emits an error when the task has a Consul block that requires
WI but there's no WI. The exact error message we get depends on whether we're
running in CE or ENT. Update the test assertion so that we can tolerate this
difference without building ENT-specific test files.
When looking up the Consul or Vault cluster from a client hook, we should always
use an accessor function rather than trying to lookup the `Cluster` field, which
may be empty for jobs registered before Nomad 1.7.
* identity: support change_mode and change_signal
wip - just jobspec portion
* test struct
* cleanup some insignificant boogs
* actually implement change mode
* docs tweaks
* add changelog
* test identity.change_mode operations
* use more words in changelog
* job endpoint tests
* address comments from code review
---------
Co-authored-by: Tim Gross <tgross@hashicorp.com>
Allocations that were created before Nomad 1.7 will not have the cluster field
set for their Consul blocks. While this can be corrected server-side, that
doesn't help allocations already on clients.
* vault: remove `token_ttl` from `vaultcompat` setup
Since Nomad uses periodic tokens, the right value to set in the role is
`token_period`, not `token_ttl`.
* vault: set 1.11.0 as min version for JWT auth
In order to use workload identities JWT auth with Vault it's required to
have a Vault cluster running v1.11.0+, which the version where
`user_claim_json_pointer` was introduced.
Allocations that were created before Nomad 1.7 will not have the `cluster` field
set for their Vault blocks. While this can be corrected server-side, that
doesn't help allocations already on clients.
Also add extra safety on Consul cluster lookup too
When attempting a WebSocket connection upgrade the client may receive a
redirect request from the server, in which case the request should be
reattempted using the new address present in the `Location` header.
Vault tokens requested for WI are "periodic" Vault tokens (ones that get
periodically renewed). The field we should be setting for the renewal window is
`token_period`.
OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.
Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.
**Test Updates**
Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.
I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package.
In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.
**TODO**
- Docs and changelog entries.
- Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
- Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.
**Requiem for ed25519**
When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.
EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:
1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032).
4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.
Life was good with ed25519, but sadly it could not last.
[IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:
- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256
RS256 only:
- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
Job submitters cannot set multiple identities prior to Nomad 1.7, and cluster
administrators should not set the identity configurations for their `consul` and
`vault` configuration blocks until all servers have been upgraded. Validate
these cases during job submission so as to prevent state store corruption when
jobs are submitting in the middle of a cluster upgrade.
The Consul and Vault integrations work shipping in Nomad 1.7 will deprecated the
existing token-based workflows. These will be removed in Nomad 1.9, so add a
note describing this to the upgrade guide.
The `group_service_hook` needs to supply the Consul service client with Consul
tokens for its services. The lookup in the hook resources was looking for the
wrong key. This would cause the service client to ignore the Consul token we've
received and use the agent's own token.
This changeset also moves the prefix formatting into `MakeUniqueIdentityName` method
to reduce the risk of this kind of bug in the future.
The allocrunner's `identity_hook` implements the interface for TaskStop, but
this interface is only ever called for task-level hooks. This results in a
leaked goroutine that tries to periodically renew WIs until the client shuts
down gracefully.
Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so
that whenever an allocation is stopped or garbage collected we stop renewing its
Workload Identities. This also requires making the `Shutdown` method of `WIDMgr`
safe to call multiple times.
The WI we get for Consul services is saved to the client state DB like all other
WIs, but the resulting JWT is never exposed to the task secrets directory
because (a) it's only intended for use with Consul service configuration,
and (b) for group services it could be ambiguous which task to expose it to.
Add a note to the `consul.service_identity` docs that these fields are ignored.
Prior to `consul-template` v0.22.0, automatic PKI renewal wouldn't work properly
based on the expiration of the cert. More recent versions of `consul-template`
can use the expiry to refresh the cert, so it's no longer necessary (and in fact
generates extra load on Vault) to set `generate_lease`. Remove this
recommendation from the docs.
Fixes: #18893
The `BindName` for JWT authentication should always bind to the `nomad_service` field in the JWT and not include the namespace, as the `nomad_service` is what's actually registered in Consul.
* Fix the binding rule for the `consulcompat` test
* Add a reachability assertion so that we don't miss regressions.
* Ensure we have a clean shutdown so that we don't leak state (containers and iptables) between tests.
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.
The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.