nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-02 16:35:44 +03:00

Author	SHA1	Message	Date
James Rasell	4c4cb2c6ad	agent: Fix misaligned contextual k/v logging arguments. (#25629 ) Arguments passed to hclog log lines should always have an even number to provide the expected k/v output.	2025-04-10 14:40:21 +01:00
Tim Gross	e168548341	provide allocrunner hooks with prebuilt taskenv and fix mutation bugs (#25373 ) Some of our allocrunner hooks require a task environment for interpolating values based on the node or allocation. But several of the hooks accept an already-built environment or builder and then keep that in memory. Both of these retain a copy of all the node attributes and allocation metadata, which balloons memory usage until the allocation is GC'd. While we'd like to look into ways to avoid keeping the allocrunner around entirely (see #25372), for now we can significantly reduce memory usage by creating the task environment on-demand when calling allocrunner methods, rather than persisting it in the allocrunner hooks. In doing so, we uncover two other bugs: * The WID manager, the group service hook, and the checks hook have to interpolate services for specific tasks. They mutated a taskenv builder to do so, but each time they mutate the builder, they write to the same environment map. When a group has multiple tasks, it's possible for one task to set an environment variable that would then be interpolated in the service definition for another task if that task did not have that environment variable. Only the service definition interpolation is impacted. This does not leak env vars across running tasks, as each taskrunner has its own builder. To fix this, we move the `UpdateTask` method off the builder and onto the taskenv as the `WithTask` method. This makes a shallow copy of the taskenv with a deep clone of the environment map used for interpolation, and then overwrites the environment from the task. * The checks hook interpolates Nomad native service checks only on `Prerun` and not on `Update`. This could cause unexpected deregistration and registration of checks during in-place updates. To fix this, we make sure we interpolate in the `Update` method. I also bumped into an incorrectly implemented interface in the CSI hook. I've pulled that and some better guardrails out to https://github.com/hashicorp/nomad/pull/25472. Fixes: https://github.com/hashicorp/nomad/issues/25269 Fixes: https://hashicorp.atlassian.net/browse/NET-12310 Ref: https://github.com/hashicorp/nomad/issues/25372	2025-03-24 12:05:04 -04:00
James Rasell	7268053174	vault: Remove legacy token based authentication workflow. (#25155 ) The legacy workflow for Vault whereby servers were configured using a token to provide authentication to the Vault API has now been removed. This change also removes the workflow where servers were responsible for deriving Vault tokens for Nomad clients. The deprecated Vault config options used byi the Nomad agent have all been removed except for "token" which is still in use by the Vault Transit keyring implementation. Job specification authors can no longer use the "vault.policies" parameter and should instead use "vault.role" when not using the default workload identity. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>	2025-02-28 07:40:02 +00:00
Tim Gross	d56e8ad1aa	WI: ensure Consul hook and WID manager interpolate services (#20344 ) Services can have some of their string fields interpolated. The new Workload Identity flow doesn't interpolate the services before requesting signed identities or using those identities to get Consul tokens. Add support for interpolation to the WID manager and the Consul tokens hook by providing both with a taskenv builder. Add an "interpolate workload" field to the WI handle to allow passing the original workload name to the server so the server can find the correct service to sign. This changeset also makes two related test improvements: * Remove the mock WID manager, which was only used in the Consul hook tests and isn't necessary so long as we provide the real WID manager with the mock signer and never call `Run` on it. It wasn't feasible to exercise the correct behavior without this refactor, as the mocks were bypassing the new code. * Fixed swapped expect-vs-actual assertions on the `consul_hook` tests. Fixes: https://github.com/hashicorp/nomad/issues/20025	2024-04-11 15:40:28 -04:00
Luiz Aoqui	b61a31c38f	chore: remove comment about WI change mode (#19047 ) Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.	2023-11-09 11:06:03 -05:00
Michael Schurter	66fbc0f67e	identity: default to RS256 for new workload ids (#18882 ) OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256. Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again. Test Updates Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed. I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package. In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners. TODO - Docs and changelog entries. - Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days. - Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this. Requiem for ed25519 When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset. EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties: 1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key. 2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious. 3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032). 4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus! 5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too. Life was good with ed25519, but sadly it could not last. [IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256: - [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256 - [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256 RS256 only: - [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration) - [GitLab](https://gitlab.com/.well-known/openid-configuration) - [Google](https://accounts.google.com/.well-known/openid-configuration) - [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration) - [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration) - [SalesForce](https://login.salesforce.com/.well-known/openid-configuration) - [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/) - [TFC](https://app.terraform.io/.well-known/openid-configuration)	2023-10-31 11:25:20 -07:00
Tim Gross	f0330d6df1	`identity_hook`: implement PreKill hook, not TaskStop hook (#18913 ) The allocrunner's `identity_hook` implements the interface for TaskStop, but this interface is only ever called for task-level hooks. This results in a leaked goroutine that tries to periodically renew WIs until the client shuts down gracefully. Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so that whenever an allocation is stopped or garbage collected we stop renewing its Workload Identities. This also requires making the `Shutdown` method of `WIDMgr` safe to call multiple times.	2023-10-30 10:54:22 -04:00
Piotr Kazmierczak	16d71582f6	client: `consul_hook` tests (#18780 ) ref https://github.com/hashicorp/team-nomad/issues/404	2023-10-18 20:02:35 +02:00
Luiz Aoqui	349c032369	vault: update task runner vault hook to support workload identity (#18534 )	2023-10-16 19:37:57 -04:00
Tim Gross	3633ca0f8c	auth: add client-only ACL (#18730 ) The RPC handlers expect to see `nil` ACL objects whenever ACLs are disabled. By using `nil` as a sentinel value, we have the risk of nil pointer exceptions and improper handling of `nil` when returned from our various auth methods that can lead to privilege escalation bugs. This is the third in a series to eliminate the use of `nil` ACLs as a sentinel value for when ACLs are disabled. This patch involves creating a new "virtual" ACL object for checking permissions on client operations and a matching `AuthenticateClientOnly` method for client-only RPCs that can produce that ACL. Unlike the server ACLs PR, this also includes a special case for "legacy" client RPCs where the client was not previously sending the secret as it should (leaning on mTLS only). Those client RPCs were fixed in Nomad 1.6.0, but it'll take a while before we can guarantee they'll be present during upgrades. Ref: https://github.com/hashicorp/nomad-enterprise/pull/1218 Ref: https://github.com/hashicorp/nomad/pull/18703 Ref: https://github.com/hashicorp/nomad/pull/18715 Ref: https://github.com/hashicorp/nomad/pull/16799	2023-10-12 12:21:48 -04:00
Tim Gross	e22c5b82f3	WID manager: request signed identities for services (#18650 ) Includes changes to WID Manager that make it request signed identities for services, as well as a few improvements to WIHandle introduced in #18672. --------- Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>	2023-10-11 12:07:16 +02:00
Tim Gross	928a82a184	WID manager: save and restore signed WIs from client state DB (#18661 ) When clients are restarted and the identity hook runs when we restore allocations, the running allocations are likely to have already-signed Workload Identities that are unexpired. Save these to the client's local state DB so that we can avoid a thundering herd of RPCs during client restart. When we restore, we'll check if there's at least one expired signed WI before making any initial signing request. Included: * Renames `getIdentities` to `getInitialIdentities` to make the workflow more clear. * Renames the existing `widmgr_test.go` file of integration tests, which is in its own package to avoid circular imports to `widmgr_int_test.go`	2023-10-09 09:16:23 -04:00
Piotr Kazmierczak	597d835220	wi: introduce workload identity handler (#18672 ) Any code that tracks workloads and their identities should not rely on string comparisons, especially since we support 2 types of workload identities: those that identify tasks and those that identify services. This means we cannot rely on task.Name for workload-identity pairs. The new type structs.WIHandle solves this problem by providing a uniform way of identifying workloads and their identities.	2023-10-06 18:32:47 +02:00
Piotr Kazmierczak	5dab41881b	client: new consul_hook (#18557 ) This PR introduces a new allocrunner-level consul_hook which iterates over services and tasks, if their provider is consul, fetches consul tokens for all of them, stores them in AllocHookResources and in task secret dirs. Ref: hashicorp/team-nomad#404 --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-09-29 17:41:48 +02:00
Piotr Kazmierczak	0a75a42d94	WI: WIDMgr should expose default identity signatures (#18610 ) Since the identity_hook is meant to be the central place that makes signed identities available to other hooks, it should also expose the default identity that is signed by the plan applier. Ref: hashicorp/team-nomad#404	2023-09-29 15:17:59 +02:00
Piotr Kazmierczak	86d2cdcf80	client: split identity_hook across allocrunner and taskrunner (#18431 ) This commit splits identity_hook between the allocrunner and taskrunner. The allocrunner-level part of the hook signs each task identity, and the taskrunner-level part picks it up and stores secrets for each task. The code revamps the WIDMgr, which is now split into 2 interfaces: IdentityManager which manages renewals of signatures and handles sending updates to subscribers via Watch method, and IdentitySigner which only does the signing. This work is necessary for having a unified Consul login workflow that comes with the new Consul integration. A new, allocrunner-level consul_hook will now be the only hook doing Consul authentication.	2023-09-21 17:31:27 +02:00
James Rasell	a9d5beb141	test: use correct parallel test setup func (#18326 )	2023-08-25 13:51:36 +01:00
Michael Schurter	0e22fc1a0b	identity: add support for multiple identities + audiences (#18123 ) Allows for multiple `identity{}` blocks for tasks along with user-specified audiences. This is a building block to allow workload identities to be used with Consul, Vault and 3rd party JWT based auth methods. Expiration is still unimplemented and is necessary for JWTs to be used securely, so that's up next. --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-08-15 09:11:53 -07:00

18 Commits