nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-04 17:35:43 +03:00

Author	SHA1	Message	Date
Luiz Aoqui	db5ffde2b7	client: prevent start on cgroups init error (#19915 ) The Nomad client expects certain cgroups paths to exist in order to manage tasks. These paths are created when the agent first starts, but if process fails the agent would just log the error and proceed with its initialization, despite not being able to run tasks. This commit surfaces the errors back to the client initialization so the process can stop early and make clear to operators that something went wrong.	2024-02-09 13:45:29 -05:00
Tim Gross	62c57d208b	fingerprint: eliminate spurious warning logs with Consul CE (#19923 ) Support for fingerprinting the Consul admin partition was added in #19485. But when the client fingerprints Consul CE, it gets a valid fingerprint and working Consul but with a warn-level log. Return "ok" from the partition extractor, but also ensure that we only add the Consul attribute if it actually has a value. Fixes: https://github.com/hashicorp/nomad/issues/19756	2024-02-09 08:19:00 -05:00
hc-github-team-nomad-core	33f0a5b268	Prepare for next release	2024-02-08 10:40:24 -05:00
hc-github-team-nomad-core	875e96cccc	Generate files for 1.7.4 release	2024-02-08 10:40:24 -05:00
Tim Gross	df86503349	template: sandbox template rendering The Nomad client renders templates in the same privileged process used for most other client operations. During internal testing, we discovered that a malicious task can create a symlink that can cause template rendering to read and write to arbitrary files outside the allocation sandbox. Because the Nomad agent can be restarted without restarting tasks, we can't simply check that the path is safe at the time we write without encountering a time-of-check/time-of-use race. To protect Nomad client hosts from this attack, we'll now read and write templates in a subprocess: * On Linux/Unix, this subprocess is sandboxed via chroot to the allocation directory. This requires that Nomad is running as a privileged process. A non-root Nomad agent will warn that it cannot sandbox the template renderer. * On Windows, this process is sandboxed via a Windows AppContainer which has been granted access to only to the allocation directory. This does not require special privileges on Windows. (Creating symlinks in the first place can be prevented by running workloads as non-Administrator or non-ContainerAdministrator users.) Both sandboxes cause encountered symlinks to be evaluated in the context of the sandbox, which will result in a "file not found" or "access denied" error, depending on the platform. This change will also require an update to Consul-Template to allow callers to inject a custom `ReaderFunc` and `RenderFunc`. This design is intended as a workaround to allow us to fix this bug without creating backwards compatibility issues for running tasks. A future version of Nomad may introduce a read-only mount specifically for templates and artifacts so that tasks cannot write into the same location that the Nomad agent is. Fixes: https://github.com/hashicorp/nomad/issues/19888 Fixes: CVE-2024-1329	2024-02-08 10:40:24 -05:00
Tim Gross	0d3cd1427f	migration: check symlink sources during archive unpack During allocation directory migration, the client was not checking that any symlinks in the archive aren't pointing to somewhere outside the allocation directory. While task driver sandboxing will protect against processes inside the task from reading/writing thru the symlink, this doesn't protect against the client itself from performing unintended operations outside the sandbox. This changeset includes two changes: * Update the archive unpacking to check the source of symlinks and require that they fall within the sandbox. * Fix a bug in the symlink check where it was using `filepath.Rel` which doesn't work for paths in the sibling directories of the sandbox directory. This bug doesn't appear to be exploitable but caused errors in testing. Fixes: https://github.com/hashicorp/nomad/issues/19887	2024-02-08 10:40:24 -05:00
Juana De La Cuesta	120c3ca3c9	Add granular control of SELinux labels for host mounts (#19839 ) Add new configuration option on task's volume_mounts, to give a fine grained control over SELinux "z" label * Update website/content/docs/job-specification/volume_mount.mdx Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> * fix: typo * func: make volume mount verification happen even on mounts with no volume --------- Co-authored-by: Luiz Aoqui <luiz@hashicorp.com> Co-authored-by: Tim Gross <tgross@hashicorp.com>	2024-02-05 10:05:33 +01:00
Tim Gross	334c383eb6	template: run template tests on Windows where possible (#19856 ) We don't run the whole suite of unit tests on all platforms to keep CI times reasonable, so the only things we've been running on Windows are platform-specific. I'm working on some platform-specific `template` related work and having these tests run on Windows will reduce the risk of regressions. Our Windows CI box doesn't have Consul or Vault, so I've skipped those tests for the time being, and can follow up with that later. There's also a test with assertions looking for specific paths, and the results are different on Windows. I've skipped those for the moment as well and will follow up under a separate PR. Also swap `testify` for `shoenig/test`	2024-02-02 09:22:03 -05:00
Michael Schurter	8f564182ef	connect: rewrite envoy bootstrap on every restart (#19787 ) Fixes #19781 Do not mark the envoy bootstrap hook as done after successfully running once. Since the bootstrap file is written to /secrets, which is a tmpfs on supported platforms, it is not persisted across reboots. This causes the task and allocation to fail on reboot (see #19781). This fixes it by always rewriting the envoy bootstrap file every time the Nomad agent starts. This does mean we may write a new bootstrap file to an already running Envoy task, but in my testing that doesn't have any impact. This commit doesn't necessarily fix every use of Done by hooks, but hopefully improves the situation. The comment on Done has been expanded to hopefully avoid misuse in the future. Done assertions were removed from tests as they add more noise than value. Alternative 1: Use a regular file An alternative approach would be to write the bootstrap file somewhere other than the tmpfs, but this is unsafe as when Consul ACLs are enabled the file will contain a secret token: https://developer.hashicorp.com/consul/commands/connect/envoy#bootstrap Alternative 2: Detect if file is already written An alternative approach would be to detect if the bootstrap file exists, and only write it if it doesn't. This is just a more complicated form of the current fix. I think in general in the absence of other factors task hooks should be idempotent and therefore able to rerun on any agent startup. This simplifies the code and our ability to reason about task restarts vs agent restarts vs node reboots by making them all take the same code path.	2024-01-24 11:26:31 -08:00
Seth Hoenig	5b7f4746ce	client/allocdir: use an interface in place of AllocDir structs (#19703 ) * client/allocdir: use an interface in place of AllocDir structs This PR replace allocdir.AllocDir with allocdir.Interface such that we may eventually have another implementation of alloc directories. This is in support of the exec2 driver, which will need an implementation of the alloc directory incompatibile with the current version. use rlock	2024-01-12 14:13:29 -06:00
Tim Gross	0935f443dc	vault: support allowing tokens to expire without refresh (#19691 ) Some users with batch workloads or short-lived prestart tasks want to derive a Vaul token, use it, and then allow it to expire without requiring a constant refresh. Add the `vault.allow_token_expiration` field, which works only with the Workload Identity workflow and not the legacy workflow. When set to true, this disables the client's renewal loop in the `vault_hook`. When Vault revokes the token lease, the token will no longer be valid. The client will also now automatically detect if the Vault auth configuration does not allow renewals and will disable the renewal loop automatically. Note this should only be used when a secret is requested from Vault once at the start of a task or in a short-lived prestart task. Long-running tasks should never set `allow_token_expiration=true` if they obtain Vault secrets via `template` blocks, as the Vault token will expire and the template runner will continue to make failing requests to Vault until the `vault_retry` attempts are exhausted. Fixes: https://github.com/hashicorp/nomad/issues/8690	2024-01-10 14:49:02 -05:00
Marvin Chin	be8575a8a2	Fix server shutdown not waiting for worker run completion (#19560 ) * Move group into a separate helper module for reuse * Add shutdownCh to worker The shutdown channel is used to signal that worker has stopped. * Make server shutdown block on workers' shutdownCh * Fix waiting for eval broker state change blocking indefinitely There was a race condition in the GenericNotifier between the Run and WaitForChange functions, where WaitForChange blocks trying to write to a full unsubscribeCh, but the Run function never reads from the unsubscribeCh as it has already stopped. This commit fixes it by unblocking if the notifier has been stopped. * Bound the amount of time server shutdown waits on worker completion * Fix lostcancel linter error * Fix worker test using unexpected worker constructor * Add changelog --------- Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>	2024-01-05 08:45:07 -06:00
David Ventura	fb43b14fb0	Mark CGroups as off when missing essential controllers (#19176 )	2023-12-15 11:20:52 -05:00
Piotr Kazmierczak	f1fb51422b	client: consul hook not called for templates (#19490 ) Due to some refactoring mishap, task-level Consul hook was never triggered and thus never wrote any secrets in task secret dirs.	2023-12-15 17:16:00 +01:00
Tim Gross	2e33115c15	consul: fingerprint Consul Enterprise admin partitions (#19485 ) Consul Enterprise agents all belong to an admin partition. Fingerprint this attribute when available. When a Consul agent is not explicitly configured with "default" it is in the default partition but will not report this in its `/v1/agent/self` endpoint. Fallback to "default" when missing only for Consul Enterprise. This feature provides users the ability to add constraints for jobs to land on Nomad nodes that have a Consul in that partition. Or it can allow cluster administrators to pair Consul partitions 1:1 with Nomad node pools. We'll also have the option to implement a future `partition` field in the jobspec's `consul` block to create an implicit constraint. Ref: https://github.com/hashicorp/nomad/issues/13139#issuecomment-1856479581	2023-12-15 09:26:25 -05:00
Seth Hoenig	6e4d57b330	numalib: provide a fallback for topology scanning on linux (#19457 ) * numalib: provide a fallback for topology scanning on linux * numalib: better package var names * cl: add cl * lint: fix my sloppy code * cl: fixup wording	2023-12-13 13:06:30 -06:00
Piotr Kazmierczak	b6dd376100	numa: account for incorrect core number on topology.insert (#19383 ) Unsupported environments like containers or guests OSs inside LXD can incorrectly number of available cores thus leading to numalib having trouble detecting cores and panicking. This code adds tests for linux sysfs detection methods and fixes the panic.	2023-12-13 17:40:26 +01:00
Luiz Aoqui	0bc822db40	vault: load default config for tasks without vault (#19439 ) It is often expected that a task that needs access to Vault defines a `vault` block to specify the Vault policy to use to derive a token. But in some scenarios, like when the Nomad client is connected to a local Vault agent that is responsible for authn/authz, the task is not required to defined a `vault` block. In these situations, the `default` Vault cluster should be used to render the template.	2023-12-12 14:06:55 -05:00
Luiz Aoqui	099ee06a60	Revert "deps: update go-metrics to v0.5.3 (#19190 )" (#19374 ) * Revert "deps: update go-metrics to v0.5.3 (#19190)" This reverts commit `ddb060d8b3`. * changelog: add entry for #19374	2023-12-08 08:46:55 -05:00
Tim Gross	d7a5274164	client: allow incomplete allocrunners to be removed on restore (#16638 ) If an allocrunner is persisted to the client state but the client stops before task runner can start, we end up with an allocation in the database with allocrunner state but no taskrunner state. This ends up mimicking an old pre-0.9.5 state where this state was not recorded and that hits a backwards compatibility shim. This leaves allocations in the client state that can never be restored, but won't ever be removed either. Update the backwards compatibility shim so that we fail the restore for the allocrunner and remove the allocation from the client state. Taskrunners persist state during graceful shutdown, so it shouldn't be possible to leak tasks that have actually started. This lets us "start over" with the allocation, if the server still wants to place it on the client.	2023-12-07 14:04:55 -05:00
Tim Gross	3c4e2009f5	connect: deployments should wait for Connect sidecar checks (#19334 ) When a Connect service is registered with Consul, Nomad includes the nested `Connect.SidecarService` field that includes health checks for the Envoy proxy. Because these are not part of the job spec, the alloc health tracker created by `health_hook` doesn't know to read the value of these checks. In many circumstances this won't be noticed, but if the Envoy health check happens to take longer than the `update.min_healthy_time` (perhaps because it's been set low), it's possible for a deployment to progress too early such that there will briefly be no healthy instances of the service available in Consul. Update the Consul service client to find the nested sidecar service in the service catalog and attach it to the results provided to the tracker. The tracker can then check the sidecar health checks. Fixes: https://github.com/hashicorp/nomad/issues/19269	2023-12-06 16:59:51 -05:00
Juana De La Cuesta	cf539c405e	Add a new parameter to avoid starting a replacement for lost allocs (#19101 ) This commit introduces the parameter preventRescheduleOnLost which indicates that the task group can't afford to have multiple instances running at the same time. In the case of a node going down, its allocations will be registered as unknown but no replacements will be rescheduled. If the lost node comes back up, the allocs will reconnect and continue to run. In case of max_client_disconnect also being enabled, if there is a reschedule policy, an error will be returned. Implements issue #10366 Co-authored-by: Dom Lavery <dom@circleci.com> Co-authored-by: Tim Gross <tgross@hashicorp.com> Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>	2023-12-06 12:28:42 +01:00
Tim Gross	ae403dcb4b	script_check_hook: handle task-level Consul namespace (#19241 ) The `script_check_hook` runs at the task level but can create script checks for both task-level services and group-level services. Now that we allow the Consul namespace to be set at the task-level `consul.namespace`, we need to have both possible namespaces handy when creating and updating checks.	2023-11-30 11:13:30 -05:00
Luiz Aoqui	1a2d41d30b	consul: refactor allocrunner consul hook (#19229 ) Refactor the JWT token derivation logic to only take a single request since it was only ever called with a map of length one. The original implementation received multiple requets to match the legacy flow, but but legacy flow requests were batched from the Nomad client to the server, which doesn't happen for JWT. Each JWT request goes directly from the Nomad client to the Consul agent, so there is no batching involved.	2023-11-30 10:55:03 -05:00
Tim Gross	f77b4baebb	service_hook: ensure task-level `consul.namespace` is respected (#19224 ) The task-level service hook is using the group-level method to get the provider namespace, but this was not designed with task-level `consul` blocks in mind. This leads to task-level services using the group-level `consul.namespace`. Fix by creating a method to get the correct namespace and move this into the service hook itself rather than in the outer `initHooks` method.	2023-11-29 16:46:27 -05:00
Luiz Aoqui	ddb060d8b3	deps: update go-metrics to v0.5.3 (#19190 ) Update `go-metrics` to v0.5.3 to pick https://github.com/hashicorp/go-metrics/pull/146.	2023-11-28 12:37:57 -05:00
Piotr Kazmierczak	6a98e45c53	client: add metadata to tokens requested by Consul client (#19196 ) This way tokens created by Nomad workloads are easier to keep track of.	2023-11-28 16:09:31 +01:00
Piotr Kazmierczak	711da2e653	client: change Consul client interface (#19140 ) DeriveSITokenWithJWT is a misleading method name, because it's used to derive Consul ACL tokens for other purposes too.	2023-11-21 16:01:26 +01:00
Piotr Kazmierczak	e9019d5fc8	client: make sure consul_hook does not perform double requests for tasks (#19137 )	2023-11-21 10:24:45 +01:00
Tim Gross	b5af87ebf3	set Vault namespace from task in `vault_hook` JWT login (#19080 ) The JWT login codepath for the `vault_hook` was missing the Vault namespace, so the login request for non-default namespaces would fail.	2023-11-14 09:54:36 -05:00
Luiz Aoqui	f0acf72ae7	client: fix Consul token retrievel for templates (#19058 ) The template hook must use the Consul token for the cluster defined in the task-level `consul` block or, if `nil, in the group-level `consul` block. The Consul tokens are generated by the allocrunner consul hook, but during the transition period we must fallback to the Nomad agent token if workload identities are not being used. So an empty token returned from `GetConsulTokens()` is not enough to determine if we should use the legacy flow (either this is an old task or the cluster is not configured for Consul WI), or if there is a misconfiguration (task or group is `consul` block is using a cluster that doesn't have an `identity` set). In order to distinguish between the two scenarios we must iterate over the task identities looking for one suitable for the Consul cluster being used.	2023-11-10 13:42:30 -05:00
Tim Gross	5ad715b281	fix taskrunner test after broken signature (#19056 ) PRs #19034 and #19040 accidentally conflicted with each other without a merge conflict when #19034 changes the method signature of `SetConsulTokens`. Because CI doesn't rebase, both PRs tested fine and only were broken once they landed on `main`. Fix that.	2023-11-09 15:53:25 -05:00
Luiz Aoqui	b61a31c38f	chore: remove comment about WI change mode (#19047 ) Identity change mode was implemented in #18943 and handles the update at the task level, so workload identity manager receives the update as expected.	2023-11-09 11:06:03 -05:00
Luiz Aoqui	6d8417014f	client: pass alloc hook resources to template hook (#19040 ) The task template hook uses the alloc resource to retrieve Consul tokens, so it must be passed from the allocation.	2023-11-09 10:55:35 -05:00
Tim Gross	c7c3b3ae33	revoke Consul tokens obtained via WI when alloc stops (#19034 ) Add a `Postrun` and `Destroy` hook to the allocrunner's `consul_hook` to ensure that Consul tokens we've created via WI get revoked via the logout API when we're done with them. Also add the logout to the `Prerun` hook if we've hit an error.	2023-11-09 10:08:09 -05:00
Tim Gross	7191c78928	refactor: rename allocrunner's Consul service reg handler (#19019 ) The allocrunner has a service registration handler that proxies various API calls to Consul. With multi-cluster support (for ENT), the service registration handler is what selects the correct Consul client. The name of this field in the allocrunner and taskrunner code base looks like it's referring to the actual Consul API client. This was actually the case before Nomad native service discovery was implemented, but now the name is misleading.	2023-11-08 15:39:32 -05:00
Luiz Aoqui	ab36cf031c	vault: avoid continual renewal of invalid token (#18985 ) A series of errors may happen when a token is invalidated while the Vault client is waiting to renew it. The token may have been invalidated for several reasons, such as the alloc finished running and it's now terminal or the token may have been change directly on Vault out-of-band. Most of the errors are caused by retries that will never succeed until Vault fully removes the token from its state. This commit prevents the retries by making the error `invalid lease ID` a fatal error. In earlier versions of Vault, this case was covered by the error `lease not found or lease is not renewable`, which is already considered to be a fatal error by Nomad: `2d0cde4ccc/vault/expiration.go (L636-L639)` But https://github.com/hashicorp/vault/pull/5346 introduced an earlier `nil` check that generates a different error message: `750ab337ea/vault/expiration.go (L1362-L1364)` Both errors happen for the same reason (`le == nil`) and so should be considered fatal on renewal.	2023-11-07 19:50:19 -05:00
Luiz Aoqui	7054fe1a8c	vault: always renew tokens using the renewal loop (#18998 ) Previously, a Vault token could renewed either periodically via the renewal loop or immediately by calling `RenewToken()`. But a race condition in the renewal loop could cause an attempt to renew an expired token. If both `updateCh` and `renewalCh` are active (such as when a task stops at the same time its token is waiting for renewal), the following `select` picks a `case` at random. `78f0c6b2a9/client/vaultclient/vaultclient.go (L557-L564)` If `case <-renewalCh` is picked, the token is incorrectly re-added to the heap, causing unnecessary renewals of a token that is already expired. `1604dba508/client/vaultclient/vaultclient.go (L505-L510)` To prevent this situation, the `renew()` function should only renew tokens that are currently in the heap, so `RenewToken()` must first push the token to the heap and wait for the renewal to happen instead of calling `renew()` directly since this could cause another race condition where the token is renewed twice: once by `RenewToken()` calling `renew()` directly and a second time if the renewal happens to pick the token as soon as `RenewToken()` adds it to the heap.	2023-11-07 19:49:33 -05:00
Tim Gross	50f0ce5412	config: remove old Vault/Consul config blocks from client (#18994 ) Remove the now-unused original configuration blocks for Consul and Vault from the client. When the client needs to refer to a Consul or Vault block it will always be for a specific cluster for the task/service. Add a helper for accessing the default clusters (for the client's own use). This is two of three changesets for this work. The remainder will implement the same changes in the `command/agent` package. As part of this work I discovered and fixed two bugs: * The gRPC proxy socket that we create for Envoy is only ever created using the default Consul cluster's configuration. This will prevent Connect from being used with the non-default cluster. * The Consul configuration we use for templates always comes from the default Consul cluster's configuration, but will use the correct Consul token for the non-default cluster. This will prevent templates from being used with the non-default cluster. Ref: https://github.com/hashicorp/nomad/issues/18947 Ref: https://github.com/hashicorp/nomad/pull/18991 Fixes: https://github.com/hashicorp/nomad/issues/18984 Fixes: https://github.com/hashicorp/nomad/issues/18983	2023-11-07 09:15:37 -05:00
Seth Hoenig	1604dba508	client: fingerprint cpu on raspberry pi (#18982 ) This PR tweaks the linux cpu fingerprinter to handle the case where no NUMA node data is found under /sys/devices/system/, in which case we need to assume just one node, one socket.	2023-11-02 15:52:37 -05:00
Luiz Aoqui	a907273557	vault: fix import cycle in `vaultclient` (#18965 ) * Revert "vault: eliminate vaultclient test import cycle (#18652)" This reverts commit `03cf9ae7ff`. * vault: remove import cycle in vaultclient_test.go	2023-11-02 11:07:04 -04:00
Tim Gross	483e78615d	template: fix test assertion to be compatible between CE/ENT (#18957 ) The template hook emits an error when the task has a Consul block that requires WI but there's no WI. The exact error message we get depends on whether we're running in CE or ENT. Update the test assertion so that we can tolerate this difference without building ENT-specific test files.	2023-11-01 13:26:45 -04:00
Tim Gross	dd62e8a319	consul/vault: use accessor method to get cluster name in client (#18955 ) When looking up the Consul or Vault cluster from a client hook, we should always use an accessor function rather than trying to lookup the `Cluster` field, which may be empty for jobs registered before Nomad 1.7.	2023-11-01 10:59:59 -04:00
Michael Schurter	e49ca3c431	identity: Implement `change_mode` (#18943 ) * identity: support change_mode and change_signal wip - just jobspec portion * test struct * cleanup some insignificant boogs * actually implement change mode * docs tweaks * add changelog * test identity.change_mode operations * use more words in changelog * job endpoint tests * address comments from code review --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-11-01 09:41:11 -05:00
Tim Gross	d62213a135	consul: fix lookups of default cluster across upgrades (#18945 ) Allocations that were created before Nomad 1.7 will not have the cluster field set for their Consul blocks. While this can be corrected server-side, that doesn't help allocations already on clients.	2023-11-01 10:11:54 -04:00
Seth Hoenig	5b56a5c5d1	client: fix cpu core/freq calculation on intel macs (#18934 )	2023-11-01 07:16:26 -05:00
Tim Gross	c1fa145765	vault: fix lookups of default cluster across upgrades (#18940 ) Allocations that were created before Nomad 1.7 will not have the `cluster` field set for their Vault blocks. While this can be corrected server-side, that doesn't help allocations already on clients. Also add extra safety on Consul cluster lookup too	2023-10-31 17:30:01 -04:00
Luiz Aoqui	3ddf1ecf1d	actions: minor bug fixes and improvements (#18904 )	2023-10-31 17:06:02 -04:00
Michael Schurter	66fbc0f67e	identity: default to RS256 for new workload ids (#18882 ) OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256. Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again. Test Updates Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed. I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package. In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners. TODO - Docs and changelog entries. - Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days. - Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this. Requiem for ed25519 When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset. EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties: 1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key. 2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious. 3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032). 4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus! 5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too. Life was good with ed25519, but sadly it could not last. [IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256: - [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256 - [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256 RS256 only: - [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration) - [GitLab](https://gitlab.com/.well-known/openid-configuration) - [Google](https://accounts.google.com/.well-known/openid-configuration) - [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration) - [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration) - [SalesForce](https://login.salesforce.com/.well-known/openid-configuration) - [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/) - [TFC](https://app.terraform.io/.well-known/openid-configuration)	2023-10-31 11:25:20 -07:00
Tim Gross	6fd3143fe7	services: fix lookup for Consul tokens (#18914 ) The `group_service_hook` needs to supply the Consul service client with Consul tokens for its services. The lookup in the hook resources was looking for the wrong key. This would cause the service client to ignore the Consul token we've received and use the agent's own token. This changeset also moves the prefix formatting into `MakeUniqueIdentityName` method to reduce the risk of this kind of bug in the future.	2023-10-30 13:42:18 -04:00

1 2 3 4 5 ...

4901 Commits