nomad

mirror of https://github.com/kemko/nomad.git synced 2026-01-01 16:05:42 +03:00

Author	SHA1	Message	Date
James Rasell	4a89a0a0f2	changelog: fix entry wording for #18873 (#18927 )	2023-11-01 09:56:31 +00:00
Tim Gross	c1fa145765	vault: fix lookups of default cluster across upgrades (#18940 ) Allocations that were created before Nomad 1.7 will not have the `cluster` field set for their Vault blocks. While this can be corrected server-side, that doesn't help allocations already on clients. Also add extra safety on Consul cluster lookup too	2023-10-31 17:30:01 -04:00
Luiz Aoqui	d7edbd44b7	api: handle redirect during websocket upgrade (#18903 ) When attempting a WebSocket connection upgrade the client may receive a redirect request from the server, in which case the request should be reattempted using the new address present in the `Location` header.	2023-10-31 17:12:11 -04:00
Luiz Aoqui	3ddf1ecf1d	actions: minor bug fixes and improvements (#18904 )	2023-10-31 17:06:02 -04:00
Tim Gross	2bff6d2a6a	docs: fix `token_period` in example Vault role for WI (#18939 ) Vault tokens requested for WI are "periodic" Vault tokens (ones that get periodically renewed). The field we should be setting for the renewal window is `token_period`.	2023-10-31 16:33:03 -04:00
Michael Schurter	9afc70ef5a	Fix Vault docs to use HCL instead of JSON (#18938 )	2023-10-31 13:25:20 -07:00
Michael Schurter	f8a65b6c29	docs: changelog & basic docs for 1.7 WI changes (#18936 ) Changelog entries and bare minimum docs for workload identity changes in 1.7.	2023-10-31 13:06:08 -07:00
Michael Schurter	66fbc0f67e	identity: default to RS256 for new workload ids (#18882 ) OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256. Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again. Test Updates Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed. I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package. In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners. TODO - Docs and changelog entries. - Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days. - Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this. Requiem for ed25519 When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset. EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties: 1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key. 2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious. 3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032). 4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus! 5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too. Life was good with ed25519, but sadly it could not last. [IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256: - [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256 - [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256 RS256 only: - [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration) - [GitLab](https://gitlab.com/.well-known/openid-configuration) - [Google](https://accounts.google.com/.well-known/openid-configuration) - [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration) - [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration) - [SalesForce](https://login.salesforce.com/.well-known/openid-configuration) - [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/) - [TFC](https://app.terraform.io/.well-known/openid-configuration)	2023-10-31 11:25:20 -07:00
Tim Gross	01d050c36b	identity: version check multiple and implicit identities (#18926 ) Job submitters cannot set multiple identities prior to Nomad 1.7, and cluster administrators should not set the identity configurations for their `consul` and `vault` configuration blocks until all servers have been upgraded. Validate these cases during job submission so as to prevent state store corruption when jobs are submitting in the middle of a cluster upgrade.	2023-10-31 13:57:53 -04:00
Tim Gross	ea3e711fa6	docs: upgrade guide for integrations deprecation warnings (#18928 ) The Consul and Vault integrations work shipping in Nomad 1.7 will deprecated the existing token-based workflows. These will be removed in Nomad 1.9, so add a note describing this to the upgrade guide.	2023-10-31 13:21:47 -04:00
Tim Gross	790d4d5d7a	changelog entries for Integrations feature work (#18923 )	2023-10-31 11:53:43 -04:00
Phil Renaud	d98ed87c1b	Actions changelog update to feature (#18921 )	2023-10-30 20:28:50 -04:00
Tim Gross	4850f07295	docs: name, audience, and TTL fields for `identity` blocks (#18916 )	2023-10-30 13:45:40 -04:00
Tim Gross	6fd3143fe7	services: fix lookup for Consul tokens (#18914 ) The `group_service_hook` needs to supply the Consul service client with Consul tokens for its services. The lookup in the hook resources was looking for the wrong key. This would cause the service client to ignore the Consul token we've received and use the agent's own token. This changeset also moves the prefix formatting into `MakeUniqueIdentityName` method to reduce the risk of this kind of bug in the future.	2023-10-30 13:42:18 -04:00
Dave May	0748918a3a	cli: Add file prediction for operator raft/snapshot commands (#18901 )	2023-10-30 13:40:21 -04:00
Seth Hoenig	b5469dd0eb	Post 1.6.3 release (#18918 ) * Generate files for 1.6.3 release * Prepare for next release * Merge release 1.6.3 files --------- Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>	2023-10-30 12:38:16 -05:00
Tim Gross	f0330d6df1	`identity_hook`: implement PreKill hook, not TaskStop hook (#18913 ) The allocrunner's `identity_hook` implements the interface for TaskStop, but this interface is only ever called for task-level hooks. This results in a leaked goroutine that tries to periodically renew WIs until the client shuts down gracefully. Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so that whenever an allocation is stopped or garbage collected we stop renewing its Workload Identities. This also requires making the `Shutdown` method of `WIDMgr` safe to call multiple times.	2023-10-30 10:54:22 -04:00
Dave May	1f4965e877	docs: Add code fence to Improvements example (#18902 )	2023-10-30 14:13:19 +00:00
Tim Gross	9463d7f88a	docs: add note about `consul.service_identity` ignoring fields (#18900 ) The WI we get for Consul services is saved to the client state DB like all other WIs, but the resulting JWT is never exposed to the task secrets directory because (a) it's only intended for use with Consul service configuration, and (b) for group services it could be ambiguous which task to expose it to. Add a note to the `consul.service_identity` docs that these fields are ignored.	2023-10-30 09:19:15 -04:00
Luiz Aoqui	347389f9f9	vault: derive token using `create_from_role` (#18880 ) Fallback to the ACL role defined in the client's `create_from_role` configuration when using the JWT flow and the task does not specify a role to use.	2023-10-27 13:03:44 -04:00
Luiz Aoqui	71a471b90a	cli: deprecate -vault-token flag (#18881 ) Apply the same deprecation notice from #18863 to the `nomad job plan` command.	2023-10-27 12:48:11 -04:00
James Rasell	2daf49df9a	server: use same receiver name for all server funcs. (#18896 )	2023-10-27 16:36:10 +01:00
Tim Gross	694a5ec19d	docs: remove stale note about `generate_lease` from template docs (#18895 ) Prior to `consul-template` v0.22.0, automatic PKI renewal wouldn't work properly based on the expiration of the cert. More recent versions of `consul-template` can use the expiry to refresh the cert, so it's no longer necessary (and in fact generates extra load on Vault) to set `generate_lease`. Remove this recommendation from the docs. Fixes: #18893	2023-10-27 11:09:09 -04:00
Justin Yang	b76e0429c4	client: add support for NetBSD clients (#18562 ) Bumps `shirou/gopsutil` to v3.23.9	2023-10-27 10:33:00 -04:00
Tim Gross	139a96ad12	e2e: fix bind name to allow Connect reachability (#18878 ) The `BindName` for JWT authentication should always bind to the `nomad_service` field in the JWT and not include the namespace, as the `nomad_service` is what's actually registered in Consul. * Fix the binding rule for the `consulcompat` test * Add a reachability assertion so that we don't miss regressions. * Ensure we have a clean shutdown so that we don't leak state (containers and iptables) between tests.	2023-10-27 10:15:17 -04:00
James Rasell	3c8eb54dfc	scheduler: ensure dup alloc names are fixed before plan submit. (#18873 ) This change fixes a bug within the generic scheduler which meant duplicate alloc indexes (names) could be submitted to the plan applier and written to state. The bug originates from the placements calculation notion that names of allocations being replaced are blindly copied to their replacement. This is not correct in all cases, particularly when dealing with canaries. The fix updates the alloc name index tracker to include minor duplicate tracking. This can be used when computing placements to ensure duplicate are found, and a new name picked before the plan is submitted. The name index tracking is now passed from the reconciler to the generic scheduler via the results, so this does not have to be regenerated, or another data structure used.	2023-10-27 14:16:41 +01:00
Juana De La Cuesta	e8efe2d251	fix: handling non reschedule disconnecting and reconnecting allocs (#18701 ) This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.	2023-10-27 13:14:39 +02:00
Robert Sturla	23665a5685	docs: update link to tc-redirect-tap (#18879 )	2023-10-26 14:21:10 -04:00
Seth Hoenig	fdde8a56ae	docs: add job-specification docs for numa (#18864 ) * docs: add job-specification docs for numa * docs: take suggestions Co-authored-by: Tim Gross <tgross@hashicorp.com> * docs: more cr suggestions --------- Co-authored-by: Tim Gross <tgross@hashicorp.com>	2023-10-26 11:39:08 -05:00
Luiz Aoqui	61d4ee7e60	vault: validate tasks using non-default clusters (#18810 ) Since Nomad servers only start a Vault client for the default cluster, tasks using non-default clusters must provide an identity to be used for token derivation, either in the task itself or in the agent configuration.	2023-10-26 11:50:42 -04:00
Tim Gross	8f8265fa6d	add deprecation warning for Vault/Consul token usage (#18863 ) Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and intended for removal in Nomad 1.9. Add a deprecation warning to the CLI when the user passes in the appropriate flag or environment variable. Nomad agents will no longer need a Vault token when configured with workload identity, and we'll ignore Vault tokens in the agent config after Nomad 1.9. Log a warning at agent startup. Ref: https://github.com/hashicorp/nomad/issues/15617 Ref: https://github.com/hashicorp/nomad/issues/15618	2023-10-26 10:46:02 -04:00
Seth Hoenig	8ed82416e3	client: fix detection of cpuset.mems on cgroups v1 systems (#18868 )	2023-10-26 09:42:10 -05:00
Tim Gross	47f2118f40	docs: Vault Workload Identity integration (#18704 ) Documentation updates to support the new Vault integration with Nomad Workload Identity. Included: * Added a large section to the Vault integration docs to explain how to set up auth methods, roles, and policies (by hand, assuming we don't ship a `nomad setup-vault` tool for now), and how to safely migrate from the existing workflow to the new one. * Shuffled around some of the existing text so that the legacy authentication method text is in its own section. * Added a compatibility matrix to the Vault integration page.	2023-10-26 10:33:52 -04:00
Seth Hoenig	afac9d10dd	deps: purge and prohibit use of go-set/v1 (#18869 )	2023-10-26 08:56:43 -05:00
Piotr Kazmierczak	7f62dec473	consul WI: rename default auth method for services (#18867 ) It should be called nomad-services instead of nomad-workloads.	2023-10-26 09:43:33 +02:00
Seth Hoenig	de28760928	cl: add changelog for numa (#18847 )	2023-10-25 10:41:17 -05:00
James Rasell	b3e41bec2d	scheduler: remove unused alloc index functions. (#18846 )	2023-10-25 09:09:47 +01:00
Michael Schurter	9b3c38b3ed	docs: deprecate rsadecrypt (#18856 ) `rsadecrypt` uses PKCS #1 v1.5 padding which has multiple known weaknesses. While it is possible to use safely in Nomad, we should not encourage our users to use bad cryptographic primitives. If users want to decrypt secrets in jobspecs we should choose a cryptographic primitive designed for that purpose. `rsadecrypt` was inherited from Terraform which only implemented it to support decrypting Window's passwords on AWS EC2 instances: https://github.com/hashicorp/terraform/pull/16647 This is not something that should ever be done in a jobspec, therefore there's no reason for Nomad to support this HCL2 function.	2023-10-24 15:48:15 -07:00
Tim Gross	6c2d5a0fbb	E2E: Consul compatibility matrix tests (#18799 ) Set up a new test suite that exercises Nomad's compatibility with Consul. This suite installs all currently supported versions of Consul, spins up a Consul agent with appropriate configuration, and a Nomad agent running in dev mode. Then it runs a Connect job against each pair.	2023-10-24 16:03:53 -04:00
Seth Hoenig	8de7af51cb	cl: remove cgroup mountpoint (#18848 ) * cl: remove cgroup mountpoint attribute * cl: add changelog for cgroups attribute changes	2023-10-24 11:38:26 -05:00
Daniel Bennett	b46b41a2e9	scheduler: appropriately unblock evals with quotas (#18838 ) When an eval is blocked due to e.g. cpu exhausted on nodes, but there happens to also be a quota on the job's namespace, the eval would not get auto- unblocked when the node cpu got freed up. This change ensures, when considering quota during BlockedEvals.unblock(), that the block was due to quota in the first place, so unblocking does not get skipped due to the mere existence of a quota on the namespace.	2023-10-24 11:22:24 -05:00
Seth Hoenig	5cf4c6cc06	cl: note breaking change of numcores attribute on apple systems (#18850 ) I goofed the name the first time around, "power" should have been "performance" which is consistent with both Apple and Intel branding.	2023-10-24 10:54:26 -05:00
Seth Hoenig	9ae4b10dc6	cl: minor features are listed as improvements (#18845 ) The Features header is reserved for "tent-pole" features of a Nomad version.	2023-10-24 10:53:40 -05:00
James Rasell	f64ade2304	cli: ensure HCL env vars are added to the job submission object. (#18832 )	2023-10-24 16:48:13 +01:00
Kerim Satirli	5e1bbf90fc	docs: update all URLs to `developer.hashicorp.com` (#16247 )	2023-10-24 11:00:11 -04:00
Seth Hoenig	951cde4e3b	numa: fix cpu topology conversion for non linux systems (#18843 )	2023-10-24 09:12:34 -05:00
Tim Gross	cb3fde3c96	metrics: prevent negative counter from iowait decrease (#18835 ) The iowait metric obtained from `/proc/stat` can under some circumstances decrease. The relevant condition is when an interrupt arrives on a different core than the one that gets woken up for the IO, and a particular counter in the kernel for that core gets interrupted. This is documented in the man page for the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that can't be changed for the sake of ABI compatibility. In Nomad, we get the current "busy" time (everything except for idle) and compare it to the previous busy time to get the counter incremeent. If the iowait counter decreases and the idle counter increases more than the increase in the total busy time, we can get a negative total. This previously caused a panic in our metrics collection (see #15861) but that is being prevented by reporting an error message. Fix the bug by putting a zero floor on the values we return from the host CPU stats calculator. Fixes: #15861 Fixes: #18804	2023-10-24 09:58:25 -04:00
Seth Hoenig	043b1a95a7	deps: bump go-set/v2 to alpha.3 (#18844 ) fixes a rather critical bug in .Equals implementation	2023-10-24 08:23:25 -05:00
James Rasell	b55dcb3967	test: use must lib for bitmap tests. (#18834 )	2023-10-24 07:40:02 +01:00
Luiz Aoqui	70b1862026	test: add E2E `vaultcompat` test for JWT auth flow (#18822 ) Test the JWT auth flow using real Nomad and Vault agents.	2023-10-23 20:00:55 -04:00

1 2 3 4 5 ...

25227 Commits