Commit Graph

25227 Commits

Author SHA1 Message Date
James Rasell
4a89a0a0f2 changelog: fix entry wording for #18873 (#18927) 2023-11-01 09:56:31 +00:00
Tim Gross
c1fa145765 vault: fix lookups of default cluster across upgrades (#18940)
Allocations that were created before Nomad 1.7 will not have the `cluster` field
set for their Vault blocks. While this can be corrected server-side, that
doesn't help allocations already on clients.

Also add extra safety on Consul cluster lookup too
2023-10-31 17:30:01 -04:00
Luiz Aoqui
d7edbd44b7 api: handle redirect during websocket upgrade (#18903)
When attempting a WebSocket connection upgrade the client may receive a
redirect request from the server, in which case the request should be
reattempted using the new address present in the `Location` header.
2023-10-31 17:12:11 -04:00
Luiz Aoqui
3ddf1ecf1d actions: minor bug fixes and improvements (#18904) 2023-10-31 17:06:02 -04:00
Tim Gross
2bff6d2a6a docs: fix token_period in example Vault role for WI (#18939)
Vault tokens requested for WI are "periodic" Vault tokens (ones that get
periodically renewed). The field we should be setting for the renewal window is
`token_period`.
2023-10-31 16:33:03 -04:00
Michael Schurter
9afc70ef5a Fix Vault docs to use HCL instead of JSON (#18938) 2023-10-31 13:25:20 -07:00
Michael Schurter
f8a65b6c29 docs: changelog & basic docs for 1.7 WI changes (#18936)
Changelog entries and bare minimum docs for workload identity changes in 1.7.
2023-10-31 13:06:08 -07:00
Michael Schurter
66fbc0f67e identity: default to RS256 for new workload ids (#18882)
OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.

Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.

**Test Updates**

Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.

I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package.

In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.

**TODO**

- Docs and changelog entries.
- Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
- Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.

**Requiem for ed25519**

When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.

EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:

1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032).
4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.

Life was good with ed25519, but sadly it could not last.

[IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:

- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256

RS256 only:

- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
2023-10-31 11:25:20 -07:00
Tim Gross
01d050c36b identity: version check multiple and implicit identities (#18926)
Job submitters cannot set multiple identities prior to Nomad 1.7, and cluster
administrators should not set the identity configurations for their `consul` and
`vault` configuration blocks until all servers have been upgraded. Validate
these cases during job submission so as to prevent state store corruption when
jobs are submitting in the middle of a cluster upgrade.
2023-10-31 13:57:53 -04:00
Tim Gross
ea3e711fa6 docs: upgrade guide for integrations deprecation warnings (#18928)
The Consul and Vault integrations work shipping in Nomad 1.7 will deprecated the
existing token-based workflows. These will be removed in Nomad 1.9, so add a
note describing this to the upgrade guide.
2023-10-31 13:21:47 -04:00
Tim Gross
790d4d5d7a changelog entries for Integrations feature work (#18923) 2023-10-31 11:53:43 -04:00
Phil Renaud
d98ed87c1b Actions changelog update to feature (#18921) 2023-10-30 20:28:50 -04:00
Tim Gross
4850f07295 docs: name, audience, and TTL fields for identity blocks (#18916) 2023-10-30 13:45:40 -04:00
Tim Gross
6fd3143fe7 services: fix lookup for Consul tokens (#18914)
The `group_service_hook` needs to supply the Consul service client with Consul
tokens for its services. The lookup in the hook resources was looking for the
wrong key. This would cause the service client to ignore the Consul token we've
received and use the agent's own token.

This changeset also moves the prefix formatting into `MakeUniqueIdentityName` method
to reduce the risk of this kind of bug in the future.
2023-10-30 13:42:18 -04:00
Dave May
0748918a3a cli: Add file prediction for operator raft/snapshot commands (#18901) 2023-10-30 13:40:21 -04:00
Seth Hoenig
b5469dd0eb Post 1.6.3 release (#18918)
* Generate files for 1.6.3 release

* Prepare for next release

* Merge release 1.6.3 files

---------

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2023-10-30 12:38:16 -05:00
Tim Gross
f0330d6df1 identity_hook: implement PreKill hook, not TaskStop hook (#18913)
The allocrunner's `identity_hook` implements the interface for TaskStop, but
this interface is only ever called for task-level hooks. This results in a
leaked goroutine that tries to periodically renew WIs until the client shuts
down gracefully.

Add an implementation for the allocrunner's `PreKill` and `Destroy` hooks, so
that whenever an allocation is stopped or garbage collected we stop renewing its
Workload Identities. This also requires making the `Shutdown` method of `WIDMgr`
safe to call multiple times.
2023-10-30 10:54:22 -04:00
Dave May
1f4965e877 docs: Add code fence to Improvements example (#18902) 2023-10-30 14:13:19 +00:00
Tim Gross
9463d7f88a docs: add note about consul.service_identity ignoring fields (#18900)
The WI we get for Consul services is saved to the client state DB like all other
WIs, but the resulting JWT is never exposed to the task secrets directory
because (a) it's only intended for use with Consul service configuration,
and (b) for group services it could be ambiguous which task to expose it to.

Add a note to the `consul.service_identity` docs that these fields are ignored.
2023-10-30 09:19:15 -04:00
Luiz Aoqui
347389f9f9 vault: derive token using create_from_role (#18880)
Fallback to the ACL role defined in the client's `create_from_role`
configuration when using the JWT flow and the task does not specify a
role to use.
2023-10-27 13:03:44 -04:00
Luiz Aoqui
71a471b90a cli: deprecate -vault-token flag (#18881)
Apply the same deprecation notice from #18863 to the `nomad job plan`
command.
2023-10-27 12:48:11 -04:00
James Rasell
2daf49df9a server: use same receiver name for all server funcs. (#18896) 2023-10-27 16:36:10 +01:00
Tim Gross
694a5ec19d docs: remove stale note about generate_lease from template docs (#18895)
Prior to `consul-template` v0.22.0, automatic PKI renewal wouldn't work properly
based on the expiration of the cert. More recent versions of `consul-template`
can use the expiry to refresh the cert, so it's no longer necessary (and in fact
generates extra load on Vault) to set `generate_lease`. Remove this
recommendation from the docs.

Fixes: #18893
2023-10-27 11:09:09 -04:00
Justin Yang
b76e0429c4 client: add support for NetBSD clients (#18562)
Bumps `shirou/gopsutil` to v3.23.9
2023-10-27 10:33:00 -04:00
Tim Gross
139a96ad12 e2e: fix bind name to allow Connect reachability (#18878)
The `BindName` for JWT authentication should always bind to the `nomad_service` field in the JWT and not include the namespace, as the `nomad_service` is what's actually registered in Consul. 

* Fix the binding rule for the `consulcompat` test 
* Add a reachability assertion so that we don't miss regressions.
* Ensure we have a clean shutdown so that we don't leak state (containers and iptables) between tests.
2023-10-27 10:15:17 -04:00
James Rasell
3c8eb54dfc scheduler: ensure dup alloc names are fixed before plan submit. (#18873)
This change fixes a bug within the generic scheduler which meant
duplicate alloc indexes (names) could be submitted to the plan
applier and written to state. The bug originates from the
placements calculation notion that names of allocations being
replaced are blindly copied to their replacement. This is not
correct in all cases, particularly when dealing with canaries.

The fix updates the alloc name index tracker to include minor
duplicate tracking. This can be used when computing placements to
ensure duplicate are found, and a new name picked before the plan
is submitted. The name index tracking is now passed from the
reconciler to the generic scheduler via the results, so this does
not have to be regenerated, or another data structure used.
2023-10-27 14:16:41 +01:00
Juana De La Cuesta
e8efe2d251 fix: handling non reschedule disconnecting and reconnecting allocs (#18701)
This PR fixes a long lived bug, where disconnecting allocations where never rescheduled by their policy but because the group count was short. The default reschedule time for services and batches is 30 and 5 seconds respectively, in order to properly reschedule disconnected allocs, they need to be able to be rescheduled for later, a path that was not handled before. This PR introduces a way to handle such allocations.
2023-10-27 13:14:39 +02:00
Robert Sturla
23665a5685 docs: update link to tc-redirect-tap (#18879) 2023-10-26 14:21:10 -04:00
Seth Hoenig
fdde8a56ae docs: add job-specification docs for numa (#18864)
* docs: add job-specification docs for numa

* docs: take suggestions

Co-authored-by: Tim Gross <tgross@hashicorp.com>

* docs: more cr suggestions

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-10-26 11:39:08 -05:00
Luiz Aoqui
61d4ee7e60 vault: validate tasks using non-default clusters (#18810)
Since Nomad servers only start a Vault client for the default cluster,
tasks using non-default clusters must provide an identity to be used for
token derivation, either in the task itself or in the agent
configuration.
2023-10-26 11:50:42 -04:00
Tim Gross
8f8265fa6d add deprecation warning for Vault/Consul token usage (#18863)
Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and
intended for removal in Nomad 1.9. Add a deprecation warning to the CLI when the
user passes in the appropriate flag or environment variable.

Nomad agents will no longer need a Vault token when configured with workload
identity, and we'll ignore Vault tokens in the agent config after Nomad 1.9. Log
a warning at agent startup.

Ref: https://github.com/hashicorp/nomad/issues/15617
Ref: https://github.com/hashicorp/nomad/issues/15618
2023-10-26 10:46:02 -04:00
Seth Hoenig
8ed82416e3 client: fix detection of cpuset.mems on cgroups v1 systems (#18868) 2023-10-26 09:42:10 -05:00
Tim Gross
47f2118f40 docs: Vault Workload Identity integration (#18704)
Documentation updates to support the new Vault integration with Nomad Workload
Identity. Included:

* Added a large section to the Vault integration docs to explain how to set up
  auth methods, roles, and policies (by hand, assuming we don't ship a `nomad
  setup-vault` tool for now), and how to safely migrate from the existing workflow
  to the new one.
* Shuffled around some of the existing text so that the legacy authentication
  method text is in its own section.
* Added a compatibility matrix to the Vault integration page.
2023-10-26 10:33:52 -04:00
Seth Hoenig
afac9d10dd deps: purge and prohibit use of go-set/v1 (#18869) 2023-10-26 08:56:43 -05:00
Piotr Kazmierczak
7f62dec473 consul WI: rename default auth method for services (#18867)
It should be called nomad-services instead of nomad-workloads.
2023-10-26 09:43:33 +02:00
Seth Hoenig
de28760928 cl: add changelog for numa (#18847) 2023-10-25 10:41:17 -05:00
James Rasell
b3e41bec2d scheduler: remove unused alloc index functions. (#18846) 2023-10-25 09:09:47 +01:00
Michael Schurter
9b3c38b3ed docs: deprecate rsadecrypt (#18856)
`rsadecrypt` uses PKCS #1 v1.5 padding which has multiple known
weaknesses. While it is possible to use safely in Nomad, we should not
encourage our users to use bad cryptographic primitives.

If users want to decrypt secrets in jobspecs we should choose a
cryptographic primitive designed for that purpose. `rsadecrypt` was
inherited from Terraform which only implemented it to support decrypting
Window's passwords on AWS EC2 instances:

https://github.com/hashicorp/terraform/pull/16647

This is not something that should ever be done in a jobspec, therefore
there's no reason for Nomad to support this HCL2 function.
2023-10-24 15:48:15 -07:00
Tim Gross
6c2d5a0fbb E2E: Consul compatibility matrix tests (#18799)
Set up a new test suite that exercises Nomad's compatibility with Consul. This
suite installs all currently supported versions of Consul, spins up a Consul
agent with appropriate configuration, and a Nomad agent running in dev
mode. Then it runs a Connect job against each pair.
2023-10-24 16:03:53 -04:00
Seth Hoenig
8de7af51cb cl: remove cgroup mountpoint (#18848)
* cl: remove cgroup mountpoint attribute

* cl: add changelog for cgroups attribute changes
2023-10-24 11:38:26 -05:00
Daniel Bennett
b46b41a2e9 scheduler: appropriately unblock evals with quotas (#18838)
When an eval is blocked due to e.g. cpu exhausted
on nodes, but there happens to also be a quota on
the job's namespace, the eval would not get auto-
unblocked when the node cpu got freed up.

This change ensures, when considering quota during
BlockedEvals.unblock(), that the block was due to
quota in the first place, so unblocking does not
get skipped due to the mere existence of a quota
on the namespace.
2023-10-24 11:22:24 -05:00
Seth Hoenig
5cf4c6cc06 cl: note breaking change of numcores attribute on apple systems (#18850)
I goofed the name the first time around, "power" should have been
"performance" which is consistent with both Apple and Intel branding.
2023-10-24 10:54:26 -05:00
Seth Hoenig
9ae4b10dc6 cl: minor features are listed as improvements (#18845)
The Features header is reserved for "tent-pole" features of a Nomad version.
2023-10-24 10:53:40 -05:00
James Rasell
f64ade2304 cli: ensure HCL env vars are added to the job submission object. (#18832) 2023-10-24 16:48:13 +01:00
Kerim Satirli
5e1bbf90fc docs: update all URLs to developer.hashicorp.com (#16247) 2023-10-24 11:00:11 -04:00
Seth Hoenig
951cde4e3b numa: fix cpu topology conversion for non linux systems (#18843) 2023-10-24 09:12:34 -05:00
Tim Gross
cb3fde3c96 metrics: prevent negative counter from iowait decrease (#18835)
The iowait metric obtained from `/proc/stat` can under some circumstances
decrease. The relevant condition is when an interrupt arrives on a different
core than the one that gets woken up for the IO, and a particular counter in the
kernel for that core gets interrupted. This is documented in the man page for
the `proc(5)` pseudo-filesystem, and considered an unfortunate behavior that
can't be changed for the sake of ABI compatibility.

In Nomad, we get the current "busy" time (everything except for idle) and
compare it to the previous busy time to get the counter incremeent. If the
iowait counter decreases and the idle counter increases more than the increase
in the total busy time, we can get a negative total. This previously caused a
panic in our metrics collection (see #15861) but that is being prevented by
reporting an error message.

Fix the bug by putting a zero floor on the values we return from the host CPU
stats calculator.

Fixes: #15861
Fixes: #18804
2023-10-24 09:58:25 -04:00
Seth Hoenig
043b1a95a7 deps: bump go-set/v2 to alpha.3 (#18844)
fixes a rather critical bug in .Equals implementation
2023-10-24 08:23:25 -05:00
James Rasell
b55dcb3967 test: use must lib for bitmap tests. (#18834) 2023-10-24 07:40:02 +01:00
Luiz Aoqui
70b1862026 test: add E2E vaultcompat test for JWT auth flow (#18822)
Test the JWT auth flow using real Nomad and Vault agents.
2023-10-23 20:00:55 -04:00