Commit Graph

25265 Commits

Author SHA1 Message Date
Luiz Aoqui
ab36cf031c vault: avoid continual renewal of invalid token (#18985)
A series of errors may happen when a token is invalidated while the
Vault client is waiting to renew it. The token may have been invalidated
for several reasons, such as the alloc finished running and it's now
terminal or the token may have been change directly on Vault
out-of-band.

Most of the errors are caused by retries that will never succeed until
Vault fully removes the token from its state.

This commit prevents the retries by making the error `invalid lease ID`
a fatal error.

In earlier versions of Vault, this case was covered by the error `lease
not found or lease is not renewable`, which is already considered to be
a fatal error by Nomad:

2d0cde4ccc/vault/expiration.go (L636-L639)

But https://github.com/hashicorp/vault/pull/5346 introduced an earlier
`nil` check that generates a different error message:

750ab337ea/vault/expiration.go (L1362-L1364)

Both errors happen for the same reason (`le == nil`) and so should be
considered fatal on renewal.
2023-11-07 19:50:19 -05:00
Luiz Aoqui
7054fe1a8c vault: always renew tokens using the renewal loop (#18998)
Previously, a Vault token could renewed either periodically via the
renewal loop or immediately by calling `RenewToken()`.

But a race condition in the renewal loop could cause an attempt to renew
an expired token. If both `updateCh` and `renewalCh` are active (such as
when a task stops at the same time its token is waiting for renewal),
the following `select` picks a `case` at random.

78f0c6b2a9/client/vaultclient/vaultclient.go (L557-L564)

If `case <-renewalCh` is picked, the token is incorrectly re-added to
the heap, causing unnecessary renewals of a token that is already expired.

1604dba508/client/vaultclient/vaultclient.go (L505-L510)

To prevent this situation, the `renew()` function should only renew
tokens that are currently in the heap, so `RenewToken()` must first push
the token to the heap and wait for the renewal to happen instead of
calling `renew()` directly since this could cause another race condition
where the token is renewed twice: once by `RenewToken()` calling
`renew()` directly and a second time if the renewal happens to pick the
token as soon as `RenewToken()` adds it to the heap.
2023-11-07 19:49:33 -05:00
Phil Renaud
783572de7d [ui] Actions implementation in the web UI (#18793)
* runAction model and adapter funcs

* Hacky but functional action running from job index

* remove proxy hack

* runAction added to taskSubRow

* Added tty and ws_handshake to job action endpoint call

* delog

* Bunch of streaming work

* action started, running, and finished notification titles, neutral color, and ansi escape

* Handle random alloc selection in the web ui

* Run on All implementation in web ui

* [ui] Helios two-step button and uniform title bar for Actions (#18912)

* Initial pass at title bar button uniformity

* Vertical align on actions dropdown toggle and small edits to prevent keynav overflow issue

* We represent loading state w text and disable now

* Pageheader component to align buttons

* Buttons standardized

* Actions dropdown reveal for multi-alloc job

* Notification code styles

* An action-having single alloc job

* Mirageed

* Actions-laden jobs in mirage

* Separating allocCount and taskCount in mirage mocks

* Unbreak stop job tests

* Permissions for actions dropdown

* tests for running actions from the job index page

* running from a task row actions tests

* some todocleanup

* PR feedback addressed, including page helper for actions
2023-11-07 15:29:43 -05:00
Seth Hoenig
cf2f48efd4 build: update to Go 1.21.4 (#19013) 2023-11-07 13:18:07 -06:00
Seth Hoenig
a2f7ab2645 e2e disable windows (#19012)
* e2e: disable windows client

* e2e: disable windows artifact test
2023-11-07 09:34:18 -06:00
Tim Gross
50f0ce5412 config: remove old Vault/Consul config blocks from client (#18994)
Remove the now-unused original configuration blocks for Consul and Vault from
the client. When the client needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the client's own use).

This is two of three changesets for this work. The remainder will implement the
same changes in the `command/agent` package.

As part of this work I discovered and fixed two bugs:

* The gRPC proxy socket that we create for Envoy is only ever created using the
  default Consul cluster's configuration. This will prevent Connect from being
  used with the non-default cluster.
* The Consul configuration we use for templates always comes from the default
  Consul cluster's configuration, but will use the correct Consul token for the
  non-default cluster. This will prevent templates from being used with the
  non-default cluster.

Ref: https://github.com/hashicorp/nomad/issues/18947
Ref: https://github.com/hashicorp/nomad/pull/18991
Fixes: https://github.com/hashicorp/nomad/issues/18984
Fixes: https://github.com/hashicorp/nomad/issues/18983
2023-11-07 09:15:37 -05:00
Tim Gross
1998004483 move deprecation warning for Vault/Consul token to admission hook (#18995)
Submitting a Consul or Vault token with a job is deprecated in Nomad 1.7 and
intended for removal in Nomad 1.9. We added a deprecation warning to the CLI
when the user passes in the appropriate flag or environment variable in
does not use Vault or Consul but happen to have the appropriate environment
variable in your environment. While this is generally a bad practice (because
the token is leaked to Nomad), it's also the existing practice for some users.

Move the warning to the job admission hook. This will allow us to warn only when
appropriate, and that will also help the migration process by producing warnings
only for the relevant jobs.
2023-11-07 08:37:06 -05:00
Seth Hoenig
3ba364e42f deps: update some dependencies (#19002)
* deps: update shoenig/test to 1.7.0

* deps: update go-set/v2 to v2.1.0

* deps: update shoenig/go-landlock to v1.2.0
2023-11-07 07:34:40 -06:00
Piotr Kazmierczak
7c6863b479 cli: setup vault command (#18910)
An interactive setup helper for configuring Vault to accept Nomad WI-enabled
workloads.

---------

Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-11-07 10:42:00 +01:00
Dave May
e4f98a8d1d docs: fix broken links in docker.mdx (#19003) 2023-11-07 07:34:47 +00:00
Tim Gross
1ef99f0536 config: remove old Vault/Consul config blocks from server (#18991)
Remove the now-unused original configuration blocks for Consul and Vault from
the server. When the server needs to refer to a Consul or Vault block it will
always be for a specific cluster for the task/service. Add a helper for
accessing the default clusters (for the servers own use).

This is one of three changesets for this work. The remainder will implement the
same changes in the `client` package and on the `command/agent` package.

As part of this work I discovered that the job submission hook for Vault only
checks the enabled flag on the default cluster, rather than the clusters that
are used by the job being submitted. This will return an error on job
registration saying that Vault is disabled. Fix that to check only the
cluster(s) used by the job.

Ref: https://github.com/hashicorp/nomad/issues/18947
Fixes: https://github.com/hashicorp/nomad/issues/18990
2023-11-06 10:26:20 -05:00
dependabot[bot]
a13f0c6c2d build(deps-dev): bump next from 13.4.2 to 14.0.1 in /website (#18999)
Bumps [next](https://github.com/vercel/next.js) from 13.4.2 to 14.0.1.
- [Release notes](https://github.com/vercel/next.js/releases)
- [Changelog](https://github.com/vercel/next.js/blob/canary/release.js)
- [Commits](https://github.com/vercel/next.js/compare/v13.4.2...v14.0.1)

---
updated-dependencies:
- dependency-name: next
  dependency-type: direct:development
  update-type: version-update:semver-major
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2023-11-06 09:22:53 -05:00
Tim Gross
b62c5c51d2 cli: extend coverage of operator client-state command (#18996)
The `operator client-state` command is mostly used for developer debugging of
the Nomad client state, but it hasn't been updated with several recent
additions. Add allocation identities, network status, and dynamic volumes to the
objects it outputs.

Also, fix a bug where reading the state for an allocation without task states
will crash the CLI. This can happen if the Nomad client stops after an alloc is
persisted to disk but before the task actually starts.
2023-11-03 15:43:05 -04:00
Erwan Ben Souiden
9f995e76a4 docs: fix Grafana doc breaking link (#18988) 2023-11-03 14:31:37 +00:00
James Rasell
5f98e6473c acl: use token locality consts when validating auth methods. (#18975) 2023-11-03 07:22:54 +00:00
Seth Hoenig
1604dba508 client: fingerprint cpu on raspberry pi (#18982)
This PR tweaks the linux cpu fingerprinter to handle the case where no
NUMA node data is found under /sys/devices/system/, in which case we
need to assume just one node, one socket.
2023-11-02 15:52:37 -05:00
Michael Schurter
78f0c6b2a9 cli: update acl bootstrap help to match docs (#18961)
See https://developer.hashicorp.com/nomad/docs/commands/acl/bootstrap
2023-11-02 08:52:21 -07:00
Tim Gross
142884b384 ignore KEK wrapper struct for codegen (#18973)
Our codec code generation doesn't honor `json:"..."` tags which, if we were to
ever implement `json.Marshaller` for the `KeyEncryptionKeyWrapper` struct, would
break the on-disk format of all the existing KEKs.

As a precaution, add this struct to the code generator's ignore list (just like
we have done with `IdentityClaims`).
2023-11-02 11:25:40 -04:00
James Rasell
6d0893cf57 acl/client: fix incorrect denied error on calls with dangling policies. (#18972)
When a user performs a client API call, the Nomad client will
perform an RPC which looks up the ACL policies which the callers
ACL token is assigned. If the ACL token includes dangling (deleted)
policies, the call would previously fail with a permission denied
error.

This change ensures this error is not returned and that the lookup
will succeed in the event of dangling policies.
2023-11-02 15:23:42 +00:00
Luiz Aoqui
a907273557 vault: fix import cycle in vaultclient (#18965)
* Revert "vault: eliminate vaultclient test import cycle (#18652)"

This reverts commit 03cf9ae7ff.

* vault: remove import cycle in vaultclient_test.go
2023-11-02 11:07:04 -04:00
Seth Hoenig
61e21db2b4 docs: add 1.7 cpu upgrade notes and tweak cpu concepts doc (#18977)
* docs: add 1.7 cpu upgrade notes and tweak cpu concepts doc

* docs: fix spelling
2023-11-02 09:58:16 -05:00
Seth Hoenig
0dc9c49c6c docs: add a Concepts/CPU docs page (#18924)
* docs: add a Concepts/CPU docs page

* docs: cpu doc cr feedback

* docs: cpu fix image
2023-11-02 08:45:43 -05:00
Piotr Kazmierczak
d69a1238cd cli: consul setup command (#18820)
An interactive setup helper for configuring Consul to accept Nomad WI-enabled workloads.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Luiz Aoqui <luiz@hashicorp.com>
2023-11-02 09:02:07 +01:00
James Rasell
0822af35af cli: remove unused raft tool helper. (#18954) 2023-11-02 07:43:44 +00:00
Michael Schurter
0040427c6d identity: don't generate codec for oidc config (#18964)
Our codec code generation doesn't honor `json:"..."` tags which breaks
the OIDC Discovery endpoint.

This adds the relevant struct to the code generators ignore list (just
like we have done with IdentityClaims).
2023-11-01 13:20:00 -07:00
Tim Gross
feede21d9a test: make CSI bad state GC test synchronous (#18960)
One of our core scheduler tests for GC tests that volumes with invalid
allocations immediately have those claims marked as past claims and puts them
into the unpublishing state. This happens synchronously with the GC evaluation
processing, so there's no need for us to wait for the results.

Fixes: #18959
2023-11-01 15:31:42 -04:00
Seth Hoenig
51b8737ca9 Release/1.7.0 beta.1 (#18962)
* Prepare release 1.7.0-beta.1

* cl: tweak actions cl entry

* Generate files for 1.7.0-beta.1 release

* Prepare for next release

---------

Co-authored-by: hc-github-team-nomad-core <github-team-nomad-core@hashicorp.com>
2023-11-01 14:27:59 -05:00
Logan Attwood
0e643501de Fix the "Starting" allocations link (#18866)
Before this commit, it would bring you to the list of allocations
filtered by status=starting. This status does not exist in the Status
drop-down on the Allocations section of a job in the UI.
2023-11-01 15:23:43 -04:00
Michael Schurter
0b0ae40199 docs: recommend rotating keys on upgrade (#18958)
RIP EdDSA.
2023-11-01 10:57:33 -07:00
Tim Gross
483e78615d template: fix test assertion to be compatible between CE/ENT (#18957)
The template hook emits an error when the task has a Consul block that requires
WI but there's no WI. The exact error message we get depends on whether we're
running in CE or ENT. Update the test assertion so that we can tolerate this
difference without building ENT-specific test files.
2023-11-01 13:26:45 -04:00
Anthony
e1acf72eb5 Automated license utilization reporting docs (#17976) 2023-11-01 12:18:04 -04:00
Seth Hoenig
02d433225f cl: use caps for feature (#18956) 2023-11-01 10:56:39 -05:00
Tim Gross
dd62e8a319 consul/vault: use accessor method to get cluster name in client (#18955)
When looking up the Consul or Vault cluster from a client hook, we should always
use an accessor function rather than trying to lookup the `Cluster` field, which
may be empty for jobs registered before Nomad 1.7.
2023-11-01 10:59:59 -04:00
Michael Schurter
e49ca3c431 identity: Implement change_mode (#18943)
* identity: support change_mode and change_signal

wip - just jobspec portion

* test struct

* cleanup some insignificant boogs

* actually implement change mode

* docs tweaks

* add changelog

* test identity.change_mode operations

* use more words in changelog

* job endpoint tests

* address comments from code review

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2023-11-01 09:41:11 -05:00
Tim Gross
d62213a135 consul: fix lookups of default cluster across upgrades (#18945)
Allocations that were created before Nomad 1.7 will not have the cluster field
set for their Consul blocks. While this can be corrected server-side, that
doesn't help allocations already on clients.
2023-11-01 10:11:54 -04:00
James Rasell
4ec27a97d1 docs: clarify ACL agent config TTL params apply to auth methods. (#18949) 2023-11-01 13:45:13 +00:00
Luiz Aoqui
bfb2dcd172 Vault small fixes (#18942)
* vault: remove `token_ttl` from `vaultcompat` setup

Since Nomad uses periodic tokens, the right value to set in the role is
`token_period`, not `token_ttl`.

* vault: set 1.11.0 as min version for JWT auth

In order to use workload identities JWT auth with Vault it's required to
have a Vault cluster running v1.11.0+, which the version where
`user_claim_json_pointer` was introduced.
2023-11-01 08:23:19 -04:00
Seth Hoenig
5b56a5c5d1 client: fix cpu core/freq calculation on intel macs (#18934) 2023-11-01 07:16:26 -05:00
James Rasell
4a89a0a0f2 changelog: fix entry wording for #18873 (#18927) 2023-11-01 09:56:31 +00:00
Tim Gross
c1fa145765 vault: fix lookups of default cluster across upgrades (#18940)
Allocations that were created before Nomad 1.7 will not have the `cluster` field
set for their Vault blocks. While this can be corrected server-side, that
doesn't help allocations already on clients.

Also add extra safety on Consul cluster lookup too
2023-10-31 17:30:01 -04:00
Luiz Aoqui
d7edbd44b7 api: handle redirect during websocket upgrade (#18903)
When attempting a WebSocket connection upgrade the client may receive a
redirect request from the server, in which case the request should be
reattempted using the new address present in the `Location` header.
2023-10-31 17:12:11 -04:00
Luiz Aoqui
3ddf1ecf1d actions: minor bug fixes and improvements (#18904) 2023-10-31 17:06:02 -04:00
Tim Gross
2bff6d2a6a docs: fix token_period in example Vault role for WI (#18939)
Vault tokens requested for WI are "periodic" Vault tokens (ones that get
periodically renewed). The field we should be setting for the renewal window is
`token_period`.
2023-10-31 16:33:03 -04:00
Michael Schurter
9afc70ef5a Fix Vault docs to use HCL instead of JSON (#18938) 2023-10-31 13:25:20 -07:00
Michael Schurter
f8a65b6c29 docs: changelog & basic docs for 1.7 WI changes (#18936)
Changelog entries and bare minimum docs for workload identity changes in 1.7.
2023-10-31 13:06:08 -07:00
Michael Schurter
66fbc0f67e identity: default to RS256 for new workload ids (#18882)
OIDC mandates the support of the RS256 signing algorithm so in order to maximize workload identity's usefulness this change switches from using the EdDSA signing algorithm to RS256.

Old keys will continue to use EdDSA but new keys will use RS256. The EdDSA generation code was left in place because it's fast and cheap and I'm not going to lie I hope we get to use it again.

**Test Updates**

Most of our Variables and Keyring tests had a subtle assumption in them that the keyring would be initialized by the time the test server had elected a leader. ed25519 key generation is so fast that the fact that it was happening asynchronously with server startup didn't seem to cause problems. Sadly rsa key generation is so slow that basically all of these tests failed.

I added a new `testutil.WaitForKeyring` helper to replace `testutil.WaitForLeader` in cases where the keyring must be initialized before the test may continue. However this is mostly used in the `nomad/` package.

In the `api` and `command/agent` packages I decided to switch their helpers to wait for keyring initialization by default. This will slow down tests a bit, but allow those packages to not be as concerned with subtle server readiness details. On my machine rsa key generation takes 63ms, so hopefully the difference isn't significant on CI runners.

**TODO**

- Docs and changelog entries.
- Upgrades - right now upgrades won't get RS256 keys until their root key rotates either manually or after ~30 days.
- Observability - I'm not sure there's a way for operators to see if they're using EdDSA or RS256 unless they inspect a key. The JWKS endpoint can be inspected to see if EdDSA will be used for new identities, but it doesn't technically define which key is active. If upgrades can be fixed to automatically rotate keys, we probably don't need to worry about this.

**Requiem for ed25519**

When workload identities were first implemented we did not immediately consider OIDC compliance. Consul, Vault, and many other third parties support JWT auth methods without full OIDC compliance. For the machine<-->machine use cases workload identity is intended to fulfill, OIDC seemed like a bigger risk than asset.

EdDSA/ed25519 is the signing algorithm we chose for workload identity JWTs because of all these lovely properties:

1. Deterministic keys that can be derived from our preexisting root keys. This was perhaps the biggest factor since we already had a root encryption key around from which we could derive a signing key.
2. Wonderfully compact: 64 byte private key, 32 byte public key, 64 byte signatures. Just glorious.
3. No parameters. No choices of encodings. It's all well-defined by [RFC 8032](https://datatracker.ietf.org/doc/html/rfc8032).
4. Fastest performing signing algorithm! We don't even care that much about the performance of our chosen algorithm, but what a free bonus!
5. Arguably one of the most secure signing algorithms widely available. Not just from a cryptanalysis perspective, but from an API and usage perspective too.

Life was good with ed25519, but sadly it could not last.

[IDPs](https://en.wikipedia.org/wiki/Identity_provider), such as AWS's IAM OIDC Provider, love OIDC. They have OIDC implemented for humans, so why not reuse that OIDC support for machines as well? Since OIDC mandates RS256, many implementations don't bother implementing other signing algorithms (or at least not advertising their support). A quick survey of OIDC Discovery endpoints revealed only 2 out of 10 OIDC providers advertised support for anything other than RS256:

- [PayPal](https://www.paypalobjects.com/.well-known/openid-configuration) supports HS256
- [Yahoo](https://api.login.yahoo.com/.well-known/openid-configuration) supports ES256

RS256 only:

- [GitHub](https://token.actions.githubusercontent.com/.well-known/openid-configuration)
- [GitLab](https://gitlab.com/.well-known/openid-configuration)
- [Google](https://accounts.google.com/.well-known/openid-configuration)
- [Intuit](https://developer.api.intuit.com/.well-known/openid_configuration)
- [Microsoft](https://login.microsoftonline.com/fabrikamb2c.onmicrosoft.com/v2.0/.well-known/openid-configuration)
- [SalesForce](https://login.salesforce.com/.well-known/openid-configuration)
- [SimpleLogin (acquired by ProtonMail)](https://app.simplelogin.io/.well-known/openid-configuration/)
- [TFC](https://app.terraform.io/.well-known/openid-configuration)
2023-10-31 11:25:20 -07:00
Tim Gross
01d050c36b identity: version check multiple and implicit identities (#18926)
Job submitters cannot set multiple identities prior to Nomad 1.7, and cluster
administrators should not set the identity configurations for their `consul` and
`vault` configuration blocks until all servers have been upgraded. Validate
these cases during job submission so as to prevent state store corruption when
jobs are submitting in the middle of a cluster upgrade.
2023-10-31 13:57:53 -04:00
Tim Gross
ea3e711fa6 docs: upgrade guide for integrations deprecation warnings (#18928)
The Consul and Vault integrations work shipping in Nomad 1.7 will deprecated the
existing token-based workflows. These will be removed in Nomad 1.9, so add a
note describing this to the upgrade guide.
2023-10-31 13:21:47 -04:00
Tim Gross
790d4d5d7a changelog entries for Integrations feature work (#18923) 2023-10-31 11:53:43 -04:00
Phil Renaud
d98ed87c1b Actions changelog update to feature (#18921) 2023-10-30 20:28:50 -04:00