Commit Graph

25539 Commits

Author SHA1 Message Date
dependabot[bot]
40bbddf3d8 chore(deps): bump github.com/prometheus/client_golang (#19733) 2024-01-15 08:24:43 +00:00
Luiz Aoqui
e1e80f383e vault: add new nomad setup vault -check commmand (#19720)
The new `nomad setup vault -check` commmand can be used to retrieve
information about the changes required before a cluster is migrated from
the deprecated legacy authentication flow with Vault to use only
workload identities.
2024-01-12 15:48:30 -05:00
Seth Hoenig
5b7f4746ce client/allocdir: use an interface in place of AllocDir structs (#19703)
* client/allocdir: use an interface in place of AllocDir structs

This PR replace *allocdir.AllocDir with allocdir.Interface such that we
may eventually have another implementation of alloc directories. This is
in support of the exec2 driver, which will need an implementation of the
alloc directory incompatibile with the current version.

* use rlock
2024-01-12 14:13:29 -06:00
Piotr Kazmierczak
858a805d7d e2e: add a note about provisioning the infrastructure on macOS/Apple Silicon (#19727) 2024-01-12 14:09:50 +01:00
Piotr Kazmierczak
5d12ca4f57 state store: better handling of job deletion (#19609)
When jobs are deleted with -purge, all their deployments and allocations should
be deleted from the state store, and the evals status should be set to complete.
Otherwise we end up in a situation where users could re-submit previously
failing jobs, but these new jobs would not get deployments allocated unless
system gc got called.
2024-01-12 10:08:55 +01:00
Luiz Aoqui
b2aa6ffd05 docs: fix Consul ACL requirements (#19721)
Even with the new workload identitiy based flow the Nomad servers still
need the `acl = "write"` permission in order to revoke service identity
tokens.
2024-01-11 15:52:23 -05:00
Seth Hoenig
a58f0eca8e e2e: move rawexec oversub tests into oversubscription e2e test suite (#19717)
* e2e: move rawexec oversub tests into oversubscription e2e test suite

This PR moves two tests for raw_exec and memory oversubscription into
the oversubscription test suite, which has the necessary plumbing to
activate and restore the oversubscription configuration of the scheduler
during the test.

* cr: rename files for better readability
2024-01-11 14:27:05 -06:00
Luiz Aoqui
8d0a469000 vault: remove revoked Vault accessors from state (#19706)
When using the no-op Vault client the Nomad server still needs to delete
the revoked Vault accessors from state to prevent them from lingering
forever after the cluster migrates to the workload identity flow.
2024-01-11 14:38:51 -05:00
Seth Hoenig
aad932eeee build: update to go1.21.6 (#19709) 2024-01-11 09:48:56 -06:00
Tim Gross
4c206d0b19 docs: changelog entry for ENT PR (#19705)
Ref: https://github.com/hashicorp/nomad-enterprise/pull/1370
2024-01-11 10:36:08 -05:00
Seth Hoenig
0c08f94c8e build: use setup-golang@v3 to handle auto caching (#19707)
* wip: try on branch

* build: use setup-golang@v3 to handle auto caching
2024-01-11 08:51:56 -06:00
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
Tim Gross
1254468600 consul: refactor job mutation hook (#19699)
The job mutation logic for Nomad CE and Nomad ENT are nearly identical except
for a prelude that grabs the correct default cluster. Factor this out into a
method that can be shared between both code bases.
2024-01-10 16:29:05 -05:00
CJ
c9cd8480fa docs: considerations for Stateful Workloads (#19077)
Co-authored-by: Adrian Todorov <adrian.todorov@hashicorp.com>
2024-01-10 16:06:45 -05:00
Piotr Kazmierczak
930339a0fa e2e: remove broken Consul WI test (#19697) 2024-01-10 21:31:18 +01:00
Tim Gross
0935f443dc vault: support allowing tokens to expire without refresh (#19691)
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.

When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.

Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.

Fixes: https://github.com/hashicorp/nomad/issues/8690
2024-01-10 14:49:02 -05:00
Luiz Aoqui
5267eec3ad vault: fix token revocation during workflow migration (#19689)
When transitioning from the legacy token-based workflow to the new JWT
workflow for Vault the previous code would instantiate a no-op Vault if
the server configuration had a `default_identity` block.

This no-op client returned an error for some of its operations were
called, such as `LookupToken` and `RevokeTokens`. The original intention
was that, in the new JWT workflow, none of these methods should be
called, so returning an error could help surface potential bugs.

But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even
in the JWT flow. When a leadership transition happens, the new server
looks for unused Vault accessors from state and tries to revoke them.
Similarly, the `RevokeTokens` method is called every time the
`Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as
the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault,
calling `RevokeTokens` and `MarkForRevocation` is not able to complete
without a Vault token, so this commit changes the logic to use the no-op
Vault client when no token is configured. It also updates the client
itself to not error if these methods are called, but to rather just log
so operators can be made aware that there are Vault tokens created by
Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based
flow, Nomad operators must first upgrade the Nomad version without
removing any of the existing Vault configuration. Doing so can prevent
Nomad servers from managing and cleaning-up existing Vault tokens during
a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a `vault` block so they are
updated with an `identity` for Vault. Skipping this step may cause
allocations to fail if their Vault token expires (if, for example, the
Nomad client stops running for TTL/2) or if they are rescheduled, since
the new client will try to follow the legacy flow which will fail if the
Nomad server configuration for Vault has already been updated to remove
the Vault address and token.
2024-01-10 13:28:46 -05:00
Tim Gross
d3e5cae1eb consul: support admin partitions (#19665)
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: https://github.com/hashicorp/nomad/issues/13139
2024-01-10 10:41:29 -05:00
Daniel Peinhopf
9eb357020d Docs: Alternative IIS Task Driver (#19411) 2024-01-10 14:14:30 +00:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Egor Mikhailov
18f49e015f auth: add new optional OIDCDisableUserInfo setting for OIDC auth provider (#19566)
Add new optional `OIDCDisableUserInfo` setting for OIDC auth provider which
disables a request to the identity provider to get OIDC UserInfo.

This option is helpful when your identity provider doesn't send any additional
claims from the UserInfo endpoint, such as Microsoft AD FS OIDC Provider:

> The AD FS UserInfo endpoint always returns the subject claim as specified in the
> OpenID standards. AD FS doesn't support additional claims requested via the
> UserInfo endpoint

Fixes #19318
2024-01-09 13:41:46 -05:00
Tim Gross
c875f3e49a docs: expand docs on implicit ACL capabilities grants (#19681)
An audit of Nomad's ACLs resulted in some confusion around whether the
`NamespaceValidator` method is conjunctive ("add", as implied by the docs) or
disjunctive ("or", as it is by design). Clarify the ACL documentation as
follows:

* Call out where fine-grained capabilities imply grants to other
  capabilities (for example, that `csi-read-volume` grants `csi-list-volume`).
* Fix an incorrectly documented ACL requirement for the CSI List External
  Volumes API.
* Clarify how ACLs are expected to work for the two search API endpoints, such
  that you need list/read access to the objects in the search context.
2024-01-09 13:25:05 -05:00
James Rasell
a3a03dff78 acl: ensure auth method configs are correctly and fully hashed. (#19677) 2024-01-09 14:03:26 +00:00
dependabot[bot]
f3bc9c7c41 chore(deps): bump github.com/docker/docker (#19672) 2024-01-09 08:24:20 +00:00
Tim Gross
a399f16a31 docs: describe cgroup controller requirements (#19493)
Nomad can only use cgroups to control resource requirements if all the cgroups
controllers are actually enabled. Add this to our requirements documentation as
well as the impacted `exec` and `java` task drivers.
2024-01-08 10:01:14 -05:00
am-ak
7dc82f233f [DOCS] Update docker.mdx (#19657)
Removed info regarding development of Nomad
2024-01-08 14:32:57 +00:00
James Rasell
fbea8d1051 server: Fix panic when validating non-service reschedule block. (#19652) 2024-01-08 14:14:00 +00:00
Shantanu Gadgil
6bbd3b0cec reschedule is at group level (#19653)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2024-01-08 10:54:52 +00:00
dependabot[bot]
398b5000c1 chore(deps): bump github.com/hashicorp/go-plugin from 1.4.10 to 1.6.0 (#19646)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2024-01-08 08:26:34 +00:00
James Rasell
ff2d0d6453 cli: Fix dummy FSM create to ensure snapshot state command works. (#19630)
The Nomad state store function was recently updated to validate
certain parameters, fixing a panic condition. This change meant
dummy FSM used for the snapshot state command was always failing
this validation and the command no longer worked.

This change adds the required parameter to pass validation and
therefore makes the CLI command functional again.
2024-01-05 16:00:24 +00:00
Marvin Chin
be8575a8a2 Fix server shutdown not waiting for worker run completion (#19560)
* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
2024-01-05 08:45:07 -06:00
James Rasell
5a00440b06 api: Fix operator snapshot API streaming. (#19608) 2024-01-05 14:33:39 +00:00
dependabot[bot]
37af843b01 chore(deps): bump github.com/opencontainers/runc from 1.1.8 to 1.1.10 (#19289) 2024-01-05 09:57:54 +00:00
dependabot[bot]
c2e6d8aee2 build(deps): bump github.com/containerd/containerd from 1.6.18 to 1.6.26 (#19531) 2024-01-05 09:29:14 +00:00
James Rasell
f3ed406b0f state: ensure the job submission table is persisted and restored. (#19605) 2024-01-05 08:12:27 +00:00
James Rasell
2abbd7e485 cli: fix operator snapshot save help output examples. (#19606) 2024-01-05 07:43:12 +00:00
Phil Renaud
a5881963dd Error message typo fix: Filed to Failed (#19611) 2024-01-04 21:56:23 -05:00
Phil Renaud
16876697a1 [ui] Adds group-name tooltips to deploying and steady-state job panels (#19601)
* Adds group-name tooltips to deploying and steady-state job panels

* Default tooltip text for mirage edge cases
2024-01-04 13:10:37 -05:00
Phil Renaud
75b830ef04 [ui] Changelog for multi-line variables (#19600)
* Changelog for multi-line variables

* Multi-entry changelog
2024-01-04 12:00:50 -05:00
Seth Hoenig
4b3ee77d6b docs: update raw_exec driver docs and 1.7 upgrade notes (#19598) 2024-01-04 08:26:46 -06:00
Seth Hoenig
ccfb13a72d e2e: add test for raw_exec memory_max configuration (#19596)
* e2e: add test for raw_exec memory_max configuration

* docs: note raw_exec supports memory_max in resources documentation
2024-01-04 08:25:56 -06:00
Piotr Kazmierczak
aa197cf824 e2e: pass Nomad address to Consul WI test (#19603) 2024-01-04 08:52:39 +01:00
Phil Renaud
89cceebb91 [ui] Multi-line variable values and helios upgrades generally (#19544)
* Multi-line variable values and helios upgrades generally

* Variables page titles and actions restyle

* Hacky fix to keyboard shortcut otherwise bumping space on shift

* Related entities heliosified

* Namespace and path fields heliosed

* Paths table heliosified

* Variable view table

* Fixups after design discussion

* Monospaced editing

* De-commented template placeholder

* Acceptance tests updated for helios components across variables

* Tests helios'd in variable-form-test

* PR suggestions
2024-01-03 15:54:22 -05:00
Marvin Chin
d75293d2ab Add OOM detection for exec driver (#19563)
* Add OomKilled field to executor proto format

* Teach linux executor to detect and report OOMs

* Teach exec driver to propagate OOMKill information

* Fix data race

* use tail /dev/zero to create oom condition

* use new test framework

* minor tweaks to executor test

* add cl entry

* remove type conversion

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-03 09:50:27 -06:00
Tim Gross
f2630add91 acl: remove timestamps from WhoAmI response (#19578)
In Nomad 1.7 we updated our JWT library to go-jose, but this changed the wire
format of the embedded struct we have in the `IdentityClaims` struct that we
return as part of the `WhoAmI` RPC response. This wasn't originally intended to
be sent over the wire but other changes in Nomad 1.5+ added a caller to the
client. The library change causes a deserialization error on Nomad 1.5 and 1.6
clients, which prevents access to Nomad Variables and SD via template blocks.

Removed the incompatible fields from the response, which are unused by any
current caller. In a future version of Nomad, we'll likely remove the `WhoAmI`
callers from the client in lieu of using the public keys the clients have to
check auth.

Fixes: https://github.com/hashicorp/nomad/issues/19555
2024-01-03 08:24:38 -05:00
James Rasell
91cba75f5c copywrite: fix and add copywrite config enterprise comments. (#19590)
Nomad CI checks for copywrite headers using multiple config files
for specific exemption paths. This means the top-level config file
does not take effect when running the copywrite script within
these sub-folders. Exempt files therefore need to be added to the
sub-config files, along with the top level.
2024-01-03 08:58:53 +00:00
Piotr Kazmierczak
a87aa71f55 e2e: fix typo in Consul e2e (#19589) 2024-01-03 09:34:38 +01:00
Tim Gross
e7ca2b51ad vault: ignore allow_unauthenticated config if identity is set (#19585)
When the server's `vault` block has a default identity, we don't check the
user's Vault token (and in fact, we warn them on job submit if they've provided
one). But the validation hook still checks for a token if
`allow_unauthenticated` is set to true. This is a misconfiguration but there's
no reason for Nomad not to do the expected thing here.

Fixes: https://github.com/hashicorp/nomad/issues/19565
2024-01-02 16:46:34 -05:00
Luiz Aoqui
cd8a03431c docs: add scale_in_protection to AWS Autoscaler (#19546)
Document new `scale_in_protection` configuration of the AWS ASG
Autoscaler target plugin.
2024-01-02 14:48:56 -05:00
Luiz Aoqui
0bef6f05a2 docs: add note about * namespace on autoscaling (#19547)
Explain the behaviour when the wildcard namespace value `*` is used to
configure the Nomad Autoscaler agent.
2024-01-02 14:48:20 -05:00