Commit Graph

25528 Commits

Author SHA1 Message Date
Seth Hoenig
9410c519ff drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration (#19599)
* drivers/raw_exec: remove plumbing for ineffective no_cgroups configuration

* fix tests
2024-01-11 08:20:15 -06:00
Tim Gross
1254468600 consul: refactor job mutation hook (#19699)
The job mutation logic for Nomad CE and Nomad ENT are nearly identical except
for a prelude that grabs the correct default cluster. Factor this out into a
method that can be shared between both code bases.
2024-01-10 16:29:05 -05:00
CJ
c9cd8480fa docs: considerations for Stateful Workloads (#19077)
Co-authored-by: Adrian Todorov <adrian.todorov@hashicorp.com>
2024-01-10 16:06:45 -05:00
Piotr Kazmierczak
930339a0fa e2e: remove broken Consul WI test (#19697) 2024-01-10 21:31:18 +01:00
Tim Gross
0935f443dc vault: support allowing tokens to expire without refresh (#19691)
Some users with batch workloads or short-lived prestart tasks want to derive a
Vaul token, use it, and then allow it to expire without requiring a constant
refresh. Add the `vault.allow_token_expiration` field, which works only with the
Workload Identity workflow and not the legacy workflow.

When set to true, this disables the client's renewal loop in the
`vault_hook`. When Vault revokes the token lease, the token will no longer be
valid. The client will also now automatically detect if the Vault auth
configuration does not allow renewals and will disable the renewal loop
automatically.

Note this should only be used when a secret is requested from Vault once at the
start of a task or in a short-lived prestart task. Long-running tasks should
never set `allow_token_expiration=true` if they obtain Vault secrets via
`template` blocks, as the Vault token will expire and the template runner will
continue to make failing requests to Vault until the `vault_retry` attempts are
exhausted.

Fixes: https://github.com/hashicorp/nomad/issues/8690
2024-01-10 14:49:02 -05:00
Luiz Aoqui
5267eec3ad vault: fix token revocation during workflow migration (#19689)
When transitioning from the legacy token-based workflow to the new JWT
workflow for Vault the previous code would instantiate a no-op Vault if
the server configuration had a `default_identity` block.

This no-op client returned an error for some of its operations were
called, such as `LookupToken` and `RevokeTokens`. The original intention
was that, in the new JWT workflow, none of these methods should be
called, so returning an error could help surface potential bugs.

But the `RevokeTokens` and `MarkForRevocation` methods _are_ called even
in the JWT flow. When a leadership transition happens, the new server
looks for unused Vault accessors from state and tries to revoke them.
Similarly, the `RevokeTokens` method is called every time the
`Node.UpdataStatus` and `Node.UpdateAlloc` RPCs are made by clients, as
the Nomad server tries to find unused Vault tokens for the node/alloc.

Since the new JWT flow does not require Nomad servers to contact Vault,
calling `RevokeTokens` and `MarkForRevocation` is not able to complete
without a Vault token, so this commit changes the logic to use the no-op
Vault client when no token is configured. It also updates the client
itself to not error if these methods are called, but to rather just log
so operators can be made aware that there are Vault tokens created by
Nomad that have not been force-expired.

When migrating an existing cluster to the new workload identity based
flow, Nomad operators must first upgrade the Nomad version without
removing any of the existing Vault configuration. Doing so can prevent
Nomad servers from managing and cleaning-up existing Vault tokens during
a leadership transition and node or alloc updates.

Operators must also resubmit all jobs with a `vault` block so they are
updated with an `identity` for Vault. Skipping this step may cause
allocations to fail if their Vault token expires (if, for example, the
Nomad client stops running for TTL/2) or if they are rescheduled, since
the new client will try to follow the legacy flow which will fail if the
Nomad server configuration for Vault has already been updated to remove
the Vault address and token.
2024-01-10 13:28:46 -05:00
Tim Gross
d3e5cae1eb consul: support admin partitions (#19665)
Add support for Consul Enterprise admin partitions. We added fingerprinting in
https://github.com/hashicorp/nomad/pull/19485. This PR adds a `consul.partition`
field. The expectation is that most users will create a mapping of Nomad node
pool to Consul admin partition. But we'll also create an implicit constraint for
the fingerprinted value.

Fixes: https://github.com/hashicorp/nomad/issues/13139
2024-01-10 10:41:29 -05:00
Daniel Peinhopf
9eb357020d Docs: Alternative IIS Task Driver (#19411) 2024-01-10 14:14:30 +00:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Egor Mikhailov
18f49e015f auth: add new optional OIDCDisableUserInfo setting for OIDC auth provider (#19566)
Add new optional `OIDCDisableUserInfo` setting for OIDC auth provider which
disables a request to the identity provider to get OIDC UserInfo.

This option is helpful when your identity provider doesn't send any additional
claims from the UserInfo endpoint, such as Microsoft AD FS OIDC Provider:

> The AD FS UserInfo endpoint always returns the subject claim as specified in the
> OpenID standards. AD FS doesn't support additional claims requested via the
> UserInfo endpoint

Fixes #19318
2024-01-09 13:41:46 -05:00
Tim Gross
c875f3e49a docs: expand docs on implicit ACL capabilities grants (#19681)
An audit of Nomad's ACLs resulted in some confusion around whether the
`NamespaceValidator` method is conjunctive ("add", as implied by the docs) or
disjunctive ("or", as it is by design). Clarify the ACL documentation as
follows:

* Call out where fine-grained capabilities imply grants to other
  capabilities (for example, that `csi-read-volume` grants `csi-list-volume`).
* Fix an incorrectly documented ACL requirement for the CSI List External
  Volumes API.
* Clarify how ACLs are expected to work for the two search API endpoints, such
  that you need list/read access to the objects in the search context.
2024-01-09 13:25:05 -05:00
James Rasell
a3a03dff78 acl: ensure auth method configs are correctly and fully hashed. (#19677) 2024-01-09 14:03:26 +00:00
dependabot[bot]
f3bc9c7c41 chore(deps): bump github.com/docker/docker (#19672) 2024-01-09 08:24:20 +00:00
Tim Gross
a399f16a31 docs: describe cgroup controller requirements (#19493)
Nomad can only use cgroups to control resource requirements if all the cgroups
controllers are actually enabled. Add this to our requirements documentation as
well as the impacted `exec` and `java` task drivers.
2024-01-08 10:01:14 -05:00
am-ak
7dc82f233f [DOCS] Update docker.mdx (#19657)
Removed info regarding development of Nomad
2024-01-08 14:32:57 +00:00
James Rasell
fbea8d1051 server: Fix panic when validating non-service reschedule block. (#19652) 2024-01-08 14:14:00 +00:00
Shantanu Gadgil
6bbd3b0cec reschedule is at group level (#19653)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2024-01-08 10:54:52 +00:00
dependabot[bot]
398b5000c1 chore(deps): bump github.com/hashicorp/go-plugin from 1.4.10 to 1.6.0 (#19646)
Co-authored-by: James Rasell <jrasell@hashicorp.com>
2024-01-08 08:26:34 +00:00
James Rasell
ff2d0d6453 cli: Fix dummy FSM create to ensure snapshot state command works. (#19630)
The Nomad state store function was recently updated to validate
certain parameters, fixing a panic condition. This change meant
dummy FSM used for the snapshot state command was always failing
this validation and the command no longer worked.

This change adds the required parameter to pass validation and
therefore makes the CLI command functional again.
2024-01-05 16:00:24 +00:00
Marvin Chin
be8575a8a2 Fix server shutdown not waiting for worker run completion (#19560)
* Move group into a separate helper module for reuse

* Add shutdownCh to worker

The shutdown channel is used to signal that worker has stopped.

* Make server shutdown block on workers' shutdownCh

* Fix waiting for eval broker state change blocking indefinitely

There was a race condition in the GenericNotifier between the
Run and WaitForChange functions, where WaitForChange blocks
trying to write to a full unsubscribeCh, but the Run function never
reads from the unsubscribeCh as it has already stopped.

This commit fixes it by unblocking if the notifier has been stopped.

* Bound the amount of time server shutdown waits on worker completion

* Fix lostcancel linter error

* Fix worker test using unexpected worker constructor

* Add changelog

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
2024-01-05 08:45:07 -06:00
James Rasell
5a00440b06 api: Fix operator snapshot API streaming. (#19608) 2024-01-05 14:33:39 +00:00
dependabot[bot]
37af843b01 chore(deps): bump github.com/opencontainers/runc from 1.1.8 to 1.1.10 (#19289) 2024-01-05 09:57:54 +00:00
dependabot[bot]
c2e6d8aee2 build(deps): bump github.com/containerd/containerd from 1.6.18 to 1.6.26 (#19531) 2024-01-05 09:29:14 +00:00
James Rasell
f3ed406b0f state: ensure the job submission table is persisted and restored. (#19605) 2024-01-05 08:12:27 +00:00
James Rasell
2abbd7e485 cli: fix operator snapshot save help output examples. (#19606) 2024-01-05 07:43:12 +00:00
Phil Renaud
a5881963dd Error message typo fix: Filed to Failed (#19611) 2024-01-04 21:56:23 -05:00
Phil Renaud
16876697a1 [ui] Adds group-name tooltips to deploying and steady-state job panels (#19601)
* Adds group-name tooltips to deploying and steady-state job panels

* Default tooltip text for mirage edge cases
2024-01-04 13:10:37 -05:00
Phil Renaud
75b830ef04 [ui] Changelog for multi-line variables (#19600)
* Changelog for multi-line variables

* Multi-entry changelog
2024-01-04 12:00:50 -05:00
Seth Hoenig
4b3ee77d6b docs: update raw_exec driver docs and 1.7 upgrade notes (#19598) 2024-01-04 08:26:46 -06:00
Seth Hoenig
ccfb13a72d e2e: add test for raw_exec memory_max configuration (#19596)
* e2e: add test for raw_exec memory_max configuration

* docs: note raw_exec supports memory_max in resources documentation
2024-01-04 08:25:56 -06:00
Piotr Kazmierczak
aa197cf824 e2e: pass Nomad address to Consul WI test (#19603) 2024-01-04 08:52:39 +01:00
Phil Renaud
89cceebb91 [ui] Multi-line variable values and helios upgrades generally (#19544)
* Multi-line variable values and helios upgrades generally

* Variables page titles and actions restyle

* Hacky fix to keyboard shortcut otherwise bumping space on shift

* Related entities heliosified

* Namespace and path fields heliosed

* Paths table heliosified

* Variable view table

* Fixups after design discussion

* Monospaced editing

* De-commented template placeholder

* Acceptance tests updated for helios components across variables

* Tests helios'd in variable-form-test

* PR suggestions
2024-01-03 15:54:22 -05:00
Marvin Chin
d75293d2ab Add OOM detection for exec driver (#19563)
* Add OomKilled field to executor proto format

* Teach linux executor to detect and report OOMs

* Teach exec driver to propagate OOMKill information

* Fix data race

* use tail /dev/zero to create oom condition

* use new test framework

* minor tweaks to executor test

* add cl entry

* remove type conversion

---------

Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-03 09:50:27 -06:00
Tim Gross
f2630add91 acl: remove timestamps from WhoAmI response (#19578)
In Nomad 1.7 we updated our JWT library to go-jose, but this changed the wire
format of the embedded struct we have in the `IdentityClaims` struct that we
return as part of the `WhoAmI` RPC response. This wasn't originally intended to
be sent over the wire but other changes in Nomad 1.5+ added a caller to the
client. The library change causes a deserialization error on Nomad 1.5 and 1.6
clients, which prevents access to Nomad Variables and SD via template blocks.

Removed the incompatible fields from the response, which are unused by any
current caller. In a future version of Nomad, we'll likely remove the `WhoAmI`
callers from the client in lieu of using the public keys the clients have to
check auth.

Fixes: https://github.com/hashicorp/nomad/issues/19555
2024-01-03 08:24:38 -05:00
James Rasell
91cba75f5c copywrite: fix and add copywrite config enterprise comments. (#19590)
Nomad CI checks for copywrite headers using multiple config files
for specific exemption paths. This means the top-level config file
does not take effect when running the copywrite script within
these sub-folders. Exempt files therefore need to be added to the
sub-config files, along with the top level.
2024-01-03 08:58:53 +00:00
Piotr Kazmierczak
a87aa71f55 e2e: fix typo in Consul e2e (#19589) 2024-01-03 09:34:38 +01:00
Tim Gross
e7ca2b51ad vault: ignore allow_unauthenticated config if identity is set (#19585)
When the server's `vault` block has a default identity, we don't check the
user's Vault token (and in fact, we warn them on job submit if they've provided
one). But the validation hook still checks for a token if
`allow_unauthenticated` is set to true. This is a misconfiguration but there's
no reason for Nomad not to do the expected thing here.

Fixes: https://github.com/hashicorp/nomad/issues/19565
2024-01-02 16:46:34 -05:00
Luiz Aoqui
cd8a03431c docs: add scale_in_protection to AWS Autoscaler (#19546)
Document new `scale_in_protection` configuration of the AWS ASG
Autoscaler target plugin.
2024-01-02 14:48:56 -05:00
Luiz Aoqui
0bef6f05a2 docs: add note about * namespace on autoscaling (#19547)
Explain the behaviour when the wildcard namespace value `*` is used to
configure the Nomad Autoscaler agent.
2024-01-02 14:48:20 -05:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00
Seth Hoenig
c06f804cea build: make copywrite thing happy (#19577) 2024-01-02 13:33:45 -06:00
Luiz Aoqui
7eecca65ec docs: add autoscaler AWS retry_attempts config (#19549)
Document the Nomad Autoscaler AWS target plugin config `retry_attempts`.
2024-01-02 14:08:10 -05:00
Luiz Aoqui
56b1bf3240 docs: add policy_id and target_name metric labels (#19551) 2024-01-02 14:06:37 -05:00
Luiz Aoqui
1694e69b77 docs: clarify the behaviour of lower_bound and upper_bound (#19552) 2024-01-02 14:06:07 -05:00
hc-github-team-es-release-engineering
a4ecc2fbc8 Merge pull request #19283 from hashicorp/RELENG-960-EOY-license-fixes
[DO NOT MERGE UNTIL EOY] update year in LICENSE and copywrite files
2024-01-02 09:38:54 -08:00
Seth Hoenig
23e5ffbfd0 build: bump setup-golang action version to v2 (#19568) 2024-01-02 09:41:50 -06:00
Luiz Aoqui
09731442e4 docs: add node_pool autoscaler node selector (#19548)
Document the `node_pool` node selector configuration.
2024-01-02 10:19:58 -05:00
Piotr Kazmierczak
bb3d2227a2 e2e: add a test for checking default WI Consul workflow for services and tasks (#19500) 2024-01-02 16:02:32 +01:00
James Rasell
76ba3e10e7 docs: add Nomad Autoscaler HA configuration details. (#19010)
Co-authored-by: Piotr Kazmierczak <470696+pkazmierczak@users.noreply.github.com>
2023-12-27 08:00:07 +00:00
Mike Nomitch
dd15bdff9c Adds vault role to JWT claims if specified in jobspec (#19535) 2023-12-20 15:51:34 -08:00