Commit Graph

26240 Commits

Author SHA1 Message Date
Daniel Bennett
a0d7fb6b09 connect: fix ipv6 bind_address test (#24216) 2024-10-16 08:23:44 -05:00
Tim Gross
6b8ddff1fa windows: set job object for executor and children (#24214)
On Windows, if the `raw_exec` driver's executor exits, the child processes are
not also killed. Create a Windows "job object" (not to be confused with a Nomad
job) and add the executor to it. Child processes of the executor will inherit
the job automatically. When the handle to the job object is freed (on executor
exit), the job itself is destroyed and this causes all processes in that job to
exit.

Fixes: https://github.com/hashicorp/nomad/issues/23668
Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
2024-10-16 09:20:26 -04:00
James Rasell
0f6561bdfe docs: Add initial nomad-driver-virt driver plugin documentation. (#24094)
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2024-10-15 17:05:30 +01:00
Tim Gross
d261d58ea2 build: update hc-install to current (#24199)
Installing Vault and Consul from releases.hashicorp.com via `hc-install` has
been failing intermittently. Update the `hc-install` binaries to be current and
add one retry to downloads for our compat tests so that we can get builds more
reliably green while the underlying issue is being debugged.
2024-10-15 10:07:58 -04:00
James Rasell
61dd1f3f10 docs: CLI node pool list does not accept arguments. (#24188) 2024-10-15 07:49:37 +01:00
Daniel Bennett
067afcda26 Consul Connect over IPv6 (except tproxy) (#24203)
* detect ipv6 on "bridge" network and set
  service.connect.sidecar_proxy.config.bind_address
  for envoy to "::" instead of "0.0.0.0"
* allow users to set bind_address in jobspec
  e.g. "" would defer to consul proxy-defaults
* caveat: tproxy still does not work, because
  the CNI plugin does not configure ip6tables
2024-10-14 18:52:02 -05:00
Aimee Ukasick
5beb1ce58e Docs: Update job version section with tutorial links (#24179)
* Update job page with tutorial links

* Update section links
2024-10-14 12:29:56 -05:00
Tim Gross
fec91d1dc8 windows: trade heap for stack to build process tree for stats in linear space (#24182)
In #20619 we overhauled how we were gathering stats for Windows
processes. Unlike in Linux where we can ask for processes in a cgroup, on
Windows we have to make a single expensive syscall to get all the processes and
then build the tree ourselves. Our algorithm to do so is recursive and quadratic
in both steps and space with the number of processes on the host. For busy hosts
this hits the stack limit and panics the Nomad client.

We already build a map of parent PID to PID, so modify this to be a map of
parent PID to slice of children and then traverse that tree only from the root
we care about (the executor PID). This moves the allocations to the heap but
makes the stats gathering linear in steps and space required.

This changeset also moves as much of this code as possible into an area
 not conditionally-compiled by OS, as the tagged test file was not being run in CI.

Fixes: https://github.com/hashicorp/nomad/issues/23984
2024-10-14 11:26:38 -04:00
Aimee Ukasick
8f4a9326be Docs: Add 1.9 release notes (#24161)
* Add 1.9 release notes

* Add deprecated items

* Update Virt driver docs link to point to repo

Update Virt driver docs link to point to repo
2024-10-14 09:57:15 -05:00
James Rasell
a7dad68996 changelog: remove doubled entry for 1.9 release. (#24192) 2024-10-14 14:48:50 +01:00
dependabot[bot]
294ebd1540 chore(deps): bump actions/checkout from 4.2.0 to 4.2.1 (#24183)
Bumps [actions/checkout](https://github.com/actions/checkout) from 4.2.0 to 4.2.1.
- [Release notes](https://github.com/actions/checkout/releases)
- [Changelog](https://github.com/actions/checkout/blob/main/CHANGELOG.md)
- [Commits](d632683dd7...eef61447b9)

---
updated-dependencies:
- dependency-name: actions/checkout
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 08:26:34 -05:00
dependabot[bot]
e439d6e408 chore(deps): bump actions/upload-artifact from 4.4.0 to 4.4.3 (#24184)
Bumps [actions/upload-artifact](https://github.com/actions/upload-artifact) from 4.4.0 to 4.4.3.
- [Release notes](https://github.com/actions/upload-artifact/releases)
- [Commits](50769540e7...b4b15b8c7c)

---
updated-dependencies:
- dependency-name: actions/upload-artifact
  dependency-type: direct:production
  update-type: version-update:semver-patch
...

Signed-off-by: dependabot[bot] <support@github.com>
Co-authored-by: dependabot[bot] <49699333+dependabot[bot]@users.noreply.github.com>
2024-10-14 08:24:59 -05:00
Michael Smithhisler
436ff75f15 scheduler: fix reconnecting allocations getting rescheduled (#24165)
* scheduler: fix reconnecting allocations getting rescheduled
2024-10-14 09:00:58 -04:00
James Rasell
e7154f1d81 Merge pull request #24187 from hashicorp/post-1.9.0-release
admin: post 1.9.0 release
2024-10-14 09:15:14 +02:00
James Rasell
67f2f32027 Merge release 1.9.0 files 2024-10-14 07:42:14 +01:00
hc-github-team-nomad-core
da654ead34 Prepare for next release 2024-10-14 07:26:46 +01:00
hc-github-team-nomad-core
f1714162df Generate files for 1.9.0 release 2024-10-14 07:26:36 +01:00
Aimee Ukasick
c839f38cab Docs: Golden Versions updates (#24153)
* Add language from CLI help to job revert for version|tag

* Add CLI job tag subcommand page

* Add API create delete tag

Examples use same names between CLI and API

* Update CLI revert, tag; API jobs

* Add job version content

* add tag name unique per job to CLI/API; address Phil's feedback

Add partial explaining why tag, add to CLI/API

* Add diff_version to API jobs list job versions

* Apply suggestions from code review

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>

* remove tutorial links since not published yet.

---------

Co-authored-by: Jeff Boruszak <104028618+boruszak@users.noreply.github.com>
2024-10-11 12:36:32 -05:00
Tim Gross
4de1665942 consul: improve reliability of deregistration (#24166)
When the local Consul agent receives a deregister request, it performs a
pre-flight check using the locally cached ACL token. The agent then sends the
request upstream to the Consul servers as part of anti-entropy, using its own
token. This requires that the token we use for deregistration is valid even
though that's not the token used to write to the Consul server.

There are several cases where the service identity token might no longer exist
at the time of deregistration:
* A race condition between the sync and destroying the allocation.
* Misconfiguration of the Consul auth method with a TTL.
* Out-of-band destruction of the token.

Additionally, Nomad's sync with Consul returns early if there are any errors,
which means that a single broken token can prevent any other service on the
Nomad agent from being registered or deregistered.

Update Nomad's sync with Consul to use the Nomad agent's own Consul token for
deregistration, regardless of which token the service was registered
with. Accumulate errors from the sync so that they no longer block
deregistration of other services.

Fixes: https://github.com/hashicorp/nomad/issues/20159
2024-10-11 12:32:23 -04:00
Tim Gross
5bb6d96773 build: update versions file for backports (#24174) 2024-10-11 12:30:34 -04:00
Seth Hoenig
f1ce127524 jobspec: add a chown option to artifact block (#24157)
* jobspec: add a chown option to artifact block

This PR adds a boolean 'chown' field to the artifact block.

It indicates whether the Nomad client should chown the downloaded files
and directories to be owned by the task.user. This is useful for drivers
like raw_exec and exec2 which are subject to the host filesystem user
permissions structure. Before, these drivers might not be able to use or
manage the downloaded artifacts since they would be owned by the root
user on a typical Nomad client configuration.

* api: no need for pointer of chown field
2024-10-11 11:30:27 -05:00
Tim Gross
7381f8419b docs: clarify requirements for Consul token policies and TTLs (#24167)
As of #24166, Nomad agents will use their own token to deregister services and
checks from Consul. This returns the deregistration path to the pre-Workload
Identity workflow. Expand the documentation to make clear why certain ACL
policies are required for clients.

Additionally, we did not explicitly call out that auth methods should not set an
expiration on Consul tokens. Nomad does not have a facility to refresh these
tokens if they expire. Even if Nomad could, there's no way to re-inject them
into Envoy sidecars for Consul Service Mesh without recreating the task anyways,
which is what happens today. Warn users that they should not set an expiration.

Closes: https://github.com/hashicorp/nomad/issues/20185 (wontfix)
Ref: https://hashicorp.atlassian.net/browse/NET-10262
2024-10-11 11:59:21 -04:00
Daniel Bennett
373aae7b32 docs: add Resource Quota specification page (#24152)
and update some related pages

Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-10-10 15:03:10 -05:00
Daniel Bennett
278a2df3af e2e: ui: update playwright to 1.48.0 (#24158)
steps to update:
 * edit run.sh IMAGE variable manually
 * run ./run.sh test
2024-10-09 10:34:53 -05:00
Phil Renaud
dc45066ae7 [ui] Separate Diffs and Versions from the /versions endpoint as far as Ember is concerned (#24145)
* Separate Diffs and Versions from the /versions endpoint as far as Ember is concerned

* Back to async true

* Handle undefined-diffs case
2024-10-08 12:13:01 -04:00
the-sun-will-rise-tomorrow
1ba9cc266c docs: Link directly to podman's --network option (#24149) 2024-10-08 09:05:14 -05:00
Daniel Bennett
4562b9ac8a Release/1.9.0 beta.2 2024-10-04 14:07:13 -05:00
hc-github-team-nomad-core
7d7a88d7e0 Prepare for next release 2024-10-04 16:18:34 +00:00
hc-github-team-nomad-core
668a827b2b Generate files for 1.9.0-beta.2 release 2024-10-04 16:18:27 +00:00
Daniel Bennett
3f1bba1643 Prepare release 1.9.0-beta.2 2024-10-04 12:13:01 -04:00
Tim Gross
7531b7a62f fix data race in node upsert (#24127)
While testing with agents built with the race-detection option enabled, I
encountered a data race while draining a node.

When we upsert a node we copy the `NodeResources` struct and then perform a
fixup for backwards compatibility of the topology struct. This fixup was being
executed on the original struct and not the copy, which means we're uselessly
fixing up the wrong struct and we're corrupting the state store in the
process (albeit harmlessly, I suspect).

Fix the data race by calling the method on the correct pointer.
2024-10-04 08:41:14 -04:00
Daniel Bennett
1c76dd9c1c update example device readme (#24124) 2024-10-03 13:24:58 -05:00
Tim Gross
b7595c646d alloc fs: use case-insensitive check for reads of secret/private dir (#24125)
When using the Client FS APIs, we check to ensure that reads don't traverse into
the allocation's secret dir and private dir. But this check can be bypassed on
case-insensitive file systems (ex. Windows, macOS, and Linux with obscure ext4
options enabled). This allows a user with `read-fs` permissions but not
`alloc-exec` permissions to read from the secrets dir.

This changeset updates the check so that it's case-insensitive. This risks false
positives for escape (see linked Go issue), but only if a task without
filesystem isolation deliberately writes into the task working directory to do
so, which is a fail-safe failure mode.

Ref: https://github.com/golang/go/issues/18358

Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
2024-10-03 14:20:24 -04:00
Michael Schurter
da75d4ff4b docs: fix aed -> aead typo (#24123) 2024-10-03 13:31:32 -04:00
Tim Gross
f7d4bd2fd1 test: wait for keyring in plan submission tests (#24122)
In #23977 we merged a change to how the keyring was stored. Because keyring
initialization takes slightly longer now, this uncovered existing timing bugs in
some of our tests where tests that require the keyring (ex. plan applier tests)
were waiting for the leader but not the keyring initialization. Fix another
example we've seen causing test flakes.
2024-10-03 13:22:41 -04:00
Daniel Bennett
7526c91ccd scheduler: non-nil err when no devices match (#24118) 2024-10-03 10:29:36 -05:00
Aimee Ukasick
4c131229f4 Add devices to NUMA section of CPU page (#24113) 2024-10-03 09:09:10 -05:00
Aimee Ukasick
e5b18affa1 nvidia driver: add MIG support to overview paragraph (#24099) 2024-10-03 09:08:43 -05:00
James Rasell
1fabbaa179 driver: remove LXC and ECS driver documentation. (#24107)
Co-authored-by: Aimee Ukasick <aimee.ukasick@hashicorp.com>
2024-10-03 08:55:39 +01:00
Phil Renaud
2fc7544ff3 [ui] Modify variable access permissions for UI users with write in only certain namespaces (#24073)
* Modify variable access permissions for UI users with write in only certain namespaces

* Addressing some PR comments

* Variables index namespaces on * and ability checks are now namespaced

* Mistook Delete for Destroy, and update unit tests for mult-return allPaths
2024-10-02 16:02:40 -04:00
Tim Gross
64881eefce docs: remove references to serf.io site (#24114)
The serf.io site is being taken down, so change all our links to point to the
repo docs instead.

Ref: https://github.com/hashicorp/serf/pull/743
2024-10-02 14:33:04 -04:00
Daniel Bennett
6b9bcb8582 differently exclude tagged job versions from being pruned (#24102)
* test bug: tagged versions count against limit
  specifically tagged versions that are not the oldest

* fix: use original logic, sans tagged versions
2024-10-02 09:58:35 -05:00
Martijn Vegter
3ecf0d21e2 metrics: introduce client config to include alloc metadata as part of the base labels (#23964) 2024-10-02 10:55:44 -04:00
Tim Gross
6c03e1991d refactor: clean up slice initialization in node status (#24109)
We initialize this slice with a zeroed array and then append to it, which means
we then have to clean out the empty strings later. Initialize to the correct
capacity up front so there are no empty values.

Ref: https://github.com/hashicorp/nomad/pull/24104
2024-10-02 10:40:32 -04:00
Tim Gross
7dc57efe1b build: update go toolchain to 1.23.2 (#24108)
Picks up some small bug fixes but one especially relevant to Nomad is the
`os/exec` file descriptor, which could impact script check / change mode for
task drivers without isolated exec (ex. `raw_exec`).

Ref: https://github.com/golang/go/issues?q=milestone%3AGo1.23.2+label%3ACherryPickApproved
Ref: https://github.com/golang/go/issues/69402
2024-10-02 10:29:10 -04:00
Tim Gross
651d8d6f88 tests: fixup copywrite in test file (#24101)
In #24007 we merged new HCL files but they were missing copywrite headers
because the scan didn't run on this PR for some reason. I've already backported
this to the Enterprise branches.
2024-10-01 16:43:10 -04:00
Tim Gross
e9ba630639 docker: fix script check execution (#24098)
In #24095 we made a fix for non-streaming exec into Docker tasks for script
checks and `change_mode = "script"`, but didn't complete E2E testing. We need to
use `ContainerExecAttach` in the new API in order to get stdout/stderr from
tasklets, but the previous `ContainerExecStart` call will prevent this from
running successfully with an error that the exec has already run.

* Ref: [NET-11202 (comment)](https://hashicorp.atlassian.net/browse/NET-11202?focusedCommentId=551618)
* This has shipped in Nomad 1.9.0-beta.1 but not production yet.
* This should fix the remaining issues in nightly E2E for Docker.
2024-10-01 16:41:38 -04:00
Juliano Martinez
4a74fda8ce Allow client template config block to be parsed when using json config (#24007)
- Adds tests
- Adds sample test data for parsing hcl and json
- Adds changelog
2024-10-01 15:44:36 -04:00
Seth Hoenig
8ae7f21d41 docs: stats_period device configuration no longer exists (#24097) 2024-10-01 13:47:04 -05:00
Tim Gross
5e1ad14f1f scaling policy: use request namespace as target if unset in jobspec (#24065)
When jobs are submitted with a scaling policy, the scaling policy's target only
includes the job's namespace if the `namespace` field is set in the jobspec and
not from the request. Normally jobs are canonicalized in the RPC handler before
being written to Raft. But the scaling policy targets are instead written during
the conversion from `api.Job` to `structs.Job`. We populate the `structs.Job`
namespace from the request here as well, but only after the conversion has
occurred. Swap the order of these operations so that the conversion is always
happening with a correct namespace.

Long-term we should not be making mutations during conversion either. But we
can't remove it immediately because API requests may come from any agent across
upgrades. Move the scaling target creation into the `Canonicalize` method and
mark it for future removal in the API conversion code path.

Fixes: https://github.com/hashicorp/nomad/issues/24039
2024-10-01 11:41:40 -04:00