Commit Graph

300 Commits

Author SHA1 Message Date
Daniel Bennett
7c633f8109 exec: don't panic on rootless raw_exec tasks (#26401)
the executor dies, leaving an orphaned process still running.

the panic fix:
 * don't `panic()`
 * and return an empty, but non-nil, func on cgroup error

feature fix:
 * allow non-root agent to proceed with exec when cgroups are off
2025-08-04 13:58:35 -04:00
James Rasell
5989d5862a ci: Update golangci-lint to v2 and fix highlighted issues. (#26334) 2025-07-25 10:44:08 +01:00
Tim Gross
c8dcd3c2db docker: clamp CPU shares to minimum of 2 (#26081)
In #25963 we added normalization of CPU shares for large hosts where the total
compute was larger than the maximum CPU shares. But if the result after
normalization is less than 2, runc will have an integer overflow. We prevent
this in the shared executor for the `exec`/`rawexec` driver by clamping to the
safe minimum value. Do this for the `docker` driver as well and add test
coverage of it for the shared executor too.

Fixes: https://github.com/hashicorp/nomad/issues/26080
Ref: https://github.com/hashicorp/nomad/pull/25963
2025-06-19 13:48:06 -04:00
Tim Gross
34e96932a1 drivers: normalize CPU shares/weights to fit large hosts (#25963)
The `resources.cpu` field is scheduled in MHz. On most Linux task drivers, this
value is then mapped to a `cpu.share` (cgroups v1) or `cpu.weight` (cgroups
v2). But this means on very large hosts where the total compute is greater than
the Linux kernel defined maximum CPU shares, you can't set a `resources.cpu`
value large enough to consume the entire host.

The `cpu.share`/`cpu.weight` value is relative within the parent cgroup's slice,
which is owned by Nomad. So we can fix this by re-normalizing the weight on very
large hosts such that the maximum `resources.cpu` matches up with largest
possible CPU share. This happens in the task driver so that the rest of Nomad
doesn't need to be aware of this implementation detail. Note that these functions 
will result in bad share config if the request is more than the available, but that's 
supposed to be caught in the scheduler so by not catching it here we intentionally 
hit the runc error.

Fixes: https://hashicorp.atlassian.net/browse/NMD-297
Fixes: https://github.com/hashicorp/nomad/issues/7731
Ref: https://go.hashi.co/rfc/nmd-211
2025-06-03 15:57:40 -04:00
Tim Gross
77c8acb422 telemetry: fix excessive CPU consumption in executor (#25870)
Collecting metrics from processes is expensive, especially on platforms like
Windows. The executor code has a 5s cache of stats to ensure that we don't
thrash syscalls on nodes running many allocations. But the timestamp used to
calculate TTL of this cache was never being set, so we were always treating it
as expired. This causes excess CPU utilization on client nodes.

Ensure that when we fill the cache, we set the timestamp. In testing on Windows,
this reduces exector CPU overhead by roughly 75%.

This changeset includes two other related items:

* The `telemetry.publish_allocation_metrics` field correctly prevents a node
  from publishing metrics, but the stats hook on the taskrunner still collects
  the metrics, which can be expensive. Thread the configuration value into the
  stats hook so that we don't collect if `telemetry.publish_allocation_metrics =
  false`.

* The `linuxProcStats` type in the executor's `procstats` package is misnamed as
  a result of a couple rounds of refactoring. It's used by all task executors,
  not just Linux. Rename this and move a comment about how Windows processes are
  listed so that the comment is closer to where the logic is implemented.

Fixes: https://github.com/hashicorp/nomad/issues/23323
Fixes: https://hashicorp.atlassian.net/browse/NMD-455
2025-05-19 09:24:13 -04:00
Piotr Kazmierczak
0fa0624576 exec: Fix incorrect HOME and USER env variables for tasks that have user set (#25859)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-16 15:02:45 +02:00
Tim Gross
374e987b9b metrics: emit cache and rss stats on cgroup v2 (#25751)
In cgroups v2, a different map of memory stats is available from the kernel than
in v1. The Docker API reflects this change. But there are equivalent values in
the map for RSS (anonymously mapped memory) and cache (filesystem cache and
tmpfs), which the Docker driver is not currently emitting.

Fallback to these alternate values when the cgroups v1 values are not
available. Include the anonymous mapping in the "measured" allocation stats as
"RSS" so that they both show up in allocation metrics. We can do this on both
the `docker` driver and the Linux executor for `exec` and `java` drivers.

Fixes: https://github.com/hashicorp/nomad/issues/19185
Ref: https://hashicorp.atlassian.net/browse/NMD-437
Ref: https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html#memory-interface-files
Ref: https://www.kernel.org/doc/Documentation/cgroup-v1/memory.txt
2025-04-24 12:48:18 -04:00
James Rasell
c85c723336 ci: Run core tests groups workflow on amd64 and arm64 runners. (#25695) 2025-04-17 15:16:29 +01:00
Tim Gross
48f304d0ca java: only set nobody user on Unix (#25648)
In #25496 we introduced the ability to have `task.user` set for on Windows, so
long as the user ID fits a particular shape. But this uncovered a 7 year old bug
in the `java` driver introduced in #5143, where we set the `task.user` to the
non-existent Unix user `nobody`, even if we're running on Windows.

Prior to the change in #25496 we always ignored the `task.user`, so this was not
a problem. We don't set the `task.user` in the `raw_exec` driver, and the
otherwise very similar `exec` driver is Linux-only, so we never see the problem
there.

Fix the bug in the `java` driver by gating the change to the `task.user` on not
being Windows. Also add a check to the new code path that the user is non-empty
before parsing it, so that any third party drivers that might be borrowing the
executor code don't hit the same probem on Windows.

Ref: https://github.com/hashicorp/nomad/pull/5143
Ref: https://github.com/hashicorp/nomad/pull/25496
Fixes: https://github.com/hashicorp/nomad/issues/25638
2025-04-10 10:34:34 -04:00
Denis Rodin
aca0ff438a raw_exec windows: add support for setting the task user (#25496) 2025-04-03 11:21:13 -04:00
Piotr Kazmierczak
e9ebbed32c drivers: unflake TestExecutor_OOMKilled (#25521)
Every now and then TestExecutor_OOMKilled would fail with: "unable to start
container process: container init was OOM-killed (memory limit too low?)" which
started happening since we upgraded libcontainer.

This PR removes manual (and arbitrary) resource limits on the test
task, since it should be OOMd with resources inherited from the
testExecutorCommandWithChroot, and it fixes a small possible goroutine leak in
the OOM checker in exec driver.
2025-03-28 11:35:02 +01:00
Piotr Kazmierczak
16bbdd9833 drivers: adapt shared executor code to use opencontainers/runc 1.2 (#25138)
Co-authored-by: Michael Smithhisler <michael.smithhisler@hashicorp.com>
2025-03-17 14:32:16 +01:00
Simon Zou
73ceacd236 ListProcesses through PID when cgroup is not found in Linux (#25198)
* ListProcesses through PID when cgroup is not found

* add changelog entry

* update the ListByPid for windows
2025-03-06 17:41:51 +01:00
Juana De La Cuesta
6ffe441983 [gh-24931] Return dummy function for moving processes when running rootless (#24944)
* fix: stop executor launch if nomad doesnt have permissions

* func: return move function if c group is not enabled
2025-03-06 10:34:21 +01:00
Jorge Marey
25426f0777 fingerprint: add config option to disable dmidecode (#25108) 2025-02-13 11:20:48 -05:00
Michael Smithhisler
4e2d9675e7 executor: fail early on reattach if listener is not executor (#24538) 2024-12-02 09:56:00 -05:00
Piotr Kazmierczak
3a18f22c18 goflags: go:build linux for tests that won't compile on other platforms (#24559)
I'm a heavy LSP user and I frequently goto:next_error. This confuses my
editor on macOS.
2024-11-28 15:05:00 +01:00
Seth Hoenig
dd396a3900 windows: revert process listing logic to that of v1.6.10 (#24494)
* windows: revert process listing logic to that of v1.6.10

In Nomad 1.7 much of the process management code was refactored, including
a rewrite of how the process tree of an executor was determined on Windows
machines. Unfortunately that rewrite has been cursed with performance issues
and bugs. Instead, revert to the logic used in v1.6.10.

* changelog
2024-11-20 11:20:20 -06:00
Piotr Kazmierczak
5dfb38d806 drivers: fix capabilities on non-linux systems (#24450)
Recently we moved from github.com/syndtr/gocapability to
github.com/moby/sys/capability due to the former package no longer being
maintainer. The new package's capability function works differently: the
known/supported functionality is split now, and the .ListSupported() call will
always return an empty list on non-linux systems. This means Nomad agents won't
start on darwin or windows.
2024-11-13 15:58:25 +01:00
Kir Kolyshkin
d09c8ddf21 deps: switch to moby/sys/capability (#24093)
github.com/moby/sys/capability is a fork of the (no longer maintained)
github.com/syndtr/gocapability package.

For changes since the fork took place, see
https://github.com/moby/sys/blob/main/capability/CHANGELOG.md

Note that the "workaround for RHEL6" is removed for a number of reasons.
Feel free to choose the one you like the most, either is sufficient:

1. /proc/sys/kernel/cap_last_cap is available since RHEL 6.7
   (kernel 2.6.32-573.el6), released 9 years ago (2015-07-22).

2. It incorrectly returns CAP_BLOCK_SUSPEND (36), which was only added
   in kernel v3.5 and was never backported to RHEL6 kernels. The
   correct value for RHEL6 would be CAP_MAC_ADMIN (33).

3. As far as upstream kernels go, /proc/sys/kernel/cap_last_cap was
   added in kernel v3.2, and a correct value depends on the kernel
   version. It could be CAP_WAKE_ALARM (35), added to kernel v3.0, or
   CAP_SYSLOG (34), added to kernel v2.6.38, or possibly a lesser value
   for even older kernels.

Signed-off-by: Kir Kolyshkin <kolyshkin@gmail.com>
2024-11-11 14:07:31 -05:00
Seth Hoenig
a0ff07393b drivers: provide empty implementations of cgroup helpers for non-root nomad (#24392) 2024-11-07 12:24:37 -06:00
Seth Hoenig
b58abf48c1 drivers: move executor process out of v1 task cgroup after process starts (#24340)
* drivers: move executor process out of v1 task cgroup after process starts

This PR changes the behavior of the raw exec task driver on old cgroups v1
systems such that the executor process is no longer a member of the cgroups
created for the task. Now, the executor process is placed into those
cgroups and starts the task child process (just as before), but now then
exits those cgroups and exists in the nomad parent cgroup. This change
makes the behavior sort of similar to cgroups v2 systems, where we never
have the executor enter the task cgroup to begin with (because we can
directly clone(3) the task process into it).

Fixes #23951

* executor: handle non-linux case

* cgroups: add test case for no executor process in task cgroup (v1)

* add changelog

* drivers: also move executor out of cpuset cgroup
2024-11-07 07:31:38 -06:00
Michael Smithhisler
658c429d75 Drivers: add work_dir config to exec/raw_exec/java drivers (#24249)
---------

Co-authored-by: wurosh <uros.m.perisic@gmail.com>
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-11-01 11:04:40 -04:00
Juanadelacuesta
80e398bbf7 test: add tests for validateBounds 2024-10-31 14:54:27 +01:00
Juanadelacuesta
8752bb0a65 func: move the user lookup into the validation, it's used everywhere the function is called 2024-10-31 10:34:26 +01:00
Juana De La Cuesta
f1439f54f7 Update drivers/shared/validators/validators.go
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
2024-10-31 09:32:51 +01:00
Juanadelacuesta
445b19ce3e docs: update func docs 2024-10-30 12:35:06 +01:00
Juanadelacuesta
a491ceff5f fix: put back MSL license header 2024-10-30 11:25:27 +01:00
Juanadelacuesta
9a6d2648c8 style: improve debug logging 2024-10-30 11:21:51 +01:00
Juanadelacuesta
2b9bb7a289 license: change missing file to BUSL 2024-10-30 10:24:35 +01:00
Juanadelacuesta
a9a452341c license: update headers to BUSL 2024-10-29 15:54:09 +01:00
Juanadelacuesta
0227788e22 fix: update tests configuration 2024-10-29 15:24:12 +01:00
Juanadelacuesta
0cd1b5ff13 func: move the validation to a dependency and use id sets 2024-10-28 18:59:51 +01:00
Mike Nomitch
fd7e81dbce Fixing accidental move of helper fn to unix only validators file 2024-10-28 11:15:41 +01:00
Mike Nomitch
c4f2a41da6 Splitting validators unix functions into own file 2024-10-28 11:15:41 +01:00
Mike Nomitch
e1c226e633 Restructuring IDRange 2024-10-28 11:15:41 +01:00
Mike Nomitch
0fbf592131 moving user out of validators 2024-10-28 11:15:41 +01:00
Mike Nomitch
916af5a948 Moving idrange struct location 2024-10-28 11:15:41 +01:00
Mike Nomitch
9565dde138 Only parsing id ranges once 2024-10-28 11:15:41 +01:00
Mike Nomitch
cf36509474 Removing unnecessary int conversion 2024-10-28 11:15:40 +01:00
Mike Nomitch
9cc3992ca6 Adds ability to restrict uid and gids in exec and raw_exec 2024-10-28 11:15:37 +01:00
Tim Gross
6b8ddff1fa windows: set job object for executor and children (#24214)
On Windows, if the `raw_exec` driver's executor exits, the child processes are
not also killed. Create a Windows "job object" (not to be confused with a Nomad
job) and add the executor to it. Child processes of the executor will inherit
the job automatically. When the handle to the job object is freed (on executor
exit), the job itself is destroyed and this causes all processes in that job to
exit.

Fixes: https://github.com/hashicorp/nomad/issues/23668
Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
2024-10-16 09:20:26 -04:00
Tim Gross
fec91d1dc8 windows: trade heap for stack to build process tree for stats in linear space (#24182)
In #20619 we overhauled how we were gathering stats for Windows
processes. Unlike in Linux where we can ask for processes in a cgroup, on
Windows we have to make a single expensive syscall to get all the processes and
then build the tree ourselves. Our algorithm to do so is recursive and quadratic
in both steps and space with the number of processes on the host. For busy hosts
this hits the stack limit and panics the Nomad client.

We already build a map of parent PID to PID, so modify this to be a map of
parent PID to slice of children and then traverse that tree only from the root
we care about (the executor PID). This moves the allocations to the heap but
makes the stats gathering linear in steps and space required.

This changeset also moves as much of this code as possible into an area
 not conditionally-compiled by OS, as the tagged test file was not being run in CI.

Fixes: https://github.com/hashicorp/nomad/issues/23984
2024-10-14 11:26:38 -04:00
Piotr Kazmierczak
981ca36049 docker: use official client instead of fsouza/go-dockerclient (#23966)
This PR replaces fsouza/go-dockerclient 3rd party docker client library with
docker's official SDK.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-09-26 18:41:44 +02:00
Seth Hoenig
51215bf102 deps: update to go-set/v3 and refactor to use custom iterators (#23971)
* deps: update to go-set/v3

* deps: use custom set iterators for looping
2024-09-16 13:40:10 -05:00
Tim Gross
b25f1b66ce resources: allow job authors to configure size of secrets tmpfs (#23696)
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.

This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.

Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.

For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.

Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
2024-08-05 16:06:58 -04:00
Piotr Kazmierczak
f22ce921cd docker: adjust capabilities on Windows (#23599)
Adjusts Docker capabilities per OS, and checks for runtime on Windows.

---------

Co-authored-by: Tim Gross <tgross@hashicorp.com>
2024-07-17 09:01:45 +02:00
Piotr Kazmierczak
85430be6dd raw_exec: oom_score_adj support (#23308) 2024-06-14 11:36:27 +02:00
Luke Palmer
75874136ac fix cgroup setup for non-default devices (#22518) 2024-06-13 09:27:19 -04:00
Seth Hoenig
7d00a494d9 windows: fix inefficient gathering of task processes (#20619)
* windows: fix inefficient gathering of task processes

* return set of just executor pid in case of ps error
2024-05-17 09:46:23 -05:00