Commit Graph

34 Commits

Author SHA1 Message Date
Piotr Kazmierczak
0fa0624576 exec: Fix incorrect HOME and USER env variables for tasks that have user set (#25859)
Co-authored-by: Tim Gross <tgross@hashicorp.com>
2025-05-16 15:02:45 +02:00
Simon Zou
73ceacd236 ListProcesses through PID when cgroup is not found in Linux (#25198)
* ListProcesses through PID when cgroup is not found

* add changelog entry

* update the ListByPid for windows
2025-03-06 17:41:51 +01:00
Juana De La Cuesta
6ffe441983 [gh-24931] Return dummy function for moving processes when running rootless (#24944)
* fix: stop executor launch if nomad doesnt have permissions

* func: return move function if c group is not enabled
2025-03-06 10:34:21 +01:00
Seth Hoenig
a0ff07393b drivers: provide empty implementations of cgroup helpers for non-root nomad (#24392) 2024-11-07 12:24:37 -06:00
Seth Hoenig
b58abf48c1 drivers: move executor process out of v1 task cgroup after process starts (#24340)
* drivers: move executor process out of v1 task cgroup after process starts

This PR changes the behavior of the raw exec task driver on old cgroups v1
systems such that the executor process is no longer a member of the cgroups
created for the task. Now, the executor process is placed into those
cgroups and starts the task child process (just as before), but now then
exits those cgroups and exists in the nomad parent cgroup. This change
makes the behavior sort of similar to cgroups v2 systems, where we never
have the executor enter the task cgroup to begin with (because we can
directly clone(3) the task process into it).

Fixes #23951

* executor: handle non-linux case

* cgroups: add test case for no executor process in task cgroup (v1)

* add changelog

* drivers: also move executor out of cpuset cgroup
2024-11-07 07:31:38 -06:00
Seth Hoenig
51215bf102 deps: update to go-set/v3 and refactor to use custom iterators (#23971)
* deps: update to go-set/v3

* deps: use custom set iterators for looping
2024-09-16 13:40:10 -05:00
Piotr Kazmierczak
85430be6dd raw_exec: oom_score_adj support (#23308) 2024-06-14 11:36:27 +02:00
Seth Hoenig
7d00a494d9 windows: fix inefficient gathering of task processes (#20619)
* windows: fix inefficient gathering of task processes

* return set of just executor pid in case of ps error
2024-05-17 09:46:23 -05:00
Tim Gross
623486b302 deps: vendor containernetworking/plugins functions for net NS utils (#20556)
We bring in `containernetworking/plugins` for the contents of a single file,
which we use in a few places for running a goroutine in a specific network
namespace. This code hasn't needed an update in a couple of years, and a good
chunk of what we need was previously vendored into `client/lib/nsutil`
already.

Updating the library via dependabot is causing errors in Docker driver tests
because it updates a lot of transient dependencies, and it's bringing in a pile
of new transient dependencies like opentelemetry. Avoid this problem going
forward by vendoring the remaining code we hadn't already.

Ref: https://github.com/hashicorp/nomad/pull/20146
2024-05-13 09:10:16 -04:00
Seth Hoenig
14a022cbc0 drivers/raw_exec: enable setting cgroup override values (#20481)
* drivers/raw_exec: enable setting cgroup override values

This PR enables configuration of cgroup override values on the `raw_exec`
task driver. WARNING: setting cgroup override values eliminates any
gauruntee Nomad can make about resource availability for *any* task on
the client node.

For cgroup v2 systems, set a single unified cgroup path using `cgroup_v2_override`.
The path may be either absolute or relative to the cgroup root.

config {
  cgroup_v2_override = "custom.slice/app.scope"
}

or

config {
  cgroup_v2_override = "/sys/fs/cgroup/custom.slice/app.scope"
}

For cgroup v1 systems, set a per-controller path for each controller using
`cgroup_v1_override`. The path(s) may be either absolute or relative to
the controller root.

config {
  cgroup_v1_override = {
    "pids": "custom/app",
    "cpuset": "custom/app",
  }
}

or

config {
  cgroup_v1_override = {
    "pids": "/sys/fs/cgroup/pids/custom/app",
    "cpuset": "/sys/fs/cgroup/cpuset/custom/app",
  }
}

* drivers/rawexec: ensure only one of v1/v2 cgroup override is set

* drivers/raw_exec: executor should error if setting cgroup does not work

* drivers/raw_exec: create cgroups in raw_exec tests

* drivers/raw_exec: ensure we fail to start if custom cgroup set and non-root

* move custom cgroup func into shared file

---------

Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
2024-05-07 16:46:27 -07:00
Seth Hoenig
cb7d078c1d drivers/raw_exec: enable configuring raw_exec task to have no memory limit (#19670)
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit

This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.

This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.

* cl: add cl
2024-01-09 14:57:13 -06:00
Matt Robenolt
656bb5cafa drivers/executor: set oom_score_adj for raw_exec (#19515)
* drivers/executor: set oom_score_adj for raw_exec

This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.

We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.

Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.

I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.

We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.

* drivers/executor: minor cleanup of setting oom adjustment

* e2e: add test for raw_exec oom adjust score

* e2e: set oom score adjust to -999

* cl: add cl

---------

Co-authored-by: Seth Hoenig <shoenig@duck.com>
2024-01-02 13:35:09 -06:00
Tim Gross
0236bd0907 qemu: fix panic from missing resources block (#19089)
The `qemu` driver uses our universal executor to run the qemu command line
tool. Because qemu owns the resource isolation, we don't pass in the resource
block that the universal executor uses to configure cgroups and core
pinning. This resulted in a panic.

Fix the panic by returning early in the cgroup configuration in the universal
executor. This fixes `qemu` but also any third-party drivers that might exist
and are using our executor code without passing in the resource block.

In future work, we should ensure that the `resources` block is being translated
into qemu equivalents, so that we have support for things like NUMA-aware
scheduling for that driver.

Fixes: https://github.com/hashicorp/nomad/issues/19078
2023-11-14 16:26:44 -05:00
Seth Hoenig
e3c8700ded deps: upgrade to go-set/v2 (#18638)
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
2023-10-05 11:56:17 -05:00
Seth Hoenig
2e1974a574 client: refactor cpuset partitioning (#18371)
* client: refactor cpuset partitioning

This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.

Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.

Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.

* cr: tweaks for feedback
2023-09-12 09:11:11 -05:00
Seth Hoenig
a4cc76bd3e numa: enable numa topology detection (#18146)
* client: refactor cgroups management in client

* client: fingerprint numa topology

* client: plumb numa and cgroups changes to drivers

* client: cleanup task resource accounting

* client: numa client and config plumbing

* lib: add a stack implementation

* tools: remove ec2info tool

* plugins: fixup testing for cgroups / numa changes

* build: update makefile and package tests and cl
2023-08-10 17:05:30 -05:00
hashicorp-copywrite[bot]
f005448366 [COMPLIANCE] Add Copyright and License Headers 2023-04-10 15:36:59 +00:00
Seth Hoenig
e4e5bc5cef client: protect user lookups with global lock (#14742)
* client: protect user lookups with global lock

This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).

* cl: add cl
2022-09-29 09:30:13 -05:00
Seth Hoenig
6d9e179338 deps: update opencontainers/runc to v1.1.3 2022-08-04 12:56:49 -05:00
Seth Hoenig
be7ec8de3e raw_exec: make raw exec driver work with cgroups v2
This PR adds support for the raw_exec driver on systems with only cgroups v2.

The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.

There is a bit of refactoring in this PR, but the fundamental design remains
the same.

Closes #12351 #12348
2022-04-04 16:11:38 -05:00
Seth Hoenig
5da1a31e94 client: enable support for cgroups v2
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.

Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.

Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.

When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.

Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.

The new cpuset management strategy fixes #11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.

[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html

Closes #11289
Fixes #11705 #11773 #11933
2022-03-23 11:35:27 -05:00
Tim Gross
358a46819b fix integer bounds checks (#11815)
* driver: fix integer conversion error

The shared executor incorrectly parsed the user's group into int32 and
then cast to uint32 without bounds checking. This is harmless because
an out-of-bounds gid will throw an error later, but it triggers
security and code quality scans. Parse directly to uint32 so that we
get correct error handling.

* helper: fix integer conversion error

The autopilot flags helper incorrectly parses a uint64 to a uint which
is machine specific size. Although we don't have 32-bit builds, this
sets off security and code quality scaans. Parse to the machine sized
uint.

* driver: restrict bounds of port map

The plugin server doesn't constrain the maximum integer for port
maps. This could result in a user-visible misconfiguration, but it
also triggers security and code quality scans. Restrict the bounds
before casting to int32 and return an error.

* cpuset: restrict upper bounds of cpuset values

Our cpuset configuration expects values in the range of uint16 to
match the expectations set by the kernel, but we don't constrain the
values before downcasting. An underflow could lead to allocations
failing on the client rather than being caught earlier. This also make
security and code quality scanners happy.

* http: fix integer downcast for per_page parameter

The parser for the `per_page` query parameter downcasts to int32
without bounds checking. This could result in underflow and
nonsensical paging, but there's no server-side consequences for
this. Fixing this will silence some security and code quality scanners
though.
2022-01-25 11:16:48 -05:00
Seth Hoenig
87dbc7162b deps: upgrade docker and runc
This PR upgrades
 - docker dependency to the latest tagged release (v20.10.12)
 - runc dependency to the latest tagged release (v1.0.3)

Docker does not abide by [semver](https://github.com/moby/moby/issues/39302), so it is marked +incompatible,
and transitive dependencies are upgrade manually.

Runc made three relevant breaking changes

 * cgroup manager .Set changed to accept Resources instead of Cgroup
   3f65946756

 * config.Device moved to devices.Device
   https://github.com/opencontainers/runc/pull/2679

 * mountinfo.Mounted now returns an error if the specified path does not exist
   https://github.com/moby/sys/blob/mountinfo/v0.5.0/mountinfo/mountinfo.go#L16
2022-01-18 08:35:26 -06:00
zhsj
46b335d652 deps: update runc to v1.0.0-rc93
includes updates for breaking changes in runc v1.0.0-rc93
2021-03-31 10:57:02 -04:00
Mahmood Ali
edaa16589b honor task user when execing into raw_exec task (#9439)
Fix #9210 .

This update the executor so it honors the User when using nomad alloc exec. The bug was that the exec task didn't honor the init command when execing.
2020-11-25 09:34:10 -05:00
Shengjing Zhu
274bf2ee1c Adjust cgroup change in libcontainer 2020-08-20 00:31:07 +08:00
Nick Ethier
149578ca1e executor: rename wrapNetns to withNetworkIsolation 2019-09-30 21:38:31 -04:00
Nick Ethier
e6ce6d2c2b comment wrapNetns 2019-09-30 12:06:52 -04:00
Nick Ethier
2f16eb9640 executor: run exec commands in netns if set 2019-09-30 11:50:22 -04:00
Nick Ethier
d28d865100 executor: support network namespacing on universal executor 2019-07-31 01:03:58 -04:00
Lang Martin
c7cd018655 executor_universal_linux log a link to the docs on cgroup error 2019-07-24 12:37:33 -04:00
Lang Martin
cab04997f0 executor_universal_linux raw_exec cgroup failure is not fatal 2019-07-22 15:16:36 -04:00
Lang Martin
1a9c598fc2 executor_universal_linux getAllPids chooses cgroup when available 2019-07-17 17:33:55 -04:00
Danielle Tomlinson
d2136e0aa7 drivers: Move client/drivers/executor to drivers/shared/executor 2018-11-30 10:46:13 +01:00