Collecting metrics from processes is expensive, especially on platforms like
Windows. The executor code has a 5s cache of stats to ensure that we don't
thrash syscalls on nodes running many allocations. But the timestamp used to
calculate TTL of this cache was never being set, so we were always treating it
as expired. This causes excess CPU utilization on client nodes.
Ensure that when we fill the cache, we set the timestamp. In testing on Windows,
this reduces exector CPU overhead by roughly 75%.
This changeset includes two other related items:
* The `telemetry.publish_allocation_metrics` field correctly prevents a node
from publishing metrics, but the stats hook on the taskrunner still collects
the metrics, which can be expensive. Thread the configuration value into the
stats hook so that we don't collect if `telemetry.publish_allocation_metrics =
false`.
* The `linuxProcStats` type in the executor's `procstats` package is misnamed as
a result of a couple rounds of refactoring. It's used by all task executors,
not just Linux. Rename this and move a comment about how Windows processes are
listed so that the comment is closer to where the logic is implemented.
Fixes: https://github.com/hashicorp/nomad/issues/23323
Fixes: https://hashicorp.atlassian.net/browse/NMD-455
Every now and then TestExecutor_OOMKilled would fail with: "unable to start
container process: container init was OOM-killed (memory limit too low?)" which
started happening since we upgraded libcontainer.
This PR removes manual (and arbitrary) resource limits on the test
task, since it should be OOMd with resources inherited from the
testExecutorCommandWithChroot, and it fixes a small possible goroutine leak in
the OOM checker in exec driver.
Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.
This commit introduces two mechanisms to avoid this behaviour:
* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
The value for the executor cgroup CPU weight must be within the limits
imposed by the Linux kernel.
Nomad used the task `resource.cpu`, an unbounded value, directly as the
cgroup CPU weight, causing it to potentially go outside the imposed
values.
This commit clamps the CPU shares values to be within the limits
allowed.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
* Add OomKilled field to executor proto format
* Teach linux executor to detect and report OOMs
* Teach exec driver to propagate OOMKill information
* Fix data race
* use tail /dev/zero to create oom condition
* use new test framework
* minor tweaks to executor test
* add cl entry
* remove type conversion
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
* drivers: plumb hardware topology via grpc into drivers
This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.
* cr: use enum instead of bool for core grade
* cr: fix test slit tables to be possible
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.
This issue was most noticeable on ARM64, as there auto-detection
always failed.
* client: do not disable memory swappiness if kernel does not support it
This PR adds a workaround for very old Linux kernels which do not support
the memory swappiness interface file. Normally we write a "0" to the file
to explicitly disable swap. In the case the kernel does not support it,
give libcontainer a nil value so it does not write anything.
Fixes#17448
* client: detect swappiness by writing to the file
* fixup changelog
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
Currently, the `exec` driver is only setting the Bounding set, which is
not sufficient to actually enable the requisite capabilities for the
task process. In order for the capabilities to survive `execve`
performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient`
sets must also be set.
Per CAPABILITIES (7):
> Ambient: This is a set of capabilities that are preserved across an
> execve(2) of a program that is not privileged. The ambient capability
> set obeys the invariant that no capability can ever be ambient if it
> is not both permitted and inheritable.
The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.
Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.
Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.
This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
This PR upgrades our CI images and fixes some affected tests.
- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)
- eliminate go-machine-recent-image (no longer necessary)
- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)
- fix tcp dial error check (message seems to be OS specific)
- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)
- use safe MkdirTemp for generating tmpfiles
NOT applied: (too flakey)
- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)
- upgrade resource type for all imanges to large (2C -> 4C)
Explicitly set the `oom_score_adj` value for `exec` and `java` tasks.
We recommend that the Nomad service to have oom_score_adj of a low value
(e.g. -1000) to avoid having nomad agent OOM Killed if the node is
oversubscriped.
However, Nomad's workloads should not inherit Nomad's process, which is
the default behavior.
Fixes#10663
This PR enables setting allow_caps on the exec driver
plugin configuration, as well as cap_add and cap_drop in
exec task configuration. These options replicate the
functionality already present in the docker task driver.
Important: this change also reduces the default set of
capabilities enabled by the exec driver to match the
default set enabled by the docker driver. Until v1.0.5
the exec task driver would enable all capabilities supported
by the operating system. v1.0.5 removed NET_RAW from that
list of default capabilities, but left may others which
could potentially also be leveraged by compromised tasks.
Important: the "root" user is still special cased when
used with the exec driver. Older versions of Nomad enabled
enabled all capabilities supported by the operating system
for tasks set with the root user. To maintain compatibility
with existing clusters we continue supporting this "feature",
however we maintain support for the legacy set of capabilities
rather than enabling all capabilities now supported on modern
operating systems.
The default Linux Capabilities set enabled by the docker, exec, and
java task drivers includes CAP_NET_RAW (for making ping just work),
which has the side affect of opening an ARP DoS/MiTM attack between
tasks using bridge networking on the same host network.
https://docs.docker.com/engine/reference/run/#runtime-privilege-and-linux-capabilities
This PR disables CAP_NET_RAW for the docker, exec, and java task
drivers. The previous behavior can be restored for docker using the
allow_caps docker plugin configuration option.
A future version of nomad will enable similar configurability for the
exec and java task drivers.
This PR adds pid_mode and ipc_mode options to the exec and java task
driver config options. By default these will defer to the default_pid_mode
and default_ipc_mode agent plugin options created in #9969. Setting
these values to "host" mode disables isolation for the task. Doing so
is not recommended, but may be necessary to support legacy job configurations.
Closes#9970
This PR adds default_pid_mode and default_ipc_mode options to the exec and java
task drivers. By default these will default to "private" mode, enabling PID and
IPC isolation for tasks. Setting them to "host" mode disables isolation. Doing
so is not recommended, but may be necessary to support legacy job configurations.
Closes#9969
This has to have been unused because the HasPrefix operation is
backwards, meaning a Command.Env that includes PATH= never would have
worked; the default path was always used.