No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
* client: protect user lookups with global lock
This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).
* cl: add cl
This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
* driver: fix integer conversion error
The shared executor incorrectly parsed the user's group into int32 and
then cast to uint32 without bounds checking. This is harmless because
an out-of-bounds gid will throw an error later, but it triggers
security and code quality scans. Parse directly to uint32 so that we
get correct error handling.
* helper: fix integer conversion error
The autopilot flags helper incorrectly parses a uint64 to a uint which
is machine specific size. Although we don't have 32-bit builds, this
sets off security and code quality scaans. Parse to the machine sized
uint.
* driver: restrict bounds of port map
The plugin server doesn't constrain the maximum integer for port
maps. This could result in a user-visible misconfiguration, but it
also triggers security and code quality scans. Restrict the bounds
before casting to int32 and return an error.
* cpuset: restrict upper bounds of cpuset values
Our cpuset configuration expects values in the range of uint16 to
match the expectations set by the kernel, but we don't constrain the
values before downcasting. An underflow could lead to allocations
failing on the client rather than being caught earlier. This also make
security and code quality scanners happy.
* http: fix integer downcast for per_page parameter
The parser for the `per_page` query parameter downcasts to int32
without bounds checking. This could result in underflow and
nonsensical paging, but there's no server-side consequences for
this. Fixing this will silence some security and code quality scanners
though.
Fix#9210 .
This update the executor so it honors the User when using nomad alloc exec. The bug was that the exec task didn't honor the init command when execing.