On Windows, if the `raw_exec` driver's executor exits, the child processes are
not also killed. Create a Windows "job object" (not to be confused with a Nomad
job) and add the executor to it. Child processes of the executor will inherit
the job automatically. When the handle to the job object is freed (on executor
exit), the job itself is destroyed and this causes all processes in that job to
exit.
Fixes: https://github.com/hashicorp/nomad/issues/23668
Ref: https://learn.microsoft.com/en-us/windows/win32/procthread/job-objects
In #20619 we overhauled how we were gathering stats for Windows
processes. Unlike in Linux where we can ask for processes in a cgroup, on
Windows we have to make a single expensive syscall to get all the processes and
then build the tree ourselves. Our algorithm to do so is recursive and quadratic
in both steps and space with the number of processes on the host. For busy hosts
this hits the stack limit and panics the Nomad client.
We already build a map of parent PID to PID, so modify this to be a map of
parent PID to slice of children and then traverse that tree only from the root
we care about (the executor PID). This moves the allocations to the heap but
makes the stats gathering linear in steps and space required.
This changeset also moves as much of this code as possible into an area
not conditionally-compiled by OS, as the tagged test file was not being run in CI.
Fixes: https://github.com/hashicorp/nomad/issues/23984
On supported platforms, the secrets directory is a 1MiB tmpfs. But some tasks
need larger space for downloading large secrets. This is especially the case for
tasks using `templates`, which need extra room to write a temporary file to the
secrets directory that gets renamed to the old file atomically.
This changeset allows increasing the size of the tmpfs in the `resources`
block. Because this is a memory resource, we need to include it in the memory we
allocate for scheduling purposes. The task is already prevented from using more
memory in the tmpfs than the `resources.memory` field allows, but can bypass
that limit by writing to the tmpfs via `template` or `artifact` blocks.
Therefore, we need to account for the size of the tmpfs in the allocation
resources. Simply adding it to the memory needed when we create the allocation
allows it to be accounted for in all downstream consumers, and then we'll
subtract that amount from the memory resources just before configuring the task
driver.
For backwards compatibility, the default value of 1MiB is "free" and ignored by
the scheduler. Otherwise we'd be increasing the allocated resources for every
existing alloc, which could cause problems across upgrades. If a user explicitly
sets `resources.secrets = 1` it will no longer be free.
Fixes: https://github.com/hashicorp/nomad/issues/2481
Ref: https://hashicorp.atlassian.net/browse/NET-10070
Whenever the "exec" task driver is being used, nomad runs a plug in that in time runs the task on a container under the hood. If by any circumstance the executor is killed, the task is reparented to the init service and wont be stopped by Nomad in case of a job updated or stop.
This commit introduces two mechanisms to avoid this behaviour:
* Adds signal catching and handling to the executor, so in case of a SIGTERM, the signal will also be passed on to the task.
* Adds a pre start clean up of the processes in the container, ensuring only the ones the executor runs are present at any given time.
We bring in `containernetworking/plugins` for the contents of a single file,
which we use in a few places for running a goroutine in a specific network
namespace. This code hasn't needed an update in a couple of years, and a good
chunk of what we need was previously vendored into `client/lib/nsutil`
already.
Updating the library via dependabot is causing errors in Docker driver tests
because it updates a lot of transient dependencies, and it's bringing in a pile
of new transient dependencies like opentelemetry. Avoid this problem going
forward by vendoring the remaining code we hadn't already.
Ref: https://github.com/hashicorp/nomad/pull/20146
* drivers/raw_exec: enable setting cgroup override values
This PR enables configuration of cgroup override values on the `raw_exec`
task driver. WARNING: setting cgroup override values eliminates any
gauruntee Nomad can make about resource availability for *any* task on
the client node.
For cgroup v2 systems, set a single unified cgroup path using `cgroup_v2_override`.
The path may be either absolute or relative to the cgroup root.
config {
cgroup_v2_override = "custom.slice/app.scope"
}
or
config {
cgroup_v2_override = "/sys/fs/cgroup/custom.slice/app.scope"
}
For cgroup v1 systems, set a per-controller path for each controller using
`cgroup_v1_override`. The path(s) may be either absolute or relative to
the controller root.
config {
cgroup_v1_override = {
"pids": "custom/app",
"cpuset": "custom/app",
}
}
or
config {
cgroup_v1_override = {
"pids": "/sys/fs/cgroup/pids/custom/app",
"cpuset": "/sys/fs/cgroup/cpuset/custom/app",
}
}
* drivers/rawexec: ensure only one of v1/v2 cgroup override is set
* drivers/raw_exec: executor should error if setting cgroup does not work
* drivers/raw_exec: create cgroups in raw_exec tests
* drivers/raw_exec: ensure we fail to start if custom cgroup set and non-root
* move custom cgroup func into shared file
---------
Co-authored-by: Michael Schurter <mschurter@hashicorp.com>
* exec2: add client support for unveil filesystem isolation mode
This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.
* actually create alloc-mounts-dir directory
* fix doc strings about alloc mount dir paths
The value for the executor cgroup CPU weight must be within the limits
imposed by the Linux kernel.
Nomad used the task `resource.cpu`, an unbounded value, directly as the
cgroup CPU weight, causing it to potentially go outside the imposed
values.
This commit clamps the CPU shares values to be within the limits
allowed.
Co-authored-by: Tim Gross <tgross@hashicorp.com>
On Windows, Nomad uses `syscall.NewLazyDLL` and `syscall.LoadDLL` functions to
load a few system DLL files, which does not prevent DLL hijacking
attacks. Hypothetically a local attacker on the client host that can place an
abusive library in a specific location could use this to escalate privileges to
the Nomad process. Although this attack does not fall within the Nomad security
model, it doesn't hurt to follow good practices here.
We can remove two of these DLL loads by using wrapper functions provided by the
stdlib in `x/sys/windows`
Co-authored-by: dduzgun-security <deniz.duzgun@hashicorp.com>
* drivers/raw_exec: enable configuring raw_exec task to have no memory limit
This PR makes it possible to configure a raw_exec task to not have an
upper memory limit, which is how the driver would behave pre-1.7.
This is done by setting memory_max = -1. The cluster (or node pool) must
have memory oversubscription enabled.
* cl: add cl
* Add OomKilled field to executor proto format
* Teach linux executor to detect and report OOMs
* Teach exec driver to propagate OOMKill information
* Fix data race
* use tail /dev/zero to create oom condition
* use new test framework
* minor tweaks to executor test
* add cl entry
* remove type conversion
---------
Co-authored-by: Marvin Chin <marvinchin@users.noreply.github.com>
Co-authored-by: Seth Hoenig <shoenig@duck.com>
* drivers/executor: set oom_score_adj for raw_exec
This might not be wholly true since I don't know all configurations of
Nomad, but in our use cases, we run some of our tasks as `raw_exec` for
reasons.
We observed that our tasks were running with `oom_score_adj = -1000`,
which prevents them from being OOM'd. This value is being inherited from
the nomad agent parent process, as configured by systemd.
Similar to #10698, we also were shocked to have this value inherited
down to every child process and believe that we should also set this
value to 0 explicitly.
I have no idea if there are other paths that might leverage this or
other ways that `raw_exec` can manifest, but this is how I was able to
observe and fix in one of our configurations.
We have been running in production our tasks wrapped in a script that
does: `echo 0 > /proc/self/oom_score_adj` to avoid this issue.
* drivers/executor: minor cleanup of setting oom adjustment
* e2e: add test for raw_exec oom adjust score
* e2e: set oom score adjust to -999
* cl: add cl
---------
Co-authored-by: Seth Hoenig <shoenig@duck.com>
The `qemu` driver uses our universal executor to run the qemu command line
tool. Because qemu owns the resource isolation, we don't pass in the resource
block that the universal executor uses to configure cgroups and core
pinning. This resulted in a panic.
Fix the panic by returning early in the cgroup configuration in the universal
executor. This fixes `qemu` but also any third-party drivers that might exist
and are using our executor code without passing in the resource block.
In future work, we should ensure that the `resources` block is being translated
into qemu equivalents, so that we have support for things like NUMA-aware
scheduling for that driver.
Fixes: https://github.com/hashicorp/nomad/issues/19078
No functional changes, just cleaning up deprecated usages that are
removed in v2 and replace one call of .Slice with .ForEach to avoid
making the intermediate copy.
* drivers: plumb hardware topology via grpc into drivers
This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.
* cr: use enum instead of bool for core grade
* cr: fix test slit tables to be possible
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
Although nomad officially does not support running the client as a non-root
user, doing so has been more or less possible with the raw_exec driver as
long as you don't expect features to work like networking or running tasks
as specific users. In the cgroups refactoring I bulldozed right over the
special casing we had in place for raw_exec to continue working if the cgroups
were unable to be created. This PR restores that behavior - you can now
(as before) run the nomad client as a non-root user and make use of the
raw_exec task driver.
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.
This issue was most noticeable on ARM64, as there auto-detection
always failed.
This complements the `env` parameter, so that the operator can author
tasks that don't share their Vault token with the workload when using
`image` filesystem isolation. As a result, more powerful tokens can be used
in a job definition, allowing it to use template stanzas to issue all kinds of
secrets (database secrets, Vault tokens with very specific policies, etc.),
without sharing that issuing power with the task itself.
This is accomplished by creating a directory called `private` within
the task's working directory, which shares many properties of
the `secrets` directory (tmpfs where possible, not accessible by
`nomad alloc fs` or Nomad's web UI), but isn't mounted into/bound to the
container.
If the `disable_file` parameter is set to `false` (its default), the Vault token
is also written to the NOMAD_SECRETS_DIR, so the default behavior is
backwards compatible. Even if the operator never changes the default,
they will still benefit from the improved behavior of Nomad never reading
the token back in from that - potentially altered - location.
* client: do not disable memory swappiness if kernel does not support it
This PR adds a workaround for very old Linux kernels which do not support
the memory swappiness interface file. Normally we write a "0" to the file
to explicitly disable swap. In the case the kernel does not support it,
give libcontainer a nil value so it does not write anything.
Fixes#17448
* client: detect swappiness by writing to the file
* fixup changelog
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
---------
Co-authored-by: James Rasell <jrasell@users.noreply.github.com>
On Windows the executor returns an error when trying to open the `NUL` device
when we pass it `os.DevNull` for the stdout/stderr paths. Instead of opening the
device, use the discard pipe so that we have platform-specific behavior from the
executor itself.
Fixes: #17148
Currently, the `exec` driver is only setting the Bounding set, which is
not sufficient to actually enable the requisite capabilities for the
task process. In order for the capabilities to survive `execve`
performed by libcontainer, the `Permitted`, `Inheritable`, and `Ambient`
sets must also be set.
Per CAPABILITIES (7):
> Ambient: This is a set of capabilities that are preserved across an
> execve(2) of a program that is not privileged. The ambient capability
> set obeys the invariant that no capability can ever be ambient if it
> is not both permitted and inheritable.
* Update ioutil library references to os and io respectively for drivers package
No user facing changes so I assume no change log is required
* Fix failing tests
The exec driver and other drivers derived from the shared executor check the
path of the command before handing off to libcontainer to ensure that the
command doesn't escape the sandbox. But we don't check any host volume mounts,
which should be safe to use as a source for executables if we're letting the
user mount them to the container in the first place.
Check the mount config to verify the executable lives in the mount's host path,
but then return an absolute path within the mount's task path so that we can hand
that off to libcontainer to run.
Includes a good bit of refactoring here because the anchoring of the final task
path has different code paths for inside the task dir vs inside a mount. But
I've fleshed out the test coverage of this a good bit to ensure we haven't
created any regressions in the process.
* client: protect user lookups with global lock
This PR updates Nomad client to always do user lookups while holding
a global process lock. This is to prevent concurrency unsafe implementations
of NSS, but still enabling NSS lookups of users (i.e. cannot not use osusergo).
* cl: add cl
The `golang.org/x/net/context` package was merged into the stdlib as of go
1.7. Update the imports to use the identical stdlib version. Clean up import
blocks for the impacted files to remove unnecessary package aliasing.
* test: use `T.TempDir` to create temporary test directory
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix TestLogmon_Start_restart on Windows
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestConsul_Integration
t.TempDir fails to perform the cleanup properly because the folder is
still in use
testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This test exercises upgrades between 0.8 and Nomad versions greater
than 0.9. We have not supported 0.8.x in a very long time and in any
case the test has been marked to skip because the downloader doesn't
work.
This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
* driver: fix integer conversion error
The shared executor incorrectly parsed the user's group into int32 and
then cast to uint32 without bounds checking. This is harmless because
an out-of-bounds gid will throw an error later, but it triggers
security and code quality scans. Parse directly to uint32 so that we
get correct error handling.
* helper: fix integer conversion error
The autopilot flags helper incorrectly parses a uint64 to a uint which
is machine specific size. Although we don't have 32-bit builds, this
sets off security and code quality scaans. Parse to the machine sized
uint.
* driver: restrict bounds of port map
The plugin server doesn't constrain the maximum integer for port
maps. This could result in a user-visible misconfiguration, but it
also triggers security and code quality scans. Restrict the bounds
before casting to int32 and return an error.
* cpuset: restrict upper bounds of cpuset values
Our cpuset configuration expects values in the range of uint16 to
match the expectations set by the kernel, but we don't constrain the
values before downcasting. An underflow could lead to allocations
failing on the client rather than being caught earlier. This also make
security and code quality scanners happy.
* http: fix integer downcast for per_page parameter
The parser for the `per_page` query parameter downcasts to int32
without bounds checking. This could result in underflow and
nonsensical paging, but there's no server-side consequences for
this. Fixing this will silence some security and code quality scanners
though.
This PR upgrades our CI images and fixes some affected tests.
- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)
- eliminate go-machine-recent-image (no longer necessary)
- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)
- fix tcp dial error check (message seems to be OS specific)
- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)
- use safe MkdirTemp for generating tmpfiles
NOT applied: (too flakey)
- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)
- upgrade resource type for all imanges to large (2C -> 4C)
github.com/kr/pty was moved to github.com/creack/pty
Swap this dependency so we can upgrade to the latest version
and no longer need a replace directive.