* exec2: add client support for unveil filesystem isolation mode
This PR adds support for a new filesystem isolation mode, "Unveil". The
mode introduces a "alloc_mounts" directory where tasks have user-owned
directory structure which are bind mounts into the real alloc directory
structure. This enables a task driver to use landlock (and maybe the
real unveil on openbsd one day) to isolate a task to the task owned
directory structure, providing sandboxing.
* actually create alloc-mounts-dir directory
* fix doc strings about alloc mount dir paths
* drivers: plumb hardware topology via grpc into drivers
This PR swaps out the temporary use of detecting system hardware manually
in each driver for using the Client's detected topology by plumbing the
data over gRPC. This ensures that Client configuration is taken to account
consistently in all references to system topology.
* cr: use enum instead of bool for core grade
* cr: fix test slit tables to be possible
* client: refactor cpuset partitioning
This PR updates the way Nomad client manages the split between tasks
that make use of resources.cpus vs. resources.cores.
Previously, each task was explicitly assigned which CPU cores they were
able to run on. Every time a task was started or destroyed, all other
tasks' cpusets would need to be updated. This was inefficient and would
crush the Linux kernel when a client would try to run ~400 or so tasks.
Now, we make use of cgroup heirarchy and cpuset inheritence to efficiently
manage cpusets.
* cr: tweaks for feedback
Before this commit, it was only used for fingerprinting, but not
for CPU stats on nodes or tasks. This meant that if the
auto-detection failed, setting the cpu_total_compute didn't resolved
the issue.
This issue was most noticeable on ARM64, as there auto-detection
always failed.
* Update ioutil library references to os and io respectively for drivers package
No user facing changes so I assume no change log is required
* Fix failing tests
* test: use `T.TempDir` to create temporary test directory
This commit replaces `ioutil.TempDir` with `t.TempDir` in tests. The
directory created by `t.TempDir` is automatically removed when the test
and all its subtests complete.
Prior to this commit, temporary directory created using `ioutil.TempDir`
needs to be removed manually by calling `os.RemoveAll`, which is omitted
in some tests. The error handling boilerplate e.g.
defer func() {
if err := os.RemoveAll(dir); err != nil {
t.Fatal(err)
}
}
is also tedious, but `t.TempDir` handles this for us nicely.
Reference: https://pkg.go.dev/testing#T.TempDir
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix TestLogmon_Start_restart on Windows
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
* test: fix failing TestConsul_Integration
t.TempDir fails to perform the cleanup properly because the folder is
still in use
testing.go:967: TempDir RemoveAll cleanup: unlinkat /tmp/TestConsul_Integration2837567823/002/191a6f1a-5371-cf7c-da38-220fe85d10e5/web/secrets: device or resource busy
Signed-off-by: Eng Zer Jun <engzerjun@gmail.com>
This PR adds support for the raw_exec driver on systems with only cgroups v2.
The raw exec driver is able to use cgroups to manage processes. This happens
only on Linux, when exec_driver is enabled, and the no_cgroups option is not
set. The driver uses the freezer controller to freeze processes of a task,
issue a sigkill, then unfreeze. Previously the implementation assumed cgroups
v1, and now it also supports cgroups v2.
There is a bit of refactoring in this PR, but the fundamental design remains
the same.
Closes#12351#12348
This PR introduces support for using Nomad on systems with cgroups v2 [1]
enabled as the cgroups controller mounted on /sys/fs/cgroups. Newer Linux
distros like Ubuntu 21.10 are shipping with cgroups v2 only, causing problems
for Nomad users.
Nomad mostly "just works" with cgroups v2 due to the indirection via libcontainer,
but not so for managing cpuset cgroups. Before, Nomad has been making use of
a feature in v1 where a PID could be a member of more than one cgroup. In v2
this is no longer possible, and so the logic around computing cpuset values
must be modified. When Nomad detects v2, it manages cpuset values in-process,
rather than making use of cgroup heirarchy inheritence via shared/reserved
parents.
Nomad will only activate the v2 logic when it detects cgroups2 is mounted at
/sys/fs/cgroups. This means on systems running in hybrid mode with cgroups2
mounted at /sys/fs/cgroups/unified (as is typical) Nomad will continue to
use the v1 logic, and should operate as before. Systems that do not support
cgroups v2 are also not affected.
When v2 is activated, Nomad will create a parent called nomad.slice (unless
otherwise configured in Client conifg), and create cgroups for tasks using
naming convention <allocID>-<task>.scope. These follow the naming convention
set by systemd and also used by Docker when cgroups v2 is detected.
Client nodes now export a new fingerprint attribute, unique.cgroups.version
which will be set to 'v1' or 'v2' to indicate the cgroups regime in use by
Nomad.
The new cpuset management strategy fixes#11705, where docker tasks that
spawned processes on startup would "leak". In cgroups v2, the PIDs are
started in the cgroup they will always live in, and thus the cause of
the leak is eliminated.
[1] https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.htmlCloses#11289Fixes#11705#11773#11933
This PR upgrades our CI images and fixes some affected tests.
- upgrade go-machine-image to premade latest ubuntu LTS (ubuntu-2004:202111-02)
- eliminate go-machine-recent-image (no longer necessary)
- manage GOPATH in GNUMakefile (see https://discuss.circleci.com/t/gopath-is-set-to-multiple-directories/7174)
- fix tcp dial error check (message seems to be OS specific)
- spot check values measured instead of specifically 'RSS' (rss no longer reported in cgroups v2)
- use safe MkdirTemp for generating tmpfiles
NOT applied: (too flakey)
- eliminate setting GOMAXPROCS=1 (build tools were also affected by this setting)
- upgrade resource type for all imanges to large (2C -> 4C)
Fixes#2522
Skip embedding client.alloc_dir when building chroot. If a user
configures a Nomad client agent so that the chroot_env will embed the
client.alloc_dir, Nomad will happily infinitely recurse while building
the chroot until something horrible happens. The best case scenario is
the filesystem's path length limit is hit. The worst case scenario is
disk space is exhausted.
A bad agent configuration will look something like this:
```hcl
data_dir = "/tmp/nomad-badagent"
client {
enabled = true
chroot_env {
# Note that the source matches the data_dir
"/tmp/nomad-badagent" = "/ohno"
# ...
}
}
```
Note that `/ohno/client` (the state_dir) will still be created but not
`/ohno/alloc` (the alloc_dir).
While I cannot think of a good reason why someone would want to embed
Nomad's client (and possibly server) directories in chroots, there
should be no cause for harm. chroots are only built when Nomad runs as
root, and Nomad disables running exec jobs as root by default. Therefore
even if client state is copied into chroots, it will be inaccessible to
tasks.
Skipping the `data_dir` and `{client,server}.state_dir` is possible, but
this PR attempts to implement the minimum viable solution to reduce risk
of unintended side effects or bugs.
When running tests as root in a vm without the fix, the following error
occurs:
```
=== RUN TestAllocDir_SkipAllocDir
alloc_dir_test.go:520:
Error Trace: alloc_dir_test.go:520
Error: Received unexpected error:
Couldn't create destination file /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/testtask/nomad/test/testtask/.../nomad/test/testtask/secrets/.nomad-mount: open /tmp/TestAllocDir_SkipAllocDir1457747331/001/nomad/test/.../testtask/secrets/.nomad-mount: file name too long
Test: TestAllocDir_SkipAllocDir
--- FAIL: TestAllocDir_SkipAllocDir (22.76s)
```
Also removed unused Copy methods on AllocDir and TaskDir structs.
Thanks to @eveld for not letting me forget about this!
Symlinking busybox may fail when the test code and the test temporary
directory live on different volumes/partitions; so we should copy
instead. This situation arises in the Vagrant setup, where the code
repository live on special file sharing volume.
Somewhat unrelated, remove `f.Sync()` invocation from a test copyFile
helper function. Sync is useful only for crash recovery, and isn't
necessary in our test setup. The sync invocation is a significant
overhead as it requires the OS to flush any cached writes to disk.
plugins/driver: update driver interface to support streaming stats
client/tr: use streaming stats api
TODO:
* how to handle errors and closed channel during stats streaming
* prevent tight loop if Stats(ctx) returns an error
drivers: update drivers TaskStats RPC to handle streaming results
executor: better error handling in stats rpc
docker: better control and error handling of stats rpc
driver: allow stats to return a recoverable error
* master: (71 commits)
Fix output of 'nomad deployment fail' with no arg
Always create a running allocation when testing task state
tests: ensure exec tests pass valid task resources (#4992)
some changes for more idiomatic code
fix iops related tests
fixed bug in loop delay
gofmt
improved code for readability
client: updateAlloc release lock after read
fixup! device attributes in `nomad node status -verbose`
drivers/exec: support device binds and mounts
fix iops bug and increase test matrix coverage
tests: tag image explicitly
changelog
ci: install lxc-templates explicitly
tests: skip checking rdma cgroup
ci: use Ubuntu 16.04 (Xenial) in TravisCI
client: update driver info on new fingerprint
drivers/docker: enforce volumes.enabled (#4983)
client: Style: use fluent style for building loggers
...
Using statically linked busybox binary to setup a basic rootfs for
testing, by symlinking it to provide the basic commands used in tests.
I considered using a proper rootfs tarball, but the overhead of managing
tarfile and expanding it seems significant enough that I went with this
implementation.
IOPS have been modelled as a resource since Nomad 0.1 but has never
actually been detected and there is no plan in the short term to add
detection. This is because IOPS is a bit simplistic of a unit to define
the performance requirements from the underlying storage system. In its
current state it adds unnecessary confusion and can be removed without
impacting any users. This PR leaves IOPS defined at the jobspec parsing
level and in the api/ resources since these are the two public uses of
the field. These should be considered deprecated and only exist to allow
users to stop using them during the Nomad 0.9.x release. In the future,
there should be no expectation that the field will exist.